本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-06-11)

今日共更新419篇论文,其中:

  • 自然语言处理81篇(Computation and Language (cs.CL))
  • 计算机视觉93篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能127篇(Artificial Intelligence (cs.AI))
  • 机器学习161篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] 3D-GRAND: Towards Better Grounding and Less Hallucination for 3D-LLMs
[NLP-0] 3D-GRAND:为3D-LLM提供更好的基础和更少的幻觉

链接: https://arxiv.org/abs/2406.05132
作者: Jianing Yang,Xuweiyi Chen,Nikhil Madaan,Madhavan Iyengar,Shengyi Qian,David F. Fouhey,Joyce Chai
关键词: developing embodied agents, perception is crucial, physical world, crucial for developing, agents and robots
中文关键词: 开发具体代理,感知至关重要,物理世界,对于开发、代理和机器人至关重要
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project website: this https URL

点击查看摘要

Abstract:The integration of language and 3D perception is crucial for developing embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datasets that provide dense grounding between language and 3D scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons among future models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the critical role of large-scale 3D-text datasets in advancing embodied AI research. Notably, our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with essential resources and insights, setting the stage for more reliable and better-grounded 3D-LLMs. Project website: this https URL
摘要:语言和3D感知的结合对于开发理解物理世界并与之交互的具体化代理和机器人至关重要。虽然大型语言模型(LLM)已经显示出令人印象深刻的语言理解和生成能力,但它们对3D环境的适应(3D-LLM)仍处于早期阶段。一个主要的挑战是缺乏在语言和3D场景之间提供密集基础的大规模数据集。在本文中,我们介绍了3D-GRAND,这是一个开创性的大规模数据集,包含40087个家庭场景和620万条密集的场景语言指令。我们的结果表明,使用3D-Grand的指令调优显著增强了3D-LLM的接地能力,并减少了幻觉。作为我们贡献的一部分,我们提出了一个全面的基准3D-Pope来系统地评估3D-LLM中的幻觉,从而能够在未来的模型之间进行公平的比较。我们的实验突出了数据集大小和3D-LLM性能之间的缩放效应,强调了大规模3D文本数据集在推进体验式人工智能研究中的关键作用。值得注意的是,我们的结果展示了有效的模拟到真实传输的早期信号,表明基于大型合成数据训练的模型可以在真实世界的3D扫描中表现良好。通过3D-Grand和3D-Pope,我们的目标是为具体化的人工智能社区配备必要的资源和洞察力,为更可靠和更有基础的3D-LLM奠定基础。项目网站:此HTTPS URL

[NLP-1] An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models
[NLP-1] 多模式大型语言模型参数高效微调的实证研究

链接: https://arxiv.org/abs/2406.05130
作者: Xiongtao Zhou,Jie He,Yuhua Ke,Guangyao Zhu,Víctor Gutiérrez-Basulto,Jeff Z. Pan
关键词: Multimodal large language, demonstrated remarkable capabilities, multimodal instruction datasets, multimodal tasks, Multimodal large
中文关键词: 多模式大型语言,表现出非凡的能力,多模式指令数据集,多模式任务,多模式大型
类目: Computation and Language (cs.CL)
备注: ACL finding 2024

点击查看摘要

Abstract:Multimodal large language models (MLLMs) fine-tuned with multimodal instruction datasets have demonstrated remarkable capabilities in multimodal tasks. However, fine-tuning all parameters of MLLMs has become challenging as they usually contain billions of parameters. To address this issue, we study parameter-efficient fine-tuning (PEFT) methods for MLLMs. We aim to identify effective methods for enhancing the performance of MLLMs in scenarios where only a limited number of parameters are trained. This paper conducts empirical studies using four popular PEFT methods to fine-tune the LLM component of open-source MLLMs. We present a comprehensive analysis that encompasses various aspects, including the impact of PEFT methods on various models, parameters and location of the PEFT module, size of fine-tuning data, model stability based on PEFT methods, MLLM’s generalization, and hallucination. We evaluated four PEFT methods on seven datasets from two different categories: unseen and seen datasets. Across all experiments, we show that the adapter is the best-performing PEFT method. At the same time, fine-tuning the connector layers leads to improved performance in most MLLMs. Code and data are available at this https URL.
摘要:与多通道教学数据集相结合的多通道大语言模型(MLLMS)在多通道任务中表现出显著的性能。然而,微调MLLMS的所有参数变得具有挑战性,因为它们通常包含数十亿个参数。为了解决这个问题,我们研究了参数高效微调(PEFT)方法。我们的目标是在只训练有限数量的参数的情况下,寻找有效的方法来提高MLLMS的性能。本文使用四种流行的PEFT方法进行了实证研究,以微调开源MLLMS的LLM组件。我们给出了一个全面的分析,包括PEFT方法对各种模型的影响,PEFT模块的参数和位置,微调数据的大小,基于PEFT方法的模型稳定性,MLLM的推广,以及幻觉。我们在来自两个不同类别的七个数据集上对四种PEFT方法进行了评估:未见数据集和可见数据集。通过所有实验,我们表明适配器是性能最好的PEFT方法。同时,对连接器层进行微调可以提高大多数MLLM的性能。代码和数据可在此HTTPS URL上找到。

[NLP-2] Multi-Head RAG: Solving Multi-Aspect Problems with LLMs
[NLP-2] 多头RAG:用LLM解决多方面问题

链接: https://arxiv.org/abs/2406.05085
作者: Maciej Besta,Ales Kubicek,Roman Niggli,Robert Gerstenberger,Lucas Weitzendorf,Mingyuan Chi,Patrick Iff,Joanna Gajda,Piotr Nyczyk,Jürgen Müller,Hubert Niewiadomski,Marcin Chrapek,Michał Podstawski,Torsten Hoefler
关键词: Large Language Models, Retrieval Augmented Generation, Augmented Generation, Language Models, Large Language
中文关键词: 大型语言模型、检索增强生成、增强生成、语言模型、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs) by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents. Such queries occur frequently, but are challenging because the embeddings of these documents may be distant in the embedding space, making it hard to retrieve them all. This paper introduces Multi-Head RAG (MRAG), a novel scheme designed to address this gap with a simple yet powerful idea: leveraging activations of Transformer’s multi-head attention layer, instead of the decoder layer, as keys for fetching multi-aspect documents. The driving motivation is that different attention heads can learn to capture different data aspects. Harnessing the corresponding activations results in embeddings that represent various facets of data items and queries, improving the retrieval accuracy for complex queries. We provide an evaluation methodology and metrics, synthetic datasets, and real-world use cases to demonstrate MRAG’s effectiveness, showing improvements of up to 20% in relevance over standard RAG baselines. MRAG can be seamlessly integrated with existing RAG frameworks and benchmarking tools like RAGAS as well as different classes of data stores.
摘要:检索增强生成(RAG)增强了大型语言模型(LLMS)的能力,它能够将文档检索到LLM上下文中,以提供更准确和相关的响应。现有的RAG解决方案不关注可能需要获取具有实质不同内容的多个文档的查询。这样的查询经常发生,但很有挑战性,因为这些文档的嵌入可能在嵌入空间中很远,使得检索它们变得困难。本文介绍了一种新的方案–多头RAG(MRAG),它用一个简单而强大的想法来解决这个问题:利用Transformer的多头关注层的激活而不是解码层作为读取多方面文档的关键。其驱动动机是不同的注意力负责人可以学习捕捉不同的数据方面。利用相应的激活会产生表示数据项和查询的各个方面的嵌入,从而提高复杂查询的检索精度。我们提供了评估方法和指标、合成数据集和真实使用案例来展示MRAG的有效性,显示出与标准RAG基线相比,相关性提高了高达20%。MRAG可以与现有的RAG框架和基准测试工具(如RAGAS)以及不同类别的数据存储无缝集成。

[NLP-3] On Ambiguity and the Expressive Function of Law: The Role of Pragmatics in Smart Legal Ecosystems
[NLP-3] 论法律的模糊性和表达功能:实用学在智能法律生态系统中的作用

链接: https://arxiv.org/abs/2406.05084
作者: Pompeu Casanovas
关键词: long paper, function of law, expressive function, artificial intelligence, Optimizing Manufacturing Processes
中文关键词: 长文、法律功能、表达功能、人工智能、优化制造过程
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 50 pages, 6 Figures, first presented at the 31st Congress of General Linguistics of the University of Barcelona (UB, CLUB31), October, 2023. To be published in the Catalan Linguistic Series as a chapter of the volume edited by Jordi Fortuny, Pau Francesch and Lluis Payrato (eds.), Ambiguity: an interdisciplinary approach. Barcelona: Edicions de la Universitat de Barcelona, 2025

点击查看摘要

Abstract:This is a long paper, an essay, on ambiguity, pragmatics, legal ecosystems, and the expressive function of law. It is divided into two parts and fifteen sections. The first part (Pragmatics) addresses ambiguity from the perspective of linguistic and cognitive pragmatics in the legal field. The second part (Computing) deals with this issue from the point of view of human-centered design and artificial intelligence, specifically focusing on the notion and modelling of rules and what it means to comply with the rules. This is necessary for the scaffolding of smart legal ecosystems (SLE). I will develop this subject with the example of the architecture, information flows, and smart ecosystem of OPTIMAI, an EU project of Industry 4.0 for zero-defect manufacturing (Optimizing Manufacturing Processes through Artificial Intelligence and Virtualization).
摘要:这是一篇关于歧义性、实用性、法律生态系统和法律的表达功能的长篇论文。分为两部分十五节。第一部分(修辞学)从法律领域的语言学和认知修辞学的角度解决歧义问题。第二部分(计算)从以人为本的设计和人工智能的角度处理这个问题,特别关注规则的概念和建模以及遵守规则的含义。这对于构建智能法律生态系统(狼疮)是必要的。我将以OPTIMAI的架构、信息流和智能生态系统为例来开发这个主题,OPTIMAI是欧盟工业4.0零缺陷制造(通过人工智能和虚拟化优化制造流程)项目。

[NLP-4] I2EDL: Interactive Instruction Error Detection and Localization
[NLP-4] I2 EDL:交互式指令错误检测和定位

链接: https://arxiv.org/abs/2406.05080
作者: Francesco Taioli,Stefano Rosa,Alberto Castellini,Lorenzo Natale,Alessio Del Bue,Alessandro Farinelli,Marco Cristani,Yiming Wang
关键词: human user guides, Continuous Environments, Interactive Instruction Error, instruction errors, Instruction Error Detector
中文关键词: 人类用户指南、连续环境、交互式指令错误、指令错误、指令错误检测器
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at IEEE RO-MAN 2024

点击查看摘要

Abstract:In the Vision-and-Language Navigation in Continuous Environments (VLN-CE) task, the human user guides an autonomous agent to reach a target goal via a series of low-level actions following a textual instruction in natural language. However, most existing methods do not address the likely case where users may make mistakes when providing such instruction (e.g. “turn left” instead of “turn right”). In this work, we address a novel task of Interactive VLN in Continuous Environments (IVLN-CE), which allows the agent to interact with the user during the VLN-CE navigation to verify any doubts regarding the instruction errors. We propose an Interactive Instruction Error Detector and Localizer (I2EDL) that triggers the user-agent interaction upon the detection of instruction errors during the navigation. We leverage a pre-trained module to detect instruction errors and pinpoint them in the instruction by cross-referencing the textual input and past observations. In such way, the agent is able to query the user for a timely correction, without demanding the user’s cognitive load, as we locate the probable errors to a precise part of the instruction. We evaluate the proposed I2EDL on a dataset of instructions containing errors, and further devise a novel metric, the Success weighted by Interaction Number (SIN), to reflect both the navigation performance and the interaction effectiveness. We show how the proposed method can ask focused requests for corrections to the user, which in turn increases the navigation success, while minimizing the interactions.
摘要:在连续环境下的视觉和语言导航(VLN-CE)任务中,人类用户按照自然语言的文本指令,通过一系列低级动作来引导自主智能体达到目标。然而,大多数现有的方法没有解决用户在提供这种指令时可能会出错的情况(例如,在提供这种指令时,用户可能会出错(例如,不是“向右转”而是“向左转”)。在这项工作中,我们解决了一个新的任务–连续环境中的交互式VLN(IVLN-CE),它允许代理在VLN-CE导航过程中与用户交互,以验证任何关于指令错误的怀疑。我们提出了一种交互式指令错误检测和定位器(I2EDL),它在导航过程中检测到指令错误时触发用户-代理交互。我们利用预先训练的模块来检测指令错误,并通过交叉引用文本输入和过去的观察来精确定位指令中的错误。通过这种方式,当我们将可能的错误定位到指令的精确部分时,代理能够询问用户以进行及时的更正,而不需要用户的认知负荷。我们在包含错误的指令数据集上对所提出的I2EDL进行了评估,并进一步设计了一种新的度量–交互次数加权的成功(SIN),以同时反映导航性能和交互效果。我们展示了提出的方法如何向用户请求有针对性的更正请求,这反过来增加了导航成功率,同时将交互降至最低。

[NLP-5] SUMIE: A Synthetic Benchmark for Incremental Entity Summarization
[NLP-5] SUMIE:增量实体汇总的综合基准

链接: https://arxiv.org/abs/2406.05079
作者: Eunjeong Hwang,Yichao Zhou,Beliz Gunel,James Bradley Wendt,Sandeep Tata
关键词: models rapidly advance, Incremental Entity Summarization, existing dataset adequately, dataset adequately tests, language models
中文关键词: 模型快速发展、增量实体总结、现有数据集充分、数据集充分测试、语言模型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 figures, 4 tables

点击查看摘要

Abstract:No existing dataset adequately tests how well language models can incrementally update entity summaries - a crucial ability as these models rapidly advance. The Incremental Entity Summarization (IES) task is vital for maintaining accurate, up-to-date knowledge. To address this, we introduce SUMIE, a fully synthetic dataset designed to expose real-world IES challenges. This dataset effectively highlights problems like incorrect entity association and incomplete information presentation. Unlike common synthetic datasets, ours captures the complexity and nuances found in real-world data. We generate informative and diverse attributes, summaries, and unstructured paragraphs in sequence, ensuring high quality. The alignment between generated summaries and paragraphs exceeds 96%, confirming the dataset’s quality. Extensive experiments demonstrate the dataset’s difficulty - state-of-the-art LLMs struggle to update summaries with an F1 higher than 80.4%. We will open source the benchmark and the evaluation metrics to help the community make progress on IES tasks.
摘要:没有一个现有的数据集能够充分测试语言模型如何能够增量地更新实体摘要–随着这些模型的快速发展,这是一种至关重要的能力。增量式实体汇总(IES)任务对于保持准确、最新的知识至关重要。为了解决这个问题,我们引入了Sumie,这是一个完全合成的数据集,旨在暴露现实世界的IES挑战。该数据集有效地突出了实体关联不正确和信息表示不完整等问题。与常见的合成数据集不同,我们的数据集捕捉到了现实世界数据中的复杂性和细微差别。我们按顺序生成信息量大、种类繁多的属性、摘要和非结构化段落,确保高质量。生成的摘要与段落之间的对比度超过96%,证实了数据集的质量。广泛的实验表明,该数据集最先进的LLM难以更新F1高于80.4%的摘要。我们将开源基准和评估指标,以帮助社区在IES任务上取得进展。

[NLP-6] Are Large Language Models More Empathetic than Humans?
[NLP-6] 大型语言模型比人类更具同理心吗?

链接: https://arxiv.org/abs/2406.05063
作者: Anuradha Welivita,Pearl Pu
关键词: large language models, language models, empathetic responding, emergence of large, large language
中文关键词: 大型语言模型,语言模型,同理心响应,大型语言的出现
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures. arXiv admin note: text overlap with arXiv:2403.05572

点击查看摘要

Abstract:With the emergence of large language models (LLMs), investigating if they can surpass humans in areas such as emotion recognition and empathetic responding has become a focal point of research. This paper presents a comprehensive study exploring the empathetic responding capabilities of four state-of-the-art LLMs: GPT-4, LLaMA-2-70B-Chat, Gemini-1.0-Pro, and Mixtral-8x7B-Instruct in comparison to a human baseline. We engaged 1,000 participants in a between-subjects user study, assessing the empathetic quality of responses generated by humans and the four LLMs to 2,000 emotional dialogue prompts meticulously selected to cover a broad spectrum of 32 distinct positive and negative emotions. Our findings reveal a statistically significant superiority of the empathetic responding capability of LLMs over humans. GPT-4 emerged as the most empathetic, marking approximately 31% increase in responses rated as “Good” compared to the human benchmark. It was followed by LLaMA-2, Mixtral-8x7B, and Gemini-Pro, which showed increases of approximately 24%, 21%, and 10% in “Good” ratings, respectively. We further analyzed the response ratings at a finer granularity and discovered that some LLMs are significantly better at responding to specific emotions compared to others. The suggested evaluation framework offers a scalable and adaptable approach for assessing the empathy of new LLMs, avoiding the need to replicate this study’s findings in future research.
摘要:随着大型语言模型的出现,研究它们在情感识别和共情反应等方面是否能够超越人类已经成为研究的重点。本文介绍了一项全面的研究,探讨了四种最先进的LLMS:GPT-4,Llama-2-70B-Chat,Gemini-1.0-Pro和Mixtral-8x7B-Indict的同理心反应能力,并与人类基线进行了比较。我们让1,000名参与者参与了一项受试者之间的用户研究,评估了人类和四个LLM对2,000个情感对话提示的共鸣质量,这些提示经过精心挑选,涵盖了32种不同的积极和消极情绪。我们的发现显示,LLMS的同理心反应能力在统计学上明显优于人类。GPT-4是最具同情心的,与人类基准相比,被评为“好”的回答增加了约31%。紧随其后的是骆驼2号、Mixtral-8x7B和Gemini-Pro,它们的“良好”评级分别上升了约24%、21%和10%。我们进一步分析了更细粒度的反应评级,发现一些LLM比其他LLM在对特定情绪的反应上要好得多。建议的评估框架为评估新的LLM的共情提供了一种可扩展和适应性的方法,避免了在未来的研究中重复这项研究的结果的需要。

[NLP-7] Bootstrapping Referring Multi-Object Tracking
[NLP-7] 引导引用多目标跟踪

链接: https://arxiv.org/abs/2406.05039
作者: Yani Zhang,Dongming Wu,Wencheng Han,Xingping Dong
关键词: human instruction represented, tracking multiple objects, natural language expression, Referring multi-object tracking, aims at detecting
中文关键词: 人类指令表示,跟踪多个对象,自然语言表达,引用多个对象跟踪,旨在检测
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Referring multi-object tracking (RMOT) aims at detecting and tracking multiple objects following human instruction represented by a natural language expression. Existing RMOT benchmarks are usually formulated through manual annotations, integrated with static regulations. This approach results in a dearth of notable diversity and a constrained scope of implementation. In this work, our key idea is to bootstrap the task of referring multi-object tracking by introducing discriminative language words as much as possible. In specific, we first develop Refer-KITTI into a large-scale dataset, named Refer-KITTI-V2. It starts with 2,719 manual annotations, addressing the issue of class imbalance and introducing more keywords to make it closer to real-world scenarios compared to Refer-KITTI. They are further expanded to a total of 9,758 annotations by prompting large language models, which create 617 different words, surpassing previous RMOT benchmarks. In addition, the end-to-end framework in RMOT is also bootstrapped by a simple yet elegant temporal advancement strategy, which achieves better performance than previous approaches. The source code and dataset is available at this https URL.
摘要:参考多目标跟踪(RMOT)的目的是按照自然语言表达的人类指令来检测和跟踪多个目标。现有的RMOT基准通常是通过与静态法规集成的手动注释来制定的。这种做法导致缺乏显着的多样性和有限的实施范围。在这项工作中,我们的核心思想是通过尽可能多地引入区分语言词汇来引导多目标跟踪的引用任务。具体地说,我们首先将REFER-KITTI开发成一个大规模的数据集,命名为REFER-KITTI-V2。它从2719个手动注释开始,解决了类不平衡的问题,并引入了更多的关键字,使其与Refer-Kitti相比更接近现实世界的场景。通过提示大型语言模型,它们进一步扩展到总共9758个注释,这些模型创建了617个不同的单词,超过了以前的RMOT基准。此外,RMOT中的端到端框架还被一种简单而优雅的时间推进策略所引导,该策略比以前的方法获得了更好的性能。源代码和数据集可以在此HTTPS URL上找到。

[NLP-8] Scenarios and Approaches for Situated Natural Language Explanations
[NLP-8] 情境自然语言解释的场景和方法

链接: https://arxiv.org/abs/2406.05035
作者: Pengshuo Qiu,Frank Rudzicz,Zining Zhu
关键词: Large language models, Large language, Large, explanations, language
中文关键词: 大型语言模型,大型语言,大型,解释,语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) can be used to generate natural language explanations (NLE) that are adapted to different users’ situations. However, there is yet to be a quantitative evaluation of the extent of such adaptation. To bridge this gap, we collect a benchmarking dataset, Situation-Based Explanation. This dataset contains 100 explanandums. Each explanandum is paired with explanations targeted at three distinct audience types-such as educators, students, and professionals-enabling us to assess how well the explanations meet the specific informational needs and contexts of these diverse groups e.g. students, teachers, and parents. For each “explanandum paired with an audience” situation, we include a human-written explanation. These allow us to compute scores that quantify how the LLMs adapt the explanations to the situations. On an array of pretrained language models with varying sizes, we examine three categories of prompting methods: rule-based prompting, meta-prompting, and in-context learning prompting. We find that 1) language models can generate prompts that result in explanations more precisely aligned with the target situations, 2) explicitly modeling an “assistant” persona by prompting “You are a helpful assistant…” is not a necessary prompt technique for situated NLE tasks, and 3) the in-context learning prompts only can help LLMs learn the demonstration template but can’t improve their inference performance. SBE and our analysis facilitate future research towards generating situated natural language explanations.
摘要:大型语言模型可用于生成适应不同用户情况的自然语言解释。然而,尚未对这种适应的程度进行量化评估。为了弥补这一差距,我们收集了一个基准数据集,即基于情况的解释。该数据集包含100个解释。每个解释备忘录都配有针对三种不同受众类型的解释–如教育工作者、学生和专业人员–使我们能够评估这些解释是否很好地满足了这些不同群体(如学生、教师和家长)的特定信息需求和背景。对于每一种“解释与观众配对”的情况,我们都会包括一份人工书面解释。这使我们能够计算分数,量化LLM如何根据情况调整解释。在一组不同大小的预先训练的语言模型上,我们考察了三类提示方法:基于规则的提示、元提示和情境学习提示。我们发现,1)语言模型可以生成提示,从而使解释更准确地与目标情景保持一致;2)通过提示“你是一个有帮助的助理……”来明确地模拟一个“助理”角色。3)情境学习提示只能帮助LLMS学习演示模板,而不能提高其推理性能。SBE和我们的分析促进了未来生成情景自然语言解释的研究。

[NLP-9] CHIQ: Contextual History Enhancement for Improving Query Rewriting in Conversational Search
[NLP-9] CHIQ:上下文历史增强,以改善对话搜索中的查询重写

链接: https://arxiv.org/abs/2406.05013
作者: Fengran Mo,Abbas Ghaddar,Kelong Mao,Mehdi Rezagholizadeh,Boxing Chen,Qun Liu,Jian-Yun Nie
关键词: improving query rewriting, large language models, open-source large language, query rewriting, large language
中文关键词: 改进查询重写、大型语言模型、开源大型语言、查询重写、大型语言
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we study how open-source large language models (LLMs) can be effectively deployed for improving query rewriting in conversational search, especially for ambiguous queries. We introduce CHIQ, a two-step method that leverages the capabilities of LLMs to resolve ambiguities in the conversation history before query rewriting. This approach contrasts with prior studies that predominantly use closed-source LLMs to directly generate search queries from conversation history. We demonstrate on five well-established benchmarks that CHIQ leads to state-of-the-art results across most settings, showing highly competitive performances with systems leveraging closed-source LLMs. Our study provides a first step towards leveraging open-source LLMs in conversational search, as a competitive alternative to the prevailing reliance on commercial LLMs. Data, models, and source code will be publicly available upon acceptance at this https URL.
摘要:在本文中,我们研究了如何有效部署开源大型语言模型(LLM)来改进对话搜索中的查询重写,尤其是对于模棱两可的查询。我们引入了CHIQ,这是一种两步方法,利用LLM的功能来在查询重写之前解决对话历史中的歧义。这种方法与之前的研究形成鲜明对比,之前的研究主要使用闭源LLM来直接从对话历史记录中生成搜索查询。我们在五个成熟的基准上证明,CH智商在大多数环境中都能带来最先进的结果,并在利用闭源LLM的系统中表现出极具竞争力的性能。我们的研究为在对话搜索中利用开源LLM迈出了第一步,作为普遍依赖商业LLM的有竞争力的替代方案。数据、模型和源代码在接受后将在此https URL上公开提供。

[NLP-10] Compositional Generalization with Grounded Language Models
[NLP-10] 使用扎根语言模型的合成概括

链接: https://arxiv.org/abs/2406.04989
作者: Sondre Wold,Étienne Simon,Lucas Georges Gabriel Charpentier,Egor V. Kostylev,Erik Velldal,Lilja Øvrelid
关键词: Grounded language models, Grounded language, knowledge graphs, external sources, general challenges
中文关键词: 扎根语言模型、扎根语言、知识图谱、外部来源、一般挑战
类目: Computation and Language (cs.CL)
备注: ACL 2024, Findings

点击查看摘要

Abstract:Grounded language models use external sources of information, such as knowledge graphs, to meet some of the general challenges associated with pre-training. By extending previous work on compositional generalization in semantic parsing, we allow for a controlled evaluation of the degree to which these models learn and generalize from patterns in knowledge graphs. We develop a procedure for generating natural language questions paired with knowledge graphs that targets different aspects of compositionality and further avoids grounding the language models in information already encoded implicitly in their weights. We evaluate existing methods for combining language models with knowledge graphs and find them to struggle with generalization to sequences of unseen lengths and to novel combinations of seen base components. While our experimental results provide some insight into the expressive power of these models, we hope our work and released datasets motivate future research on how to better combine language models with structured knowledge representations.
摘要:扎根的语言模型使用外部信息源,如知识图,以应对与预培训相关的一些一般挑战。通过扩展之前在语义分析中成分泛化的工作,我们允许对这些模型从知识图中的模式学习和泛化的程度进行受控评估。我们开发了一种生成与知识图配对的自然语言问题的过程,该过程针对合成性的不同方面,并进一步避免将语言模型基于已经隐式编码在其权重中的信息。我们评估了现有的将语言模型与知识图相结合的方法,发现它们难以推广到未知长度的序列和新的可见基本成分的组合。虽然我们的实验结果为这些模型的表达能力提供了一些见解,但我们希望我们的工作和发布的数据集能够激励未来的研究,即如何更好地将语言模型与结构化知识表示相结合。

[NLP-11] Language models emulate certain cognitive profiles: An investigation of how predictability measures interact with individual differences
[NLP-11] 语言模型模拟某些认知特征:可预测性指标如何与个体差异相互作用的调查

链接: https://arxiv.org/abs/2406.04988
作者: Patrick Haller,Lena S. Bolliger,Lena A. Jäger
关键词: disregarding individual differences, surprisal and entropy, reading times, predictive power, power of surprisal
中文关键词: 不考虑个体差异、统计和信息、阅读时间、预测能力、统计能力
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2024

点击查看摘要

Abstract:To date, most investigations on surprisal and entropy effects in reading have been conducted on the group level, disregarding individual differences. In this work, we revisit the predictive power of surprisal and entropy measures estimated from a range of language models (LMs) on data of human reading times as a measure of processing effort by incorporating information of language users’ cognitive capacities. To do so, we assess the predictive power of surprisal and entropy estimated from generative LMs on reading data obtained from individuals who also completed a wide range of psychometric tests. Specifically, we investigate if modulating surprisal and entropy relative to cognitive scores increases prediction accuracy of reading times, and we examine whether LMs exhibit systematic biases in the prediction of reading times for cognitively high- or low-performing groups, revealing what type of psycholinguistic subject a given LM emulates. Our study finds that in most cases, incorporating cognitive capacities increases predictive power of surprisal and entropy on reading times, and that generally, high performance in the psychometric tests is associated with lower sensitivity to predictability effects. Finally, our results suggest that the analyzed LMs emulate readers with lower verbal intelligence, suggesting that for a given target group (i.e., individuals with high verbal intelligence), these LMs provide less accurate predictability estimates.
摘要:到目前为止,关于阅读中的惊喜效应和信息熵效应的研究大多是在群体层面上进行的,忽略了个体差异。在这项工作中,我们重新审视了从一系列语言模型(LMS)估计的惊喜和熵度量的预测能力,该度量基于人类阅读时间的数据,通过纳入语言使用者的认知能力的信息来衡量加工努力。为了做到这一点,我们评估了从生成性LMS估计的惊讶和熵对阅读数据的预测能力,这些数据来自也完成了广泛心理测量测试的个人。具体地说,我们调查了相对于认知分数调整惊喜和熵是否提高了阅读时间的预测准确性,并检验了最小二乘法在预测认知水平高或低的群体的阅读时间时是否表现出系统性的偏差,从而揭示了给定的LM模拟的是哪种类型的心理语言学对象。我们的研究发现,在大多数情况下,纳入认知能力会增加惊讶和熵对阅读时间的预测力,通常情况下,心理测量测试中的高表现与对可预测性影响的敏感度较低有关。最后,我们的结果表明,被分析的最小二乘法模仿了言语智力较低的读者,这表明对于特定的目标群体(即,言语智力较高的个人),这些最小二乘法提供的可预测性估计不那么准确。

[NLP-12] MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter
[NLP-12] MEFT:通过稀疏适配器实现内存高效微调

链接: https://arxiv.org/abs/2406.04984
作者: Jitai Hao,WeiWei Sun,Xin Xin,Qi Meng,Zhumin Chen,Pengjie Ren,Zhaochun Ren
关键词: Large Language Models, Large Language, Language Models, Central Processing Unit, Graphics Processing Unit
中文关键词: 大型语言模型、大型语言、语言模型、中央处理单元、图形处理单元
类目: Computation and Language (cs.CL)
备注: ACL 24

点击查看摘要

Abstract:Parameter-Efficient Fine-tuning (PEFT) facilitates the fine-tuning of Large Language Models (LLMs) under limited resources. However, the fine-tuning performance with PEFT on complex, knowledge-intensive tasks is limited due to the constrained model capacity, which originates from the limited number of additional trainable parameters. To overcome this limitation, we introduce a novel mechanism that fine-tunes LLMs with adapters of larger size yet memory-efficient. This is achieved by leveraging the inherent activation sparsity in the Feed-Forward Networks (FFNs) of LLMs and utilizing the larger capacity of Central Processing Unit (CPU) memory compared to Graphics Processing Unit (GPU). We store and update the parameters of larger adapters on the CPU. Moreover, we employ a Mixture of Experts (MoE)-like architecture to mitigate unnecessary CPU computations and reduce the communication volume between the GPU and CPU. This is particularly beneficial over the limited bandwidth of PCI Express (PCIe). Our method can achieve fine-tuning results comparable to those obtained with larger memory capacities, even when operating under more limited resources such as a 24GB memory single GPU setup, with acceptable loss in training efficiency. Our codes are available at this https URL.
摘要:参数高效精调(PEFT)有助于在有限资源下对大型语言模型(LLM)进行精调。然而,PEFT在复杂、知识密集型任务上的微调性能受到模型容量的限制,这源于附加可训练参数的数量有限。为了克服这一限制,我们引入了一种新的机制,使用更大尺寸的适配器来微调LLM,但又节省了内存。这是通过利用LLM的前馈网络(FFN)中固有的激活稀疏性并利用与图形处理单元(GPU)相比更大容量的中央处理单元(CPU)内存来实现的。我们在CPU上存储和更新较大适配器的参数。此外,我们采用了类似专家混合(MOE)的体系结构来减少不必要的CPU计算,并减少了GPU和CPU之间的通信量。这在PCI Express(PCIe)带宽有限的情况下尤其有益。我们的方法可以获得与具有更大内存容量的微调结果相媲美的结果,即使在更有限的资源下运行时也是如此,例如24 GB内存的单个GPU设置,而训练效率的损失是可以接受的。我们的代码可以在这个HTTPS URL上找到。

[NLP-13] Quantifying Geospatial in the Common Crawl Corpus
[NLP-13] 通用爬行数据库中的地理空间量化

链接: https://arxiv.org/abs/2406.04952
作者: Ilya Ilyankou,Meihui Wang,James Haworth,Stefano Cavazzi
关键词: Large language models, vast unlabelled text, Common Crawl corpus, exhibit emerging geospatial, emerging geospatial capabilities
中文关键词: 大型语言模型、大量未标签文本、Common Crawl文集,展示了新兴的地理空间、新兴的地理空间能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit emerging geospatial capabilities, stemming from their pre-training on vast unlabelled text datasets that are often derived from the Common Crawl corpus. However, the geospatial content within CC remains largely unexplored, impacting our understanding of LLMs’ spatial reasoning. This paper investigates the prevalence of geospatial data in recent Common Crawl releases using Gemini, a powerful language model. By analyzing a sample of documents and manually revising the results, we estimate that between 1 in 5 and 1 in 6 documents contain geospatial information such as coordinates and street addresses. Our findings provide quantitative insights into the nature and extent of geospatial data within Common Crawl, and web crawl data in general. Furthermore, we formulate questions to guide future investigations into the geospatial content of available web crawl datasets and its influence on LLMs.
摘要:大型语言模型(LLM)展现出新兴的地理空间能力,源于它们对大量未标记文本数据集的预训练,这些数据集通常源自Common Crawl文集。然而,CC中的地理空间内容在很大程度上仍未被探索,这影响了我们对LLM空间推理的理解。本文使用强大的语言模型Gemini调查了最近Common Crawl版本中地理空间数据的流行程度。通过分析文档样本并手动修改结果,我们估计五分之一到六分之一的文档包含地理空间信息,例如坐标和街道地址。我们的研究结果为Common Crawl中地理空间数据以及一般网络爬行数据的性质和范围提供了定量见解。此外,我们还制定问题来指导未来对可用网络爬行数据集的地理空间内容及其对LLM的影响的调查。

[NLP-14] BAMO at SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense
[NLP-14] BAMO参加SemEval-2024任务9:脑力挑逗:一项挑战常识的新颖任务

链接: https://arxiv.org/abs/2406.04947
作者: Baktash Ansari,Mohammadmostafa Rostamkhani,Sauleh Eetemadi
关键词: Defying Common Sense, Task Defying Common, Common Sense, Defying Common, Task Defying
中文关键词: 违反常识,违反任务
类目: Computation and Language (cs.CL)
备注: 9 pages, 8 tables, 5 figures

点击查看摘要

Abstract:This paper outlines our approach to SemEval 2024 Task 9, BRAINTEASER: A Novel Task Defying Common Sense. The task aims to evaluate the ability of language models to think creatively. The dataset comprises multi-choice questions that challenge models to think “outside of the box”. We fine-tune 2 models, BERT and RoBERTa Large. Next, we employ a Chain of Thought (CoT) zero-shot prompting approach with 6 large language models, such as GPT-3.5, Mixtral, and Llama2. Finally, we utilize ReConcile, a technique that employs a “round table conference” approach with multiple agents for zero-shot learning, to generate consensus answers among 3 selected language models. Our best method achieves an overall accuracy of 85 percent on the sentence puzzles subtask.
摘要:本文概述了我们对SemEval 2024任务9、BRAINTEASER:一项挑战常识的新颖任务的方法。该任务旨在评估语言模型的创造性思维能力。该数据集包括多项选择题,挑战模型“跳出框框”思考。我们对2款型号进行了微调,BERT和RoBERTa Large。接下来,我们采用思想链(CoT)零触发方法,具有6种大型语言模型,例如GPT-3.5、Mixtral和Llama 2。最后,我们利用ReConcile,这是一种采用“圆桌会议”方法的技术,具有多个代理进行零射击学习,在3个选定的语言模型之间生成共识答案。我们的最佳方法在句子谜题子任务上实现了85%的总体准确率。

[NLP-15] CMD: A Traditional Chinese Medicine QA Dataset for Evaluating Large Language Models
[NLP-15] CMA:用于评估大型语言模型的中医QA数据集

链接: https://arxiv.org/abs/2406.04941
作者: Ping Yu,Kaitao Song,Fengchen He,Ming Chen,Jianfeng Lu
关键词: Large Language Models, advanced medical-domain models, Large Language, Language Models, recently unprecedented advancements
中文关键词: 大型语言模型、高级医学领域模型、大型语言、语言模型、最近前所未有的进步
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The recently unprecedented advancements in Large Language Models (LLMs) have propelled the medical community by establishing advanced medical-domain models. However, due to the limited collection of medical datasets, there are only a few comprehensive benchmarks available to gauge progress in this area. In this paper, we introduce a new medical question-answering (QA) dataset that contains massive manual instruction for solving Traditional Chinese Medicine examination tasks, called TCMD. Specifically, our TCMD collects massive questions across diverse domains with their annotated medical subjects and thus supports us in comprehensively assessing the capability of LLMs in the TCM domain. Extensive evaluation of various general LLMs and medical-domain-specific LLMs is conducted. Moreover, we also analyze the robustness of current LLMs in solving TCM QA tasks by introducing randomness. The inconsistency of the experimental results also reveals the shortcomings of current LLMs in solving QA tasks. We also expect that our dataset can further facilitate the development of LLMs in the TCM area.
摘要:大型语言模型(LLM)最近取得了前所未有的进步,通过建立高级医学领域模型推动了医学界的发展。然而,由于医学数据集的收集有限,只有少数几个综合基准可用于衡量这一领域的进展。在本文中,我们介绍了一个新的医学问答数据集,它包含了大量用于解决中医考试任务的人工指令,称为TCMD。具体地说,我们的TCMD收集了不同领域的海量问题及其注释医学主题,从而支持我们全面评估LLMS在中医领域的能力。对各种通用的LLM和医学领域特定的LLM进行了广泛的评估。此外,我们还通过引入随机性分析了现有的LLMS在解决中医问答问题时的稳健性。实验结果的不一致也揭示了现有LLMS在解决QA问题上的不足。我们也期待我们的数据集能够进一步促进LLMS在中医药领域的发展。

[NLP-16] hrough the Thicket: A Study of Number-Oriented LLMs derived from Random Forest Models
[NLP-16] 穿越丛林:从随机森林模型衍生的面向数字的LLM研究

链接: https://arxiv.org/abs/2406.04926
作者: Michał Romaszewski,Przemysław Sekuła,Przemysław Głomb,Michał Cholewa,Katarzyna Kołodziej
关键词: shown exceptional performance, text processing, Large Language Models, shown exceptional, Large Language
中文关键词: 表现出出色的性能,文本处理,大型语言模型,表现出出色,大型语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown exceptional performance in text processing. Notably, LLMs can synthesize information from large datasets and explain their decisions similarly to human reasoning through a chain of thought (CoT). An emerging application of LLMs is the handling and interpreting of numerical data, where fine-tuning enhances their performance over basic inference methods. This paper proposes a novel approach to training LLMs using knowledge transfer from a random forest (RF) ensemble, leveraging its efficiency and accuracy. By converting RF decision paths into natural language statements, we generate outputs for LLM fine-tuning, enhancing the model’s ability to classify and explain its decisions. Our method includes verifying these rules through established classification metrics, ensuring their correctness. We also examine the impact of preprocessing techniques on the representation of numerical data and their influence on classification accuracy and rule correctness
摘要:大语言模型在文本处理中表现出优异的性能。值得注意的是,LLMS可以从大数据集中合成信息,并通过思维链(COT)解释它们的决策,类似于人类的推理。LLMS的一个新兴应用是处理和解释数字数据,其中微调增强了它们相对于基本推理方法的性能。提出了一种利用随机森林(RF)集成的知识转移来训练LLMS的新方法,该方法利用了随机森林集成的效率和准确性。通过将RF决策路径转换为自然语言语句,我们生成用于LLM微调的输出,从而增强了模型分类和解释其决策的能力。我们的方法包括通过建立的分类度量来验证这些规则,确保它们的正确性。我们还考察了预处理技术对数字数据表示的影响,以及它们对分类精度和规则正确性的影响

[NLP-17] Sexism Detection on a Data Diet
[NLP-17] 数据饮食中的性别歧视检测

链接: https://arxiv.org/abs/2406.04892
作者: Rabiraj Bandyopadhyay,Dennis Assenmacher,Jose M.Alonso Moral,Claudia Wagner
关键词: online hate commensurate, Natural Language Processing, Deep Learning models, training Deep Learning, Deep Learning
中文关键词: 在线仇恨相称、自然语言处理、深度学习模型、训练深度学习、深度学习
类目: Computation and Language (cs.CL)
备注: Accepted at ACM WebSci 2024 Workshop in DHOW: Diffusion of Harmful Content on Online Web Workshop

点击查看摘要

Abstract:There is an increase in the proliferation of online hate commensurate with the rise in the usage of social media. In response, there is also a significant advancement in the creation of automated tools aimed at identifying harmful text content using approaches grounded in Natural Language Processing and Deep Learning. Although it is known that training Deep Learning models require a substantial amount of annotated data, recent line of work suggests that models trained on specific subsets of the data still retain performance comparable to the model that was trained on the full dataset. In this work, we show how we can leverage influence scores to estimate the importance of a data point while training a model and designing a pruning strategy applied to the case of sexism detection. We evaluate the model performance trained on data pruned with different pruning strategies on three out-of-domain datasets and find, that in accordance with other work a large fraction of instances can be removed without significant performance drop. However, we also discover that the strategies for pruning data, previously successful in Natural Language Inference tasks, do not readily apply to the detection of harmful content and instead amplify the already prevalent class imbalance even more, leading in the worst-case to a complete absence of the hateful class.
摘要:随着社交媒体使用率的上升,网络仇恨的扩散也在增加。作为回应,在创建自动化工具方面也取得了重大进展,这些工具旨在利用以自然语言处理和深度学习为基础的方法来识别有害文本内容。尽管众所周知,训练深度学习模型需要大量的注释数据,但最近的工作表明,在特定数据子集上训练的模型仍然保持与在整个数据集上训练的模型相当的性能。在这项工作中,我们展示了如何利用影响分数来估计数据点的重要性,同时训练模型并设计应用于性别歧视检测的剪枝策略。我们评估了在三个域外数据集上用不同的剪枝策略训练的数据的模型性能,发现与其他工作一样,可以在不显著降低性能的情况下删除很大一部分实例。然而,我们也发现,以前在自然语言推理任务中成功的数据剪枝策略并不适用于有害内容的检测,相反,它甚至进一步放大了已经很普遍的类别失衡,在最坏的情况下导致完全没有可恨的类别。

[NLP-18] Seeing the Unseen: Visual Metaphor Captioning for Videos
[NLP-18] 看到看不见的:视频的视觉隐喻字幕

链接: https://arxiv.org/abs/2406.04886
作者: Abisek Rajakumar Kalarani,Pushpak Bhattacharyya,Sumit Shekhar
关键词: common communication tool, common communication, communication tool, Metaphors, Average Concept Distance
中文关键词: 共同沟通工具,共同沟通,沟通工具,隐喻,平均概念距离
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Metaphors are a common communication tool used in our day-to-day life. The detection and generation of metaphors in textual form have been studied extensively but metaphors in other forms have been under-explored. Recent studies have shown that Vision-Language (VL) models cannot understand visual metaphors in memes and adverts. As of now, no probing studies have been done that involve complex language phenomena like metaphors with videos. Hence, we introduce a new VL task of describing the metaphors present in the videos in our work. To facilitate this novel task, we construct and release a manually created dataset with 705 videos and 2115 human-written captions, along with a new metric called Average Concept Distance (ACD), to automatically evaluate the creativity of the metaphors generated. We also propose a novel low-resource video metaphor captioning system: GIT-LLaVA, which obtains comparable performance to SoTA video language models on the proposed task. We perform a comprehensive analysis of existing video language models on this task and publish our dataset, models, and benchmark results to enable further research.
摘要:隐喻是我们日常生活中常用的交际工具。人们对篇章形式隐喻的发现和生成进行了广泛的研究,但对其他形式的隐喻研究较少。最近的研究表明,视觉语言模型不能理解模因和广告中的视觉隐喻。到目前为止,还没有进行过涉及复杂语言现象的探索性研究,比如带有视频的隐喻。因此,我们引入了一个新的VL任务来描述我们工作中的视频中的隐喻。为了促进这一新颖的任务,我们构建并发布了一个手动创建的数据集,其中包含705个视频和2115个人工编写的字幕,以及一个名为平均概念距离(ACD)的新度量,以自动评估生成的隐喻的创造力。我们还提出了一种新的低资源视频隐喻字幕系统:Git-LLaVA,在所提出的任务上获得了与SOTA视频语言模型相当的性能。我们对这项任务的现有视频语言模型进行了全面分析,并发布了我们的数据集、模型和基准测试结果,以便于进一步研究。

[NLP-19] InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment
[NLP-19] DirectNav:用于在未探索的环境中进行通用指令导航的零镜头系统

链接: https://arxiv.org/abs/2406.04882
作者: Yuxing Long,Wenzhe Cai,Hongcheng Wang,Guanqi Zhan,Hao Dong
关键词: Enabling robots, navigation, human-robot interaction, instruction navigation, instruction
中文关键词: 启用机器人、导航、人机交互、指令导航、指令
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to CoRL 2024

点击查看摘要

Abstract:Enabling robots to navigate following diverse language instructions in unexplored environments is an attractive goal for human-robot interaction. However, this goal is challenging because different navigation tasks require different strategies. The scarcity of instruction navigation data hinders training an instruction navigation model with varied strategies. Therefore, previous methods are all constrained to one specific type of navigation instruction. In this work, we propose InstructNav, a generic instruction navigation system. InstructNav makes the first endeavor to handle various instruction navigation tasks without any navigation training or pre-built maps. To reach this goal, we introduce Dynamic Chain-of-Navigation (DCoN) to unify the planning process for different types of navigation instructions. Furthermore, we propose Multi-sourced Value Maps to model key elements in instruction navigation so that linguistic DCoN planning can be converted into robot actionable trajectories. With InstructNav, we complete the R2R-CE task in a zero-shot way for the first time and outperform many task-training methods. Besides, InstructNav also surpasses the previous SOTA method by 10.48% on the zero-shot Habitat ObjNav and by 86.34% on demand-driven navigation DDN. Real robot experiments on diverse indoor scenes further demonstrate our method’s robustness in coping with the environment and instruction variations.
摘要:使机器人能够在未知环境中遵循不同的语言指令进行导航是人机交互的一个吸引人的目标。然而,这一目标具有挑战性,因为不同的导航任务需要不同的策略。教学导航数据的匮乏阻碍了训练具有不同策略的教学导航模型。因此,以前的方法都局限于一种特定类型的导航指令。在这项工作中,我们提出了一个通用的指令导航系统InstructNav。InstructNav首次尝试在没有任何导航训练或预建地图的情况下处理各种指令导航任务。为了实现这一目标,我们引入了动态导航链(DCoN)来统一不同类型导航指令的规划过程。此外,我们提出了多源价值图来对指令导航中的关键元素进行建模,以便将语言DCoN规划转化为机器人可操作的轨迹。有了InstructNav,我们首次以零射击的方式完成了R2R-CE任务,并超越了许多任务训练方法。此外,InstructNav在零镜头的Habite ObjNav上也超过了以前的SOTA方法10.48%,在需求驱动的导航DDN上超过了86.34%。真实机器人在不同室内场景上的实验进一步证明了该方法在应对环境和指令变化方面的鲁棒性。

[NLP-20] A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques
[NLP-20] 深入研究参数高效偏好对齐技术的权衡

链接: https://arxiv.org/abs/2406.04879
作者: Megh Thakkar,Quentin Fournier,Matthew D Riemer,Pin-Yu Chen,Amal Zouaq,Payel Das,Sarath Chandar
关键词: Large language models, Large language, pre-trained on trillions, trillions of tokens, instruction-tuned or aligned
中文关键词: 大型语言模型,大型语言,在数万亿个代币上预训练,经过描述调整或对齐
类目: Computation and Language (cs.CL)
备注: Accepted to ACL (Main) 2024

点击查看摘要

Abstract:Large language models are first pre-trained on trillions of tokens and then instruction-tuned or aligned to specific preferences. While pre-training remains out of reach for most researchers due to the compute required, fine-tuning has become affordable thanks to parameter-efficient methods such as LoRA and QLoRA. Alignment is known to be sensitive to the many factors involved, including the quantity and quality of data, the alignment method, and the adapter rank. However, there has not yet been an extensive study of their effect on downstream performance. To address this gap, we conduct an in-depth investigation of the impact of popular choices for three crucial axes: (i) the alignment dataset (HH-RLHF and BeaverTails), (ii) the alignment technique (SFT and DPO), and (iii) the model (LLaMA-1, Vicuna-v1.3, Mistral-7b, and Mistral-7b-Instruct). Our extensive setup spanning over 300 experiments reveals consistent trends and unexpected findings. We observe how more informative data helps with preference alignment, cases where supervised fine-tuning outperforms preference optimization, and how aligning to a distinct preference boosts performance on downstream tasks. Through our in-depth analyses, we put forward key guidelines to help researchers perform more effective parameter-efficient LLM alignment.
摘要:大型语言模型首先在数万亿个符号上进行预训练,然后根据特定的偏好进行指令调整或调整。虽然由于所需的计算能力,大多数研究人员仍然无法进行预培训,但由于LORA和QLoRA等参数高效方法,微调已经变得负担得起。已知的比对对所涉及的许多因素很敏感,包括数据的数量和质量、比对方法和适配器等级。然而,关于它们对下游性能的影响还没有广泛的研究。为了弥补这一差距,我们对三个关键轴线的流行选择的影响进行了深入调查:(I)对齐数据集(HH-RLHF和BeverTail),(Ii)对齐技术(SFT和DPO),以及(Iii)模型(Llama-1,Vicuna-v1.3,Mistral-7b,和Mistral-7b-Indict)。我们的广泛设置跨越300多个实验,显示了一致的趋势和意想不到的发现。我们观察了信息量更大的数据如何帮助偏好匹配,在有监督的微调优于偏好优化的情况下,以及如何调整到不同的偏好提高下游任务的性能。通过我们的深入分析,我们提出了关键的指导方针,以帮助研究人员进行更有效的参数高效的LLM配准。

[NLP-21] HateDebias: On the Diversity and Variability of Hate Speech Debiasing
[NLP-21] 仇恨Debias:关于仇恨言论去偏见的多样性和可变性

链接: https://arxiv.org/abs/2406.04876
作者: Nankai Lin,Hongyan Wu,Zhengming Chen,Zijian Li,Lianxi Wang,Shengyi Jiang,Dong Zhou,Aimin Yang
关键词: hate speech detection, Hate speech, speech detection, urgently controlled, existing hate speech
中文关键词: 仇恨言语检测,仇恨言语,言语检测,紧急控制,现有仇恨言语
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hate speech on social media is ubiquitous but urgently controlled. Without detecting and mitigating the biases brought by hate speech, different types of ethical problems. While a number of datasets have been proposed to address the problem of hate speech detection, these datasets seldom consider the diversity and variability of bias, making it far from real-world scenarios. To fill this gap, we propose a benchmark, named HateDebias, to analyze the model ability of hate speech detection under continuous, changing environments. Specifically, to meet the diversity of biases, we collect existing hate speech detection datasets with different types of biases. To further meet the variability (i.e., the changing of bias attributes in datasets), we reorganize datasets to follow the continuous learning setting. We evaluate the detection accuracy of models trained on the datasets with a single type of bias with the performance on the HateDebias, where a significant performance drop is observed. To provide a potential direction for debiasing, we further propose a debiasing framework based on continuous learning and bias information regularization, as well as the memory replay strategies to ensure the debiasing ability of the model. Experiment results on the proposed benchmark show that the aforementioned method can improve several baselines with a distinguished margin, highlighting its effectiveness in real-world applications.
摘要:社交媒体上的仇恨言论无处不在,但亟待控制。在没有发现和减轻仇恨言论带来的偏见的情况下,不同类型的伦理问题。虽然已经提出了一些数据集来解决仇恨语音检测的问题,但这些数据集很少考虑偏见的多样性和可变性,使得它与真实世界的场景相去甚远。为了填补这一空白,我们提出了一个名为HateDebias的基准测试,来分析在连续、变化的环境下仇恨语音检测的模型能力。具体地说,为了满足偏见的多样性,我们收集了现有的带有不同类型偏见的仇恨语音检测数据集。为了进一步满足可变性(即数据集中偏差属性的变化),我们重新组织数据集以适应持续学习的设置。我们评估了在具有单一类型偏差的数据集上训练的模型的检测精度,以及在HateDebias上的性能,其中观察到了显著的性能下降。为了为去偏提供一个潜在的方向,我们进一步提出了一个基于连续学习和偏差信息正则化的去偏框架,以及保证模型去偏能力的记忆重放策略。在所提出的基准上的实验结果表明,上述方法能够以显著的差值改善多个基线,突出了其在实际应用中的有效性。

[NLP-22] ComplexTempQA: A Large-Scale Dataset for Complex Temporal Question Answering
[NLP-22] Complex TempQA:用于复杂时态问题解答的大规模数据集

链接: https://arxiv.org/abs/2406.04866
作者: Raphael Gruber,Abdelrahman Abdallah,Michael Färber,Adam Jatowt
关键词: million question-answer pairs, question-answer pairs designed, introduce ComplexTempQA,a large-scale, ComplexTempQA,a large-scale dataset, large-scale dataset consisting
中文关键词: 百万个问答对,设计的问答对,引入ComplexTempQA,一个大规模,ComplexTempQA,一个大规模数据集,大规模数据集由
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce ComplexTempQA,a large-scale dataset consisting of over 100 million question-answer pairs designed to tackle the challenges in temporal question answering. ComplexTempQA significantly surpasses existing benchmarks like HOTPOTQA, TORQUE, and TEQUILA in scale and scope. Utilizing data from Wikipedia and Wikidata, the dataset covers questions spanning over two decades and offers an unmatched breadth of topics. We introduce a unique taxonomy that categorizes questions as attributes, comparisons, and counting questions, each revolving around events, entities, and time periods. One standout feature of ComplexTempQA is the high complexity of its questions, which demand effective capabilities for answering such as across-time comparison, temporal aggregation, and multi-hop reasoning involving temporal event ordering and entity recognition. Additionally, each question is accompanied by detailed metadata, including specific time scopes, allowing for comprehensive evaluation and enhancement of the temporal reasoning abilities of large language models. ComplexTempQA serves both as a testing ground for developing sophisticated AI models and as a foundation for advancing research in question answering, information retrieval, and language understanding. Dataset and code are freely available at: this https URL.
摘要:我们介绍了ComplexTempQA,这是一个由超过1亿个问答对组成的大规模数据集,旨在解决时态问答中的挑战。ComplexTempQA在规模和范围上大大超过了HOTPOTQA、TORQUE和龙舌兰等现有基准。利用维基百科和维基数据的数据,该数据集涵盖了跨越20年的问题,并提供了无与伦比的主题广度。我们引入了唯一的分类,将问题分类为属性、比较和计数问题,每个问题都围绕事件、实体和时间段。ComplexTempQA的一个突出特点是其问题的高度复杂性,这需要有效的回答能力,如跨时间比较、时态聚集和涉及时态事件排序和实体识别的多跳推理。此外,每个问题都附有详细的元数据,包括具体的时间范围,以便全面评估和增强大型语言模型的时间推理能力。ComplexTempQA既是开发复杂人工智能模型的试验场,也是推进问题回答、信息检索和语言理解研究的基础。数据集和代码可在以下网址免费获得:This HTTPS URL。

[NLP-23] he Russian Legislative Corpus
[NLP-23] 俄罗斯立法数据库

链接: https://arxiv.org/abs/2406.04855
作者: Denis Saveliev,Ruslan Kuchakov
关键词: comprehensive Russian primary, legislation corpus covering, secondary legislation corpus, comprehensive Russian, Russian primary
中文关键词: 综合俄语初级,立法文集涵盖,二级立法文集,综合俄语,俄罗斯初级
类目: Computation and Language (cs.CL)
备注: 7 pages, 6 figures, 1 table

点击查看摘要

Abstract:We present the comprehensive Russian primary and secondary legislation corpus covering 1991 to 2023. The corpus collects all 281,413 texts (176,523,268 tokens) of non-secret federal regulations and acts, along with their metadata. The corpus has two versions the original text with minimal preprocessing and a version prepared for linguistic analysis with morphosyntactic markup.
摘要:我们介绍了涵盖1991年至2023年的全面俄罗斯初级和二级立法文集。该数据库收集了非秘密联邦法规和法案的所有281,413个文本(176,523,268个代币)及其元数据。该数据库有两个版本,一个是经过最少预处理的原始文本,另一个是为使用形态语法标记进行语言分析而准备的版本。

[NLP-24] Uncertainty Aware Learning for Language Model Alignment
[NLP-24] 语言模型对齐的不确定性感知学习

链接: https://arxiv.org/abs/2406.04854
作者: Yikun Wang,Rui Zheng,Liang Ding,Qi Zhang,Dahua Lin,Dacheng Tao
关键词: aligning pretrained foundation, presents increasing challenges, instruction-tuned large language, large language models, pretrained foundation models
中文关键词: 对齐预训练的基础,带来越来越大的挑战,经过描述调整的大型语言、大型语言模型、预训练的基础模型
类目: Computation and Language (cs.CL)
备注: ACL 2024

点击查看摘要

Abstract:As instruction-tuned large language models (LLMs) evolve, aligning pretrained foundation models presents increasing challenges. Existing alignment strategies, which typically leverage diverse and high-quality data sources, often overlook the intrinsic uncertainty of tasks, learning all data samples equally. This may lead to suboptimal data efficiency and model performance. In response, we propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios, by introducing the sample uncertainty (elicited from more capable LLMs). We implement UAL in a simple fashion – adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Analysis shows that our UAL indeed facilitates better token clustering in the feature space, validating our hypothesis. Extensive experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning. Notably, LLMs aligned in a mixed scenario have achieved an average improvement of 10.62% on high-entropy tasks (i.e., AlpacaEval leaderboard), and 1.81% on complex low-entropy tasks (i.e., MetaMath and GSM8K).
摘要:随着教学调谐的大型语言模型(LLM)的发展,调整预先训练的基础模型提出了越来越大的挑战。现有的对齐策略通常利用各种高质量的数据源,通常忽略了任务的内在不确定性,平等地学习所有数据样本。这可能会导致次优的数据效率和模型性能。作为回应,我们提出了不确定性感知学习(UAL),通过引入样本不确定性(来自能力更强的LLMS)来提高不同任务场景的模型对齐。我们以一种简单的方式实现了UAL–根据单个样本的不确定性自适应地设置训练的标签平滑值。分析表明,我们的UAL确实有助于在特征空间中更好地进行标记聚类,从而验证了我们的假设。在广泛使用的基准测试上的广泛实验表明,我们的UAL显著并持续地优于标准监督微调。值得注意的是,在混合场景中对齐的LLM在高熵任务(即AlpacaEval排行榜)上获得了10.62%的平均改进,在复杂低熵任务(即MetaMath和GSM8K)上实现了1.81%的平均改进。

[NLP-25] Digital assistant in a point of sales
[NLP-25] 销售点的数字助理

链接: https://arxiv.org/abs/2406.04851
作者: Emilia Lesiak,Grzegorz Wolny,Bartosz Przybył,Michał Szczerbak
关键词: Voice User Interface, User Interface, Voice User, powered digital assistant, digital assistant
中文关键词: 语音用户界面,用户界面,语音用户,动力数字助理,数字助理
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: update: cleaned the unnecessary files and updated the metadata

点击查看摘要

Abstract:This article investigates the deployment of a Voice User Interface (VUI)-powered digital assistant in a retail setting and assesses its impact on customer engagement and service efficiency. The study explores how digital assistants can enhance user interactions through advanced conversational capabilities with multilingual support. By integrating a digital assistant into a high-traffic retail environment, we evaluate its effectiveness in improving the quality of customer service and operational efficiency. Data collected during the experiment demonstrate varied impacts on customer interaction, revealing insights into the future optimizations of digital assistant technologies in customer-facing roles. This study contributes to the understanding of digital transformation strategies within the customer relations domain emphasizing the need for service flexibility and user-centric design in modern retail stores.
摘要:本文调查了零售环境中基于语音用户界面(VUI)的数字助理的部署,并评估了其对客户参与度和服务效率的影响。该研究探讨了数字助理如何通过具有多语言支持的高级对话功能来增强用户交互。通过将数字助理集成到高流量零售环境中,我们评估其在提高客户服务质量和运营效率方面的有效性。实验期间收集的数据展示了对客户互动的各种影响,揭示了对面向客户角色中数字助理技术未来优化的见解。这项研究有助于理解客户关系领域的数字化转型策略,强调现代零售商店对服务灵活性和以用户为中心的设计的需求。

[NLP-26] Do Language Models Exhibit Human-like Structural Priming Effects?
[NLP-26] 语言模型是否表现出类人的结构启动效应?

链接: https://arxiv.org/abs/2406.04847
作者: Jaap Jumelet,Willem Zuidema,Arabella Sinclair
关键词: Gries and Kootstra, token level, explore which linguistic, sentence and token, role in influencing
中文关键词: Gries和Kootstra,代币层面,探索哪种语言、句子和代币在影响中的作用
类目: Computation and Language (cs.CL)
备注: ACL Findings 2024

点击查看摘要

Abstract:We explore which linguistic factors – at the sentence and token level – play an important role in influencing language model predictions, and investigate whether these are reflective of results found in humans and human corpora (Gries and Kootstra, 2017). We make use of the structural priming paradigm, where recent exposure to a structure facilitates processing of the same structure. We don’t only investigate whether, but also where priming effects occur, and what factors predict them. We show that these effects can be explained via the inverse frequency effect, known in human priming, where rarer elements within a prime increase priming effects, as well as lexical dependence between prime and target. Our results provide an important piece in the puzzle of understanding how properties within their context affect structural prediction in language models.
摘要:我们探索哪些语言因素(在句子和符号层面)在影响语言模型预测方面发挥重要作用,并调查这些因素是否反映了人类和人类库中发现的结果(格里斯和库茨斯特拉,2017年)。我们利用结构启动范式,其中最近接触结构有助于处理相同结构。我们不仅调查启动效应是否发生,而且还调查启动效应在哪里发生,以及哪些因素可以预测启动效应。我们表明,这些效应可以通过人类启动中已知的反频率效应来解释,其中主音中的稀有元素会增加启动效应,以及主音和目标之间的词汇依赖性。我们的结果为理解上下文中的属性如何影响语言模型中的结构预测这一难题提供了重要的部分。

[NLP-27] FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models
[NLP-27] FedLLM-Bench:大型语言模型联邦学习的现实基准

链接: https://arxiv.org/abs/2406.04845
作者: Rui Ye,Rui Ge,Xinyu Zhu,Jingyi Chai,Yaxin Du,Yang Liu,Yanfeng Wang,Siheng Chen
关键词: enabled multiple parties, collaboratively train large, train large language, large language models, sharing their data
中文关键词: 支持多方、协作训练大型、训练大型语言、大型语言模型、共享数据
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 22 pages

点击查看摘要

Abstract:Federated learning has enabled multiple parties to collaboratively train large language models without directly sharing their data (FedLLM). Following this training paradigm, the community has put massive efforts from diverse aspects including framework, performance, and privacy. However, an unpleasant fact is that there are currently no realistic datasets and benchmarks for FedLLM and previous works all rely on artificially constructed datasets, failing to capture properties in real-world scenarios. Addressing this, we propose FedLLM-Bench, which involves 8 training methods, 4 training datasets, and 6 evaluation metrics, to offer a comprehensive testbed for the FedLLM community. FedLLM-Bench encompasses three datasets (e.g., user-annotated multilingual dataset) for federated instruction tuning and one dataset (e.g., user-annotated preference dataset) for federated preference alignment, whose scale of client number ranges from 38 to 747. Our datasets incorporate several representative diversities: language, quality, quantity, instruction, length, embedding, and preference, capturing properties in real-world scenarios. Based on FedLLM-Bench, we conduct experiments on all datasets to benchmark existing FL methods and provide empirical insights (e.g., multilingual collaboration). We believe that our FedLLM-Bench can benefit the FedLLM community by reducing required efforts, providing a practical testbed, and promoting fair comparisons. Code and datasets are available at this https URL.
摘要:联合学习使多方能够在不直接共享数据的情况下协作训练大型语言模型(FedLLM)。遵循这一培训范式,社区从框架、性能和隐私等不同方面做出了巨大努力。然而,一个令人不快的事实是,目前还没有适用于FedLLM的现实数据集和基准,以前的工作都依赖于人工构建的数据集,无法捕捉真实世界场景中的属性。针对这一问题,我们提出了FedLLM-BENCH,它包括8种训练方法、4个训练数据集和6个评估指标,为FedLLM社区提供了一个全面的试验床。FedLLM-BENCH包括用于联合指令调整的三个数据集(例如,用户注释的多语言数据集)和用于联合偏好对齐的一个数据集(例如,用户注释的偏好数据集),其客户数量的范围从38到747。我们的数据集包含了几个具有代表性的多样性:语言、质量、数量、指令、长度、嵌入和偏好,捕获了真实世界场景中的属性。基于FedLLM-BENCH,我们在所有数据集上进行实验,以对现有的外语学习方法进行基准测试,并提供经验见解(例如,多语言协作)。我们相信,我们的FedLLM工作台可以减少所需的工作量,提供一个实用的试验台,并促进公平的比较,从而使FedLLM社区受益。代码和数据集可在此HTTPS URL上找到。

[NLP-28] Revisiting Catastrophic Forgetting in Large Language Model Tuning
[NLP-28] 重温大型语言模型调优中的灾难性遗忘

链接: https://arxiv.org/abs/2406.04836
作者: Hongyu Li,Liang Ding,Meng Fang,Dacheng Tao
关键词: forgetting previously acquired, models forgetting previously, previously acquired knowledge, Catastrophic Forgetting, forgetting previously
中文关键词: 忘记以前获得的、模型忘记以前、以前获得的知识、灾难性忘记、以前忘记
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Catastrophic Forgetting (CF) means models forgetting previously acquired knowledge when learning new data. It compromises the effectiveness of large language models (LLMs) during fine-tuning, yet the underlying causes have not been thoroughly investigated. This paper takes the first step to reveal the direct link between the flatness of the model loss landscape and the extent of CF in the field of LLMs. Based on this, we introduce the sharpness-aware minimization to mitigate CF by flattening the loss landscape. Experiments on three widely-used fine-tuning datasets, spanning different model scales, demonstrate the effectiveness of our method in alleviating CF. Analyses show that we nicely complement the existing anti-forgetting strategies, further enhancing the resistance of LLMs to CF.
摘要:灾难性遗忘(CF)意味着模型在学习新数据时忘记以前获得的知识。它会在微调期间损害大型语言模型(LLM)的有效性,但根本原因尚未得到彻底调查。本文迈出了第一步,揭示了模型损失景观的平坦度与LLM领域CF范围之间的直接联系。基于此,我们引入了敏锐度感知最小化,通过拉平损失格局来减轻CF。在跨越不同模型尺度的三个广泛使用的微调数据集上进行的实验证明了我们的方法在缓解CF方面的有效性。分析表明,我们很好地补充了现有的反遗忘策略,进一步增强了LLM对CF的抵抗力。

[NLP-29] Annotating FrameNet via Structure-Conditioned Language Generation
[NLP-29] 通过结构条件语言生成注释FrameNet

链接: https://arxiv.org/abs/2406.04834
作者: Xinyue Cui,Swabha Swayamdipta
关键词: producing naturalistic language, remarkable generative capabilities, structures remain understudied, language models, naturalistic language
中文关键词: 产生自然主义语言、非凡的生成能力、结构研究不足、语言模型、自然主义语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the remarkable generative capabilities of language models in producing naturalistic language, their effectiveness on explicit manipulation and generation of linguistic structures remain understudied. In this paper, we investigate the task of generating new sentences preserving a given semantic structure, following the FrameNet formalism. We propose a framework to produce novel frame-semantically annotated sentences following an overgenerate-and-filter approach. Our results show that conditioning on rich, explicit semantic information tends to produce generations with high human acceptance, under both prompting and finetuning. Our generated frame-semantic structured annotations are effective at training data augmentation for frame-semantic role labeling in low-resource settings; however, we do not see benefits under higher resource settings. Our study concludes that while generating high-quality, semantically rich data might be within reach, the downstream utility of such generations remains to be seen, highlighting the outstanding challenges with automating linguistic annotation tasks.
摘要:尽管语言模型在产生自然语言方面具有显著的生成能力,但它们在显性操纵和语言结构生成方面的有效性仍未得到充分的研究。在本文中,我们研究了在保持给定语义结构的情况下,遵循FrameNet形式生成新句子的任务。我们提出了一个框架,通过过度生成和过滤的方法产生新的框架语义标注句子。我们的结果表明,在丰富、明确的语义信息的制约下,无论是在提示下还是在精调下,都会产生人类高度接受的世代。我们生成的框架语义结构标注在低资源环境下的框架语义角色标注的训练数据增强方面是有效的;然而,我们在高资源环境下看不到好处。我们的研究结论是,尽管生成高质量、语义丰富的数据可能是触手可及的,但这种代的下游效用仍有待观察,这突显了自动化语言注释任务的突出挑战。

[NLP-30] BERTs are Generative In-Context Learners
[NLP-30] BERT是产生性的上下文学习者

链接: https://arxiv.org/abs/2406.04823
作者: David Samuel
关键词: in-context learning capabilities, challenging the common, paper explores, common view, in-context learning
中文关键词: 背景学习能力,挑战共同点,论文探索,共同观点,背景学习
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, preprint

点击查看摘要

Abstract:This paper explores the in-context learning capabilities of masked language models, challenging the common view that this ability does not ‘emerge’ in them. We present an embarrassingly simple inference technique that enables DeBERTa to operate as a generative model without any additional training. Our findings demonstrate that DeBERTa can match and even surpass GPT-3, its contemporary that famously introduced the paradigm of in-context learning. The comparative analysis reveals that the masked and causal language models behave very differently, as they clearly outperform each other on different categories of tasks. This suggests that there is great potential for a hybrid training approach that takes advantage of the strengths of both training objectives.
摘要:本文探讨了蒙面语言模型的上下文学习能力,挑战了这种能力不会在其中“出现”的普遍观点。我们提出了一种极其简单的推理技术,使DeBEERTa能够在无需任何额外训练的情况下作为生成模型运行。我们的研究结果表明,DeBERTa可以与甚至超越GPT-3相媲美,GPT-3是其当代的,著名地引入了上下文学习范式。比较分析表明,蒙面语言模型和因果语言模型的表现非常不同,因为它们在不同类别的任务中明显优于彼此。这表明,利用两种培训目标优势的混合培训方法具有巨大潜力。

[NLP-31] Zero Finite and Infinite Belief History of Theory of Mind Reasoning in Large Language Models
[NLP-31] 零有限和无限相信大型语言模型中心理理论推理的历史

链接: https://arxiv.org/abs/2406.04800
作者: Weizhi Tang,Vaishak Belle
关键词: Theory of Mind, Large Language Models, Infinite Belief History, Belief History, emergence of Theory
中文关键词: 心理理论、大型语言模型、无限信仰历史、信仰历史、理论的出现
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently shown a promise and emergence of Theory of Mind (ToM) ability and even outperform humans in certain ToM tasks. To evaluate and extend the boundaries of the ToM reasoning ability of LLMs, we propose a novel concept, taxonomy, and framework, the ToM reasoning with Zero, Finite, and Infinite Belief History and develop a multi-round text-based game, called \textitPick the Right Stuff , as a benchmark. We have evaluated six LLMs with this game and found their performance on Zero Belief History is consistently better than on Finite Belief History. In addition, we have found two of the models with small parameter sizes outperform all the evaluated models with large parameter sizes. We expect this work to pave the way for future ToM benchmark development and also for the promotion and development of more complex AI agents or systems which are required to be equipped with more complex ToM reasoning ability.
摘要:大型语言模型(LLM)最近显示出心理理论(ToM)能力的前景和出现,甚至在某些ToM任务中表现优于人类。为了评估和扩展LLM ToM推理能力的边界,我们提出了一个新颖的概念、分类学和框架,即具有零、有限和无限信念历史的ToM推理,并开发了一个多轮基于文本的游戏,名为\textitPick the Right Stuff,作为基准。我们用这款游戏评估了六个LLM,发现它们在零信念历史上的表现始终优于有限信念历史上的表现。此外,我们发现其中两个参数大小较小的模型优于所有参数大小较大的评估模型。我们预计这项工作将为未来的ToM基准开发以及需要配备更复杂的ToM推理能力的更复杂人工智能代理或系统的推广和开发铺平道路。

[NLP-32] SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals
[NLP-32] SelfGoal:您的语言代理已经知道如何实现高级目标

链接: https://arxiv.org/abs/2406.04784
作者: Ruihan Yang,Jiangjie Chen,Yikai Zhang,Siyu Yuan,Aili Chen,Kyle Richardson,Yanghua Xiao,Deqing Yang
关键词: large language models, gaming and programming, Language agents powered, powered by large, increasingly valuable
中文关键词: 大型语言模型、游戏和编程、语言代理驱动,由大型驱动,越来越有价值
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Language agents powered by large language models (LLMs) are increasingly valuable as decision-making tools in domains such as gaming and programming. However, these agents often face challenges in achieving high-level goals without detailed instructions and in adapting to environments where feedback is delayed. In this paper, we present SelfGoal, a novel automatic approach designed to enhance agents’ capabilities to achieve high-level goals with limited human prior and environmental feedback. The core concept of SelfGoal involves adaptively breaking down a high-level goal into a tree structure of more practical subgoals during the interaction with environments while identifying the most useful subgoals and progressively updating this structure. Experimental results demonstrate that SelfGoal significantly enhances the performance of language agents across various tasks, including competitive, cooperative, and deferred feedback environments. Project page: this https URL.
摘要:由大型语言模型(LLM)支持的语言代理作为游戏和编程等领域的决策工具越来越有价值。然而,这些代理经常面临在没有详细指示的情况下实现高级目标以及适应反馈延迟的环境的挑战。在本文中,我们介绍了SelfGoal,这是一种新型的自动方法,旨在增强代理人在有限的人类先验和环境反馈下实现高级目标的能力。SelfGoal的核心概念涉及在与环境交互期间自适应地将高级目标分解为由更实用的子目标组成的树结构,同时识别最有用的子目标并逐步更新此结构。实验结果表明,SelfGoal显着提高了语言代理在各种任务中的性能,包括竞争、合作和延迟反馈环境。项目页面:此https URL。

[NLP-33] WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
[NLP-33] WildBench:通过来自野外真实用户的挑战任务对LLC进行基准测试

链接: https://arxiv.org/abs/2406.04770
作者: Bill Yuchen Lin,Yuntian Deng,Khyathi Chandu,Faeze Brahman,Abhilasha Ravichander,Valentina Pyatkin,Nouha Dziri,Ronan Le Bras,Yejin Choi
关键词: real-world user queries, benchmark large language, evaluation framework designed, large language models, real-world user
中文关键词: 现实世界用户查询、基准大型语言、设计的评估框架、大型语言模型、现实世界用户
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Link: this https URL

点击查看摘要

Abstract:We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WildBench, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo. WildBench evaluation uses task-specific checklists to evaluate model outputs systematically and provides structured explanations that justify the scores and comparisons, resulting in more reliable and interpretable automatic judgments. WB-Reward employs fine-grained pairwise comparisons between model responses, generating five potential outcomes: much better, slightly better, slightly worse, much worse, or a tie. Unlike previous evaluations that employed a single baseline model, we selected three baseline models at varying performance levels to ensure a comprehensive pairwise evaluation. Additionally, we propose a simple method to mitigate length bias, by converting outcomes of slightly better/worse'' to tie’’ if the winner response exceeds the loser one by more than K characters. WB-Score evaluates the quality of model outputs individually, making it a fast and cost-efficient evaluation metric. WildBench results demonstrate a strong correlation with the human-voted Elo ratings from Chatbot Arena on hard tasks. Specifically, WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models. Additionally, WB-Score reaches 0.95, surpassing both ArenaHard’s 0.91 and AlpacaEval2.0’s 0.89 for length-controlled win rates, as well as the 0.87 for regular win rates.
摘要:我们介绍了WildBch,这是一个自动评估框架,旨在使用具有挑战性的真实世界用户查询来对大型语言模型(LLM)进行基准测试。WildBtch由1024个任务组成,这些任务是从100多万个人类聊天机器人的对话日志中精心挑选出来的。对于WildBch的自动评估,我们开发了两个指标,WB-REWARD和WB-SCORE,可以使用GPT-4-TURBO等高级LLMS进行计算。WildBtch评估使用特定于任务的核对表来系统地评估模型输出,并提供结构化的解释来证明分数和比较的合理性,从而产生更可靠和更易于解释的自动判断。WB-REWART使用模型反应之间的细粒度两两比较,产生五种潜在结果:好得多、略好、略差、差得多或不相上下。与以前采用单一基线模型的评估不同,我们选择了三个不同性能水平的基线模型,以确保进行全面的成对评估。此外,我们提出了一种简单的方法来缓解长度偏差,如果获胜者的反应比失败者多一个K个字符,那么就把“略好/差”的结果转换成“平局”。WB-SCORE单独评估模型输出的质量,使其成为一种快速且经济高效的评估指标。WildBtch的结果表明,在完成艰巨的任务时,聊天机器人Arena对人类投票的Elo评分与之有很强的相关性。具体地说,WB-REWARD与顶级模特的皮尔逊相关性为0.98。此外,WB-Score达到0.95,在长度控制胜率方面超过ArenaHard的0.91和AlpacaEval2.0的S 0.89,以及常规胜率的0.87。

[NLP-34] hink out Loud: Emotion Deducing Explanation in Dialogues
[NLP-34] 大声暗示:对话中的情感演绎解释

链接: https://arxiv.org/abs/2406.04758
作者: Jiangnan Li,Zheng Lin,Lanrui Wang,Qingyi Si,Yanan Cao,Mo Yu,Peng Fu,Weiping Wang,Jie Zhou
关键词: emotion, dialogues, Humans convey emotions, daily dialogues, affective intelligence
中文关键词: 情感、对话、人类传达情感、日常对话、情商
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Humans convey emotions through daily dialogues, making emotion understanding a crucial step of affective intelligence. To understand emotions in dialogues, machines are asked to recognize the emotion for an utterance (Emotion Recognition in Dialogues, ERD); based on the emotion, then find causal utterances for the emotion (Emotion Cause Extraction in Dialogues, ECED). The setting of the two tasks requires first ERD and then ECED, ignoring the mutual complement between emotion and cause. To fix this, some new tasks are proposed to extract them simultaneously. Although the current research on these tasks has excellent achievements, simply identifying emotion-related factors by classification modeling lacks realizing the specific thinking process of causes stimulating the emotion in an explainable way. This thinking process especially reflected in the reasoning ability of Large Language Models (LLMs) is under-explored. To this end, we propose a new task “Emotion Deducing Explanation in Dialogues” (EDEN). EDEN recognizes emotion and causes in an explicitly thinking way. That is, models need to generate an explanation text, which first summarizes the causes; analyzes the inner activities of the speakers triggered by the causes using common sense; then guesses the emotion accordingly. To support the study of EDEN, based on the existing resources in ECED, we construct two EDEN datasets by human effort. We further evaluate different models on EDEN and find that LLMs are more competent than conventional PLMs. Besides, EDEN can help LLMs achieve better recognition of emotions and causes, which explores a new research direction of explainable emotion understanding in dialogues.
摘要:人类通过日常对话传递情感,使情感理解成为情感智力的关键一步。为了理解对话中的情感,机器被要求识别话语中的情感(对话中的情感识别,ERD);基于情感,然后找到情感的因果话语(对话中的情感原因提取,ECED)。这两个任务的设置要求先是ERD,然后是ECED,忽视了情感和原因之间的相辅相成。为了解决这一问题,提出了一些新的任务来同时提取它们。尽管目前对这些任务的研究已经取得了很好的成果,但简单地通过分类建模来识别情绪相关因素,缺乏对刺激情绪的原因的具体思维过程的可解释性理解。这种思维过程特别是反映在大型语言模型(LLM)的推理能力上的研究还不够深入。为此,我们提出了一个新的任务“对话中的情感演绎解释”(EDEN)。伊甸园以一种明确的思维方式识别情感和原因。也就是说,模型需要生成一个解释文本,该文本首先总结原因;使用常识分析由原因引发的说话者的内部活动;然后相应地猜测情绪。为了支持对伊甸园的研究,我们在ECED现有资源的基础上,人工构建了两个伊甸园数据集。我们进一步在EDEN上对不同的模型进行了评估,发现LLM比传统的PLM更具竞争力。此外,伊甸园可以帮助LLMS更好地识别情绪和原因,这为对话中的可解释情绪理解探索了一个新的研究方向。

[NLP-35] CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language Models
[NLP-35] CRiskEval:大型语言模型的中文多级别风险评估基准数据集

链接: https://arxiv.org/abs/2406.04752
作者: Ling Shi,Deyi Xiong
关键词: numerous beneficial capabilities, Large language models, Large language, harbors unpredictable risks, potential inclination harbors
中文关键词: 众多有益的能力,大型语言模型,大型语言,蕴藏不可预测的风险,潜在的倾向蕴藏
类目: Computation and Language (cs.CL)
备注: 28 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) are possessed of numerous beneficial capabilities, yet their potential inclination harbors unpredictable risks that may materialize in the future. We hence propose CRiskEval, a Chinese dataset meticulously designed for gauging the risk proclivities inherent in LLMs such as resource acquisition and malicious coordination, as part of efforts for proactive preparedness. To curate CRiskEval, we define a new risk taxonomy with 7 types of frontier risks and 4 safety levels, including extremely hazardous,moderately hazardous, neutral and safe. We follow the philosophy of tendency evaluation to empirically measure the stated desire of LLMs via fine-grained multiple-choice question answering. The dataset consists of 14,888 questions that simulate scenarios related to predefined 7 types of frontier risks. Each question is accompanied with 4 answer choices that state opinions or behavioral tendencies corresponding to the question. All answer choices are manually annotated with one of the defined risk levels so that we can easily build a fine-grained frontier risk profile for each assessed LLM. Extensive evaluation with CRiskEval on a spectrum of prevalent Chinese LLMs has unveiled a striking revelation: most models exhibit risk tendencies of more than 40% (weighted tendency to the four risk levels). Furthermore, a subtle increase in the model’s inclination toward urgent self-sustainability, power seeking and other dangerous goals becomes evident as the size of models increase. To promote further research on the frontier risk evaluation of LLMs, we publicly release our dataset at this https URL.
摘要:大型语言模型具有许多有益的能力,但它们的潜在倾向蕴含着不可预测的风险,这些风险可能在未来成为现实。因此,我们提出CRiskEval,这是一个精心设计的中国数据集,用于衡量低成本管理中固有的风险倾向,如资源获取和恶意协调,作为主动准备工作的一部分。为了管理CRiskEval,我们定义了一个新的风险分类,包括7种前沿风险和4个安全级别,包括极端危险、中等危险、中性和安全。我们遵循倾向评估的理念,通过细粒度的多项选择题回答,对LLMS的陈述意愿进行了实证测量。该数据集由14,888个问题组成,模拟与预定义的7种前沿风险相关的情景。每个问题伴随着4个答案选项,这些选项说明了与问题对应的观点或行为倾向。所有答案选项都手动标注了其中一个定义的风险级别,因此我们可以轻松地为每个评估的LLM构建细粒度的前沿风险配置文件。CRiskEval对中国一系列流行的低风险模型的广泛评估揭示了一个惊人的启示:大多数模型的风险倾向超过40%(对四个风险水平的加权趋势)。此外,随着模型尺寸的增加,模型对紧急自我维持、权力追求和其他危险目标的倾向明显增加。为了促进对低密度脂蛋白的前沿风险评估的进一步研究,我们在这个HTTPS URL上公开发布了我们的数据集。

[NLP-36] PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction
[NLP-36] PQPP:文本到图像提示和查询性能预测的联合基准

链接: https://arxiv.org/abs/2406.04746
作者: Eduard Poesina,Adriana Valentina Costache,Adrian-Gabriel Chifu,Josiane Mothe,Radu Tudor Ionescu
关键词: generative diffusion models, visually impressive results, diffusion models, recently emerged, viable alternative
中文关键词: 生成式扩散模型、视觉上令人印象深刻的结果、扩散模型、最近出现的、可行的替代方案
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-to-image generation has recently emerged as a viable alternative to text-to-image retrieval, due to the visually impressive results of generative diffusion models. Although query performance prediction is an active research topic in information retrieval, to the best of our knowledge, there is no prior study that analyzes the difficulty of queries (prompts) in text-to-image generation, based on human judgments. To this end, we introduce the first dataset of prompts which are manually annotated in terms of image generation performance. In order to determine the difficulty of the same prompts in image retrieval, we also collect manual annotations that represent retrieval performance. We thus propose the first benchmark for joint text-to-image prompt and query performance prediction, comprising 10K queries. Our benchmark enables: (i) the comparative assessment of the difficulty of prompts/queries in image generation and image retrieval, and (ii) the evaluation of prompt/query performance predictors addressing both generation and retrieval. We present results with several pre-generation/retrieval and post-generation/retrieval performance predictors, thus providing competitive baselines for future research. Our benchmark and code is publicly available under the CC BY 4.0 license at this https URL.
摘要:由于生成扩散模型的视觉效果令人印象深刻,文本到图像的生成最近已经成为文本到图像检索的一种可行的替代方案。尽管查询性能预测是信息检索领域的一个活跃的研究课题,但就我们所知,还没有基于人的判断来分析查询(提示)在文本到图像生成中的难度的研究。为此,我们引入了提示的第一个数据集,这些提示是根据图像生成性能手动标注的。为了确定相同提示在图像检索中的难度,我们还收集了代表检索性能的人工标注。因此,我们提出了第一个联合文本到图像提示和查询性能预测的基准,包括10K个查询。我们的基准能够:(1)比较评估提示/查询在图像生成和图像检索中的难度,以及(2)评估针对生成和检索的提示/查询性能预测指标。我们提出了几个前生成/检索和后生成/检索性能预测指标的结果,从而为未来的研究提供了有竞争力的基线。我们的基准测试和代码在CC by 4.0许可证下可通过以下HTTPS URL公开获得。

[NLP-37] CRAG – Comprehensive RAG Benchmark
[NLP-37] CRAG --全面的RAG基准

链接: https://arxiv.org/abs/2406.04744
作者: Xiao Yang,Kai Sun,Hao Xin,Yushi Sun,Nikita Bhalla,Xiangsen Chen,Sajal Choudhary,Rongze Daniel Gui,Ziran Will Jiang,Ziyu Jiang,Lingkun Kong,Brian Moran,Jiaqi Wang,Yifan Ethan Xu,An Yan,Chenyu Yang,Eting Yuan,Hanwen Zha,Nan Tang,Lei Chen,Nicolas Scheffer,Yue Liu,Nirav Shah,Rakesh Wanga,Anuj Kumar,Wen-tau Yih,Xin Luna Dong
关键词: Large Language Model, alleviate Large Language, Language Model, Large Language, Retrieval-Augmented Generation
中文关键词: 大型语言模型,缓解大型语言,语言模型,大型语言,检索增强生成
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation on this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve =34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days of the competition. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions.
摘要:检索增强生成(RAG)是解决大语言模型S缺乏知识的一个有前途的解决方案。然而,现有的RAG数据集不能充分代表现实世界问答(QA)任务的多样性和动态性质。为了弥补这一差距,我们引入了全面RAG基准(CRAG),这是一个由4,409个问答对和模拟API来模拟Web和知识图(KG)搜索的事实问答基准。Crag旨在封装五个领域和八个问题类别的各种问题,反映从流行到长尾的不同实体受欢迎程度,以及从几年到几秒的时间动态。我们对这一基准的评估突出了与完全值得信赖的QA之间的差距。虽然大多数先进的LLMS在CRAG上达到了34%的准确率,但以直接的方式添加RAG仅将准确率提高到44%。最先进的行业破布解决方案只能回答63%的问题,没有任何幻觉。Crag还显示,在回答有关动态程度较高、受欢迎程度较低或复杂性较高的事实问题时,准确率要低得多,这表明了未来的研究方向。CRAG基准为2024年KDD杯挑战赛奠定了基础,在比赛的前50天内吸引了数千名参与者和参赛者。我们致力于维护CRAG,为研究社区提供RAG解决方案和一般QA解决方案。

[NLP-38] Generative AI Models: Opportunities and Risks for Industry and Authorities
[NLP-38] 生成性人工智能模型:行业和当局的机遇和风险

链接: https://arxiv.org/abs/2406.04734
作者: Tobias Alt,Andrea Ibisch,Clemens Meiser,Anna Wilhelm,Raphael Zimmer,Christian Berghoff,Christoph Droste,Jens Karschau,Friederike Laus,Rainer Plaga,Carola Plesch,Britta Sennewald,Thomas Thaeren,Kristina Unverricht,Steffen Waurick
关键词: traditionally require creativity, human understanding, capable of performing, performing a wide, wide range
中文关键词: 传统上需要创造力、人类理解力、表演能力、表演范围广泛
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 33 pages, 3 figures

点击查看摘要

Abstract:Generative AI models are capable of performing a wide range of tasks that traditionally require creativity and human understanding. They learn patterns from existing data during training and can subsequently generate new content such as texts, images, and music that follow these patterns. Due to their versatility and generally high-quality results, they, on the one hand, represent an opportunity for digitalization. On the other hand, the use of generative AI models introduces novel IT security risks that need to be considered for a comprehensive analysis of the threat landscape in relation to IT security. In response to this risk potential, companies or authorities using them should conduct an individual risk analysis before integrating generative AI into their workflows. The same applies to developers and operators, as many risks in the context of generative AI have to be taken into account at the time of development or can only be influenced by the operating company. Based on this, existing security measures can be adjusted, and additional measures can be taken.
摘要:产生式人工智能模型能够执行传统上需要创造力和人类理解的广泛任务。他们在训练过程中从现有数据中学习模式,随后可以生成遵循这些模式的新内容,如文本、图像和音乐。由于它们的多功能性和总体高质量的结果,它们一方面代表着数字化的机会。另一方面,生成性人工智能模型的使用引入了新的IT安全风险,需要考虑这些风险才能全面分析与IT安全相关的威胁格局。为了应对这种风险潜力,使用它们的公司或当局应该在将生成性人工智能整合到他们的工作流程中之前进行单独的风险分析。这同样适用于开发人员和运营商,因为在生成性人工智能的背景下,许多风险必须在开发时考虑到,或者只能受到运营公司的影响。在此基础上,可以调整现有的安全措施,并可以采取额外的措施。

[NLP-39] AICoderEval: Improving AI Domain Code Generation of Large Language Models
[NLP-39] AICoderEval:改进大型语言模型的人工智能领域代码生成

链接: https://arxiv.org/abs/2406.04712
作者: Yinghui Xia,Yuyan Chen,Tianyu Shi,Jun Wang,Jinsong Yang
关键词: code generation, generation, task-specific code generation, code, code generation capability
中文关键词: 代码生成、生成、特定任务代码生成、代码、代码生成能力
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated code generation is a pivotal capability of large language models (LLMs). However, assessing this capability in real-world scenarios remains challenging. Previous methods focus more on low-level code generation, such as model loading, instead of generating high-level codes catering for real-world tasks, such as image-to-text, text classification, in various domains. Therefore, we construct AICoderEval, a dataset focused on real-world tasks in various domains based on HuggingFace, PyTorch, and TensorFlow, along with comprehensive metrics for evaluation and enhancing LLMs’ task-specific code generation capability. AICoderEval contains test cases and complete programs for automated evaluation of these tasks, covering domains such as natural language processing, computer vision, and multimodal learning. To facilitate research in this area, we open-source the AICoderEval dataset at \urlthis https URL. After that, we propose CoderGen, an agent-based framework, to help LLMs generate codes related to real-world tasks on the constructed AICoderEval. Moreover, we train a more powerful task-specific code generation model, named AICoder, which is refined on llama-3 based on AICoderEval. Our experiments demonstrate the effectiveness of CoderGen in improving LLMs’ task-specific code generation capability (by 12.00% on pass@1 for original model and 9.50% on pass@1 for ReAct Agent). AICoder also outperforms current code generation LLMs, indicating the great quality of the AICoderEval benchmark.
摘要:代码自动生成是大型语言模型的一项关键功能。然而,在现实世界的场景中评估这种能力仍然具有挑战性。以前的方法更多地关注低级代码生成,例如模型加载,而不是生成满足现实世界任务的高级代码,例如图像到文本、文本分类,在不同的领域。因此,我们构建了AICoderEval,这是一个基于HuggingFace、PyTorch和TensorFlow的专注于各个领域的真实任务的数据集,以及用于评估和增强LLMS特定于任务的代码生成能力的全面指标。AICoderEval包含用于自动评估这些任务的测试用例和完整程序,涵盖自然语言处理、计算机视觉和多模式学习等领域。为了促进这一领域的研究,我们将AICoderEval数据集开源,地址为\urlThis HTTPS URL。然后,我们提出了一个基于代理的框架CoderGen来帮助LLMS在构建的AICoderEval上生成与现实世界任务相关的代码。此外,我们还训练了一个更强大的特定于任务的代码生成模型AICoder,该模型是在AICoderEval的基础上在Llama-3上改进的。我们的实验证明了CoderGen在提高LLMS的特定任务代码生成能力方面的有效性(对于原始模型提高了12.00%PASS@1,对于Reaction代理提高了9.50%PASS@1)。AICoder的性能也超过了当前的代码生成LLM,这表明了AICoderEval基准测试的高质量。

[NLP-40] Mixture-of-Agents Enhances Large Language Model Capabilities
[NLP-40] 混合代理增强大型语言模型能力

链接: https://arxiv.org/abs/2406.04692
作者: Junlin Wang,Jue Wang,Ben Athiwaratkun,Ce Zhang,James Zou
关键词: natural language understanding, Recent advances, demonstrate substantial capabilities, large language models, generation tasks
中文关键词: 自然语言理解、最新进展、展示了强大的能力、大型语言模型、生成任务
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) demonstrate substantial capabilities in natural language understanding and generation tasks. With the growing number of LLMs, how to harness the collective expertise of multiple LLMs is an exciting open direction. Toward this goal, we propose a new approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents (MoA) methodology. In our approach, we construct a layered MoA architecture wherein each layer comprises multiple LLM agents. Each agent takes all the outputs from agents in the previous layer as auxiliary information in generating its response. MoA models achieves state-of-art performance on AlpacaEval 2.0, MT-Bench and FLASK, surpassing GPT-4 Omni. For example, our MoA using only open-source LLMs is the leader of AlpacaEval 2.0 by a substantial gap, achieving a score of 65.1% compared to 57.5% by GPT-4 Omni.
摘要:大型语言模型(LLM)的最新进展展示了自然语言理解和生成任务的强大能力。随着法学硕士数量的不断增加,如何利用多个法学硕士的集体专业知识是一个令人兴奋的开放方向。为了实现这一目标,我们提出了一种新方法,通过混合代理(MoA)方法来利用多个LLM的集体优势。在我们的方法中,我们构建了分层的MoA架构,其中每层包括多个LLM代理。每个代理将前一层代理的所有输出作为生成其响应的辅助信息。MoA型号在AlpacaEval 2.0、MT-Bench和FLASK上实现了最先进的性能,超过了GPT-4 Omni。例如,我们仅使用开源LLM的MoA在AlpacaEval 2.0中处于领先地位,得分为65.1%,而GPT-4 Omni的得分为57.5%。

[NLP-41] MATTER: Memory-Augmented Transformer Using Heterogeneous Knowledge Sources
[NLP-41] MATER:使用异类知识源的内存增强Transformer

链接: https://arxiv.org/abs/2406.04670
作者: Dongkyu Lee,Chandana Satya Prakash,Jack FitzGerald,Jens Lehmann
关键词: Leveraging external knowledge, achieving high performance, Leveraging external, knowledge-intensive tasks, question answering
中文关键词: 利用外部知识,实现高绩效,利用外部知识密集型任务,回答问题
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL2024-Findings

点击查看摘要

Abstract:Leveraging external knowledge is crucial for achieving high performance in knowledge-intensive tasks, such as question answering. The retrieve-and-read approach is widely adopted for integrating external knowledge into a language model. However, this approach suffers from increased computational cost and latency due to the long context length, which grows proportionally with the number of retrieved knowledge. Furthermore, existing retrieval-augmented models typically retrieve information from a single type of knowledge source, limiting their scalability to diverse knowledge sources with varying structures. In this work, we introduce an efficient memory-augmented transformer called MATTER, designed to retrieve relevant knowledge from multiple heterogeneous knowledge sources. Specifically, our model retrieves and reads from both unstructured sources (paragraphs) and semi-structured sources (QA pairs) in the form of fixed-length neural memories. We demonstrate that our model outperforms existing efficient retrieval-augmented models on popular QA benchmarks in terms of both accuracy and speed. Furthermore, MATTER achieves competitive results compared to conventional read-and-retrieve models while having 100x throughput during inference.
摘要:在知识密集型任务(如问题回答)中,利用外部知识是实现高性能的关键。检索和阅读方法被广泛采用来将外部知识整合到语言模型中。然而,这种方法由于上下文长度较长而增加了计算成本和延迟,并且与检索到的知识的数量成比例地增长。此外,现有的检索增强模型通常从单一类型的知识源检索信息,限制了它们对具有不同结构的不同知识源的可扩展性。在这项工作中,我们引入了一种高效的记忆扩展转换器Matter,旨在从多个异类知识源中检索相关知识。具体地说,我们的模型以固定长度神经记忆的形式从非结构化来源(段落)和半结构化来源(QA对)检索和阅读。我们证明了我们的模型在准确率和速度上都优于现有的有效检索增强模型。此外,与传统的读取和检索模型相比,Matter获得了具有竞争力的结果,同时在推理过程中具有100倍的吞吐量。

[NLP-42] DiNeR: a Large Realistic Dataset for Evaluating Compositional Generalization
[NLP-42] DiNeR:用于评估成分概括的大型现实数据集

链接: https://arxiv.org/abs/2406.04669
作者: Chengang Hu,Xiao Liu,Yansong Feng
关键词: natural language variation, existing compositional generalization, compositional generalization, lack of natural, compositional generalization datasets
中文关键词: 自然语言变异、现有的合成概括、合成概括、缺乏自然的合成概括数据集
类目: Computation and Language (cs.CL)
备注: EMNLP 2023 long paper

点击查看摘要

Abstract:Most of the existing compositional generalization datasets are synthetically-generated, resulting in a lack of natural language variation. While there have been recent attempts to introduce non-synthetic datasets for compositional generalization, they suffer from either limited data scale or a lack of diversity in the forms of combinations. To better investigate compositional generalization with more linguistic phenomena and compositional diversity, we propose the DIsh NamE Recognition (DiNeR) task and create a large realistic Chinese dataset. Given a recipe instruction, models are required to recognize the dish name composed of diverse combinations of food, actions, and flavors. Our dataset consists of 3,811 dishes and 228,114 recipes, and involves plenty of linguistic phenomena such as anaphora, omission and ambiguity. We provide two strong baselines based on T5 and large language models (LLMs). This work contributes a challenging task, baseline methods to tackle the task, and insights into compositional generalization in the context of dish name recognition. Code and data are available at this https URL.
摘要:现有的成分泛化数据集大多是人工合成的,缺乏自然语言的变异。虽然最近有人试图引入非合成数据集进行成分概括,但它们要么受到数据规模的限制,要么组合形式缺乏多样性。为了更好地研究具有更多语言现象和成分多样性的成分概括,我们提出了菜名识别(DINER)任务,并创建了一个大型的真实中文数据集。在给出食谱说明的情况下,模特需要识别由食物、动作和口味的不同组合组成的菜名。我们的数据集包括3811道菜肴和228,114道菜谱,涉及大量的语言现象,如回指、省略和歧义。我们基于T5和大型语言模型(LLM)提供了两个强大的基线。这项工作是一项具有挑战性的任务,解决这一任务的基线方法,以及对菜名识别背景下的成分概括的见解。代码和数据可在此HTTPS URL上找到。

[NLP-43] More Victories Less Cooperation: Assessing Ciceros Diplomacy Play
[NLP-43] 胜利越多,合作越少:评估西塞罗斯外交策略

链接: https://arxiv.org/abs/2406.04643
作者: Wichayaporn Wongkamjan,Feng Gu,Yanze Wang,Ulf Hermjakob,Jonathan May,Brandon M. Stewart,Jonathan K. Kummerfeld,Denis Peskoff,Jordan Lee Boyd-Graber
关键词: cooperative artificial intelligence, artificial intelligence, challenging setting, boardgame Diplomacy, prominent communicative Diplomacy
中文关键词: 合作人工智能、人工智能、具有挑战性的环境、棋盘游戏外交、杰出的沟通外交
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The boardgame Diplomacy is a challenging setting for communicative and cooperative artificial intelligence. The most prominent communicative Diplomacy AI, Cicero, has excellent strategic abilities, exceeding human players. However, the best Diplomacy players master communication, not just tactics, which is why the game has received attention as an AI challenge. This work seeks to understand the degree to which Cicero succeeds at communication. First, we annotate in-game communication with abstract meaning representation to separate in-game tactics from general language. Second, we run two dozen games with humans and Cicero, totaling over 200 human-player hours of competition. While AI can consistently outplay human players, AI-Human communication is still limited because of AI’s difficulty with deception and persuasion. This shows that Cicero relies on strategy and has not yet reached the full promise of communicative and cooperative AI.
摘要:棋盘游戏外交对于沟通和合作的人工智能来说是一个具有挑战性的环境。最著名的沟通外交人工智能西塞罗拥有出色的战略能力,超过了人类玩家。然而,最好的外交玩家掌握的是沟通,而不仅仅是战术,这就是为什么该游戏作为人工智能挑战而受到关注。这项工作旨在了解西塞罗在沟通方面的成功程度。首先,我们用抽象意义表示来注释游戏中的沟通,以将游戏中的策略与一般语言分开。其次,我们与人类和西塞罗一起运行了两打游戏,总共超过200小时的人类玩家竞争。虽然人工智能可以始终胜过人类玩家,但由于人工智能难以欺骗和说服,人工智能与人类的沟通仍然受到限制。这表明西塞罗依赖策略,尚未完全实现沟通和合作人工智能的承诺。

[NLP-44] Large Language Model-guided Document Selection
[NLP-44] 大语言模型引导的文档选择

链接: https://arxiv.org/abs/2406.04638
作者: Xiang Kong,Tom Gunter,Ruoming Pang
关键词: Large Language Model, growing compute budget, selection enables comparable, careful document selection, document selection enables
中文关键词: 大型语言模型、不断增长的计算预算、选择支持具有可比性、仔细的文档选择、文档选择支持
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs. Inspired by efforts suggesting that domain-specific training document selection is in fact an interpretable process [Gunasekar et al., 2023], as well as research showing that instruction-finetuned LLMs are adept zero-shot data labelers [Gilardi et al.,2023], we explore a promising direction for scalable general-domain document selection; employing a prompted LLM as a document grader, we distill quality labels into a classifier model, which is applied at scale to a large, and already heavily-filtered, web-crawl-derived corpus autonomously. Following the guidance of this classifier, we drop 75% of the corpus and train LLMs on the remaining data. Results across multiple benchmarks show that: 1. Filtering allows us to quality-match a model trained on the full corpus across diverse benchmarks with at most 70% of the FLOPs, 2. More capable LLM labelers and classifier models lead to better results that are less sensitive to the labeler’s prompt, 3. In-context learning helps to boost the performance of less-capable labeling models. In all cases we use open-source datasets, models, recipes, and evaluation frameworks, so that results can be reproduced by the community.
摘要:大型语言模型(LLM)预训练耗尽了不断增长的计算预算,但最近的研究表明,仔细的文档选择可以在只有一小部分失败的情况下实现相当的模型质量。受到表明特定领域的训练文档选择实际上是一个可解释的过程的努力[Gunasekar等人,2023],以及研究表明指令优化的LLM是熟练的零命中数据标签器[Gilardi等人,2023],我们探索了一个有希望的方向,用于可伸缩的通用领域文档选择;使用一个提示的LLM作为文档分类器,我们将质量标签提取到一个分类器模型中,该模型可以自动地大规模应用于大规模的、已经经过大量过滤的网络爬虫派生语料库。在该分类器的指导下,我们删除了75%的语料库,并在剩余的数据上训练LLMS。多个基准的结果表明:1.过滤允许我们在不同基准的完整语料库上训练的模型与最多70%的FLOP进行质量匹配;2.能力更强的LLM标记器和分类器模型会产生更好的结果,而对标签者的提示不那么敏感;3.情境学习有助于提高能力较差的标注模型的性能。在所有情况下,我们都使用开放源码的数据集、模型、配方和评估框架,以便社区可以重现结果。

[NLP-45] Low-Resource Cross-Lingual Summarization through Few-Shot Learning with Large Language Models
[NLP-45] 通过使用大型语言模型的少镜头学习实现低资源跨语言总结

链接: https://arxiv.org/abs/2406.04630
作者: Gyutae Park,Seojin Hwang,Hwanhee Lee
关键词: source language document, XLS performance, XLS, aims to generate, few-shot XLS performance
中文关键词: 源语言文档XLS性能XLS,旨在生成少量XLS性能
类目: Computation and Language (cs.CL)
备注: 7 pages,3 figures

点击查看摘要

Abstract:Cross-lingual summarization (XLS) aims to generate a summary in a target language different from the source language document. While large language models (LLMs) have shown promising zero-shot XLS performance, their few-shot capabilities on this task remain unexplored, especially for low-resource languages with limited parallel data. In this paper, we investigate the few-shot XLS performance of various models, including Mistral-7B-Instruct-v0.2, GPT-3.5, and GPT-4. Our experiments demonstrate that few-shot learning significantly improves the XLS performance of LLMs, particularly GPT-3.5 and GPT-4, in low-resource settings. However, the open-source model Mistral-7B-Instruct-v0.2 struggles to adapt effectively to the XLS task with limited examples. Our findings highlight the potential of few-shot learning for improving XLS performance and the need for further research in designing LLM architectures and pre-training objectives tailored for this task. We provide a future work direction to explore more effective few-shot learning strategies and to investigate the transfer learning capabilities of LLMs for cross-lingual summarization.
摘要:跨语言摘要(XLS)旨在生成不同于源语言文档的目标语言摘要。虽然大型语言模型(LLM)已经显示出有希望的零激发XLS性能,但它们在这项任务上的少激发能力仍未被开发,特别是对于具有有限并行数据的低资源语言。在本文中,我们研究了不同型号的XLS的少发性能,包括Mistral-7B-Indict-v0.2,GPT-3.5和GPT-4。我们的实验表明,在低资源环境下,少镜头学习显著提高了LLMS的XLS性能,尤其是GPT-3.5和GPT-4。然而,开源模型Mistral-7B-Instruct-v0.2很难有效地适应XLS任务,示例有限。我们的发现突显了极少的学习对提高XLS性能的潜力,以及在设计LLM体系结构和为此任务量身定做的预培训目标方面需要进一步研究。我们的工作方向是探索更有效的短程学习策略,研究LLMS在跨语言摘要中的迁移学习能力。

[NLP-46] Key-Element-Informed sLLM Tuning for Document Summarization
[NLP-46] 文档摘要的关键要素知情sLLM调优

链接: https://arxiv.org/abs/2406.04625
作者: Sangwon Ryu,Heejin Do,Yunsu Kim,Gary Geunbae Lee,Jungseul Ok
关键词: large language models, Remarkable advances, enabled high-quality text, language models, advances in large
中文关键词: 大型语言模型,显着的进步,实现了高质量的文本,语言模型,大规模的进步
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Interspeech 2024

点击查看摘要

Abstract:Remarkable advances in large language models (LLMs) have enabled high-quality text summarization. However, this capability is currently accessible only through LLMs of substantial size or proprietary LLMs with usage fees. In response, smaller-scale LLMs (sLLMs) of easy accessibility and low costs have been extensively studied, yet they often suffer from missing key information and entities, i.e., low relevance, in particular, when input documents are long. We hence propose a key-element-informed instruction tuning for summarization, so-called KEITSum, which identifies key elements in documents and instructs sLLM to generate summaries capturing these key elements. Experimental results on dialogue and news datasets demonstrate that sLLM with KEITSum indeed provides high-quality summarization with higher relevance and less hallucinations, competitive to proprietary LLM.
摘要:大型语言模型(LLM)的显着进步使高质量的文本摘要成为可能。然而,目前只能通过大规模的LLM或收取使用费的专有LLM来访问该功能。作为回应,人们广泛研究了易于访问和低成本的较小规模的LLM(sLLM),但它们经常缺少关键信息和实体,即相关性较低,尤其是当输入文档很长时。因此,我们提出了一种基于关键元素的摘要指令调优,即所谓的KEITSum,它识别文档中的关键元素并指示sLLM生成捕获这些关键元素的摘要。对话和新闻数据集的实验结果表明,具有KEITSum的sLLM确实提供了高质量的摘要,具有更高的相关性和更少的幻觉,与专有的LLM竞争。

[NLP-47] LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model
[NLP-47] LawGPT:中国法律知识增强型大型语言模型

链接: https://arxiv.org/abs/2406.04614
作者: Zhi Zhou,Jiang-Xin Shi,Peng-Xiao Song,Xiao-Wen Yang,Yi-Xuan Jin,Lan-Zhe Guo,Yu-Feng Li
关键词: Large language models, showcased remarkable capabilities, Large language, Chinese legal tasks, Chinese legal
中文关键词: 大型语言模型,展示非凡的能力,大型语言,中国法律任务,中国法律
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical Report

点击查看摘要

Abstract:Large language models (LLMs), including both proprietary and open-source models, have showcased remarkable capabilities in addressing a wide range of downstream tasks. Nonetheless, when it comes to practical Chinese legal tasks, these models fail to meet the actual requirements. Proprietary models do not ensure data privacy for sensitive legal cases, while open-source models demonstrate unsatisfactory performance due to their lack of legal knowledge. To address this problem, we introduce LawGPT, the first open-source model specifically designed for Chinese legal applications. LawGPT comprises two key components: legal-oriented pre-training and legal supervised fine-tuning. Specifically, we employ large-scale Chinese legal documents for legal-oriented pre-training to incorporate legal domain knowledge. To further improve the model’s performance on downstream legal tasks, we create a knowledge-driven instruction dataset for legal supervised fine-tuning. Our experimental results demonstrate that LawGPT outperforms the open-source LLaMA 7B model. Our code and resources are publicly available at this https URL and have received 5.7K stars on GitHub.
摘要:大型语言模型(LLM),包括专有和开源模型,在处理广泛的下游任务方面展示了非凡的能力。然而,当涉及到中国的实际法律任务时,这些模式并不能满足实际要求。专有模式不能确保敏感法律案件的数据隐私,而开源模式由于缺乏法律知识,表现不佳。为了解决这个问题,我们引入了LawGPT,这是第一个专门为中国法律应用程序设计的开源模型。法律GPT包括两个关键组成部分:面向法律的预训和法律监督下的微调。具体地说,我们使用大规模的中文法律文件进行面向法律的预培训,以纳入法律领域知识。为了进一步提高模型在下游法律任务上的性能,我们创建了一个知识驱动的指令数据集,用于法律监督微调。我们的实验结果表明,LawGPT的性能优于开源的骆驼7B模型。我们的代码和资源在这个HTTPS URL上公开可用,并在GitHub上获得了5.7k颗星。

[NLP-48] Learning Task Decomposition to Assist Humans in Competitive Programming
[NLP-48] 学习任务分解帮助人类进行竞争性编程

链接: https://arxiv.org/abs/2406.04604
作者: Jiaxin Wen,Ruiqi Zhong,Pei Ke,Zhihong Shao,Hongning Wang,Minlie Huang
关键词: language models, struggle to understand, understand the LM-generated, decompose complex solutions, solve complex problems
中文关键词: 语言模型,难以理解,理解LM生成的,分解复杂解决方案,解决复杂问题
类目: Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: ACL 2024 Main Conference

点击查看摘要

Abstract:When using language models (LMs) to solve complex problems, humans might struggle to understand the LM-generated solutions and repair the flawed ones. To assist humans in repairing them, we propose to automatically decompose complex solutions into multiple simpler pieces that correspond to specific subtasks. We introduce a novel objective for learning task decomposition, termed assistive value (AssistV), which measures the feasibility and speed for humans to repair the decomposed solution. We collect a dataset of human repair experiences on different decomposed solutions. Utilizing the collected data as in-context examples, we then learn to critique, refine, and rank decomposed solutions to improve AssistV. We validate our method under competitive programming problems: under 177 hours of human study, our method enables non-experts to solve 33.3% more problems, speeds them up by 3.3x, and empowers them to match unassisted experts.
摘要:当使用语言模型(LM)来解决复杂问题时,人类可能很难理解LM生成的解决方案并修复有缺陷的解决方案。为了帮助人类修复它们,我们建议将复杂的解决方案自动分解为多个与特定子任务相对应的更简单的部分。我们引入了一个新的学习任务分解目标,称为辅助值(Assistant V),它衡量人类修复分解解决方案的可行性和速度。我们收集人类对不同分解解决方案的修复经验数据集。利用收集的数据作为上下文示例,然后我们学习对分解的解决方案进行批评、完善和排名,以改进Assistant V。我们在竞争性编程问题下验证了我们的方法:在177小时的人类研究下,我们的方法使非专家能够解决33.3%的问题,将它们的速度提高3.3倍,并使他们能够匹配无人协助的专家。

[NLP-49] Pitch-Aware RNN-T for Mandarin Chinese Mispronunciation Detection and Diagnosis
[NLP-49] 具有音调感知的RNN-T用于普通话发音错误检测和诊断

链接: https://arxiv.org/abs/2406.04595
作者: Xintong Wang,Mingqian Shi,Ye Wang
关键词: Automatic Speech Recognition, leveraging Automatic Speech, Mispronunciation Detection, Detection and Diagnosis, Speech Recognition
中文关键词: 自动语音识别,利用自动语音、发音错误检测、检测和诊断、语音识别
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2024

点击查看摘要

Abstract:Mispronunciation Detection and Diagnosis (MDD) systems, leveraging Automatic Speech Recognition (ASR), face two main challenges in Mandarin Chinese: 1) The two-stage models create an information gap between the phoneme or tone classification stage and the MDD stage. 2) The scarcity of Mandarin MDD datasets limits model training. In this paper, we introduce a stateless RNN-T model for Mandarin MDD, utilizing HuBERT features with pitch embedding through a Pitch Fusion Block. Our model, trained solely on native speaker data, shows a 3% improvement in Phone Error Rate and a 7% increase in False Acceptance Rate over the state-of-the-art baseline in non-native scenarios
摘要:利用自动语音识别(ASB)的发音错误检测和诊断(DDD)系统在普通话中面临着两个主要挑战:1)两阶段模型在音素或语气分类阶段和DDD阶段之间创建了信息差距。2)普通话DDD数据集的稀缺限制了模型训练。本文中,我们为普通话DDD引入了一种无状态RNN-T模型,利用HuBERT特征并通过音调融合块进行音调嵌入。我们的模型仅根据母语者数据训练,显示在非母语场景中,电话错误率比最先进的基线提高了3%,错误接受率提高了7%

[NLP-50] Extroversion or Introversion? Controlling The Personality of Your Large Language Models
[NLP-50] 外向还是内向?控制大型语言模型的个性

链接: https://arxiv.org/abs/2406.04583
作者: Yanquan Chen,Zhen Wu,Junjie Guo,Shujian Huang,Xinyu Dai
关键词: Large language models, Large language, mimicking human behavior, exhibiting synthetic personalities, language models
中文关键词: 大型语言模型,大型语言,模仿人类行为,表现出合成人格,语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit robust capabilities in text generation and comprehension, mimicking human behavior and exhibiting synthetic personalities. However, some LLMs have displayed offensive personality, propagating toxic discourse. Existing literature neglects the origin and evolution of LLM personalities, as well as the effective personality control. To fill these gaps, our study embarked on a comprehensive investigation into LLM personality control. We investigated several typical methods to influence LLMs, including three training methods: Continual Pre-training, Supervised Fine-Tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF), along with inference phase considerations (prompts). Our investigation revealed a hierarchy of effectiveness in control: Prompt SFT RLHF Continual Pre-train. Notably, SFT exhibits a higher control success rate compared to prompt induction. While prompts prove highly effective, we found that prompt-induced personalities are less robust than those trained, making them more prone to showing conflicting personalities under reverse personality prompt induction. Besides, harnessing the strengths of both SFT and prompt, we proposed \underline\textP rompt \underline\textI nduction post \underline\textS upervised \underline\textF ine-tuning (PISF), which emerges as the most effective and robust strategy for controlling LLMs’ personality, displaying high efficacy, high success rates, and high robustness. Even under reverse personality prompt induction, LLMs controlled by PISF still exhibit stable and robust personalities.
摘要:大型语言模型在文本生成和理解方面表现出强大的能力,模仿人类的行为,并显示出合成的个性。然而,一些低收入者表现出攻击性的个性,宣传有毒的言论。现有文献忽略了LLM人格的起源和演化,以及有效的人格控制。为了填补这些空白,我们的研究对LLM人格控制进行了全面的调查。我们研究了几种典型的影响LLMS的方法,包括三种训练方法:连续预训练、有监督的微调(SFT)和从人的反馈中强化学习(RLHF),以及推理阶段的考虑(提示)。我们的研究揭示了控制的有效性等级:即时SFT、RLHF、持续预训练。值得注意的是,与即时诱导相比,SFT显示出更高的控制成功率。虽然提示被证明是非常有效的,但我们发现提示诱导的人格不如那些受过训练的人健壮,这使得他们在反向人格提示诱导下更容易表现出相互冲突的人格。此外,利用SFT和Prompt的优点,我们提出了\Underline\TextP Rompt\Underline\Text Induction Post\Underline\Text Supervised\Underline\TextF-Tuning(PISF),它是控制LLMS人格的最有效和最健壮的策略,表现出高效率、高成功率和高稳健性。即使在反向人格提示诱导下,PISF控制的LLMS仍然表现出稳定和健壮的人格。

[NLP-51] SC2: Towards Enhancing Content Preservation and Style Consistency in Long Text Style Transfer
[NLP-51] SC 2:在长文本风格转移中增强内容保留和风格一致性

链接: https://arxiv.org/abs/2406.04578
作者: Jie Zhao,Ziyu Guan,Cai Xu,Wei Zhao,Yue Jiang
关键词: Text style transfer, Text style, aims to vary, Style Consistency loss, preserving the semantic
中文关键词: 文本风格转移,文本风格,旨在变化,风格一致性丧失,保留语义
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text style transfer (TST) aims to vary the style polarity of text while preserving the semantic content. Although recent advancements have demonstrated remarkable progress in short TST, it remains a relatively straightforward task with limited practical applications. The more comprehensive long TST task presents two challenges: (1) existing methods encounter difficulties in accurately evaluating content attributes in multiple words, leading to content degradation; (2) the conventional vanilla style classifier loss encounters obstacles in maintaining consistent style across multiple generated sentences. In this paper, we propose a novel method SC2, where a multilayer Joint Style-Content Weighed (JSCW) module and a Style Consistency loss are designed to address the two issues. The JSCW simultaneously assesses the amounts of style and content attributes within a token, aiming to acquire a lossless content representation and thereby enhancing content preservation. The multiple JSCW layers further progressively refine content representations. We design a style consistency loss to ensure the generated multiple sentences consistently reflect the target style polarity. Moreover, we incorporate a denoising non-autoregressive decoder to accelerate the training. We conduct plentiful experiments and the results show significant improvements of SC2 over competitive baselines. Our code: this https URL. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2406.04578 [cs.CL] (or arXiv:2406.04578v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2406.04578 Focus to learn more arXiv-issued DOI via DataCite
摘要:文本风格转换旨在改变文本的风格极性,同时保持文本的语义内容。尽管最近的进展显示在短TST方面取得了显著的进展,但它仍然是一项相对简单的任务,实际应用有限。更全面的Long TST任务带来了两个挑战:(1)现有方法在准确评估多个词的内容属性方面遇到困难,导致内容退化;(2)传统的Vanilla Style分类器丢失在多个生成的句子中保持一致的风格遇到障碍。本文提出了一种新的方法SC2,其中设计了多层联合样式-内容加权(JSCW)模块和样式一致性损失来解决这两个问题。JSCW同时评估令牌中的样式和内容属性的数量,旨在获得无损的内容表示,从而增强内容保存。多个JSCW层进一步逐步改进内容表示形式。我们设计了一种风格一致性损失,以确保生成的多个句子一致地反映目标风格的极性。此外,我们还加入了一个去噪非自回归解码器来加速训练。我们进行了大量的实验,结果表明,SC2在竞争基线上有了显著的改善。我们的代码:这个HTTPS URL。主题:计算与语言(cs.CL)引用如下:arxiv:2406.04578cs.CLhttps://doi.org/10.48550/arXiv.2406.04578 Focus通过DataCite了解更多arxiv发布的文档

[NLP-52] SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models
[NLP-52] SpaRC和SpaRP:用于理解大型语言模型空间推理能力的空间推理特征和路径生成

链接: https://arxiv.org/abs/2406.04566
作者: Md Imbesat Hassan Rizvi,Xiaodan Zhu,Iryna Gurevych
关键词: Spatial reasoning, Spatial Reasoning Characterization, Spatial Reasoning Paths, Spatial, artificial intelligence
中文关键词: 空间推理,空间推理特征,空间推理路径,空间,人工智能
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ACL 2024 (Main)

点击查看摘要

Abstract:Spatial reasoning is a crucial component of both biological and artificial intelligence. In this work, we present a comprehensive study of the capability of current state-of-the-art large language models (LLMs) on spatial reasoning. To support our study, we created and contribute a novel Spatial Reasoning Characterization (SpaRC) framework and Spatial Reasoning Paths (SpaRP) datasets, to enable an in-depth understanding of the spatial relations and compositions as well as the usefulness of spatial reasoning chains. We found that all the state-of-the-art LLMs do not perform well on the datasets – their performances are consistently low across different setups. The spatial reasoning capability improves substantially as model sizes scale up. Finetuning both large language models (e.g., Llama-2-70B) and smaller ones (e.g., Llama-2-13B) can significantly improve their F1-scores by 7–32 absolute points. We also found that the top proprietary LLMs still significantly outperform their open-source counterparts in topological spatial understanding and reasoning.
摘要:空间推理是生物智能和人工智能的重要组成部分。在这项工作中,我们对目前最先进的大型语言模型在空间推理方面的能力进行了全面的研究。为了支持我们的研究,我们创建并贡献了一个新的空间推理表征(SPARC)框架和空间推理路径(SpaRP)数据集,以使我们能够深入了解空间关系和组成以及空间推理链的有用性。我们发现,所有最先进的LLM在数据集上的表现都不是很好–它们的性能在不同的设置中一直很低。随着模型尺寸的增大,空间推理能力显著提高。对大型语言模型(例如,Llama-2-70B)和较小语言模型(例如,Llama-2-13B)进行微调可以显著提高他们的F1分数7-32个绝对点。我们还发现,顶级专有LLM在拓扑空间理解和推理方面仍然显著优于它们的开源同行。

[NLP-53] Creating an AI Observer: Generative Semantic Workspaces
[NLP-53] 创建人工智能观察者:生成性语义工作空间

链接: https://arxiv.org/abs/2406.04555
作者: Pavan Holur,Shreyas Rajesh,David Chong,Vwani Roychowdhury
关键词: experienced human Observer, human Observer reading, missing Semantic parts, Semantic parts anticipating, textit
中文关键词: 经验丰富的人类观察者,人类观察者阅读,缺失的语义部分,预期的语义部分,文本
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 37 pages with appendix, 28 figures

点击查看摘要

Abstract:An experienced human Observer reading a document – such as a crime report – creates a succinct plot-like \textitWorking Memory'' comprising different actors, their prototypical roles and states at any point, their evolution over time based on their interactions, and even a map of missing Semantic parts anticipating them in the future. \textitAn equivalent AI Observer currently does not exist . We introduce the \textbf[G] enerative \textbf[S] emantic \textbf[W] orkspace (GSW) -- comprising an \textitOperator’’ and a \textitReconciler'' -- that leverages advancements in LLMs to create a generative-style Semantic framework, as opposed to a traditionally predefined set of lexicon labels. Given a text segment C_n that describes an ongoing situation, the \textitOperator instantiates actor-centric Semantic maps (termed Workspace instance’’ \mathcalW_n ). The \textitReconciler resolves differences between \mathcalW_n and a ``Working memory’’ \mathcalM_n^* to generate the updated \mathcalM_n+1^* . GSW outperforms well-known baselines on several tasks ( \sim 94% vs. FST, GLEN, BertSRL - multi-sentence Semantics extraction, \sim 15% vs. NLI-BERT, \sim 35% vs. QA). By mirroring the real Observer, GSW provides the first step towards Spatial Computing assistants capable of understanding individual intentions and predicting future behavior.
摘要:一个有经验的人类观察者阅读一份文件–比如一份犯罪报告–会创建一个简洁的情节式的\文本标题“工作记忆”,包括不同的参与者,他们在任何时候的原型角色和状态,他们基于他们的互动随着时间的演变,甚至还有一张预测他们未来的缺失语义部分的地图。\text对等的AI观察器当前不存在。我们介绍了生成词典空间(Gsw),它利用LLMS中的改进来创建生成样式的语义框架,而不是传统上预定义的一组词典标签。在给定描述正在进行的情况的文本段C_n的情况下,文本操作符实例化以参与者为中心的语义图(称为‘’工作空间实例‘’\mathcalW_n)。文本协调器解决了\mathcalW_n和`‘工作内存’‘\mathcalM_n^之间的差异,以生成更新的\mathcalM_n+1^。GSW在几个任务(SIM 94与FST、Glen、BertSRL-多句语义提取、SIM 15与NLI-BERT、SIM 35与QA)上的表现优于众所周知的基线。通过镜像真实的观察者,GSW向能够理解个人意图和预测未来行为的空间计算助手迈出了第一步。

[NLP-54] Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation
[NLP-54] 用于E2 E同步语音翻译的标签同步神经传感器

链接: https://arxiv.org/abs/2406.04541
作者: Keqi Deng,Philip C. Woodland
关键词: online speech recognition, simultaneous speech translation, speech recognition, simultaneous speech, online speech
中文关键词: 在线语音识别,同步语音翻译,语音识别,同步语音,在线语音
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by ACL 2024 Main Conference

点击查看摘要

Abstract:While the neural transducer is popular for online speech recognition, simultaneous speech translation (SST) requires both streaming and re-ordering capabilities. This paper presents the LS-Transducer-SST, a label-synchronous neural transducer for SST, which naturally possesses these two properties. The LS-Transducer-SST dynamically decides when to emit translation tokens based on an Auto-regressive Integrate-and-Fire (AIF) mechanism. A latency-controllable AIF is also proposed, which can control the quality-latency trade-off either only during decoding, or it can be used in both decoding and training. The LS-Transducer-SST can naturally utilise monolingual text-only data via its prediction network which helps alleviate the key issue of data sparsity for E2E SST. During decoding, a chunk-based incremental joint decoding technique is designed to refine and expand the search space. Experiments on the Fisher-CallHome Spanish (Es-En) and MuST-C En-De data show that the LS-Transducer-SST gives a better quality-latency trade-off than existing popular methods. For example, the LS-Transducer-SST gives a 3.1/2.9 point BLEU increase (Es-En/En-De) relative to CAAT at a similar latency and a 1.4 s reduction in average lagging latency with similar BLEU scores relative to Wait-k.
摘要:虽然神经传感器在在线语音识别中很受欢迎,但同步语音翻译(SST)需要流和重新排序两种能力。本文提出的LS-Transducer-SST是一种用于SST的标签同步神经传感器,它自然具有这两个特性。LS-Transducer-SST基于自动回归积分与触发(AIF)机制动态决定何时发出转换令牌。提出了一种时延可控的AIF算法,它可以只在译码过程中控制质量与时延的权衡,也可以同时用于译码和训练。LS-Transducer-SST可以通过其预测网络自然地利用单语言纯文本数据,这有助于缓解E2E SST数据稀疏的关键问题。在译码过程中,设计了一种基于块的增量式联合译码技术来细化和扩展搜索空间。在Fisher-CallHome西班牙语(ES-EN)和必须-C en-de数据上的实验表明,LS-Transducer-SST比现有的流行方法提供了更好的质量-延迟权衡。例如,LS-Transducer-SST在类似的潜伏期下,BLEU的平均滞后潜伏期比CAAT高3.1/2.9分(ES-EN/EN-DE),S的平均滞后潜伏期缩短1.4,而BLEU的得分与WAIT-k相似。

[NLP-55] mNER: (Zero|Few)-Shot Named Entity Recognition Exploiting the Power of Large Language Models
[NLP-55] mNER:(零|少数)-利用大型语言模型的力量的镜头命名实体识别

链接: https://arxiv.org/abs/2406.04528
作者: Fabián Villena,Luis Miranda,Claudio Aracena
关键词: high-quality human-like text, Large language models, generate high-quality human-like, Large language, human-like text
中文关键词: 高质量的类人文本,大型语言模型,生成高质量的类人文本,大型语言,类人文本
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) allow us to generate high-quality human-like text. One interesting task in natural language processing (NLP) is named entity recognition (NER), which seeks to detect mentions of relevant information in documents. This paper presents llmNER, a Python library for implementing zero-shot and few-shot NER with LLMs; by providing an easy-to-use interface, llmNER can compose prompts, query the model, and parse the completion returned by the LLM. Also, the library enables the user to perform prompt engineering efficiently by providing a simple interface to test multiple variables. We validated our software on two NER tasks to show the library’s flexibility. llmNER aims to push the boundaries of in-context learning research by removing the barrier of the prompting and parsing steps.
摘要:大型语言模型(LLM)使我们能够生成高质量的类人文本。自然语言处理(NLP)中一项有趣的任务是命名为实体识别(NER),它旨在检测文档中相关信息的提及。本文介绍了llmNER,这是一个Python库,用于通过LLM实现零触发和少触发NER;通过提供易于使用的界面,llmNER可以编写提示、查询模型并解析LLM返回的完成。此外,该库通过提供简单的界面来测试多个变量,使用户能够有效地执行即时工程。我们在两项NER任务中验证了我们的软件,以展示库的灵活性。llmNER旨在通过消除提示和解析步骤的障碍来突破上下文学习研究的界限。

[NLP-56] Proofread: Fixes All Errors with One Tap
[NLP-56] 校对:只需点击即可修复所有错误

链接: https://arxiv.org/abs/2406.04523
作者: Renjie Liu,Yanxiang Zhang,Yun Zhu,Haicheng Sun,Yuanbo Zhang,Michael Xuelin Huang,Shanqing Cai,Lei Meng,Shumin Zhai
关键词: Large Language Models, Large Language, users’ typing experience, reimagine users’ typing, capabilities in Large
中文关键词: 大型语言模型、大型语言、用户打字体验、重新想象用户打字、大型功能
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 3 figures, 2 tables

点击查看摘要

Abstract:The impressive capabilities in Large Language Models (LLMs) provide a powerful approach to reimagine users’ typing experience. This paper demonstrates Proofread, a novel Gboard feature powered by a server-side LLM in Gboard, enabling seamless sentence-level and paragraph-level corrections with a single tap. We describe the complete system in this paper, from data generation, metrics design to model tuning and deployment. To obtain models with sufficient quality, we implement a careful data synthetic pipeline tailored to online use cases, design multifaceted metrics, employ a two-stage tuning approach to acquire the dedicated LLM for the feature: the Supervised Fine Tuning (SFT) for foundational quality, followed by the Reinforcement Learning (RL) tuning approach for targeted refinement. Specifically, we find sequential tuning on Rewrite and proofread tasks yields the best quality in SFT stage, and propose global and direct rewards in the RL tuning stage to seek further improvement. Extensive experiments on a human-labeled golden set showed our tuned PaLM2-XS model achieved 85.56% good ratio. We launched the feature to Pixel 8 devices by serving the model on TPU v5 in Google Cloud, with thousands of daily active users. Serving latency was significantly reduced by quantization, bucket inference, text segmentation, and speculative decoding. Our demo could be seen in \hrefthis https URLYoutube.
摘要:大型语言模型(LLM)中令人印象深刻的功能为重新想象用户的打字体验提供了一种强大的方法。本文演示了校对,这是Gboard的一个新功能,由Gboard中的服务器端LLM提供支持,只需点击一次,即可实现无缝的句子级和段落级更正。本文描述了一个完整的系统,从数据生成、度量设计到模型调整和部署。为了获得足够高质量的模型,我们实施了针对在线用例量身定做的仔细的数据合成管道,设计了多方面的度量标准,使用两阶段调优方法来获得用于该功能的专用LLM:针对基础质量的监督精调(SFT),然后针对目标优化的强化学习(RL)调优方法。具体地说,我们发现重写和校对任务的顺序调整在SFT阶段获得了最好的质量,并在RL调整阶段提出了全局和直接奖励以寻求进一步的改进。在人工标记的金色集合上的大量实验表明,我们的调谐Palm2-XS模型达到了85.56%的优良率。我们通过在谷歌云的TPU v5上为Pixel 8设备提供服务,推出了这一功能,每天有数千名活跃用户。通过量化、桶推理、文本分割和推测解码显著降低了服务延迟。我们的演示可以在这个HTTPS URLYoutube中看到。

[NLP-57] NATURAL PLAN: Benchmarking LLMs on Natural Language Planning
[NLP-57] NATURAL SYS:自然语言规划的LLC基准

链接: https://arxiv.org/abs/2406.04520
作者: Huaixiu Steven Zheng,Swaroop Mishra,Hugh Zhang,Xinyun Chen,Minmin Chen,Azade Nova,Le Hou,Heng-Tze Cheng,Quoc V. Le,Ed H. Chi,Denny Zhou
关键词: Calendar Scheduling, introduce NATURAL PLAN, Meeting Planning, NATURAL PLAN, Google Calendar
中文关键词: 日历安排,介绍自然日历、会议规划、自然日历、谷歌日历
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce NATURAL PLAN, a realistic planning benchmark in natural language containing 3 key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. We focus our evaluation on the planning capabilities of LLMs with full information on the task, by providing outputs from tools such as Google Flights, Google Maps, and Google Calendar as contexts to the models. This eliminates the need for a tool-use environment for evaluating LLMs on Planning. We observe that NATURAL PLAN is a challenging benchmark for state of the art models. For example, in Trip Planning, GPT-4 and Gemini 1.5 Pro could only achieve 31.1% and 34.8% solve rate respectively. We find that model performance drops drastically as the complexity of the problem increases: all models perform below 5% when there are 10 cities, highlighting a significant gap in planning in natural language for SoTA LLMs. We also conduct extensive ablation studies on NATURAL PLAN to further shed light on the (in)effectiveness of approaches such as self-correction, few-shot generalization, and in-context planning with long-contexts on improving LLM planning.
摘要:我们介绍了自然计划,这是一个自然语言的现实计划基准,包含三个关键任务:行程计划、会议计划和日历计划。我们将评估的重点放在LLMS的计划能力上,其中包含有关任务的完整信息,方法是将Google Flight、Google Maps和Google Calendar等工具的输出作为模型的上下文。这样就不再需要工具使用环境来评估规划中的LLMS。我们观察到,对于最先进的模型来说,自然规划是一个具有挑战性的基准。例如,在出行计划中,GPT-4和Gemini 1.5 Pro分别只能达到31.1%和34.8%的解决率。我们发现,随着问题复杂性的增加,模型性能急剧下降:当有10个城市时,所有模型的性能都低于5%,这突显了Sota LLM在自然语言规划方面的显著差距。我们还对自然计划进行了广泛的消融研究,以进一步阐明自我纠正、小范围推广和具有长期背景的上下文规划等方法在改进LLM规划方面的有效性。

[NLP-58] o Distill or Not to Distill? On the Robustness of Robust Knowledge Distillation
[NLP-58] o蒸馏还是不蒸馏?论稳健知识蒸馏的鲁棒性

链接: https://arxiv.org/abs/2406.04512
作者: Abdul Waheed,Karima Kadaoui,Muhammad Abdul-Mageed
关键词: Automatic Speech Recognition, Speech Recognition, Automatic Speech, present unique challenges, present unique
中文关键词: 自动语音识别,语音识别,自动语音,提出独特的挑战,提出独特的
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at ACL’24 main

点击查看摘要

Abstract:Arabic is known to present unique challenges for Automatic Speech Recognition (ASR). On one hand, its rich linguistic diversity and wide range of dialects complicate the development of robust, inclusive models. On the other, current multilingual ASR models are compute-intensive and lack proper comprehensive evaluations. In light of these challenges, we distill knowledge from large teacher models into smaller student variants that are more efficient. We also introduce a novel human-annotated dataset covering five under-represented Arabic dialects for evaluation. We further evaluate both our models and existing SoTA multilingual models on both standard available benchmarks and our new dialectal data. Our best-distilled model’s overall performance ( 45.0 % WER) surpasses that of a SoTA model twice its size (SeamlessM4T-large-v2, WER= 47.0 %) and its teacher model (Whisper-large-v2, WER= 55.1 %), and its average performance on our new dialectal data ( 56.9 % WER) outperforms all other models. To gain more insight into the poor performance of these models on dialectal data, we conduct an error analysis and report the main types of errors the different models tend to make. The GitHub repository for the project is available at \urlthis https URL.
摘要:阿拉伯语对自动语音识别(ASR)提出了独特的挑战。一方面,其丰富的语言多样性和广泛的方言范围使稳健、包容的模式的发展复杂化。另一方面,目前的多语言ASR模型计算量大,缺乏适当的综合评估。鉴于这些挑战,我们将知识从大型教师模型提炼成更有效的较小学生变体。我们还介绍了一个新的人类标注数据集,涵盖了五种未被充分代表的阿拉伯方言用于评估。我们在标准基准和我们的新方言数据上进一步评估了我们的模型和现有的SOTA多语言模型。最优提取模型的整体性能(45.0WER)超过了SOTA模型的两倍(Seamless M4T-Large-v2,WER=47.0)和教师模型(Whisper-Large-v2,WER=55.1%),在我们的新方言数据上的平均性能(56.9WER)也超过了所有其他模型。为了更深入地了解这些模型在方言数据上的糟糕表现,我们进行了错误分析,并报告了不同模型倾向于犯的主要错误类型。该项目的GitHub存储库位于此HTTPS URL。

[NLP-59] FLUID-LLM: Learning Computational Fluid Dynamics with Spatiotemporal-aware Large Language Models
[NLP-59] TUID-LLM:使用时空感知大型语言模型学习计算流体动力学

链接: https://arxiv.org/abs/2406.04501
作者: Max Zhu,Adrián Bazaga,Pietro Liò
关键词: Learning computational fluid, computationally intensive simulations, Learning computational, traditionally relies, Navier-Stokes equations
中文关键词: 学习计算流体、计算密集型模拟、学习计算、传统依赖、纳维尔-斯托克斯方程
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Learning computational fluid dynamics (CFD) traditionally relies on computationally intensive simulations of the Navier-Stokes equations. Recently, large language models (LLMs) have shown remarkable pattern recognition and reasoning abilities in natural language processing (NLP) and computer vision (CV). However, these models struggle with the complex geometries inherent in fluid dynamics. We introduce FLUID-LLM, a novel framework combining pre-trained LLMs with spatiotemporal-aware encoding to predict unsteady fluid dynamics. Our approach leverages the temporal autoregressive abilities of LLMs alongside spatial-aware layers, bridging the gap between previous CFD prediction methods. Evaluations on standard benchmarks reveal significant performance improvements across various fluid datasets. Our results demonstrate that FLUID-LLM effectively integrates spatiotemporal information into pre-trained LLMs, enhancing CFD task performance.
摘要:学习计算流体动力学(CFA)传统上依赖于对纳维尔-斯托克斯方程的计算密集型模拟。最近,大型语言模型(LLM)在自然语言处理(NLP)和计算机视觉(CV)方面表现出了非凡的模式识别和推理能力。然而,这些模型难以应对流体动力学固有的复杂几何形状。我们引入了FLOID-LLM,这是一种新型框架,将预训练的LLM与时空感知编码相结合,以预测不稳定的流体动力学。我们的方法利用了LLM的时间自回归能力以及空间感知层,弥合了之前计算流体力学预测方法之间的差距。对标准基准的评估揭示了各种流体数据集的性能显着改进。我们的结果表明,TUID-LLM有效地将时空信息集成到预训练的LLM中,提高了计算流体力学任务性能。

[NLP-60] me Sensitive Knowledge Editing through Efficient Finetuning
[NLP-60] 通过高效的微调来编辑敏感的知识

链接: https://arxiv.org/abs/2406.04496
作者: Xiou Ge,Ali Mousavi,Edouard Grave,Armand Joulin,Kun Qian,Benjamin Han,Mostafa Arefiyan,Yunyao Li
关键词: Large Language Models, Language Models, demonstrated impressive capability, Large Language, demonstrated impressive
中文关键词: 大型语言模型,语言模型,表现出令人印象深刻的能力,大型语言,表现出令人印象深刻的能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL 2024 main conference

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capability in different tasks and are bringing transformative changes to many domains. However, keeping the knowledge in LLMs up-to-date remains a challenge once pretraining is complete. It is thus essential to design effective methods to both update obsolete knowledge and induce new knowledge into LLMs. Existing locate-and-edit knowledge editing (KE) method suffers from two limitations. First, the post-edit LLMs by such methods generally have poor capability in answering complex queries that require multi-hop reasoning. Second, the long run-time of such locate-and-edit methods to perform knowledge edits make it infeasible for large scale KE in practice. In this paper, we explore Parameter-Efficient Fine-Tuning (PEFT) techniques as an alternative for KE. We curate a more comprehensive temporal KE dataset with both knowledge update and knowledge injection examples for KE performance benchmarking. We further probe the effect of fine-tuning on a range of layers in an LLM for the multi-hop QA task. We find that PEFT performs better than locate-and-edit techniques for time-sensitive knowledge edits.
摘要:大型语言模型在不同的任务中表现出了令人印象深刻的能力,并给许多领域带来了革命性的变化。然而,一旦完成前期培训,使低成本管理系统中的知识保持最新仍然是一项挑战。因此,设计有效的方法来更新过时的知识和将新的知识引入低成本管理系统是至关重要的。现有的定位与编辑知识编辑方法存在两方面的局限性。首先,通过这种方法编辑后的LLMS通常在回答需要多跳推理的复杂查询时能力较差。其次,这种定位和编辑方法进行知识编辑的运行时间很长,这使得大规模知识工程在实践中不可行。在本文中,我们探索了参数高效微调(PEFT)技术作为KE的替代方案。我们为KE性能基准管理了一个更全面的时态KE数据集,包括知识更新和知识注入示例。我们进一步探讨了微调对多跳QA任务的LLM中一系列层的影响。我们发现,对于时间敏感的知识编辑,PEFT比定位和编辑技术表现得更好。

[NLP-61] CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset
[NLP-61] CORU:全面的OCR后解析和收据理解数据集

链接: https://arxiv.org/abs/2406.04493
作者: Abdelrahman Abdallah,Mahmoud Abdalla,Mahmoud SalahEldin Kasem,Mohamed Mahmoud,Ibrahim Abdelhalim,Mohamed Elkasaby,Yasser ElBendary,Adam Jatowt
关键词: Optical Character Recognition, Natural Language Processing, Character Recognition, Optical Character, Natural Language
中文关键词: 光学字符识别、自然语言处理、字符识别、光学字符、自然语言
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the fields of Optical Character Recognition (OCR) and Natural Language Processing (NLP), integrating multilingual capabilities remains a critical challenge, especially when considering languages with complex scripts such as Arabic. This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding Dataset (CORU), a novel dataset specifically designed to enhance OCR and information extraction from receipts in multilingual contexts involving Arabic and English. CORU consists of over 20,000 annotated receipts from diverse retail settings, including supermarkets and clothing stores, alongside 30,000 annotated images for OCR that were utilized to recognize each detected line, and 10,000 items annotated for detailed information extraction. These annotations capture essential details such as merchant names, item descriptions, total prices, receipt numbers, and dates. They are structured to support three primary computational tasks: object detection, OCR, and information extraction. We establish the baseline performance for a range of models on CORU to evaluate the effectiveness of traditional methods, like Tesseract OCR, and more advanced neural network-based approaches. These baselines are crucial for processing the complex and noisy document layouts typical of real-world receipts and for advancing the state of automated multilingual document processing. Our datasets are publicly accessible (this https URL).
摘要:在光学字符识别(OCR)和自然语言处理(NLP)领域,集成多语言能力仍然是一个严峻的挑战,特别是在考虑具有复杂文字的语言时,如阿拉伯语。本文介绍了综合后OCR解析和收据理解数据集(CORU),这是一个新的数据集,专门设计用于增强涉及阿拉伯语和英语的多语言上下文中的收据的OCR和信息提取。CORU包括来自不同零售场所(包括超市和服装店)的20,000多张带注释的收据,以及30,000张用于OCR的注释图像,用于识别每一条检测到的线条,以及10,000件带注释的商品,用于详细信息提取。这些注释记录了基本的细节,如商家名称、商品描述、总价、收据编号和日期。它们被构造为支持三个主要的计算任务:目标检测、OCR和信息提取。我们在CORU上建立了一系列模型的基准性能,以评估传统方法(如Tesseract OCR)和更先进的基于神经网络的方法的有效性。这些基准对于处理现实世界收据中常见的复杂和嘈杂的单据布局以及推进自动化多语言单据处理状态至关重要。我们的数据集是可公开访问的(此HTTPS URL)。

[NLP-62] Automatic Bug Detection in LLM-Powered Text-Based Games Using LLMs
[NLP-62] 使用LLM在LLM支持的基于文本的游戏中自动检测错误

链接: https://arxiv.org/abs/2406.04482
作者: Claire Jin,Sudha Rao,Xiangyu Peng,Portia Botchway,Jessica Quaye,Chris Brockett,Bill Dolan
关键词: enabling dynamic plotlines, large language models, Advancements in large, language models, enabling dynamic
中文关键词: 启用动态情节、大型语言模型、大型语言模型的进步、启用动态
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: Accepted for publication in Findings of the Association for Computational Linguistics: ACL 2024

点击查看摘要

Abstract:Advancements in large language models (LLMs) are revolutionizing interactive game design, enabling dynamic plotlines and interactions between players and non-player characters (NPCs). However, LLMs may exhibit flaws such as hallucinations, forgetfulness, or misinterpretations of prompts, causing logical inconsistencies and unexpected deviations from intended designs. Automated techniques for detecting such game bugs are still lacking. To address this, we propose a systematic LLM-based method for automatically identifying such bugs from player game logs, eliminating the need for collecting additional data such as post-play surveys. Applied to a text-based game DejaBoom!, our approach effectively identifies bugs inherent in LLM-powered interactive games, surpassing unstructured LLM-powered bug-catching methods and filling the gap in automated detection of logical and design flaws.
摘要:大型语言模型(LLM)的进步正在彻底改变交互式游戏设计,实现动态情节以及玩家和非玩家角色(NPC)之间的交互。然而,LLM可能会表现出幻觉、健忘或对提示的误解等缺陷,导致逻辑不一致和与预期设计的意外偏差。仍然缺乏检测此类游戏错误的自动化技术。为了解决这个问题,我们提出了一种基于LLM的系统方法,用于从玩家游戏日志中自动识别此类错误,从而消除了收集游戏后调查等额外数据的需要。应用于文本游戏DejaBoom!,我们的方法有效地识别了LLM支持的交互式游戏中固有的错误,超越了非结构化LLM支持的错误捕获方法,并填补了逻辑和设计缺陷自动检测方面的空白。

[NLP-63] PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning
[NLP-63] AttributionFix:通过对抗性提示调整删除少量后门

链接: https://arxiv.org/abs/2406.04478
作者: Tianrong Zhang,Zhaohan Xi,Ting Wang,Prasenjit Mitra,Jinghui Chen
关键词: attracted enormous attention, Pre-trained language models, Pre-trained language, attracted enormous, enormous attention
中文关键词: 引起了巨大的关注,预训练的语言模型,预训练的语言,引起了巨大的关注
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: NAACL 2024

点击查看摘要

Abstract:Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances. Meanwhile, the soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting as the most popular training paradigms for natural language processing (NLP) models. Nevertheless, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when trigger tokens are presented. In this paper, we propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings. Unlike existing NLP backdoor removal methods, which rely on accurate trigger inversion and subsequent model fine-tuning, PromptFix keeps the model parameters intact and only utilizes two extra sets of soft tokens which approximate the trigger and counteract it respectively. The use of soft tokens and adversarial optimization eliminates the need to enumerate possible backdoor configurations and enables an adaptive balance between trigger finding and preservation of performance. Experiments with various backdoor attacks validate the effectiveness of the proposed method and the performances when domain shift is present further shows PromptFix’s applicability to models pretrained on unknown data source which is the common case in prompt tuning scenarios.
摘要:在过去的几年里,预训练语言模型(PLM)以其无与伦比的表现吸引了人们的极大关注。与此同时,训练PLM的成本飙升以及它们惊人的泛化能力共同导致了微调和提示成为自然语言处理(NLP)模型最受欢迎的训练范例。然而,现有的研究表明,这些NLP模型可以被倒退,使得当提供触发令牌时,模型行为被操纵。在本文中,我们提出了一种新的后门缓解策略PromptFix,该策略通过对抗性的快速调整在少镜头环境下对NLP模型进行缓解。与现有的NLP后门去除方法不同,PromptFix保持模型参数不变,只使用两组额外的软令牌来分别逼近和抵消触发,而不是依赖于精确的触发反转和后续的模型微调。软令牌和对抗性优化的使用消除了列举可能的后门配置的需要,并实现了触发查找和性能保持之间的自适应平衡。通过对各种后门攻击的实验,验证了该方法的有效性和当域转移时的性能,进一步表明了PromptFix对未知数据源上的预训练模型的适用性,这是快速调优场景中的常见情况。

[NLP-64] Multi-Label Classification for Implicit Discourse Relation Recognition
[NLP-64] 隐式话语关系识别的多标签分类

链接: https://arxiv.org/abs/2406.04461
作者: Wanqiu Long,N. Siddharth,Bonnie Webber
关键词: Penn Discourse Treebank, textual content, uniting sentences, cohesive narrative, Discourse relations play
中文关键词: 宾夕法尼亚大学话语树库、文本内容、统一句子、凝聚叙事、话语关系发挥
类目: Computation and Language (cs.CL)
备注: ACL2024 Finding

点击查看摘要

Abstract:Discourse relations play a pivotal role in establishing coherence within textual content, uniting sentences and clauses into a cohesive narrative. The Penn Discourse Treebank (PDTB) stands as one of the most extensively utilized datasets in this domain. In PDTB-3, the annotators can assign multiple labels to an example, when they believe that multiple relations are present. Prior research in discourse relation recognition has treated these instances as separate examples during training, and only one example needs to have its label predicted correctly for the instance to be judged as correct. However, this approach is inadequate, as it fails to account for the interdependence of labels in real-world contexts and to distinguish between cases where only one sense relation holds and cases where multiple relations hold simultaneously. In our work, we address this challenge by exploring various multi-label classification frameworks to handle implicit discourse relation recognition. We show that multi-label classification methods don’t depress performance for single-label prediction. Additionally, we give comprehensive analysis of results and data. Our work contributes to advancing the understanding and application of discourse relations and provide a foundation for the future study
摘要:语篇关系在建立语篇内容的连贯性、将句子和小句连接成连贯的叙事方面起着举足轻重的作用。宾夕法尼亚大学话语树库(PDTB)是该领域应用最广泛的数据集之一。在PDTB-3中,当注释器认为存在多个关系时,可以将多个标签分配给一个示例。以往的语篇关系识别研究在训练过程中将这些实例作为单独的实例进行处理,只需要对一个实例的标签进行正确的预测,就可以判断该实例是正确的。然而,这种方法是不够的,因为它没有考虑到现实世界语境中标签的相互依赖,也没有区分只有一个意义关系成立的情况和多个关系同时成立的情况。在我们的工作中,我们通过探索各种多标签分类框架来处理隐含的话语关系识别来应对这一挑战。我们证明了多标签分类方法不会降低单标签预测的性能。此外,我们还对结果和数据进行了综合分析。我们的工作有助于促进对语篇关系的理解和应用,并为以后的研究提供基础

[NLP-65] Evaluating the Smooth Control of Attribute Intensity in Text Generation with LLMs
[NLP-65] 使用LLM评估文本生成中属性强度的平滑控制

链接: https://arxiv.org/abs/2406.04460
作者: Shang Zhou,Feng Yao,Chengyu Dong,Zihan Wang,Jingbo Shang
关键词: writing conciseness, chatting emotion, crucial across scenarios, explanation clarity, text generation
中文关键词: 写作简洁、聊天情感、跨场景至关重要、解释清晰、文本生成
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024 Findings

点击查看摘要

Abstract:Controlling the attribute intensity of text generation is crucial across scenarios (e.g., writing conciseness, chatting emotion, and explanation clarity). The remarkable capabilities of large language models (LLMs) have revolutionized text generation, prompting us to explore such \emphsmooth control of LLM generation. Specifically, we propose metrics to assess the range, calibration, and consistency of the generated text’s attribute intensity in response to varying control values, as well as its relevance to the intended context. To quantify the attribute intensity and context relevance, we propose an effective evaluation framework leveraging the Elo rating system and GPT4, both renowned for their robust alignment with human judgment. We look into two viable training-free methods for achieving smooth control of LLMs: (1) Prompting with semantic shifters, and (2) Modifying internal model representations. The evaluations of these two methods are conducted on 5 different attributes with various models. Our code and dataset can be obtained from \urlthis https URL.
摘要:控制文本生成的属性强度是跨场景(例如,写作简洁性、聊天情感和解释清晰度)的关键。大型语言模型(LLM)的非凡能力使文本生成发生了革命性的变化,促使我们探索如何顺利地控制LLM生成。具体地说,我们提出了一些度量来评估生成的文本的属性强度的范围、校准和一致性,以响应不同的控制值,以及它与预期上下文的相关性。为了量化属性强度和上下文相关性,我们提出了一个有效的评估框架,利用ELO评级系统和GPT4,这两个系统都以与人类判断的稳健一致而闻名。我们研究了两种可行的无需训练的方法来实现LLMS的平滑控制:(1)使用语义移位器进行提示;(2)修改内部模型表示。用不同的模型对这两种方法的5个不同属性进行了评价。我们的代码和数据集可以从此HTTPS URL获取。

[NLP-66] MAIRA-2: Grounded Radiology Report Generation
[NLP-66] MAIRA-2:接地放射学报告生成

链接: https://arxiv.org/abs/2406.04449
作者: Shruthi Bannur,Kenza Bouzid,Daniel C. Castro,Anton Schwaighofer,Sam Bond-Taylor,Maximilian Ilse,Fernando Pérez-García,Valentina Salvatelli,Harshita Sharma,Felix Meissen,Mercy Ranjit,Shaury Srivastav,Julia Gong,Fabian Falck,Ozan Oktay,Anja Thieme,Matthew P. Lungren,Maria Teodora Wetscherek,Javier Alvarez-Valle,Stephanie L. Hyland
关键词: requires detailed image, Radiology reporting, integration of multiple, detailed image understanding, requires detailed
中文关键词: 需要详细的图像、放射学报告、多个、详细的图像理解的集成,需要详细的
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 44 pages, 20 figures

点击查看摘要

Abstract:Radiology reporting is a complex task that requires detailed image understanding, integration of multiple inputs, including comparison with prior imaging, and precise language generation. This makes it ideal for the development and use of generative multimodal models. Here, we extend report generation to include the localisation of individual findings on the image - a task we call grounded report generation. Prior work indicates that grounding is important for clarifying image understanding and interpreting AI-generated text. Therefore, grounded reporting stands to improve the utility and transparency of automated report drafting. To enable evaluation of grounded reporting, we propose a novel evaluation framework - RadFact - leveraging the reasoning capabilities of large language models (LLMs). RadFact assesses the factuality of individual generated sentences, as well as correctness of generated spatial localisations when present. We introduce MAIRA-2, a large multimodal model combining a radiology-specific image encoder with a LLM, and trained for the new task of grounded report generation on chest X-rays. MAIRA-2 uses more comprehensive inputs than explored previously: the current frontal image, the current lateral image, the prior frontal image and prior report, as well as the Indication, Technique and Comparison sections of the current report. We demonstrate that these additions significantly improve report quality and reduce hallucinations, establishing a new state of the art on findings generation (without grounding) on MIMIC-CXR while demonstrating the feasibility of grounded reporting as a novel and richer task.
摘要:放射学报告是一项复杂的任务,需要详细的图像理解、多个输入的集成,包括与先前成像的比较,以及精确的语言生成。这使得它成为开发和使用生成性多模式模型的理想选择。在这里,我们将报告生成扩展到包括图像上的单个调查结果的本地化-我们称之为接地报告生成。先前的工作表明,基础对于澄清图像理解和解释人工智能生成的文本是重要的。因此,扎根报告有助于提高自动报告起草的效用和透明度。为了能够对扎根报告进行评估,我们提出了一个新的评估框架-RadFact-利用大型语言模型(LLMS)的推理能力。RadFact评估单个生成的句子的真实性,以及生成的空间位置(如果存在)的正确性。我们介绍了Maira-2,一个结合了放射学专用图像编码器和LLM的大型多模式模型,并为新的胸部X光报告生成任务进行了训练。Maira-2使用了比以前探索的更全面的资料:当前的正面图像、当前的侧向图像、先前的正面图像和先前的报告,以及本报告的指示、技术和比较部分。我们证明,这些添加显著提高了报告质量并减少了幻觉,建立了在MIMIC-CXR上生成调查结果(无需接地)的新技术状态,同时证明了接地报告作为一项新的和更丰富的任务的可行性。

[NLP-67] xIm FAST: Text-to-Image Representation for Semantic Similarity Evaluation using Transformers
[NLP-67] xIM Fast:使用Transformers进行语义相似性评估的文本到图像表示

链接: https://arxiv.org/abs/2406.04438
作者: Wazib Ansar,Saptarsi Goswami,Amlan Chakrabarti
关键词: Natural Language Processing, Language Processing, Natural Language, objectives of Natural, principal objectives
中文关键词: 自然语言处理,语言处理,自然语言,自然目标,主要目标
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 33 figures

点击查看摘要

Abstract:One of the principal objectives of Natural Language Processing (NLP) is to generate meaningful representations from text. Improving the informativeness of the representations has led to a tremendous rise in the dimensionality and the memory footprint. It leads to a cascading effect amplifying the complexity of the downstream model by increasing its parameters. The available techniques cannot be applied to cross-modal applications such as text-to-image. To ameliorate these issues, a novel Text-to-Image methodology for generating fixed-length representations through a self-supervised Variational Auto-Encoder (VAE) for semantic evaluation applying transformers (TexIm FAST) has been proposed in this paper. The pictorial representations allow oblivious inference while retaining the linguistic intricacies, and are potent in cross-modal applications. TexIm FAST deals with variable-length sequences and generates fixed-length representations with over 75% reduced memory footprint. It enhances the efficiency of the models for downstream tasks by reducing its parameters. The efficacy of TexIm FAST has been extensively analyzed for the task of Semantic Textual Similarity (STS) upon the MSRPC, CNN/ Daily Mail, and XSum data-sets. The results demonstrate 6% improvement in accuracy compared to the baseline and showcase its exceptional ability to compare disparate length sequences such as a text with its summary.
摘要:自然语言处理的主要目标之一是从文本中生成有意义的表示。提高表示的信息量导致了维度和内存占用的极大增加。它导致级联效应,通过增加下游模型的参数来放大其复杂性。可用的技术不能应用于跨模式应用,例如文本到图像。为了改善这些问题,本文提出了一种新的文本到图像转换方法,该方法通过一种用于语义评估的自监督变分自动编码器(VAE)来生成定长表示(TexIm FAST)。图形表征允许遗忘推理,同时保留了语言的复杂性,并且在跨情态应用中是有效的。Texim FAST处理可变长度的序列,并生成固定长度的表示,内存占用量减少75%以上。它通过减少模型的参数来提高下游任务模型的效率。在MSRPC、CNN/Daily Mail和XSum数据集上进行语义文本相似度(STS)任务时,已广泛分析了TexIm FAST的有效性。结果表明,与基线相比,准确率提高了6%,并展示了它将不同长度的序列(如文本)与其摘要进行比较的非凡能力。

[NLP-68] MoralBench: Moral Evaluation of LLMs
[NLP-68] MoralBench:法学硕士的道德评估

链接: https://arxiv.org/abs/2406.04428
作者: Jianchao Ji,Yutong Chen,Mingyu Jin,Wujiang Xu,Wenyue Hua,Yongfeng Zhang
关键词: decision-making support systems, rapidly evolving field, natural language processing, myriad of applications, support systems
中文关键词: 决策支持系统、快速发展的领域、自然语言处理、无数的应用程序、支持系统
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the rapidly evolving field of artificial intelligence, large language models (LLMs) have emerged as powerful tools for a myriad of applications, from natural language processing to decision-making support systems. However, as these models become increasingly integrated into societal frameworks, the imperative to ensure they operate within ethical and moral boundaries has never been more critical. This paper introduces a novel benchmark designed to measure and compare the moral reasoning capabilities of LLMs. We present the first comprehensive dataset specifically curated to probe the moral dimensions of LLM outputs, addressing a wide range of ethical dilemmas and scenarios reflective of real-world complexities. The main contribution of this work lies in the development of benchmark datasets and metrics for assessing the moral identity of LLMs, which accounts for nuance, contextual sensitivity, and alignment with human ethical standards. Our methodology involves a multi-faceted approach, combining quantitative analysis with qualitative insights from ethics scholars to ensure a thorough evaluation of model performance. By applying our benchmark across several leading LLMs, we uncover significant variations in moral reasoning capabilities of different models. These findings highlight the importance of considering moral reasoning in the development and evaluation of LLMs, as well as the need for ongoing research to address the biases and limitations uncovered in our study. We publicly release the benchmark at this https URL and also open-source the code of the project at this https URL. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.04428 [cs.CL] (or arXiv:2406.04428v1 [cs.CL] for this version)
摘要:在迅速发展的人工智能领域,大语言模型(LLM)已经成为从自然语言处理到决策支持系统等各种应用的强大工具。然而,随着这些模式越来越多地融入社会框架,确保它们在伦理和道德界限内运作的迫切性从未像现在这样重要。本文介绍了一种用于测量和比较LLMS道德推理能力的新型基准测试方法。我们提供了第一个全面的数据集,专门为探索LLM输出的道德维度而设计,解决了反映现实世界复杂性的广泛的伦理困境和情景。这项工作的主要贡献在于开发了基准数据集和度量标准,用于评估低成本管理人员的道德认同,其中考虑了细微差别、上下文敏感性和与人类伦理标准的一致性。我们的方法涉及多方面的方法,将定量分析与伦理学学者的定性见解相结合,以确保对模型性能的彻底评估。通过将我们的基准应用于几个领先的LLM,我们发现不同模型的道德推理能力存在显著差异。这些发现突显了在开发和评估LLMS时考虑道德推理的重要性,以及持续研究的必要性,以解决我们研究中发现的偏见和局限性。我们在这个HTTPS URL上公开发布了基准测试,并在这个HTTPS URL上开放了项目的代码。学科:计算与语言(cs.CL);人工智能(cs.AI)引用AS:arxiv:2406.04428cs.CL

[NLP-69] Aligning Large Language Models with Self-generated Preference Data
[NLP-69] 将大型语言模型与自生成的偏好数据保持一致

链接: https://arxiv.org/abs/2406.04412
作者: Dongyoung Kim,Kimin Lee,Jinwoo Shin,Jaehyung Kim
关键词: Aligning large language, Aligning large, large human-annotated preference, human-annotated preference dataset, large language models
中文关键词: 对齐大型语言、对齐大型人类注释偏好、人类注释偏好数据集、大型语言模型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, under review

点击查看摘要

Abstract:Aligning large language models (LLMs) with human preferences becomes a key component to obtaining state-of-the-art performance, but it yields a huge cost to construct a large human-annotated preference dataset. To tackle this problem, we propose a new framework that boosts the alignment of LLMs through Self-generated Preference data (Selfie) using only a very small amount of human-annotated preference data. Our key idea is leveraging the human prior knowledge within the small (seed) data and progressively improving the alignment of LLM, by iteratively generating the responses and learning from them with the self-annotated preference data. To be specific, we propose to derive the preference label from the logits of LLM to explicitly extract the model’s inherent preference. Compared to the previous approaches using external reward models or implicit in-context learning, we observe that the proposed approach is significantly more effective. In addition, we introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data. Our experimental results demonstrate that the proposed framework significantly boosts the alignment of LLMs. For example, we achieve superior alignment performance on AlpacaEval 2.0 with only 3.3% of the ground-truth preference labels in the Ultrafeedback data compared to the cases using the entire data or state-of-the-art baselines.
摘要:将大型语言模型(LLM)与人类偏好进行匹配成为获得最先进性能的关键组成部分,但构建一个人类注释的大型偏好数据集会产生巨大的成本。为了解决这个问题,我们提出了一个新的框架,该框架仅使用非常少量的人类注释的偏好数据,通过自生成的偏好数据(SELIFE)来促进LLMS的比对。我们的关键思想是利用小(种子)数据中的人类先验知识,通过迭代生成响应并使用自注释偏好数据学习来逐步改进LLM的对齐。具体地说,我们建议从LLM的Logits中导出偏好标签,以显式地提取模型的内在偏好。与以往使用外部奖励模型或内隐情境学习的方法相比,我们观察到所提出的方法明显更有效。此外,我们引入了一种噪声感知的偏好学习算法来降低生成的偏好数据中的低质量风险。我们的实验结果表明,该框架显著提高了LLMS的对准能力。例如,与使用整个数据或最新基线的情况相比,我们在AlpacaEval 2.0上仅使用3.3%UltraFeedback数据中的地面真实偏好标签就实现了卓越的比对性能。

[NLP-70] Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?
[NLP-70] 为什么预测具有规模的前沿人工智能模型的下游能力仍然难以捉摸?

链接: https://arxiv.org/abs/2406.04391
作者: Rylan Schaeffer,Hailey Schoelkopf,Brando Miranda,Gabriel Mukobi,Varun Madan,Adam Ibrahim,Herbie Bradley,Stella Biderman,Sanmi Koyejo
关键词: extremely desirable property, desirable property, advanced AI systems, extremely desirable, probability mass
中文关键词: 非常理想的属性,理想的属性,先进的人工智能系统,非常理想的,概率质量
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Predictable behavior from scaling advanced AI systems is an extremely desirable property. Although a well-established literature exists on how pretraining performance scales, the literature on how particular downstream capabilities scale is significantly muddier. In this work, we take a step back and ask: why has predicting specific downstream capabilities with scale remained elusive? While many factors are certainly responsible, we identify a new factor that makes modeling scaling behavior on widely used multiple-choice question-answering benchmarks challenging. Using five model families and twelve well-established multiple-choice benchmarks, we show that downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively degrade the statistical relationship between performance and scale. We then reveal the mechanism causing this degradation: downstream metrics require comparing the correct choice against a small number of specific incorrect choices, meaning accurately predicting downstream capabilities requires predicting not just how probability mass concentrates on the correct choice with scale, but also how probability mass fluctuates on specific incorrect choices with scale. We empirically study how probability mass on the correct choice co-varies with probability mass on incorrect choices with increasing compute, suggesting that scaling laws for incorrect choices might be achievable. Our work also explains why pretraining scaling laws are commonly regarded as more predictable than downstream capabilities and contributes towards establishing scaling-predictable evaluations of frontier AI models.
摘要:扩展高级人工智能系统的可预测行为是一种非常理想的特性。尽管有关于培训前表现如何衡量的成熟文献,但关于特定下游能力如何衡量的文献明显不清楚。在这项工作中,我们退一步问:为什么预测具有规模的特定下游能力仍然难以捉摸?虽然许多因素肯定是原因,但我们发现了一个新的因素,它使得在广泛使用的多项选择问答基准上建模缩放行为具有挑战性。使用5个模型族和12个公认的多项选择基准,我们证明了下游绩效是通过一系列逐渐降低绩效与规模之间的统计关系的转换从负对数概率计算出来的。然后,我们揭示了导致这种降级的机制:下游指标需要将正确的选择与少量特定的错误选择进行比较,这意味着准确预测下游能力不仅需要预测概率质量如何集中在规模的正确选择上,还需要预测概率质量如何随规模的特定错误选择波动。我们实证研究了正确选择的概率质量与错误选择的概率质量如何随着计算量的增加而共同变化,表明错误选择的标度律是可能实现的。我们的工作也解释了为什么训练前的缩放规则通常被认为比下游能力更可预测,并有助于建立前沿人工智能模型的可缩放评估。

[NLP-71] Exploring the Latest LLMs for Leaderboard Extraction
[NLP-71] 探索最新的LLM以提取排行榜

链接: https://arxiv.org/abs/2406.04383
作者: Salomon Kabongo,Jennifer D’Souza,Sören Auer
关键词: Large Language Models, Large Language, automating complex tasks, advancements in Large, Language Models
中文关键词: 大型语言模型、大型语言、自动化复杂任务、大型语言模型的进步
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancements in Large Language Models (LLMs) have opened new avenues for automating complex tasks in AI research. This paper investigates the efficacy of different LLMs-Mistral 7B, Llama-2, GPT-4-Turbo and GPT-4.o in extracting leaderboard information from empirical AI research articles. We explore three types of contextual inputs to the models: DocTAET (Document Title, Abstract, Experimental Setup, and Tabular Information), DocREC (Results, Experiments, and Conclusions), and DocFULL (entire document). Our comprehensive study evaluates the performance of these models in generating (Task, Dataset, Metric, Score) quadruples from research papers. The findings reveal significant insights into the strengths and limitations of each model and context type, providing valuable guidance for future AI research automation efforts.
摘要:大型语言模型(LLM)的快速发展为人工智能研究中复杂任务的自动化开辟了新途径。本文研究了不同的LLM-Mistral 7 B、Llama-2、GPT-4-Turbo和GPT-4.o从经验性人工智能研究文章中提取排行榜信息的功效。我们探索模型的三种类型的上下文输入:DocTAET(文档标题、摘要、实验设置和表格信息)、DocREC(结果、实验和结论)和Docfull(整个文档)。我们的全面研究评估了这些模型在从研究论文中生成(任务、数据集、指标、得分)四倍方面的性能。研究结果揭示了对每个模型和上下文类型的优势和局限性的重要见解,为未来的人工智能研究自动化工作提供了宝贵的指导。

[NLP-72] VHDL-Eval: A Framework for Evaluating Large Language Models in VHDL Code Generation
[NLP-72] VHDL-Eval:评估ADL代码生成中大型语言模型的框架

链接: https://arxiv.org/abs/2406.04379
作者: Prashanth Vijayaraghavan,Luyao Shi,Stefano Ambrogio,Charles Mackin,Apoorva Nitsure,David Beymer,Ehsan Degan
关键词: Large Language Models, Hardware Description Languages, VHDL code generation, Large Language, advancements in Large
中文关键词: 大型语言模型、硬件描述语言、ADL代码生成、大型语言、大型进步
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages, 3 Figures, LAD’24

点击查看摘要

Abstract:With the unprecedented advancements in Large Language Models (LLMs), their application domains have expanded to include code generation tasks across various programming languages. While significant progress has been made in enhancing LLMs for popular programming languages, there exists a notable gap in comprehensive evaluation frameworks tailored for Hardware Description Languages (HDLs), particularly VHDL. This paper addresses this gap by introducing a comprehensive evaluation framework designed specifically for assessing LLM performance in VHDL code generation task. We construct a dataset for evaluating LLMs on VHDL code generation task. This dataset is constructed by translating a collection of Verilog evaluation problems to VHDL and aggregating publicly available VHDL problems, resulting in a total of 202 problems. To assess the functional correctness of the generated VHDL code, we utilize a curated set of self-verifying testbenches specifically designed for those aggregated VHDL problem set. We conduct an initial evaluation of different LLMs and their variants, including zero-shot code generation, in-context learning (ICL), and Parameter-efficient fine-tuning (PEFT) methods. Our findings underscore the considerable challenges faced by existing LLMs in VHDL code generation, revealing significant scope for improvement. This study emphasizes the necessity of supervised fine-tuning code generation models specifically for VHDL, offering potential benefits to VHDL designers seeking efficient code generation solutions.
摘要:随着大型语言模型的空前发展,它们的应用领域已经扩展到包括跨各种编程语言的代码生成任务。虽然在增强流行编程语言的LLM方面取得了重大进展,但在为硬件描述语言(HDL)特别是VHDL量身定做的综合评估框架方面存在着显著的差距。本文通过引入一个专门为评估VHDL语言代码生成任务中的LLM性能而设计的综合评估框架来弥补这一差距。我们构建了一个用于在VHDL语言代码生成任务中评估LLMS的数据集。这个数据集是通过将Verilog评估问题的集合转换为VHDL语言并聚合公开可用的VHDL语言问题来构建的,总共产生了202个问题。为了评估生成的VHDL码的功能正确性,我们利用了一组精心设计的自验证测试台,这些测试台是专门为那些聚集的VHDL题集设计的。我们对不同的LLM及其变种进行了初步评估,包括零命中代码生成、上下文学习(ICL)和参数高效微调(PEFT)方法。我们的发现强调了现有LLM在VHDL码生成方面面临的相当大的挑战,揭示了巨大的改进空间。这项研究强调了针对VHDL语言的有监督的微调代码生成模型的必要性,为寻求高效的代码生成解决方案的VHDL设计者提供了潜在的好处。

[NLP-73] Phased Instruction Fine-Tuning for Large Language Models
[NLP-73] 大型语言模型的分阶段指令微调

链接: https://arxiv.org/abs/2406.04371
作者: Wei Pang,Chuan Zhou,Xiao-Hua Zhou,Xiaojie Wang
关键词: one-off training approach, Phased IFT, mere next-word prediction, phased instruction fine-tuning, Instruction
中文关键词: 一次性培训方法、分阶段TIP、仅下一个词预测、分阶段教学微调、教学
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Review version, to be appear at ACL 2024 Findings

点击查看摘要

Abstract:Instruction Fine-Tuning, a method enhancing pre-trained language models’ capabilities from mere next-word prediction to complex instruction following, often employs a one-off training approach on diverse instruction dataset. However, this method may not effectively enhance models’ adherence to instructions due to the simultaneous handling of varying instruction complexities. To address this, we propose a novel phased instruction fine-tuning (Phased IFT) method, grounded in the hypothesis of progressive alignment, which posits that the transition of a pre-trained language model from simple next-word prediction to sophisticated instruction following is a gradual learning process. Specifically, we obtain the score of difficulty for each instruction via GPT-4, stratify the instruction data into subsets of increasing difficulty, and sequentially uptrain on these subsets using the standard supervised loss. Through extensive experiments on the pre-trained models Llama-2 7B/13B, and Mistral-7B using the 52K Alpaca instruction data, we demonstrate that Phased IFT significantly surpasses traditional one-off instruction fine-tuning (One-off IFT) method in win rate, empirically validating the progressive alignment hypothesis. Our findings suggest that Phased IFT offers a simple yet effective pathway for elevating the instruction-following capabilities of pre-trained language models. Models and datasets from our experiments are freely available at this https URL.
摘要:指令精调是一种将预先训练好的语言模型的能力从单纯的下一词预测提高到复杂的指令跟随的方法,通常在不同的指令数据集上采用一次性训练的方法。然而,由于同时处理不同的指令复杂性,该方法可能不能有效地增强模型对指令的遵守。为了解决这一问题,我们提出了一种新的阶段性指令微调方法,该方法基于渐进对齐假设,认为预先训练的语言模型从简单的下一词预测到复杂的指令跟随是一个渐进的学习过程。具体地说,我们通过GPT-4获得每个指令的难度分数,将指令数据分层为难度增加的子集,并使用标准监督损失对这些子集进行顺序向上训练。通过使用52K羊驼指令数据在预训练模型Llama-27B/13B和Mistral-7B上的大量实验,我们证明了阶段性IFT在胜率上明显优于传统的一次性指令微调(One-Off IFT)方法,从经验上验证了渐进对齐假说。我们的发现表明,阶段性IFT提供了一条简单而有效的途径来提高预先训练的语言模型的教学跟随能力。我们实验中的模型和数据集可以在这个HTTPS URL上免费获得。

[NLP-74] Large Language Model Confidence Estimation via Black-Box Access
[NLP-74] 通过黑匣子访问的大语言模型置信度估计

链接: https://arxiv.org/abs/2406.04370
作者: Tejaswini Pedapati,Amit Dhurandhar,Soumya Ghosh,Soham Dan,Prasanna Sattigeri
关键词: significant in evaluating, evaluating trust, Estimating uncertainty, confidence, responses
中文关键词: 在评估、评估信任、估计不确定性、信心、反应方面具有重要意义
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of flan-ul2, llama-13b and mistral-7b with it consistently outperforming existing black-box confidence estimation approaches on benchmark datasets such as TriviaQA, SQuAD, CoQA and Natural Questions by even over 10% (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features that are predictive of confidence, leading to the interesting and useful discovery that our confidence models built for one LLM generalize zero-shot across others on a given dataset.
摘要:评估模型响应中的不确定性或置信度不仅在评估对响应的信任中具有重要意义,而且在评估整个模型中也具有重要意义。本文研究了简单黑盒或查询访问大型语言模型(LLM)的响应置信度估计问题。我们提出了一个简单且可扩展的框架,其中我们设计了新的特征并训练了一个(可解释的)模型(即。Logistic回归)对这些特征进行置信度估计。我们的经验表明,我们的简单框架在估计flan-ul2、llama-13b和Mistral-7b的置信度方面是有效的,在某些情况下,它在TriviaQA、Team、CoQA和Natural Questions等基准数据集上的性能一直比现有的黑盒置信度估计方法高出10%以上(在AUROC上)。此外,我们的可解释方法提供了对预测置信度的特征的洞察,导致了一个有趣和有用的发现,即我们为一个LLM构建的置信度模型概括了给定数据集上其他LLM的零命中率。

[NLP-75] SocialNLP Fake-EmoReact 2021 Challenge Overview: Predicting Fake Tweets from Their Replies and GIFs
[NLP-75] SocialNLP Fake-DeliverReact 2021挑战概述:根据回复和GIF预测虚假推文

链接: https://arxiv.org/abs/2406.04368
作者: Chien-Kun Huang,Yi-Ting Chang,Lun-Wei Ku,Cheng-Te Li,Hong-Han Shuai
关键词: SocialNLP Workshop, conjunction with NAACL, augmented GIF categories, Workshop, NAACL
中文关键词: SocialNLP Workshop,与NAACL结合,增强GIF类别,Workshop,NAACL
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This paper provides an overview of the Fake-EmoReact 2021 Challenge, held at the 9th SocialNLP Workshop, in conjunction with NAACL 2021. The challenge requires predicting the authenticity of tweets using reply context and augmented GIF categories from EmotionGIF dataset. We offer the Fake-EmoReact dataset with more than 453k as the experimental materials, where every tweet is labeled with authenticity. Twenty-four teams registered to participate in this challenge, and 5 submitted their results successfully in the evaluation phase. The best team achieves 93.9 on Fake-EmoReact 2021 dataset using F1 score. In addition, we show the definition of share task, data collection, and the teams’ performance that joined this challenge and their approaches.
摘要:本文概述了在第9届SocialNLP研讨会上与NAACL 2021同时举行的Fake-DeliverReact 2021挑战赛。这项挑战需要使用回复上下文和来自PredictionGIF数据集的增强GIF类别来预测推文的真实性。我们提供超过453,000个的Fake-DeliverReact数据集作为实验材料,其中每条推文都贴有真实性标签。24支球队注册参加此次挑战,5支球队在评估阶段成功提交了结果。使用F1评分,最好的团队在Fake-DeliverReact 2021数据集上获得了93.9分。此外,我们还展示了共享任务的定义、数据收集以及加入这一挑战的团队的表现及其方法。

[NLP-76] LLM-based speaker diarization correction: A generalizable approach
[NLP-76] 基于LLM的说话者日记化纠正:一种可推广的方法

链接: https://arxiv.org/abs/2406.04927
作者: Georgios Efstathiadis,Vijay Yadav,Anzar Abbas
关键词: automated speech recognition, Speaker diarization, speech recognition, automated speech, Speaker
中文关键词: 自动语音识别,说话人日记化,语音识别,自动语音,说话人
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Speaker diarization is necessary for interpreting conversations transcribed using automated speech recognition (ASR) tools. Despite significant developments in diarization methods, diarization accuracy remains an issue. Here, we investigate the use of large language models (LLMs) for diarization correction as a post-processing step. LLMs were fine-tuned using the Fisher corpus, a large dataset of transcribed conversations. The ability of the models to improve diarization accuracy in a holdout dataset was measured. We report that fine-tuned LLMs can markedly improve diarization accuracy. However, model performance is constrained to transcripts produced using the same ASR tool as the transcripts used for fine-tuning, limiting generalizability. To address this constraint, an ensemble model was developed by combining weights from three separate models, each fine-tuned using transcripts from a different ASR tool. The ensemble model demonstrated better overall performance than each of the ASR-specific models, suggesting that a generalizable and ASR-agnostic approach may be achievable. We hope to make these models accessible through public-facing APIs for use by third-party applications.
摘要:使用自动语音识别(ASR)工具对转录的对话进行翻译时,说话人对分是必要的。尽管二值化方法有了很大的发展,但二值化的准确性仍然是一个问题。在这里,我们研究了使用大语言模型(LLM)进行对数校正作为后处理步骤。使用Fisher语料库对LLM进行了微调,Fisher语料库是一个转录对话的大型数据集。该模型的能力,以提高二值化精度在一个坚持数据集进行了测量。我们报告说,微调的LLMS可以显著提高二值化精度。然而,模型的性能仅限于使用与用于微调的副本相同的ASR工具生成的副本,从而限制了通用性。为了解决这一限制,开发了一个集合模型,通过组合来自三个单独模型的权重,每个模型使用来自不同ASR工具的转录进行微调。集合模型表现出比每个ASR特定模型更好的整体性能,这表明一种可推广的和ASR不可知的方法是可能实现的。我们希望通过面向公众的API访问这些模型,以供第三方应用程序使用。

[NLP-77] XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
[NLP-77] XTTC:一种大规模多语言零镜头文本到语音模型

链接: https://arxiv.org/abs/2406.04904
作者: Edresson Casanova,Kelly Davis,Eren Gölge,Görkem Göknar,Iulian Gulea,Logan Hart,Aya Aljafari,Joshua Meyer,Reuben Morais,Samuel Olayemi,Julian Weber
关键词: Zero-shot Multi-speaker TTS, Multi-speaker TTS, Zero-shot Multi-speaker, medium resource languages, medium resource
中文关键词: 零镜头多扬声器TTC,多扬声器TTC,零镜头多扬声器,媒体资源语言,媒体资源
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at INTERSPEECH 2024

点击查看摘要

Abstract:Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just a few high/medium resource languages, limiting the applications of these models in most of the low/medium resource languages. In this paper, we aim to alleviate this issue by proposing and making publicly available the XTTS system. Our method builds upon the Tortoise model and adds several novel modifications to enable multilingual training, improve voice cloning, and enable faster training and inference. XTTS was trained in 16 languages and achieved state-of-the-art (SOTA) results in most of them.
摘要:大多数零镜头多扬声器TTC(CLAR-TTC)系统仅支持单一语言。尽管YourTTC、VALL-E X、Mega-RTS 2和Voicebox等模型探索了多语言SEARCH,但它们仅限于少数高/中等资源语言,从而限制了这些模型在大多数低/中等资源语言中的应用。在本文中,我们的目标是通过提出并公开XTTC系统来缓解这个问题。我们的方法建立在Torchell模型的基础上,并添加了一些新颖的修改,以实现多语言训练、改进语音克隆并实现更快的训练和推理。XTTC接受了16种语言的培训,并在其中大多数语言中取得了最先进的(SOTA)结果。

[NLP-78] What do MLLMs hear? Examining reasoning with text and sound components in Multimodal Large Language Models
[NLP-78] MLLM听到了什么?在多模式大型语言模型中检查文本和声音成分的推理

链接: https://arxiv.org/abs/2406.04615
作者: Enis Berk Çoban,Michael I. Mandel,Johanna Devaney
关键词: Large Language Models, Large Language, Language Models, notably in connecting, solve problems
中文关键词: 大型语言模型,大型语言,语言模型,特别是在连接、解决问题方面
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 9 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, notably in connecting ideas and adhering to logical rules to solve problems. These models have evolved to accommodate various data modalities, including sound and images, known as multimodal LLMs (MLLMs), which are capable of describing images or sound recordings. Previous work has demonstrated that when the LLM component in MLLMs is frozen, the audio or visual encoder serves to caption the sound or image input facilitating text-based reasoning with the LLM component. We are interested in using the LLM’s reasoning capabilities in order to facilitate classification. In this paper, we demonstrate through a captioning/classification experiment that an audio MLLM cannot fully leverage its LLM’s text-based reasoning when generating audio captions. We also consider how this may be due to MLLMs separately representing auditory and textual information such that it severs the reasoning pathway from the LLM to the audio encoder.
摘要:大型语言模型(LLM)已经显示出卓越的推理能力,尤其是在连接思想和遵循逻辑规则解决问题方面。这些模型已经发展到适应各种数据形态,包括声音和图像,称为多模式LLMS(MLLMS),它能够描述图像或声音记录。以前的工作已经证明,当MLLMS中的LLM组件被冻结时,音频或视觉编码器用来为声音或图像输入添加字幕,从而促进与LLM组件的基于文本的推理。我们有兴趣使用LLM的推理能力来促进分类。在本文中,我们通过一个字幕/分类实验证明,音频MLLM在生成音频字幕时不能充分利用其LLM基于文本的推理。我们还考虑了这可能是由于MLLMS分别表示听觉和文本信息,从而切断了从LLM到音频编码器的推理路径。

[NLP-79] Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis
[NLP-79] Small-E:具有线性注意力的小型语言模型,用于高效语音合成

链接: https://arxiv.org/abs/2406.04467
作者: Théodor Lemerle,Nicolas Obin,Axel Roebel
关键词: showcased remarkable capabilities, Recent advancements, powered by language, language models, models have showcased
中文关键词: 展示了非凡的能力,由语言、语言模型、模型提供支持的最新进展
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Interspeech

点击查看摘要

Abstract:Recent advancements in text-to-speech (TTS) powered by language models have showcased remarkable capabilities in achieving naturalness and zero-shot voice cloning. Notably, the decoder-only transformer is the prominent architecture in this domain. However, transformers face challenges stemming from their quadratic complexity in sequence length, impeding training on lengthy sequences and resource-constrained hardware. Moreover they lack specific inductive bias with regards to the monotonic nature of TTS alignments. In response, we propose to replace transformers with emerging recurrent architectures and introduce specialized cross-attention mechanisms for reducing repeating and skipping issues. Consequently our architecture can be efficiently trained on long samples and achieve state-of-the-art zero-shot voice cloning against baselines of comparable size.
摘要:由语言模型支持的文本转语音(TTC)的最新进展在实现自然性和零镜头语音克隆方面展示了非凡的能力。值得注意的是,仅解码器的Transformer是该领域中的突出架构。然而,变形器面临着序列长度的二次复杂性带来的挑战,阻碍了对冗长序列和资源受限硬件的训练。此外,它们缺乏有关TTC对齐单调性的特定感性偏差。作为回应,我们建议用新兴的反复架构取代变压器,并引入专门的交叉注意机制来减少重复和跳过问题。因此,我们的架构可以在长样本上有效地训练,并针对相当大小的基线实现最先进的零镜头语音克隆。

[NLP-80] LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition
[NLP-80] LipGER:用于鲁棒自动语音识别的视觉条件生成错误纠正

链接: https://arxiv.org/abs/2406.04432
作者: Sreyan Ghosh,Sonal Kumar,Ashish Seth,Purva Chiniya,Utkarsh Tyagi,Ramani Duraiswami,Dinesh Manocha
关键词: Automatic Speech Recognition, Speech Recognition, Automatic Speech, ASR error correction, performance of Automatic
中文关键词: 自动语音识别、语音识别、自动语音、ASB错误纠正、自动性能
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: InterSpeech 2024. Code and Data: this https URL

点击查看摘要

Abstract:Visual cues, like lip motion, have been shown to improve the performance of Automatic Speech Recognition (ASR) systems in noisy environments. We propose LipGER (Lip Motion aided Generative Error Correction), a novel framework for leveraging visual cues for noise-robust ASR. Instead of learning the cross-modal correlation between the audio and visual modalities, we make an LLM learn the task of visually-conditioned (generative) ASR error correction. Specifically, we instruct an LLM to predict the transcription from the N-best hypotheses generated using ASR beam-search. This is further conditioned on lip motions. This approach addresses key challenges in traditional AVSR learning, such as the lack of large-scale paired datasets and difficulties in adapting to new domains. We experiment on 4 datasets in various settings and show that LipGER improves the Word Error Rate in the range of 1.1%-49.2%. We also release LipHyp, a large-scale dataset with hypothesis-transcription pairs that is additionally equipped with lip motion cues to promote further research in this space
摘要:视觉提示,如嘴唇运动,已被证明可以改善自动语音识别(ASR)系统在噪声环境中的性能。我们提出了一种利用视觉线索进行抗噪ASR的新框架LipGER(Lip Motion Assisted Generative Error Eqution)。我们不是学习音频和视频通道之间的跨模式相关性,而是让LLM学习视觉条件(生成性)ASR纠错任务。具体地说,我们指示LLM从使用ASR光束搜索生成的N个最佳假设中预测转录。这进一步取决于嘴唇的动作。这种方法解决了传统AVSR学习中的关键挑战,例如缺乏大规模配对数据集和难以适应新的领域。我们在不同环境下的4个数据集上进行了实验,结果表明,LipGER在1.1%-49.2%的范围内提高了词的错误率。我们还发布了LipHyp,这是一个具有假设-转录对的大规模数据集,此外还配备了嘴唇运动提示,以促进该领域的进一步研究

计算机视觉

[CV-0] 3D-GRAND: Towards Better Grounding and Less Hallucination for 3D-LLMs

链接: https://arxiv.org/abs/2406.05132
作者: Jianing Yang,Xuweiyi Chen,Nikhil Madaan,Madhavan Iyengar,Shengyi Qian,David F. Fouhey,Joyce Chai
关键词: developing embodied agents, perception is crucial, physical world, crucial for developing, agents and robots
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Project website: this https URL

点击查看摘要

Abstract:The integration of language and 3D perception is crucial for developing embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datasets that provide dense grounding between language and 3D scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons among future models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the critical role of large-scale 3D-text datasets in advancing embodied AI research. Notably, our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with essential resources and insights, setting the stage for more reliable and better-grounded 3D-LLMs. Project website: this https URL

[CV-1] DVOS: Self-Supervised Dense-Pattern Video Object Segmentation

链接: https://arxiv.org/abs/2406.05131
作者: Keyhan Najafian,Farhad Maleki,Ian Stavness,Lingling Jin
关键词: approaches primarily rely, large-scale pixel-accurate human-annotated, pixel-accurate human-annotated datasets, segmentation approaches primarily, Video object segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Video object segmentation approaches primarily rely on large-scale pixel-accurate human-annotated datasets for model development. In Dense Video Object Segmentation (DVOS) scenarios, each video frame encompasses hundreds of small, dense, and partially occluded objects. Accordingly, the labor-intensive manual annotation of even a single frame often takes hours, which hinders the development of DVOS for many applications. Furthermore, in videos with dense patterns, following a large number of objects that move in different directions poses additional challenges. To address these challenges, we proposed a semi-self-supervised spatiotemporal approach for DVOS utilizing a diffusion-based method through multi-task learning. Emulating real videos’ optical flow and simulating their motion, we developed a methodology to synthesize computationally annotated videos that can be used for training DVOS models; The model performance was further improved by utilizing weakly labeled (computationally generated but imprecise) data. To demonstrate the utility and efficacy of the proposed approach, we developed DVOS models for wheat head segmentation of handheld and drone-captured videos, capturing wheat crops in fields of different locations across various growth stages, spanning from heading to maturity. Despite using only a few manually annotated video frames, the proposed approach yielded high-performing models, achieving a Dice score of 0.82 when tested on a drone-captured external test set. While we showed the efficacy of the proposed approach for wheat head segmentation, its application can be extended to other crops or DVOS in other domains, such as crowd analysis or microscopic image analysis.

[CV-2] PatchSVD: A Non-uniform SVD-based Image Compression Algorithm

链接: https://arxiv.org/abs/2406.05129
作者: Zahra Golpayegani,Nizar Bouguila
关键词: involves large file, large file sizes, file sizes due, Storing data, image compression
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Storing data is particularly a challenge when dealing with image data which often involves large file sizes due to the high resolution and complexity of images. Efficient image compression algorithms are crucial to better manage data storage costs. In this paper, we propose a novel region-based lossy image compression technique, called PatchSVD, based on the Singular Value Decomposition (SVD) algorithm. We show through experiments that PatchSVD outperforms SVD-based image compression with respect to three popular image compression metrics. Moreover, we compare PatchSVD compression artifacts with those of Joint Photographic Experts Group (JPEG) and SVD-based image compression and illustrate some cases where PatchSVD compression artifacts are preferable compared to JPEG and SVD artifacts.

[CV-3] owards Semantic Equivalence of Tokenization in Multimodal LLM

链接: https://arxiv.org/abs/2406.05127
作者: Shengqiong Wu,Hao Fei,Xiangtai Li,Jiayi Ji,Hanwang Zhang,Tat-Seng Chua,Shuicheng Yan
关键词: Multimodal Large Language, Large Language Models, Multimodal Large, demonstrated exceptional capabilities, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report. The project page: this https URL

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results. The project page is at this https URL.

[CV-4] Contextual fusion enhances robustness to image blurring

链接: https://arxiv.org/abs/2406.05120
作者: Shruti Joshi,Aiswarya Akumalla,Seth Haney,Maxim Bazhenov
关键词: Mammalian brains handle, brains handle complex, brain regions specialized, handle complex reasoning, Mammalian brains
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2011.09526

点击查看摘要

Abstract:Mammalian brains handle complex reasoning by integrating information across brain regions specialized for particular sensory modalities. This enables improved robustness and generalization versus deep neural networks, which typically process one modality and are vulnerable to perturbations. While defense methods exist, they do not generalize well across perturbations. We developed a fusion model combining background and foreground features from CNNs trained on Imagenet and Places365. We tested its robustness to human-perceivable perturbations on MS COCO. The fusion model improved robustness, especially for classes with greater context variability. Our proposed solution for integrating multiple modalities provides a new approach to enhance robustness and may be complementary to existing methods.

[CV-5] Compositional Curvature Bounds for Deep Neural Networks

链接: https://arxiv.org/abs/2406.05119
作者: Taha Entesari,Sina Sharifi,Mahyar Fazlyab
关键词: neural networks, key challenge, challenge that threatens, threatens the widespread, safety-critical applications
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Proceedings of the 41 st International Conference on Machine Learning (ICML 2024)

点击查看摘要

Abstract:A key challenge that threatens the widespread use of neural networks in safety-critical applications is their vulnerability to adversarial attacks. In this paper, we study the second-order behavior of continuously differentiable deep neural networks, focusing on robustness against adversarial perturbations. First, we provide a theoretical analysis of robustness and attack certificates for deep classifiers by leveraging local gradients and upper bounds on the second derivative (curvature constant). Next, we introduce a novel algorithm to analytically compute provable upper bounds on the second derivative of neural networks. This algorithm leverages the compositional structure of the model to propagate the curvature bound layer-by-layer, giving rise to a scalable and modular approach. The proposed bound can serve as a differentiable regularizer to control the curvature of neural networks during training, thereby enhancing robustness. Finally, we demonstrate the efficacy of our method on classification tasks using the MNIST and CIFAR-10 datasets.

[CV-6] he Expanding Scope of the Stability Gap: Unveiling its Presence in Joint Incremental Learning of Homogeneous Tasks

链接: https://arxiv.org/abs/2406.05114
作者: Sandesh Kamath,Albin Soutif-Cormerais,Joost van de Weijer,Bogdan Raducanu
关键词: Recent research identified, Recent research, temporary performance drop, previously learned tasks, research identified
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at CVPR 2024 Workshop on Continual Learning in Computer Vision (CLVision)

点击查看摘要

Abstract:Recent research identified a temporary performance drop on previously learned tasks when transitioning to a new one. This drop is called the stability gap and has great consequences for continual learning: it complicates the direct employment of continually learning since the worse-case performance at task-boundaries is dramatic, it limits its potential as an energy-efficient training paradigm, and finally, the stability drop could result in a reduced final performance of the algorithm. In this paper, we show that the stability gap also occurs when applying joint incremental training of homogeneous tasks. In this scenario, the learner continues training on the same data distribution and has access to all data from previous tasks. In addition, we show that in this scenario, there exists a low-loss linear path to the next minima, but that SGD optimization does not choose this path. We perform further analysis including a finer batch-wise analysis which could provide insights towards potential solution directions.

[CV-7] LLavaGuard: VLM-based Safeguards for Vision Dataset Curation and Safety Assessment

链接: https://arxiv.org/abs/2406.05113
作者: Lukas Helff,Felix Friedrich,Manuel Brack,Kristian Kersting,Patrick Schramowski
关键词: VLM-based safeguard models, offering a versatile, family of VLM-based, VLM-based safeguard, versatile framework
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page at this https URL

点击查看摘要

Abstract:We introduce LlavaGuard, a family of VLM-based safeguard models, offering a versatile framework for evaluating the safety compliance of visual content. Specifically, we designed LlavaGuard for dataset annotation and generative model safeguarding. To this end, we collected and annotated a high-quality visual dataset incorporating a broad safety taxonomy, which we use to tune VLMs on context-aware safety risks. As a key innovation, LlavaGuard’s new responses contain comprehensive information, including a safety rating, the violated safety categories, and an in-depth rationale. Further, our introduced customizable taxonomy categories enable the context-specific alignment of LlavaGuard to various scenarios. Our experiments highlight the capabilities of LlavaGuard in complex and real-world applications. We provide checkpoints ranging from 7B to 34B parameters demonstrating state-of-the-art performance, with even the smallest models outperforming baselines like GPT-4. We make our dataset and model weights publicly available and invite further research to address the diverse needs of communities and contexts.

[CV-8] A Novel Time Series-to-Image Encoding Approach for Weather Phenomena Classification

链接: https://arxiv.org/abs/2406.05096
作者: Christian Giannetti
关键词: sparked increasing interest, electromagnetic wave attenuation, research community, sparked increasing, increasing interest
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This preprint is the result of work in progress, therefore it should still be considered a draft

点击查看摘要

Abstract:Rainfall estimation through the analysis of its impact on electromagnetic waves has sparked increasing interest in the research community. Recent studies have delved into its effects on cellular network performance, demonstrating the potential to forecast rainfall levels based on electromagnetic wave attenuation during precipitations. This paper aims to solve the problem of identifying the nature of specific weather phenomena from the received signal level (RSL) in 4G/LTE mobile terminals. Specifically, utilizing time-series data representing RSL, we propose a novel approach to encode time series as images and model the task as an image classification problem, which we finally address using convolutional neural networks (CNNs). The main benefit of the abovementioned procedure is the opportunity to utilize various data augmentation techniques simultaneously. This encompasses applying traditional approaches, such as moving averages, to the time series and enhancing the generated images. We have investigated various image data augmentation methods to identify the most effective combination for this scenario. In the upcoming sections, we will introduce the task of rainfall estimation and conduct a comprehensive analysis of the dataset used. Subsequently, we will formally propose a new approach for converting time series into images. To conclude, the paper’s final section will present and discuss the experiments conducted, providing the reader with a brief yet comprehensive overview of the results.

[CV-9] Provably Better Explanations with Optimized Aggregation of Feature Attributions

链接: https://arxiv.org/abs/2406.05090
作者: Thomas Decker,Ananta R. Bhattarai,Jindong Gu,Volker Tresp,Florian Buettner
关键词: opaque machine learning, machine learning models, common practice, practice to understand, understand and verify
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: International Conference on Machine Learning (ICML) 2024

点击查看摘要

Abstract:Using feature attributions for post-hoc explanations is a common practice to understand and verify the predictions of opaque machine learning models. Despite the numerous techniques available, individual methods often produce inconsistent and unstable results, putting their overall reliability into question. In this work, we aim to systematically improve the quality of feature attributions by combining multiple explanations across distinct methods or their variations. For this purpose, we propose a novel approach to derive optimal convex combinations of feature attributions that yield provable improvements of desired quality criteria such as robustness or faithfulness to the model behavior. Through extensive experiments involving various model architectures and popular feature attribution techniques, we demonstrate that our combination strategy consistently outperforms individual methods and existing baselines.

[CV-10] CoNo: Consistency Noise Injection for Tuning-free Long Video Diffusion

链接: https://arxiv.org/abs/2406.05082
作者: Xingrui Wang,Xin Li,Zhibo Chen
关键词: Tuning-free long video, long video diffusion, short video diffusion, video diffusion model, generate extended-duration videos
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages

点击查看摘要

Abstract:Tuning-free long video diffusion has been proposed to generate extended-duration videos with enriched content by reusing the knowledge from pre-trained short video diffusion model without retraining. However, most works overlook the fine-grained long-term video consistency modeling, resulting in limited scene consistency (i.e., unreasonable object or background transitions), especially with multiple text inputs. To mitigate this, we propose the Consistency Noise Injection, dubbed CoNo, which introduces the “look-back” mechanism to enhance the fine-grained scene transition between different video clips, and designs the long-term consistency regularization to eliminate the content shifts when extending video contents through noise prediction. In particular, the “look-back” mechanism breaks the noise scheduling process into three essential parts, where one internal noise prediction part is injected into two video-extending parts, intending to achieve a fine-grained transition between two video clips. The long-term consistency regularization focuses on explicitly minimizing the pixel-wise distance between the predicted noises of the extended video clip and the original one, thereby preventing abrupt scene transitions. Extensive experiments have shown the effectiveness of the above strategies by performing long-video generation under both single- and multi-text prompt conditions. The project has been available in this https URL.

[CV-11] Diving Deep into the Motion Representation of Video-Text Models

链接: https://arxiv.org/abs/2406.05075
作者: Chinmaya Devaraj,Cornelia Fermuller,Yiannis Aloimonos
关键词: motion, informative than images, motion descriptions, Videos, capture dynamic activities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACL Findings , 2024

点击查看摘要

Abstract:Videos are more informative than images because they capture the dynamics of the scene. By representing motion in videos, we can capture dynamic activities. In this work, we introduce GPT-4 generated motion descriptions that capture fine-grained motion descriptions of activities and apply them to three action datasets. We evaluated several video-text models on the task of retrieval of motion descriptions. We found that they fall far behind human expert performance on two action datasets, raising the question of whether video-text models understand motion in videos. To address it, we introduce a method of improving motion understanding in video-text models by utilizing motion descriptions. This method proves to be effective on two action datasets for the motion description retrieval task. The results draw attention to the need for quality captions involving fine-grained motion information in existing datasets and demonstrate the effectiveness of the proposed pipeline in understanding fine-grained motion during video-text retrieval.

[CV-12] Classification Metrics for Image Explanations: Towards Building Reliable XAI-Evaluations

链接: https://arxiv.org/abs/2406.05068
作者: Benjamin Fresz,Lena Lörcher,Marco Huber
关键词: computer vision models, deep neural networks, Decision processes, vision models, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Decision processes of computer vision models - especially deep neural networks - are opaque in nature, meaning that these decisions cannot be understood by humans. Thus, over the last years, many methods to provide human-understandable explanations have been proposed. For image classification, the most common group are saliency methods, which provide (super-)pixelwise feature attribution scores for input images. But their evaluation still poses a problem, as their results cannot be simply compared to the unknown ground truth. To overcome this, a slew of different proxy metrics have been defined, which are - as the explainability methods themselves - often built on intuition and thus, are possibly unreliable. In this paper, new evaluation metrics for saliency methods are developed and common saliency methods are benchmarked on ImageNet. In addition, a scheme for reliability evaluation of such metrics is proposed that is based on concepts from psychometric testing. The used code can be found at this https URL .

[CV-13] GenHeld: Generating and Editing Handheld Objects

链接: https://arxiv.org/abs/2406.05059
作者: Chaerin Min,Srinath Sridhar
关键词: important human activity, computer vision, studied in robotics, cognitive science, important human
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Grasping is an important human activity that has long been studied in robotics, computer vision, and cognitive science. Most existing works study grasping from the perspective of synthesizing hand poses conditioned on 3D or 2D object representations. We propose GenHeld to address the inverse problem of synthesizing held objects conditioned on 3D hand model or 2D image. Given a 3D model of hand, GenHeld 3D can select a plausible held object from a large dataset using compact object representations called object codes.The selected object is then positioned and oriented to form a plausible grasp without changing hand pose. If only a 2D hand image is available, GenHeld 2D can edit this image to add or replace a held object. GenHeld 2D operates by combining the abilities of GenHeld 3D with diffusion-based image editing. Results and experiments show that we outperform baselines and can generate plausible held objects in both 2D and 3D. Our experiments demonstrate that our method achieves high quality and plausibility of held object synthesis in both 3D and 2D.

[CV-14] Prototype Correlation Matching and Class-Relation Reasoning for Few-Shot Medical Image Segmentation

链接: https://arxiv.org/abs/2406.05054
作者: Yumin Zhang,Hongliu Li,Yajun Gao,Haoran Duan,Yawen Huang,Yefeng Zheng
关键词: biomedical imaging field, achieved great progress, inter-class relations, large intra-class variations, imaging field
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Few-shot medical image segmentation has achieved great progress in improving accuracy and efficiency of medical analysis in the biomedical imaging field. However, most existing methods cannot explore inter-class relations among base and novel medical classes to reason unseen novel classes. Moreover, the same kind of medical class has large intra-class variations brought by diverse appearances, shapes and scales, thus causing ambiguous visual characterization to degrade generalization performance of these existing methods on unseen novel classes. To address the above challenges, in this paper, we propose a \underline\textbfPrototype correlation \underline\textbfMatching and \underline\textbfClass-relation \underline\textbfReasoning (i.e., \textbfPMCR) model. The proposed model can effectively mitigate false pixel correlation matches caused by large intra-class variations while reasoning inter-class relations among different medical classes. Specifically, in order to address false pixel correlation match brought by large intra-class variations, we propose a prototype correlation matching module to mine representative prototypes that can characterize diverse visual information of different appearances well. We aim to explore prototype-level rather than pixel-level correlation matching between support and query features via optimal transport algorithm to tackle false matches caused by intra-class variations. Meanwhile, in order to explore inter-class relations, we design a class-relation reasoning module to segment unseen novel medical objects via reasoning inter-class relations between base and novel classes. Such inter-class relations can be well propagated to semantic encoding of local query features to improve few-shot segmentation performance. Quantitative comparisons illustrates the large performance improvement of our model over other baseline methods.

[CV-15] Bootstrapping Referring Multi-Object Tracking

链接: https://arxiv.org/abs/2406.05039
作者: Yani Zhang,Dongming Wu,Wencheng Han,Xingping Dong
关键词: human instruction represented, tracking multiple objects, natural language expression, Referring multi-object tracking, aims at detecting
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Referring multi-object tracking (RMOT) aims at detecting and tracking multiple objects following human instruction represented by a natural language expression. Existing RMOT benchmarks are usually formulated through manual annotations, integrated with static regulations. This approach results in a dearth of notable diversity and a constrained scope of implementation. In this work, our key idea is to bootstrap the task of referring multi-object tracking by introducing discriminative language words as much as possible. In specific, we first develop Refer-KITTI into a large-scale dataset, named Refer-KITTI-V2. It starts with 2,719 manual annotations, addressing the issue of class imbalance and introducing more keywords to make it closer to real-world scenarios compared to Refer-KITTI. They are further expanded to a total of 9,758 annotations by prompting large language models, which create 617 different words, surpassing previous RMOT benchmarks. In addition, the end-to-end framework in RMOT is also bootstrapped by a simple yet elegant temporal advancement strategy, which achieves better performance than previous approaches. The source code and dataset is available at this https URL.

[CV-16] Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs

链接: https://arxiv.org/abs/2406.05038
作者: Shentong Mo
关键词: state space approach, selective state space, Recent advancements, long sequence handling, efficient long sequence
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in sequence modeling have led to the development of the Mamba architecture, noted for its selective state space approach, offering a promising avenue for efficient long sequence handling. However, its application in 3D shape generation, particularly at high resolutions, remains underexplored. Traditional diffusion transformers (DiT) with self-attention mechanisms, despite their potential, face scalability challenges due to the cubic complexity of attention operations as input length increases. This complexity becomes a significant hurdle when dealing with high-resolution voxel sizes. To address this challenge, we introduce a novel diffusion architecture tailored for 3D point clouds generation-Diffusion Mamba (DiM-3D). This architecture forgoes traditional attention mechanisms, instead utilizing the inherent efficiency of the Mamba architecture to maintain linear complexity with respect to sequence length. DiM-3D is characterized by fast inference times and substantially lower computational demands, quantified in reduced Gflops, thereby addressing the key scalability issues of prior models. Our empirical results on the ShapeNet benchmark demonstrate that DiM-3D achieves state-of-the-art performance in generating high-fidelity and diverse 3D shapes. Additionally, DiM-3D shows superior capabilities in tasks like 3D point cloud completion. This not only proves the model’s scalability but also underscores its efficiency in generating detailed, high-resolution voxels necessary for advanced 3D shape modeling, particularly excelling in environments requiring high-resolution voxel sizes. Through these findings, we illustrate the exceptional scalability and efficiency of the Diffusion Mamba framework in 3D shape generation, setting a new standard for the field and paving the way for future explorations in high-resolution 3D modeling technologies.

[CV-17] GANetic Loss for Generative Adversarial Networks with a Focus on Medical Applications

链接: https://arxiv.org/abs/2406.05023
作者: Shakhnaz Akhmedova,Nils Körber
关键词: Generative adversarial networks, underlying statistical structure, GANetic loss, machine learning models, loss
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Generative adversarial networks (GANs) are machine learning models that are used to estimate the underlying statistical structure of a given dataset and as a result can be used for a variety of tasks such as image generation or anomaly detection. Despite their initial simplicity, designing an effective loss function for training GANs remains challenging, and various loss functions have been proposed aiming to improve the performance and stability of the generative models. In this study, loss function design for GANs is presented as an optimization problem solved using the genetic programming (GP) approach. Initial experiments were carried out using small Deep Convolutional GAN (DCGAN) model and the MNIST dataset, in order to search experimentally for an improved loss function. The functions found were evaluated on CIFAR10, with the best function, named GANetic loss, showing exceptionally better performance and stability compared to the losses commonly used for GAN training. To further evalute its general applicability on more challenging problems, GANetic loss was applied for two medical applications: image generation and anomaly detection. Experiments were performed with histopathological, gastrointestinal or glaucoma images to evaluate the GANetic loss in medical image generation, resulting in improved image quality compared to the baseline models. The GANetic Loss used for polyp and glaucoma images showed a strong improvement in the detection of anomalies. In summary, the GANetic loss function was evaluated on multiple datasets and applications where it consistently outperforms alternative loss functions. Moreover, GANetic loss leads to stable training and reproducible results, a known weak spot of GANs.

[CV-18] Clarifying Myths About the Relationship Between Shape Bias Accuracy and Robustness

链接: https://arxiv.org/abs/2406.05006
作者: Zahra Golpayegani,Patrick St-Amant,Nizar Bouguila
关键词: Deep learning models, OOD robustness, Deep learning, OOD, OOD data
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Deep learning models can perform well when evaluated on images from the same distribution as the training set. However, applying small perturbations in the forms of noise, artifacts, occlusions, blurring, etc. to a model’s input image and feeding the model with out-of-distribution (OOD) data can significantly drop the model’s accuracy, making it not applicable to real-world scenarios. Data augmentation is one of the well-practiced methods to improve model robustness against OOD data; however, examining which augmentation type to choose and how it affects the OOD robustness remains understudied. There is a growing belief that augmenting datasets using data augmentations that improve a model’s bias to shape-based features rather than texture-based features results in increased OOD robustness for Convolutional Neural Networks trained on the ImageNet-1K dataset. This is usually stated as ``an increase in the model’s shape bias results in an increase in its OOD robustness". Based on this hypothesis, some works in the literature aim to find augmentations with higher effects on model shape bias and use those for data augmentation. By evaluating 39 types of data augmentations on a widely used OOD dataset, we demonstrate the impact of each data augmentation on the model’s robustness to OOD data and further show that the mentioned hypothesis is not true; an increase in shape bias does not necessarily result in higher OOD robustness. By analyzing the results, we also find some biases in the ImageNet-1K dataset that can easily be reduced using proper data augmentation. Our evaluation results further show that there is not necessarily a trade-off between in-domain accuracy and OOD robustness, and choosing the proper augmentations can help increase both in-domain accuracy and OOD robustness simultaneously.

[CV-19] AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation

链接: https://arxiv.org/abs/2406.05000
作者: Lianyu Pang,Jian Yin,Baoquan Zhao,Feize Wu,Fu Lee Wang,Qing Li,Xudong Mao
关键词: flexible textual control, enabled high-quality personalized, high-quality personalized image, personalized image synthesis, Recent advances
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advances in text-to-image models have enabled high-quality personalized image synthesis of user-provided concepts with flexible textual control. In this work, we analyze the limitations of two primary techniques in text-to-image personalization: Textual Inversion and DreamBooth. When integrating the learned concept into new prompts, Textual Inversion tends to overfit the concept, while DreamBooth often overlooks it. We attribute these issues to the incorrect learning of the embedding alignment for the concept. We introduce AttnDreamBooth, a novel approach that addresses these issues by separately learning the embedding alignment, the attention map, and the subject identity in different training stages. We also introduce a cross-attention map regularization term to enhance the learning of the attention map. Our method demonstrates significant improvements in identity preservation and text alignment compared to the baseline methods.

[CV-20] ProMotion: Prototypes As Motion Learners

链接: https://arxiv.org/abs/2406.04999
作者: Yawen Lu,Dongfang Liu,Qifan Wang,Cheng Han,Yiming Cui,Zhiwen Cao,Xueling Zhang,Yingjie Victor Chen,Heng Fan
关键词: prototypical framework engineered, unified prototypical framework, framework engineered, fundamental motion tasks, model fundamental motion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages

点击查看摘要

Abstract:In this work, we introduce ProMotion, a unified prototypical framework engineered to model fundamental motion tasks. ProMotion offers a range of compelling attributes that set it apart from current task-specific paradigms. We adopt a prototypical perspective, establishing a unified paradigm that harmonizes disparate motion learning approaches. This novel paradigm streamlines the architectural design, enabling the simultaneous assimilation of diverse motion information. We capitalize on a dual mechanism involving the feature denoiser and the prototypical learner to decipher the intricacies of motion. This approach effectively circumvents the pitfalls of ambiguity in pixel-wise feature matching, significantly bolstering the robustness of motion representation. We demonstrate a profound degree of transferability across distinct motion patterns. This inherent versatility reverberates robustly across a comprehensive spectrum of both 2D and 3D downstream tasks. Empirical results demonstrate that ProMotion outperforms various well-known specialized architectures, achieving 0.54 and 0.054 Abs Rel error on the Sintel and KITTI depth datasets, 1.04 and 2.01 average endpoint error on the clean and final pass of Sintel flow benchmark, and 4.30 F1-all error on the KITTI flow benchmark. For its efficacy, we hope our work can catalyze a paradigm shift in universal models in computer vision.

[CV-21] ADBA:Approximation Decision Boundary Approach for Black-Box Adversarial Attacks

链接: https://arxiv.org/abs/2406.04998
作者: Feiyang Wang,Xingquan Zuo,Hai Huang,Gang Chen
关键词: machine learning models, machine learning, target machine learning, perturbation directions, real-world applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures, conference

点击查看摘要

Abstract:Many machine learning models are susceptible to adversarial attacks, with decision-based black-box attacks representing the most critical threat in real-world applications. These attacks are extremely stealthy, generating adversarial examples using hard labels obtained from the target machine learning model. This is typically realized by optimizing perturbation directions, guided by decision boundaries identified through query-intensive exact search, significantly limiting the attack success rate. This paper introduces a novel approach using the Approximation Decision Boundary (ADB) to efficiently and accurately compare perturbation directions without precisely determining decision boundaries. The effectiveness of our ADB approach (ADBA) hinges on promptly identifying suitable ADB, ensuring reliable differentiation of all perturbation directions. For this purpose, we analyze the probability distribution of decision boundaries, confirming that using the distribution’s median value as ADB can effectively distinguish different perturbation directions, giving rise to the development of the ADBA-md algorithm. ADBA-md only requires four queries on average to differentiate any pair of perturbation directions, which is highly query-efficient. Extensive experiments on six well-known image classifiers clearly demonstrate the superiority of ADBA and ADBA-md over multiple state-of-the-art black-box attacks.

[CV-22] CityCraft: A Real Crafter for 3D City Generation

链接: https://arxiv.org/abs/2406.04983
作者: Jie Deng,Wenhao Chai,Junsheng Huang,Zhonghan Zhao,Qixuan Huang,Mingyan Gao,Jianshu Guo,Shengyu Hao,Wenhao Hu,Jenq-Neng Hwang,Xi Li,Gaoang Wang
关键词: gained significant attention, smart city development, Generative Adversarial Networks, autonomous driving, traffic simulation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 9 figures

点击查看摘要

Abstract:City scene generation has gained significant attention in autonomous driving, smart city development, and traffic simulation. It helps enhance infrastructure planning and monitoring solutions. Existing methods have employed a two-stage process involving city layout generation, typically using Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), or Transformers, followed by neural rendering. These techniques often exhibit limited diversity and noticeable artifacts in the rendered city scenes. The rendered scenes lack variety, resembling the training images, resulting in monotonous styles. Additionally, these methods lack planning capabilities, leading to less realistic generated scenes. In this paper, we introduce CityCraft, an innovative framework designed to enhance both the diversity and quality of urban scene generation. Our approach integrates three key stages: initially, a diffusion transformer (DiT) model is deployed to generate diverse and controllable 2D city layouts. Subsequently, a Large Language Model(LLM) is utilized to strategically make land-use plans within these layouts based on user prompts and language guidelines. Based on the generated layout and city plan, we utilize the asset retrieval module and Blender for precise asset placement and scene construction. Furthermore, we contribute two new datasets to the field: 1)CityCraft-OSM dataset including 2D semantic layouts of urban areas, corresponding satellite images, and detailed annotations. 2) CityCraft-Buildings dataset, featuring thousands of diverse, high-quality 3D building assets. CityCraft achieves state-of-the-art performance in generating realistic 3D cities.

[CV-23] Semantic Segmentation on VSPW Dataset through Masked Video Consistency

链接: https://arxiv.org/abs/2406.04979
作者: Chen Liang,Qiang Guo,Chongkai Yu,Chengjing Wu,Ting Liu,Luoqi Liu
关键词: Pixel-level Video Understanding, Understanding requires effectively, Video Understanding requires, requires effectively integrating, effectively integrating three-dimensional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pixel-level Video Understanding requires effectively integrating three-dimensional data in both spatial and temporal dimensions to learn accurate and stable semantic information from continuous frames. However, existing advanced models on the VSPW dataset have not fully modeled spatiotemporal relationships. In this paper, we present our solution for the PVUW competition, where we introduce masked video consistency (MVC) based on existing models. MVC enforces the consistency between predictions of masked frames where random patches are withheld. The model needs to learn the segmentation results of the masked parts through the context of images and the relationship between preceding and succeeding frames of the video. Additionally, we employed test-time augmentation, model aggeregation and a multimodal model-based post-processing method. Our approach achieves 67.27% mIoU performance on the VSPW dataset, ranking 2nd place in the PVUW2024 challenge VSS track.

[CV-24] Multiplane Prior Guided Few-Shot Aerial Scene Rendering

链接: https://arxiv.org/abs/2406.04961
作者: Zihan Gao,Licheng Jiao,Lingling Li,Xu Liu,Fang Liu,Puhua Chen,Yuwei Guo
关键词: Neural Radiance Fields, Neural Radiance, Radiance Fields, Multiplane Prior, face challenges
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 8 figures, accepted at CVPR 2024

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have been successfully applied in various aerial scenes, yet they face challenges with sparse views due to limited supervision. The acquisition of dense aerial views is often prohibitive, as unmanned aerial vehicles (UAVs) may encounter constraints in perspective range and energy constraints. In this work, we introduce Multiplane Prior guided NeRF (MPNeRF), a novel approach tailored for few-shot aerial scene rendering-marking a pioneering effort in this domain. Our key insight is that the intrinsic geometric regularities specific to aerial imagery could be leveraged to enhance NeRF in sparse aerial scenes. By investigating NeRF’s and Multiplane Image (MPI)'s behavior, we propose to guide the training process of NeRF with a Multiplane Prior. The proposed Multiplane Prior draws upon MPI’s benefits and incorporates advanced image comprehension through a SwinV2 Transformer, pre-trained via SimMIM. Our extensive experiments demonstrate that MPNeRF outperforms existing state-of-the-art methods applied in non-aerial contexts, by tripling the performance in SSIM and LPIPS even with three views available. We hope our work offers insights into the development of NeRF-based applications in aerial scenes with limited data.

[CV-25] Multi-style Neural Radiance Field with AdaIN

链接: https://arxiv.org/abs/2406.04960
作者: Yu-Wen Pao,An-Jie Li
关键词: View Synthesis, combines AdaIN, AdaIN and NeRF, stylized Novel View, Synthesis
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:In this work, we propose a novel pipeline that combines AdaIN and NeRF for the task of stylized Novel View Synthesis. Compared to previous works, we make the following contributions: 1) We simplify the pipeline. 2) We extend the capabilities of model to handle the multi-style task. 3) We modify the model architecture to perform well on styles with strong brush strokes. 4) We implement style interpolation on the multi-style model, allowing us to control the style between any two styles and the style intensity between the stylized output and the original scene, providing better control over the stylization strength.

[CV-26] Nacala-Roof-Material: Drone Imagery for Roof Detection Classification and Segmentation to Support Mosquito-borne Disease Risk Assessment

链接: https://arxiv.org/abs/2406.04949
作者: Venkanna Babu Guthula,Stefan Oehmcke,Remigio Chilaule,Hui Zhang,Nico Lang,Ankit Kariryaa,Johan Mottelson,Christian Igel
关键词: remote sensing imagery, roof types based, malaria risk, assessment of malaria, increased risk
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As low-quality housing and in particular certain roof characteristics are associated with an increased risk of malaria, classification of roof types based on remote sensing imagery can support the assessment of malaria risk and thereby help prevent the disease. To support research in this area, we release the Nacala-Roof-Material dataset, which contains high-resolution drone images from Mozambique with corresponding labels delineating houses and specifying their roof types. The dataset defines a multi-task computer vision problem, comprising object detection, classification, and segmentation. In addition, we benchmarked various state-of-the-art approaches on the dataset. Canonical U-Nets, YOLOv8, and a custom decoder on pretrained DINOv2 served as baselines. We show that each of the methods has its advantages but none is superior on all tasks, which highlights the potential of our dataset for future research in multi-task learning. While the tasks are closely related, accurate segmentation of objects does not necessarily imply accurate instance separation, and vice versa. We address this general issue by introducing a variant of the deep ordinal watershed (DOW) approach that additionally separates the interior of objects, allowing for improved object delineation and separation. We show that our DOW variant is a generic approach that improves the performance of both U-Net and DINOv2 backbones, leading to a better trade-off between semantic segmentation and instance segmentation.

[CV-27] Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement

链接: https://arxiv.org/abs/2406.04942
作者: Wei Qian,Qi Li,Kun Li,Xinke Wang,Xiao Sun,Meng Wang,Dan Guo
关键词: Vision-based Remote Physiological, Physiological Signal Sensing, Remote Physiological Signal, Vision-based Remote, Signal Sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper briefly introduces the solutions developed by our team, HFUT-VUT, for Track 1 of self-supervised heart rate measurement in the 3rd Vision-based Remote Physiological Signal Sensing (RePSS) Challenge hosted at IJCAI 2024. The goal is to develop a self-supervised learning algorithm for heart rate (HR) estimation using unlabeled facial videos. To tackle this task, we present two self-supervised HR estimation solutions that integrate spatial-temporal modeling and contrastive learning, respectively. Specifically, we first propose a non-end-to-end self-supervised HR measurement framework based on spatial-temporal modeling, which can effectively capture subtle rPPG clues and leverage the inherent bandwidth and periodicity characteristics of rPPG to constrain the model. Meanwhile, we employ an excellent end-to-end solution based on contrastive learning, aiming to generalize across different scenarios from complementary perspectives. Finally, we combine the strengths of the above solutions through an ensemble strategy to generate the final predictions, leading to a more accurate HR estimation. As a result, our solutions achieved a remarkable RMSE score of 8.85277 on the test dataset, securing \textbf2nd place in Track 1 of the challenge.

[CV-28] Leveraging Activations for Superpixel Explanations

链接: https://arxiv.org/abs/2406.04933
作者: Ahcène Boubekki,Samuel G. Fadel,Sebastian Mair
关键词: deep neural networks, neural network image, network image classifier, Saliency methods, deep neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Saliency methods have become standard in the explanation toolkit of deep neural networks. Recent developments specific to image classifiers have investigated region-based explanations with either new methods or by adapting well-established ones using ad-hoc superpixel algorithms. In this paper, we aim to avoid relying on these segmenters by extracting a segmentation from the activations of a deep neural network image classifier without fine-tuning the network. Our so-called Neuro-Activated Superpixels (NAS) can isolate the regions of interest in the input relevant to the model’s prediction, which boosts high-threshold weakly supervised object localization performance. This property enables the semi-supervised semantic evaluation of saliency methods. The aggregation of NAS with existing saliency methods eases their interpretation and reveals the inconsistencies of the widely used area under the relevance curve metric.

[CV-29] Faster Than Lies: Real-time Deepfake Detection using Binary Neural Networks

链接: https://arxiv.org/abs/2406.04932
作者: Lanzino Romeo,Fontana Federico,Diko Anxhelo,Marini Marco Raoul,Cinque Luigi
关键词: Binary Neural Networks, Deepfake detection aims, online content, Deepfake detection, aims to contrast
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at CVPR24 DFAD Workshop

点击查看摘要

Abstract:Deepfake detection aims to contrast the spread of deep-generated media that undermines trust in online content. While existing methods focus on large and complex models, the need for real-time detection demands greater efficiency. With this in mind, unlike previous work, we introduce a novel deepfake detection approach on images using Binary Neural Networks (BNNs) for fast inference with minimal accuracy loss. Moreover, our method incorporates Fast Fourier Transform (FFT) and Local Binary Pattern (LBP) as additional channel features to uncover manipulation traces in frequency and texture domains. Evaluations on COCOFake, DFFD, and CIFAKE datasets demonstrate our method’s state-of-the-art performance in most scenarios with a significant efficiency gain of up to a 20\times reduction in FLOPs during inference. Finally, by exploring BNNs in deepfake detection to balance accuracy and efficiency, this work paves the way for future research on efficient deepfake detection.

[CV-30] MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers

链接: https://arxiv.org/abs/2406.04930
作者: Tanvir Mahmud,Shentong Mo,Yapeng Tian,Diana Marculescu
关键词: Recent advances, pre-trained vision transformers, parameter-efficient audio-visual, audio pre-training, advances in pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted in Efficient Deep Learning for Computer Vision CVPR Workshop 2024

点击查看摘要

Abstract:Recent advances in pre-trained vision transformers have shown promise in parameter-efficient audio-visual learning without audio pre-training. However, few studies have investigated effective methods for aligning multimodal features in parameter-efficient audio-visual transformers. In this paper, we propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for corresponding multimodal semantic features. Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer. This allows the model to learn separate representations for each modality, while also attending to the cross-modal relationships between them. In addition, unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features throughout the encoding phase. Furthermore, to suppress the background features in each modality from foreground matched audio-visual features, we introduce a robust discriminative foreground mining scheme. Through extensive experiments on benchmark AVE, VGGSound, and CREMA-D datasets, we achieve considerable performance improvements over SOTA methods.

[CV-31] AGBD: A Global-scale Biomass Dataset

链接: https://arxiv.org/abs/2406.04928
作者: Ghjulia Sialelli,Torben Peters,Jan D. Wegner,Konrad Schindler
关键词: humanity biggest challenges, Ground Biomass, Accurate estimates, biggest challenges, climate change
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Accurate estimates of Above Ground Biomass (AGB) are essential in addressing two of humanity’s biggest challenges, climate change and biodiversity loss. Existing datasets for AGB estimation from satellite imagery are limited. Either they focus on specific, local regions at high resolution, or they offer global coverage at low resolution. There is a need for a machine learning-ready, globally representative, high-resolution benchmark. Our findings indicate significant variability in biomass estimates across different vegetation types, emphasizing the necessity for a dataset that accurately captures global diversity. To address these gaps, we introduce a comprehensive new dataset that is globally distributed, covers a range of vegetation types, and spans several years. This dataset combines AGB reference data from the GEDI mission with data from Sentinel-2 and PALSAR-2 imagery. Additionally, it includes pre-processed high-level features such as a dense canopy height map, an elevation map, and a land-cover classification map. We also produce a dense, high-resolution (10m) map of AGB predictions for the entire area covered by the dataset. Rigorously tested, our dataset is accompanied by several benchmark models and is publicly available. It can be easily accessed using a single line of code, offering a solid basis for efforts towards global AGB estimation. The GitHub repository this http URL serves as a one-stop shop for all code and data.

[CV-32] RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection

链接: https://arxiv.org/abs/2406.04906
作者: Liting Huang,Zhihao Zhang,Yiran Zhang,Xiyue Zhou,Shoujin Wang
关键词: create realistic, people communicate, recent advancements, realistic and human-like, significantly transforming
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The recent advancements in generative AI models, which can create realistic and human-like content, are significantly transforming how people communicate, create, and work. While the appropriate use of generative AI models can benefit the society, their misuse poses significant threats to data reliability and authentication. However, due to a lack of aligned multimodal datasets, effective and robust methods for detecting machine-generated content are still in the early stages of development. In this paper, we introduce RU-AI, a new large-scale multimodal dataset designed for the robust and efficient detection of machine-generated content in text, image, and voice. Our dataset is constructed from three large publicly available datasets: Flickr8K, COCO, and Places205, by combining the original datasets and their corresponding machine-generated pairs. Additionally, experimental results show that our proposed unified model, which incorporates a multimodal embedding module with a multilayer perceptron network, can effectively determine the origin of the data (i.e., original data samples or machine-generated ones) from RU-AI. However, future work is still required to address the remaining challenges posed by RU-AI. The source code and dataset are available at this https URL.

[CV-33] Labeled Data Selection for Category Discovery

链接: https://arxiv.org/abs/2406.04898
作者: Bingchen Zhao,Nico Lang,Serge Belongie,Oisin Mac Aodha
关键词: discovery methods aim, Category discovery methods, labeled data, labeled, methods aim
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Category discovery methods aim to find novel categories in unlabeled visual data. At training time, a set of labeled and unlabeled images are provided, where the labels correspond to the categories present in the images. The labeled data provides guidance during training by indicating what types of visual properties and features are relevant for performing discovery in the unlabeled data. As a result, changing the categories present in the labeled set can have a large impact on what is ultimately discovered in the unlabeled set. Despite its importance, the impact of labeled data selection has not been explored in the category discovery literature to date. We show that changing the labeled data can significantly impact discovery performance. Motivated by this, we propose two new approaches for automatically selecting the most suitable labeled data based on the similarity between the labeled and unlabeled data. Our observation is that, unlike in conventional supervised transfer learning, the best labeled is neither too similar, nor too dissimilar, to the unlabeled categories. Our resulting approaches obtains state-of-the-art discovery performance across a range of challenging fine-grained benchmark datasets.

[CV-34] Zero-Shot Video Editing through Adaptive Sliding Score Distillation

链接: https://arxiv.org/abs/2406.04888
作者: Lianghan Zhu,Yanqi Bao,Jing Huo,Jing Wu,Yu-Kun Lai,Wenbin Li,Yang Gao
关键词: reignited significant interest, burgeoning field, field of text-based, reignited significant, significant interest
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The burgeoning field of text-based video generation (T2V) has reignited significant interest in the research of controllable video editing. Although pre-trained T2V-based editing models have achieved efficient editing capabilities, current works are still plagued by two major challenges. Firstly, the inherent limitations of T2V models lead to content inconsistencies and motion discontinuities between frames. Secondly, the notorious issue of over-editing significantly disrupts areas that are intended to remain unaltered. To address these challenges, our work aims to explore a robust video-based editing paradigm based on score distillation. Specifically, we propose an Adaptive Sliding Score Distillation strategy, which not only enhances the stability of T2V supervision but also incorporates both global and local video guidance to mitigate the impact of generation errors. Additionally, we modify the self-attention layers during the editing process to further preserve the key features of the original video. Extensive experiments demonstrate that these strategies enable us to effectively address the aforementioned challenges, achieving superior editing performance compared to existing state-of-the-art methods.

[CV-35] Seeing the Unseen: Visual Metaphor Captioning for Videos

链接: https://arxiv.org/abs/2406.04886
作者: Abisek Rajakumar Kalarani,Pushpak Bhattacharyya,Sumit Shekhar
关键词: common communication tool, common communication, communication tool, Metaphors, Average Concept Distance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Metaphors are a common communication tool used in our day-to-day life. The detection and generation of metaphors in textual form have been studied extensively but metaphors in other forms have been under-explored. Recent studies have shown that Vision-Language (VL) models cannot understand visual metaphors in memes and adverts. As of now, no probing studies have been done that involve complex language phenomena like metaphors with videos. Hence, we introduce a new VL task of describing the metaphors present in the videos in our work. To facilitate this novel task, we construct and release a manually created dataset with 705 videos and 2115 human-written captions, along with a new metric called Average Concept Distance (ACD), to automatically evaluate the creativity of the metaphors generated. We also propose a novel low-resource video metaphor captioning system: GIT-LLaVA, which obtains comparable performance to SoTA video language models on the proposed task. We perform a comprehensive analysis of existing video language models on this task and publish our dataset, models, and benchmark results to enable further research.

[CV-36] InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment

链接: https://arxiv.org/abs/2406.04882
作者: Yuxing Long,Wenzhe Cai,Hongcheng Wang,Guanqi Zhan,Hao Dong
关键词: Enabling robots, navigation, human-robot interaction, instruction navigation, instruction
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to CoRL 2024

点击查看摘要

Abstract:Enabling robots to navigate following diverse language instructions in unexplored environments is an attractive goal for human-robot interaction. However, this goal is challenging because different navigation tasks require different strategies. The scarcity of instruction navigation data hinders training an instruction navigation model with varied strategies. Therefore, previous methods are all constrained to one specific type of navigation instruction. In this work, we propose InstructNav, a generic instruction navigation system. InstructNav makes the first endeavor to handle various instruction navigation tasks without any navigation training or pre-built maps. To reach this goal, we introduce Dynamic Chain-of-Navigation (DCoN) to unify the planning process for different types of navigation instructions. Furthermore, we propose Multi-sourced Value Maps to model key elements in instruction navigation so that linguistic DCoN planning can be converted into robot actionable trajectories. With InstructNav, we complete the R2R-CE task in a zero-shot way for the first time and outperform many task-training methods. Besides, InstructNav also surpasses the previous SOTA method by 10.48% on the zero-shot Habitat ObjNav and by 86.34% on demand-driven navigation DDN. Real robot experiments on diverse indoor scenes further demonstrate our method’s robustness in coping with the environment and instruction variations.

[CV-37] 3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views

链接: https://arxiv.org/abs/2406.04875
作者: Xiaobiao Du,Haiyang Sun,Shuyun Wang,Zhuojie Wu,Hongwei Sheng,Jiaying Ying,Ming Lu,Tianqing Zhu,Kun Zhan,Xin Yu
关键词: augmented reality, self-driving systems, car, cars, dataset
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:3D cars are commonly used in self-driving systems, virtual/augmented reality, and games. However, existing 3D car datasets are either synthetic or low-quality, presenting a significant gap toward the high-quality real-world 3D car datasets and limiting their applications in practical scenarios. In this paper, we propose the first large-scale 3D real car dataset, termed 3DRealCar, offering three distinctive features. (1) \textbfHigh-Volume: 2,500 cars are meticulously scanned by 3D scanners, obtaining car images and point clouds with real-world dimensions; (2) \textbfHigh-Quality: Each car is captured in an average of 200 dense, high-resolution 360-degree RGB-D views, enabling high-fidelity 3D reconstruction; (3) \textbfHigh-Diversity: The dataset contains various cars from over 100 brands, collected under three distinct lighting conditions, including reflective, standard, and dark. Additionally, we offer detailed car parsing maps for each instance to promote research in car parsing tasks. Moreover, we remove background point clouds and standardize the car orientation to a unified axis for the reconstruction only on cars without background and controllable rendering. We benchmark 3D reconstruction results with state-of-the-art methods across each lighting condition in 3DRealCar. Extensive experiments demonstrate that the standard lighting condition part of 3DRealCar can be used to produce a large number of high-quality 3D cars, improving various 2D and 3D tasks related to cars. Notably, our dataset brings insight into the fact that recent 3D reconstruction methods face challenges in reconstructing high-quality 3D cars under reflective and dark lighting conditions. \textcolorred\hrefthis https URLOur dataset is available here.

[CV-38] Ada-VE: Training-Free Consistent Video Editing Using Adaptive Motion Prior

链接: https://arxiv.org/abs/2406.04873
作者: Tanvir Mahmud,Mustafa Munir,Radu Marculescu,Diana Marculescu
关键词: synthesis models face, face significant challenges, models face significant, ensuring consistent character, consistent character generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Tech Report

点击查看摘要

Abstract:Video-to-video synthesis models face significant challenges, such as ensuring consistent character generation across frames, maintaining smooth temporal transitions, and preserving quality during fast motion. The introduction of joint fully cross-frame self-attention mechanisms has improved character consistency, but this comes at the cost of increased computational complexity. This full cross-frame self-attention mechanism also incorporates redundant details and limits the number of frames that can be jointly edited due to its computational cost. Moreover, the lack of frames in cross-frame attention adversely affects temporal consistency and visual quality. To address these limitations, we propose a new adaptive motion-guided cross-frame attention mechanism that drastically reduces complexity while preserving semantic details and temporal consistency. Specifically, we selectively incorporate the moving regions of successive frames in cross-frame attention and sparsely include stationary regions based on optical flow sampling. This technique allows for an increased number of jointly edited frames without additional computational overhead. For longer duration of video editing, existing methods primarily focus on frame interpolation or flow-warping from jointly edited keyframes, which often results in blurry frames or reduced temporal consistency. To improve this, we introduce KV-caching of jointly edited frames and reuse the same KV across all intermediate frames, significantly enhancing both intermediate frame quality and temporal consistency. Overall, our motion-sampling method enables the use of around three times more keyframes than existing joint editing methods while maintaining superior prediction quality. Ada-VE achieves up to 4x speed-up when using fully-extended self-attention across 40 frames for joint editing, without compromising visual quality or temporal consistency.

[CV-39] Deep learning for precipitation nowcasting: A survey from the perspective of time series forecasting

链接: https://arxiv.org/abs/2406.04867
作者: Sojung An,Tae-Jin Oh,Eunha Sohn,Donghyun Kim
关键词: estimate motion flow, time series precipitation, series precipitation forecasting, time series, precipitation forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Deep learning-based time series forecasting has dominated the short-term precipitation forecasting field with the help of its ability to estimate motion flow in high-resolution datasets. The growing interest in precipitation nowcasting offers substantial opportunities for the advancement of current forecasting technologies. Nevertheless, there has been a scarcity of in-depth surveys of time series precipitation forecasting using deep learning. Thus, this paper systemically reviews recent progress in time series precipitation forecasting models. Specifically, we investigate the following key points within background components, covering: i) preprocessing, ii) objective functions, and iii) evaluation metrics. We then categorize forecasting models into \textitrecursive and \textitmultiple strategies based on their approaches to predict future frames, investigate the impacts of models using the strategies, and performance assessments. Finally, we evaluate current deep learning-based models for precipitation forecasting on a public benchmark, discuss their limitations and challenges, and present some promising research directions. Our contribution lies in providing insights for a better understanding of time series precipitation forecasting and in aiding the development of robust AI solutions for the future.

[CV-40] Normal-guided Detail-Preserving Neural Implicit Functions for High-Fidelity 3D Surface Reconstruction

链接: https://arxiv.org/abs/2406.04861
作者: Aarya Patel,Hamid Laga,Ojaswa Sharma
关键词: Neural implicit representations, powerful paradigm, RGB, Neural implicit, differential properties
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Original version. Project page with images and code: this https URL

点击查看摘要

Abstract:Neural implicit representations have emerged as a powerful paradigm for 3D reconstruction. However, despite their success, existing methods fail to capture fine geometric details and thin structures, especially in scenarios where only sparse RGB views of the objects of interest are available. We hypothesize that current methods for learning neural implicit representations from RGB or RGBD images produce 3D surfaces with missing parts and details because they only rely on 0-order differential properties, i.e. the 3D surface points and their projections, as supervisory signals. Such properties, however, do not capture the local 3D geometry around the points and also ignore the interactions between points. This paper demonstrates that training neural representations with first-order differential properties, i.e. surface normals, leads to highly accurate 3D surface reconstruction even in situations where only as few as two RGB (front and back) images are available. Given multiview RGB images of an object of interest, we first compute the approximate surface normals in the image space using the gradient of the depth maps produced using an off-the-shelf monocular depth estimator such as Depth Anything model. An implicit surface regressor is then trained using a loss function that enforces the first-order differential properties of the regressed surface to match those estimated from Depth Anything. Our extensive experiments on a wide range of real and synthetic datasets show that the proposed method achieves an unprecedented level of reconstruction accuracy even when using as few as two RGB views. The detailed ablation study also demonstrates that normal-based supervision plays a key role in this significant improvement in performance, enabling the 3D reconstruction of intricate geometric details and thin structures that were previously challenging to capture.

[CV-41] Multi-Granularity Language-Guided Multi-Object Tracking

链接: https://arxiv.org/abs/2406.04844
作者: Yuhao Li,Muzammal Naseer,Jiale Cao,Yu Zhu,Jinqiu Sun,Yanning Zhang,Fahad Shahbaz Khan
关键词: methods typically learn, typically learn visual, tracking methods typically, visual features, methods typically
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Most existing multi-object tracking methods typically learn visual tracking features via maximizing dis-similarities of different instances and minimizing similarities of the same instance. While such a feature learning scheme achieves promising performance, learning discriminative features solely based on visual information is challenging especially in case of environmental interference such as occlusion, blur and domain variance. In this work, we argue that multi-modal language-driven features provide complementary information to classical visual features, thereby aiding in improving the robustness to such environmental interference. To this end, we propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity (scene-and instance-level) and combines it with standard visual features to obtain discriminative representations. To develop LG-MOT, we annotate existing MOT datasets with scene-and instance-level language descriptions. We then encode both instance-and scene-level language information into high-dimensional embeddings, which are utilized to guide the visual features during training. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Extensive experiments on three benchmarks, MOT17, DanceTrack and SportsMOT, reveal the merits of the proposed contributions leading to state-of-the-art performance. On the DanceTrack test set, our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score), compared to the baseline using only visual features. Further, our LG-MOT exhibits strong cross-domain generalizability. The dataset and code will be available at ~\urlthis https URL.

[CV-42] 3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation

链接: https://arxiv.org/abs/2406.04842
作者: Feiyu Pan,Hao Fang,Xiankai Lu
关键词: Referring video object, emphasizing modeling dense, dense text-video relations, segment target objects, modeling dense text-video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Referring video object segmentation (RVOS) relies on natural language expressions to segment target objects in video, emphasizing modeling dense text-video relations. The current RVOS methods typically use independently pre-trained vision and language models as backbones, resulting in a significant domain gap between video and text. In cross-modal feature interaction, text features are only used as query initialization and do not fully utilize important information in the text. In this work, we propose using frozen pre-trained vision-language models (VLM) as backbones, with a specific emphasis on enhancing cross-modal feature interaction. Firstly, we use frozen convolutional CLIP backbone to generate feature-aligned vision and text features, alleviating the issue of domain gap and reducing training costs. Secondly, we add more cross-modal feature fusion in the pipeline to enhance the utilization of multi-modal information. Furthermore, we propose a novel video query initialization method to generate higher quality video queries. Without bells and whistles, our method achieved 51.5 JF on the MeViS test set and ranked 3rd place for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation.

[CV-43] EGOR: Efficient Generated Objects Replay for incremental object detection

链接: https://arxiv.org/abs/2406.04829
作者: Zijia An,Boyu Diao,Libo Huang,Ruiqi Liu,Zhulin An,Yongjun Xu
关键词: detect emerging new-class, simultaneously maintain old-class, emerging new-class objects, incremental data, maintain old-class accuracy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Incremental object detection aims to simultaneously maintain old-class accuracy and detect emerging new-class objects in incremental data. Most existing distillation-based methods underperform when unlabeled old-class objects are absent in the incremental dataset. While the absence can be mitigated by generating old-class samples, it also incurs high computational costs. In this paper, we argue that the extra computational cost stems from the inconsistency between the detector and the generative model, along with redundant generation. To overcome this problem, we propose Efficient Generated Object Replay (EGOR). Specifically, we generate old-class samples by inversing the original detectors, thus eliminating the necessity of training and storing additional generative models. We also propose augmented replay to reuse the objects in generated samples, thereby reducing the redundant generation. In addition, we propose high-response knowledge distillation focusing on the knowledge related to the old class, which transfers the knowledge in generated objects to the incremental detector. With the addition of the generated objects and losses, we observe a bias towards old classes in the detector. We balance the losses for old and new classes to alleviate the bias, thereby increasing the overall detection accuracy. Extensive experiments conducted on MS COCO 2017 demonstrate that our method can efficiently improve detection performance in the absence of old-class objects.

[CV-44] Navigating Efficiency in MobileViT through Gaussian Process on Global Architecture Factors

链接: https://arxiv.org/abs/2406.04820
作者: Ke Meng,Kai Chen
关键词: convolutional neural networks, Numerous techniques, achieve optimal architectures, neural networks, meticulously designed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Numerous techniques have been meticulously designed to achieve optimal architectures for convolutional neural networks (CNNs), yet a comparable focus on vision transformers (ViTs) has been somewhat lacking. Despite the remarkable success of ViTs in various vision tasks, their heavyweight nature presents challenges of computational costs. In this paper, we leverage the Gaussian process to systematically explore the nonlinear and uncertain relationship between performance and global architecture factors of MobileViT, such as resolution, width, and depth including the depth of in-verted residual blocks and the depth of ViT blocks, and joint factors including resolution-depth and resolution-width. We present design principles twisting magic 4D cube of the global architecture factors that minimize model sizes and computational costs with higher model accuracy. We introduce a formula for downsizing architectures by iteratively deriving smaller MobileViT V2, all while adhering to a specified constraint of multiply-accumulate operations (MACs). Experiment results show that our formula significantly outperforms CNNs and mobile ViTs across diversified datasets

[CV-45] A short review on graphonometric evaluation tools in children

链接: https://arxiv.org/abs/2406.04818
作者: Belen Esther Aleman,Moises Diaz,Miguel Angel Ferrer
关键词: coordination of motor, complex task, task that involves, involves the coordination, cognitive skills
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Handwriting is a complex task that involves the coordination of motor, perceptual and cognitive skills. It is a fundamental skill for the cognitive and academic development of children. However, the technological, and educational changes in recent decades have affected both the teaching and assessment of handwriting. This paper presents a literature review of handwriting analysis in children, including a bibliometric analysis of published articles, the study participants, and the methods of evaluating the graphonometric state of children. The aim is to synthesize the state of the art and provide an overview of the main study trends over the last decade. The review concludes that handwriting remains a fundamental tool for early estimation of cognitive problems and early intervention. The article analyzes graphonometric evaluation tools. Likewise, it reflects on the importance of graphonometric evaluation as a means to detect possible difficulties or disorders in learning to write. The article concludes by highlighting the need to agree on an evaluation methodology and to combine databases.

[CV-46] Online Continual Learning of Video Diffusion Models From a Single Video Stream

链接: https://arxiv.org/abs/2406.04814
作者: Jason Yoo,Dylan Green,Geoff Pleiss,Frank Wood
关键词: shown exceptional capabilities, generating realistic videos, shown exceptional, exceptional capabilities, capabilities in generating
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have shown exceptional capabilities in generating realistic videos. Yet, their training has been predominantly confined to offline environments where models can repeatedly train on i.i.d. data to convergence. This work explores the feasibility of training diffusion models from a semantically continuous video stream, where correlated video frames sequentially arrive one at a time. To investigate this, we introduce two novel continual video generative modeling benchmarks, Lifelong Bouncing Balls and Windows 95 Maze Screensaver, each containing over a million video frames generated from navigating stationary environments. Surprisingly, our experiments show that diffusion models can be effectively trained online using experience replay, achieving performance comparable to models trained with i.i.d. samples given the same number of gradient steps.

[CV-47] Predictive Dynamic Fusion

链接: https://arxiv.org/abs/2406.04802
作者: Bing Cao,Yinan Xia,Yi Ding,Changqing Zhang,Qinghua Hu
关键词: rendering holistic judgments, joint decision-making systems, holistic judgments, crucial in joint, joint decision-making
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 pages, 7 figures

点击查看摘要

Abstract:Multimodal fusion is crucial in joint decision-making systems for rendering holistic judgments. Since multimodal data changes in open environments, dynamic fusion has emerged and achieved remarkable progress in numerous applications. However, most existing dynamic multimodal fusion methods lack theoretical guarantees and easily fall into suboptimal problems, yielding unreliability and instability. To address this issue, we propose a Predictive Dynamic Fusion (PDF) framework for multimodal learning. We proceed to reveal the multimodal fusion from a generalization perspective and theoretically derive the predictable Collaborative Belief (Co-Belief) with Mono- and Holo-Confidence, which provably reduces the upper bound of generalization error. Accordingly, we further propose a relative calibration strategy to calibrate the predicted Co-Belief for potential uncertainty. Extensive experiments on multiple benchmarks confirm our superiority. Our code is available at this https URL.

[CV-48] MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

链接: https://arxiv.org/abs/2406.04801
作者: Xingkui Zhu,Yiran Guan,Dingkang Liang,Yuchao Chen,Yuliang Liu,Xiang Bai
关键词: sparsely activated mixture, traditional densely activated, MoE models, dense checkpoints, MoE
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:The sparsely activated mixture of experts (MoE) model presents a promising alternative to traditional densely activated (dense) models, enhancing both quality and computational efficiency. However, training MoE models from scratch demands extensive data and computational resources. Moreover, public repositories like timm mainly provide pre-trained dense checkpoints, lacking similar resources for MoE models, hindering their adoption. To bridge this gap, we introduce MoE Jetpack, an effective method for fine-tuning dense checkpoints into MoE models. MoE Jetpack incorporates two key techniques: (1) checkpoint recycling, which repurposes dense checkpoints as initial weights for MoE models, thereby accelerating convergence, enhancing accuracy, and alleviating the computational burden of pre-training; (2) hyperspherical adaptive MoE (SpheroMoE) layer, which optimizes the MoE architecture for better integration of dense checkpoints, enhancing fine-tuning performance. Our experiments on vision tasks demonstrate that MoE Jetpack significantly improves convergence speed and accuracy when fine-tuning dense checkpoints into MoE models. Our code will be publicly available at this https URL.

[CV-49] REP: Resource-Efficient Prompting for On-device Continual Learning

链接: https://arxiv.org/abs/2406.04772
作者: Sungho Jeon,Xinyue Ma,Kwang In Kim,Myeongjae Jeon
关键词: On-device continual learning, requires the co-optimization, resource efficiency, continual learning, efficiency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 10 figures

点击查看摘要

Abstract:On-device continual learning (CL) requires the co-optimization of model accuracy and resource efficiency to be practical. This is extremely challenging because it must preserve accuracy while learning new tasks with continuously drifting data and maintain both high energy and memory efficiency to be deployable on real-world devices. Typically, a CL method leverages one of two types of backbone networks: CNN or ViT. It is commonly believed that CNN-based CL excels in resource efficiency, whereas ViT-based CL is superior in model performance, making each option attractive only for a single aspect. In this paper, we revisit this comparison while embracing powerful pre-trained ViT models of various sizes, including ViT-Ti (5.8M parameters). Our detailed analysis reveals that many practical options exist today for making ViT-based methods more suitable for on-device CL, even when accuracy, energy, and memory are all considered. To further expand this impact, we introduce REP, which improves resource efficiency specifically targeting prompt-based rehearsal-free methods. Our key focus is on avoiding catastrophic trade-offs with accuracy while trimming computational and memory costs throughout the training process. We achieve this by exploiting swift prompt selection that enhances input data using a carefully provisioned model, and by developing two novel algorithms-adaptive token merging (AToM) and adaptive layer dropping (ALD)-that optimize the prompt updating stage. In particular, AToM and ALD perform selective skipping across the data and model-layer dimensions without compromising task-specific features in vision transformer models. Extensive experiments on three image classification datasets validate REP’s superior resource efficiency over current state-of-the-art methods.

[CV-50] SMC: Masked Learning of Unsupervised Video Semantic Compression

链接: https://arxiv.org/abs/2406.04765
作者: Yuan Tian,Guo Lu,Guangtao Zhai
关键词: human visual perception, neglecting semantic preservation, compression methods focus, video compression methods, visual perception
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Most video compression methods focus on human visual perception, neglecting semantic preservation. This leads to severe semantic loss during the compression, hampering downstream video analysis tasks. In this paper, we propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics, by jointly mining and compressing the semantics in a self-supervised manner. While MVM is proficient at learning generalizable semantics through the masked patch prediction task, it may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises. To suppress this, we explicitly regularize the non-semantic entropy of the compressed video in the MVM token space. The proposed framework is instantiated as a simple Semantic-Mining-then-Compression (SMC) model. Furthermore, we extend SMC as an advanced SMC++ model from several aspects. First, we equip it with a masked motion prediction objective, leading to better temporal semantic learning ability. Second, we introduce a Transformer-based compression module, to improve the semantic compression efficacy. Considering that directly mining the complex redundancy among heterogeneous features in different coding stages is non-trivial, we introduce a compact blueprint semantic representation to align these features into a similar form, fully unleashing the power of the Transformer-based compression module. Extensive results demonstrate the proposed SMC and SMC++ models show remarkable superiority over previous traditional, learnable, and perceptual quality-oriented video codecs, on three video analysis tasks and seven datasets. \textitCodes and model are available at: \urlthis https URL.

[CV-51] Interpretable Multimodal Out-of-context Detection with Soft Logic Regularization

链接: https://arxiv.org/abs/2406.04756
作者: Huanhuan Ma,Jinghao Zhang,Qiang Liu,Shu Wu,Liang Wang
关键词: causing significant concerns, causing significant, concerns in society, rapid spread, spread of information
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICASSP 2024 lecture paper

点击查看摘要

Abstract:The rapid spread of information through mobile devices and media has led to the widespread of false or deceptive news, causing significant concerns in society. Among different types of misinformation, image repurposing, also known as out-of-context misinformation, remains highly prevalent and effective. However, current approaches for detecting out-of-context misinformation often lack interpretability and offer limited explanations. In this study, we propose a logic regularization approach for out-of-context detection called LOGRAN (LOGic Regularization for out-of-context ANalysis). The primary objective of LOGRAN is to decompose the out-of-context detection at the phrase level. By employing latent variables for phrase-level predictions, the final prediction of the image-caption pair can be aggregated using logical rules. The latent variables also provide an explanation for how the final result is derived, making this fine-grained detection method inherently explanatory. We evaluate the performance of LOGRAN on the NewsCLIPpings dataset, showcasing competitive overall results. Visualized examples also reveal faithful phrase-level predictions of out-of-context images, accompanied by explanations. This highlights the effectiveness of our approach in addressing out-of-context detection and enhancing interpretability.

[CV-52] PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction

链接: https://arxiv.org/abs/2406.04746
作者: Eduard Poesina,Adriana Valentina Costache,Adrian-Gabriel Chifu,Josiane Mothe,Radu Tudor Ionescu
关键词: generative diffusion models, visually impressive results, diffusion models, recently emerged, viable alternative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-to-image generation has recently emerged as a viable alternative to text-to-image retrieval, due to the visually impressive results of generative diffusion models. Although query performance prediction is an active research topic in information retrieval, to the best of our knowledge, there is no prior study that analyzes the difficulty of queries (prompts) in text-to-image generation, based on human judgments. To this end, we introduce the first dataset of prompts which are manually annotated in terms of image generation performance. In order to determine the difficulty of the same prompts in image retrieval, we also collect manual annotations that represent retrieval performance. We thus propose the first benchmark for joint text-to-image prompt and query performance prediction, comprising 10K queries. Our benchmark enables: (i) the comparative assessment of the difficulty of prompts/queries in image generation and image retrieval, and (ii) the evaluation of prompt/query performance predictors addressing both generation and retrieval. We present results with several pre-generation/retrieval and post-generation/retrieval performance predictors, thus providing competitive baselines for future research. Our benchmark and code is publicly available under the CC BY 4.0 license at this https URL.

[CV-53] Confidence-aware Contrastive Learning for Selective Classification

链接: https://arxiv.org/abs/2406.04745
作者: Yu-Chang Wu,Shen-Huan Lyu,Haopu Shang,Xiangyu Wang,Chao Qian
关键词: Selective classification, selective classification model, Selective classification enables, classification enables models, sufficiently confident
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICML 2024

点击查看摘要

Abstract:Selective classification enables models to make predictions only when they are sufficiently confident, aiming to enhance safety and reliability, which is important in high-stakes scenarios. Previous methods mainly use deep neural networks and focus on modifying the architecture of classification layers to enable the model to estimate the confidence of its prediction. This work provides a generalization bound for selective classification, disclosing that optimizing feature layers helps improve the performance of selective classification. Inspired by this theory, we propose to explicitly improve the selective classification model at the feature level for the first time, leading to a novel Confidence-aware Contrastive Learning method for Selective Classification, CCL-SC, which similarizes the features of homogeneous instances and differentiates the features of heterogeneous instances, with the strength controlled by the model’s confidence. The experimental results on typical datasets, i.e., CIFAR-10, CIFAR-100, CelebA, and ImageNet, show that CCL-SC achieves significantly lower selective risk than state-of-the-art methods, across almost all coverage degrees. Moreover, it can be combined with existing methods to bring further improvement.

[CV-54] MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

链接: https://arxiv.org/abs/2406.04716
作者: Cong Yang,Zuchao Li,Lefei Zhang
关键词: remote sensing, remote sensing images, remote sensing scenarios, Multimodal Model, sensing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, large multimodal models have built a bridge from visual to textual information, but they tend to underperform in remote sensing scenarios. This underperformance is due to the complex distribution of objects and the significant scale differences among targets in remote sensing images, leading to visual ambiguities and insufficient descriptions by these multimodal models. Moreover, the lack of multimodal fine-tuning data specific to the remote sensing field makes it challenging for the model’s behavior to align with user queries. To address these issues, this paper proposes an attribute-guided \textbfMulti-Granularity Instruction Multimodal Model (MGIMM) for remote sensing image detailed description. MGIMM guides the multimodal model to learn the consistency between visual regions and corresponding text attributes (such as object names, colors, and shapes) through region-level instruction tuning. Then, with the multimodal model aligned on region-attribute, guided by multi-grain visual features, MGIMM fully perceives both region-level and global image information, utilizing large language models for comprehensive descriptions of remote sensing images. Due to the lack of a standard benchmark for generating detailed descriptions of remote sensing images, we construct a dataset featuring 38,320 region-attribute pairs and 23,463 image-detailed description pairs. Compared with various advanced methods on this dataset, the results demonstrate the effectiveness of MGIMM’s region-attribute guided learning approach. Code can be available at this https URL

[CV-55] CDeFuse: Continuous Decomposition for Infrared and Visible Image Fusion

链接: https://arxiv.org/abs/2406.04689
作者: Haolong Ma,Hui Li,Chunyang Cheng,Xiaoning Song,Zhongwei Shen
关键词: image processing technique, common image processing, decomposition, extract complementary information, processing technique
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As a common image processing technique, image decomposition is often used to extract complementary information between modalities. In current decomposition-based image fusion methods, typically, source images are decomposed into three parts at single scale (i.e., visible-exclusive part, infrared-exclusive part, and common part) and lacking interaction between modalities during the decomposition process. These results in the inability of fusion images to effectively focus on finer complementary information between modalities at various scales. To address the above issue, a novel decomposition mechanism, Continuous Decomposition Fusion (CDeFuse), is proposed. Firstly, CDeFuse extends the original three-part decomposition to a more general K-part decomposition at each scale through similarity constraints to fuse multi-scale information and achieve a finer representation of decomposition features. Secondly, a Continuous Decomposition Module (CDM) is introduced to assist K-part decomposition. Its core component, State Transformer (ST), efficiently captures complementary information between modalities by utilizing multi-head self-attention mechanism. Finally, a novel decomposition loss function and the corresponding computational optimization strategy are utilized to ensure the smooth progress of the decomposition process while maintaining linear growth in time complexity with the number of decomposition results K. Extensive experiments demonstrate that our CDeFuse achieves comparable performance compared to previous methods. The code will be publicly available.

[CV-56] LogiCode: an LLM-Driven Framework for Logical Anomaly Detection

链接: https://arxiv.org/abs/2406.04687
作者: Yiheng Zhang,Yunkang Cao,Xiaohao Xu,Weiming Shen
关键词: Large Language Models, leverages Large Language, Language Models, Large Language, leverages Large
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents LogiCode, a novel framework that leverages Large Language Models (LLMs) for identifying logical anomalies in industrial settings, moving beyond traditional focus on structural inconsistencies. By harnessing LLMs for logical reasoning, LogiCode autonomously generates Python codes to pinpoint anomalies such as incorrect component quantities or missing elements, marking a significant leap forward in anomaly detection technologies. A custom dataset “LOCO-Annotations” and a benchmark “LogiBench” are introduced to evaluate the LogiCode’s performance across various metrics including binary classification accuracy, code generation success rate, and precision in reasoning. Findings demonstrate LogiCode’s enhanced interpretability, significantly improving the accuracy of logical anomaly detection and offering detailed explanations for identified anomalies. This represents a notable shift towards more intelligent, LLM-driven approaches in industrial anomaly detection, promising substantial impacts on industry-specific applications.

[CV-57] ACE Metric: Advection and Convection Evaluation for Accurate Weather Forecasting

链接: https://arxiv.org/abs/2406.04678
作者: Doyi Kim,Minseok Seo,Yeji Choi
关键词: Numerical Weather Prediction, traditional NWP, data-driven weather forecasting, received significant attention, Numerical Weather
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages

点击查看摘要

Abstract:Recently, data-driven weather forecasting methods have received significant attention for surpassing the RMSE performance of traditional NWP (Numerical Weather Prediction)-based methods. However, data-driven models are tuned to minimize the loss between forecasted data and ground truths, often using pixel-wise loss. This can lead to models that produce blurred outputs, which, despite being significantly different in detail from the actual weather conditions, still demonstrate low RMSE values. Although evaluation metrics from the computer vision field, such as PSNR, SSIM, and FVD, can be used, they are not entirely suitable for weather variables. This is because weather variables exhibit continuous physical changes over time and lack the distinct boundaries of objects typically seen in computer vision images. To resolve these issues, we propose the advection and convection Error (ACE) metric, specifically designed to assess how well models predict advection and convection, which are significant atmospheric transfer methods. We have validated the ACE evaluation metric on the WeatherBench2 and MovingMNIST datasets.

[CV-58] OVMR: Open-Vocabulary Recognition with Multi-Modal References

链接: https://arxiv.org/abs/2406.04675
作者: Zehong Ma,Shiliang Zhang,Longhui Wei,Qi Tian
关键词: textual descriptions, open-vocabulary recognition lies, textual, descriptions, model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR2024

点击查看摘要

Abstract:The challenge of open-vocabulary recognition lies in the model has no clue of new categories it is applied to. Existing works have proposed different methods to embed category cues into the model, \eg, through few-shot fine-tuning, providing category names or textual descriptions to Vision-Language Models. Fine-tuning is time-consuming and degrades the generalization capability. Textual descriptions could be ambiguous and fail to depict visual details. This paper tackles open-vocabulary recognition from a different perspective by referring to multi-modal clues composed of textual descriptions and exemplar images. Our method, named OVMR, adopts two innovative components to pursue a more robust category cues embedding. A multi-modal classifier is first generated by dynamically complementing textual descriptions with image exemplars. A preference-based refinement module is hence applied to fuse uni-modal and multi-modal classifiers, with the aim to alleviate issues of low-quality exemplar images or textual descriptions. The proposed OVMR is a plug-and-play module, and works well with exemplar images randomly crawled from the Internet. Extensive experiments have demonstrated the promising performance of OVMR, \eg, it outperforms existing methods across various scenarios and setups. Codes are publicly available at \hrefthis https URLthis https URL.

[CV-59] MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

链接: https://arxiv.org/abs/2406.04673
作者: Sanjoy Chowdhury,Sayan Nag,K J Joseph,Balaji Vasan Srinivasan,Dinesh Manocha
关键词: emotions and feelings, universal language, communicate emotions, Music, synthesize music
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Accepted at CVPR 2024 as Highlight paper. Webpage: this https URL

点击查看摘要

Abstract:Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel “visual synapse”, which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music, measured both objectively and subjectively, with a relative gain of up to 67.98% on the FAD score. We hope that our work will gather attention to this pragmatic, yet relatively under-explored research area.

[CV-60] Evaluating and Mitigating IP Infringement in Visual Generative AI

链接: https://arxiv.org/abs/2406.04662
作者: Zhenting Wang,Chen Chen,Vikash Sehwag,Minzhou Pan,Lingjuan Lyu
关键词: Stable Video Diffusion, Stable Video, Stable Diffusion, visual generative models, Video Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The popularity of visual generative AI models like DALL-E 3, Stable Diffusion XL, Stable Video Diffusion, and Sora has been increasing. Through extensive evaluation, we discovered that the state-of-the-art visual generative models can generate content that bears a striking resemblance to characters protected by intellectual property rights held by major entertainment companies (such as Sony, Marvel, and Nintendo), which raises potential legal concerns. This happens when the input prompt contains the character’s name or even just descriptive details about their characteristics. To mitigate such IP infringement problems, we also propose a defense method against it. In detail, we develop a revised generation paradigm that can identify potentially infringing generated content and prevent IP infringement by utilizing guidance techniques during the diffusion process. It has the capability to recognize generated content that may be infringing on intellectual property rights, and mitigate such infringement by employing guidance methods throughout the diffusion process without retrain or fine-tune the pretrained models. Experiments on well-known character IPs like Spider-Man, Iron Man, and Superman demonstrate the effectiveness of the proposed defense method. Our data and code can be found at this https URL.

[CV-61] LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model

链接: https://arxiv.org/abs/2406.04659
作者: Dongkai Wang,Shiyu Xuan,Shiliang Zhang
关键词: existing human keypoint, keypoint priors provided, keypoint localization, keypoint, human keypoint localization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR2024

点击查看摘要

Abstract:The capacity of existing human keypoint localization models is limited by keypoint priors provided by the training data. To alleviate this restriction and pursue more general model, this work studies keypoint localization from a different perspective by reasoning locations based on keypiont clues in text descriptions. We propose LocLLM, the first Large-Language Model (LLM) based keypoint localization model that takes images and text instructions as inputs and outputs the desired keypoint coordinates. LocLLM leverages the strong reasoning capability of LLM and clues of keypoint type, location, and relationship in textual descriptions for keypoint localization. To effectively tune LocLLM, we construct localization-based instruction conversations to connect keypoint description with corresponding coordinates in input image, and fine-tune the whole model in a parameter-efficient training pipeline. LocLLM shows remarkable performance on standard 2D/3D keypoint localization benchmarks. Moreover, incorporating language clues into the localization makes LocLLM show superior flexibility and generalizable capability in cross dataset keypoint localization, and even detecting novel type of keypoints unseen during training.

[CV-62] SMART: Scene-motion-aware human action recognition framework for mental disorder group

链接: https://arxiv.org/abs/2406.04649
作者: Zengyuan Lai,Jiarui Yang,Songpengcheng Xia,Qi Wu,Zhen Sun,Wenxian Yu,Ling Pei
关键词: Internet of Things, necessitating intelligent video, intelligent video behavior, video behavior monitoring, Action Recognition Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Patients with mental disorders often exhibit risky abnormal actions, such as climbing walls or hitting windows, necessitating intelligent video behavior monitoring for smart healthcare with the rising Internet of Things (IoT) technology. However, the development of vision-based Human Action Recognition (HAR) for these actions is hindered by the lack of specialized algorithms and datasets. In this paper, we innovatively propose to build a vision-based HAR dataset including abnormal actions often occurring in the mental disorder group and then introduce a novel Scene-Motion-aware Action Recognition Technology framework, named SMART, consisting of two technical modules. First, we propose a scene perception module to extract human motion trajectory and human-scene interaction features, which introduces additional scene information for a supplementary semantic representation of the above actions. Second, the multi-stage fusion module fuses the skeleton motion, motion trajectory, and human-scene interaction features, enhancing the semantic association between the skeleton motion and the above supplementary representation, thus generating a comprehensive representation with both human motion and scene information. The effectiveness of our proposed method has been validated on our self-collected HAR dataset (MentalHAD), achieving 94.9% and 93.1% accuracy in un-seen subjects and scenes and outperforming state-of-the-art approaches by 6.5% and 13.2%, respectively. The demonstrated subject- and scene- generalizability makes it possible for SMART’s migration to practical deployment in smart healthcare systems for mental disorder patients in medical settings. The code and dataset will be released publicly for further research: this https URL.

[CV-63] UCDNet: Multi-UAV Collaborative 3D Object Detection Network by Reliable Feature Mapping

链接: https://arxiv.org/abs/2406.04648
作者: Pengju Tian,Peirui Cheng,Yuchao Wang,Zhechao Wang,Zhirui Wang,Menglong Yan,Xue Yang,Xian Sun
关键词: encompassing traffic monitoring, comprehend complex environments, applications encompassing traffic, object detection paradigm, integrating complementary information
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multi-UAV collaborative 3D object detection can perceive and comprehend complex environments by integrating complementary information, with applications encompassing traffic monitoring, delivery services and agricultural management. However, the extremely broad observations in aerial remote sensing and significant perspective differences across multiple UAVs make it challenging to achieve precise and consistent feature mapping from 2D images to 3D space in multi-UAV collaborative 3D object detection paradigm. To address the problem, we propose an unparalleled camera-based multi-UAV collaborative 3D object detection paradigm called UCDNet. Specifically, the depth information from the UAVs to the ground is explicitly utilized as a strong prior to provide a reference for more accurate and generalizable feature mapping. Additionally, we design a homologous points geometric consistency loss as an auxiliary self-supervision, which directly influences the feature mapping module, thereby strengthening the global consistency of multi-view perception. Experiments on AeroCollab3D and CoPerception-UAVs datasets show our method increases 4.7% and 10% mAP respectively compared to the baseline, which demonstrates the superiority of UCDNet.

[CV-64] UVCPNet: A UAV-Vehicle Collaborative Perception Network for 3D Object Detection

链接: https://arxiv.org/abs/2406.04647
作者: Yuchao Wang,Peirui Cheng,Pengju Tian,Ziyang Yuan,Liangjin Zhao,Jing Tian,Wensheng Wang,Zhirui Wang,Xian Sun
关键词: collaborative perception, crucial component, increasingly important, Bird Eye View, Collaborative Depth Optimization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the advancement of collaborative perception, the role of aerial-ground collaborative perception, a crucial component, is becoming increasingly important. The demand for collaborative perception across different perspectives to construct more comprehensive perceptual information is growing. However, challenges arise due to the disparities in the field of view between cross-domain agents and their varying sensitivity to information in images. Additionally, when we transform image features into Bird’s Eye View (BEV) features for collaboration, we need accurate depth information. To address these issues, we propose a framework specifically designed for aerial-ground collaboration. First, to mitigate the lack of datasets for aerial-ground collaboration, we develop a virtual dataset named V2U-COO for our research. Second, we design a Cross-Domain Cross-Adaptation (CDCA) module to align the target information obtained from different domains, thereby achieving more accurate perception results. Finally, we introduce a Collaborative Depth Optimization (CDO) module to obtain more precise depth estimation results, leading to more accurate perception outcomes. We conduct extensive experiments on both our virtual dataset and a public dataset to validate the effectiveness of our framework. Our experiments on the V2U-COO dataset and the DAIR-V2X dataset demonstrate that our method improves detection accuracy by 6.1% and 2.7%, respectively.

[CV-65] Cooperative Meta-Learning with Gradient Augmentation

链接: https://arxiv.org/abs/2406.04639
作者: Jongyun Shin,Seunjin Han,Jangho Kim
关键词: Model agnostic meta-learning, outer loop, CML, Model agnostic, outer loop update
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to UAI 2024

点击查看摘要

Abstract:Model agnostic meta-learning (MAML) is one of the most widely used gradient-based meta-learning, consisting of two optimization loops: an inner loop and outer loop. MAML learns the new task from meta-initialization parameters with an inner update and finds the meta-initialization parameters in the outer loop. In general, the injection of noise into the gradient of the model for augmenting the gradient is one of the widely used regularization methods. In this work, we propose a novel cooperative meta-learning framework dubbed CML which leverages gradient-level regularization with gradient augmentation. We inject learnable noise into the gradient of the model for the model generalization. The key idea of CML is introducing the co-learner which has no inner update but the outer loop update to augment gradients for finding better meta-initialization parameters. Since the co-learner does not update in the inner loop, it can be easily deleted after meta-training. Therefore, CML infers with only meta-learner without additional cost and performance degradation. We demonstrate that CML is easily applicable to gradient-based meta-learning methods and CML leads to increased performance in few-shot regression, few-shot image classification and few-shot node classification tasks. Our codes are at this https URL.

[CV-66] STAR: Skeleton-aware Text-based 4D Avatar Generation with In-Network Motion Retargeting

链接: https://arxiv.org/abs/2406.04629
作者: Zenghao Chai,Chen Tang,Yongkang Wong,Mohan Kankanhalli
关键词: subsequently applies animation, Score Distillation Sampling, diffusion models, space and subsequently, subsequently applies
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
*备注: Tech report

点击查看摘要

Abstract:The creation of 4D avatars (i.e., animated 3D avatars) from text description typically uses text-to-image (T2I) diffusion models to synthesize 3D avatars in the canonical space and subsequently applies animation with target motions. However, such an optimization-by-animation paradigm has several drawbacks. (1) For pose-agnostic optimization, the rendered images in canonical pose for naive Score Distillation Sampling (SDS) exhibit domain gap and cannot preserve view-consistency using only T2I priors, and (2) For post hoc animation, simply applying the source motions to target 3D avatars yields translation artifacts and misalignment. To address these issues, we propose Skeleton-aware Text-based 4D Avatar generation with in-network motion Retargeting (STAR). STAR considers the geometry and skeleton differences between the template mesh and target avatar, and corrects the mismatched source motion by resorting to the pretrained motion retargeting techniques. With the informatively retargeted and occlusion-aware skeleton, we embrace the skeleton-conditioned T2I and text-to-video (T2V) priors, and propose a hybrid SDS module to coherently provide multi-view and frame-consistent supervision signals. Hence, STAR can progressively optimize the geometry, texture, and motion in an end-to-end manner. The quantitative and qualitative experiments demonstrate our proposed STAR can synthesize high-quality 4D avatars with vivid animations that align well with the text description. Additional ablation studies shows the contributions of each component in STAR. The source code and demos are available at: \hrefthis https URLthis https URL.

[CV-67] Image Processing Based Forest Fire Detection

链接: https://arxiv.org/abs/2406.04624
作者: Vipin V
关键词: image processing technique, processing technique, YCbCr color space, RGB color space, color space
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:A novel approach for forest fire detection using image processing technique is proposed. A rule-based color model for fire pixel classification is used. The proposed algorithm uses RGB and YCbCr color space. The advantage of using YCbCr color space is that it can separate the luminance from the chrominance more effectively than RGB color space. The performance of the proposed algorithm is tested on two sets of images, one of which contains fire; the other contains fire-like regions. Standard methods are used for calculating the performance of the algorithm. The proposed method has both higher detection rate and lower false alarm rate. Since the algorithm is cheap in computation, it can be used for real-time forest fire detection.

[CV-68] A Recover-then-Discriminate Framework for Robust Anomaly Detection

链接: https://arxiv.org/abs/2406.04608
作者: Peng Xing,Dong Zhang,Jinhui Tang,Zechao li
关键词: Anomaly detection, recent past, extensively studied, studied and applied, wide range
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 10 figures

点击查看摘要

Abstract:Anomaly detection (AD) has been extensively studied and applied in a wide range of scenarios in the recent past. However, there are still gaps between achieved and desirable levels of recognition accuracy for making AD for practical applications. In this paper, we start from an insightful analysis of two types of fundamental yet representative failure cases in the baseline model, and reveal reasons that hinder current AD methods from achieving a higher recognition accuracy. Specifically, by Case-1, we found that the main reasons detrimental to current AD methods is that the inputs to the recovery model contain a large number of detailed features to be recovered, which leads to the normal/abnormal area has-not/has been recovered into its original state. By Case-2, we surprisingly found that the abnormal area that cannot be recognized in image-level representations can be easily recognized in the feature-level representation. Based on the above observations, we propose a novel Recover-then-Discriminate (ReDi) framework for AD. ReDi takes a self-generated feature map and a selected prompted image as explicit input information to solve problems in case-1. Concurrently, a feature-level discriminative network is proposed to enhance abnormal differences between the recovered representation and the input representation. Extensive experimental results on two popular yet challenging AD datasets validate that ReDi achieves the new state-of-the-art accuracy.

[CV-69] Simplify Implant Depth Prediction as Video Grounding: A Texture Perceive Implant Depth Prediction Network

链接: https://arxiv.org/abs/2406.04603
作者: Xinquan Yang,Xuguang Li,Xiaoling Luo,Leilei Zeng,Yudi Zhang,Linlin Shen,Yongqiang Deng
关键词: Surgical guide plate, implant depth prediction, implant depth, Depth Prediction Network, Surgical guide
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Surgical guide plate is an important tool for the dental implant surgery. However, the design process heavily relies on the dentist to manually simulate the implant angle and depth. When deep neural networks have been applied to assist the dentist quickly locates the implant position, most of them are not able to determine the implant depth. Inspired by the video grounding task which localizes the starting and ending time of the target video segment, in this paper, we simplify the implant depth prediction as video grounding and develop a Texture Perceive Implant Depth Prediction Network (TPNet), which enables us to directly output the implant depth without complex measurements of oral bone. TPNet consists of an implant region detector (IRD) and an implant depth prediction network (IDPNet). IRD is an object detector designed to crop the candidate implant volume from the CBCT, which greatly saves the computation resource. IDPNet takes the cropped CBCT data to predict the implant depth. A Texture Perceive Loss (TPL) is devised to enable the encoder of IDPNet to perceive the texture variation among slices. Extensive experiments on a large dental implant dataset demonstrated that the proposed TPNet achieves superior performance than the existing methods.

[CV-70] 1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation

链接: https://arxiv.org/abs/2406.04600
作者: Deshui Miao,Xin Li,Zhenyu He,Yaowei Wang,Ming-Hsuan Yang
关键词: video object segmentation, Tracking and segmenting, segmenting multiple objects, video object, object segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Tracking and segmenting multiple objects in complex scenes has always been a challenge in the field of video object segmentation, especially in scenarios where objects are occluded and split into parts. In such cases, the definition of objects becomes very ambiguous. The motivation behind the MOSE dataset is how to clearly recognize and distinguish objects in complex scenes. In this challenge, we propose a semantic embedding video object segmentation model and use the salient features of objects as query representations. The semantic understanding helps the model to recognize parts of the objects and the salient feature captures the more discriminative features of the objects. Trained on a large-scale video object segmentation dataset, our model achieves first place (\textbf84.45%) in the test set of PVUW Challenge 2024: Complex Video Object Segmentation Track.

[CV-71] CLoG: Benchmarking Continual Learning of Image Generation Models

链接: https://arxiv.org/abs/2406.04584
作者: Haotian Zhang,Junting Zhou,Haowei Lin,Hang Ye,Jianhua Zhu,Zihao Wang,Liangcai Gao,Yizhou Wang,Yitao Liang
关键词: Artificial Intelligence, incrementally acquire knowledge, Continual Learning, continual learning community, poses a significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Continual Learning (CL) poses a significant challenge in Artificial Intelligence, aiming to mirror the human ability to incrementally acquire knowledge and skills. While extensive research has focused on CL within the context of classification tasks, the advent of increasingly powerful generative models necessitates the exploration of Continual Learning of Generative models (CLoG). This paper advocates for shifting the research focus from classification-based CL to CLoG. We systematically identify the unique challenges presented by CLoG compared to traditional classification-based CL. We adapt three types of existing CL methodologies, replay-based, regularization-based, and parameter-isolation-based methods to generative tasks and introduce comprehensive benchmarks for CLoG that feature great diversity and broad task coverage. Our benchmarks and results yield intriguing insights that can be valuable for developing future CLoG methods. Additionally, we will release a codebase designed to facilitate easy benchmarking and experimentation in CLoG publicly at this https URL. We believe that shifting the research focus to CLoG will benefit the continual learning community and illuminate the path for next-generation AI-generated content (AIGC) in a lifelong learning paradigm.

[CV-72] Attention Fusion Reverse Distillation for Multi-Lighting Image Anomaly Detection

链接: https://arxiv.org/abs/2406.04573
作者: Yiheng Zhang,Yunkang Cao,Tianhang Zhang,Weiming Shen
关键词: enhance imaging quality, Image Anomaly Detection, targets Multi-Lighting Image, Fusion Reverse Distillation, Anomaly Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study targets Multi-Lighting Image Anomaly Detection (MLIAD), where multiple lighting conditions are utilized to enhance imaging quality and anomaly detection performance. While numerous image anomaly detection methods have been proposed, they lack the capacity to handle multiple inputs for a single sample, like multi-lighting images in MLIAD. Hence, this study proposes Attention Fusion Reverse Distillation (AFRD) to handle multiple inputs in MLIAD. For this purpose, AFRD utilizes a pre-trained teacher network to extract features from multiple inputs. Then these features are aggregated into fused features through an attention module. Subsequently, a corresponding student net-work is utilized to regress the attention fused features. The regression errors are denoted as anomaly scores during inference. Experiments on Eyecandies demonstrates that AFRD achieves superior MLIAD performance than other MLIAD alternatives, also highlighting the benefit of using multiple lighting conditions for anomaly detection.

[CV-73] Camera-Pose Robust Crater Detection from Change 5

链接: https://arxiv.org/abs/2406.04569
作者: Matthew Rodda,Sofia McLeod,Ky Cuong Pham,Tat-Jun Chin
关键词: off-nadir view angles, space missions aim, timely position estimates, ensure safe navigation, off-nadir view
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As space missions aim to explore increasingly hazardous terrain, accurate and timely position estimates are required to ensure safe navigation. Vision-based navigation achieves this goal through correlating impact craters visible through onboard imagery with a known database to estimate a craft’s pose. However, existing literature has not sufficiently evaluated crater-detection algorithm (CDA) performance from imagery containing off-nadir view angles. In this work, we evaluate the performance of Mask R-CNN for crater detection, comparing models pretrained on simulated data containing off-nadir view angles and to pretraining on real-lunar images. We demonstrate pretraining on real-lunar images is superior despite the lack of images containing off-nadir view angles, achieving detection performance of 63.1 F1-score and ellipse-regression performance of 0.701 intersection over union. This work provides the first quantitative analysis of performance of CDAs on images containing off-nadir view angles. Towards the development of increasingly robust CDAs, we additionally provide the first annotated CDA dataset with off-nadir view angles from the Chang’e 5 Landing Camera.

[CV-74] Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance

链接: https://arxiv.org/abs/2406.04551
作者: Reyhane Askari Hemmat,Melissa Hall,Alicia Sun,Candace Ross,Michal Drozdzal,Adriana Romero-Soriano
关键词: generated images, risks and biases, growing popularity, increasing focus, focus on understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the growing popularity of text-to-image generative models, there has been increasing focus on understanding their risks and biases. Recent work has found that state-of-the-art models struggle to depict everyday objects with the true diversity of the real world and have notable gaps between geographic regions. In this work, we aim to increase the diversity of generated images of common objects such that per-region variations are representative of the real world. We introduce an inference time intervention, contextualized Vendi Score Guidance (c-VSG), that guides the backwards steps of latent diffusion models to increase the diversity of a sample as compared to a “memory bank” of previously generated images while constraining the amount of variation within that of an exemplar set of real-world contextualizing images. We evaluate c-VSG with two geographically representative datasets and find that it substantially increases the diversity of generated images, both for the worst performing regions and on average, while simultaneously maintaining or improving image quality and consistency. Additionally, qualitative analyses reveal that diversity of generated images is significantly improved, including along the lines of reductive region portrayals present in the original model. We hope that this work is a step towards text-to-image generative models that reflect the true geographic diversity of the world.

[CV-75] FOOD: Facial Authentication and Out-of-Distribution Detection with Short-Range FMCW Radar

链接: https://arxiv.org/abs/2406.04546
作者: Sabri Mustafa Kahya,Boran Hamdi Sivrikaya,Muhammet Sami Yavuz,Eckehard Steinbach
关键词: radar-based facial authentication, FMCW radar-based facial, short-range FMCW radar-based, paper proposes, radar-based facial
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted at ICIP 2024

点击查看摘要

Abstract:This paper proposes a short-range FMCW radar-based facial authentication and out-of-distribution (OOD) detection framework. Our pipeline jointly estimates the correct classes for the in-distribution (ID) samples and detects the OOD samples to prevent their inaccurate prediction. Our reconstruction-based architecture consists of a main convolutional block with one encoder and multi-decoder configuration, and intermediate linear encoder-decoder parts. Together, these elements form an accurate human face classifier and a robust OOD detector. For our dataset, gathered using a 60 GHz short-range FMCW radar, our network achieves an average classification accuracy of 98.07% in identifying in-distribution human faces. As an OOD detector, it achieves an average Area Under the Receiver Operating Characteristic (AUROC) curve of 98.50% and an average False Positive Rate at 95% True Positive Rate (FPR95) of 6.20%. Also, our extensive experiments show that the proposed approach outperforms previous OOD detectors in terms of common OOD detection metrics.

[CV-76] MM VTO: Multi-Garment Virtual Try-On and Editing

链接: https://arxiv.org/abs/2406.04542
作者: Luyang Zhu,Yingwei Li,Nan Liu,Hao Peng,Dawei Yang,Ira Kemelmacher-Shlizerman
关键词: multiple garment images, image, virtual try-on, garment images, input multiple garment
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: CVPR 2024 Highlight. Project website: this https URL

点击查看摘要

Abstract:We present MM VTO, a mix and match virtual try-on method that takes as input multiple garment images, text description for garment layout and an image of a person. An example input includes: an image of a shirt, an image of a pair of pants, “rolled sleeves, shirt tucked in”, and an image of a person. The output is a visualization of how those garments (in the desired layout) would look like on the given person. Key contributions of our method are: 1) a single stage diffusion based model, with no super resolution cascading, that allows to mix and match multiple garments at 1024x512 resolution preserving and warping intricate garment details, 2) architecture design (VTO UNet Diffusion Transformer) to disentangle denoising from person specific features, allowing for a highly effective finetuning strategy for identity preservation (6MB model per individual vs 4GB achieved with, e.g., dreambooth finetuning); solving a common identity loss problem in current virtual try-on methods, 3) layout control for multiple garments via text inputs specifically finetuned over PaLI-3 for virtual try-on task. Experimental results indicate that MM VTO achieves state-of-the-art performance both qualitatively and quantitatively, as well as opens up new opportunities for virtual try-on via language-guided and multi-garment try-on.

[CV-77] MambaDepth: Enhancing Long-range Dependency for Self-Supervised Fine-Structured Monocular Depth Estimation

链接: https://arxiv.org/abs/2406.04532
作者: Ionuţ Grigore,Călin-Adrian Popa
关键词: Convolutional Neural Networks, Convolutional Neural, self-supervised depth estimation, Transformers have traditionally, Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the field of self-supervised depth estimation, Convolutional Neural Networks (CNNs) and Transformers have traditionally been dominant. However, both architectures struggle with efficiently handling long-range dependencies due to their local focus or computational demands. To overcome this limitation, we present MambaDepth, a versatile network tailored for self-supervised depth estimation. Drawing inspiration from the strengths of the Mamba architecture, renowned for its adept handling of lengthy sequences and its ability to capture global context efficiently through a State Space Model (SSM), we introduce MambaDepth. This innovative architecture combines the U-Net’s effectiveness in self-supervised depth estimation with the advanced capabilities of Mamba. MambaDepth is structured around a purely Mamba-based encoder-decoder framework, incorporating skip connections to maintain spatial information at various levels of the network. This configuration promotes an extensive feature learning process, enabling the capture of fine details and broader contexts within depth maps. Furthermore, we have developed a novel integration technique within the Mamba blocks to facilitate uninterrupted connectivity and information flow between the encoder and decoder components, thereby improving depth accuracy. Comprehensive testing across the established KITTI dataset demonstrates MambaDepth’s superiority over leading CNN and Transformer-based models in self-supervised depth estimation task, allowing it to achieve state-of-the-art performance. Moreover, MambaDepth proves its superior generalization capacities on other datasets such as Make3D and Cityscapes. MambaDepth’s performance heralds a new era in effective long-range dependency modeling for self-supervised depth estimation.

[CV-78] Classification of Non-native Handwritten Characters Using Convolutional Neural Network

链接: https://arxiv.org/abs/2406.04511
作者: F. A. Mamun,S. A. H. Chowdhury,J. E. Giti,H. Sarker
关键词: convolutional neural networks, character recognition, Handwritten character recognition, neural networks, accelerated the progress
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The use of convolutional neural networks (CNNs) has accelerated the progress of handwritten character classification/recognition. Handwritten character recognition (HCR) has found applications in various domains, such as traffic signal detection, language translation, and document information extraction. However, the widespread use of existing HCR technology is yet to be seen as it does not provide reliable character recognition with outstanding accuracy. One of the reasons for unreliable HCR is that existing HCR methods do not take the handwriting styles of non-native writers into account. Hence, further improvement is needed to ensure the reliability and extensive deployment of character recognition technologies for critical tasks. In this work, the classification of English characters written by non-native users is performed by proposing a custom-tailored CNN model. We train this CNN with a new dataset called the handwritten isolated English character (HIEC) dataset. This dataset consists of 16,496 images collected from 260 persons. This paper also includes an ablation study of our CNN by adjusting hyperparameters to identify the best model for the HIEC dataset. The proposed model with five convolutional layers and one hidden layer outperforms state-of-the-art models in terms of character recognition accuracy and achieves an accuracy of \mathbf97.04 %. Compared with the second-best model, the relative improvement of our model in terms of classification accuracy is \mathbf4.38 %.

[CV-79] OCCAM: Towards Cost-Efficient and Accuracy-Aware Image Classification Inference

链接: https://arxiv.org/abs/2406.04508
作者: Dujian Ding,Bicheng Xu,Laks V.S. Lakshmanan
关键词: computer vision applications, fundamental building block, vision applications, fundamental building, building block
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Image classification is a fundamental building block for a majority of computer vision applications. With the growing popularity and capacity of machine learning models, people can easily access trained image classifiers as a service online or offline. However, model use comes with a cost and classifiers of higher capacity usually incur higher inference costs. To harness the respective strengths of different classifiers, we propose a principled approach, OCCAM, to compute the best classifier assignment strategy over image classification queries (termed as the optimal model portfolio) so that the aggregated accuracy is maximized, under user-specified cost budgets. Our approach uses an unbiased and low-variance accuracy estimator and effectively computes the optimal solution by solving an integer linear programming problem. On a variety of real-world datasets, OCCAM achieves 40% cost reduction with little to no accuracy drop.

[CV-80] CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset

链接: https://arxiv.org/abs/2406.04493
作者: Abdelrahman Abdallah,Mahmoud Abdalla,Mahmoud SalahEldin Kasem,Mohamed Mahmoud,Ibrahim Abdelhalim,Mohamed Elkasaby,Yasser ElBendary,Adam Jatowt
关键词: Optical Character Recognition, Natural Language Processing, Character Recognition, Optical Character, Natural Language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In the fields of Optical Character Recognition (OCR) and Natural Language Processing (NLP), integrating multilingual capabilities remains a critical challenge, especially when considering languages with complex scripts such as Arabic. This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding Dataset (CORU), a novel dataset specifically designed to enhance OCR and information extraction from receipts in multilingual contexts involving Arabic and English. CORU consists of over 20,000 annotated receipts from diverse retail settings, including supermarkets and clothing stores, alongside 30,000 annotated images for OCR that were utilized to recognize each detected line, and 10,000 items annotated for detailed information extraction. These annotations capture essential details such as merchant names, item descriptions, total prices, receipt numbers, and dates. They are structured to support three primary computational tasks: object detection, OCR, and information extraction. We establish the baseline performance for a range of models on CORU to evaluate the effectiveness of traditional methods, like Tesseract OCR, and more advanced neural network-based approaches. These baselines are crucial for processing the complex and noisy document layouts typical of real-world receipts and for advancing the state of automated multilingual document processing. Our datasets are publicly accessible (this https URL).

[CV-81] GenAI Arena: An Open Evaluation Platform for Generative Models

链接: https://arxiv.org/abs/2406.04485
作者: Dongfu Jiang,Max Ku,Tianle Li,Yuansheng Ni,Shizhuo Sun,Rongqi Fan,Wenhu Chen
关键词: made remarkable strides, made remarkable, remarkable strides, strides to revolutionize, revolutionize fields
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 9 pages,7 figures

点击查看摘要

Abstract:Generative AI has made remarkable strides to revolutionize fields such as image and video generation. These advancements are driven by innovative algorithms, architecture, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the absence of trustworthy evaluation metrics. Current automatic assessments such as FID, CLIP, FVD, etc often fail to capture the nuanced quality and user satisfaction associated with generative outputs. This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three arenas for text-to-image generation, text-to-video generation, and image editing respectively. Currently, we cover a total of 27 open-source generative models. GenAI-Arena has been operating for four months, amassing over 6000 votes from the community. We describe our platform, analyze the data, and explain the statistical methods for ranking the models. To further promote the research in building model-based evaluation metrics, we release a cleaned version of our preference data for the three tasks, namely GenAI-Bench. We prompt the existing multi-modal models like Gemini, GPT-4o to mimic human voting. We compute the correlation between model voting with human voting to understand their judging abilities. Our results show existing multimodal models are still lagging in assessing the generated visual content, even the best model GPT-4o only achieves a Pearson correlation of 0.22 in the quality subscore, and behaves like random guessing in others.

[CV-82] Step Out and Seek Around: On Warm-Start Training with Incremental Data

链接: https://arxiv.org/abs/2406.04484
作者: Maying Shen,Hongxu Yin,Pavlo Molchanov,Lei Mao,Jose M. Alvarez
关键词: real-world deep learning, deep learning applications, autonomous driving, arrives in sequence, sequence over time
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Data often arrives in sequence over time in real-world deep learning applications such as autonomous driving. When new training data is available, training the model from scratch undermines the benefit of leveraging the learned knowledge, leading to significant training costs. Warm-starting from a previously trained checkpoint is the most intuitive way to retain knowledge and advance learning. However, existing literature suggests that this warm-starting degrades generalization. In this paper, we advocate for warm-starting but stepping out of the previous converging point, thus allowing a better adaptation to new data without compromising previous knowledge. We propose Knowledge Consolidation and Acquisition (CKCA), a continuous model improvement algorithm with two novel components. First, a novel feature regularization (FeatReg) to retain and refine knowledge from existing checkpoints; Second, we propose adaptive knowledge distillation (AdaKD), a novel approach to forget mitigation and knowledge transfer. We tested our method on ImageNet using multiple splits of the training data. Our approach achieves up to 8.39% higher top1 accuracy than the vanilla warm-starting and consistently outperforms the prior art with a large margin.

[CV-83] Evaluating Large Vision-Language Models Understanding of Real-World Complexities Through Synthetic Benchmarks

链接: https://arxiv.org/abs/2406.04470
作者: Haokun Zhou,Yipeng Hong
关键词: Large Vision-Language Models, ability of Large, Large Vision-Language, assesses the ability, differentiate between AI-generated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study assesses the ability of Large Vision-Language Models (LVLMs) to differentiate between AI-generated and human-generated images. It introduces a new automated benchmark construction method for this evaluation. The experiment compared common LVLMs with human participants using a mixed dataset of AI and human-created images. Results showed that LVLMs could distinguish between the image types to some extent but exhibited a rightward bias, and perform significantly worse compared to humans. To build on these findings, we developed an automated benchmark construction process using AI. This process involved topic retrieval, narrative script generation, error embedding, and image generation, creating a diverse set of text-image pairs with intentional errors. We validated our method through constructing two caparable benchmarks. This study highlights the strengths and weaknesses of LVLMs in real-world understanding and advances benchmark construction techniques, providing a scalable and automatic approach for AI model evaluation.

[CV-84] MAIRA-2: Grounded Radiology Report Generation

链接: https://arxiv.org/abs/2406.04449
作者: Shruthi Bannur,Kenza Bouzid,Daniel C. Castro,Anton Schwaighofer,Sam Bond-Taylor,Maximilian Ilse,Fernando Pérez-García,Valentina Salvatelli,Harshita Sharma,Felix Meissen,Mercy Ranjit,Shaury Srivastav,Julia Gong,Fabian Falck,Ozan Oktay,Anja Thieme,Matthew P. Lungren,Maria Teodora Wetscherek,Javier Alvarez-Valle,Stephanie L. Hyland
关键词: requires detailed image, Radiology reporting, integration of multiple, detailed image understanding, requires detailed
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 44 pages, 20 figures

点击查看摘要

Abstract:Radiology reporting is a complex task that requires detailed image understanding, integration of multiple inputs, including comparison with prior imaging, and precise language generation. This makes it ideal for the development and use of generative multimodal models. Here, we extend report generation to include the localisation of individual findings on the image - a task we call grounded report generation. Prior work indicates that grounding is important for clarifying image understanding and interpreting AI-generated text. Therefore, grounded reporting stands to improve the utility and transparency of automated report drafting. To enable evaluation of grounded reporting, we propose a novel evaluation framework - RadFact - leveraging the reasoning capabilities of large language models (LLMs). RadFact assesses the factuality of individual generated sentences, as well as correctness of generated spatial localisations when present. We introduce MAIRA-2, a large multimodal model combining a radiology-specific image encoder with a LLM, and trained for the new task of grounded report generation on chest X-rays. MAIRA-2 uses more comprehensive inputs than explored previously: the current frontal image, the current lateral image, the prior frontal image and prior report, as well as the Indication, Technique and Comparison sections of the current report. We demonstrate that these additions significantly improve report quality and reduce hallucinations, establishing a new state of the art on findings generation (without grounding) on MIMIC-CXR while demonstrating the feasibility of grounded reporting as a novel and richer task.

[CV-85] DeTra: A Unified Model for Object Detection and Trajectory Forecasting

链接: https://arxiv.org/abs/2406.04426
作者: Sergio Casas,Ben Agro,Jiageng Mao,Thomas Gilles,Alexander Cui,Thomas Li,Raquel Urtasun
关键词: trajectory forecasting play, autonomous driving, play a crucial, crucial role, role in understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The tasks of object detection and trajectory forecasting play a crucial role in understanding the scene for autonomous driving. These tasks are typically executed in a cascading manner, making them prone to compounding errors. Furthermore, there is usually a very thin interface between the two tasks, creating a lossy information bottleneck. To address these challenges, our approach formulates the union of the two tasks as a trajectory refinement problem, where the first pose is the detection (current time), and the subsequent poses are the waypoints of the multiple forecasts (future time). To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects directly from LiDAR point clouds and high-definition maps. We call this model DeTra, short for object Detection and Trajectory forecasting. In our experiments, we observe that \ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Waymo Open Dataset by a large margin, across a broad range of metrics. Last but not least, we perform extensive ablation studies that show the value of refinement for this task, that every proposed component contributes positively to its performance, and that key design choices were made.

[CV-86] Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning

链接: https://arxiv.org/abs/2406.04413
作者: Amandeep Kumar,Muhammad Awais,Sanath Narayan,Hisham Cholakkal,Salman Khan,Rao Muhammad Anwer
关键词: Drawing upon StyleGAN, employ textual prompting, approaches employ textual, edit facial images, StyleGAN expressivity
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Drawing upon StyleGAN’s expressivity and disentangled latent space, existing 2D approaches employ textual prompting to edit facial images with different attributes. In contrast, 3D-aware approaches that generate faces at different target poses require attribute-specific classifiers, learning separate model weights for each attribute, and are not scalable for novel attributes. In this work, we propose an efficient, plug-and-play, 3D-aware face editing framework based on attribute-specific prompt learning, enabling the generation of facial images with controllable attributes across various target poses. To this end, we introduce a text-driven learnable style token-based latent attribute editor (LAE). The LAE harnesses a pre-trained vision-language model to find text-guided attribute-specific editing direction in the latent space of any pre-trained 3D-aware GAN. It utilizes learnable style tokens and style mappers to learn and transform this editing direction to 3D latent space. To train LAE with multiple attributes, we use directional contrastive loss and style token loss. Furthermore, to ensure view consistency and identity preservation across different poses and attributes, we employ several 3D-aware identity and pose preservation losses. Our experiments show that our proposed framework generates high-quality images with 3D awareness and view consistency while maintaining attribute-specific features. We demonstrate the effectiveness of our method on different facial attributes, including hair color and style, expression, and others. Code: this https URL.

[CV-87] Use of a Multiscale Vision Transformer to predict Nursing Activities Score from Low Resolution Thermal Videos in an Intensive Care Unit

链接: https://arxiv.org/abs/2406.04364
作者: Isaac YL Lee,Thanh Nguyen-Duc,Ryo Ueno,Jesse Smith,Peter Y Chan
关键词: Excessive caregiver workload, Intensive Care Unit, poorer patient care, Excessive caregiver, NAS
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 4 pages, 1 figure

点击查看摘要

Abstract:Excessive caregiver workload in hospital nurses has been implicated in poorer patient care and increased worker burnout. Measurement of this workload in the Intensive Care Unit (ICU) is often done using the Nursing Activities Score (NAS), but this is usually recorded manually and sporadically. Previous work has made use of Ambient Intelligence (AmI) by using computer vision to passively derive caregiver-patient interaction times to monitor staff workload. In this letter, we propose using a Multiscale Vision Transformer (MViT) to passively predict the NAS from low-resolution thermal videos recorded in an ICU. 458 videos were obtained from an ICU in Melbourne, Australia and used to train a MViTv2 model using an indirect prediction and a direct prediction method. The indirect method predicted 1 of 8 potentially identifiable NAS activities from the video before inferring the NAS. The direct method predicted the NAS score immediately from the video. The indirect method yielded an average 5-fold accuracy of 57.21%, an area under the receiver operating characteristic curve (ROC AUC) of 0.865, a F1 score of 0.570 and a mean squared error (MSE) of 28.16. The direct method yielded a MSE of 18.16. We also showed that the MViTv2 outperforms similar models such as R(2+1)D and ResNet50-LSTM under identical settings. This study shows the feasibility of using a MViTv2 to passively predict the NAS in an ICU and monitor staff workload automatically. Our results above also show an increased accuracy in predicting NAS directly versus predicting NAS indirectly. We hope that our study can provide a direction for future work and further improve the accuracy of passive NAS monitoring. Comments: 4 pages, 1 figure Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2406.04364 [cs.CV] (or arXiv:2406.04364v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2406.04364 Focus to learn more arXiv-issued DOI via DataCite

[CV-88] Energy Propagation in Scattering Convolution Networks Can Be Arbitrarily Slow

链接: https://arxiv.org/abs/2406.05121
作者: Hartmut Führ,Max Getter
关键词: deep convolutional neural, Mallat wavelet scattering, energy decay, analyze energy decay, convolutional neural networks
类目: Functional Analysis (math.FA); Computer Vision and Pattern Recognition (cs.CV)
*备注: 39 pages, 4 figures

点击查看摘要

Abstract:We analyze energy decay for deep convolutional neural networks employed as feature extractors, such as Mallat’s wavelet scattering transform. For time-frequency scattering transforms based on Gabor filters, it has been established that energy decay is exponential, for arbitrary square-integrable input signals. Our main results allow to prove that this is wrong for wavelet scattering in arbitrary dimensions. In this setting, the energy decay of the scattering transform acting on a generic square-integrable signal turns out to be arbitrarily slow. The fact that this behavior holds for dense subsets of L^2(\mathbbR^d) emphasizes that fast energy decay is generally not a stable property of signals. We complement these findings with positive results allowing to conclude fast (up to exponential) energy decay for generalized Sobolev spaces that are tailored to the frequency localization of the underlying filter bank. Both negative and positive results highlight that energy decay in scattering networks critically depends on the interplay of the respective frequency localizations of the signal on the one hand, and of the employed filters on the other. Comments: 39 pages, 4 figures Subjects: Functional Analysis (math.FA); Computer Vision and Pattern Recognition (cs.CV) MSC classes: 42C15, 68T07 (Primary) 42C40, 42B35 (Secondary) ACMclasses: I.5.1; I.5.4; I.4.6 Cite as: arXiv:2406.05121 [math.FA] (or arXiv:2406.05121v1 [math.FA] for this version) https://doi.org/10.48550/arXiv.2406.05121 Focus to learn more arXiv-issued DOI via DataCite

[CV-89] Hibou: A Family of Foundational Vision Transformers for Pathology

链接: https://arxiv.org/abs/2406.05074
作者: Dmitry Nechaev,Alexey Pchelnikov,Ekaterina Ivanova
关键词: medical conditions, microscopic examination, examination of diseased, critical for diagnosing, diagnosing various medical
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pathology, the microscopic examination of diseased tissue, is critical for diagnosing various medical conditions, particularly cancers. Traditional methods are labor-intensive and prone to human error. Digital pathology, which converts glass slides into high-resolution digital images for analysis by computer algorithms, revolutionizes the field by enhancing diagnostic accuracy, consistency, and efficiency through automated image analysis and large-scale data processing. Foundational transformer pretraining is crucial for developing robust, generalizable models as it enables learning from vast amounts of unannotated data. This paper introduces the Hibou family of foundational vision transformers for pathology, leveraging the DINOv2 framework to pretrain two model variants, Hibou-B and Hibou-L, on a proprietary dataset of over 1 million whole slide images (WSIs) representing diverse tissue types and staining techniques. Our pretrained models demonstrate superior performance on both patch-level and slide-level benchmarks, surpassing existing state-of-the-art methods. Notably, Hibou-L achieves the highest average accuracy across multiple benchmark datasets. To support further research and application in the field, we have open-sourced the Hibou-B model, which can be accessed at this https URL Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.05074 [eess.IV] (or arXiv:2406.05074v1 [eess.IV] for this version)

[CV-90] Diffusion-based Generative Image Outpainting for Recovery of FOV-Truncated CT Images

链接: https://arxiv.org/abs/2406.04769
作者: Michelle Espranita Liman,Daniel Rueckert,Florian J. Fintelmann,Philip Müller
关键词: body composition analysis, subcutaneous adipose tissue, accurate body composition, involves quantifying skeletal, quantifying skeletal muscle
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Shared last authorship: Florian J. Fintelmann and Philip Müller

点击查看摘要

Abstract:Field-of-view (FOV) recovery of truncated chest CT scans is crucial for accurate body composition analysis, which involves quantifying skeletal muscle and subcutaneous adipose tissue (SAT) on CT slices. This, in turn, enables disease prognostication. Here, we present a method for recovering truncated CT slices using generative image outpainting. We train a diffusion model and apply it to truncated CT slices generated by simulating a small FOV. Our model reliably recovers the truncated anatomy and outperforms the previous state-of-the-art despite being trained on 87% less data.

[CV-91] MTS-Net: Dual-Enhanced Positional Multi-Head Self-Attention for 3D CT Diagnosis of May-Thurner Syndrome

链接: https://arxiv.org/abs/2406.04680
作者: Yixin Huang,Yiqi Jin,Ke Tao,Kaijian Xia,Jianfeng Gu,Lei Yu,Lan Du,Cunjian Chen
关键词: vein compression syndrome, iliac vein compression, condition potentially impacting, deep venous thrombosis, iliofemoral deep venous
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:May-Thurner Syndrome (MTS), also known as iliac vein compression syndrome or Cockett’s syndrome, is a condition potentially impacting over 20 percent of the population, leading to an increased risk of iliofemoral deep venous thrombosis. In this paper, we present a 3D-based deep learning approach called MTS-Net for diagnosing May-Thurner Syndrome using CT scans. To effectively capture the spatial-temporal relationship among CT scans and emulate the clinical process of diagnosing MTS, we propose a novel attention module called the dual-enhanced positional multi-head self-attention (DEP-MHSA). The proposed DEP-MHSA reconsiders the role of positional embedding and incorporates a dual-enhanced positional embedding in both attention weights and residual connections. Further, we establish a new dataset, termed MTS-CT, consisting of 747 subjects. Experimental results demonstrate that our proposed approach achieves state-of-the-art MTS diagnosis results, and our self-attention design facilitates the spatial-temporal modeling. We believe that our DEP-MHSA is more suitable to handle CT image sequence modeling and the proposed dataset enables future research on MTS diagnosis. We make our code and dataset publicly available at: this https URL.

[CV-92] XctDiff: Reconstruction of CT Images with Consistent Anatomical Structures from a Single Radiographic Projection Image

链接: https://arxiv.org/abs/2406.04679
作者: Qingze Bai,Tiange Liu,Zhi Liu,Yubing Tong,Drew Torigian,Jayaram Udupa
关键词: easily controllable tasks, progressive feature extraction, feature extraction strategy, feature extraction, algorithm framework
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we present XctDiff, an algorithm framework for reconstructing CT from a single radiograph, which decomposes the reconstruction process into two easily controllable tasks: feature extraction and CT reconstruction. Specifically, we first design a progressive feature extraction strategy that is able to extract robust 3D priors from radiographs. Then, we use the extracted prior information to guide the CT reconstruction in the latent space. Moreover, we design a homogeneous spatial codebook to improve the reconstruction quality further. The experimental results show that our proposed method achieves state-of-the-art reconstruction performance and overcomes the blurring issue. We also apply XctDiff on self-supervised pre-training task. The effectiveness indicates that it has promising additional applications in medical image analysis. The code is available at:this https URL

机器学习

[LG-0] 3D-GRAND: Towards Better Grounding and Less Hallucination for 3D-LLMs

链接: https://arxiv.org/abs/2406.05132
作者: Jianing Yang,Xuweiyi Chen,Nikhil Madaan,Madhavan Iyengar,Shengyi Qian,David F. Fouhey,Joyce Chai
关键词: developing embodied agents, perception is crucial, physical world, crucial for developing, agents and robots
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Project website: this https URL

点击查看摘要

Abstract:The integration of language and 3D perception is crucial for developing embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datasets that provide dense grounding between language and 3D scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons among future models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the critical role of large-scale 3D-text datasets in advancing embodied AI research. Notably, our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with essential resources and insights, setting the stage for more reliable and better-grounded 3D-LLMs. Project website: this https URL

[LG-1] Compositional Curvature Bounds for Deep Neural Networks

链接: https://arxiv.org/abs/2406.05119
作者: Taha Entesari,Sina Sharifi,Mahyar Fazlyab
关键词: neural networks, key challenge, challenge that threatens, threatens the widespread, safety-critical applications
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Proceedings of the 41 st International Conference on Machine Learning (ICML 2024)

点击查看摘要

Abstract:A key challenge that threatens the widespread use of neural networks in safety-critical applications is their vulnerability to adversarial attacks. In this paper, we study the second-order behavior of continuously differentiable deep neural networks, focusing on robustness against adversarial perturbations. First, we provide a theoretical analysis of robustness and attack certificates for deep classifiers by leveraging local gradients and upper bounds on the second derivative (curvature constant). Next, we introduce a novel algorithm to analytically compute provable upper bounds on the second derivative of neural networks. This algorithm leverages the compositional structure of the model to propagate the curvature bound layer-by-layer, giving rise to a scalable and modular approach. The proposed bound can serve as a differentiable regularizer to control the curvature of neural networks during training, thereby enhancing robustness. Finally, we demonstrate the efficacy of our method on classification tasks using the MNIST and CIFAR-10 datasets.

[LG-2] he Expanding Scope of the Stability Gap: Unveiling its Presence in Joint Incremental Learning of Homogeneous Tasks

链接: https://arxiv.org/abs/2406.05114
作者: Sandesh Kamath,Albin Soutif-Cormerais,Joost van de Weijer,Bogdan Raducanu
关键词: Recent research identified, Recent research, temporary performance drop, previously learned tasks, research identified
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at CVPR 2024 Workshop on Continual Learning in Computer Vision (CLVision)

点击查看摘要

Abstract:Recent research identified a temporary performance drop on previously learned tasks when transitioning to a new one. This drop is called the stability gap and has great consequences for continual learning: it complicates the direct employment of continually learning since the worse-case performance at task-boundaries is dramatic, it limits its potential as an energy-efficient training paradigm, and finally, the stability drop could result in a reduced final performance of the algorithm. In this paper, we show that the stability gap also occurs when applying joint incremental training of homogeneous tasks. In this scenario, the learner continues training on the same data distribution and has access to all data from previous tasks. In addition, we show that in this scenario, there exists a low-loss linear path to the next minima, but that SGD optimization does not choose this path. We perform further analysis including a finer batch-wise analysis which could provide insights towards potential solution directions.

[LG-3] LLavaGuard: VLM-based Safeguards for Vision Dataset Curation and Safety Assessment

链接: https://arxiv.org/abs/2406.05113
作者: Lukas Helff,Felix Friedrich,Manuel Brack,Kristian Kersting,Patrick Schramowski
关键词: VLM-based safeguard models, offering a versatile, family of VLM-based, VLM-based safeguard, versatile framework
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page at this https URL

点击查看摘要

Abstract:We introduce LlavaGuard, a family of VLM-based safeguard models, offering a versatile framework for evaluating the safety compliance of visual content. Specifically, we designed LlavaGuard for dataset annotation and generative model safeguarding. To this end, we collected and annotated a high-quality visual dataset incorporating a broad safety taxonomy, which we use to tune VLMs on context-aware safety risks. As a key innovation, LlavaGuard’s new responses contain comprehensive information, including a safety rating, the violated safety categories, and an in-depth rationale. Further, our introduced customizable taxonomy categories enable the context-specific alignment of LlavaGuard to various scenarios. Our experiments highlight the capabilities of LlavaGuard in complex and real-world applications. We provide checkpoints ranging from 7B to 34B parameters demonstrating state-of-the-art performance, with even the smallest models outperforming baselines like GPT-4. We make our dataset and model weights publicly available and invite further research to address the diverse needs of communities and contexts.

[LG-4] Large Generative Graph Models

链接: https://arxiv.org/abs/2406.05109
作者: Yu Wang,Ryan A. Rossi,Namyong Park,Huiyuan Chen,Nesreen K. Ahmed,Puja Trivedi,Franck Dernoncourt,Danai Koutra,Tyler Derr
关键词: graph generative models, Large Generative Models, Generative Models, graph generative, Large Graph Generative
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Generative Models (LGMs) such as GPT, Stable Diffusion, Sora, and Suno are trained on a huge amount of language corpus, images, videos, and audio that are extremely diverse from numerous domains. This training paradigm over diverse well-curated data lies at the heart of generating creative and sensible content. However, all previous graph generative models (e.g., GraphRNN, MDVAE, MoFlow, GDSS, and DiGress) have been trained only on one dataset each time, which cannot replicate the revolutionary success achieved by LGMs in other fields. To remedy this crucial gap, we propose a new class of graph generative model called Large Graph Generative Model (LGGM) that is trained on a large corpus of graphs (over 5000 graphs) from 13 different domains. We empirically demonstrate that the pre-trained LGGM has superior zero-shot generative capability to existing graph generative models. Furthermore, our pre-trained LGGM can be easily fine-tuned with graphs from target domains and demonstrate even better performance than those directly trained from scratch, behaving as a solid starting point for real-world customization. Inspired by Stable Diffusion, we further equip LGGM with the capability to generate graphs given text prompts (Text-to-Graph), such as the description of the network name and domain (i.e., “The power-1138-bus graph represents a network of buses in a power distribution system.”), and network statistics (i.e., “The graph has a low average degree, suitable for modeling social media interactions.”). This Text-to-Graph capability integrates the extensive world knowledge in the underlying language model, offering users fine-grained control of the generated graphs. We release the code, the model checkpoint, and the datasets at this https URL.

[LG-5] Provably Better Explanations with Optimized Aggregation of Feature Attributions

链接: https://arxiv.org/abs/2406.05090
作者: Thomas Decker,Ananta R. Bhattarai,Jindong Gu,Volker Tresp,Florian Buettner
关键词: opaque machine learning, machine learning models, common practice, practice to understand, understand and verify
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: International Conference on Machine Learning (ICML) 2024

点击查看摘要

Abstract:Using feature attributions for post-hoc explanations is a common practice to understand and verify the predictions of opaque machine learning models. Despite the numerous techniques available, individual methods often produce inconsistent and unstable results, putting their overall reliability into question. In this work, we aim to systematically improve the quality of feature attributions by combining multiple explanations across distinct methods or their variations. For this purpose, we propose a novel approach to derive optimal convex combinations of feature attributions that yield provable improvements of desired quality criteria such as robustness or faithfulness to the model behavior. Through extensive experiments involving various model architectures and popular feature attribution techniques, we demonstrate that our combination strategy consistently outperforms individual methods and existing baselines.

[LG-6] Optimizing Time Series Forecasting Architectures: A Hierarchical Neural Architecture Search Approach

链接: https://arxiv.org/abs/2406.05088
作者: Difan Deng,Marius Lindauer
关键词: deep learning-based modules, time series forecasting, series forecasting research, rapid development, research has brought
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid development of time series forecasting research has brought many deep learning-based modules in this field. However, despite the increasing amount of new forecasting architectures, it is still unclear if we have leveraged the full potential of these existing modules within a properly designed architecture. In this work, we propose a novel hierarchical neural architecture search approach for time series forecasting tasks. With the design of a hierarchical search space, we incorporate many architecture types designed for forecasting tasks and allow for the efficient combination of different forecasting architecture modules. Results on long-term-time-series-forecasting tasks show that our approach can search for lightweight high-performing forecasting architectures across different forecasting tasks.

[LG-7] SUMIE: A Synthetic Benchmark for Incremental Entity Summarization

链接: https://arxiv.org/abs/2406.05079
作者: Eunjeong Hwang,Yichao Zhou,Beliz Gunel,James Bradley Wendt,Sandeep Tata
关键词: models rapidly advance, Incremental Entity Summarization, existing dataset adequately, dataset adequately tests, language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 24 figures, 4 tables

点击查看摘要

Abstract:No existing dataset adequately tests how well language models can incrementally update entity summaries - a crucial ability as these models rapidly advance. The Incremental Entity Summarization (IES) task is vital for maintaining accurate, up-to-date knowledge. To address this, we introduce SUMIE, a fully synthetic dataset designed to expose real-world IES challenges. This dataset effectively highlights problems like incorrect entity association and incomplete information presentation. Unlike common synthetic datasets, ours captures the complexity and nuances found in real-world data. We generate informative and diverse attributes, summaries, and unstructured paragraphs in sequence, ensuring high quality. The alignment between generated summaries and paragraphs exceeds 96%, confirming the dataset’s quality. Extensive experiments demonstrate the dataset’s difficulty - state-of-the-art LLMs struggle to update summaries with an F1 higher than 80.4%. We will open source the benchmark and the evaluation metrics to help the community make progress on IES tasks.

[LG-8] Linearization Turns Neural Operators into Function-Valued Gaussian Processes

链接: https://arxiv.org/abs/2406.05072
作者: Emilia Magnani,Marvin Pförtner,Tobias Weber,Philipp Hennig
关键词: Modeling dynamical systems, partial differential equations, necessitates solving partial, differential equations, Modeling dynamical
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Modeling dynamical systems, e.g. in climate and engineering sciences, often necessitates solving partial differential equations. Neural operators are deep neural networks designed to learn nontrivial solution operators of such differential equations from data. As for all statistical models, the predictions of these models are imperfect and exhibit errors. Such errors are particularly difficult to spot in the complex nonlinear behaviour of dynamical systems. We introduce a new framework for approximate Bayesian uncertainty quantification in neural operators using function-valued Gaussian processes. Our approach can be interpreted as a probabilistic analogue of the concept of currying from functional programming and provides a practical yet theoretically sound way to apply the linearized Laplace approximation to neural operators. In a case study on Fourier neural operators, we show that, even for a discretized input, our method yields a Gaussian closure–a structured Gaussian process posterior capturing the uncertainty in the output function of the neural operator, which can be evaluated at an arbitrary set of points. The method adds minimal prediction overhead, can be applied post-hoc without retraining the neural operator, and scales to large models and datasets. We showcase the efficacy of our approach through applications to different types of partial differential equations.

[LG-9] Massively Multiagent Minigames for Training Generalist Agents

链接: https://arxiv.org/abs/2406.05071
作者: Kyoung Whan Choe,Ryan Sullivan,Joseph Suárez
关键词: present Meta MMO, Meta MMO, Neural MMO, MMO, expands Neural MMO
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We present Meta MMO, a collection of many-agent minigames for use as a reinforcement learning benchmark. Meta MMO is built on top of Neural MMO, a massively multiagent environment that has been the subject of two previous NeurIPS competitions. Our work expands Neural MMO with several computationally efficient minigames. We explore generalization across Meta MMO by learning to play several minigames with a single set of weights. We release the environment, baselines, and training code under the MIT license. We hope that Meta MMO will spur additional progress on Neural MMO and, more generally, will serve as a useful benchmark for many-agent generalization.

[LG-10] Pretraining Decision Transformers with Reward Prediction for In-Context Multi-task Structured Bandit Learning

链接: https://arxiv.org/abs/2406.05064
作者: Subhojyoti Mukherjee,Josiah P. Hanna,Qiaomin Xie,Robert Nowak
关键词: minimizes cumulative regret, cumulative regret, shared structure, test task, algorithm
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we study multi-task structured bandit problem where the goal is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks share a common structure and the algorithm exploits the shared structure to minimize the cumulative regret for an unseen but related test task. We use a transformer as a decision-making algorithm to learn this shared structure so as to generalize to the test task. The prior work of pretrained decision transformers like DPT requires access to the optimal action during training which may be hard in several scenarios. Diverging from these works, our learning algorithm does not need the knowledge of optimal action per task during training but predicts a reward vector for each of the actions using only the observed offline data from the diverse training tasks. Finally, during inference time, it selects action using the reward predictions employing various exploration strategies in-context for an unseen test task. Our model outperforms other SOTA methods like DPT, and Algorithmic Distillation over a series of experiments on several structured bandit problems (linear, bilinear, latent, non-linear). Interestingly, we show that our algorithm, without the knowledge of the underlying problem structure, can learn a near-optimal policy in-context by leveraging the shared structure across diverse tasks. We further extend the field of pre-trained decision transformers by showing that they can leverage unseen tasks with new actions and still learn the underlying latent structure to derive a near-optimal policy. We validate this over several experiments to show that our proposed solution is very general and has wide applications to potentially emergent online and offline strategies at test time. Finally, we theoretically analyze the performance of our algorithm and obtain generalization bounds in the in-context multi-task learning setting.

[LG-11] Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation

链接: https://arxiv.org/abs/2406.05053
作者: Nachiket Kotalwar,Alkis Gotovos,Adish Singla
关键词: hold great promise, generating individualized feedback, models hold great, enhancing programming education, hints for learners
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative AI and large language models hold great promise in enhancing programming education by generating individualized feedback and hints for learners. Recent works have primarily focused on improving the quality of generated feedback to achieve human tutors’ quality. While quality is an important performance criterion, it is not the only criterion to optimize for real-world educational deployments. In this paper, we benchmark language models for programming feedback generation across several performance criteria, including quality, cost, time, and data privacy. The key idea is to leverage recent advances in the new paradigm of in-browser inference that allow running these models directly in the browser, thereby providing direct benefits across cost and data privacy. To boost the feedback quality of small models compatible with in-browser inference engines, we develop a fine-tuning pipeline based on GPT-4 generated synthetic data. We showcase the efficacy of fine-tuned Llama3-8B and Phi3-3.8B 4-bit quantized models using WebLLM’s in-browser inference engine on three different Python programming datasets. We will release the full implementation along with a web app and datasets to facilitate further research on in-browser language models.

[LG-12] A Tensor Decomposition Perspective on Second-order RNNs

链接: https://arxiv.org/abs/2406.05045
作者: Maude Lizaire,Michael Rizvi-Martel,Marawan Gamal Abdel Hameed,Guillaume Rabusseau
关键词: Recurrent Neural Networks, Second-order Recurrent Neural, Neural Networks, Recurrent Neural, Second-order Recurrent
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2024. Camera ready version

点击查看摘要

Abstract:Second-order Recurrent Neural Networks (2RNNs) extend RNNs by leveraging second-order interactions for sequence modelling. These models are provably more expressive than their first-order counterparts and have connections to well-studied models from formal language theory. However, their large parameter tensor makes computations intractable. To circumvent this issue, one approach known as MIRNN consists in limiting the type of interactions used by the model. Another is to leverage tensor decomposition to diminish the parameter count. In this work, we study the model resulting from parameterizing 2RNNs using the CP decomposition, which we call CPRNN. Intuitively, the rank of the decomposition should reduce expressivity. We analyze how rank and hidden size affect model capacity and show the relationships between RNNs, 2RNNs, MIRNNs, and CPRNNs based on these parameters. We support these results empirically with experiments on the Penn Treebank dataset which demonstrate that, with a fixed parameter budget, CPRNNs outperforms RNNs, 2RNNs, and MIRNNs with the right choice of rank and hidden size.

[LG-13] Online Frequency Scheduling by Learning Parallel Actions

链接: https://arxiv.org/abs/2406.05041
作者: Anastasios Giovanidis,Mathieu Leconte,Sabrine Aroua,Tor Kvernvik,David Sandberg
关键词: Radio Resource Management, applications create strong, create strong competition, Radio Resource, Resource Management
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 9 pages, 5 figures, conference submission

点击查看摘要

Abstract:Radio Resource Management is a challenging topic in future 6G networks where novel applications create strong competition among the users for the available resources. In this work we consider the frequency scheduling problem in a multi-user MIMO system. Frequency resources need to be assigned to a set of users while allowing for concurrent transmissions in the same sub-band. Traditional methods are insufficient to cope with all the involved constraints and uncertainties, whereas reinforcement learning can directly learn near-optimal solutions for such complex environments. However, the scheduling problem has an enormous action space accounting for all the combinations of users and sub-bands, so out-of-the-box algorithms cannot be used directly. In this work, we propose a scheduler based on action-branching over sub-bands, which is a deep Q-learning architecture with parallel decision capabilities. The sub-bands learn correlated but local decision policies and altogether they optimize a global reward. To improve the scaling of the architecture with the number of sub-bands, we propose variations (Unibranch, Graph Neural Network-based) that reduce the number of parameters to learn. The parallel decision making of the proposed architecture allows to meet short inference time requirements in real systems. Furthermore, the deep Q-learning approach permits online fine-tuning after deployment to bridge the sim-to-real gap. The proposed architectures are evaluated against relevant baselines from the literature showing competitive performance and possibilities of online adaptation to evolving environments.

[LG-14] Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs

链接: https://arxiv.org/abs/2406.05038
作者: Shentong Mo
关键词: state space approach, selective state space, Recent advancements, long sequence handling, efficient long sequence
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in sequence modeling have led to the development of the Mamba architecture, noted for its selective state space approach, offering a promising avenue for efficient long sequence handling. However, its application in 3D shape generation, particularly at high resolutions, remains underexplored. Traditional diffusion transformers (DiT) with self-attention mechanisms, despite their potential, face scalability challenges due to the cubic complexity of attention operations as input length increases. This complexity becomes a significant hurdle when dealing with high-resolution voxel sizes. To address this challenge, we introduce a novel diffusion architecture tailored for 3D point clouds generation-Diffusion Mamba (DiM-3D). This architecture forgoes traditional attention mechanisms, instead utilizing the inherent efficiency of the Mamba architecture to maintain linear complexity with respect to sequence length. DiM-3D is characterized by fast inference times and substantially lower computational demands, quantified in reduced Gflops, thereby addressing the key scalability issues of prior models. Our empirical results on the ShapeNet benchmark demonstrate that DiM-3D achieves state-of-the-art performance in generating high-fidelity and diverse 3D shapes. Additionally, DiM-3D shows superior capabilities in tasks like 3D point cloud completion. This not only proves the model’s scalability but also underscores its efficiency in generating detailed, high-resolution voxels necessary for advanced 3D shape modeling, particularly excelling in environments requiring high-resolution voxel sizes. Through these findings, we illustrate the exceptional scalability and efficiency of the Diffusion Mamba framework in 3D shape generation, setting a new standard for the field and paving the way for future explorations in high-resolution 3D modeling technologies.

[LG-15] meSieve: Extracting Temporal Dynamics through Information Bottlenecks

链接: https://arxiv.org/abs/2406.05036
作者: Ninghui Feng,Songning Lai,Fobao Zhou,Zhenxiao Yin,Hang Zhao
关键词: Time series forecasting, increasingly popular research, popular research area, research area due, Time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series forecasting has become an increasingly popular research area due to its critical applications in various real-world domains such as traffic management, weather prediction, and financial analysis. Despite significant advancements, existing models face notable challenges, including the necessity of manual hyperparameter tuning for different datasets, and difficulty in effectively distinguishing signal from redundant features in data characterized by strong seasonality. These issues hinder the generalization and practical application of time series forecasting models. To solve this issues, we propose an innovative time series forecasting model TimeSieve designed to address these challenges. Our approach employs wavelet transforms to preprocess time series data, effectively capturing multi-scale features without the need for additional parameters or manual hyperparameter tuning. Additionally, we introduce the information bottleneck theory that filters out redundant features from both detail and approximation coefficients, retaining only the most predictive information. This combination reduces significantly improves the model’s accuracy. Extensive experiments demonstrate that our model outperforms existing state-of-the-art methods on 70% of the datasets, achieving higher predictive accuracy and better generalization across diverse datasets. Our results validate the effectiveness of our approach in addressing the key challenges in time series forecasting, paving the way for more reliable and efficient predictive models in practical applications. The code for our model is available at this https URL.

[LG-16] Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes

链接: https://arxiv.org/abs/2406.05033
作者: Si Yi Meng,Antonio Orvieto,Daniel Yiming Cao,Christopher De Sa
关键词: study gradient descent, critical step size, step sizes, step size, logistic regression problems
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study gradient descent (GD) dynamics on logistic regression problems with large, constant step sizes. For linearly-separable data, it is known that GD converges to the minimizer with arbitrarily large step sizes, a property which no longer holds when the problem is not separable. In fact, the behaviour can be much more complex – a sequence of period-doubling bifurcations begins at the critical step size 2/\lambda , where \lambda is the largest eigenvalue of the Hessian at the solution. Using a smaller-than-critical step size guarantees convergence if initialized nearby the solution: but does this suffice globally? In one dimension, we show that a step size less than 1/\lambda suffices for global convergence. However, for all step sizes between 1/\lambda and the critical step size 2/\lambda , one can construct a dataset such that GD converges to a stable cycle. In higher dimensions, this is actually possible even for step sizes less than 1/\lambda . Our results show that although local convergence is guaranteed for all step sizes less than the critical step size, global convergence is not, and GD may instead converge to a cycle depending on the initialization.

[LG-17] Optimizing Automatic Differentiation with Deep Reinforcement Learning

链接: https://arxiv.org/abs/2406.05027
作者: Jamie Lohoff,Emre Neftci
关键词: computational fluid dynamics, robotics and finance, fluid dynamics, Jacobian, exact Jacobian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Computing Jacobians with automatic differentiation is ubiquitous in many scientific domains such as machine learning, computational fluid dynamics, robotics and finance. Even small savings in the number of computations or memory usage in Jacobian computations can already incur massive savings in energy consumption and runtime. While there exist many methods that allow for such savings, they generally trade computational efficiency for approximations of the exact Jacobian. In this paper, we present a novel method to optimize the number of necessary multiplications for Jacobian computation by leveraging deep reinforcement learning (RL) and a concept called cross-country elimination while still computing the exact Jacobian. Cross-country elimination is a framework for automatic differentiation that phrases Jacobian accumulation as ordered elimination of all vertices on the computational graph where every elimination incurs a certain computational cost. We formulate the search for the optimal elimination order that minimizes the number of necessary multiplications as a single player game which is played by an RL agent. We demonstrate that this method achieves up to 33% improvements over state-of-the-art methods on several relevant tasks taken from diverse domains. Furthermore, we show that these theoretical gains translate into actual runtime improvements by providing a cross-country elimination interpreter in JAX that can efficiently execute the obtained elimination orders.

[LG-18] Scaling up Probabilistic PDE Simulators with Structured Volumetric Information

链接: https://arxiv.org/abs/2406.05020
作者: Tim Weiland,Marvin Pförtner,Philipp Hennig
关键词: Modeling real-world problems, partial differential equations, scientific machine learning, Modeling real-world, differential equations
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Modeling real-world problems with partial differential equations (PDEs) is a prominent topic in scientific machine learning. Classic solvers for this task continue to play a central role, e.g. to generate training data for deep learning analogues. Any such numerical solution is subject to multiple sources of uncertainty, both from limited computational resources and limited data (including unknown parameters). Gaussian process analogues to classic PDE simulation methods have recently emerged as a framework to construct fully probabilistic estimates of all these types of uncertainty. So far, much of this work focused on theoretical foundations, and as such is not particularly data efficient or scalable. Here we propose a framework combining a discretization scheme based on the popular Finite Volume Method with complementary numerical linear algebra techniques. Practical experiments, including a spatiotemporal tsunami simulation, demonstrate substantially improved scaling behavior of this approach over previous collocation-based techniques.

[LG-19] Adaptively Learning to Select-Rank in Online Platforms

链接: https://arxiv.org/abs/2406.05017
作者: Jingyuan Wang,Perry Dong,Ying Jin,Ruohan Zhan,Zhengyuan Zhou
关键词: content streaming services, streaming services, online platforms, platforms across e-commerce, e-commerce sites
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages in total. Includes 4 figures and a pdf. International conference on machine learning. PMLR, 2024

点击查看摘要

Abstract:Ranking algorithms are fundamental to various online platforms across e-commerce sites to content streaming services. Our research addresses the challenge of adaptively ranking items from a candidate pool for heterogeneous users, a key component in personalizing user experience. We develop a user response model that considers diverse user preferences and the varying effects of item positions, aiming to optimize overall user satisfaction with the ranked list. We frame this problem within a contextual bandits framework, with each ranked list as an action. Our approach incorporates an upper confidence bound to adjust predicted user satisfaction scores and selects the ranking action that maximizes these adjusted scores, efficiently solved via maximum weight imperfect matching. We demonstrate that our algorithm achieves a cumulative regret bound of O(d\sqrtNKT) for ranking K out of N items in a d -dimensional context space over T rounds, under the assumption that user responses follow a generalized linear model. This regret alleviates dependence on the ambient action space, whose cardinality grows exponentially with N and K (thus rendering direct application of existing adaptive learning algorithms – such as UCB or Thompson sampling – infeasible). Experiments conducted on both simulated and real-world datasets demonstrate our algorithm outperforms the baseline.

[LG-20] ADBA:Approximation Decision Boundary Approach for Black-Box Adversarial Attacks

链接: https://arxiv.org/abs/2406.04998
作者: Feiyang Wang,Xingquan Zuo,Hai Huang,Gang Chen
关键词: machine learning models, machine learning, target machine learning, perturbation directions, real-world applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures, conference

点击查看摘要

Abstract:Many machine learning models are susceptible to adversarial attacks, with decision-based black-box attacks representing the most critical threat in real-world applications. These attacks are extremely stealthy, generating adversarial examples using hard labels obtained from the target machine learning model. This is typically realized by optimizing perturbation directions, guided by decision boundaries identified through query-intensive exact search, significantly limiting the attack success rate. This paper introduces a novel approach using the Approximation Decision Boundary (ADB) to efficiently and accurately compare perturbation directions without precisely determining decision boundaries. The effectiveness of our ADB approach (ADBA) hinges on promptly identifying suitable ADB, ensuring reliable differentiation of all perturbation directions. For this purpose, we analyze the probability distribution of decision boundaries, confirming that using the distribution’s median value as ADB can effectively distinguish different perturbation directions, giving rise to the development of the ADBA-md algorithm. ADBA-md only requires four queries on average to differentiate any pair of perturbation directions, which is highly query-efficient. Extensive experiments on six well-known image classifiers clearly demonstrate the superiority of ADBA and ADBA-md over multiple state-of-the-art black-box attacks.

[LG-21] he Price of Implicit Bias in Adversarially Robust Generalization

链接: https://arxiv.org/abs/2406.04981
作者: Nikolaos Tsilivis,Natalie Frank,Nathan Srebro,Julia Kempe
关键词: empirical risk minimization, robust empirical risk, robust ERM, risk minimization, empirical risk
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the implicit bias of optimization in robust empirical risk minimization (robust ERM) and its connection with robust generalization. In classification settings under adversarial perturbations with linear models, we study what type of regularization should ideally be applied for a given perturbation set to improve (robust) generalization. We then show that the implicit bias of optimization in robust ERM can significantly affect the robustness of the model and identify two ways this can happen; either through the optimization algorithm or the architecture. We verify our predictions in simulations with synthetic data and experimentally study the importance of implicit bias in robust ERM with deep neural networks.

[LG-22] UniTST: Effectively Modeling Inter-Series and Intra-Series Dependencies for Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2406.04975
作者: Juncheng Liu,Chenghao Liu,Gerald Woo,Yiwei Wang,Bryan Hooi,Caiming Xiong,Doyen Sahoo
关键词: emerged as powerful, powerful tools, tools for multivariate, MTSF, multivariate time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformer-based models have emerged as powerful tools for multivariate time series forecasting (MTSF). However, existing Transformer models often fall short of capturing both intricate dependencies across variate and temporal dimensions in MTS data. Some recent models are proposed to separately capture variate and temporal dependencies through either two sequential or parallel attention mechanisms. However, these methods cannot directly and explicitly learn the intricate inter-series and intra-series dependencies. In this work, we first demonstrate that these dependencies are very important as they usually exist in real-world data. To directly model these dependencies, we propose a transformer-based model UniTST containing a unified attention mechanism on the flattened patch tokens. Additionally, we add a dispatcher module which reduces the complexity and makes the model feasible for a potentially large number of variates. Although our proposed model employs a simple architecture, it offers compelling performance as shown in our extensive experiments on several datasets for time series forecasting.

[LG-23] Neural Laplace for learning Stochastic Differential Equations

链接: https://arxiv.org/abs/2406.04964
作者: Adrien Carrel
关键词: differential equations, learning diverse classes, Neural Laplace, ordinary differential equations, Stochastic differential equations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Neural Laplace is a unified framework for learning diverse classes of differential equations (DE). For different classes of DE, this framework outperforms other approaches relying on neural networks that aim to learn classes of ordinary differential equations (ODE). However, many systems can’t be modelled using ODEs. Stochastic differential equations (SDE) are the mathematical tool of choice when modelling spatiotemporal DE dynamics under the influence of randomness. In this work, we review the potential applications of Neural Laplace to learn diverse classes of SDE, both from a theoretical and a practical point of view.

[LG-24] Learning Divergence Fields for Shift-Robust Graph Representations

链接: https://arxiv.org/abs/2406.04963
作者: Qitian Wu,Fan Nie,Chenxiao Yang,Junchi Yan
关键词: induce instance-level interdependence, involves certain geometries, generation often involves, induce instance-level, Real-world data generation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to ICML 2024. Source codes at this https URL

点击查看摘要

Abstract:Real-world data generation often involves certain geometries (e.g., graphs) that induce instance-level interdependence. This characteristic makes the generalization of learning models more difficult due to the intricate interdependent patterns that impact data-generative distributions and can vary from training to testing. In this work, we propose a geometric diffusion model with learnable divergence fields for the challenging generalization problem with interdependent data. We generalize the diffusion equation with stochastic diffusivity at each time step, which aims to capture the multi-faceted information flows among interdependent data. Furthermore, we derive a new learning objective through causal inference, which can guide the model to learn generalizable patterns of interdependence that are insensitive across domains. Regarding practical implementation, we introduce three model instantiations that can be considered as the generalized versions of GCN, GAT, and Transformers, respectively, which possess advanced robustness against distribution shifts. We demonstrate their promising efficacy for out-of-distribution generalization on diverse real-world datasets.

[LG-25] Nacala-Roof-Material: Drone Imagery for Roof Detection Classification and Segmentation to Support Mosquito-borne Disease Risk Assessment

链接: https://arxiv.org/abs/2406.04949
作者: Venkanna Babu Guthula,Stefan Oehmcke,Remigio Chilaule,Hui Zhang,Nico Lang,Ankit Kariryaa,Johan Mottelson,Christian Igel
关键词: remote sensing imagery, roof types based, malaria risk, assessment of malaria, increased risk
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As low-quality housing and in particular certain roof characteristics are associated with an increased risk of malaria, classification of roof types based on remote sensing imagery can support the assessment of malaria risk and thereby help prevent the disease. To support research in this area, we release the Nacala-Roof-Material dataset, which contains high-resolution drone images from Mozambique with corresponding labels delineating houses and specifying their roof types. The dataset defines a multi-task computer vision problem, comprising object detection, classification, and segmentation. In addition, we benchmarked various state-of-the-art approaches on the dataset. Canonical U-Nets, YOLOv8, and a custom decoder on pretrained DINOv2 served as baselines. We show that each of the methods has its advantages but none is superior on all tasks, which highlights the potential of our dataset for future research in multi-task learning. While the tasks are closely related, accurate segmentation of objects does not necessarily imply accurate instance separation, and vice versa. We address this general issue by introducing a variant of the deep ordinal watershed (DOW) approach that additionally separates the interior of objects, allowing for improved object delineation and separation. We show that our DOW variant is a generic approach that improves the performance of both U-Net and DINOv2 backbones, leading to a better trade-off between semantic segmentation and instance segmentation.

[LG-26] Multiple-input multiple-output modal testing of a Hawk T1A aircraft: A new full-scale dataset for structural health monitoring

链接: https://arxiv.org/abs/2406.04943
作者: James Wilson,Max D. Champneys,Matt Tipuric,Robin Mills,David J. Wagg,Timothy J. Rogers
关键词: measured vibration data, measured vibration, BAE Systems Hawk, enabling the development, structural health monitoring
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The use of measured vibration data from structures has a long history of enabling the development of methods for inference and monitoring. In particular, applications based on system identification and structural health monitoring have risen to prominence over recent decades and promise significant benefits when implemented in practice. However, significant challenges remain in the development of these methods. The introduction of realistic, full-scale datasets will be an important contribution to overcoming these challenges. This paper presents a new benchmark dataset capturing the dynamic response of a decommissioned BAE Systems Hawk T1A. The dataset reflects the behaviour of a complex structure with a history of service that can still be tested in controlled laboratory conditions, using a variety of known loading and damage simulation conditions. As such, it provides a key stepping stone between simple laboratory test structures and in-service structures. In this paper, the Hawk structure is described in detail, alongside a comprehensive summary of the experimental work undertaken. Following this, key descriptive highlights of the dataset are presented, before a discussion of the research challenges that the data present. Using the dataset, non-linearity in the structure is demonstrated, as well as the sensitivity of the structure to damage of different types. The dataset is highly applicable to many academic enquiries and additional analysis techniques which will enable further advancement of vibration-based engineering techniques.

[LG-27] CarbonSense: A Multimodal Dataset and Baseline for Carbon Flux Modelling

链接: https://arxiv.org/abs/2406.04940
作者: Matthew Fortier,Mats L. Richter,Oliver Sonnentag,Chris Pal
关键词: Terrestrial carbon fluxes, provide vital information, Terrestrial carbon, carbon fluxes, fluxes provide vital
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 content pages, 11 reference pages, 9 appendix pages

点击查看摘要

Abstract:Terrestrial carbon fluxes provide vital information about our biosphere’s health and its capacity to absorb anthropogenic CO _2 emissions. The importance of predicting carbon fluxes has led to the emerging field of data-driven carbon flux modelling (DDCFM), which uses statistical techniques to predict carbon fluxes from biophysical data. However, the field lacks a standardized dataset to promote comparisons between models. To address this gap, we present CarbonSense, the first machine learning-ready dataset for DDCFM. CarbonSense integrates measured carbon fluxes, meteorological predictors, and satellite imagery from 385 locations across the globe, offering comprehensive coverage and facilitating robust model training. Additionally, we provide a baseline model using a current state-of-the-art DDCFM approach and a novel transformer based model. Our experiments illustrate the potential gains that multimodal deep learning techniques can bring to this domain. By providing these resources, we aim to lower the barrier to entry for other deep learning researchers to develop new models and drive new advances in carbon flux modelling.

[LG-28] SpanGNN: Towards Memory-Efficient Graph Neural Networks via Spanning Subgraph Training

链接: https://arxiv.org/abs/2406.04938
作者: Xizhi Gu,Hongzheng Li,Shihong Gao,Xinyan Zhang,Lei Chen,Yingxia Shao
关键词: Graph Neural Networks, Neural Networks, Graph Neural, GNN training, mini-batch GNN training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have superior capability in learning graph data. Full-graph GNN training generally has high accuracy, however, it suffers from large peak memory usage and encounters the Out-of-Memory problem when handling large graphs. To address this memory problem, a popular solution is mini-batch GNN training. However, mini-batch GNN training increases the training variance and sacrifices the model accuracy. In this paper, we propose a new memory-efficient GNN training method using spanning subgraph, called SpanGNN. SpanGNN trains GNN models over a sequence of spanning subgraphs, which are constructed from empty structure. To overcome the excessive peak memory consumption problem, SpanGNN selects a set of edges from the original graph to incrementally update the spanning subgraph between every epoch. To ensure the model accuracy, we introduce two types of edge sampling strategies (i.e., variance-reduced and noise-reduced), and help SpanGNN select high-quality edges for the GNN learning. We conduct experiments with SpanGNN on widely used datasets, demonstrating SpanGNN’s advantages in the model performance and low peak memory usage.

[LG-29] SLOPE: Search with Learned Optimal Pruning-based Expansion

链接: https://arxiv.org/abs/2406.04935
作者: Davor Bokan,Zlatan Ajanovic,Bakir Lacevic
关键词: motion planning, planning and pathfinding, finding the shortest, promising completeness, Learned Optimal Pruning-based
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: presented at the ICAPS 2024 workshop on Bridging the Planning and Reinforcement Learning

点击查看摘要

Abstract:Heuristic search is often used for motion planning and pathfinding problems, for finding the shortest path in a graph while also promising completeness and optimal efficiency. The drawback is it’s space complexity, specifically storing all expanded child nodes in memory and sorting large lists of active nodes, which can be a problem in real-time scenarios with limited on-board computation. To combat this, we present the Search with Learned Optimal Pruning-based Expansion (SLOPE), which, learns the distance of a node from a possible optimal path, unlike other approaches that learn a cost-to-go value. The unfavored nodes are then pruned according to the said distance, which in turn reduces the size of the open list. This ensures that the search explores only the region close to optimal paths while lowering memory and computational costs. Unlike traditional learning methods, our approach is orthogonal to estimating cost-to-go heuristics, offering a complementary strategy for improving search efficiency. We demonstrate the effectiveness of our approach evaluating it as a standalone search method and in conjunction with learned heuristic functions, achieving comparable-or-better node expansion metrics, while lowering the number of child nodes in the open list. Our code is available at this https URL.

[LG-30] Optimal Recurrent Network Topologies for Dynamical Systems Reconstruction

链接: https://arxiv.org/abs/2406.04934
作者: Christoph Jürgen Hemmer,Manuel Brenner,Florian Hess,Daniel Durstewitz
关键词: time series measurements, underlying dynamical process, seek to infer, infer from time, time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:In dynamical systems reconstruction (DSR) we seek to infer from time series measurements a generative model of the underlying dynamical process. This is a prime objective in any scientific discipline, where we are particularly interested in parsimonious models with a low parameter load. A common strategy here is parameter pruning, removing all parameters with small weights. However, here we find this strategy does not work for DSR, where even low magnitude parameters can contribute considerably to the system dynamics. On the other hand, it is well known that many natural systems which generate complex dynamics, like the brain or ecological networks, have a sparse topology with comparatively few links. Inspired by this, we show that geometric pruning, where in contrast to magnitude-based pruning weights with a low contribution to an attractor’s geometrical structure are removed, indeed manages to reduce parameter load substantially without significantly hampering DSR quality. We further find that the networks resulting from geometric pruning have a specific type of topology, and that this topology, and not the magnitude of weights, is what is most crucial to performance. We provide an algorithm that automatically generates such topologies which can be used as priors for generative modeling of dynamical systems by RNNs, and compare it to other well studied topologies like small-world or scale-free networks.

[LG-31] Faster Than Lies: Real-time Deepfake Detection using Binary Neural Networks

链接: https://arxiv.org/abs/2406.04932
作者: Lanzino Romeo,Fontana Federico,Diko Anxhelo,Marini Marco Raoul,Cinque Luigi
关键词: Binary Neural Networks, Deepfake detection aims, online content, Deepfake detection, aims to contrast
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at CVPR24 DFAD Workshop

点击查看摘要

Abstract:Deepfake detection aims to contrast the spread of deep-generated media that undermines trust in online content. While existing methods focus on large and complex models, the need for real-time detection demands greater efficiency. With this in mind, unlike previous work, we introduce a novel deepfake detection approach on images using Binary Neural Networks (BNNs) for fast inference with minimal accuracy loss. Moreover, our method incorporates Fast Fourier Transform (FFT) and Local Binary Pattern (LBP) as additional channel features to uncover manipulation traces in frequency and texture domains. Evaluations on COCOFake, DFFD, and CIFAKE datasets demonstrate our method’s state-of-the-art performance in most scenarios with a significant efficiency gain of up to a 20\times reduction in FLOPs during inference. Finally, by exploring BNNs in deepfake detection to balance accuracy and efficiency, this work paves the way for future research on efficient deepfake detection.

[LG-32] Protein pathways as a catalyst to directed evolution of the topology of artificial neural networks

链接: https://arxiv.org/abs/2406.04929
作者: Oscar Lao,Konstantinos Zacharopoulos,Apostolos Fournaris,Rossano Schifanella,Ioannis Arapakis
关键词: evolving Artificial Neural, Artificial Neural Networks, Artificial Neural, Artificial Protein Network, evolving Artificial
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:In the present article, we propose a paradigm shift on evolving Artificial Neural Networks (ANNs) towards a new bio-inspired design that is grounded on the structural properties, interactions, and dynamics of protein networks (PNs): the Artificial Protein Network (APN). This introduces several advantages previously unrealized by state-of-the-art approaches in NE: (1) We can draw inspiration from how nature, thanks to millions of years of evolution, efficiently encodes protein interactions in the DNA to translate our APN to silicon DNA. This helps bridge the gap between syntax and semantics observed in current NE approaches. (2) We can learn from how nature builds networks in our genes, allowing us to design new and smarter networks through EA evolution. (3) We can perform EA crossover/mutation operations and evolution steps, replicating the operations observed in nature directly on the genotype of networks, thus exploring and exploiting the phenotypic space, such that we avoid getting trapped in sub-optimal solutions. (4) Our novel definition of APN opens new ways to leverage our knowledge about different living things and processes from biology. (5) Using biologically inspired encodings, we can model more complex demographic and ecological relationships (e.g., virus-host or predator-prey interactions), allowing us to optimise for multiple, often conflicting objectives.

[LG-33] AGBD: A Global-scale Biomass Dataset

链接: https://arxiv.org/abs/2406.04928
作者: Ghjulia Sialelli,Torben Peters,Jan D. Wegner,Konrad Schindler
关键词: humanity biggest challenges, Ground Biomass, Accurate estimates, biggest challenges, climate change
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Accurate estimates of Above Ground Biomass (AGB) are essential in addressing two of humanity’s biggest challenges, climate change and biodiversity loss. Existing datasets for AGB estimation from satellite imagery are limited. Either they focus on specific, local regions at high resolution, or they offer global coverage at low resolution. There is a need for a machine learning-ready, globally representative, high-resolution benchmark. Our findings indicate significant variability in biomass estimates across different vegetation types, emphasizing the necessity for a dataset that accurately captures global diversity. To address these gaps, we introduce a comprehensive new dataset that is globally distributed, covers a range of vegetation types, and spans several years. This dataset combines AGB reference data from the GEDI mission with data from Sentinel-2 and PALSAR-2 imagery. Additionally, it includes pre-processed high-level features such as a dense canopy height map, an elevation map, and a land-cover classification map. We also produce a dense, high-resolution (10m) map of AGB predictions for the entire area covered by the dataset. Rigorously tested, our dataset is accompanied by several benchmark models and is publicly available. It can be easily accessed using a single line of code, offering a solid basis for efforts towards global AGB estimation. The GitHub repository this http URL serves as a one-stop shop for all code and data.

[LG-34] hrough the Thicket: A Study of Number-Oriented LLMs derived from Random Forest Models

链接: https://arxiv.org/abs/2406.04926
作者: Michał Romaszewski,Przemysław Sekuła,Przemysław Głomb,Michał Cholewa,Katarzyna Kołodziej
关键词: shown exceptional performance, text processing, Large Language Models, shown exceptional, Large Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown exceptional performance in text processing. Notably, LLMs can synthesize information from large datasets and explain their decisions similarly to human reasoning through a chain of thought (CoT). An emerging application of LLMs is the handling and interpreting of numerical data, where fine-tuning enhances their performance over basic inference methods. This paper proposes a novel approach to training LLMs using knowledge transfer from a random forest (RF) ensemble, leveraging its efficiency and accuracy. By converting RF decision paths into natural language statements, we generate outputs for LLM fine-tuning, enhancing the model’s ability to classify and explain its decisions. Our method includes verifying these rules through established classification metrics, ensuring their correctness. We also examine the impact of preprocessing techniques on the representation of numerical data and their influence on classification accuracy and rule correctness

[LG-35] Sim-to-real Transfer of Deep Reinforcement Learning Agents for Online Coverage Path Planning

链接: https://arxiv.org/abs/2406.04920
作者: Arvi Jonnarth,Ola Johansson,Michael Felsberg
关键词: presents a difficult, difficult challenge, transfer presents, real world, environment
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Sim-to-real transfer presents a difficult challenge, where models trained in simulation are to be deployed in the real world. The distribution shift between the two settings leads to biased representations of the perceived real-world environment, and thus to suboptimal predictions. In this work, we tackle the challenge of sim-to-real transfer of reinforcement learning (RL) agents for coverage path planning (CPP). In CPP, the task is for a robot to find a path that visits every point of a confined area. Specifically, we consider the case where the environment is unknown, and the agent needs to plan the path online while mapping the environment. We bridge the sim-to-real gap through a semi-virtual environment with a simulated sensor and obstacles, while including real robot kinematics and real-time aspects. We investigate what level of fine-tuning is needed for adapting to a realistic setting, comparing to an agent trained solely in simulation. We find that a high model inference frequency is sufficient for reducing the sim-to-real gap, while fine-tuning degrades performance initially. By training the model in simulation and deploying it at a high inference frequency, we transfer state-of-the-art results from simulation to the real domain, where direct learning would take in the order of weeks with manual interaction, i.e., would be completely infeasible.

[LG-36] Combinatorial Complex Score-based Diffusion Modelling through Stochastic Differential Equations

链接: https://arxiv.org/abs/2406.04916
作者: Adrien Carrel
关键词: representing diverse patterns, transportation systems, applicable across domains, molecular chemistry, offer a versatile
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:Graph structures offer a versatile framework for representing diverse patterns in nature and complex systems, applicable across domains like molecular chemistry, social networks, and transportation systems. While diffusion models have excelled in generating various objects, generating graphs remains challenging. This thesis explores the potential of score-based generative models in generating such objects through a modelization as combinatorial complexes, which are powerful topological structures that encompass higher-order relationships. In this thesis, we propose a unified framework by employing stochastic differential equations. We not only generalize the generation of complex objects such as graphs and hypergraphs, but we also unify existing generative modelling approaches such as Score Matching with Langevin dynamics and Denoising Diffusion Probabilistic Models. This innovation overcomes limitations in existing frameworks that focus solely on graph generation, opening up new possibilities in generative AI. The experiment results showed that our framework could generate these complex objects, and could also compete against state-of-the-art approaches for mere graph and molecule generation tasks. Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Algebraic Topology (math.AT) Cite as: arXiv:2406.04916 [cs.LG] (or arXiv:2406.04916v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2406.04916 Focus to learn more arXiv-issued DOI via DataCite

[LG-37] Submodular Framework for Structured-Sparse Optimal Transport

链接: https://arxiv.org/abs/2406.04914
作者: Piyushi Manupriya,Pratik Jawanpuria,Karthik S. Gurumoorthy,SakethaNath Jagarlapudi,Bamdev Mishra
关键词: Unbalanced optimal transport, handling un-normalized measures, Unbalanced optimal, sparse transport plans, robustness properties
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unbalanced optimal transport (UOT) has recently gained much attention due to its flexible framework for handling un-normalized measures and its robustness properties. In this work, we explore learning (structured) sparse transport plans in the UOT setting, i.e., transport plans have an upper bound on the number of non-sparse entries in each column (structured sparse pattern) or in the whole plan (general sparse pattern). We propose novel sparsity-constrained UOT formulations building on the recently explored maximum mean discrepancy based UOT. We show that the proposed optimization problem is equivalent to the maximization of a weakly submodular function over a uniform matroid or a partition matroid. We develop efficient gradient-based discrete greedy algorithms and provide the corresponding theoretical guarantees. Empirically, we observe that our proposed greedy algorithms select a diverse support set and we illustrate the efficacy of the proposed approach in various applications.

[LG-38] Online Adaptation for Enhancing Imitation Learning Policies

链接: https://arxiv.org/abs/2406.04913
作者: Federico Malato,Ville Hautamaki
关键词: enables autonomous agents, learning enables autonomous, reward signal, Imitation learning enables, enables autonomous
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at IEEE Conference on Games 2024, Milan, Italy

点击查看摘要

Abstract:Imitation learning enables autonomous agents to learn from human examples, without the need for a reward signal. Still, if the provided dataset does not encapsulate the task correctly, or when the task is too complex to be modeled, such agents fail to reproduce the expert policy. We propose to recover from these failures through online adaptation. Our approach combines the action proposal coming from a pre-trained policy with relevant experience recorded by an expert. The combination results in an adapted action that closely follows the expert. Our experiments show that an adapted agent performs better than its pure imitation learning counterpart. Notably, adapted agents can achieve reasonable performance even when the base, non-adapted policy catastrophically fails.

[LG-39] PolyLUT-Add: FPGA-based LUT Inference with Wide Inputs

链接: https://arxiv.org/abs/2406.04910
作者: Binglei Lou,Richard Rademacher,David Boland,Philip H.W. Leong
关键词: deploying deep neural, deep neural networks, distinct advantages, technology for deploying, deploying deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: To be published in the International Conference on Field-Programmable Logic and Applications (FPL) 2024

点击查看摘要

Abstract:FPGAs have distinct advantages as a technology for deploying deep neural networks (DNNs) at the edge. Lookup Table (LUT) based networks, where neurons are directly modelled using LUTs, help maximize this promise of offering ultra-low latency and high area efficiency on FPGAs. Unfortunately, LUT resource usage scales exponentially with the number of inputs to the LUT, restricting PolyLUT to small LUT sizes. This work introduces PolyLUT-Add, a technique that enhances neuron connectivity by combining A PolyLUT sub-neurons via addition to improve accuracy. Moreover, we describe a novel architecture to improve its scalability. We evaluated our implementation over the MNIST, Jet Substructure classification and Network Intrusion Detection benchmark and found that for similar accuracy, PolyLUT-Add achieves a LUT reduction of 1.3-7.7\times with a 1.2-2.2\times decrease in latency.

[LG-40] Concept Drift Detection using Ensemble of Integrally Private Models

链接: https://arxiv.org/abs/2406.04903
作者: Ayush K. Varshney,Vicenc Torra
关键词: Deep neural networks, machine learning algorithm, Deep neural, neural networks, learning algorithm
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted for publication in MLCS co-located with ECML-PKDD 2023

点击查看摘要

Abstract:Deep neural networks (DNNs) are one of the most widely used machine learning algorithm. DNNs requires the training data to be available beforehand with true labels. This is not feasible for many real-world problems where data arrives in the streaming form and acquisition of true labels are scarce and expensive. In the literature, not much focus has been given to the privacy prospect of the streaming data, where data may change its distribution frequently. These concept drifts must be detected privately in order to avoid any disclosure risk from DNNs. Existing privacy models use concept drift detection schemes such ADWIN, KSWIN to detect the drifts. In this paper, we focus on the notion of integrally private DNNs to detect concept drifts. Integrally private DNNs are the models which recur frequently from different datasets. Based on this, we introduce an ensemble methodology which we call ‘Integrally Private Drift Detection’ (IPDD) method to detect concept drift from private models. Our IPDD method does not require labels to detect drift but assumes true labels are available once the drift has been detected. We have experimented with binary and multi-class synthetic and real-world data. Our experimental results show that our methodology can privately detect concept drift, has comparable utility (even better in some cases) with ADWIN and outperforms utility from different levels of differentially private models. The source code for the paper is available \hyperlinkthis https URLhere.

[LG-41] From Link Prediction to Forecasting: Information Loss in Batch-based Temporal Graph Learning

链接: https://arxiv.org/abs/2406.04897
作者: Moritz Lampert,Christopher Blöcker,Ingo Scholtes
关键词: temporal edge patterns, Dynamic link prediction, recent works proposing, important problem considered, edge patterns
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamic link prediction is an important problem considered by many recent works proposing various approaches for learning temporal edge patterns. To assess their efficacy, models are evaluated on publicly available benchmark datasets involving continuous-time and discrete-time temporal graphs. However, as we show in this work, the suitability of common batch-oriented evaluation depends on the datasets’ characteristics, which can cause two issues: First, for continuous-time temporal graphs, fixed-size batches create time windows with different durations, resulting in an inconsistent dynamic link prediction task. Second, for discrete-time temporal graphs, the sequence of batches can additionally introduce temporal dependencies that are not present in the data. In this work, we empirically show that this common evaluation approach leads to skewed model performance and hinders the fair comparison of methods. We mitigate this problem by reformulating dynamic link prediction as a link forecasting task that better accounts for temporal information present in the data. We provide implementations of our new evaluation method for commonly used graph learning frameworks.

[LG-42] Stabilizing Extreme Q-learning by Maclaurin Expansion

链接: https://arxiv.org/abs/2406.04896
作者: Motoki Omura,Takayuki Osa,Yusuke Mukuta,Tatsuya Harada
关键词: Regression is performed, Gumbel Regression, Expanded Extreme Q-learning, assumed Gumbel distribution, Extreme Q-learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at RLC 2024: The first Reinforcement Learning Conference

点击查看摘要

Abstract:In Extreme Q-learning (XQL), Gumbel Regression is performed with an assumed Gumbel distribution for the error distribution. This allows learning of the value function without sampling out-of-distribution actions and has shown excellent performance mainly in Offline RL. However, issues remained, including the exponential term in the loss function causing instability and the potential for an error distribution diverging from the Gumbel distribution. Therefore, we propose Maclaurin Expanded Extreme Q-learning to enhance stability. In this method, applying Maclaurin expansion to the loss function in XQL enhances stability against large errors. It also allows adjusting the error distribution assumption from normal to Gumbel based on the expansion order. Our method significantly stabilizes learning in Online RL tasks from DM Control, where XQL was previously unstable. Additionally, it improves performance in several Offline RL tasks from D4RL, where XQL already showed excellent results.

[LG-43] Enhancing Indoor Temperature Forecasting through Synthetic Data in Low-Data Environments

链接: https://arxiv.org/abs/2406.04890
作者: Zachari Thiry,Massimiliano Ruocco,Alessandro Nocente,Michail Spitieris
关键词: achieve efficient control, HVAC systems, control of HVAC, data, synthetic data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Forecasting indoor temperatures is important to achieve efficient control of HVAC systems. In this task, the limited data availability presents a challenge as most of the available data is acquired during standard operation where extreme scenarios and transitory regimes such as major temperature increases or decreases are de-facto excluded. Acquisition of such data requires significant energy consumption and a dedicated facility, hindering the quantity and diversity of available data. Cost related constraints however do not allow for continuous year-around acquisition. To address this, we investigate the efficacy of data augmentation techniques leveraging SoTA AI-based methods for synthetic data generation. Inspired by practical and experimental motivations, we explore fusion strategies of real and synthetic data to improve forecasting models. This approach alleviates the need for continuously acquiring extensive time series data, especially in contexts involving repetitive heating and cooling cycles in buildings. In our evaluation 1) we assess the performance of synthetic data generators independently, particularly focusing on SoTA AI-based methods; 2) we measure the utility of incorporating synthetically augmented data in a subsequent forecasting tasks where we employ a simple model in two distinct scenarios: 1) we first examine an augmentation technique that combines real and synthetically generated data to expand the training dataset, 2) we delve into utilizing synthetic data to tackle dataset imbalances. Our results highlight the potential of synthetic data augmentation in enhancing forecasting accuracy while mitigating training variance. Through empirical experiments, we show significant improvements achievable by integrating synthetic data, thereby paving the way for more robust forecasting models in low-data regime.

[LG-44] Diversified Batch Selection for Training Acceleration

链接: https://arxiv.org/abs/2406.04872
作者: Feng Hong,Yueming Lyu,Jiangchao Yao,Ya Zhang,Ivor W. Tsang,Yanfeng Wang
关键词: modern machine learning, machine learning models, demands extensive training, extensive training time, resource consumption
类目: Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

Abstract:The remarkable success of modern machine learning models on large datasets often demands extensive training time and resource consumption. To save cost, a prevalent research line, known as online batch selection, explores selecting informative subsets during the training process. Although recent efforts achieve advancements by measuring the impact of each sample on generalization, their reliance on additional reference models inherently limits their practical applications, when there are no such ideal models available. On the other hand, the vanilla reference-model-free methods involve independently scoring and selecting data in a sample-wise manner, which sacrifices the diversity and induces the redundancy. To tackle this dilemma, we propose Diversified Batch Selection (DivBS), which is reference-model-free and can efficiently select diverse and representative samples. Specifically, we define a novel selection objective that measures the group-wise orthogonalized representativeness to combat the redundancy issue of previous sample-wise criteria, and provide a principled selection-efficient realization. Extensive experiments across various tasks demonstrate the significant superiority of DivBS in the performance-speedup trade-off. The code is publicly available.

[LG-45] Perturb-and-Project: Differentially Private Similarities and Marginals

链接: https://arxiv.org/abs/2406.04868
作者: Vincent Cohen-Addad,Tommaso d’Orsi,Alessandro Epasto,Vahab Mirrokni,Peilin Zhong
关键词: mathcal, differential privacy, privacy where noise, noise is added, projected back
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
*备注: 21 ppages, ICML 2024

点击查看摘要

Abstract:We revisit the input perturbations framework for differential privacy where noise is added to the input A\in \mathcalS and the result is then projected back to the space of admissible datasets \mathcalS . Through this framework, we first design novel efficient algorithms to privately release pair-wise cosine similarities. Second, we derive a novel algorithm to compute k -way marginal queries over n features. Prior work could achieve comparable guarantees only for k even. Furthermore, we extend our results to t -sparse datasets, where our efficient algorithms yields novel, stronger guarantees whenever t\le n^5/6/\log n,. Finally, we provide a theoretical perspective on why \textitfast input perturbation algorithms works well in practice. The key technical ingredients behind our results are tight sum-of-squares certificates upper bounding the Gaussian complexity of sets of solutions.

[LG-46] Deep learning for precipitation nowcasting: A survey from the perspective of time series forecasting

链接: https://arxiv.org/abs/2406.04867
作者: Sojung An,Tae-Jin Oh,Eunha Sohn,Donghyun Kim
关键词: estimate motion flow, time series precipitation, series precipitation forecasting, time series, precipitation forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Deep learning-based time series forecasting has dominated the short-term precipitation forecasting field with the help of its ability to estimate motion flow in high-resolution datasets. The growing interest in precipitation nowcasting offers substantial opportunities for the advancement of current forecasting technologies. Nevertheless, there has been a scarcity of in-depth surveys of time series precipitation forecasting using deep learning. Thus, this paper systemically reviews recent progress in time series precipitation forecasting models. Specifically, we investigate the following key points within background components, covering: i) preprocessing, ii) objective functions, and iii) evaluation metrics. We then categorize forecasting models into \textitrecursive and \textitmultiple strategies based on their approaches to predict future frames, investigate the impacts of models using the strategies, and performance assessments. Finally, we evaluate current deep learning-based models for precipitation forecasting on a public benchmark, discuss their limitations and challenges, and present some promising research directions. Our contribution lies in providing insights for a better understanding of time series precipitation forecasting and in aiding the development of robust AI solutions for the future.

[LG-47] Multi-View Stochastic Block Models

链接: https://arxiv.org/abs/2406.04860
作者: Vincent Cohen-Addad,Tommaso d’Orsi,Silvio Lattanzi,Rajai Nasser
关键词: practical applications, central topic, topic in unsupervised, multitude of practical, multi-view graph clustering
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 31 pages, ICML 2024

点击查看摘要

Abstract:Graph clustering is a central topic in unsupervised learning with a multitude of practical applications. In recent years, multi-view graph clustering has gained a lot of attention for its applicability to real-world instances where one has access to multiple data sources. In this paper we formalize a new family of models, called \textitmulti-view stochastic block models that captures this setting. For this model, we first study efficient algorithms that naively work on the union of multiple graphs. Then, we introduce a new efficient algorithm that provably outperforms previous approaches by analyzing the structure of each graph separately. Furthermore, we complement our results with an information-theoretic lower bound studying the limits of what can be done in this model. Finally, we corroborate our results with experimental evaluations. Comments: 31 pages, ICML 2024 Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML) ACMclasses: F.2; G.3 Cite as: arXiv:2406.04860 [cs.LG] (or arXiv:2406.04860v1 [cs.LG] for this version)

[LG-48] A Near-Linear Time Approximation Algorithm for Beyond-Worst-Case Graph Clustering

链接: https://arxiv.org/abs/2406.04857
作者: Vincent Cohen-Addad,Tommaso d’Orsi,Aida Mousavifar
关键词: add arbitrary edges, arbitrary edges inside, remove arbitrary edges, random bipartite graph, arbitrary edges
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 24 pages, ICML 2024

点击查看摘要

Abstract:We consider the semi-random graph model of [Makarychev, Makarychev and Vijayaraghavan, STOC’12], where, given a random bipartite graph with \alpha edges and an unknown bipartition (A, B) of the vertex set, an adversary can add arbitrary edges inside each community and remove arbitrary edges from the cut (A, B) (i.e. all adversarial changes are \textitmonotone with respect to the bipartition). For this model, a polynomial time algorithm is known to approximate the Balanced Cut problem up to value O(\alpha) [MMV’12] as long as the cut (A, B) has size \Omega(\alpha) . However, it consists of slow subroutines requiring optimal solutions for logarithmically many semidefinite programs. We study the fine-grained complexity of the problem and present the first near-linear time algorithm that achieves similar performances to that of [MMV’12]. Our algorithm runs in time O(|V(G)|^1+o(1) + |E(G)|^1+o(1)) and finds a balanced cut of value O(\alpha) . Our approach appears easily extendible to related problem, such as Sparsest Cut, and also yields an near-linear time O(1) -approximation to Dagupta’s objective function for hierarchical clustering [Dasgupta, STOC’16] for the semi-random hierarchical stochastic block model inputs of [Cohen-Addad, Kanade, Mallmann-Trenn, Mathieu, JACM’19].

[LG-49] me-Series JEPA for Predictive Remote Control under Capacity-Limited Networks

链接: https://arxiv.org/abs/2406.04853
作者: Abanoub M. Girgis,Alvaro Valcarce,Mehdi Bennis
关键词: wireless sensor networks, massive wireless sensor, transmitting large data, large data volumes, remote control systems
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In remote control systems, transmitting large data volumes (e.g. video feeds) from wireless sensors to faraway controllers is challenging when the uplink channel capacity is limited (e.g. RedCap devices or massive wireless sensor networks). Furthermore, the controllers often only need the information-rich components of the original data. To address this, we propose a Time-Series Joint Embedding Predictive Architecture (TS-JEPA) and a semantic actor trained through self-supervised learning. This approach harnesses TS-JEPA’s semantic representation power and predictive capabilities by capturing spatio-temporal correlations in the source data. We leverage this to optimize uplink channel utilization, while the semantic actor calculates control commands directly from the encoded representations, rather than from the original data. We test our model through multiple parallel instances of the well-known inverted cart-pole scenario, where the approach is validated through the maximization of stability under constrained uplink channel capacity.

[LG-50] CTBENCH: A Library and Benchmark for Certified Training

链接: https://arxiv.org/abs/2406.04848
作者: Yuhao Mao,Stefan Balauca,Martin Vechev
关键词: certifiably robust neural, robust neural networks, Training certifiably robust, challenging task, certifiably robust
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Training certifiably robust neural networks is an important but challenging task. While many algorithms for (deterministic) certified training have been proposed, they are often evaluated on different training schedules, certification methods, and systematically under-tuned hyperparameters, making it difficult to compare their performance. To address this challenge, we introduce CTBENCH, a unified library and a high-quality benchmark for certified training that evaluates all algorithms under fair settings and systematically tuned hyperparameters. We show that (1) almost all algorithms in CTBENCH surpass the corresponding reported performance in literature in the magnitude of algorithmic improvements, thus establishing new state-of-the-art, and (2) the claimed advantage of recent algorithms drops significantly when we enhance the outdated baselines with a fair training schedule, a fair certification method and well-tuned hyperparameters. Based on CTBENCH, we provide new insights into the current state of certified training and suggest future research directions. We are confident that CTBENCH will serve as a benchmark and testbed for future research in certified training.

[LG-51] FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models

链接: https://arxiv.org/abs/2406.04845
作者: Rui Ye,Rui Ge,Xinyu Zhu,Jingyi Chai,Yaxin Du,Yang Liu,Yanfeng Wang,Siheng Chen
关键词: enabled multiple parties, collaboratively train large, train large language, large language models, sharing their data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 22 pages

点击查看摘要

Abstract:Federated learning has enabled multiple parties to collaboratively train large language models without directly sharing their data (FedLLM). Following this training paradigm, the community has put massive efforts from diverse aspects including framework, performance, and privacy. However, an unpleasant fact is that there are currently no realistic datasets and benchmarks for FedLLM and previous works all rely on artificially constructed datasets, failing to capture properties in real-world scenarios. Addressing this, we propose FedLLM-Bench, which involves 8 training methods, 4 training datasets, and 6 evaluation metrics, to offer a comprehensive testbed for the FedLLM community. FedLLM-Bench encompasses three datasets (e.g., user-annotated multilingual dataset) for federated instruction tuning and one dataset (e.g., user-annotated preference dataset) for federated preference alignment, whose scale of client number ranges from 38 to 747. Our datasets incorporate several representative diversities: language, quality, quantity, instruction, length, embedding, and preference, capturing properties in real-world scenarios. Based on FedLLM-Bench, we conduct experiments on all datasets to benchmark existing FL methods and provide empirical insights (e.g., multilingual collaboration). We believe that our FedLLM-Bench can benefit the FedLLM community by reducing required efforts, providing a practical testbed, and promoting fair comparisons. Code and datasets are available at this https URL.

[LG-52] Variational Flow Matching for Graph Generation

链接: https://arxiv.org/abs/2406.04843
作者: Floor Eijkelboom,Grigory Bartosh,Christian Andersson Naesseth,Max Welling,Jan-Willem van de Meent
关键词: variational flow matching, flow matching, variational inference, variational flow, VFM
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present a formulation of flow matching as variational inference, which we refer to as variational flow matching (VFM). Based on this formulation we develop CatFlow, a flow matching method for categorical data. CatFlow is easy to implement, computationally efficient, and achieves strong results on graph generation tasks. In VFM, the objective is to approximate the posterior probability path, which is a distribution over possible end points of a trajectory. We show that VFM admits both the CatFlow objective and the original flow matching objective as special cases. We also relate VFM to score-based models, in which the dynamics are stochastic rather than deterministic, and derive a bound on the model likelihood based on a reweighted VFM objective. We evaluate CatFlow on one abstract graph generation task and two molecular generation tasks. In all cases, CatFlow exceeds or matches performance of the current state-of-the-art models.

[LG-53] Primitive Agentic First-Order Optimization

链接: https://arxiv.org/abs/2406.04841
作者: R. Sala
关键词: Efficient numerical optimization, Efficient numerical, partial state representations, improve performance, performance and reduce
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 7 Pages

点击查看摘要

Abstract:Efficient numerical optimization methods can improve performance and reduce the environmental impact of computing in many applications. This work presents a proof-of-concept study combining primitive state representations and agent-environment interactions as first-order optimizers in the setting of budget-limited optimization. Through reinforcement learning (RL) over a set of training instances of an optimization problem class, optimal policies for sequential update selection of algorithmic iteration steps are approximated in generally formulated low-dimensional partial state representations that consider aspects of progress and resource use. For the investigated case studies, deployment of the trained agents to unseen instances of the quadratic optimization problem classes outperformed conventional optimal algorithms with optimized hyperparameters. The results show that elementary RL methods combined with succinct partial state representations can be used as heuristics to manage complexity in RL-based optimization, paving the way for agentic optimization approaches.

[LG-54] Algorithms for learning value-aligned policies considering admissibility relaxation

链接: https://arxiv.org/abs/2406.04838
作者: Andrés Holgado-Sánchez,Joaquín Arias,Holger Billhardt,Sascha Ossowski
关键词: awareness engineering, claims that software, emerging field, accordance with human, software agents
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emerging field of \emphvalue awareness engineering claims that software agents and systems should be value-aware, i.e. they must make decisions in accordance with human values. In this context, such agents must be capable of explicitly reasoning as to how far different courses of action are aligned with these values. For this purpose, values are often modelled as preferences over states or actions, which are then aggregated to determine the sequences of actions that are maximally aligned with a certain value. Recently, additional value admissibility constraints at this level have been considered as well. However, often relaxed versions of these constraints are needed, and this increases considerably the complexity of computing value-aligned policies. To obtain efficient algorithms that make value-aligned decisions considering admissibility relaxation, we propose the use of learning techniques, in particular, we have used constrained reinforcement learning algorithms. In this paper, we present two algorithms, \epsilon\text-ADQL for strategies based on local alignment and its extension \epsilon\text-CADQL for a sequence of decisions. We have validated their efficiency in a water distribution problem in a drought scenario. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2406.04838 [cs.AI] (or arXiv:2406.04838v1 [cs.AI] for this version)

[LG-55] Black Box Differential Privacy Auditing Using Total Variation Distance

链接: https://arxiv.org/abs/2406.04827
作者: Antti Koskela,Jafar Mohammadi
关键词: machine learning model, small hold-out dataset, learning model, hold-out dataset, differential privacy
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present a practical method to audit the differential privacy (DP) guarantees of a machine learning model using a small hold-out dataset that is not exposed to the model during the training. Having a score function such as the loss function employed during the training, our method estimates the total variation (TV) distance between scores obtained with a subset of the training data and the hold-out dataset. With some meta information about the underlying DP training algorithm, these TV distance values can be converted to (\varepsilon,\delta) -guarantees for any \delta . We show that these score distributions asymptotically give lower bounds for the DP guarantees of the underlying training algorithm, however, we perform a one-shot estimation for practicality reasons. We specify conditions that lead to lower bounds for the DP guarantees with high probability. To estimate the TV distance between the score distributions, we use a simple density estimation method based on histograms. We show that the TV distance gives a very close to optimally robust estimator and has an error rate \mathcalO(k^-1/3) , where k is the total number of samples. Numerical experiments on benchmark datasets illustrate the effectiveness of our approach and show improvements over baseline methods for black-box auditing.

[LG-56] Graph Mining under Data scarcity

链接: https://arxiv.org/abs/2406.04825
作者: Appan Rakaraddi,Lam Siew-Kei,Mahardhika Pratama,Marcus de Carvalho
关键词: Multitude of deep, Uncertainty Estimator, GNN backbone network, Uncertainty Estimator framework, node classification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:Multitude of deep learning models have been proposed for node classification in graphs. However, they tend to perform poorly under labeled-data scarcity. Although Few-shot learning for graphs has been introduced to overcome this problem, the existing models are not easily adaptable for generic graph learning frameworks like Graph Neural Networks (GNNs). Our work proposes an Uncertainty Estimator framework that can be applied on top of any generic GNN backbone network (which are typically designed for supervised/semi-supervised node classification) to improve the node classification performance. A neural network is used to model the Uncertainty Estimator as a probability distribution rather than probabilistic discrete scalar values. We train these models under the classic episodic learning paradigm in the n -way, k -shot fashion, in an end-to-end setting. Our work demonstrates that implementation of the uncertainty estimator on a GNN backbone network improves the classification accuracy under Few-shot setting without any meta-learning specific architecture. We conduct experiments on multiple datasets under different Few-shot settings and different GNN-based backbone networks. Our method outperforms the baselines, which demonstrates the efficacy of the Uncertainty Estimator for Few-shot node classification on graphs with a GNN. Comments: 7 pages, 2 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.04825 [cs.LG] (or arXiv:2406.04825v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2406.04825 Focus to learn more arXiv-issued DOI via DataCite

[LG-57] FunBO: Discovering Acquisition Functions for Bayesian Optimization with FunSearch

链接: https://arxiv.org/abs/2406.04824
作者: Virginia Aglietti,Ira Ktena,Jessica Schrouff,Eleni Sgouritsa,Francisco J. R. Ruiz,Alexis Bellot,Silvia Chiappa
关键词: carefully crafted acquisition, efficiency of Bayesian, Bayesian optimization algorithms, crafted acquisition functions, Bayesian optimization
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The sample efficiency of Bayesian optimization algorithms depends on carefully crafted acquisition functions (AFs) guiding the sequential collection of function evaluations. The best-performing AF can vary significantly across optimization problems, often requiring ad-hoc and problem-specific choices. This work tackles the challenge of designing novel AFs that perform well across a variety of experimental settings. Based on FunSearch, a recent work using Large Language Models (LLMs) for discovery in mathematical sciences, we propose FunBO, an LLM-based method that can be used to learn new AFs written in computer code by leveraging access to a limited number of evaluations for a set of objective functions. We provide the analytic expression of all discovered AFs and evaluate them on various global optimization benchmarks and hyperparameter optimization tasks. We show how FunBO identifies AFs that generalize well in and out of the training distribution of functions, thus outperforming established general-purpose AFs and achieving competitive performance against AFs that are customized to specific function types and are learned via transfer-learning algorithms.

[LG-58] M2NO: Multiresolution Operator Learning with Multiwavelet-based Algebraic Multigrid Method

链接: https://arxiv.org/abs/2406.04822
作者: Zhihao Li,Zhilu Lai,Xiaobo Wang,Wei Wang
关键词: Solving partial differential, partial differential equations, increasing grid points, Solving partial, high-dimensional scenarios characterized
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving partial differential equations (PDEs) effectively necessitates a multi-scale approach, particularly critical in high-dimensional scenarios characterized by increasing grid points or resolution. Traditional methods often fail to capture the detailed features necessary for accurate modeling, presenting a significant challenge in scientific computing. In response, we introduce the Multiwavelet-based Algebraic Multigrid Neural Operator (M2NO), a novel deep learning framework that synergistically combines multiwavelet transformations and algebraic multigrid (AMG) techniques. By exploiting the inherent similarities between these two approaches, M2NO overcomes their individual limitations and enhances precision and flexibility across various PDE benchmarks. Employing Multiresolution Analysis (MRA) with high-pass and low-pass filters, the model executes hierarchical decomposition to accurately delineate both global trends and localized details within PDE solutions, supporting adaptive data representation at multiple scales. M2NO also automates node selection and adeptly manages complex boundary conditions through its multiwavelet-based operators. Extensive evaluations on a diverse array of PDE datasets with different boundary conditions confirm M2NO’s superior performance. Furthermore, M2NO excels in handling high-resolution and super-resolution tasks, consistently outperforming competing models and demonstrating robust adaptability in complex computational scenarios.

[LG-59] Skill-aware Mutual Information Optimisation for Generalisation in Reinforcement Learning

链接: https://arxiv.org/abs/2406.04815
作者: Xuehui Yu,Mhairi Dunion,Xin Li,Stefano V. Albrecht
关键词: varying environmental features, Skill-aware Mutual Information, Meta-Reinforcement Learning, modes of behaviours, Skill-aware Noise Contrastive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Meta-Reinforcement Learning (Meta-RL) agents can struggle to operate across tasks with varying environmental features that require different optimal skills (i.e., different modes of behaviours). Using context encoders based on contrastive learning to enhance the generalisability of Meta-RL agents is now widely studied but faces challenges such as the requirement for a large sample size, also referred to as the \log - K curse. To improve RL generalisation to different tasks, we first introduce Skill-aware Mutual Information (SaMI), an optimisation objective that aids in distinguishing context embeddings according to skills, thereby equipping RL agents with the ability to identify and execute different skills across tasks. We then propose Skill-aware Noise Contrastive Estimation (SaNCE), a K -sample estimator used to optimise the SaMI objective. We provide a framework for equipping an RL agent with SaNCE in practice and conduct experimental validation on modified MuJoCo and Panda-gym benchmarks. We empirically find that RL agents that learn by maximising SaMI achieve substantially improved zero-shot generalisation to unseen tasks. Additionally, the context encoder equipped with SaNCE demonstrates greater robustness to reductions in the number of available samples, thus possessing the potential to overcome the \log - K curse.

[LG-60] Online Continual Learning of Video Diffusion Models From a Single Video Stream

链接: https://arxiv.org/abs/2406.04814
作者: Jason Yoo,Dylan Green,Geoff Pleiss,Frank Wood
关键词: shown exceptional capabilities, generating realistic videos, shown exceptional, exceptional capabilities, capabilities in generating
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models have shown exceptional capabilities in generating realistic videos. Yet, their training has been predominantly confined to offline environments where models can repeatedly train on i.i.d. data to convergence. This work explores the feasibility of training diffusion models from a semantically continuous video stream, where correlated video frames sequentially arrive one at a time. To investigate this, we introduce two novel continual video generative modeling benchmarks, Lifelong Bouncing Balls and Windows 95 Maze Screensaver, each containing over a million video frames generated from navigating stationary environments. Surprisingly, our experiments show that diffusion models can be effectively trained online using experience replay, achieving performance comparable to models trained with i.i.d. samples given the same number of gradient steps.

[LG-61] Generating Piano Practice Policy with a Gaussian Process

链接: https://arxiv.org/abs/2406.04812
作者: Alexandra Moringen,Elad Vromen,Helge Ritter,Jason Friedman
关键词: so-called practice modes, practice modes, practice, units that focus, play music comprise
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A typical process of learning to play a piece on a piano consists of a progression through a series of practice units that focus on individual dimensions of the skill, the so-called practice modes. Practice modes in learning to play music comprise a particularly large set of possibilities, such as hand coordination, posture, articulation, ability to read a music score, correct timing or pitch, etc. Self-guided practice is known to be suboptimal, and a model that schedules optimal practice to maximize a learner’s progress still does not exist. Because we each learn differently and there are many choices for possible piano practice tasks and methods, the set of practice modes should be dynamically adapted to the human learner, a process typically guided by a teacher. However, having a human teacher guide individual practice is not always feasible since it is time-consuming, expensive, and often unavailable. In this work, we present a modeling framework to guide the human learner through the learning process by choosing the practice modes generated by a policy model. To this end, we present a computational architecture building on a Gaussian process that incorporates 1) the learner state, 2) a policy that selects a suitable practice mode, 3) performance evaluation, and 4) expert knowledge. The proposed policy model is trained to approximate the expert-learner interaction during a practice session. In our future work, we will test different Bayesian optimization techniques, e.g., different acquisition functions, and evaluate their effect on the learning progress.

[LG-62] VERA: Generating Visual Explanations of Two-Dimensional Embeddings via Region Annotation

链接: https://arxiv.org/abs/2406.04808
作者: Pavlin G. Poličar,Blaž Zupan
关键词: dimensionality reduction techniques, visualize high-dimensional data, Two-dimensional embeddings obtained, reduction techniques, obtained from dimensionality
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Two-dimensional embeddings obtained from dimensionality reduction techniques, such as MDS, t-SNE, and UMAP, are widely used across various disciplines to visualize high-dimensional data. These visualizations provide a valuable tool for exploratory data analysis, allowing researchers to visually identify clusters, outliers, and other interesting patterns in the data. However, interpreting the resulting visualizations can be challenging, as it often requires additional manual inspection to understand the differences between data points in different regions of the embedding space. To address this issue, we propose Visual Explanations via Region Annotation (VERA), an automatic embedding-annotation approach that generates visual explanations for any two-dimensional embedding. VERA produces informative explanations that characterize distinct regions in the embedding space, allowing users to gain an overview of the embedding landscape at a glance. Unlike most existing approaches, which typically require some degree of manual user intervention, VERA produces static explanations, automatically identifying and selecting the most informative visual explanations to show to the user. We illustrate the usage of VERA on a real-world data set and validate the utility of our approach with a comparative user study. Our results demonstrate that the explanations generated by VERA are as useful as fully-fledged interactive tools on typical exploratory data analysis tasks but require significantly less time and effort from the user.

[LG-63] GENIE: Watermarking Graph Neural Networks for Link Prediction

链接: https://arxiv.org/abs/2406.04805
作者: Venkata Sai Pranav Bachina,Ankit Gangwal,Aaryan Ajay Sharma,Charu Sharma
关键词: utilizing graph-structured data, Graph Neural Networks, Neural Networks, graph-structured data, real world
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 20 pages, 12 figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have advanced the field of machine learning by utilizing graph-structured data, which is ubiquitous in the real world. GNNs have applications in various fields, ranging from social network analysis to drug discovery. GNN training is strenuous, requiring significant computational resources and human expertise. It makes a trained GNN an indispensable Intellectual Property (IP) for its owner. Recent studies have shown GNNs to be vulnerable to model-stealing attacks, which raises concerns over IP rights protection. Watermarking has been shown to be effective at protecting the IP of a GNN model. Existing efforts to develop a watermarking scheme for GNNs have only focused on the node classification and the graph classification tasks. To the best of our knowledge, we introduce the first-ever watermarking scheme for GNNs tailored to the Link Prediction (LP) task. We call our proposed watermarking scheme GENIE (watermarking Graph nEural Networks for lInk prEdiction). We design GENIE using a novel backdoor attack to create a trigger set for two key methods of LP: (1) node representation-based and (2) subgraph-based. In GENIE, the watermark is embedded into the GNN model by training it on both the trigger set and a modified training set, resulting in a watermarked GNN model. To assess a suspect model, we verify the watermark against the trigger set. We extensively evaluate GENIE across 3 model architectures (i.e., SEAL, GCN, and GraphSAGE) and 7 real-world datasets. Furthermore, we validate the robustness of GENIE against 11 state-of-the-art watermark removal techniques and 3 model extraction attacks. We also demonstrate that GENIE is robust against ownership piracy attack. Our ownership demonstration scheme statistically guarantees both False Positive Rate (FPR) and False Negative Rate (FNR) to be less than 10^-6 . Comments: 20 pages, 12 figures Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2406.04805 [cs.CR] (or arXiv:2406.04805v1 [cs.CR] for this version)

[LG-64] Predictive Dynamic Fusion

链接: https://arxiv.org/abs/2406.04802
作者: Bing Cao,Yinan Xia,Yi Ding,Changqing Zhang,Qinghua Hu
关键词: rendering holistic judgments, joint decision-making systems, holistic judgments, crucial in joint, joint decision-making
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 pages, 7 figures

点击查看摘要

Abstract:Multimodal fusion is crucial in joint decision-making systems for rendering holistic judgments. Since multimodal data changes in open environments, dynamic fusion has emerged and achieved remarkable progress in numerous applications. However, most existing dynamic multimodal fusion methods lack theoretical guarantees and easily fall into suboptimal problems, yielding unreliability and instability. To address this issue, we propose a Predictive Dynamic Fusion (PDF) framework for multimodal learning. We proceed to reveal the multimodal fusion from a generalization perspective and theoretically derive the predictable Collaborative Belief (Co-Belief) with Mono- and Holo-Confidence, which provably reduces the upper bound of generalization error. Accordingly, we further propose a relative calibration strategy to calibrate the predicted Co-Belief for potential uncertainty. Extensive experiments on multiple benchmarks confirm our superiority. Our code is available at this https URL.

[LG-65] Learning-Augmented Priority Queues

链接: https://arxiv.org/abs/2406.04793
作者: Ziyad Benomar,Christian Coester
关键词: computer science, fundamental and widely, widely used data, data structures, structures in computer
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Priority queues are one of the most fundamental and widely used data structures in computer science. Their primary objective is to efficiently support the insertion of new elements with assigned priorities and the extraction of the highest priority element. In this study, we investigate the design of priority queues within the learning-augmented framework, where algorithms use potentially inaccurate predictions to enhance their worst-case performance. We examine three prediction models spanning different use cases, and show how the predictions can be leveraged to enhance the performance of priority queue operations. Moreover, we demonstrate the optimality of our solution and discuss some possible applications.

[LG-66] Mobile Network Configuration Recommendation using Deep Generative Graph Neural Network

链接: https://arxiv.org/abs/2406.04779
作者: Shirwan Piroti,Ashima Chawla,Tahar Zanouda
关键词: Radio Access Telecom, Access Telecom Network, Access Telecom, Radio Access, Telecom Network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: 4 pages, 4 figures

点击查看摘要

Abstract:There are vast number of configurable parameters in a Radio Access Telecom Network. A significant amount of these parameters is configured by Radio Node or cell based on their deployment setting. Traditional methods rely on domain knowledge for individual parameter configuration, often leading to sub-optimal results. To improve this, a framework using a Deep Generative Graph Neural Network (GNN) is proposed. It encodes the network into a graph, extracts subgraphs for each RAN node, and employs a Siamese GNN (S-GNN) to learn embeddings. The framework recommends configuration parameters for a multitude of parameters and detects misconfigurations, handling both network expansion and existing cell reconfiguration. Tested on real-world data, the model surpasses baselines, demonstrating accuracy, generalizability, and robustness against concept drift.

[LG-67] DT Loss Takes It All: Integrating Temporal Dependencies among Targets into Non-Autoregressive Time Series Forecasting

链接: https://arxiv.org/abs/2406.04777
作者: Qi Xiong,Kai Tang,Minbo Ma,Jie Xu,Tianrui Li
关键词: Learning temporal dependencies, time series forecasting, TDT Loss, TDT, temporal dependencies
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning temporal dependencies among targets (TDT) benefits better time series forecasting, where targets refer to the predicted sequence. Although autoregressive methods model TDT recursively, they suffer from inefficient inference and error accumulation. We argue that integrating TDT learning into non-autoregressive methods is essential for pursuing effective and efficient time series forecasting. In this study, we introduce the differencing approach to represent TDT and propose a parameter-free and plug-and-play solution through an optimization objective, namely TDT Loss. It leverages the proportion of inconsistent signs between predicted and ground truth TDT as an adaptive weight, dynamically balancing target prediction and fine-grained TDT fitting. Importantly, TDT Loss incurs negligible additional cost, with only \mathcalO(n) increased computation and \mathcalO(1) memory requirements, while significantly enhancing the predictive performance of non-autoregressive models. To assess the effectiveness of TDT loss, we conduct extensive experiments on 7 widely used datasets. The experimental results of plugging TDT loss into 6 state-of-the-art methods show that out of the 168 experiments, 75.00% and 94.05% exhibit improvements in terms of MSE and MAE with the maximum 24.56% and 16.31%, respectively.

[LG-68] REP: Resource-Efficient Prompting for On-device Continual Learning

链接: https://arxiv.org/abs/2406.04772
作者: Sungho Jeon,Xinyue Ma,Kwang In Kim,Myeongjae Jeon
关键词: On-device continual learning, requires the co-optimization, resource efficiency, continual learning, efficiency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 10 figures

点击查看摘要

Abstract:On-device continual learning (CL) requires the co-optimization of model accuracy and resource efficiency to be practical. This is extremely challenging because it must preserve accuracy while learning new tasks with continuously drifting data and maintain both high energy and memory efficiency to be deployable on real-world devices. Typically, a CL method leverages one of two types of backbone networks: CNN or ViT. It is commonly believed that CNN-based CL excels in resource efficiency, whereas ViT-based CL is superior in model performance, making each option attractive only for a single aspect. In this paper, we revisit this comparison while embracing powerful pre-trained ViT models of various sizes, including ViT-Ti (5.8M parameters). Our detailed analysis reveals that many practical options exist today for making ViT-based methods more suitable for on-device CL, even when accuracy, energy, and memory are all considered. To further expand this impact, we introduce REP, which improves resource efficiency specifically targeting prompt-based rehearsal-free methods. Our key focus is on avoiding catastrophic trade-offs with accuracy while trimming computational and memory costs throughout the training process. We achieve this by exploiting swift prompt selection that enhances input data using a carefully provisioned model, and by developing two novel algorithms-adaptive token merging (AToM) and adaptive layer dropping (ALD)-that optimize the prompt updating stage. In particular, AToM and ALD perform selective skipping across the data and model-layer dimensions without compromising task-specific features in vision transformer models. Extensive experiments on three image classification datasets validate REP’s superior resource efficiency over current state-of-the-art methods.

[LG-69] Reinforcement Learning and Regret Bounds for Admission Control

链接: https://arxiv.org/abs/2406.04766
作者: Lucas Weber,Ana Bušić,Jiamin Zhu
关键词: Markov decision process, Markov decision, reinforcement learning algorithm, state space, action space
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The expected regret of any reinforcement learning algorithm is lower bounded by \Omega\left(\sqrtDXAT\right) for undiscounted returns, where D is the diameter of the Markov decision process, X the size of the state space, A the size of the action space and T the number of time steps. However, this lower bound is general. A smaller regret can be obtained by taking into account some specific knowledge of the problem structure. In this article, we consider an admission control problem to an M/M/c/S queue with m job classes and class-dependent rewards and holding costs. Queuing systems often have a diameter that is exponential in the buffer size S , making the previous lower bound prohibitive for any practical use. We propose an algorithm inspired by UCRL2, and use the structure of the problem to upper bound the expected total regret by O(S\log T + \sqrtmT \log T) in the finite server case. In the infinite server case, we prove that the dependence of the regret on S disappears.

[LG-70] Probabilistic Weather Forecasting with Hierarchical Graph Neural Networks

链接: https://arxiv.org/abs/2406.04759
作者: Joel Oskarsson,Tomas Landelius,Marc Peter Deisenroth,Fredrik Lindsten
关键词: recent years, machine learning, powerful tool, tool for high-resolution, high-resolution weather forecasting
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 67 pages, 29 figures. Code is available at this https URL (global forecasting) and this https URL (limited area modeling)

点击查看摘要

Abstract:In recent years, machine learning has established itself as a powerful tool for high-resolution weather forecasting. While most current machine learning models focus on deterministic forecasts, accurately capturing the uncertainty in the chaotic weather system calls for probabilistic modeling. We propose a probabilistic weather forecasting model called Graph-EFM, combining a flexible latent-variable formulation with the successful graph-based forecasting framework. The use of a hierarchical graph construction allows for efficient sampling of spatially coherent forecasts. Requiring only a single forward pass per time step, Graph-EFM allows for fast generation of arbitrarily large ensembles. We experiment with the model on both global and limited area forecasting. Ensemble forecasts from Graph-EFM achieve equivalent or lower errors than comparable deterministic models, with the added benefit of accurately capturing forecast uncertainty.

[LG-71] Sales Whisperer: A Human-Inconspicuous Attack on LLM Brand Recommendations

链接: https://arxiv.org/abs/2406.04755
作者: Weiran Lin,Anna Gerchanovsky,Omer Akgul,Lujo Bauer,Matt Fredrikson,Zifan Wang
关键词: Large language model, Large language, prompting services, language model, users might rely
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language model (LLM) users might rely on others (e.g., prompting services), to write prompts. However, the risks of trusting prompts written by others remain unstudied. In this paper, we assess the risk of using such prompts on brand recommendation tasks when shopping. First, we found that paraphrasing prompts can result in LLMs mentioning given brands with drastically different probabilities, including a pair of prompts where the probability changes by 100%. Next, we developed an approach that can be used to perturb an original base prompt to increase the likelihood that an LLM mentions a given brand. We designed a human-inconspicuous algorithm that perturbs prompts, which empirically forces LLMs to mention strings related to a brand more often, by absolute improvements up to 78.3%. Our results suggest that our perturbed prompts, 1) are inconspicuous to humans, 2) force LLMs to recommend a target brand more often, and 3) increase the perceived chances of picking targeted brands.

[LG-72] PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction

链接: https://arxiv.org/abs/2406.04746
作者: Eduard Poesina,Adriana Valentina Costache,Adrian-Gabriel Chifu,Josiane Mothe,Radu Tudor Ionescu
关键词: generative diffusion models, visually impressive results, diffusion models, recently emerged, viable alternative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-to-image generation has recently emerged as a viable alternative to text-to-image retrieval, due to the visually impressive results of generative diffusion models. Although query performance prediction is an active research topic in information retrieval, to the best of our knowledge, there is no prior study that analyzes the difficulty of queries (prompts) in text-to-image generation, based on human judgments. To this end, we introduce the first dataset of prompts which are manually annotated in terms of image generation performance. In order to determine the difficulty of the same prompts in image retrieval, we also collect manual annotations that represent retrieval performance. We thus propose the first benchmark for joint text-to-image prompt and query performance prediction, comprising 10K queries. Our benchmark enables: (i) the comparative assessment of the difficulty of prompts/queries in image generation and image retrieval, and (ii) the evaluation of prompt/query performance predictors addressing both generation and retrieval. We present results with several pre-generation/retrieval and post-generation/retrieval performance predictors, thus providing competitive baselines for future research. Our benchmark and code is publicly available under the CC BY 4.0 license at this https URL.

[LG-73] Confidence-aware Contrastive Learning for Selective Classification

链接: https://arxiv.org/abs/2406.04745
作者: Yu-Chang Wu,Shen-Huan Lyu,Haopu Shang,Xiangyu Wang,Chao Qian
关键词: Selective classification, selective classification model, Selective classification enables, classification enables models, sufficiently confident
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICML 2024

点击查看摘要

Abstract:Selective classification enables models to make predictions only when they are sufficiently confident, aiming to enhance safety and reliability, which is important in high-stakes scenarios. Previous methods mainly use deep neural networks and focus on modifying the architecture of classification layers to enable the model to estimate the confidence of its prediction. This work provides a generalization bound for selective classification, disclosing that optimizing feature layers helps improve the performance of selective classification. Inspired by this theory, we propose to explicitly improve the selective classification model at the feature level for the first time, leading to a novel Confidence-aware Contrastive Learning method for Selective Classification, CCL-SC, which similarizes the features of homogeneous instances and differentiates the features of heterogeneous instances, with the strength controlled by the model’s confidence. The experimental results on typical datasets, i.e., CIFAR-10, CIFAR-100, CelebA, and ImageNet, show that CCL-SC achieves significantly lower selective risk than state-of-the-art methods, across almost all coverage degrees. Moreover, it can be combined with existing methods to bring further improvement.

[LG-74] When Swarm Learning meets energy series data: A decentralized collaborative learning design based on blockchain

链接: https://arxiv.org/abs/2406.04743
作者: Lei Xu,Yulong Chen,Yuntian Chen,Longfeng Nie,Xuetao Wei,Liang Xue,Dongxiao Zhang
关键词: infer essential unknown, essential unknown variables, Machine learning models, forecast future energy, future energy production
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Machine learning models offer the capability to forecast future energy production or consumption and infer essential unknown variables from existing data. However, legal and policy constraints within specific energy sectors render the data sensitive, presenting technical hurdles in utilizing data from diverse sources. Therefore, we propose adopting a Swarm Learning (SL) scheme, which replaces the centralized server with a blockchain-based distributed network to address the security and privacy issues inherent in Federated Learning (FL)'s centralized architecture. Within this distributed Collaborative Learning framework, each participating organization governs nodes for inter-organizational communication. Devices from various organizations utilize smart contracts for parameter uploading and retrieval. Consensus mechanism ensures distributed consistency throughout the learning process, guarantees the transparent trustworthiness and immutability of parameters on-chain. The efficacy of the proposed framework is substantiated across three real-world energy series modeling scenarios with superior performance compared to Local Learning approaches, simultaneously emphasizing enhanced data security and privacy over Centralized Learning and FL method. Notably, as the number of data volume and the count of local epochs increases within a threshold, there is an improvement in model performance accompanied by a reduction in the variance of performance errors. Consequently, this leads to an increased stability and reliability in the outcomes produced by the model.

[LG-75] A survey and benchmark of high-dimensional Bayesian optimization of discrete sequences

链接: https://arxiv.org/abs/2406.04739
作者: Miguel González-Duque,Richard Michael,Simon Bartels,Yevgen Zainchkovskyy,Søren Hauberg,Wouter Boomsma
关键词: Optimizing discrete black-box, protein engineering, drug design, Bayesian optimization, Optimizing discrete
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Optimizing discrete black-box functions is key in several domains, e.g. protein engineering and drug design. Due to the lack of gradient information and the need for sample efficiency, Bayesian optimization is an ideal candidate for these tasks. Several methods for high-dimensional continuous and categorical Bayesian optimization have been proposed recently. However, our survey of the field reveals highly heterogeneous experimental set-ups across methods and technical barriers for the replicability and application of published algorithms to real-world tasks. To address these issues, we develop a unified framework to test a vast array of high-dimensional Bayesian optimization methods and a collection of standardized black-box functions representing real-world application domains in chemistry and biology. These two components of the benchmark are each supported by flexible, scalable, and easily extendable software libraries (poli and poli-baselines), allowing practitioners to readily incorporate new optimization objectives or discrete optimizers. Project website: this https URL

[LG-76] Predicting Polymer Properties Based on Multimodal Multitask Pretraining

链接: https://arxiv.org/abs/2406.04727
作者: Fanmeng Wang,Wentao Guo,Minjie Cheng,Shen Yuan,Hongteng Xu,Zhifeng Gao
关键词: similar monomers covalently, bonding numerous identical, polymer property prediction, property prediction, polymer property
类目: Machine Learning (cs.LG); Soft Condensed Matter (cond-mat.soft); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the past few decades, polymers, high-molecular-weight compounds formed by bonding numerous identical or similar monomers covalently, have played an essential role in various scientific fields. In this context, accurate prediction of their properties is becoming increasingly crucial. Typically, the properties of a polymer, such as plasticity, conductivity, bio-compatibility, and so on, are highly correlated with its 3D structure. However, current methods for predicting polymer properties heavily rely on information from polymer SMILES sequences (P-SMILES strings) while ignoring crucial 3D structural information, leading to sub-optimal performance. In this work, we propose MMPolymer, a novel multimodal multitask pretraining framework incorporating both polymer 1D sequential information and 3D structural information to enhance downstream polymer property prediction tasks. Besides, to overcome the limited availability of polymer 3D data, we further propose the “Star Substitution” strategy to extract 3D structural information effectively. During pretraining, MMPolymer not only predicts masked tokens and recovers 3D coordinates but also achieves the cross-modal alignment of latent representation. Subsequently, we further fine-tune the pretrained MMPolymer for downstream polymer property prediction tasks in the supervised learning paradigm. Experimental results demonstrate that MMPolymer achieves state-of-the-art performance in various polymer property prediction tasks. Moreover, leveraging the pretrained MMPolymer and using only one modality (either P-SMILES string or 3D conformation) during fine-tuning can also surpass existing polymer property prediction methods, highlighting the exceptional capability of MMPolymer in polymer feature extraction and utilization. Our online platform for polymer property prediction is available at https://app.bohrium.dp.tech/mmpolymer.

[LG-77] Probabilistic Perspectives on Error Minimization in Adversarial Reinforcement Learning

链接: https://arxiv.org/abs/2406.04724
作者: Roman Belaire,Arunesh Sinha,Pradeep Varakantham
关键词: Deep Reinforcement Learning, Deep Reinforcement, Reinforcement Learning, posing severe risks, policies are critically
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) policies are critically vulnerable to adversarial noise in observations, posing severe risks in safety-critical scenarios. For example, a self-driving car receiving manipulated sensory inputs about traffic signs could lead to catastrophic outcomes. Existing strategies to fortify RL algorithms against such adversarial perturbations generally fall into two categories: (a) using regularization methods that enhance robustness by incorporating adversarial loss terms into the value objectives, and (b) adopting “maximin” principles, which focus on maximizing the minimum value to ensure robustness. While regularization methods reduce the likelihood of successful attacks, their effectiveness drops significantly if an attack does succeed. On the other hand, maximin objectives, although robust, tend to be overly conservative. To address this challenge, we introduce a novel objective called Adversarial Counterfactual Error (ACoE), which naturally balances optimizing value and robustness against adversarial attacks. To optimize ACoE in a scalable manner in model-free settings, we propose a theoretically justified surrogate objective known as Cumulative-ACoE (C-ACoE). The core idea of optimizing C-ACoE is utilizing the belief about the underlying true state given the adversarially perturbed observation. Our empirical evaluations demonstrate that our method outperforms current state-of-the-art approaches for addressing adversarial RL problems across all established benchmarks (MuJoCo, Atari, and Highway) used in the literature.

[LG-78] FlowMM: Generating Materials with Riemannian Flow Matching

链接: https://arxiv.org/abs/2406.04713
作者: Benjamin Kurt Miller,Ricky T. Q. Chen,Anuroop Sriram,Brandon M Wood
关键词: unique computational challenges, presents unique computational, Crystalline materials, next-generation technologies, computational challenges
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注: this https URL

点击查看摘要

Abstract:Crystalline materials are a fundamental component in next-generation technologies, yet modeling their distribution presents unique computational challenges. Of the plausible arrangements of atoms in a periodic lattice only a vanishingly small percentage are thermodynamically stable, which is a key indicator of the materials that can be experimentally realized. Two fundamental tasks in this area are to (a) predict the stable crystal structure of a known composition of elements and (b) propose novel compositions along with their stable structures. We present FlowMM, a pair of generative models that achieve state-of-the-art performance on both tasks while being more efficient and more flexible than competing methods. We generalize Riemannian Flow Matching to suit the symmetries inherent to crystals: translation, rotation, permutation, and periodic boundary conditions. Our framework enables the freedom to choose the flow base distributions, drastically simplifying the problem of learning crystal structures compared with diffusion models. In addition to standard benchmarks, we validate FlowMM’s generated structures with quantum chemistry calculations, demonstrating that it is about 3x more efficient, in terms of integration steps, at finding stable materials compared to previous open methods.

[LG-79] ConDiff: A Challenging Dataset for Neural Solvers of Partial Differential Equations

链接: https://arxiv.org/abs/2406.04709
作者: Vladislav Trifonov,Alexander Rudikov,Oleg Iliev,Ivan Oseledets,Ekaterina Muravleva
关键词: scientific machine learning, ConDiff, present ConDiff, coefficient functions, problems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present ConDiff, a novel dataset for scientific machine learning. ConDiff focuses on the diffusion equation with varying coefficients, a fundamental problem in many applications of parametric partial differential equations (PDEs). The main novelty of the proposed dataset is that we consider discontinuous coefficients with high contrast. These coefficient functions are sampled from a selected set of distributions. This class of problems is not only of great academic interest, but is also the basis for describing various environmental and industrial problems. In this way, ConDiff shortens the gap with real-world problems while remaining fully synthetic and easy to use. ConDiff consists of a diverse set of diffusion equations with coefficients covering a wide range of contrast levels and heterogeneity with a measurable complexity metric for clearer comparison between different coefficient functions. We baseline ConDiff on standard deep learning models in the field of scientific machine learning. By providing a large number of problem instances, each with its own coefficient function and right-hand side, we hope to encourage the development of novel physics-based deep learning approaches, such as neural operators and physics-informed neural networks, ultimately driving progress towards more accurate and efficient solutions of complex PDE problems.

[LG-80] Winner-takes-all learners are geometry-aware conditional density estimators

链接: https://arxiv.org/abs/2406.04706
作者: Victor Letzelter(LTCI, S2A, IDS, IP Paris),David Perera(LTCI, S2A, IDS, IP Paris),Cédric Rommel,Mathieu Fontaine(LTCI, S2A, IDS, IP Paris),Slim Essid(IDS, S2A, LTCI, IP Paris),Gael Richard(S2A, IDS, LTCI, IP Paris),Patrick Pérez
关键词: simple learning paradigm, handles ambiguous tasks, learning paradigm, simple learning, handles ambiguous
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Signal Processing (eess.SP); Probability (math.PR); Machine Learning (stat.ML)
*备注: International Conference on Machine Learning, Jul 2024, Vienne (Autriche), Austria

点击查看摘要

Abstract:Winner-takes-all training is a simple learning paradigm, which handles ambiguous tasks by predicting a set of plausible hypotheses. Recently, a connection was established between Winner-takes-all training and centroidal Voronoi tessellations, showing that, once trained, hypotheses should quantize optimally the shape of the conditional distribution to predict. However, the best use of these hypotheses for uncertainty quantification is still an open this http URL this work, we show how to leverage the appealing geometric properties of the Winner-takes-all learners for conditional density estimation, without modifying its original training scheme. We theoretically establish the advantages of our novel estimator both in terms of quantization and density estimation, and we demonstrate its competitiveness on synthetic and real-world datasets, including audio data.

[LG-81] Marking the Pace: A Blockchain-Enhanced Privacy-Traceable Strategy for Federated Recommender Systems

链接: https://arxiv.org/abs/2406.04702
作者: Zhen Cai,Tao Tang,Shuo Yu,Yunpeng Xiao,Feng Xia
关键词: Internet of Things, data sharing, model updates, continuous model updates, data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated recommender systems have been crucially enhanced through data sharing and continuous model updates, attributed to the pervasive connectivity and distributed computing capabilities of Internet of Things (IoT) devices. Given the sensitivity of IoT data, transparent data processing in data sharing and model updates is paramount. However, existing methods fall short in tracing the flow of shared data and the evolution of model updates. Consequently, data sharing is vulnerable to exploitation by malicious entities, raising significant data privacy concerns, while excluding data sharing will result in sub-optimal recommendations. To mitigate these concerns, we present LIBERATE, a privacy-traceable federated recommender system. We design a blockchain-based traceability mechanism, ensuring data privacy during data sharing and model updates. We further enhance privacy protection by incorporating local differential privacy in user-server communication. Extensive evaluations with the real-world dataset corroborate LIBERATE’s capabilities in ensuring data privacy during data sharing and model update while maintaining efficiency and performance. Results underscore blockchain-based traceability mechanism as a promising solution for privacy-preserving in federated recommender systems.

[LG-82] LLM-Vectorizer: LLM-based Verified Loop Vectorizer

链接: https://arxiv.org/abs/2406.04693
作者: Jubi Taneja,Avery Laird,Cong Yan,Madan Musuvathi,Shuvendu K. Lahiri
关键词: computing applications operating, performance computing applications, powerful optimization technique, large data arrays, powerful optimization
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Vectorization is a powerful optimization technique that significantly boosts the performance of high performance computing applications operating on large data arrays. Despite decades of research on auto-vectorization, compilers frequently miss opportunities to vectorize code. On the other hand, writing vectorized code manually using compiler intrinsics is still a complex, error-prone task that demands deep knowledge of specific architecture and compilers. In this paper, we evaluate the potential of large-language models (LLMs) to generate vectorized (Single Instruction Multiple Data) code from scalar programs that process individual array elements. We propose a novel finite-state machine multi-agents based approach that harnesses LLMs and test-based feedback to generate vectorized code. Our findings indicate that LLMs are capable of producing high performance vectorized code with run-time speedup ranging from 1.1x to 9.4x as compared to the state-of-the-art compilers such as Intel Compiler, GCC, and Clang. To verify the correctness of vectorized code, we use Alive2, a leading bounded translation validation tool for LLVM IR. We describe a few domain-specific techniques to improve the scalability of Alive2 on our benchmark dataset. Overall, our approach is able to verify 38.2% of vectorizations as correct on the TSVC benchmark dataset. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2406.04693 [cs.SE] (or arXiv:2406.04693v1 [cs.SE] for this version)

[LG-83] Higher-order Structure Based Anomaly Detection on Attributed Networks

链接: https://arxiv.org/abs/2406.04690
作者: Xu Yuan,Na Zhou,Shuo Yu,Huafei Huang,Zhikui Chen,Feng Xia
关键词: telecom fraud detection, medical image detection, Anomaly detection, telecom fraud, medical image
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Anomaly detection (such as telecom fraud detection and medical image detection) has attracted the increasing attention of people. The complex interaction between multiple entities widely exists in the network, which can reflect specific human behavior patterns. Such patterns can be modeled by higher-order network structures, thus benefiting anomaly detection on attributed networks. However, due to the lack of an effective mechanism in most existing graph learning methods, these complex interaction patterns fail to be applied in detecting anomalies, hindering the progress of anomaly detection to some extent. In order to address the aforementioned issue, we present a higher-order structure based anomaly detection (GUIDE) method. We exploit attribute autoencoder and structure autoencoder to reconstruct node attributes and higher-order structures, respectively. Moreover, we design a graph attention layer to evaluate the significance of neighbors to nodes through their higher-order structure differences. Finally, we leverage node attribute and higher-order structure reconstruction errors to find anomalies. Extensive experiments on five real-world datasets (i.e., ACM, Citation, Cora, DBLP, and Pubmed) are implemented to verify the effectiveness of GUIDE. Experimental results in terms of ROC-AUC, PR-AUC, and Recall@K show that GUIDE significantly outperforms the state-of-art methods.

[LG-84] LogiCode: an LLM-Driven Framework for Logical Anomaly Detection

链接: https://arxiv.org/abs/2406.04687
作者: Yiheng Zhang,Yunkang Cao,Xiaohao Xu,Weiming Shen
关键词: Large Language Models, leverages Large Language, Language Models, Large Language, leverages Large
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents LogiCode, a novel framework that leverages Large Language Models (LLMs) for identifying logical anomalies in industrial settings, moving beyond traditional focus on structural inconsistencies. By harnessing LLMs for logical reasoning, LogiCode autonomously generates Python codes to pinpoint anomalies such as incorrect component quantities or missing elements, marking a significant leap forward in anomaly detection technologies. A custom dataset “LOCO-Annotations” and a benchmark “LogiBench” are introduced to evaluate the LogiCode’s performance across various metrics including binary classification accuracy, code generation success rate, and precision in reasoning. Findings demonstrate LogiCode’s enhanced interpretability, significantly improving the accuracy of logical anomaly detection and offering detailed explanations for identified anomalies. This represents a notable shift towards more intelligent, LLM-driven approaches in industrial anomaly detection, promising substantial impacts on industry-specific applications.

[LG-85] Advanced Payment Security System:XGBoost CatBoost and SMOTE Integrated

链接: https://arxiv.org/abs/2406.04658
作者: Qi Zheng,Chang Yu,Jin Cao,Yongshun Xu,Qianwen Xing,Yinxin Jin
关键词: mobile payment systems, Payment Security Protection, Synthetic Minority Over-sampling, Minority Over-sampling Technique, robust Payment Security
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This paper is received by this https URL

点击查看摘要

Abstract:With the rise of various online and mobile payment systems, transaction fraud has become a significant threat to financial security. This study explores the application of advanced machine learning models, specifically XGBoost and LightGBM, for developing a more accurate and robust Payment Security Protection this http URL enhance data reliability, we meticulously processed the data sources and used SMOTE (Synthetic Minority Over-sampling Technique) to address class imbalance and improve data representation. By selecting highly correlated features, we aimed to strengthen the training process and boost model performance.We conducted thorough performance evaluations of our proposed models, comparing them against traditional methods including Random Forest, Neural Network, and Logistic Regression. Key metrics such as Precision, Recall, and F1 Score were used to rigorously assess their effectiveness.Our detailed analyses and comparisons reveal that the combination of SMOTE with XGBoost and LightGBM offers a highly efficient and powerful mechanism for payment security protection. The results show that these models not only outperform traditional approaches but also hold significant promise for advancing the field of transaction fraud prevention.

[LG-86] Crafting Heavy-Tails in Weight Matrix Spectrum without Gradient Noise

链接: https://arxiv.org/abs/2406.04657
作者: Vignesh Kothapalli,Tianyu Pang,Shenyang Deng,Zongmin Liu,Yaoqing Yang
关键词: Modern training strategies, deep neural networks, Modern training, weight spectra, weight spectra tend
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 31 pages, 37 figures

点击查看摘要

Abstract:Modern training strategies of deep neural networks (NNs) tend to induce a heavy-tailed (HT) spectra of layer weights. Extensive efforts to study this phenomenon have found that NNs with HT weight spectra tend to generalize well. A prevailing notion for the occurrence of such HT spectra attributes gradient noise during training as a key contributing factor. Our work shows that gradient noise is unnecessary for generating HT weight spectra: two-layer NNs trained with full-batch Gradient Descent/Adam can exhibit HT spectra in their weights after finite training steps. To this end, we first identify the scale of the learning rate at which one step of full-batch Adam can lead to feature learning in the shallow NN, particularly when learning a single index teacher model. Next, we show that multiple optimizer steps with such (sufficiently) large learning rates can transition the bulk of the weight’s spectra into an HT distribution. To understand this behavior, we present a novel perspective based on the singular vectors of the weight matrices and optimizer updates. We show that the HT weight spectrum originates from the `spike’, which is generated from feature learning and interacts with the main bulk to generate an HT spectrum. Finally, we analyze the correlations between the HT weight spectra and generalization after multiple optimizer updates with varying learning rates.

[LG-87] LinkGPT: Teaching Large Language Models To Predict Missing Links

链接: https://arxiv.org/abs/2406.04640
作者: Zhongmou He,Jing Zhu,Shengyi Qian,Joyce Chai,Danai Koutra
关键词: Large Language Models, Large Language, Language Models, shown promising results, LLMs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promising results on various language and vision tasks. Recently, there has been growing interest in applying LLMs to graph-based tasks, particularly on Text-Attributed Graphs (TAGs). However, most studies have focused on node classification, while the use of LLMs for link prediction (LP) remains understudied. In this work, we propose a new task on LLMs, where the objective is to leverage LLMs to predict missing links between nodes in a graph. This task evaluates an LLM’s ability to reason over structured data and infer new facts based on learned patterns. This new task poses two key challenges: (1) How to effectively integrate pairwise structural information into the LLMs, which is known to be crucial for LP performance, and (2) how to solve the computational bottleneck when teaching LLMs to perform LP. To address these challenges, we propose LinkGPT, the first end-to-end trained LLM for LP tasks. To effectively enhance the LLM’s ability to understand the underlying structure, we design a two-stage instruction tuning approach where the first stage fine-tunes the pairwise encoder, projector, and node projector, and the second stage further fine-tunes the LLMs to predict links. To address the efficiency challenges at inference time, we introduce a retrieval-reranking scheme. Experiments show that LinkGPT can achieve state-of-the-art performance on real-world graphs as well as superior generalization in zero-shot and few-shot learning, surpassing existing benchmarks. At inference time, it can achieve 10\times speedup while maintaining high LP accuracy.

[LG-88] Cooperative Meta-Learning with Gradient Augmentation

链接: https://arxiv.org/abs/2406.04639
作者: Jongyun Shin,Seunjin Han,Jangho Kim
关键词: Model agnostic meta-learning, outer loop, CML, Model agnostic, outer loop update
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to UAI 2024

点击查看摘要

Abstract:Model agnostic meta-learning (MAML) is one of the most widely used gradient-based meta-learning, consisting of two optimization loops: an inner loop and outer loop. MAML learns the new task from meta-initialization parameters with an inner update and finds the meta-initialization parameters in the outer loop. In general, the injection of noise into the gradient of the model for augmenting the gradient is one of the widely used regularization methods. In this work, we propose a novel cooperative meta-learning framework dubbed CML which leverages gradient-level regularization with gradient augmentation. We inject learnable noise into the gradient of the model for the model generalization. The key idea of CML is introducing the co-learner which has no inner update but the outer loop update to augment gradients for finding better meta-initialization parameters. Since the co-learner does not update in the inner loop, it can be easily deleted after meta-training. Therefore, CML infers with only meta-learner without additional cost and performance degradation. We demonstrate that CML is easily applicable to gradient-based meta-learning methods and CML leads to increased performance in few-shot regression, few-shot image classification and few-shot node classification tasks. Our codes are at this https URL.

[LG-89] Denoising-Aware Contrastive Learning for Noisy Time Series

链接: https://arxiv.org/abs/2406.04627
作者: Shuang Zhou,Daochen Zha,Xiao Shen,Xiao Huang,Rui Zhang,Fu-Lai Chung
关键词: Time series self-supervised, exploit unlabeled data, Time series, series self-supervised learning, aims to exploit
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to 33rd International Joint Conference on Artificial Intelligence (IJCAI 2024)

点击查看摘要

Abstract:Time series self-supervised learning (SSL) aims to exploit unlabeled data for pre-training to mitigate the reliance on labels. Despite the great success in recent years, there is limited discussion on the potential noise in the time series, which can severely impair the performance of existing SSL methods. To mitigate the noise, the de facto strategy is to apply conventional denoising methods before model training. However, this pre-processing approach may not fully eliminate the effect of noise in SSL for two reasons: (i) the diverse types of noise in time series make it difficult to automatically determine suitable denoising methods; (ii) noise can be amplified after mapping raw data into latent space. In this paper, we propose denoising-aware contrastive learning (DECL), which uses contrastive learning objectives to mitigate the noise in the representation and automatically selects suitable denoising methods for every sample. Extensive experiments on various datasets verify the effectiveness of our method. The code is open-sourced.

[LG-90] Adaptive Interface-PINNs (AdaI-PINNs): An Efficient Physics-informed Neural Networks Framework for Interface Problems

链接: https://arxiv.org/abs/2406.04626
作者: Sumanta Roy,Chandrasekhar Annavarapu,Pratanu Roy,Antareep Kumar Sarma
关键词: termed Adaptive Interface-PINNs, efficient physics-informed neural, termed Adaptive, Adaptive Interface-PINNs, physics-informed neural networks
类目: Machine Learning (cs.LG)
*备注: 17 pages, 8 figures, 6 tables

点击查看摘要

Abstract:We present an efficient physics-informed neural networks (PINNs) framework, termed Adaptive Interface-PINNs (AdaI-PINNs), to improve the modeling of interface problems with discontinuous coefficients and/or interfacial jumps. This framework is an enhanced version of its predecessor, Interface PINNs or I-PINNs (Sarma et al.; this https URL), which involves domain decomposition and assignment of different predefined activation functions to the neural networks in each subdomain across a sharp interface, while keeping all other parameters of the neural networks identical. In AdaI-PINNs, the activation functions vary solely in their slopes, which are trained along with the other parameters of the neural networks. This makes the AdaI-PINNs framework fully automated without requiring preset activation functions. Comparative studies on one-dimensional, two-dimensional, and three-dimensional benchmark elliptic interface problems reveal that AdaI-PINNs outperform I-PINNs, reducing computational costs by 2-6 times while producing similar or better accuracy.

[LG-91] Image Processing Based Forest Fire Detection

链接: https://arxiv.org/abs/2406.04624
作者: Vipin V
关键词: image processing technique, processing technique, YCbCr color space, RGB color space, color space
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:A novel approach for forest fire detection using image processing technique is proposed. A rule-based color model for fire pixel classification is used. The proposed algorithm uses RGB and YCbCr color space. The advantage of using YCbCr color space is that it can separate the luminance from the chrominance more effectively than RGB color space. The performance of the proposed algorithm is tested on two sets of images, one of which contains fire; the other contains fire-like regions. Standard methods are used for calculating the performance of the algorithm. The proposed method has both higher detection rate and lower false alarm rate. Since the algorithm is cheap in computation, it can be used for real-time forest fire detection.

[LG-92] CTSyn: A Foundational Model for Cross Tabular Data Generation

链接: https://arxiv.org/abs/2406.04619
作者: Xiaofeng Lin,Chenheng Xu,Matthew Yang,Guang Cheng
关键词: Generative Foundation Models, Generative Foundation, images and text, Foundation Models, remarkable quality
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generative Foundation Models (GFMs) have produced synthetic data with remarkable quality in modalities such as images and text. However, applying GFMs to tabular data poses significant challenges due to the inherent heterogeneity of table features. Existing cross-table learning frameworks are hindered by the absence of both a generative model backbone and a decoding mechanism for heterogeneous feature values. To overcome these limitations, we introduce the Cross-Table Synthesizer (CTSyn), a diffusion-based foundational model tailored for tabular data generation. CTSyn introduces three major components: an aggregator that consolidates heterogeneous tables into a unified latent space; a conditional latent diffusion model for sampling from this space; and type-specific decoders that reconstruct values of varied data types from sampled latent vectors. Extensive testing on real-world datasets reveals that CTSyn not only significantly outperforms existing table synthesizers in utility and diversity, but also uniquely enhances performances of downstream machine learning beyond what is achievable with real data, thus establishing a new paradigm for synthetic data generation.

[LG-93] Revisiting Attention Weights as Interpretations of Message-Passing Neural Networks

链接: https://arxiv.org/abs/2406.04612
作者: Yong-Min Shin,Siqing Li,Xin Cao,Won-Yong Shin
关键词: message-passing neural networks, widely-used message-passing neural, neural networks, self-attention mechanism, widely-used message-passing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE); Social and Information Networks (cs.SI)
*备注: 11 pages, 3 figures, 5 tables

点击查看摘要

Abstract:The self-attention mechanism has been adopted in several widely-used message-passing neural networks (MPNNs) (e.g., GATs), which adaptively controls the amount of information that flows along the edges of the underlying graph. This usage of attention has made such models a baseline for studies on explainable AI (XAI) since interpretations via attention have been popularized in various domains (e.g., natural language processing and computer vision). However, existing studies often use naive calculations to derive attribution scores from attention, and do not take the precise and careful calculation of edge attribution into consideration. In our study, we aim to fill the gap between the widespread usage of attention-enabled MPNNs and their potential in largely under-explored explainability, a topic that has been actively investigated in other areas. To this end, as the first attempt, we formalize the problem of edge attribution from attention weights in GNNs. Then, we propose GATT, an edge attribution calculation method built upon the computation tree. Through comprehensive experiments, we demonstrate the effectiveness of our proposed method when evaluating attributions from GATs. Conversely, we empirically validate that simply averaging attention weights over graph attention layers is insufficient to interpret the GAT model’s behavior. Code is publicly available at this https URL.

[LG-94] Contrastive explainable clustering with differential privacy

链接: https://arxiv.org/abs/2406.04610
作者: Dung Nguyen,Ariel Vetzler,Sarit Kraus,Anil Vullikanti
关键词: approach in Explainable, integrating contrastive explanations, paper presents, XAI, clustering
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach in Explainable AI (XAI), integrating contrastive explanations with differential privacy in clustering methods. For several basic clustering problems, including k -median and k -means, we give efficient differential private contrastive explanations that achieve essentially the same explanations as those that non-private clustering explanations can obtain. We define contrastive explanations as the utility difference between the original clustering utility and utility from clustering with a specifically fixed centroid. In each contrastive scenario, we designate a specific data point as the fixed centroid position, enabling us to measure the impact of this constraint on clustering utility under differential privacy. Extensive experiments across various datasets show our method’s effectiveness in providing meaningful explanations without significantly compromising data privacy or clustering utility. This underscores our contribution to privacy-aware machine learning, demonstrating the feasibility of achieving a balance between privacy and utility in the explanation of clustering tasks.

[LG-95] Diverse Intra- and Inter-Domain Activity Style Fusion for Cross-Person Generalization in Activity Recognition

链接: https://arxiv.org/abs/2406.04609
作者: Junru Zhang,Lang Feng,Zhidan Liu,Yuhan Wu,Yang He,Yabo Dong,Duanqing Xu
关键词: cross-person generalization tasks, Existing domain generalization, cross-person generalization, face challenges, challenges in capturing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2024)

点击查看摘要

Abstract:Existing domain generalization (DG) methods for cross-person generalization tasks often face challenges in capturing intra- and inter-domain style diversity, resulting in domain gaps with the target domain. In this study, we explore a novel perspective to tackle this problem, a process conceptualized as domain padding. This proposal aims to enrich the domain diversity by synthesizing intra- and inter-domain style data while maintaining robustness to class labels. We instantiate this concept using a conditional diffusion model and introduce a style-fused sampling strategy to enhance data generation diversity. In contrast to traditional condition-guided sampling, our style-fused sampling strategy allows for the flexible use of one or more random styles to guide data synthesis. This feature presents a notable advancement: it allows for the maximum utilization of possible permutations and combinations among existing styles to generate a broad spectrum of new style instances. Empirical evaluations on a board of datasets demonstrate that our generated data achieves remarkable diversity within the domain space. Both intra- and inter-domain generated data have proven to be significant and valuable, contributing to varying degrees of performance enhancements. Notably, our approach outperforms state-of-the-art DG methods in all human activity recognition tasks.

[LG-96] MeGA: Merging Multiple Independently Trained Neural Networks Based on Genetic Algorithm

链接: https://arxiv.org/abs/2406.04607
作者: Daniel Yun
关键词: algorithm called MeGA, called MeGA, pre-trained neural networks, genetic algorithm called, multiple pre-trained neural
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a novel method for merging the weights of multiple pre-trained neural networks using a genetic algorithm called MeGA. Traditional techniques, such as weight averaging and ensemble methods, often fail to fully harness the capabilities of pre-trained networks. Our approach leverages a genetic algorithm with tournament selection, crossover, and mutation to optimize weight combinations, creating a more effective fusion. This technique allows the merged model to inherit advantageous features from both parent models, resulting in enhanced accuracy and robustness. Through experiments on the CIFAR-10 dataset, we demonstrate that our genetic algorithm-based weight merging method improves test accuracy compared to individual models and conventional methods. This approach provides a scalable solution for integrating multiple pre-trained networks across various deep learning applications. Github is available at: this https URL

[LG-97] Helpful or Harmful Data? Fine-tuning-free Shapley Attribution for Explaining Language Model Predictions

链接: https://arxiv.org/abs/2406.04606
作者: Jingtan Wang,Xiaoqiang Lin,Rui Qiao,Chuan-Sheng Foo,Bryan Kian Hsiang Low
关键词: foundational models underscores, necessity for explainability, downstream tasks, increasing complexity, complexity of foundational
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:The increasing complexity of foundational models underscores the necessity for explainability, particularly for fine-tuning, the most widely used training method for adapting models to downstream tasks. Instance attribution, one type of explanation, attributes the model prediction to each training example by an instance score. However, the robustness of instance scores, specifically towards dataset resampling, has been overlooked. To bridge this gap, we propose a notion of robustness on the sign of the instance score. We theoretically and empirically demonstrate that the popular leave-one-out-based methods lack robustness, while the Shapley value behaves significantly better, but at a higher computational cost. Accordingly, we introduce an efficient fine-tuning-free approximation of the Shapley value (FreeShap) for instance attribution based on the neural tangent kernel. We empirically demonstrate that FreeShap outperforms other methods for instance attribution and other data-centric applications such as data removal, data selection, and wrong label detection, and further generalize our scale to large language models (LLMs). Our code is available at this https URL.

[LG-98] Enhancing Size Generalization in Graph Neural Networks through Disentangled Representation Learning

链接: https://arxiv.org/abs/2406.04601
作者: Zheng Huang,Qihui Yang,Dawei Zhou,Yujun Yan
关键词: graph neural networks, neural networks, encountered during training, graph representations, classification performance
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although most graph neural networks (GNNs) can operate on graphs of any size, their classification performance often declines on graphs larger than those encountered during training. Existing methods insufficiently address the removal of size information from graph representations, resulting in sub-optimal performance and reliance on backbone models. In response, we propose DISGEN, a novel and model-agnostic framework designed to disentangle size factors from graph representations. DISGEN employs size- and task-invariant augmentations and introduces a decoupling loss that minimizes shared information in hidden representations, with theoretical guarantees for its effectiveness. Our empirical results show that DISGEN outperforms the state-of-the-art models by up to 6% on real-world datasets, underscoring its effectiveness in enhancing the size generalizability of GNNs. Our codes are available at: this https URL.

[LG-99] Federated Representation Learning in the Under-Parameterized Regime

链接: https://arxiv.org/abs/2406.04596
作者: Renpu Liu,Cong Shen,Jing Yang
关键词: personalized federated learning, popular personalized federated, Federated representation learning, federated learning, personalized federated
类目: Machine Learning (cs.LG)
*备注: This work has been accepted to ICML 2024

点击查看摘要

Abstract:Federated representation learning (FRL) is a popular personalized federated learning (FL) framework where clients work together to train a common representation while retaining their personalized heads. Existing studies, however, largely focus on the over-parameterized regime. In this paper, we make the initial efforts to investigate FRL in the under-parameterized regime, where the FL model is insufficient to express the variations in all ground-truth models. We propose a novel FRL algorithm FLUTE, and theoretically characterize its sample complexity and convergence rate for linear models in the under-parameterized regime. To the best of our knowledge, this is the first FRL algorithm with provable performance guarantees in this regime. FLUTE features a data-independent random initialization and a carefully designed objective function that aids the distillation of subspace spanned by the global optimal representation from the misaligned local representations. On the technical side, we bridge low-rank matrix approximation techniques with the FL analysis, which may be of broad interest. We also extend FLUTE beyond linear representations. Experimental results demonstrate that FLUTE outperforms state-of-the-art FRL solutions in both synthetic and real-world tasks.

[LG-100] Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach

链接: https://arxiv.org/abs/2406.04594
作者: Jianbo Dong,Bin Luo,Jun Zhang,Pengcheng Zhang,Fei Feng,Yikai Zhu,Ang Liu,Zian Chen,Yi Shi,Hairong Jiao,Gang Lu,Yu Guan,Ennan Zhai,Wencong Xiao,Hanyu Zhao,Man Yuan,Siran Yang,Xiang Li,Jiamang Wang,Rui Men,Jianwei Zhang,Huang Zhong,Dennis Cai,Yuan Xie,Binzhang Fu
关键词: parallel training techniques, Large Language Models, parallel training, Language Models, Large Language
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has necessitated the adoption of parallel training techniques, involving the deployment of thousands of GPUs to train a single model. Unfortunately, we have found that the efficiency of current parallel training is often suboptimal, largely due to the following two main issues. Firstly, hardware failures are inevitable, leading to interruptions in the training tasks. The inability to quickly identify the faulty components results in a substantial waste of GPU resources. Secondly, since GPUs must wait for parameter synchronization to complete before proceeding to the next round of computation, network congestions can greatly increase the waiting time for GPUs. To address these challenges, this paper introduces a communication-driven solution, namely the C4. The key insights of C4 are two folds. First, in parallel training, collective communication exhibits periodic and homogeneous characteristics, so any anomalies are certainly due to some form of hardware malfunction. By leveraging this feature, C4 can rapidly identify the faulty components, swiftly isolate the anomaly, and restart the task, thereby avoiding resource wastage caused by delays in anomaly detection. Second, the predictable communication model of collective communication, involving few large flows, allows C4 to efficiently execute traffic planning, substantially reducing network congestion. C4 has been extensively implemented across our production systems, cutting error-induced overhead by roughly 30% and enhancing runtime performance by about 15% for certain applications with moderate communication costs.

[LG-101] CLoG: Benchmarking Continual Learning of Image Generation Models

链接: https://arxiv.org/abs/2406.04584
作者: Haotian Zhang,Junting Zhou,Haowei Lin,Hang Ye,Jianhua Zhu,Zihao Wang,Liangcai Gao,Yizhou Wang,Yitao Liang
关键词: Artificial Intelligence, incrementally acquire knowledge, Continual Learning, continual learning community, poses a significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Continual Learning (CL) poses a significant challenge in Artificial Intelligence, aiming to mirror the human ability to incrementally acquire knowledge and skills. While extensive research has focused on CL within the context of classification tasks, the advent of increasingly powerful generative models necessitates the exploration of Continual Learning of Generative models (CLoG). This paper advocates for shifting the research focus from classification-based CL to CLoG. We systematically identify the unique challenges presented by CLoG compared to traditional classification-based CL. We adapt three types of existing CL methodologies, replay-based, regularization-based, and parameter-isolation-based methods to generative tasks and introduce comprehensive benchmarks for CLoG that feature great diversity and broad task coverage. Our benchmarks and results yield intriguing insights that can be valuable for developing future CLoG methods. Additionally, we will release a codebase designed to facilitate easy benchmarking and experimentation in CLoG publicly at this https URL. We believe that shifting the research focus to CLoG will benefit the continual learning community and illuminate the path for next-generation AI-generated content (AIGC) in a lifelong learning paradigm.

[LG-102] Optimization of geological carbon storage operations with multimodal latent dynamic model and deep reinforcement learning

链接: https://arxiv.org/abs/2406.04575
作者: Zhongzheng Wang,Yuntian Chen,Guodong Chen,Dongxiao Zhang
关键词: demands resource-intensive simulations, Maximizing storage performance, geological carbon storage, Maximizing storage, optimization demands resource-intensive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Maximizing storage performance in geological carbon storage (GCS) is crucial for commercial deployment, but traditional optimization demands resource-intensive simulations, posing computational challenges. This study introduces the multimodal latent dynamic (MLD) model, a deep learning framework for fast flow prediction and well control optimization in GCS. The MLD model includes a representation module for compressed latent representations, a transition module for system state evolution, and a prediction module for flow responses. A novel training strategy combining regression loss and joint-embedding consistency loss enhances temporal consistency and multi-step prediction accuracy. Unlike existing models, the MLD supports diverse input modalities, allowing comprehensive data interactions. The MLD model, resembling a Markov decision process (MDP), can train deep reinforcement learning agents, specifically using the soft actor-critic (SAC) algorithm, to maximize net present value (NPV) through continuous interactions. The approach outperforms traditional methods, achieving the highest NPV while reducing computational resources by over 60%. It also demonstrates strong generalization performance, providing improved decisions for new scenarios based on knowledge from previous ones.

[LG-103] StackSight: Unveiling WebAssembly through Large Language Models and Neurosymbolic Chain-of-Thought Decompilation

链接: https://arxiv.org/abs/2406.04568
作者: Weike Fang,Zhejian Zhou,Junzhou He,Weihang Wang
关键词: demand high performance, enables near-native execution, Large Language Models, robust security, near-native execution
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages. In the Proceedings of the 41st International Conference on Machine Learning (ICML’ 24)

点击查看摘要

Abstract:WebAssembly enables near-native execution in web applications and is increasingly adopted for tasks that demand high performance and robust security. However, its assembly-like syntax, implicit stack machine, and low-level data types make it extremely difficult for human developers to understand, spurring the need for effective WebAssembly reverse engineering techniques. In this paper, we propose StackSight, a novel neurosymbolic approach that combines Large Language Models (LLMs) with advanced program analysis to decompile complex WebAssembly code into readable C++ snippets. StackSight visualizes and tracks virtual stack alterations via a static analysis algorithm and then applies chain-of-thought prompting to harness LLM’s complex reasoning capabilities. Evaluation results show that StackSight significantly improves WebAssembly decompilation. Our user study also demonstrates that code snippets generated by StackSight have significantly higher win rates and enable a better grasp of code semantics.

[LG-104] Error Bounds of Supervised Classification from Information-Theoretic Perspective

链接: https://arxiv.org/abs/2406.04567
作者: Binchuan Qi,Wei Gong,Li Li
关键词: unanswered research questions, overparametrized neural networks, deep neural networks, remarkable generalization power, efficient optimization performance
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:There remains a list of unanswered research questions on deep learning (DL), including the remarkable generalization power of overparametrized neural networks, the efficient optimization performance despite the non-convexity, and the mechanisms behind flat minima in generalization. In this paper, we adopt an information-theoretic perspective to explore the theoretical foundations of supervised classification using deep neural networks (DNNs). Our analysis introduces the concepts of fitting error and model risk, which, together with generalization error, constitute an upper bound on the expected risk. We demonstrate that the generalization errors are bounded by the complexity, influenced by both the smoothness of distribution and the sample size. Consequently, task complexity serves as a reliable indicator of the dataset’s quality, guiding the setting of regularization hyperparameters. Furthermore, the derived upper bound fitting error links the back-propagated gradient, Neural Tangent Kernel (NTK), and the model’s parameter count with the fitting error. Utilizing the triangle inequality, we establish an upper bound on the expected risk. This bound offers valuable insights into the effects of overparameterization, non-convex optimization, and the flat minima in DNNs.Finally, empirical verification confirms a significant positive correlation between the derived theoretical bounds and the practical expected risk, confirming the practical relevance of the theoretical findings.

[LG-105] SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models

链接: https://arxiv.org/abs/2406.04566
作者: Md Imbesat Hassan Rizvi,Xiaodan Zhu,Iryna Gurevych
关键词: Spatial reasoning, Spatial Reasoning Characterization, Spatial Reasoning Paths, Spatial, artificial intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at ACL 2024 (Main)

点击查看摘要

Abstract:Spatial reasoning is a crucial component of both biological and artificial intelligence. In this work, we present a comprehensive study of the capability of current state-of-the-art large language models (LLMs) on spatial reasoning. To support our study, we created and contribute a novel Spatial Reasoning Characterization (SpaRC) framework and Spatial Reasoning Paths (SpaRP) datasets, to enable an in-depth understanding of the spatial relations and compositions as well as the usefulness of spatial reasoning chains. We found that all the state-of-the-art LLMs do not perform well on the datasets – their performances are consistently low across different setups. The spatial reasoning capability improves substantially as model sizes scale up. Finetuning both large language models (e.g., Llama-2-70B) and smaller ones (e.g., Llama-2-13B) can significantly improve their F1-scores by 7–32 absolute points. We also found that the top proprietary LLMs still significantly outperform their open-source counterparts in topological spatial understanding and reasoning.

[LG-106] A Unified View of Group Fairness Tradeoffs Using Partial Information Decomposition

链接: https://arxiv.org/abs/2406.04562
作者: Faisal Hamman,Sanghamitra Dutta
关键词: prominent group fairness, statistical parity, predictive parity, equalized odds, group fairness notions
类目: Information Theory (cs.IT); Computers and Society (cs.CY); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published as a conference paper at 2024 IEEE International Symposium on Information Theory (ISIT 2024)

点击查看摘要

Abstract:This paper introduces a novel information-theoretic perspective on the relationship between prominent group fairness notions in machine learning, namely statistical parity, equalized odds, and predictive parity. It is well known that simultaneous satisfiability of these three fairness notions is usually impossible, motivating practitioners to resort to approximate fairness solutions rather than stringent satisfiability of these definitions. However, a comprehensive analysis of their interrelations, particularly when they are not exactly satisfied, remains largely unexplored. Our main contribution lies in elucidating an exact relationship between these three measures of (un)fairness by leveraging a body of work in information theory called partial information decomposition (PID). In this work, we leverage PID to identify the granular regions where these three measures of (un)fairness overlap and where they disagree with each other leading to potential tradeoffs. We also include numerical simulations to complement our results.

[LG-107] On PI Controllers for Updating Lagrange Multipliers in Constrained Optimization

链接: https://arxiv.org/abs/2406.04558
作者: Motahareh Sohrabi,Juan Ramirez,Tianyue H. Zhang,Simon Lacoste-Julien,Jose Gallego-Posada
关键词: neural network models, Constrained optimization offers, prescribe desired behaviors, network models, offers a powerful
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Published at ICML 2024. Code available at this https URL

点击查看摘要

Abstract:Constrained optimization offers a powerful framework to prescribe desired behaviors in neural network models. Typically, constrained problems are solved via their min-max Lagrangian formulations, which exhibit unstable oscillatory dynamics when optimized using gradient descent-ascent. The adoption of constrained optimization techniques in the machine learning community is currently limited by the lack of reliable, general-purpose update schemes for the Lagrange multipliers. This paper proposes the \nu PI algorithm and contributes an optimization perspective on Lagrange multiplier updates based on PI controllers, extending the work of Stooke, Achiam and Abbeel (2020). We provide theoretical and empirical insights explaining the inability of momentum methods to address the shortcomings of gradient descent-ascent, and contrast this with the empirical success of our proposed \nu PI controller. Moreover, we prove that \nu PI generalizes popular momentum methods for single-objective minimization. Our experiments demonstrate that \nu PI reliably stabilizes the multiplier dynamics and its hyperparameters enjoy robust and predictable behavior.

[LG-108] Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance

链接: https://arxiv.org/abs/2406.04551
作者: Reyhane Askari Hemmat,Melissa Hall,Alicia Sun,Candace Ross,Michal Drozdzal,Adriana Romero-Soriano
关键词: generated images, risks and biases, growing popularity, increasing focus, focus on understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the growing popularity of text-to-image generative models, there has been increasing focus on understanding their risks and biases. Recent work has found that state-of-the-art models struggle to depict everyday objects with the true diversity of the real world and have notable gaps between geographic regions. In this work, we aim to increase the diversity of generated images of common objects such that per-region variations are representative of the real world. We introduce an inference time intervention, contextualized Vendi Score Guidance (c-VSG), that guides the backwards steps of latent diffusion models to increase the diversity of a sample as compared to a “memory bank” of previously generated images while constraining the amount of variation within that of an exemplar set of real-world contextualizing images. We evaluate c-VSG with two geographically representative datasets and find that it substantially increases the diversity of generated images, both for the worst performing regions and on average, while simultaneously maintaining or improving image quality and consistency. Additionally, qualitative analyses reveal that diversity of generated images is significantly improved, including along the lines of reductive region portrayals present in the original model. We hope that this work is a step towards text-to-image generative models that reflect the true geographic diversity of the world.

[LG-109] Concurrent Training and Layer Pruning of Deep Neural Networks

链接: https://arxiv.org/abs/2406.04549
作者: Valentin Frank Ingmar Guenter,Athanasios Sideris
关键词: eliminating irrelevant layers, neural network, neural network weights, capable of identifying, identifying and eliminating
类目: Machine Learning (cs.LG)
*备注: 35 pages, 5 figures, 7 tables

点击查看摘要

Abstract:We propose an algorithm capable of identifying and eliminating irrelevant layers of a neural network during the early stages of training. In contrast to weight or filter-level pruning, layer pruning reduces the harder to parallelize sequential computation of a neural network. We employ a structure using residual connections around nonlinear network sections that allow the flow of information through the network once a nonlinear section is pruned. Our approach is based on variational inference principles using Gaussian scale mixture priors on the neural network weights and allows for substantial cost savings during both training and inference. More specifically, the variational posterior distribution of scalar Bernoulli random variables multiplying a layer weight matrix of its nonlinear sections is learned, similarly to adaptive layer-wise dropout. To overcome challenges of concurrent learning and pruning such as premature pruning and lack of robustness with respect to weight initialization or the size of the starting network, we adopt the “flattening” hyper-prior on the prior parameters. We prove that, as a result of its usage, the solutions of the resulting optimization problem describe deterministic networks with parameters of the posterior distribution at either 0 or 1. We formulate a projected SGD algorithm and prove its convergence to such a solution using stochastic approximation results. In particular, we prove conditions that lead to a layer’s weights converging to zero and derive practical pruning conditions from the theoretical results. The proposed algorithm is evaluated on the MNIST, CIFAR-10 and ImageNet datasets and common LeNet, VGG16 and ResNet architectures. The simulations demonstrate that our method achieves state-of the-art performance for layer pruning at reduced computational cost in distinction to competing methods due to the concurrent training and pruning.

[LG-110] GNNAnatomy: Systematic Generation and Evaluation of Multi-Level Explanations for Graph Neural Networks

链接: https://arxiv.org/abs/2406.04548
作者: Hsiao-Ying Lu,Yiran Li,Ujwal Pratap Krishna Kaluvakolanu Thyagarajan,Kwan-Liu Ma
关键词: Graph Neural Networks, Neural Networks, proven highly effective, Graph Neural, machine learning
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have proven highly effective in various machine learning (ML) tasks involving graphs, such as node/graph classification and link prediction. However, explaining the decisions made by GNNs poses challenges because of the aggregated relational information based on graph structure, leading to complex data transformations. Existing methods for explaining GNNs often face limitations in systematically exploring diverse substructures and evaluating results in the absence of ground truths. To address this gap, we introduce GNNAnatomy, a model- and dataset-agnostic visual analytics system designed to facilitate the generation and evaluation of multi-level explanations for GNNs. In GNNAnatomy, we employ graphlets to elucidate GNN behavior in graph-level classification tasks. By analyzing the associations between GNN classifications and graphlet frequencies, we formulate hypothesized factual and counterfactual explanations. To validate a hypothesized graphlet explanation, we introduce two metrics: (1) the correlation between its frequency and the classification confidence, and (2) the change in classification confidence after removing this substructure from the original graph. To demonstrate the effectiveness of GNNAnatomy, we conduct case studies on both real-world and synthetic graph datasets from various domains. Additionally, we qualitatively compare GNNAnatomy with a state-of-the-art GNN explainer, demonstrating the utility and versatility of our design.

[LG-111] FOOD: Facial Authentication and Out-of-Distribution Detection with Short-Range FMCW Radar

链接: https://arxiv.org/abs/2406.04546
作者: Sabri Mustafa Kahya,Boran Hamdi Sivrikaya,Muhammet Sami Yavuz,Eckehard Steinbach
关键词: radar-based facial authentication, FMCW radar-based facial, short-range FMCW radar-based, paper proposes, radar-based facial
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted at ICIP 2024

点击查看摘要

Abstract:This paper proposes a short-range FMCW radar-based facial authentication and out-of-distribution (OOD) detection framework. Our pipeline jointly estimates the correct classes for the in-distribution (ID) samples and detects the OOD samples to prevent their inaccurate prediction. Our reconstruction-based architecture consists of a main convolutional block with one encoder and multi-decoder configuration, and intermediate linear encoder-decoder parts. Together, these elements form an accurate human face classifier and a robust OOD detector. For our dataset, gathered using a 60 GHz short-range FMCW radar, our network achieves an average classification accuracy of 98.07% in identifying in-distribution human faces. As an OOD detector, it achieves an average Area Under the Receiver Operating Characteristic (AUROC) curve of 98.50% and an average False Positive Rate at 95% True Positive Rate (FPR95) of 6.20%. Also, our extensive experiments show that the proposed approach outperforms previous OOD detectors in terms of common OOD detection metrics.

[LG-112] angent differential privacy

链接: https://arxiv.org/abs/2406.04535
作者: Lexing Ying
关键词: Differential privacy, tangent differential privacy, individual data points, tangent differential, Differential
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Differential privacy is a framework for protecting the identity of individual data points in the decision-making process. In this note, we propose a new form of differential privacy called tangent differential privacy. Compared with the usual differential privacy that is defined uniformly across data distributions, tangent differential privacy is tailored towards a specific data distribution of interest. It also allows for general distribution distances such as total variation distance and Wasserstein distance. In the case of risk minimization, we show that entropic regularization guarantees tangent differential privacy under rather general conditions on the risk function.

[LG-113] Strategically Conservative Q-Learning

链接: https://arxiv.org/abs/2406.04534
作者: Yutaka Shimizu,Joey Hong,Sergey Levine,Masayoshi Tomizuka
关键词: collecting online interactions, Offline reinforcement learning, static datasets, reinforcement learning, leveraging pre-collected
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) is a compelling paradigm to extend RL’s practical utility by leveraging pre-collected, static datasets, thereby avoiding the limitations associated with collecting online interactions. The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions; doing so ineffectively will lead to policies that prefer OOD actions, which can lead to unexpected and potentially catastrophic results. Despite the variety of works proposed to address this issue, they tend to excessively suppress the value function in and around OOD regions, resulting in overly pessimistic value estimates. In this paper, we propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate, ultimately resulting in less conservative value estimates. Our approach exploits the inherent strengths of neural networks to interpolate, while carefully navigating their limitations in extrapolation, to obtain pessimistic yet still property calibrated value estimates. Theoretical analysis also shows that the value function learned by SCQ is still conservative, but potentially much less so than that of Conservative Q-learning (CQL). Finally, extensive evaluation on the D4RL benchmark tasks shows our proposed method outperforms state-of-the-art methods. Our code is available through \urlthis https URL.

[LG-114] Rare Class Prediction Model for Smart Industry in Semiconductor Manufacturing

链接: https://arxiv.org/abs/2406.04533
作者: Abdelrahman Farrag,Mohammed-Khalil Ghali,Yu Jin
关键词: digital systems, facilitating the collection, evolution of industry, industry has enabled, physical and digital
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The evolution of industry has enabled the integration of physical and digital systems, facilitating the collection of extensive data on manufacturing processes. This integration provides a reliable solution for improving process quality and managing equipment health. However, data collected from real manufacturing processes often exhibit challenging properties, such as severe class imbalance, high rates of missing values, and noisy features, which hinder effective machine learning implementation. In this study, a rare class prediction approach is developed for in situ data collected from a smart semiconductor manufacturing process. The primary objective is to build a model that addresses issues of noise and class imbalance, enhancing class separation. The developed approach demonstrated promising results compared to existing literature, which would allow the prediction of new observations that could give insights into future maintenance plans and production quality. The model was evaluated using various performance metrics, with ROC curves showing an AUC of 0.95, a precision of 0.66, and a recall of 0.96

[LG-115] Proofread: Fixes All Errors with One Tap

链接: https://arxiv.org/abs/2406.04523
作者: Renjie Liu,Yanxiang Zhang,Yun Zhu,Haicheng Sun,Yuanbo Zhang,Michael Xuelin Huang,Shanqing Cai,Lei Meng,Shumin Zhai
关键词: Large Language Models, Large Language, users’ typing experience, reimagine users’ typing, capabilities in Large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, 2 tables

点击查看摘要

Abstract:The impressive capabilities in Large Language Models (LLMs) provide a powerful approach to reimagine users’ typing experience. This paper demonstrates Proofread, a novel Gboard feature powered by a server-side LLM in Gboard, enabling seamless sentence-level and paragraph-level corrections with a single tap. We describe the complete system in this paper, from data generation, metrics design to model tuning and deployment. To obtain models with sufficient quality, we implement a careful data synthetic pipeline tailored to online use cases, design multifaceted metrics, employ a two-stage tuning approach to acquire the dedicated LLM for the feature: the Supervised Fine Tuning (SFT) for foundational quality, followed by the Reinforcement Learning (RL) tuning approach for targeted refinement. Specifically, we find sequential tuning on Rewrite and proofread tasks yields the best quality in SFT stage, and propose global and direct rewards in the RL tuning stage to seek further improvement. Extensive experiments on a human-labeled golden set showed our tuned PaLM2-XS model achieved 85.56% good ratio. We launched the feature to Pixel 8 devices by serving the model on TPU v5 in Google Cloud, with thousands of daily active users. Serving latency was significantly reduced by quantization, bucket inference, text segmentation, and speculative decoding. Our demo could be seen in \hrefthis https URLYoutube.

[LG-116] Multifidelity digital twin for real-time monitoring of structural dynamics in aquaculture net cages

链接: https://arxiv.org/abs/2406.04519
作者: Eirini Katsidoniotaki,Biao Su,Eleni Kelasidi,Themistoklis P. Sapsis
关键词: climate change intensifies, global population grows, sustainable food production, change intensifies, global population
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the global population grows and climate change intensifies, sustainable food production is critical. Marine aquaculture offers a viable solution, providing a sustainable protein source. However, the industry’s expansion requires novel technologies for remote management and autonomous operations. Digital twin technology can advance the aquaculture industry, but its adoption has been limited. Fish net cages, which are flexible floating structures, are critical yet vulnerable components of aquaculture farms. Exposed to harsh and dynamic marine environments, the cages experience significant loads and risk damage, leading to fish escapes, environmental impacts, and financial losses. We propose a multifidelity surrogate modeling framework for integration into a digital twin for real-time monitoring of aquaculture net cage structural dynamics under stochastic marine conditions. Central to this framework is the nonlinear autoregressive Gaussian process method, which learns complex, nonlinear cross-correlations between models of varying fidelity. It combines low-fidelity simulation data with a small set of high-fidelity field sensor measurements, which offer the real dynamics but are costly and spatially sparse. Validated at the SINTEF ACE fish farm in Norway, our digital twin receives online metocean data and accurately predicts net cage displacements and mooring line loads, aligning closely with field measurements. The proposed framework is beneficial where application-specific data are scarce, offering rapid predictions and real-time system representation. The developed digital twin prevents potential damages by assessing structural integrity and facilitates remote operations with unmanned underwater vehicles. Our work also compares GP and GCNs for predicting net cage deformation, highlighting the latter’s effectiveness in complex structural applications.

[LG-117] Online Joint Fine-tuning of Multi-Agent Flows

链接: https://arxiv.org/abs/2406.04516
作者: Paul Mineiro
关键词: Agents which constructs, iterative communication, constructs the solution, Agents, complex problem
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A Flow is a collection of component models (``Agents’') which constructs the solution to a complex problem via iterative communication. Flows have emerged as state of the art architectures for code generation, and are the raison d’etre for frameworks like Autogen. However, flows are currently constructed via a combination of manual prompt engineering and stagewise supervised learning techniques; the latter is limited to acyclic flows with granular node supervision. In this writeup I describe a procedure for online joint fine-tuning of an entire flow inspired by the Learning to Search framework. The approach leverages simulator access to reduce preferences over entire episodes to preferences over individual node outputs; when the components are language models the latter is a well-studied problem. The approach is applicable to reward-free settings (e.g., text feedback) if an episode evaluator model is available. I apply to the multi-hop QA dataset Musique achieving a state-of-the-art result.

[LG-118] OCCAM: Towards Cost-Efficient and Accuracy-Aware Image Classification Inference

链接: https://arxiv.org/abs/2406.04508
作者: Dujian Ding,Bicheng Xu,Laks V.S. Lakshmanan
关键词: computer vision applications, fundamental building block, vision applications, fundamental building, building block
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under Review

点击查看摘要

Abstract:Image classification is a fundamental building block for a majority of computer vision applications. With the growing popularity and capacity of machine learning models, people can easily access trained image classifiers as a service online or offline. However, model use comes with a cost and classifiers of higher capacity usually incur higher inference costs. To harness the respective strengths of different classifiers, we propose a principled approach, OCCAM, to compute the best classifier assignment strategy over image classification queries (termed as the optimal model portfolio) so that the aggregated accuracy is maximized, under user-specified cost budgets. Our approach uses an unbiased and low-variance accuracy estimator and effectively computes the optimal solution by solving an integer linear programming problem. On a variety of real-world datasets, OCCAM achieves 40% cost reduction with little to no accuracy drop.

[LG-119] Enhancing Precision in Tactile Internet-Enabled Remote Robotic Surgery: Kalman Filter Approach

链接: https://arxiv.org/abs/2406.04503
作者: Muhammad Hanif Lashari,Wafa Batayneh,Ashfaq Khokhar
关键词: remote surgery task, Accurately estimating, Patient Side Manipulator, real time, remote surgery
类目: Robotics (cs.RO); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 8 pages, 5 figures, [IWCMC 2024 AI-NECOS]

点击查看摘要

Abstract:Accurately estimating the position of a patient’s side robotic arm in real time in a remote surgery task is a significant challenge, particularly in Tactile Internet (TI) environments. This paper presents a Kalman Filter (KF) based computationally efficient position estimation method. The study also assume no prior knowledge of the dynamic system model of the robotic arm system. Instead, The JIGSAW dataset, which is a comprehensive collection of robotic surgical data, and the Master Tool Manipulator’s (MTM) input are utilized to learn the system model using System Identification (SI) toolkit available in Matlab. We further investigate the effectiveness of KF to determine the position of the Patient Side Manipulator (PSM) under simulated network conditions that include delays, jitter, and packet loss. These conditions reflect the typical challenges encountered in real-world Tactile Internet applications. The results of the study highlight KF’s resilience and effectiveness in achieving accurate state estimation despite network-induced uncertainties with over 90% estimation accuracy.

[LG-120] FLUID-LLM: Learning Computational Fluid Dynamics with Spatiotemporal-aware Large Language Models

链接: https://arxiv.org/abs/2406.04501
作者: Max Zhu,Adrián Bazaga,Pietro Liò
关键词: Learning computational fluid, computationally intensive simulations, Learning computational, traditionally relies, Navier-Stokes equations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning computational fluid dynamics (CFD) traditionally relies on computationally intensive simulations of the Navier-Stokes equations. Recently, large language models (LLMs) have shown remarkable pattern recognition and reasoning abilities in natural language processing (NLP) and computer vision (CV). However, these models struggle with the complex geometries inherent in fluid dynamics. We introduce FLUID-LLM, a novel framework combining pre-trained LLMs with spatiotemporal-aware encoding to predict unsteady fluid dynamics. Our approach leverages the temporal autoregressive abilities of LLMs alongside spatial-aware layers, bridging the gap between previous CFD prediction methods. Evaluations on standard benchmarks reveal significant performance improvements across various fluid datasets. Our results demonstrate that FLUID-LLM effectively integrates spatiotemporal information into pre-trained LLMs, enhancing CFD task performance.

[LG-121] me Sensitive Knowledge Editing through Efficient Finetuning

链接: https://arxiv.org/abs/2406.04496
作者: Xiou Ge,Ali Mousavi,Edouard Grave,Armand Joulin,Kun Qian,Benjamin Han,Mostafa Arefiyan,Yunyao Li
关键词: Large Language Models, Language Models, demonstrated impressive capability, Large Language, demonstrated impressive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to ACL 2024 main conference

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capability in different tasks and are bringing transformative changes to many domains. However, keeping the knowledge in LLMs up-to-date remains a challenge once pretraining is complete. It is thus essential to design effective methods to both update obsolete knowledge and induce new knowledge into LLMs. Existing locate-and-edit knowledge editing (KE) method suffers from two limitations. First, the post-edit LLMs by such methods generally have poor capability in answering complex queries that require multi-hop reasoning. Second, the long run-time of such locate-and-edit methods to perform knowledge edits make it infeasible for large scale KE in practice. In this paper, we explore Parameter-Efficient Fine-Tuning (PEFT) techniques as an alternative for KE. We curate a more comprehensive temporal KE dataset with both knowledge update and knowledge injection examples for KE performance benchmarking. We further probe the effect of fine-tuning on a range of layers in an LLM for the multi-hop QA task. We find that PEFT performs better than locate-and-edit techniques for time-sensitive knowledge edits.

[LG-122] User Intent Recognition and Semantic Cache Optimization-Based Query Processing Framework using CFLIS and MGR-LAU

链接: https://arxiv.org/abs/2406.04490
作者: Sakshi Mahendru
关键词: frequently accessed data, accessed data closer, Linguistic Inference System, Contextual Fuzzy Linguistic, Fuzzy Linguistic Inference
类目: Machine Learning (cs.LG)
*备注: 10 Pages. Preprint

点击查看摘要

Abstract:Query Processing (QP) is optimized by a Cloud-based cache by storing the frequently accessed data closer to users. Nevertheless, the lack of focus on user intention type in queries affected the efficiency of QP in prevailing works. Thus, by using a Contextual Fuzzy Linguistic Inference System (CFLIS), this work analyzed the informational, navigational, and transactional-based intents in queries for enhanced QP. Primarily, the user query is parsed using tokenization, normalization, stop word removal, stemming, and POS tagging and then expanded using the WordNet technique. After expanding the queries, to enhance query understanding and to facilitate more accurate analysis and retrieval in query processing, the named entity is recognized using Bidirectional Encoder UnispecNorm Representations from Transformers (BEUNRT). Next, for efficient QP and retrieval of query information from the semantic cache database, the data is structured using Epanechnikov Kernel-Ordering Points To Identify the Clustering Structure (EK-OPTICS). The features are extracted from the structured data. Now, sentence type is identified and intent keywords are extracted from the parsed query. Next, the extracted features, detected intents and structured data are inputted to the Multi-head Gated Recurrent Learnable Attention Unit (MGR-LAU), which processes the query based on a semantic cache database (stores previously interpreted queries to expedite effective future searches). Moreover, the query is processed with a minimum latency of 12856ms. Lastly, the Semantic Similarity (SS) is analyzed between the retrieved query and the inputted user query, which continues until the similarity reaches 0.9 and above. Thus, the proposed work surpassed the previous methodologies.

[LG-123] Negative Feedback for Music Personalization

链接: https://arxiv.org/abs/2406.04488
作者: M. Jeffrey Mei,Oliver Bembom,Andreas F. Ehmann
关键词: Next-item recommender systems, Next-item recommender, next-song recommender system, negative feedback, feedback
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 6 pages, 4 figures, accepted to ACM UMAP 2024

点击查看摘要

Abstract:Next-item recommender systems are often trained using only positive feedback with randomly-sampled negative feedback. We show the benefits of using real negative feedback both as inputs into the user sequence and also as negative targets for training a next-song recommender system for internet radio. In particular, using explicit negative samples during training helps reduce training time by ~60% while also improving test accuracy by ~6%; adding user skips as additional inputs also can considerably increase user coverage alongside slightly improving accuracy. We test the impact of using a large number of random negative samples to capture a ‘harder’ one and find that the test accuracy increases with more randomly-sampled negatives, but only to a point. Too many random negatives leads to false negatives that limits the lift, which is still lower than if using true negative feedback. We also find that the test accuracy is fairly robust with respect to the proportion of different feedback types, and compare the learned embeddings for different feedback types.

[LG-124] A multi-core periphery perspective: Ranking via relative centrality

链接: https://arxiv.org/abs/2406.04487
作者: Chandra Sekhar Mukherjee,Jiapeng Zhang
关键词: SIAM Review, App. Math., SIAM J. App., widely studied graph, SIAM
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Community and core-periphery are two widely studied graph structures, with their coexistence observed in real-world graphs (Rombach, Porter, Fowler \ Mucha [SIAM J. App. Math. 2014, SIAM Review 2017]). However, the nature of this coexistence is not well understood and has been pointed out as an open problem (Yanchenko \ Sengupta [Statistics Surveys, 2023]). Especially, the impact of inferring the core-periphery structure of a graph on understanding its community structure is not well utilized. In this direction, we introduce a novel quantification for graphs with ground truth communities, where each community has a densely connected part (the core), and the rest is more sparse (the periphery), with inter-community edges more frequent between the peripheries. Built on this structure, we propose a new algorithmic concept that we call relative centrality to detect the cores. We observe that core-detection algorithms based on popular centrality measures such as PageRank and degree centrality can show some bias in their outcome by selecting very few vertices from some cores. We show that relative centrality solves this bias issue and provide theoretical and simulation support, as well as experiments on real-world graphs. Core detection is known to have important applications with respect to core-periphery structures. In our model, we show a new application: relative-centrality-based algorithms can select a subset of the vertices such that it contains sufficient vertices from all communities, and points in this subset are better separable into their respective communities. We apply the methods to 11 biological datasets, with our methods resulting in a more balanced selection of vertices from all communities such that clustering algorithms have better performance on this set. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2406.04487 [cs.LG] (or arXiv:2406.04487v1 [cs.LG] for this version)

[LG-125] PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning

链接: https://arxiv.org/abs/2406.04478
作者: Tianrong Zhang,Zhaohan Xi,Ting Wang,Prasenjit Mitra,Jinghui Chen
关键词: attracted enormous attention, Pre-trained language models, Pre-trained language, attracted enormous, enormous attention
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: NAACL 2024

点击查看摘要

Abstract:Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances. Meanwhile, the soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting as the most popular training paradigms for natural language processing (NLP) models. Nevertheless, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when trigger tokens are presented. In this paper, we propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings. Unlike existing NLP backdoor removal methods, which rely on accurate trigger inversion and subsequent model fine-tuning, PromptFix keeps the model parameters intact and only utilizes two extra sets of soft tokens which approximate the trigger and counteract it respectively. The use of soft tokens and adversarial optimization eliminates the need to enumerate possible backdoor configurations and enables an adaptive balance between trigger finding and preservation of performance. Experiments with various backdoor attacks validate the effectiveness of the proposed method and the performances when domain shift is present further shows PromptFix’s applicability to models pretrained on unknown data source which is the common case in prompt tuning scenarios.

[LG-126] Provable Bounds on the Hessian of Neural Networks: Derivative-Preserving Reachability Analysis

链接: https://arxiv.org/abs/2406.04476
作者: Sina Sharifi,Mahyar Fazlyab
关键词: reachability analysis method, analysis method tailored, reachability analysis, neural network map, first-order Taylor expansion
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We propose a novel reachability analysis method tailored for neural networks with differentiable activations. Our idea hinges on a sound abstraction of the neural network map based on first-order Taylor expansion and bounding the remainder. To this end, we propose a method to compute analytical bounds on the network’s first derivative (gradient) and second derivative (Hessian). A key aspect of our method is loop transformation on the activation functions to exploit their monotonicity effectively. The resulting end-to-end abstraction locally preserves the derivative information, yielding accurate bounds on small input sets. Finally, we employ a branch and bound framework for larger input sets to refine the abstraction recursively. We evaluate our method numerically via different examples and compare the results with relevant state-of-the-art methods.

[LG-127] On the Hardness of Probabilistic Neurosymbolic Learning

链接: https://arxiv.org/abs/2406.04472
作者: Jaron Maene,Vincent Derkinderen,Luc De Raedt
关键词: purely neural learning, combine neural networks, probabilistic logical reasoning, purely neural, neural learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The limitations of purely neural learning have sparked an interest in probabilistic neurosymbolic models, which combine neural networks with probabilistic logical reasoning. As these neurosymbolic models are trained with gradient descent, we study the complexity of differentiating probabilistic reasoning. We prove that although approximating these gradients is intractable in general, it becomes tractable during training. Furthermore, we introduce WeightME, an unbiased gradient estimator based on model sampling. Under mild assumptions, WeightME approximates the gradient with probabilistic guarantees using a logarithmic number of calls to a SAT solver. Lastly, we evaluate the necessity of these guarantees on the gradient. Our experiments indicate that the existing biased approximations indeed struggle to optimize even when exact solving is still feasible.

[LG-128] On The Importance of Reasoning for Context Retrieval in Repository-Level Code Editing

链接: https://arxiv.org/abs/2406.04464
作者: Alexander Kovrigin,Aleksandra Eliseeva,Yaroslav Zharov,Timofey Bryksin
关键词: Large Language Models, code-fluent Large Language, Large Language, Language Models, Recent advancements
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in code-fluent Large Language Models (LLMs) enabled the research on repository-level code editing. In such tasks, the model navigates and modifies the entire codebase of a project according to request. Hence, such tasks require efficient context retrieval, i.e., navigating vast codebases to gather relevant context. Despite the recognized importance of context retrieval, existing studies tend to approach repository-level coding tasks in an end-to-end manner, rendering the impact of individual components within these complicated systems unclear. In this work, we decouple the task of context retrieval from the other components of the repository-level code editing pipelines. We lay the groundwork to define the strengths and weaknesses of this component and the role that reasoning plays in it by conducting experiments that focus solely on context retrieval. We conclude that while the reasoning helps to improve the precision of the gathered context, it still lacks the ability to identify its sufficiency. We also outline the ultimate role of the specialized tools in the process of context gathering. The code supplementing this paper is available at this https URL.

[LG-129] Can Language Models Use Forecasting Strategies?

链接: https://arxiv.org/abs/2406.04446
作者: Sarah Pratt,Seth Blumberg,Pietro Kreitlon Carolino,Meredith Ringel Morris
关键词: standardized test taking, deep learning systems, Advances in deep, allowed large models, basic programming
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Advances in deep learning systems have allowed large models to match or surpass human accuracy on a number of skills such as image classification, basic programming, and standardized test taking. As the performance of the most capable models begin to saturate on tasks where humans already achieve high accuracy, it becomes necessary to benchmark models on increasingly complex abilities. One such task is forecasting the future outcome of events. In this work we describe experiments using a novel dataset of real world events and associated human predictions, an evaluation metric to measure forecasting ability, and the accuracy of a number of different LLM based forecasting designs on the provided dataset. Additionally, we analyze the performance of the LLM forecasters against human predictions and find that models still struggle to make accurate predictions about the future. Our follow-up experiments indicate this is likely due to models’ tendency to guess that most events are unlikely to occur (which tends to be true for many prediction datasets, but does not reflect actual forecasting abilities). We reflect on next steps for developing a systematic and reliable approach to studying LLM forecasting.

[LG-130] Gradient Clipping Improves AdaGrad when the Noise Is Heavy-Tailed

链接: https://arxiv.org/abs/2406.04443
作者: Savelii Chezhegov,Yaroslav Klyukin,Andrei Semenov,Aleksandr Beznosikov,Alexander Gasnikov,Samuel Horváth,Martin Takáč,Eduard Gorbunov
关键词: modern Deep Learning, Deep Learning models, Large Language Models, training modern Deep, Deep Learning
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 37 pages, 8 figures

点击查看摘要

Abstract:Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the high-probability convergence of AdaGrad/Adam has not been studied in this case. In this work, we prove that AdaGrad (and its delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. To fix this issue, we propose a new version of AdaGrad called Clip-RAdaGradD (Clipped Reweighted AdaGrad with Delay) and prove its high-probability convergence bounds with polylogarithmic dependence on the confidence level for smooth convex/non-convex stochastic optimization with heavy-tailed noise. Our empirical evaluations, including NLP model fine-tuning, highlight the superiority of clipped versions of AdaGrad/Adam in handling the heavy-tailed noise.

[LG-131] DeTra: A Unified Model for Object Detection and Trajectory Forecasting

链接: https://arxiv.org/abs/2406.04426
作者: Sergio Casas,Ben Agro,Jiageng Mao,Thomas Gilles,Alexander Cui,Thomas Li,Raquel Urtasun
关键词: trajectory forecasting play, autonomous driving, play a crucial, crucial role, role in understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The tasks of object detection and trajectory forecasting play a crucial role in understanding the scene for autonomous driving. These tasks are typically executed in a cascading manner, making them prone to compounding errors. Furthermore, there is usually a very thin interface between the two tasks, creating a lossy information bottleneck. To address these challenges, our approach formulates the union of the two tasks as a trajectory refinement problem, where the first pose is the detection (current time), and the subsequent poses are the waypoints of the multiple forecasts (future time). To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects directly from LiDAR point clouds and high-definition maps. We call this model DeTra, short for object Detection and Trajectory forecasting. In our experiments, we observe that \ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Waymo Open Dataset by a large margin, across a broad range of metrics. Last but not least, we perform extensive ablation studies that show the value of refinement for this task, that every proposed component contributes positively to its performance, and that key design choices were made.

[LG-132] On Regularization via Early Stopping for Least Squares Regression

链接: https://arxiv.org/abs/2406.04425
作者: Rishi Sonthalia,Jackie Lok,Elizaveta Rebrova
关键词: generalization capabilities, understanding the effect, machine learning, learning rate schedule, parameters obtained
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:A fundamental problem in machine learning is understanding the effect of early stopping on the parameters obtained and the generalization capabilities of the model. Even for linear models, the effect is not fully understood for arbitrary learning rates and data. In this paper, we analyze the dynamics of discrete full batch gradient descent for linear regression. With minimal assumptions, we characterize the trajectory of the parameters and the expected excess risk. Using this characterization, we show that when training with a learning rate schedule \eta_k , and a finite time horizon T , the early stopped solution \beta_T is equivalent to the minimum norm solution for a generalized ridge regularized problem. We also prove that early stopping is beneficial for generic data with arbitrary spectrum and for a wide variety of learning rate schedules. We provide an estimate for the optimal stopping time and empirically demonstrate the accuracy of our estimate.

[LG-133] Enhancing Supervised Visualization through Autoencoder and Random Forest Proximities for Out-of-Sample Extension

链接: https://arxiv.org/abs/2406.04421
作者: Shuang Ni,Adrien Aumon,Guy Wolf,Kevin R. Moon,Jake S. Rhodes
关键词: uncover meaningful connections, supervised dimensionality reduction, dimensionality reduction lies, dimensionality reduction, ability to uncover
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:The value of supervised dimensionality reduction lies in its ability to uncover meaningful connections between data features and labels. Common dimensionality reduction methods embed a set of fixed, latent points, but are not capable of generalizing to an unseen test set. In this paper, we provide an out-of-sample extension method for the random forest-based supervised dimensionality reduction method, RF-PHATE, combining information learned from the random forest model with the function-learning capabilities of autoencoders. Through quantitative assessment of various autoencoder architectures, we identify that networks that reconstruct random forest proximities are more robust for the embedding extension problem. Furthermore, by leveraging proximity-based prototypes, we achieve a 40% reduction in training time without compromising extension quality. Our method does not require label information for out-of-sample points, thus serving as a semi-supervised method, and can achieve consistent quality using only 10% of the training data.

[LG-134] SCMamba: Mamba Meets Multi-View Learning for Time Series Classification

链接: https://arxiv.org/abs/2406.04419
作者: Md Atik Ahamed,Qiang Cheng
关键词: Time series classification, multivariate time series, Time series, multivariate time, series classification
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series classification (TSC) on multivariate time series is a critical problem. We propose a novel multi-view approach integrating frequency-domain and time-domain features to provide complementary contexts for TSC. Our method fuses continuous wavelet transform spectral features with temporal convolutional or multilayer perceptron features. We leverage the Mamba state space model for efficient and scalable sequence modeling. We also introduce a novel tango scanning scheme to better model sequence relationships. Experiments on 10 standard benchmark datasets demonstrate our approach achieves an average 6.45% accuracy improvement over state-of-the-art TSC models.

[LG-135] Aligning Large Language Models with Self-generated Preference Data

链接: https://arxiv.org/abs/2406.04412
作者: Dongyoung Kim,Kimin Lee,Jinwoo Shin,Jaehyung Kim
关键词: Aligning large language, Aligning large, large human-annotated preference, human-annotated preference dataset, large language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 18 pages, under review

点击查看摘要

Abstract:Aligning large language models (LLMs) with human preferences becomes a key component to obtaining state-of-the-art performance, but it yields a huge cost to construct a large human-annotated preference dataset. To tackle this problem, we propose a new framework that boosts the alignment of LLMs through Self-generated Preference data (Selfie) using only a very small amount of human-annotated preference data. Our key idea is leveraging the human prior knowledge within the small (seed) data and progressively improving the alignment of LLM, by iteratively generating the responses and learning from them with the self-annotated preference data. To be specific, we propose to derive the preference label from the logits of LLM to explicitly extract the model’s inherent preference. Compared to the previous approaches using external reward models or implicit in-context learning, we observe that the proposed approach is significantly more effective. In addition, we introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data. Our experimental results demonstrate that the proposed framework significantly boosts the alignment of LLMs. For example, we achieve superior alignment performance on AlpacaEval 2.0 with only 3.3% of the ground-truth preference labels in the Ultrafeedback data compared to the cases using the entire data or state-of-the-art baselines.

[LG-136] Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

链接: https://arxiv.org/abs/2406.04391
作者: Rylan Schaeffer,Hailey Schoelkopf,Brando Miranda,Gabriel Mukobi,Varun Madan,Adam Ibrahim,Herbie Bradley,Stella Biderman,Sanmi Koyejo
关键词: extremely desirable property, desirable property, advanced AI systems, extremely desirable, probability mass
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Predictable behavior from scaling advanced AI systems is an extremely desirable property. Although a well-established literature exists on how pretraining performance scales, the literature on how particular downstream capabilities scale is significantly muddier. In this work, we take a step back and ask: why has predicting specific downstream capabilities with scale remained elusive? While many factors are certainly responsible, we identify a new factor that makes modeling scaling behavior on widely used multiple-choice question-answering benchmarks challenging. Using five model families and twelve well-established multiple-choice benchmarks, we show that downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively degrade the statistical relationship between performance and scale. We then reveal the mechanism causing this degradation: downstream metrics require comparing the correct choice against a small number of specific incorrect choices, meaning accurately predicting downstream capabilities requires predicting not just how probability mass concentrates on the correct choice with scale, but also how probability mass fluctuates on specific incorrect choices with scale. We empirically study how probability mass on the correct choice co-varies with probability mass on incorrect choices with increasing compute, suggesting that scaling laws for incorrect choices might be achievable. Our work also explains why pretraining scaling laws are commonly regarded as more predictable than downstream capabilities and contributes towards establishing scaling-predictable evaluations of frontier AI models.

[LG-137] Innovations in Cover Song Detection: A Lyrics-Based Approach

链接: https://arxiv.org/abs/2406.04384
作者: Maximilian Balluff,Peter Mandl,Christian Wolff
关键词: Cover songs, Cover, songs, song, music
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:Cover songs are alternate versions of a song by a different artist. Long being a vital part of the music industry, cover songs significantly influence music culture and are commonly heard in public venues. The rise of online music platforms has further increased their prevalence, often as background music or video soundtracks. While current automatic identification methods serve adequately for original songs, they are less effective with cover songs, primarily because cover versions often significantly deviate from the original compositions. In this paper, we propose a novel method for cover song detection that utilizes the lyrics of a song. We introduce a new dataset for cover songs and their corresponding originals. The dataset contains 5078 cover songs and 2828 original songs. In contrast to other cover song datasets, it contains the annotated lyrics for the original song and the cover song. We evaluate our method on this dataset and compare it with multiple baseline approaches. Our results show that our method outperforms the baseline approaches.

[LG-138] Improving the Fairness of Deep-Learning Short-term Crime Prediction with Under-reporting-aware Models

链接: https://arxiv.org/abs/2406.04382
作者: Jiahui Wu,Vanessa Frias-Martinez
关键词: additional behavioral datasets, past crime data, forecast future crimes, crime predictive tools, data and additional
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 25 pages, 4 figures

点击查看摘要

Abstract:Deep learning crime predictive tools use past crime data and additional behavioral datasets to forecast future crimes. Nevertheless, these tools have been shown to suffer from unfair predictions across minority racial and ethnic groups. Current approaches to address this unfairness generally propose either pre-processing methods that mitigate the bias in the training datasets by applying corrections to crime counts based on domain knowledge or in-processing methods that are implemented as fairness regularizers to optimize for both accuracy and fairness. In this paper, we propose a novel deep learning architecture that combines the power of these two approaches to increase prediction fairness. Our results show that the proposed model improves the fairness of crime predictions when compared to models with in-processing de-biasing approaches and with models without any type of bias correction, albeit at the cost of reducing accuracy.

[LG-139] Dynamic Online Recommendation for Two-Sided Market with Bayesian Incentive Compatibility

链接: https://arxiv.org/abs/2406.04374
作者: Yuantong Li,Guang Cheng,Xiaowu Dai
关键词: Recommender systems play, Recommender systems, effective recommender systems, recommender systems faces, play a crucial
类目: Information Retrieval (cs.IR); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recommender systems play a crucial role in internet economies by connecting users with relevant products or services. However, designing effective recommender systems faces two key challenges: (1) the exploration-exploitation tradeoff in balancing new product exploration against exploiting known preferences, and (2) dynamic incentive compatibility in accounting for users’ self-interested behaviors and heterogeneous preferences. This paper formalizes these challenges into a Dynamic Bayesian Incentive-Compatible Recommendation Protocol (DBICRP). To address the DBICRP, we propose a two-stage algorithm (RCB) that integrates incentivized exploration with an efficient offline learning component for exploitation. In the first stage, our algorithm explores available products while maintaining dynamic incentive compatibility to determine sufficient sample sizes. The second stage employs inverse proportional gap sampling integrated with an arbitrary machine learning method to ensure sublinear regret. Theoretically, we prove that RCB achieves O(\sqrtKdT) regret and satisfies Bayesian incentive compatibility (BIC) under a Gaussian prior assumption. Empirically, we validate RCB’s strong incentive gain, sublinear regret, and robustness through simulations and a real-world application on personalized warfarin dosing. Our work provides a principled approach for incentive-aware recommendation in online preference learning settings.

[LG-140] Large Language Model Confidence Estimation via Black-Box Access

链接: https://arxiv.org/abs/2406.04370
作者: Tejaswini Pedapati,Amit Dhurandhar,Soumya Ghosh,Soham Dan,Prasanna Sattigeri
关键词: significant in evaluating, evaluating trust, Estimating uncertainty, confidence, responses
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating uncertainty or confidence in the responses of a model can be significant in evaluating trust not only in the responses, but also in the model as a whole. In this paper, we explore the problem of estimating confidence for responses of large language models (LLMs) with simply black-box or query access to them. We propose a simple and extensible framework where, we engineer novel features and train a (interpretable) model (viz. logistic regression) on these features to estimate the confidence. We empirically demonstrate that our simple framework is effective in estimating confidence of flan-ul2, llama-13b and mistral-7b with it consistently outperforming existing black-box confidence estimation approaches on benchmark datasets such as TriviaQA, SQuAD, CoQA and Natural Questions by even over 10% (on AUROC) in some cases. Additionally, our interpretable approach provides insight into features that are predictive of confidence, leading to the interesting and useful discovery that our confidence models built for one LLM generalize zero-shot across others on a given dataset.

[LG-141] Use of a Multiscale Vision Transformer to predict Nursing Activities Score from Low Resolution Thermal Videos in an Intensive Care Unit

链接: https://arxiv.org/abs/2406.04364
作者: Isaac YL Lee,Thanh Nguyen-Duc,Ryo Ueno,Jesse Smith,Peter Y Chan
关键词: Excessive caregiver workload, Intensive Care Unit, poorer patient care, Excessive caregiver, NAS
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 4 pages, 1 figure

点击查看摘要

Abstract:Excessive caregiver workload in hospital nurses has been implicated in poorer patient care and increased worker burnout. Measurement of this workload in the Intensive Care Unit (ICU) is often done using the Nursing Activities Score (NAS), but this is usually recorded manually and sporadically. Previous work has made use of Ambient Intelligence (AmI) by using computer vision to passively derive caregiver-patient interaction times to monitor staff workload. In this letter, we propose using a Multiscale Vision Transformer (MViT) to passively predict the NAS from low-resolution thermal videos recorded in an ICU. 458 videos were obtained from an ICU in Melbourne, Australia and used to train a MViTv2 model using an indirect prediction and a direct prediction method. The indirect method predicted 1 of 8 potentially identifiable NAS activities from the video before inferring the NAS. The direct method predicted the NAS score immediately from the video. The indirect method yielded an average 5-fold accuracy of 57.21%, an area under the receiver operating characteristic curve (ROC AUC) of 0.865, a F1 score of 0.570 and a mean squared error (MSE) of 28.16. The direct method yielded a MSE of 18.16. We also showed that the MViTv2 outperforms similar models such as R(2+1)D and ResNet50-LSTM under identical settings. This study shows the feasibility of using a MViTv2 to passively predict the NAS in an ICU and monitor staff workload automatically. Our results above also show an increased accuracy in predicting NAS directly versus predicting NAS indirectly. We hope that our study can provide a direction for future work and further improve the accuracy of passive NAS monitoring. Comments: 4 pages, 1 figure Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2406.04364 [cs.CV] (or arXiv:2406.04364v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2406.04364 Focus to learn more arXiv-issued DOI via DataCite

[LG-142] Prompt-guided Precise Audio Editing with Diffusion Models

链接: https://arxiv.org/abs/2406.04350
作者: Manjie Xu,Chenxing Li,Duzhen zhang,Dan Su,Wei Liang,Dong Yu
关键词: involves the arbitrary, arbitrary manipulation, Audio editing involves, diffusion models, audio content
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted by ICML 2024

点击查看摘要

Abstract:Audio editing involves the arbitrary manipulation of audio content through precise control. Although text-guided diffusion models have made significant advancements in text-to-audio generation, they still face challenges in finding a flexible and precise way to modify target events within an audio track. We present a novel approach, referred to as PPAE, which serves as a general module for diffusion models and enables precise audio editing. The editing is based on the input textual prompt only and is entirely training-free. We exploit the cross-attention maps of diffusion models to facilitate accurate local editing and employ a hierarchical local-global pipeline to ensure a smoother editing process. Experimental results highlight the effectiveness of our method in various editing tasks.

[LG-143] Gaining Insights into Group-Level Course Difficulty via Differential Course Functioning

链接: https://arxiv.org/abs/2406.04348
作者: Frederik Baucks,Robin Schmucker,Conrad Borchers,Zachary A. Pardos,Laurenz Wiskott
关键词: studies curriculum structure, Curriculum Analytics, studies curriculum, curriculum structure, educational programs
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Curriculum Analytics (CA) studies curriculum structure and student data to ensure the quality of educational programs. One desirable property of courses within curricula is that they are not unexpectedly more difficult for students of different backgrounds. While prior work points to likely variations in course difficulty across student groups, robust methodologies for capturing such variations are scarce, and existing approaches do not adequately decouple course-specific difficulty from students’ general performance levels. The present study introduces Differential Course Functioning (DCF) as an Item Response Theory (IRT)-based CA methodology. DCF controls for student performance levels and examines whether significant differences exist in how distinct student groups succeed in a given course. Leveraging data from over 20,000 students at a large public university, we demonstrate DCF’s ability to detect inequities in undergraduate course difficulty across student groups described by grade achievement. We compare major pairs with high co-enrollment and transfer students to their non-transfer peers. For the former, our findings suggest a link between DCF effect sizes and the alignment of course content to student home department motivating interventions targeted towards improving course preparedness. For the latter, results suggest minor variations in course-specific difficulty between transfer and non-transfer students. While this is desirable, it also suggests that interventions targeted toward mitigating grade achievement gaps in transfer students should encompass comprehensive support beyond enhancing preparedness for individual courses. By providing more nuanced and equitable assessments of academic performance and difficulties experienced by diverse student populations, DCF could support policymakers, course articulation officers, and student advisors.

[LG-144] Weight-based Decomposition: A Case for Bilinear MLPs

链接: https://arxiv.org/abs/2406.03947
作者: Michael T. Pearce,Thomas Dooms,Alice Rigg
关键词: Gated Linear Units, common building block, Gated Linear, Linear Units, modern foundation models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Gated Linear Units (GLUs) have become a common building block in modern foundation models. Bilinear layers drop the non-linearity in the “gate” but still have comparable performance to other GLUs. An attractive quality of bilinear layers is that they can be fully expressed in terms of a third-order tensor and linear operations. Leveraging this, we develop a method to decompose the bilinear tensor into a set of sparsely interacting eigenvectors that show promising interpretability properties in preliminary experiments for shallow image classifiers (MNIST) and small language models (Tiny Stories). Since the decomposition is fully equivalent to the model’s original computations, bilinear layers may be an interpretability-friendly architecture that helps connect features to the model weights. Application of our method may not be limited to pretrained bilinear models since we find that language models such as TinyLlama-1.1B can be finetuned into bilinear variants.

[LG-145] Adapting Physics-Informed Neural Networks To Optimize ODEs in Mosquito Population Dynamics

链接: https://arxiv.org/abs/2406.05108
作者: Dinh Viet Cuong,Branislava Lalić,Mina Petrić,Binh Nguyen,Mark Roantree
关键词: Physics informed neural, incorporate physics laws, informed neural networks, gaining popularity due, Physics informed
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics informed neural networks have been gaining popularity due to their unique ability to incorporate physics laws into data-driven models, ensuring that the predictions are not only consistent with empirical data but also align with domain-specific knowledge in the form of physics equations. The integration of physics principles enables the method to require less data while maintaining the robustness of deep learning in modeling complex dynamical systems. However, current PINN frameworks are not sufficiently mature for real-world ODE systems, especially those with extreme multi-scale behavior such as mosquito population dynamical modelling. In this research, we propose a PINN framework with several improvements for forward and inverse problems for ODE systems with a case study application in modelling the dynamics of mosquito populations. The framework tackles the gradient imbalance and stiff problems posed by mosquito ordinary differential equations. The method offers a simple but effective way to resolve the time causality issue in PINNs by gradually expanding the training time domain until it covers entire domain of interest. As part of a robust evaluation, we conduct experiments using simulated data to evaluate the effectiveness of the approach. Preliminary results indicate that physics-informed machine learning holds significant potential for advancing the study of ecological systems.

[LG-146] Progressive Entropic Optimal Transport Solvers

链接: https://arxiv.org/abs/2406.05061
作者: Parnian Kassraie,Aram-Alexandre Pooladian,Michal Klein,James Thornton,Jonathan Niles-Weed,Marco Cuturi
关键词: profoundly impacted machine, impacted machine learning, realign datasets, profoundly impacted, impacted machine
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 18 pages, 7 figures

点击查看摘要

Abstract:Optimal transport (OT) has profoundly impacted machine learning by providing theoretical and computational tools to realign datasets. In this context, given two large point clouds of sizes n and m in \mathbbR^d , entropic OT (EOT) solvers have emerged as the most reliable tool to either solve the Kantorovich problem and output a n\times m coupling matrix, or to solve the Monge problem and learn a vector-valued push-forward map. While the robustness of EOT couplings/maps makes them a go-to choice in practical applications, EOT solvers remain difficult to tune because of a small but influential set of hyperparameters, notably the omnipresent entropic regularization strength \varepsilon . Setting \varepsilon can be difficult, as it simultaneously impacts various performance metrics, such as compute speed, statistical performance, generalization, and bias. In this work, we propose a new class of EOT solvers (ProgOT), that can estimate both plans and transport maps. We take advantage of several opportunities to optimize the computation of EOT solutions by dividing mass displacement using a time discretization, borrowing inspiration from dynamic OT formulations, and conquering each of these steps using EOT with properly scheduled parameters. We provide experimental evidence demonstrating that ProgOT is a faster and more robust alternative to standard solvers when computing couplings at large scales, even outperforming neural network-based approaches. We also prove statistical consistency of our approach for estimating optimal transport maps.

[LG-147] Root Cause Analysis of Outliers with Missing Structural Knowledge

链接: https://arxiv.org/abs/2406.05014
作者: Nastaran Okati,Sergio Hernan Garrido Mejia,William Roy Orchard,Patrick Blöbaum,Dominik Janzing
关键词: Recent work conceptualized, Recent work, work conceptualized root, quantitative contribution analysis, structural causal models
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent work conceptualized root cause analysis (RCA) of anomalies via quantitative contribution analysis using causal counterfactuals in structural causal models (SCMs). The framework comes with three practical challenges: (1) it requires the causal directed acyclic graph (DAG), together with an SCM, (2) it is statistically ill-posed since it probes regression models in regions of low probability density, (3) it relies on Shapley values which are computationally expensive to find. In this paper, we propose simplified, efficient methods of root cause analysis when the task is to identify a unique root cause instead of quantitative contribution analysis. Our proposed methods run in linear order of SCM nodes and they require only the causal DAG without counterfactuals. Furthermore, for those use cases where the causal DAG is unknown, we justify the heuristic of identifying root causes as the variables with the highest anomaly score. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2406.05014 [stat.ML] (or arXiv:2406.05014v1 [stat.ML] for this version)

[LG-148] On the social bias of speech self-supervised models

链接: https://arxiv.org/abs/2406.04997
作者: Yi-Cheng Lin,Tzu-Quan Lin,Hsi-Che Lin,Andy T. Liu,Hung-yi Lee
关键词: raise significant concerns, achieved remarkable performance, Self-supervised learning, affecting marginalized groups, raise significant
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Accepted by INTERSPEECH 2024

点击查看摘要

Abstract:Self-supervised learning (SSL) speech models have achieved remarkable performance in various tasks, yet the biased outcomes, especially affecting marginalized groups, raise significant concerns. Social bias refers to the phenomenon where algorithms potentially amplify disparate properties between social groups present in the data used for training. Bias in SSL models can perpetuate injustice by automating discriminatory patterns and reinforcing inequitable systems. This work reveals that prevalent SSL models inadvertently acquire biased associations. We probe how various factors, such as model architecture, size, and training methodologies, influence the propagation of social bias within these models. Finally, we explore the efficacy of debiasing SSL models through regularization techniques, specifically via model compression. Our findings reveal that employing techniques such as row-pruning and training wider, shallower models can effectively mitigate social bias within SSL model.

[LG-149] Development and Validation of a Deep-Learning Model for Differential Treatment Benefit Prediction for Adults with Major Depressive Disorder Deployed in the Artificial Intelligence in Depression Medication Enhancement (AIDME) Study

链接: https://arxiv.org/abs/2406.04993
作者: David Benrimoh,Caitrin Armstrong,Joseph Mehltretter,Robert Fratila,Kelly Perlman,Sonia Israel,Adam Kapelner,Sagar V. Parikh,Jordan F. Karp,Katherine Heller,Gustavo Turecki
关键词: Major Depressive Disorder, Depressive Disorder, Major Depressive, Depression Medication Enhancement, artificial intelligence
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:INTRODUCTION: The pharmacological treatment of Major Depressive Disorder (MDD) relies on a trial-and-error approach. We introduce an artificial intelligence (AI) model aiming to personalize treatment and improve outcomes, which was deployed in the Artificial Intelligence in Depression Medication Enhancement (AIDME) Study. OBJECTIVES: 1) Develop a model capable of predicting probabilities of remission across multiple pharmacological treatments for adults with at least moderate major depression. 2) Validate model predictions and examine them for amplification of harmful biases. METHODS: Data from previous clinical trials of antidepressant medications were standardized into a common framework and included 9,042 adults with moderate to severe major depression. Feature selection retained 25 clinical and demographic variables. Using Bayesian optimization, a deep learning model was trained on the training set, refined using the validation set, and tested once on the held-out test set. RESULTS: In the evaluation on the held-out test set, the model demonstrated achieved an AUC of 0.65. The model outperformed a null model on the test set (p = 0.01). The model demonstrated clinical utility, achieving an absolute improvement in population remission rate in hypothetical and actual improvement testing. While the model did identify one drug (escitalopram) as generally outperforming the other drugs (consistent with the input data), there was otherwise significant variation in drug rankings. On bias testing, the model did not amplify potentially harmful biases. CONCLUSIONS: We demonstrate the first model capable of predicting outcomes for 10 different treatment options for patients with MDD, intended to be used at or near the start of treatment to personalize treatment. The model was put into clinical practice during the AIDME randomized controlled trial whose results are reported separately.

[LG-150] Stochastic full waveform inversion with deep generative prior for uncertainty quantification

链接: https://arxiv.org/abs/2406.04859
作者: Yuke Xie,Hervé Chauris,Nicolas Desassis
关键词: Full Waveform Inversion, Full Waveform, obtain high-resolution images, Waveform Inversion, serve as crucial
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:To obtain high-resolution images of subsurface structures from seismic data, seismic imaging techniques such as Full Waveform Inversion (FWI) serve as crucial tools. However, FWI involves solving a nonlinear and often non-unique inverse problem, presenting challenges such as local minima trapping and inadequate handling of inherent uncertainties. In addressing these challenges, we propose leveraging deep generative models as the prior distribution of geophysical parameters for stochastic Bayesian inversion. This approach integrates the adjoint state gradient for efficient back-propagation from the numerical solution of partial differential equations. Additionally, we introduce explicit and implicit variational Bayesian inference methods. The explicit method computes variational distribution density using a normalizing flow-based neural network, enabling computation of the Bayesian posterior of parameters. Conversely, the implicit method employs an inference network attached to a pretrained generative model to estimate density, incorporating an entropy estimator. Furthermore, we also experimented with the Stein Variational Gradient Descent (SVGD) method as another variational inference technique, using particles. We compare these variational Bayesian inference methods with conventional Markov chain Monte Carlo (McMC) sampling. Each method is able to quantify uncertainties and to generate seismic data-conditioned realizations of subsurface geophysical parameters. This framework provides insights into subsurface structures while accounting for inherent uncertainties.

[LG-151] Diffusion-based Generative Image Outpainting for Recovery of FOV-Truncated CT Images

链接: https://arxiv.org/abs/2406.04769
作者: Michelle Espranita Liman,Daniel Rueckert,Florian J. Fintelmann,Philip Müller
关键词: body composition analysis, subcutaneous adipose tissue, accurate body composition, involves quantifying skeletal, quantifying skeletal muscle
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Shared last authorship: Florian J. Fintelmann and Philip Müller

点击查看摘要

Abstract:Field-of-view (FOV) recovery of truncated chest CT scans is crucial for accurate body composition analysis, which involves quantifying skeletal muscle and subcutaneous adipose tissue (SAT) on CT slices. This, in turn, enables disease prognostication. Here, we present a method for recovering truncated CT slices using generative image outpainting. We train a diffusion model and apply it to truncated CT slices generated by simulating a small FOV. Our model reliably recovers the truncated anatomy and outperforms the previous state-of-the-art despite being trained on 87% less data.

[LG-152] Efficient Continual Finite-Sum Minimization

链接: https://arxiv.org/abs/2406.04731
作者: Ioannis Mavrothalassitis,Stratis Skoulakis,Leello Tadesse Dadi,Volkan Cevher
关键词: mathcal, finite-sum minimization seeks, finite-sum minimization, star, epsilon
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Accepted in ICLR 2024, 35 pages

点击查看摘要

Abstract:Given a sequence of functions f_1,\ldots,f_n with f_i:\mathcalD\mapsto \mathbbR , finite-sum minimization seeks a point x^\star \in \mathcalD minimizing \sum_j=1^n f_j(x)/n . In this work, we propose a key twist into the finite-sum minimization, dubbed as continual finite-sum minimization, that asks for a sequence of points x_1^\star,\ldots,x_n^\star \in \mathcalD such that each x^\star_i \in \mathcalD minimizes the prefix-sum \sum_j=1^if_j(x)/i . Assuming that each prefix-sum is strongly convex, we develop a first-order continual stochastic variance reduction gradient method ( \mathrmCSVRG ) producing an \epsilon -optimal sequence with \mathcal\tildeO(n/\epsilon^1/3 + 1/\sqrt\epsilon) overall first-order oracles (FO). An FO corresponds to the computation of a single gradient \nabla f_j(x) at a given x \in \mathcalD for some j \in [n] . Our approach significantly improves upon the \mathcalO(n/\epsilon) FOs that \mathrmStochasticGradientDescent requires and the \mathcalO(n^2 \log (1/\epsilon)) FOs that state-of-the-art variance reduction methods such as \mathrmKatyusha require. We also prove that there is no natural first-order method with \mathcalO\left(n/\epsilon^\alpha\right) gradient complexity for \alpha 1/4 , establishing that the first-order complexity of our method is nearly tight.

[LG-153] GenzIQA: Generalized Image Quality Assessment using Prompt-Guided Latent Diffusion Models

链接: https://arxiv.org/abs/2406.04654
作者: Diptanu De,Shankhanil Mitra,Rajiv Soundararajan
关键词: calibrate user experiences, modern visual systems, image quality assessment, design of no-reference, quality assessment
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The design of no-reference (NR) image quality assessment (IQA) algorithms is extremely important to benchmark and calibrate user experiences in modern visual systems. A major drawback of state-of-the-art NR-IQA methods is their limited ability to generalize across diverse IQA settings with reasonable distribution shifts. Recent text-to-image generative models such as latent diffusion models generate meaningful visual concepts with fine details related to text concepts. In this work, we leverage the denoising process of such diffusion models for generalized IQA by understanding the degree of alignment between learnable quality-aware text prompts and images. In particular, we learn cross-attention maps from intermediate layers of the denoiser of latent diffusion models to capture quality-aware representations of images. In addition, we also introduce learnable quality-aware text prompts that enable the cross-attention features to be better quality-aware. Our extensive cross database experiments across various user-generated, synthetic, and low-light content-based benchmarking databases show that latent diffusion models can achieve superior generalization in IQA when compared to other methods in the literature.

[LG-154] Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

链接: https://arxiv.org/abs/2406.04592
作者: Devyani Maladkar,Ruichen Jiang,Aryan Mokhtari
关键词: Adaptive gradient methods, neural network training, successful optimization algorithms, Adaptive gradient, gradient methods
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 21 pages

点击查看摘要

Abstract:Adaptive gradient methods are arguably the most successful optimization algorithms for neural network training. While it is well-known that adaptive gradient methods can achieve better dimensional dependence than stochastic gradient descent (SGD) under favorable geometry for stochastic convex optimization, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In this paper, we aim to close this gap by analyzing the convergence rates of AdaGrad measured by the \ell_1 -norm of the gradient. Specifically, when the objective has L -Lipschitz gradient and the stochastic gradient variance is bounded by \sigma^2 , we prove a worst-case convergence rate of \tilde\mathcalO(\frac\sqrtdL\sqrtT + \frac\sqrtd \sigmaT^1/4) , where d is the dimension of the problem.We also present a lower bound of \Omega(\frac\sqrtd\sqrtT) for minimizing the gradient \ell_1 -norm in the deterministic setting, showing the tightness of our upper bound in the noiseless case. Moreover, under more fine-grained assumptions on the smoothness structure of the objective and the gradient noise and under favorable gradient \ell_1/\ell_2 geometry, we show that AdaGrad can potentially shave a factor of \sqrtd compared to SGD. To the best of our knowledge, this is the first result for adaptive gradient methods that demonstrates a provable gain over SGD in the non-convex setting.

[LG-155] Generative Assignment Flows for Representing and Learning Joint Distributions of Discrete Data

链接: https://arxiv.org/abs/2406.04527
作者: Bastian Boll,Daniel Gonzalez-Alvarado,Stefania Petra,Christoph Schnörr
关键词: possibly large number, joint probability distributions, discrete random variables, possibly large, discrete joint distributions
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel generative model for the representation of joint probability distributions of a possibly large number of discrete random variables. The approach uses measure transport by randomized assignment flows on the statistical submanifold of factorizing distributions, which also enables to sample efficiently from the target distribution and to assess the likelihood of unseen data points. The embedding of the flow via the Segre map in the meta-simplex of all discrete joint distributions ensures that any target distribution can be represented in principle, whose complexity in practice only depends on the parametrization of the affinity function of the dynamical assignment flow system. Our model can be trained in a simulation-free manner without integration by conditional Riemannian flow matching, using the training data encoded as geodesics in closed-form with respect to the e-connection of information geometry. By projecting high-dimensional flow matching in the meta-simplex of joint distributions to the submanifold of factorizing distributions, our approach has strong motivation from first principles of modeling coupled discrete variables. Numerical experiments devoted to distributions of structured image labelings demonstrate the applicability to large-scale problems, which may include discrete distributions in other application areas. Performance measures show that our approach scales better with the increasing number of classes than recent related work.

[LG-156] Learning Optimal Linear Precoding for Cell-Free Massive MIMO with GNN

链接: https://arxiv.org/abs/2406.04456
作者: Benjamin Parlier,Lou Salaün,Hong Yang
关键词: Massive MIMO system, Cell-Free Massive MIMO, Massive MIMO, MIMO system, minimal downlink user
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted in the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD) 2024

点击查看摘要

Abstract:We develop a graph neural network (GNN) to compute, within a time budget of 1 to 2 milliseconds required by practical systems, the optimal linear precoder (OLP) maximizing the minimal downlink user data rate for a Cell-Free Massive MIMO system - a key 6G wireless technology. The state-of-the-art method is a bisection search on second order cone programming feasibility test (B-SOCP) which is a magnitude too slow for practical systems. Our approach relies on representing OLP as a node-level prediction task on a graph. We construct a graph that accurately captures the interdependence relation between access points (APs) and user equipments (UEs), and the permutation equivariance of the Max-Min problem. Our neural network, named OLP-GNN, is trained on data obtained by B-SOCP. We tailor the OLP-GNN size, together with several artful data preprocessing and postprocessing methods to meet the runtime requirement. We show by extensive simulations that it achieves near optimal spectral efficiency in a range of scenarios with different number of APs and UEs, and for both line-of-sight and non-line-of-sight radio propagation environments.

[LG-157] Improving Model Chain Approaches for Probabilistic Solar Energy Forecasting through Post-processing and Machine Learning

链接: https://arxiv.org/abs/2406.04424
作者: Nina Horat,Sina Klerings,Sebastian Lerch
关键词: additional weather variables, numerical weather prediction, model chain, solar power, weather prediction models
类目: Applications (stat.AP); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Weather forecasts from numerical weather prediction models play a central role in solar energy forecasting, where a cascade of physics-based models is used in a model chain approach to convert forecasts of solar irradiance to solar power production, using additional weather variables as auxiliary information. Ensemble weather forecasts aim to quantify uncertainty in the future development of the weather, and can be used to propagate this uncertainty through the model chain to generate probabilistic solar energy predictions. However, ensemble prediction systems are known to exhibit systematic errors, and thus require post-processing to obtain accurate and reliable probabilistic forecasts. The overarching aim of our study is to systematically evaluate different strategies to apply post-processing methods in model chain approaches: Not applying any post-processing at all; post-processing only the irradiance predictions before the conversion; post-processing only the solar power predictions obtained from the model chain; or applying post-processing in both steps. In a case study based on a benchmark dataset for the Jacumba solar plant in the U.S., we develop statistical and machine learning methods for post-processing ensemble predictions of global horizontal irradiance and solar power generation. Further, we propose a neural network-based model for direct solar power forecasting that bypasses the model chain. Our results indicate that post-processing substantially improves the solar power generation forecasts, in particular when post-processing is applied to the power predictions. The machine learning methods for post-processing yield slightly better probabilistic forecasts, and the direct forecasting approach performs comparable to the post-processing strategies.

[LG-158] IDMAD: Time Series Dataset for Discovering Dark Matter with AI Denoising

链接: https://arxiv.org/abs/2406.04378
作者: J. T. Fry,Aobo Li,Lindley Winslow,Xinyi Hope Fu,Zhenghao Fu,Kaliroe M. W. Pappas
关键词: Dark matter, laboratory on Earth, Dark matter makes, dark matter search, Dark
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:

点击查看摘要

Abstract:Dark matter makes up approximately 85% of total matter in our universe, yet it has never been directly observed in any laboratory on Earth. The origin of dark matter is one of the most important questions in contemporary physics, and a convincing detection of dark matter would be a Nobel-Prize-level breakthrough in fundamental science. The ABRACADABRA experiment was specifically designed to search for dark matter. Although it has not yet made a discovery, ABRACADABRA has produced several dark matter search results widely endorsed by the physics community. The experiment generates ultra-long time-series data at a rate of 10 million samples per second, where the dark matter signal would manifest itself as a sinusoidal oscillation mode within the ultra-long time series. In this paper, we present the TIDMAD – a comprehensive data release from the ABRACADABRA experiment including three key components: an ultra-long time series dataset divided into training, validation, and science subsets; a carefully-designed denoising score for direct model benchmarking; and a complete analysis framework which produces a community-standard dark matter search result suitable for publication as a physics paper. This data release enables core AI algorithms to extract the signal and produce real physics results thereby advancing fundamental science. The data downloading and associated analysis scripts are available at this https URL

[LG-159] Combining Graph Neural Network and Mamba to Capture Local and Global Tissue Spatial Relationships in Whole Slide Images

链接: https://arxiv.org/abs/2406.04377
作者: Ruiwen Ding,Kha-Dinh Luong,Erika Rodriguez,Ana Cristina Araujo Lemos da Silva,William Hsu
关键词: computational pathology, slide images, fundamental task, large size, gigapixel whole slide
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:In computational pathology, extracting spatial features from gigapixel whole slide images (WSIs) is a fundamental task, but due to their large size, WSIs are typically segmented into smaller tiles. A critical aspect of this analysis is aggregating information from these tiles to make predictions at the WSI level. We introduce a model that combines a message-passing graph neural network (GNN) with a state space model (Mamba) to capture both local and global spatial relationships among the tiles in WSIs. The model’s effectiveness was demonstrated in predicting progression-free survival among patients with early-stage lung adenocarcinomas (LUAD). We compared the model with other state-of-the-art methods for tile-level information aggregation in WSIs, including tile-level information summary statistics-based aggregation, multiple instance learning (MIL)-based aggregation, GNN-based aggregation, and GNN-transformer-based aggregation. Additional experiments showed the impact of different types of node features and different tile sampling strategies on the model performance. This work can be easily extended to any WSI-based analysis. Code: this https URL.

[LG-160] Physics-enhanced Neural Operator for Simulating Turbulent Transport

链接: https://arxiv.org/abs/2406.04367
作者: Shengyu Chen,Peyman Givi,Can Zheng,Xiaowei Jia
关键词: including climate science, energy-efficient manufacturing processes, freshwater science, climate science, engineering fields
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 13 pages

点击查看摘要

Abstract:The precise simulation of turbulent flows is of immense importance in a variety of scientific and engineering fields, including climate science, freshwater science, and the development of energy-efficient manufacturing processes. Within the realm of turbulent flow simulation, direct numerical simulation (DNS) is widely considered to be the most reliable approach, but it is prohibitively expensive for long-term simulation at fine spatial scales. Given the pressing need for efficient simulation, there is an increasing interest in building machine learning models for turbulence, either by reconstructing DNS from alternative low-fidelity simulations or by predicting DNS based on the patterns learned from historical data. However, standard machine learning techniques remain limited in capturing complex spatio-temporal characteristics of turbulent flows, resulting in limited performance and generalizability. This paper presents a novel physics-enhanced neural operator (PENO) that incorporates physical knowledge of partial differential equations (PDEs) to accurately model flow dynamics. The model is further refined by a self-augmentation mechanism to reduce the accumulated error in long-term simulations. The proposed method is evaluated through its performance on two distinct sets of 3D turbulent flow data, showcasing the model’s capability to reconstruct high-resolution DNS data, maintain the inherent physical properties of flow transport, and generate flow simulations across various resolutions. Additionally, experimental results on multiple 2D vorticity flow series, generated by different PDEs, highlight the transferability and generalizability of the proposed method. This confirms its applicability to a wide range of real-world scenarios in which extensive simulations are needed under diverse settings.

信息检索

[IR-0] Corpus Poisoning via Approximate Greedy Gradient Descent

链接: https://arxiv.org/abs/2406.05087
作者: Jinyan Su,John X. Morris,Preslav Nakov,Claire Cardie
关键词: Retrieval-Augmented Generation, knowledge intensive areas, successfully extended, intensive areas, Greedy Gradient Descent
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Dense retrievers are widely used in information retrieval and have also been successfully extended to other knowledge intensive areas such as language models, e.g., Retrieval-Augmented Generation (RAG) systems. Unfortunately, they have recently been shown to be vulnerable to corpus poisoning attacks in which a malicious user injects a small fraction of adversarial passages into the retrieval corpus to trick the system into returning these passages among the top-ranked results for a broad set of user queries. Further study is needed to understand the extent to which these attacks could limit the deployment of dense retrievers in real-world applications. In this work, we propose Approximate Greedy Gradient Descent (AGGD), a new attack on dense retrieval systems based on the widely used HotFlip method for efficiently generating adversarial passages. We demonstrate that AGGD can select a higher quality set of token-level perturbations than HotFlip by replacing its random token sampling with a more structured search. Experimentally, we show that our method achieves a high attack success rate on several datasets and using several retrievers, and can generalize to unseen queries and new domains. Notably, our method is extremely effective in attacking the ANCE retrieval model, achieving attack success rates that are 17.6% and 13.37% higher on the NQ and MS MARCO datasets, respectively, compared to HotFlip. Additionally, we demonstrate AGGD’s potential to replace HotFlip in other adversarial attacks, such as knowledge poisoning of RAG systems.\footnoteCode can be find in \urlthis https URL

[IR-1] Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

链接: https://arxiv.org/abs/2406.05085
作者: Maciej Besta,Ales Kubicek,Roman Niggli,Robert Gerstenberger,Lucas Weitzendorf,Mingyuan Chi,Patrick Iff,Joanna Gajda,Piotr Nyczyk,Jürgen Müller,Hubert Niewiadomski,Marcin Chrapek,Michał Podstawski,Torsten Hoefler
关键词: Large Language Models, Retrieval Augmented Generation, Augmented Generation, Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs) by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents. Such queries occur frequently, but are challenging because the embeddings of these documents may be distant in the embedding space, making it hard to retrieve them all. This paper introduces Multi-Head RAG (MRAG), a novel scheme designed to address this gap with a simple yet powerful idea: leveraging activations of Transformer’s multi-head attention layer, instead of the decoder layer, as keys for fetching multi-aspect documents. The driving motivation is that different attention heads can learn to capture different data aspects. Harnessing the corresponding activations results in embeddings that represent various facets of data items and queries, improving the retrieval accuracy for complex queries. We provide an evaluation methodology and metrics, synthetic datasets, and real-world use cases to demonstrate MRAG’s effectiveness, showing improvements of up to 20% in relevance over standard RAG baselines. MRAG can be seamlessly integrated with existing RAG frameworks and benchmarking tools like RAGAS as well as different classes of data stores.

[IR-2] CHIQ: Contextual History Enhancement for Improving Query Rewriting in Conversational Search

链接: https://arxiv.org/abs/2406.05013
作者: Fengran Mo,Abbas Ghaddar,Kelong Mao,Mehdi Rezagholizadeh,Boxing Chen,Qun Liu,Jian-Yun Nie
关键词: improving query rewriting, large language models, open-source large language, query rewriting, large language
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:In this paper, we study how open-source large language models (LLMs) can be effectively deployed for improving query rewriting in conversational search, especially for ambiguous queries. We introduce CHIQ, a two-step method that leverages the capabilities of LLMs to resolve ambiguities in the conversation history before query rewriting. This approach contrasts with prior studies that predominantly use closed-source LLMs to directly generate search queries from conversation history. We demonstrate on five well-established benchmarks that CHIQ leads to state-of-the-art results across most settings, showing highly competitive performances with systems leveraging closed-source LLMs. Our study provides a first step towards leveraging open-source LLMs in conversational search, as a competitive alternative to the prevailing reliance on commercial LLMs. Data, models, and source code will be publicly available upon acceptance at this https URL.

[IR-3] QAGCF: Graph Collaborative Filtering for QA Recommendation

链接: https://arxiv.org/abs/2406.04828
作者: Changshuo Zhang,Teng Shi,Xiao Zhang,Yanping Zheng,Ruobing Xie,Qi Liu,Jun Xu,Ji-Rong Wen
关键词: meet users’ knowledge, users’ knowledge acquisition, recommend question-answer pairs, question-answer pairs, unlike traditional recommendations
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Question and answer (QA) platforms usually recommend question-answer pairs to meet users’ knowledge acquisition needs, unlike traditional recommendations that recommend only one item. This makes user behaviors more complex, and presents two challenges for QA recommendation, including: the collaborative information entanglement, which means user feedback is influenced by either the question or the answer; and the semantic information entanglement, where questions are correlated with their corresponding answers, and correlations also exist among different question-answer pairs. Traditional recommendation methods treat the question-answer pair as a whole or only consider the answer as a single item, which overlooks the two challenges and cannot effectively model user interests. To address these challenges, we introduce Question Answer Graph Collaborative Filtering (QAGCF), a graph neural network model that creates separate graphs for collaborative and semantic views to disentangle the information in question-answer pairs. The collaborative view disentangles questions and answers to individually model collaborative information, while the semantic view captures the semantic information both within and between question-answer pairs. These views are further merged into a global graph to integrate the collaborative and semantic information. Polynomial-based graph filters are used to address the high heterophily issues of the global graph. Additionally, contrastive learning is utilized to obtain robust embeddings during training. Extensive experiments on industrial and public datasets demonstrate that QAGCF consistently outperforms baselines and achieves state-of-the-art results.

[IR-4] Scaling Automatic Extraction of Pseudocode

链接: https://arxiv.org/abs/2406.04635
作者: Levent Toksoz,Gang Tan,C. Lee Giles
关键词: express the algorithms, algorithms implemented, Pseudocode, Abstract, collection
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pseudocode in a scholarly paper provides a concise way to express the algorithms implemented therein. Pseudocode can also be thought of as an intermediary representation that helps bridge the gap between programming languages and natural languages. Having access to a large collection of pseudocode can provide various benefits ranging from enhancing algorithmic understanding, facilitating further algorithmic design, to empowering NLP or computer vision based models for tasks such as automated code generation and optical character recognition (OCR). We have created a large pseudocode collection by extracting nearly 320,000 pseudocode examples from arXiv papers. This process involved scanning over 2.2 million scholarly papers, with 1,000 of them being manually inspected and labeled. Our approach encompasses an extraction mechanism tailored to optimize the coverage and a validation mechanism based on random sampling to check its accuracy and reliability, given the inherent heterogeneity of the collection. In addition, we offer insights into common pseudocode structures, supported by clustering and statistical analyses. Notably, these analyses indicate an exponential-like growth in the usage of pseudocodes, highlighting their increasing significance.

[IR-5] Error Bounds of Supervised Classification from Information-Theoretic Perspective

链接: https://arxiv.org/abs/2406.04567
作者: Binchuan Qi,Wei Gong,Li Li
关键词: unanswered research questions, overparametrized neural networks, deep neural networks, remarkable generalization power, efficient optimization performance
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:There remains a list of unanswered research questions on deep learning (DL), including the remarkable generalization power of overparametrized neural networks, the efficient optimization performance despite the non-convexity, and the mechanisms behind flat minima in generalization. In this paper, we adopt an information-theoretic perspective to explore the theoretical foundations of supervised classification using deep neural networks (DNNs). Our analysis introduces the concepts of fitting error and model risk, which, together with generalization error, constitute an upper bound on the expected risk. We demonstrate that the generalization errors are bounded by the complexity, influenced by both the smoothness of distribution and the sample size. Consequently, task complexity serves as a reliable indicator of the dataset’s quality, guiding the setting of regularization hyperparameters. Furthermore, the derived upper bound fitting error links the back-propagated gradient, Neural Tangent Kernel (NTK), and the model’s parameter count with the fitting error. Utilizing the triangle inequality, we establish an upper bound on the expected risk. This bound offers valuable insights into the effects of overparameterization, non-convex optimization, and the flat minima in DNNs.Finally, empirical verification confirms a significant positive correlation between the derived theoretical bounds and the practical expected risk, confirming the practical relevance of the theoretical findings.

[IR-6] Better Late Than Never: Formulating and Benchmarking Recommendation Editing

链接: https://arxiv.org/abs/2406.04553
作者: Chengyu Lai,Sheng Zhou,Zhimeng Jiang,Qiaoyu Tan,Yuanchen Bei,Jiawei Chen,Ningyu Zhang,Jiajun Bu
关键词: Recommendation systems play, play a pivotal, pivotal role, role in suggesting, unsuitable recommendation behaviors
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recommendation systems play a pivotal role in suggesting items to users based on their preferences. However, in online platforms, these systems inevitably offer unsuitable recommendations due to limited model capacity, poor data quality, or evolving user interests. Enhancing user experience necessitates efficiently rectify such unsuitable recommendation behaviors. This paper introduces a novel and significant task termed recommendation editing, which focuses on modifying known and unsuitable recommendation behaviors. Specifically, this task aims to adjust the recommendation model to eliminate known unsuitable items without accessing training data or retraining the model. We formally define the problem of recommendation editing with three primary objectives: strict rectification, collaborative rectification, and concentrated rectification. Three evaluation metrics are developed to quantitatively assess the achievement of each objective. We present a straightforward yet effective benchmark for recommendation editing using novel Editing Bayesian Personalized Ranking Loss. To demonstrate the effectiveness of the proposed method, we establish a comprehensive benchmark that incorporates various methods from related fields. Codebase is available at this https URL.

[IR-7] GNNAnatomy: Systematic Generation and Evaluation of Multi-Level Explanations for Graph Neural Networks

链接: https://arxiv.org/abs/2406.04548
作者: Hsiao-Ying Lu,Yiran Li,Ujwal Pratap Krishna Kaluvakolanu Thyagarajan,Kwan-Liu Ma
关键词: Graph Neural Networks, Neural Networks, proven highly effective, Graph Neural, machine learning
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have proven highly effective in various machine learning (ML) tasks involving graphs, such as node/graph classification and link prediction. However, explaining the decisions made by GNNs poses challenges because of the aggregated relational information based on graph structure, leading to complex data transformations. Existing methods for explaining GNNs often face limitations in systematically exploring diverse substructures and evaluating results in the absence of ground truths. To address this gap, we introduce GNNAnatomy, a model- and dataset-agnostic visual analytics system designed to facilitate the generation and evaluation of multi-level explanations for GNNs. In GNNAnatomy, we employ graphlets to elucidate GNN behavior in graph-level classification tasks. By analyzing the associations between GNN classifications and graphlet frequencies, we formulate hypothesized factual and counterfactual explanations. To validate a hypothesized graphlet explanation, we introduce two metrics: (1) the correlation between its frequency and the classification confidence, and (2) the change in classification confidence after removing this substructure from the original graph. To demonstrate the effectiveness of GNNAnatomy, we conduct case studies on both real-world and synthetic graph datasets from various domains. Additionally, we qualitatively compare GNNAnatomy with a state-of-the-art GNN explainer, demonstrating the utility and versatility of our design.

[IR-8] Negative Feedback for Music Personalization

链接: https://arxiv.org/abs/2406.04488
作者: M. Jeffrey Mei,Oliver Bembom,Andreas F. Ehmann
关键词: Next-item recommender systems, Next-item recommender, next-song recommender system, negative feedback, feedback
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 6 pages, 4 figures, accepted to ACM UMAP 2024

点击查看摘要

Abstract:Next-item recommender systems are often trained using only positive feedback with randomly-sampled negative feedback. We show the benefits of using real negative feedback both as inputs into the user sequence and also as negative targets for training a next-song recommender system for internet radio. In particular, using explicit negative samples during training helps reduce training time by ~60% while also improving test accuracy by ~6%; adding user skips as additional inputs also can considerably increase user coverage alongside slightly improving accuracy. We test the impact of using a large number of random negative samples to capture a ‘harder’ one and find that the test accuracy increases with more randomly-sampled negatives, but only to a point. Too many random negatives leads to false negatives that limits the lift, which is still lower than if using true negative feedback. We also find that the test accuracy is fairly robust with respect to the proportion of different feedback types, and compare the learned embeddings for different feedback types.

[IR-9] Innovations in Cover Song Detection: A Lyrics-Based Approach

链接: https://arxiv.org/abs/2406.04384
作者: Maximilian Balluff,Peter Mandl,Christian Wolff
关键词: Cover songs, Cover, songs, song, music
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:Cover songs are alternate versions of a song by a different artist. Long being a vital part of the music industry, cover songs significantly influence music culture and are commonly heard in public venues. The rise of online music platforms has further increased their prevalence, often as background music or video soundtracks. While current automatic identification methods serve adequately for original songs, they are less effective with cover songs, primarily because cover versions often significantly deviate from the original compositions. In this paper, we propose a novel method for cover song detection that utilizes the lyrics of a song. We introduce a new dataset for cover songs and their corresponding originals. The dataset contains 5078 cover songs and 2828 original songs. In contrast to other cover song datasets, it contains the annotated lyrics for the original song and the cover song. We evaluate our method on this dataset and compare it with multiple baseline approaches. Our results show that our method outperforms the baseline approaches.

[IR-10] Dynamic Online Recommendation for Two-Sided Market with Bayesian Incentive Compatibility

链接: https://arxiv.org/abs/2406.04374
作者: Yuantong Li,Guang Cheng,Xiaowu Dai
关键词: Recommender systems play, Recommender systems, effective recommender systems, recommender systems faces, play a crucial
类目: Information Retrieval (cs.IR); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recommender systems play a crucial role in internet economies by connecting users with relevant products or services. However, designing effective recommender systems faces two key challenges: (1) the exploration-exploitation tradeoff in balancing new product exploration against exploiting known preferences, and (2) dynamic incentive compatibility in accounting for users’ self-interested behaviors and heterogeneous preferences. This paper formalizes these challenges into a Dynamic Bayesian Incentive-Compatible Recommendation Protocol (DBICRP). To address the DBICRP, we propose a two-stage algorithm (RCB) that integrates incentivized exploration with an efficient offline learning component for exploitation. In the first stage, our algorithm explores available products while maintaining dynamic incentive compatibility to determine sufficient sample sizes. The second stage employs inverse proportional gap sampling integrated with an arbitrary machine learning method to ensure sublinear regret. Theoretically, we prove that RCB achieves O(\sqrtKdT) regret and satisfies Bayesian incentive compatibility (BIC) under a Gaussian prior assumption. Empirically, we validate RCB’s strong incentive gain, sublinear regret, and robustness through simulations and a real-world application on personalized warfarin dosing. Our work provides a principled approach for incentive-aware recommendation in online preference learning settings.

人工智能

[AI-0] 3D-GRAND: Towards Better Grounding and Less Hallucination for 3D-LLMs

链接: https://arxiv.org/abs/2406.05132
作者: Jianing Yang,Xuweiyi Chen,Nikhil Madaan,Madhavan Iyengar,Shengyi Qian,David F. Fouhey,Joyce Chai
关键词: developing embodied agents, perception is crucial, physical world, crucial for developing, agents and robots
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Project website: this https URL

点击查看摘要

Abstract:The integration of language and 3D perception is crucial for developing embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datasets that provide dense grounding between language and 3D scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons among future models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the critical role of large-scale 3D-text datasets in advancing embodied AI research. Notably, our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with essential resources and insights, setting the stage for more reliable and better-grounded 3D-LLMs. Project website: this https URL

[AI-1] LLavaGuard: VLM-based Safeguards for Vision Dataset Curation and Safety Assessment

链接: https://arxiv.org/abs/2406.05113
作者: Lukas Helff,Felix Friedrich,Manuel Brack,Kristian Kersting,Patrick Schramowski
关键词: VLM-based safeguard models, offering a versatile, family of VLM-based, VLM-based safeguard, versatile framework
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page at this https URL

点击查看摘要

Abstract:We introduce LlavaGuard, a family of VLM-based safeguard models, offering a versatile framework for evaluating the safety compliance of visual content. Specifically, we designed LlavaGuard for dataset annotation and generative model safeguarding. To this end, we collected and annotated a high-quality visual dataset incorporating a broad safety taxonomy, which we use to tune VLMs on context-aware safety risks. As a key innovation, LlavaGuard’s new responses contain comprehensive information, including a safety rating, the violated safety categories, and an in-depth rationale. Further, our introduced customizable taxonomy categories enable the context-specific alignment of LlavaGuard to various scenarios. Our experiments highlight the capabilities of LlavaGuard in complex and real-world applications. We provide checkpoints ranging from 7B to 34B parameters demonstrating state-of-the-art performance, with even the smallest models outperforming baselines like GPT-4. We make our dataset and model weights publicly available and invite further research to address the diverse needs of communities and contexts.

[AI-2] Provably Better Explanations with Optimized Aggregation of Feature Attributions

链接: https://arxiv.org/abs/2406.05090
作者: Thomas Decker,Ananta R. Bhattarai,Jindong Gu,Volker Tresp,Florian Buettner
关键词: opaque machine learning, machine learning models, common practice, practice to understand, understand and verify
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: International Conference on Machine Learning (ICML) 2024

点击查看摘要

Abstract:Using feature attributions for post-hoc explanations is a common practice to understand and verify the predictions of opaque machine learning models. Despite the numerous techniques available, individual methods often produce inconsistent and unstable results, putting their overall reliability into question. In this work, we aim to systematically improve the quality of feature attributions by combining multiple explanations across distinct methods or their variations. For this purpose, we propose a novel approach to derive optimal convex combinations of feature attributions that yield provable improvements of desired quality criteria such as robustness or faithfulness to the model behavior. Through extensive experiments involving various model architectures and popular feature attribution techniques, we demonstrate that our combination strategy consistently outperforms individual methods and existing baselines.

[AI-3] Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

链接: https://arxiv.org/abs/2406.05085
作者: Maciej Besta,Ales Kubicek,Roman Niggli,Robert Gerstenberger,Lucas Weitzendorf,Mingyuan Chi,Patrick Iff,Joanna Gajda,Piotr Nyczyk,Jürgen Müller,Hubert Niewiadomski,Marcin Chrapek,Michał Podstawski,Torsten Hoefler
关键词: Large Language Models, Retrieval Augmented Generation, Augmented Generation, Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs) by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents. Such queries occur frequently, but are challenging because the embeddings of these documents may be distant in the embedding space, making it hard to retrieve them all. This paper introduces Multi-Head RAG (MRAG), a novel scheme designed to address this gap with a simple yet powerful idea: leveraging activations of Transformer’s multi-head attention layer, instead of the decoder layer, as keys for fetching multi-aspect documents. The driving motivation is that different attention heads can learn to capture different data aspects. Harnessing the corresponding activations results in embeddings that represent various facets of data items and queries, improving the retrieval accuracy for complex queries. We provide an evaluation methodology and metrics, synthetic datasets, and real-world use cases to demonstrate MRAG’s effectiveness, showing improvements of up to 20% in relevance over standard RAG baselines. MRAG can be seamlessly integrated with existing RAG frameworks and benchmarking tools like RAGAS as well as different classes of data stores.

[AI-4] I2EDL: Interactive Instruction Error Detection and Localization

链接: https://arxiv.org/abs/2406.05080
作者: Francesco Taioli,Stefano Rosa,Alberto Castellini,Lorenzo Natale,Alessio Del Bue,Alessandro Farinelli,Marco Cristani,Yiming Wang
关键词: human user guides, Continuous Environments, Interactive Instruction Error, instruction errors, Instruction Error Detector
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted at IEEE RO-MAN 2024

点击查看摘要

Abstract:In the Vision-and-Language Navigation in Continuous Environments (VLN-CE) task, the human user guides an autonomous agent to reach a target goal via a series of low-level actions following a textual instruction in natural language. However, most existing methods do not address the likely case where users may make mistakes when providing such instruction (e.g. “turn left” instead of “turn right”). In this work, we address a novel task of Interactive VLN in Continuous Environments (IVLN-CE), which allows the agent to interact with the user during the VLN-CE navigation to verify any doubts regarding the instruction errors. We propose an Interactive Instruction Error Detector and Localizer (I2EDL) that triggers the user-agent interaction upon the detection of instruction errors during the navigation. We leverage a pre-trained module to detect instruction errors and pinpoint them in the instruction by cross-referencing the textual input and past observations. In such way, the agent is able to query the user for a timely correction, without demanding the user’s cognitive load, as we locate the probable errors to a precise part of the instruction. We evaluate the proposed I2EDL on a dataset of instructions containing errors, and further devise a novel metric, the Success weighted by Interaction Number (SIN), to reflect both the navigation performance and the interaction effectiveness. We show how the proposed method can ask focused requests for corrections to the user, which in turn increases the navigation success, while minimizing the interactions.

[AI-5] Massively Multiagent Minigames for Training Generalist Agents

链接: https://arxiv.org/abs/2406.05071
作者: Kyoung Whan Choe,Ryan Sullivan,Joseph Suárez
关键词: present Meta MMO, Meta MMO, Neural MMO, MMO, expands Neural MMO
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We present Meta MMO, a collection of many-agent minigames for use as a reinforcement learning benchmark. Meta MMO is built on top of Neural MMO, a massively multiagent environment that has been the subject of two previous NeurIPS competitions. Our work expands Neural MMO with several computationally efficient minigames. We explore generalization across Meta MMO by learning to play several minigames with a single set of weights. We release the environment, baselines, and training code under the MIT license. We hope that Meta MMO will spur additional progress on Neural MMO and, more generally, will serve as a useful benchmark for many-agent generalization.

[AI-6] Classification Metrics for Image Explanations: Towards Building Reliable XAI-Evaluations

链接: https://arxiv.org/abs/2406.05068
作者: Benjamin Fresz,Lena Lörcher,Marco Huber
关键词: computer vision models, deep neural networks, Decision processes, vision models, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Decision processes of computer vision models - especially deep neural networks - are opaque in nature, meaning that these decisions cannot be understood by humans. Thus, over the last years, many methods to provide human-understandable explanations have been proposed. For image classification, the most common group are saliency methods, which provide (super-)pixelwise feature attribution scores for input images. But their evaluation still poses a problem, as their results cannot be simply compared to the unknown ground truth. To overcome this, a slew of different proxy metrics have been defined, which are - as the explainability methods themselves - often built on intuition and thus, are possibly unreliable. In this paper, new evaluation metrics for saliency methods are developed and common saliency methods are benchmarked on ImageNet. In addition, a scheme for reliability evaluation of such metrics is proposed that is based on concepts from psychometric testing. The used code can be found at this https URL .

[AI-7] Robustness Assessment of Mathematical Reasoning in the Presence of Missing and Contradictory Conditions

链接: https://arxiv.org/abs/2406.05055
作者: Shi-Yu Tian,Zhi Zhou,Lin-Han Jia,Lan-Zhe Guo,Yu-Feng Li
关键词: few-shot prompting, few-shot prompting techniques, demonstrated impressive performance, Large language models, few-shot prompting methods
类目: Artificial Intelligence (cs.AI)
*备注: Preprint. arXiv admin note: text overlap with arXiv:2304.09797

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive performance on reasoning tasks, which can be further improved through few-shot prompting techniques. However, the current evaluation primarily focuses on carefully constructed benchmarks and neglects the consideration of real-world reasoning problems that present missing and contradictory conditions, known as ill-defined problems. Our observations suggest that existing few-shot prompting techniques are ineffective in such scenarios, often providing overconfident answers or hallucination. To further study this problem, we develop a benchmark called Problems with Missing and Contradictory conditions (PMC) and introduce two novel metrics to evaluate the performance of few-shot prompting methods in these scenarios. Our analysis using the PMC benchmark reveals a trade-off dilemma between the performance of mathematical reasoning for well-defined problems and the ability to recognize ill-defined problems. To address the challenges posed by PMC, we propose a novel few-shot prompting method called SMT-LIB Prompting (SLP), which utilizes the SMT-LIB language to model the problems instead of solving them directly. Subsequently, a double-check solving strategy checks the satisfiability and uniqueness of the solution and provides final feedback. Extensive experiments demonstrate the superiority of our SLP approach compared to existing few-shot prompting methods when dealing with problems with missing and contradictory conditions. We will open-source our benchmark and code to facilitate future research.

[AI-8] Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation

链接: https://arxiv.org/abs/2406.05053
作者: Nachiket Kotalwar,Alkis Gotovos,Adish Singla
关键词: hold great promise, generating individualized feedback, models hold great, enhancing programming education, hints for learners
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative AI and large language models hold great promise in enhancing programming education by generating individualized feedback and hints for learners. Recent works have primarily focused on improving the quality of generated feedback to achieve human tutors’ quality. While quality is an important performance criterion, it is not the only criterion to optimize for real-world educational deployments. In this paper, we benchmark language models for programming feedback generation across several performance criteria, including quality, cost, time, and data privacy. The key idea is to leverage recent advances in the new paradigm of in-browser inference that allow running these models directly in the browser, thereby providing direct benefits across cost and data privacy. To boost the feedback quality of small models compatible with in-browser inference engines, we develop a fine-tuning pipeline based on GPT-4 generated synthetic data. We showcase the efficacy of fine-tuned Llama3-8B and Phi3-3.8B 4-bit quantized models using WebLLM’s in-browser inference engine on three different Python programming datasets. We will release the full implementation along with a web app and datasets to facilitate further research on in-browser language models.

[AI-9] Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs

链接: https://arxiv.org/abs/2406.05038
作者: Shentong Mo
关键词: state space approach, selective state space, Recent advancements, long sequence handling, efficient long sequence
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in sequence modeling have led to the development of the Mamba architecture, noted for its selective state space approach, offering a promising avenue for efficient long sequence handling. However, its application in 3D shape generation, particularly at high resolutions, remains underexplored. Traditional diffusion transformers (DiT) with self-attention mechanisms, despite their potential, face scalability challenges due to the cubic complexity of attention operations as input length increases. This complexity becomes a significant hurdle when dealing with high-resolution voxel sizes. To address this challenge, we introduce a novel diffusion architecture tailored for 3D point clouds generation-Diffusion Mamba (DiM-3D). This architecture forgoes traditional attention mechanisms, instead utilizing the inherent efficiency of the Mamba architecture to maintain linear complexity with respect to sequence length. DiM-3D is characterized by fast inference times and substantially lower computational demands, quantified in reduced Gflops, thereby addressing the key scalability issues of prior models. Our empirical results on the ShapeNet benchmark demonstrate that DiM-3D achieves state-of-the-art performance in generating high-fidelity and diverse 3D shapes. Additionally, DiM-3D shows superior capabilities in tasks like 3D point cloud completion. This not only proves the model’s scalability but also underscores its efficiency in generating detailed, high-resolution voxels necessary for advanced 3D shape modeling, particularly excelling in environments requiring high-resolution voxel sizes. Through these findings, we illustrate the exceptional scalability and efficiency of the Diffusion Mamba framework in 3D shape generation, setting a new standard for the field and paving the way for future explorations in high-resolution 3D modeling technologies.

[AI-10] meSieve: Extracting Temporal Dynamics through Information Bottlenecks

链接: https://arxiv.org/abs/2406.05036
作者: Ninghui Feng,Songning Lai,Fobao Zhou,Zhenxiao Yin,Hang Zhao
关键词: Time series forecasting, increasingly popular research, popular research area, research area due, Time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series forecasting has become an increasingly popular research area due to its critical applications in various real-world domains such as traffic management, weather prediction, and financial analysis. Despite significant advancements, existing models face notable challenges, including the necessity of manual hyperparameter tuning for different datasets, and difficulty in effectively distinguishing signal from redundant features in data characterized by strong seasonality. These issues hinder the generalization and practical application of time series forecasting models. To solve this issues, we propose an innovative time series forecasting model TimeSieve designed to address these challenges. Our approach employs wavelet transforms to preprocess time series data, effectively capturing multi-scale features without the need for additional parameters or manual hyperparameter tuning. Additionally, we introduce the information bottleneck theory that filters out redundant features from both detail and approximation coefficients, retaining only the most predictive information. This combination reduces significantly improves the model’s accuracy. Extensive experiments demonstrate that our model outperforms existing state-of-the-art methods on 70% of the datasets, achieving higher predictive accuracy and better generalization across diverse datasets. Our results validate the effectiveness of our approach in addressing the key challenges in time series forecasting, paving the way for more reliable and efficient predictive models in practical applications. The code for our model is available at this https URL.

[AI-11] Scenarios and Approaches for Situated Natural Language Explanations

链接: https://arxiv.org/abs/2406.05035
作者: Pengshuo Qiu,Frank Rudzicz,Zining Zhu
关键词: Large language models, Large language, Large, explanations, language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) can be used to generate natural language explanations (NLE) that are adapted to different users’ situations. However, there is yet to be a quantitative evaluation of the extent of such adaptation. To bridge this gap, we collect a benchmarking dataset, Situation-Based Explanation. This dataset contains 100 explanandums. Each explanandum is paired with explanations targeted at three distinct audience types-such as educators, students, and professionals-enabling us to assess how well the explanations meet the specific informational needs and contexts of these diverse groups e.g. students, teachers, and parents. For each “explanandum paired with an audience” situation, we include a human-written explanation. These allow us to compute scores that quantify how the LLMs adapt the explanations to the situations. On an array of pretrained language models with varying sizes, we examine three categories of prompting methods: rule-based prompting, meta-prompting, and in-context learning prompting. We find that 1) language models can generate prompts that result in explanations more precisely aligned with the target situations, 2) explicitly modeling an “assistant” persona by prompting “You are a helpful assistant…” is not a necessary prompt technique for situated NLE tasks, and 3) the in-context learning prompts only can help LLMs learn the demonstration template but can’t improve their inference performance. SBE and our analysis facilitate future research towards generating situated natural language explanations.

[AI-12] Optimizing Automatic Differentiation with Deep Reinforcement Learning

链接: https://arxiv.org/abs/2406.05027
作者: Jamie Lohoff,Emre Neftci
关键词: computational fluid dynamics, robotics and finance, fluid dynamics, Jacobian, exact Jacobian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Computing Jacobians with automatic differentiation is ubiquitous in many scientific domains such as machine learning, computational fluid dynamics, robotics and finance. Even small savings in the number of computations or memory usage in Jacobian computations can already incur massive savings in energy consumption and runtime. While there exist many methods that allow for such savings, they generally trade computational efficiency for approximations of the exact Jacobian. In this paper, we present a novel method to optimize the number of necessary multiplications for Jacobian computation by leveraging deep reinforcement learning (RL) and a concept called cross-country elimination while still computing the exact Jacobian. Cross-country elimination is a framework for automatic differentiation that phrases Jacobian accumulation as ordered elimination of all vertices on the computational graph where every elimination incurs a certain computational cost. We formulate the search for the optimal elimination order that minimizes the number of necessary multiplications as a single player game which is played by an RL agent. We demonstrate that this method achieves up to 33% improvements over state-of-the-art methods on several relevant tasks taken from diverse domains. Furthermore, we show that these theoretical gains translate into actual runtime improvements by providing a cross-country elimination interpreter in JAX that can efficiently execute the obtained elimination orders.

[AI-13] Adaptively Learning to Select-Rank in Online Platforms

链接: https://arxiv.org/abs/2406.05017
作者: Jingyuan Wang,Perry Dong,Ying Jin,Ruohan Zhan,Zhengyuan Zhou
关键词: content streaming services, streaming services, online platforms, platforms across e-commerce, e-commerce sites
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages in total. Includes 4 figures and a pdf. International conference on machine learning. PMLR, 2024

点击查看摘要

Abstract:Ranking algorithms are fundamental to various online platforms across e-commerce sites to content streaming services. Our research addresses the challenge of adaptively ranking items from a candidate pool for heterogeneous users, a key component in personalizing user experience. We develop a user response model that considers diverse user preferences and the varying effects of item positions, aiming to optimize overall user satisfaction with the ranked list. We frame this problem within a contextual bandits framework, with each ranked list as an action. Our approach incorporates an upper confidence bound to adjust predicted user satisfaction scores and selects the ranking action that maximizes these adjusted scores, efficiently solved via maximum weight imperfect matching. We demonstrate that our algorithm achieves a cumulative regret bound of O(d\sqrtNKT) for ranking K out of N items in a d -dimensional context space over T rounds, under the assumption that user responses follow a generalized linear model. This regret alleviates dependence on the ambient action space, whose cardinality grows exponentially with N and K (thus rendering direct application of existing adaptive learning algorithms – such as UCB or Thompson sampling – infeasible). Experiments conducted on both simulated and real-world datasets demonstrate our algorithm outperforms the baseline.

[AI-14] ADBA:Approximation Decision Boundary Approach for Black-Box Adversarial Attacks

链接: https://arxiv.org/abs/2406.04998
作者: Feiyang Wang,Xingquan Zuo,Hai Huang,Gang Chen
关键词: machine learning models, machine learning, target machine learning, perturbation directions, real-world applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages, 5 figures, conference

点击查看摘要

Abstract:Many machine learning models are susceptible to adversarial attacks, with decision-based black-box attacks representing the most critical threat in real-world applications. These attacks are extremely stealthy, generating adversarial examples using hard labels obtained from the target machine learning model. This is typically realized by optimizing perturbation directions, guided by decision boundaries identified through query-intensive exact search, significantly limiting the attack success rate. This paper introduces a novel approach using the Approximation Decision Boundary (ADB) to efficiently and accurately compare perturbation directions without precisely determining decision boundaries. The effectiveness of our ADB approach (ADBA) hinges on promptly identifying suitable ADB, ensuring reliable differentiation of all perturbation directions. For this purpose, we analyze the probability distribution of decision boundaries, confirming that using the distribution’s median value as ADB can effectively distinguish different perturbation directions, giving rise to the development of the ADBA-md algorithm. ADBA-md only requires four queries on average to differentiate any pair of perturbation directions, which is highly query-efficient. Extensive experiments on six well-known image classifiers clearly demonstrate the superiority of ADBA and ADBA-md over multiple state-of-the-art black-box attacks.

[AI-15] UniTST: Effectively Modeling Inter-Series and Intra-Series Dependencies for Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2406.04975
作者: Juncheng Liu,Chenghao Liu,Gerald Woo,Yiwei Wang,Bryan Hooi,Caiming Xiong,Doyen Sahoo
关键词: emerged as powerful, powerful tools, tools for multivariate, MTSF, multivariate time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformer-based models have emerged as powerful tools for multivariate time series forecasting (MTSF). However, existing Transformer models often fall short of capturing both intricate dependencies across variate and temporal dimensions in MTS data. Some recent models are proposed to separately capture variate and temporal dependencies through either two sequential or parallel attention mechanisms. However, these methods cannot directly and explicitly learn the intricate inter-series and intra-series dependencies. In this work, we first demonstrate that these dependencies are very important as they usually exist in real-world data. To directly model these dependencies, we propose a transformer-based model UniTST containing a unified attention mechanism on the flattened patch tokens. Additionally, we add a dispatcher module which reduces the complexity and makes the model feasible for a potentially large number of variates. Although our proposed model employs a simple architecture, it offers compelling performance as shown in our extensive experiments on several datasets for time series forecasting.

[AI-16] Neural Laplace for learning Stochastic Differential Equations

链接: https://arxiv.org/abs/2406.04964
作者: Adrien Carrel
关键词: differential equations, learning diverse classes, Neural Laplace, ordinary differential equations, Stochastic differential equations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Neural Laplace is a unified framework for learning diverse classes of differential equations (DE). For different classes of DE, this framework outperforms other approaches relying on neural networks that aim to learn classes of ordinary differential equations (ODE). However, many systems can’t be modelled using ODEs. Stochastic differential equations (SDE) are the mathematical tool of choice when modelling spatiotemporal DE dynamics under the influence of randomness. In this work, we review the potential applications of Neural Laplace to learn diverse classes of SDE, both from a theoretical and a practical point of view.

[AI-17] Learning Divergence Fields for Shift-Robust Graph Representations

链接: https://arxiv.org/abs/2406.04963
作者: Qitian Wu,Fan Nie,Chenxiao Yang,Junchi Yan
关键词: induce instance-level interdependence, involves certain geometries, generation often involves, induce instance-level, Real-world data generation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to ICML 2024. Source codes at this https URL

点击查看摘要

Abstract:Real-world data generation often involves certain geometries (e.g., graphs) that induce instance-level interdependence. This characteristic makes the generalization of learning models more difficult due to the intricate interdependent patterns that impact data-generative distributions and can vary from training to testing. In this work, we propose a geometric diffusion model with learnable divergence fields for the challenging generalization problem with interdependent data. We generalize the diffusion equation with stochastic diffusivity at each time step, which aims to capture the multi-faceted information flows among interdependent data. Furthermore, we derive a new learning objective through causal inference, which can guide the model to learn generalizable patterns of interdependence that are insensitive across domains. Regarding practical implementation, we introduce three model instantiations that can be considered as the generalized versions of GCN, GAT, and Transformers, respectively, which possess advanced robustness against distribution shifts. We demonstrate their promising efficacy for out-of-distribution generalization on diverse real-world datasets.

[AI-18] Expansion of situations theory for exploring shared awareness in human-intelligent autonomous systems

链接: https://arxiv.org/abs/2406.04956
作者: Scott A. Humr,Mustafa Canan,Mustafa Demir
关键词: shared situation awareness, Intelligent autonomous systems, Intelligent autonomous, shared situation, situation awareness
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Keywords: artificial intelligence; human-machine interaction; IAS; intelligent autonomous systems; shared situational awareness; situations theory

点击查看摘要

Abstract:Intelligent autonomous systems are part of a system of systems that interact with other agents to accomplish tasks in complex environments. However, intelligent autonomous systems integrated system of systems add additional layers of complexity based on their limited cognitive processes, specifically shared situation awareness that allows a team to respond to novel tasks. Intelligent autonomous systems’ lack of shared situation awareness adversely influences team effectiveness in complex task environments, such as military command-and-control. A complementary approach of shared situation awareness, called situations theory, is beneficial for understanding the relationship between system of systems shared situation awareness and effectiveness. The current study elucidates a conceptual discussion on situations theory to investigate the development of an system of systems shared situational awareness when humans team with intelligent autonomous system agents. To ground the discussion, the reviewed studies expanded situations theory within the context of a system of systems that result in three major conjectures that can be beneficial to the design and development of future systems of systems.

[AI-19] Experimental Evaluation of ROS-Causal in Real-World Human-Robot Spatial Interaction Scenarios

链接: https://arxiv.org/abs/2406.04955
作者: Luca Castri,Gloria Beraldo,Sariah Mghames,Marc Hanheide,Nicola Bellotto
关键词: human-shared environments requires, Deploying robots, objects interact, requires a deep, deep understanding
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Published at 2024 IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

点击查看摘要

Abstract:Deploying robots in human-shared environments requires a deep understanding of how nearby agents and objects interact. Employing causal inference to model cause-and-effect relationships facilitates the prediction of human behaviours and enables the anticipation of robot interventions. However, a significant challenge arises due to the absence of implementation of existing causal discovery methods within the ROS ecosystem, the standard de-facto framework in robotics, hindering effective utilisation on real robots. To bridge this gap, in our previous work we proposed ROS-Causal, a ROS-based framework designed for onboard data collection and causal discovery in human-robot spatial interactions. In this work, we present an experimental evaluation of ROS-Causal both in simulation and on a new dataset of human-robot spatial interactions in a lab scenario, to assess its performance and effectiveness. Our analysis demonstrates the efficacy of this approach, showcasing how causal models can be extracted directly onboard by robots during data collection. The online causal models generated from the simulation are consistent with those from lab experiments. These findings can help researchers to enhance the performance of robotic systems in shared environments, firstly by studying the causal relations between variables in simulation without real people, and then facilitating the actual robot deployment in real human environments. ROS-Causal: this https URL

[AI-20] Quantifying Geospatial in the Common Crawl Corpus

链接: https://arxiv.org/abs/2406.04952
作者: Ilya Ilyankou,Meihui Wang,James Haworth,Stefano Cavazzi
关键词: Large language models, vast unlabelled text, Common Crawl corpus, exhibit emerging geospatial, emerging geospatial capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit emerging geospatial capabilities, stemming from their pre-training on vast unlabelled text datasets that are often derived from the Common Crawl corpus. However, the geospatial content within CC remains largely unexplored, impacting our understanding of LLMs’ spatial reasoning. This paper investigates the prevalence of geospatial data in recent Common Crawl releases using Gemini, a powerful language model. By analyzing a sample of documents and manually revising the results, we estimate that between 1 in 5 and 1 in 6 documents contain geospatial information such as coordinates and street addresses. Our findings provide quantitative insights into the nature and extent of geospatial data within Common Crawl, and web crawl data in general. Furthermore, we formulate questions to guide future investigations into the geospatial content of available web crawl datasets and its influence on LLMs.

[AI-21] Nacala-Roof-Material: Drone Imagery for Roof Detection Classification and Segmentation to Support Mosquito-borne Disease Risk Assessment

链接: https://arxiv.org/abs/2406.04949
作者: Venkanna Babu Guthula,Stefan Oehmcke,Remigio Chilaule,Hui Zhang,Nico Lang,Ankit Kariryaa,Johan Mottelson,Christian Igel
关键词: remote sensing imagery, roof types based, malaria risk, assessment of malaria, increased risk
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As low-quality housing and in particular certain roof characteristics are associated with an increased risk of malaria, classification of roof types based on remote sensing imagery can support the assessment of malaria risk and thereby help prevent the disease. To support research in this area, we release the Nacala-Roof-Material dataset, which contains high-resolution drone images from Mozambique with corresponding labels delineating houses and specifying their roof types. The dataset defines a multi-task computer vision problem, comprising object detection, classification, and segmentation. In addition, we benchmarked various state-of-the-art approaches on the dataset. Canonical U-Nets, YOLOv8, and a custom decoder on pretrained DINOv2 served as baselines. We show that each of the methods has its advantages but none is superior on all tasks, which highlights the potential of our dataset for future research in multi-task learning. While the tasks are closely related, accurate segmentation of objects does not necessarily imply accurate instance separation, and vice versa. We address this general issue by introducing a variant of the deep ordinal watershed (DOW) approach that additionally separates the interior of objects, allowing for improved object delineation and separation. We show that our DOW variant is a generic approach that improves the performance of both U-Net and DINOv2 backbones, leading to a better trade-off between semantic segmentation and instance segmentation.

[AI-22] CarbonSense: A Multimodal Dataset and Baseline for Carbon Flux Modelling

链接: https://arxiv.org/abs/2406.04940
作者: Matthew Fortier,Mats L. Richter,Oliver Sonnentag,Chris Pal
关键词: Terrestrial carbon fluxes, provide vital information, Terrestrial carbon, carbon fluxes, fluxes provide vital
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 content pages, 11 reference pages, 9 appendix pages

点击查看摘要

Abstract:Terrestrial carbon fluxes provide vital information about our biosphere’s health and its capacity to absorb anthropogenic CO _2 emissions. The importance of predicting carbon fluxes has led to the emerging field of data-driven carbon flux modelling (DDCFM), which uses statistical techniques to predict carbon fluxes from biophysical data. However, the field lacks a standardized dataset to promote comparisons between models. To address this gap, we present CarbonSense, the first machine learning-ready dataset for DDCFM. CarbonSense integrates measured carbon fluxes, meteorological predictors, and satellite imagery from 385 locations across the globe, offering comprehensive coverage and facilitating robust model training. Additionally, we provide a baseline model using a current state-of-the-art DDCFM approach and a novel transformer based model. Our experiments illustrate the potential gains that multimodal deep learning techniques can bring to this domain. By providing these resources, we aim to lower the barrier to entry for other deep learning researchers to develop new models and drive new advances in carbon flux modelling.

[AI-23] SpanGNN: Towards Memory-Efficient Graph Neural Networks via Spanning Subgraph Training

链接: https://arxiv.org/abs/2406.04938
作者: Xizhi Gu,Hongzheng Li,Shihong Gao,Xinyan Zhang,Lei Chen,Yingxia Shao
关键词: Graph Neural Networks, Neural Networks, Graph Neural, GNN training, mini-batch GNN training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have superior capability in learning graph data. Full-graph GNN training generally has high accuracy, however, it suffers from large peak memory usage and encounters the Out-of-Memory problem when handling large graphs. To address this memory problem, a popular solution is mini-batch GNN training. However, mini-batch GNN training increases the training variance and sacrifices the model accuracy. In this paper, we propose a new memory-efficient GNN training method using spanning subgraph, called SpanGNN. SpanGNN trains GNN models over a sequence of spanning subgraphs, which are constructed from empty structure. To overcome the excessive peak memory consumption problem, SpanGNN selects a set of edges from the original graph to incrementally update the spanning subgraph between every epoch. To ensure the model accuracy, we introduce two types of edge sampling strategies (i.e., variance-reduced and noise-reduced), and help SpanGNN select high-quality edges for the GNN learning. We conduct experiments with SpanGNN on widely used datasets, demonstrating SpanGNN’s advantages in the model performance and low peak memory usage.

[AI-24] SLOPE: Search with Learned Optimal Pruning-based Expansion

链接: https://arxiv.org/abs/2406.04935
作者: Davor Bokan,Zlatan Ajanovic,Bakir Lacevic
关键词: motion planning, planning and pathfinding, finding the shortest, promising completeness, Learned Optimal Pruning-based
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: presented at the ICAPS 2024 workshop on Bridging the Planning and Reinforcement Learning

点击查看摘要

Abstract:Heuristic search is often used for motion planning and pathfinding problems, for finding the shortest path in a graph while also promising completeness and optimal efficiency. The drawback is it’s space complexity, specifically storing all expanded child nodes in memory and sorting large lists of active nodes, which can be a problem in real-time scenarios with limited on-board computation. To combat this, we present the Search with Learned Optimal Pruning-based Expansion (SLOPE), which, learns the distance of a node from a possible optimal path, unlike other approaches that learn a cost-to-go value. The unfavored nodes are then pruned according to the said distance, which in turn reduces the size of the open list. This ensures that the search explores only the region close to optimal paths while lowering memory and computational costs. Unlike traditional learning methods, our approach is orthogonal to estimating cost-to-go heuristics, offering a complementary strategy for improving search efficiency. We demonstrate the effectiveness of our approach evaluating it as a standalone search method and in conjunction with learned heuristic functions, achieving comparable-or-better node expansion metrics, while lowering the number of child nodes in the open list. Our code is available at this https URL.

[AI-25] Optimal Recurrent Network Topologies for Dynamical Systems Reconstruction

链接: https://arxiv.org/abs/2406.04934
作者: Christoph Jürgen Hemmer,Manuel Brenner,Florian Hess,Daniel Durstewitz
关键词: time series measurements, underlying dynamical process, seek to infer, infer from time, time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:In dynamical systems reconstruction (DSR) we seek to infer from time series measurements a generative model of the underlying dynamical process. This is a prime objective in any scientific discipline, where we are particularly interested in parsimonious models with a low parameter load. A common strategy here is parameter pruning, removing all parameters with small weights. However, here we find this strategy does not work for DSR, where even low magnitude parameters can contribute considerably to the system dynamics. On the other hand, it is well known that many natural systems which generate complex dynamics, like the brain or ecological networks, have a sparse topology with comparatively few links. Inspired by this, we show that geometric pruning, where in contrast to magnitude-based pruning weights with a low contribution to an attractor’s geometrical structure are removed, indeed manages to reduce parameter load substantially without significantly hampering DSR quality. We further find that the networks resulting from geometric pruning have a specific type of topology, and that this topology, and not the magnitude of weights, is what is most crucial to performance. We provide an algorithm that automatically generates such topologies which can be used as priors for generative modeling of dynamical systems by RNNs, and compare it to other well studied topologies like small-world or scale-free networks.

[AI-26] Online Adaptation for Enhancing Imitation Learning Policies

链接: https://arxiv.org/abs/2406.04913
作者: Federico Malato,Ville Hautamaki
关键词: enables autonomous agents, learning enables autonomous, reward signal, Imitation learning enables, enables autonomous
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at IEEE Conference on Games 2024, Milan, Italy

点击查看摘要

Abstract:Imitation learning enables autonomous agents to learn from human examples, without the need for a reward signal. Still, if the provided dataset does not encapsulate the task correctly, or when the task is too complex to be modeled, such agents fail to reproduce the expert policy. We propose to recover from these failures through online adaptation. Our approach combines the action proposal coming from a pre-trained policy with relevant experience recorded by an expert. The combination results in an adapted action that closely follows the expert. Our experiments show that an adapted agent performs better than its pure imitation learning counterpart. Notably, adapted agents can achieve reasonable performance even when the base, non-adapted policy catastrophically fails.

[AI-27] PolyLUT-Add: FPGA-based LUT Inference with Wide Inputs

链接: https://arxiv.org/abs/2406.04910
作者: Binglei Lou,Richard Rademacher,David Boland,Philip H.W. Leong
关键词: deploying deep neural, deep neural networks, distinct advantages, technology for deploying, deploying deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: To be published in the International Conference on Field-Programmable Logic and Applications (FPL) 2024

点击查看摘要

Abstract:FPGAs have distinct advantages as a technology for deploying deep neural networks (DNNs) at the edge. Lookup Table (LUT) based networks, where neurons are directly modelled using LUTs, help maximize this promise of offering ultra-low latency and high area efficiency on FPGAs. Unfortunately, LUT resource usage scales exponentially with the number of inputs to the LUT, restricting PolyLUT to small LUT sizes. This work introduces PolyLUT-Add, a technique that enhances neuron connectivity by combining A PolyLUT sub-neurons via addition to improve accuracy. Moreover, we describe a novel architecture to improve its scalability. We evaluated our implementation over the MNIST, Jet Substructure classification and Network Intrusion Detection benchmark and found that for similar accuracy, PolyLUT-Add achieves a LUT reduction of 1.3-7.7\times with a 1.2-2.2\times decrease in latency.

[AI-28] RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection

链接: https://arxiv.org/abs/2406.04906
作者: Liting Huang,Zhihao Zhang,Yiran Zhang,Xiyue Zhou,Shoujin Wang
关键词: create realistic, people communicate, recent advancements, realistic and human-like, significantly transforming
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The recent advancements in generative AI models, which can create realistic and human-like content, are significantly transforming how people communicate, create, and work. While the appropriate use of generative AI models can benefit the society, their misuse poses significant threats to data reliability and authentication. However, due to a lack of aligned multimodal datasets, effective and robust methods for detecting machine-generated content are still in the early stages of development. In this paper, we introduce RU-AI, a new large-scale multimodal dataset designed for the robust and efficient detection of machine-generated content in text, image, and voice. Our dataset is constructed from three large publicly available datasets: Flickr8K, COCO, and Places205, by combining the original datasets and their corresponding machine-generated pairs. Additionally, experimental results show that our proposed unified model, which incorporates a multimodal embedding module with a multilayer perceptron network, can effectively determine the origin of the data (i.e., original data samples or machine-generated ones) from RU-AI. However, future work is still required to address the remaining challenges posed by RU-AI. The source code and dataset are available at this https URL.

[AI-29] Sliding Window 3-Objective Pareto Optimization for Problems with Chance Constraints

链接: https://arxiv.org/abs/2406.04899
作者: Frank Neumann,Carsten Witt
关键词: evolutionary multi-objective algorithms, Neumann and Witt, sliding window approach, sliding window, frequently tackled
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: To appear at PPSN 2024

点击查看摘要

Abstract:Constrained single-objective problems have been frequently tackled by evolutionary multi-objective algorithms where the constraint is relaxed into an additional objective. Recently, it has been shown that Pareto optimization approaches using bi-objective models can be significantly sped up using sliding windows (Neumann and Witt, ECAI 2023). In this paper, we extend the sliding window approach to 3 -objective formulations for tackling chance constrained problems. On the theoretical side, we show that our new sliding window approach improves previous runtime bounds obtained in (Neumann and Witt, GECCO 2023) while maintaining the same approximation guarantees. Our experimental investigations for the chance constrained dominating set problem show that our new sliding window approach allows one to solve much larger instances in a much more efficient way than the 3-objective approach presented in (Neumann and Witt, GECCO 2023).

[AI-30] Stabilizing Extreme Q-learning by Maclaurin Expansion

链接: https://arxiv.org/abs/2406.04896
作者: Motoki Omura,Takayuki Osa,Yusuke Mukuta,Tatsuya Harada
关键词: Regression is performed, Gumbel Regression, Expanded Extreme Q-learning, assumed Gumbel distribution, Extreme Q-learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at RLC 2024: The first Reinforcement Learning Conference

点击查看摘要

Abstract:In Extreme Q-learning (XQL), Gumbel Regression is performed with an assumed Gumbel distribution for the error distribution. This allows learning of the value function without sampling out-of-distribution actions and has shown excellent performance mainly in Offline RL. However, issues remained, including the exponential term in the loss function causing instability and the potential for an error distribution diverging from the Gumbel distribution. Therefore, we propose Maclaurin Expanded Extreme Q-learning to enhance stability. In this method, applying Maclaurin expansion to the loss function in XQL enhances stability against large errors. It also allows adjusting the error distribution assumption from normal to Gumbel based on the expansion order. Our method significantly stabilizes learning in Online RL tasks from DM Control, where XQL was previously unstable. Additionally, it improves performance in several Offline RL tasks from D4RL, where XQL already showed excellent results.

[AI-31] Enhancing Indoor Temperature Forecasting through Synthetic Data in Low-Data Environments

链接: https://arxiv.org/abs/2406.04890
作者: Zachari Thiry,Massimiliano Ruocco,Alessandro Nocente,Michail Spitieris
关键词: achieve efficient control, HVAC systems, control of HVAC, data, synthetic data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Forecasting indoor temperatures is important to achieve efficient control of HVAC systems. In this task, the limited data availability presents a challenge as most of the available data is acquired during standard operation where extreme scenarios and transitory regimes such as major temperature increases or decreases are de-facto excluded. Acquisition of such data requires significant energy consumption and a dedicated facility, hindering the quantity and diversity of available data. Cost related constraints however do not allow for continuous year-around acquisition. To address this, we investigate the efficacy of data augmentation techniques leveraging SoTA AI-based methods for synthetic data generation. Inspired by practical and experimental motivations, we explore fusion strategies of real and synthetic data to improve forecasting models. This approach alleviates the need for continuously acquiring extensive time series data, especially in contexts involving repetitive heating and cooling cycles in buildings. In our evaluation 1) we assess the performance of synthetic data generators independently, particularly focusing on SoTA AI-based methods; 2) we measure the utility of incorporating synthetically augmented data in a subsequent forecasting tasks where we employ a simple model in two distinct scenarios: 1) we first examine an augmentation technique that combines real and synthetically generated data to expand the training dataset, 2) we delve into utilizing synthetic data to tackle dataset imbalances. Our results highlight the potential of synthetic data augmentation in enhancing forecasting accuracy while mitigating training variance. Through empirical experiments, we show significant improvements achievable by integrating synthetic data, thereby paving the way for more robust forecasting models in low-data regime.

[AI-32] Seeing the Unseen: Visual Metaphor Captioning for Videos

链接: https://arxiv.org/abs/2406.04886
作者: Abisek Rajakumar Kalarani,Pushpak Bhattacharyya,Sumit Shekhar
关键词: common communication tool, common communication, communication tool, Metaphors, Average Concept Distance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Metaphors are a common communication tool used in our day-to-day life. The detection and generation of metaphors in textual form have been studied extensively but metaphors in other forms have been under-explored. Recent studies have shown that Vision-Language (VL) models cannot understand visual metaphors in memes and adverts. As of now, no probing studies have been done that involve complex language phenomena like metaphors with videos. Hence, we introduce a new VL task of describing the metaphors present in the videos in our work. To facilitate this novel task, we construct and release a manually created dataset with 705 videos and 2115 human-written captions, along with a new metric called Average Concept Distance (ACD), to automatically evaluate the creativity of the metaphors generated. We also propose a novel low-resource video metaphor captioning system: GIT-LLaVA, which obtains comparable performance to SoTA video language models on the proposed task. We perform a comprehensive analysis of existing video language models on this task and publish our dataset, models, and benchmark results to enable further research.

[AI-33] InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment

链接: https://arxiv.org/abs/2406.04882
作者: Yuxing Long,Wenzhe Cai,Hongcheng Wang,Guanqi Zhan,Hao Dong
关键词: Enabling robots, navigation, human-robot interaction, instruction navigation, instruction
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to CoRL 2024

点击查看摘要

Abstract:Enabling robots to navigate following diverse language instructions in unexplored environments is an attractive goal for human-robot interaction. However, this goal is challenging because different navigation tasks require different strategies. The scarcity of instruction navigation data hinders training an instruction navigation model with varied strategies. Therefore, previous methods are all constrained to one specific type of navigation instruction. In this work, we propose InstructNav, a generic instruction navigation system. InstructNav makes the first endeavor to handle various instruction navigation tasks without any navigation training or pre-built maps. To reach this goal, we introduce Dynamic Chain-of-Navigation (DCoN) to unify the planning process for different types of navigation instructions. Furthermore, we propose Multi-sourced Value Maps to model key elements in instruction navigation so that linguistic DCoN planning can be converted into robot actionable trajectories. With InstructNav, we complete the R2R-CE task in a zero-shot way for the first time and outperform many task-training methods. Besides, InstructNav also surpasses the previous SOTA method by 10.48% on the zero-shot Habitat ObjNav and by 86.34% on demand-driven navigation DDN. Real robot experiments on diverse indoor scenes further demonstrate our method’s robustness in coping with the environment and instruction variations.

[AI-34] Ada-VE: Training-Free Consistent Video Editing Using Adaptive Motion Prior

链接: https://arxiv.org/abs/2406.04873
作者: Tanvir Mahmud,Mustafa Munir,Radu Marculescu,Diana Marculescu
关键词: synthesis models face, face significant challenges, models face significant, ensuring consistent character, consistent character generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Tech Report

点击查看摘要

Abstract:Video-to-video synthesis models face significant challenges, such as ensuring consistent character generation across frames, maintaining smooth temporal transitions, and preserving quality during fast motion. The introduction of joint fully cross-frame self-attention mechanisms has improved character consistency, but this comes at the cost of increased computational complexity. This full cross-frame self-attention mechanism also incorporates redundant details and limits the number of frames that can be jointly edited due to its computational cost. Moreover, the lack of frames in cross-frame attention adversely affects temporal consistency and visual quality. To address these limitations, we propose a new adaptive motion-guided cross-frame attention mechanism that drastically reduces complexity while preserving semantic details and temporal consistency. Specifically, we selectively incorporate the moving regions of successive frames in cross-frame attention and sparsely include stationary regions based on optical flow sampling. This technique allows for an increased number of jointly edited frames without additional computational overhead. For longer duration of video editing, existing methods primarily focus on frame interpolation or flow-warping from jointly edited keyframes, which often results in blurry frames or reduced temporal consistency. To improve this, we introduce KV-caching of jointly edited frames and reuse the same KV across all intermediate frames, significantly enhancing both intermediate frame quality and temporal consistency. Overall, our motion-sampling method enables the use of around three times more keyframes than existing joint editing methods while maintaining superior prediction quality. Ada-VE achieves up to 4x speed-up when using fully-extended self-attention across 40 frames for joint editing, without compromising visual quality or temporal consistency.

[AI-35] Deep learning for precipitation nowcasting: A survey from the perspective of time series forecasting

链接: https://arxiv.org/abs/2406.04867
作者: Sojung An,Tae-Jin Oh,Eunha Sohn,Donghyun Kim
关键词: estimate motion flow, time series precipitation, series precipitation forecasting, time series, precipitation forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Deep learning-based time series forecasting has dominated the short-term precipitation forecasting field with the help of its ability to estimate motion flow in high-resolution datasets. The growing interest in precipitation nowcasting offers substantial opportunities for the advancement of current forecasting technologies. Nevertheless, there has been a scarcity of in-depth surveys of time series precipitation forecasting using deep learning. Thus, this paper systemically reviews recent progress in time series precipitation forecasting models. Specifically, we investigate the following key points within background components, covering: i) preprocessing, ii) objective functions, and iii) evaluation metrics. We then categorize forecasting models into \textitrecursive and \textitmultiple strategies based on their approaches to predict future frames, investigate the impacts of models using the strategies, and performance assessments. Finally, we evaluate current deep learning-based models for precipitation forecasting on a public benchmark, discuss their limitations and challenges, and present some promising research directions. Our contribution lies in providing insights for a better understanding of time series precipitation forecasting and in aiding the development of robust AI solutions for the future.

[AI-36] Digital assistant in a point of sales

链接: https://arxiv.org/abs/2406.04851
作者: Emilia Lesiak,Grzegorz Wolny,Bartosz Przybył,Michał Szczerbak
关键词: Voice User Interface, User Interface, Voice User, powered digital assistant, digital assistant
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: update: cleaned the unnecessary files and updated the metadata

点击查看摘要

Abstract:This article investigates the deployment of a Voice User Interface (VUI)-powered digital assistant in a retail setting and assesses its impact on customer engagement and service efficiency. The study explores how digital assistants can enhance user interactions through advanced conversational capabilities with multilingual support. By integrating a digital assistant into a high-traffic retail environment, we evaluate its effectiveness in improving the quality of customer service and operational efficiency. Data collected during the experiment demonstrate varied impacts on customer interaction, revealing insights into the future optimizations of digital assistant technologies in customer-facing roles. This study contributes to the understanding of digital transformation strategies within the customer relations domain emphasizing the need for service flexibility and user-centric design in modern retail stores.

[AI-37] CTBENCH: A Library and Benchmark for Certified Training

链接: https://arxiv.org/abs/2406.04848
作者: Yuhao Mao,Stefan Balauca,Martin Vechev
关键词: certifiably robust neural, robust neural networks, Training certifiably robust, challenging task, certifiably robust
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Training certifiably robust neural networks is an important but challenging task. While many algorithms for (deterministic) certified training have been proposed, they are often evaluated on different training schedules, certification methods, and systematically under-tuned hyperparameters, making it difficult to compare their performance. To address this challenge, we introduce CTBENCH, a unified library and a high-quality benchmark for certified training that evaluates all algorithms under fair settings and systematically tuned hyperparameters. We show that (1) almost all algorithms in CTBENCH surpass the corresponding reported performance in literature in the magnitude of algorithmic improvements, thus establishing new state-of-the-art, and (2) the claimed advantage of recent algorithms drops significantly when we enhance the outdated baselines with a fair training schedule, a fair certification method and well-tuned hyperparameters. Based on CTBENCH, we provide new insights into the current state of certified training and suggest future research directions. We are confident that CTBENCH will serve as a benchmark and testbed for future research in certified training.

[AI-38] FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models

链接: https://arxiv.org/abs/2406.04845
作者: Rui Ye,Rui Ge,Xinyu Zhu,Jingyi Chai,Yaxin Du,Yang Liu,Yanfeng Wang,Siheng Chen
关键词: enabled multiple parties, collaboratively train large, train large language, large language models, sharing their data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 22 pages

点击查看摘要

Abstract:Federated learning has enabled multiple parties to collaboratively train large language models without directly sharing their data (FedLLM). Following this training paradigm, the community has put massive efforts from diverse aspects including framework, performance, and privacy. However, an unpleasant fact is that there are currently no realistic datasets and benchmarks for FedLLM and previous works all rely on artificially constructed datasets, failing to capture properties in real-world scenarios. Addressing this, we propose FedLLM-Bench, which involves 8 training methods, 4 training datasets, and 6 evaluation metrics, to offer a comprehensive testbed for the FedLLM community. FedLLM-Bench encompasses three datasets (e.g., user-annotated multilingual dataset) for federated instruction tuning and one dataset (e.g., user-annotated preference dataset) for federated preference alignment, whose scale of client number ranges from 38 to 747. Our datasets incorporate several representative diversities: language, quality, quantity, instruction, length, embedding, and preference, capturing properties in real-world scenarios. Based on FedLLM-Bench, we conduct experiments on all datasets to benchmark existing FL methods and provide empirical insights (e.g., multilingual collaboration). We believe that our FedLLM-Bench can benefit the FedLLM community by reducing required efforts, providing a practical testbed, and promoting fair comparisons. Code and datasets are available at this https URL.

[AI-39] Primitive Agentic First-Order Optimization

链接: https://arxiv.org/abs/2406.04841
作者: R. Sala
关键词: Efficient numerical optimization, Efficient numerical, partial state representations, improve performance, performance and reduce
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 7 Pages

点击查看摘要

Abstract:Efficient numerical optimization methods can improve performance and reduce the environmental impact of computing in many applications. This work presents a proof-of-concept study combining primitive state representations and agent-environment interactions as first-order optimizers in the setting of budget-limited optimization. Through reinforcement learning (RL) over a set of training instances of an optimization problem class, optimal policies for sequential update selection of algorithmic iteration steps are approximated in generally formulated low-dimensional partial state representations that consider aspects of progress and resource use. For the investigated case studies, deployment of the trained agents to unseen instances of the quadratic optimization problem classes outperformed conventional optimal algorithms with optimized hyperparameters. The results show that elementary RL methods combined with succinct partial state representations can be used as heuristics to manage complexity in RL-based optimization, paving the way for agentic optimization approaches.

[AI-40] Algorithms for learning value-aligned policies considering admissibility relaxation

链接: https://arxiv.org/abs/2406.04838
作者: Andrés Holgado-Sánchez,Joaquín Arias,Holger Billhardt,Sascha Ossowski
关键词: awareness engineering, claims that software, emerging field, accordance with human, software agents
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emerging field of \emphvalue awareness engineering claims that software agents and systems should be value-aware, i.e. they must make decisions in accordance with human values. In this context, such agents must be capable of explicitly reasoning as to how far different courses of action are aligned with these values. For this purpose, values are often modelled as preferences over states or actions, which are then aggregated to determine the sequences of actions that are maximally aligned with a certain value. Recently, additional value admissibility constraints at this level have been considered as well. However, often relaxed versions of these constraints are needed, and this increases considerably the complexity of computing value-aligned policies. To obtain efficient algorithms that make value-aligned decisions considering admissibility relaxation, we propose the use of learning techniques, in particular, we have used constrained reinforcement learning algorithms. In this paper, we present two algorithms, \epsilon\text-ADQL for strategies based on local alignment and its extension \epsilon\text-CADQL for a sequence of decisions. We have validated their efficiency in a water distribution problem in a drought scenario. Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2406.04838 [cs.AI] (or arXiv:2406.04838v1 [cs.AI] for this version)

[AI-41] Revisiting Catastrophic Forgetting in Large Language Model Tuning

链接: https://arxiv.org/abs/2406.04836
作者: Hongyu Li,Liang Ding,Meng Fang,Dacheng Tao
关键词: forgetting previously acquired, models forgetting previously, previously acquired knowledge, Catastrophic Forgetting, forgetting previously
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Catastrophic Forgetting (CF) means models forgetting previously acquired knowledge when learning new data. It compromises the effectiveness of large language models (LLMs) during fine-tuning, yet the underlying causes have not been thoroughly investigated. This paper takes the first step to reveal the direct link between the flatness of the model loss landscape and the extent of CF in the field of LLMs. Based on this, we introduce the sharpness-aware minimization to mitigate CF by flattening the loss landscape. Experiments on three widely-used fine-tuning datasets, spanning different model scales, demonstrate the effectiveness of our method in alleviating CF. Analyses show that we nicely complement the existing anti-forgetting strategies, further enhancing the resistance of LLMs to CF.

[AI-42] Graph Mining under Data scarcity

链接: https://arxiv.org/abs/2406.04825
作者: Appan Rakaraddi,Lam Siew-Kei,Mahardhika Pratama,Marcus de Carvalho
关键词: Multitude of deep, Uncertainty Estimator, GNN backbone network, Uncertainty Estimator framework, node classification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:Multitude of deep learning models have been proposed for node classification in graphs. However, they tend to perform poorly under labeled-data scarcity. Although Few-shot learning for graphs has been introduced to overcome this problem, the existing models are not easily adaptable for generic graph learning frameworks like Graph Neural Networks (GNNs). Our work proposes an Uncertainty Estimator framework that can be applied on top of any generic GNN backbone network (which are typically designed for supervised/semi-supervised node classification) to improve the node classification performance. A neural network is used to model the Uncertainty Estimator as a probability distribution rather than probabilistic discrete scalar values. We train these models under the classic episodic learning paradigm in the n -way, k -shot fashion, in an end-to-end setting. Our work demonstrates that implementation of the uncertainty estimator on a GNN backbone network improves the classification accuracy under Few-shot setting without any meta-learning specific architecture. We conduct experiments on multiple datasets under different Few-shot settings and different GNN-based backbone networks. Our method outperforms the baselines, which demonstrates the efficacy of the Uncertainty Estimator for Few-shot node classification on graphs with a GNN. Comments: 7 pages, 2 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.04825 [cs.LG] (or arXiv:2406.04825v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2406.04825 Focus to learn more arXiv-issued DOI via DataCite

[AI-43] BERTs are Generative In-Context Learners

链接: https://arxiv.org/abs/2406.04823
作者: David Samuel
关键词: in-context learning capabilities, challenging the common, paper explores, common view, in-context learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 21 pages, preprint

点击查看摘要

Abstract:This paper explores the in-context learning capabilities of masked language models, challenging the common view that this ability does not ‘emerge’ in them. We present an embarrassingly simple inference technique that enables DeBERTa to operate as a generative model without any additional training. Our findings demonstrate that DeBERTa can match and even surpass GPT-3, its contemporary that famously introduced the paradigm of in-context learning. The comparative analysis reveals that the masked and causal language models behave very differently, as they clearly outperform each other on different categories of tasks. This suggests that there is great potential for a hybrid training approach that takes advantage of the strengths of both training objectives.

[AI-44] Navigating Efficiency in MobileViT through Gaussian Process on Global Architecture Factors

链接: https://arxiv.org/abs/2406.04820
作者: Ke Meng,Kai Chen
关键词: convolutional neural networks, Numerous techniques, achieve optimal architectures, neural networks, meticulously designed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Numerous techniques have been meticulously designed to achieve optimal architectures for convolutional neural networks (CNNs), yet a comparable focus on vision transformers (ViTs) has been somewhat lacking. Despite the remarkable success of ViTs in various vision tasks, their heavyweight nature presents challenges of computational costs. In this paper, we leverage the Gaussian process to systematically explore the nonlinear and uncertain relationship between performance and global architecture factors of MobileViT, such as resolution, width, and depth including the depth of in-verted residual blocks and the depth of ViT blocks, and joint factors including resolution-depth and resolution-width. We present design principles twisting magic 4D cube of the global architecture factors that minimize model sizes and computational costs with higher model accuracy. We introduce a formula for downsizing architectures by iteratively deriving smaller MobileViT V2, all while adhering to a specified constraint of multiply-accumulate operations (MACs). Experiment results show that our formula significantly outperforms CNNs and mobile ViTs across diversified datasets

[AI-45] Skill-aware Mutual Information Optimisation for Generalisation in Reinforcement Learning

链接: https://arxiv.org/abs/2406.04815
作者: Xuehui Yu,Mhairi Dunion,Xin Li,Stefano V. Albrecht
关键词: varying environmental features, Skill-aware Mutual Information, Meta-Reinforcement Learning, modes of behaviours, Skill-aware Noise Contrastive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Meta-Reinforcement Learning (Meta-RL) agents can struggle to operate across tasks with varying environmental features that require different optimal skills (i.e., different modes of behaviours). Using context encoders based on contrastive learning to enhance the generalisability of Meta-RL agents is now widely studied but faces challenges such as the requirement for a large sample size, also referred to as the \log - K curse. To improve RL generalisation to different tasks, we first introduce Skill-aware Mutual Information (SaMI), an optimisation objective that aids in distinguishing context embeddings according to skills, thereby equipping RL agents with the ability to identify and execute different skills across tasks. We then propose Skill-aware Noise Contrastive Estimation (SaNCE), a K -sample estimator used to optimise the SaMI objective. We provide a framework for equipping an RL agent with SaNCE in practice and conduct experimental validation on modified MuJoCo and Panda-gym benchmarks. We empirically find that RL agents that learn by maximising SaMI achieve substantially improved zero-shot generalisation to unseen tasks. Additionally, the context encoder equipped with SaNCE demonstrates greater robustness to reductions in the number of available samples, thus possessing the potential to overcome the \log - K curse.

[AI-46] Generating Piano Practice Policy with a Gaussian Process

链接: https://arxiv.org/abs/2406.04812
作者: Alexandra Moringen,Elad Vromen,Helge Ritter,Jason Friedman
关键词: so-called practice modes, practice modes, practice, units that focus, play music comprise
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A typical process of learning to play a piece on a piano consists of a progression through a series of practice units that focus on individual dimensions of the skill, the so-called practice modes. Practice modes in learning to play music comprise a particularly large set of possibilities, such as hand coordination, posture, articulation, ability to read a music score, correct timing or pitch, etc. Self-guided practice is known to be suboptimal, and a model that schedules optimal practice to maximize a learner’s progress still does not exist. Because we each learn differently and there are many choices for possible piano practice tasks and methods, the set of practice modes should be dynamically adapted to the human learner, a process typically guided by a teacher. However, having a human teacher guide individual practice is not always feasible since it is time-consuming, expensive, and often unavailable. In this work, we present a modeling framework to guide the human learner through the learning process by choosing the practice modes generated by a policy model. To this end, we present a computational architecture building on a Gaussian process that incorporates 1) the learner state, 2) a policy that selects a suitable practice mode, 3) performance evaluation, and 4) expert knowledge. The proposed policy model is trained to approximate the expert-learner interaction during a practice session. In our future work, we will test different Bayesian optimization techniques, e.g., different acquisition functions, and evaluate their effect on the learning progress.

[AI-47] Fragile Model Watermarking: A Comprehensive Survey of Evolution Characteristics and Classification

链接: https://arxiv.org/abs/2406.04809
作者: Zhenzhe Gao,Yu Cheng,Zhaoxia Yin
关键词: witnessed rapid development, traditional multimedia fragile, Model fragile watermarking, multimedia fragile watermarking, fragile watermarking
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Model fragile watermarking, inspired by both the field of adversarial attacks on neural networks and traditional multimedia fragile watermarking, has gradually emerged as a potent tool for detecting tampering, and has witnessed rapid development in recent years. Unlike robust watermarks, which are widely used for identifying model copyrights, fragile watermarks for models are designed to identify whether models have been subjected to unexpected alterations such as backdoors, poisoning, compression, among others. These alterations can pose unknown risks to model users, such as misidentifying stop signs as speed limit signs in classic autonomous driving scenarios. This paper provides an overview of the relevant work in the field of model fragile watermarking since its inception, categorizing them and revealing the developmental trajectory of the field, thus offering a comprehensive survey for future endeavors in model fragile watermarking.

[AI-48] EDi Policy: Temporally Entangled Diffusion for Robotic Control

链接: https://arxiv.org/abs/2406.04806
作者: Sigmund H. Høeg,Lars Tingelstad
关键词: modeling complex distributions, complex distributions, Temporally Entangled Diffusion, shown to excel, mastering the challenge
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have been shown to excel in robotic imitation learning by mastering the challenge of modeling complex distributions. However, sampling speed has traditionally not been a priority due to their popularity for image generation, limiting their application to dynamical tasks. While recent work has improved the sampling speed of diffusion-based robotic policies, they are restricted to techniques from the image generation domain. We adapt Temporally Entangled Diffusion (TEDi), a framework specific for trajectory generation, to speed up diffusion-based policies for imitation learning. We introduce TEDi Policy, with novel regimes for training and sampling, and show that it drastically improves the sampling speed while remaining performant when applied to state-of-the-art diffusion-based imitation learning policies.

[AI-49] Zero Finite and Infinite Belief History of Theory of Mind Reasoning in Large Language Models

链接: https://arxiv.org/abs/2406.04800
作者: Weizhi Tang,Vaishak Belle
关键词: Theory of Mind, Large Language Models, Infinite Belief History, Belief History, emergence of Theory
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently shown a promise and emergence of Theory of Mind (ToM) ability and even outperform humans in certain ToM tasks. To evaluate and extend the boundaries of the ToM reasoning ability of LLMs, we propose a novel concept, taxonomy, and framework, the ToM reasoning with Zero, Finite, and Infinite Belief History and develop a multi-round text-based game, called \textitPick the Right Stuff , as a benchmark. We have evaluated six LLMs with this game and found their performance on Zero Belief History is consistently better than on Finite Belief History. In addition, we have found two of the models with small parameter sizes outperform all the evaluated models with large parameter sizes. We expect this work to pave the way for future ToM benchmark development and also for the promotion and development of more complex AI agents or systems which are required to be equipped with more complex ToM reasoning ability.

[AI-50] Learning-Augmented Priority Queues

链接: https://arxiv.org/abs/2406.04793
作者: Ziyad Benomar,Christian Coester
关键词: computer science, fundamental and widely, widely used data, data structures, structures in computer
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Priority queues are one of the most fundamental and widely used data structures in computer science. Their primary objective is to efficiently support the insertion of new elements with assigned priorities and the extraction of the highest priority element. In this study, we investigate the design of priority queues within the learning-augmented framework, where algorithms use potentially inaccurate predictions to enhance their worst-case performance. We examine three prediction models spanning different use cases, and show how the predictions can be leveraged to enhance the performance of priority queue operations. Moreover, we demonstrate the optimality of our solution and discuss some possible applications.

[AI-51] SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals

链接: https://arxiv.org/abs/2406.04784
作者: Ruihan Yang,Jiangjie Chen,Yikai Zhang,Siyu Yuan,Aili Chen,Kyle Richardson,Yanghua Xiao,Deqing Yang
关键词: large language models, gaming and programming, Language agents powered, powered by large, increasingly valuable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Language agents powered by large language models (LLMs) are increasingly valuable as decision-making tools in domains such as gaming and programming. However, these agents often face challenges in achieving high-level goals without detailed instructions and in adapting to environments where feedback is delayed. In this paper, we present SelfGoal, a novel automatic approach designed to enhance agents’ capabilities to achieve high-level goals with limited human prior and environmental feedback. The core concept of SelfGoal involves adaptively breaking down a high-level goal into a tree structure of more practical subgoals during the interaction with environments while identifying the most useful subgoals and progressively updating this structure. Experimental results demonstrate that SelfGoal significantly enhances the performance of language agents across various tasks, including competitive, cooperative, and deferred feedback environments. Project page: this https URL.

[AI-52] Software Engineering for Collective Cyber-Physical Ecosystems

链接: https://arxiv.org/abs/2406.04780
作者: Roberto Casadei,Gianluca Aguzzi,Giorgio Audrito,Ferruccio Damiani,Danilo Pianini,Giordano Scarso,Gianluca Torta,Mirko Viroli
关键词: large-scale cyber-physical ecosystems, addresses large-scale cyber-physical, pervasive computing addresses, computing addresses large-scale, Today distributed
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注: 12 pages, 2 figures, Accepted for presentation at the International Workshop on Software Engineering in 2030, November 2024, Puerto Galinas (Brazil)

点击查看摘要

Abstract:Today’s distributed and pervasive computing addresses large-scale cyber-physical ecosystems, characterised by dense and large networks of devices capable of computation, communication and interaction with the environment and people. While most research focusses on treating these systems as “composites” (i.e., heterogeneous functional complexes), recent developments in fields such as self-organising systems and swarm robotics have opened up a complementary perspective: treating systems as “collectives” (i.e., uniform, collaborative, and self-organising groups of entities). This article explores the motivations, state of the art, and implications of this “collective computing paradigm” in software engineering, discusses its peculiar challenges, and outlines a path for future research, touching on aspects such as macroprogramming, collective intelligence, self-adaptive middleware, learning, synthesis, and experimentation of collective behaviour.

[AI-53] Mobile Network Configuration Recommendation using Deep Generative Graph Neural Network

链接: https://arxiv.org/abs/2406.04779
作者: Shirwan Piroti,Ashima Chawla,Tahar Zanouda
关键词: Radio Access Telecom, Access Telecom Network, Access Telecom, Radio Access, Telecom Network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: 4 pages, 4 figures

点击查看摘要

Abstract:There are vast number of configurable parameters in a Radio Access Telecom Network. A significant amount of these parameters is configured by Radio Node or cell based on their deployment setting. Traditional methods rely on domain knowledge for individual parameter configuration, often leading to sub-optimal results. To improve this, a framework using a Deep Generative Graph Neural Network (GNN) is proposed. It encodes the network into a graph, extracts subgraphs for each RAN node, and employs a Siamese GNN (S-GNN) to learn embeddings. The framework recommends configuration parameters for a multitude of parameters and detects misconfigurations, handling both network expansion and existing cell reconfiguration. Tested on real-world data, the model surpasses baselines, demonstrating accuracy, generalizability, and robustness against concept drift.

[AI-54] REP: Resource-Efficient Prompting for On-device Continual Learning

链接: https://arxiv.org/abs/2406.04772
作者: Sungho Jeon,Xinyue Ma,Kwang In Kim,Myeongjae Jeon
关键词: On-device continual learning, requires the co-optimization, resource efficiency, continual learning, efficiency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages, 10 figures

点击查看摘要

Abstract:On-device continual learning (CL) requires the co-optimization of model accuracy and resource efficiency to be practical. This is extremely challenging because it must preserve accuracy while learning new tasks with continuously drifting data and maintain both high energy and memory efficiency to be deployable on real-world devices. Typically, a CL method leverages one of two types of backbone networks: CNN or ViT. It is commonly believed that CNN-based CL excels in resource efficiency, whereas ViT-based CL is superior in model performance, making each option attractive only for a single aspect. In this paper, we revisit this comparison while embracing powerful pre-trained ViT models of various sizes, including ViT-Ti (5.8M parameters). Our detailed analysis reveals that many practical options exist today for making ViT-based methods more suitable for on-device CL, even when accuracy, energy, and memory are all considered. To further expand this impact, we introduce REP, which improves resource efficiency specifically targeting prompt-based rehearsal-free methods. Our key focus is on avoiding catastrophic trade-offs with accuracy while trimming computational and memory costs throughout the training process. We achieve this by exploiting swift prompt selection that enhances input data using a carefully provisioned model, and by developing two novel algorithms-adaptive token merging (AToM) and adaptive layer dropping (ALD)-that optimize the prompt updating stage. In particular, AToM and ALD perform selective skipping across the data and model-layer dimensions without compromising task-specific features in vision transformer models. Extensive experiments on three image classification datasets validate REP’s superior resource efficiency over current state-of-the-art methods.

[AI-55] WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

链接: https://arxiv.org/abs/2406.04770
作者: Bill Yuchen Lin,Yuntian Deng,Khyathi Chandu,Faeze Brahman,Abhilasha Ravichander,Valentina Pyatkin,Nouha Dziri,Ronan Le Bras,Yejin Choi
关键词: real-world user queries, benchmark large language, evaluation framework designed, large language models, real-world user
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Link: this https URL

点击查看摘要

Abstract:We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WildBench, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo. WildBench evaluation uses task-specific checklists to evaluate model outputs systematically and provides structured explanations that justify the scores and comparisons, resulting in more reliable and interpretable automatic judgments. WB-Reward employs fine-grained pairwise comparisons between model responses, generating five potential outcomes: much better, slightly better, slightly worse, much worse, or a tie. Unlike previous evaluations that employed a single baseline model, we selected three baseline models at varying performance levels to ensure a comprehensive pairwise evaluation. Additionally, we propose a simple method to mitigate length bias, by converting outcomes of slightly better/worse'' to tie’’ if the winner response exceeds the loser one by more than K characters. WB-Score evaluates the quality of model outputs individually, making it a fast and cost-efficient evaluation metric. WildBench results demonstrate a strong correlation with the human-voted Elo ratings from Chatbot Arena on hard tasks. Specifically, WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models. Additionally, WB-Score reaches 0.95, surpassing both ArenaHard’s 0.91 and AlpacaEval2.0’s 0.89 for length-controlled win rates, as well as the 0.87 for regular win rates.

[AI-56] Sales Whisperer: A Human-Inconspicuous Attack on LLM Brand Recommendations

链接: https://arxiv.org/abs/2406.04755
作者: Weiran Lin,Anna Gerchanovsky,Omer Akgul,Lujo Bauer,Matt Fredrikson,Zifan Wang
关键词: Large language model, Large language, prompting services, language model, users might rely
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language model (LLM) users might rely on others (e.g., prompting services), to write prompts. However, the risks of trusting prompts written by others remain unstudied. In this paper, we assess the risk of using such prompts on brand recommendation tasks when shopping. First, we found that paraphrasing prompts can result in LLMs mentioning given brands with drastically different probabilities, including a pair of prompts where the probability changes by 100%. Next, we developed an approach that can be used to perturb an original base prompt to increase the likelihood that an LLM mentions a given brand. We designed a human-inconspicuous algorithm that perturbs prompts, which empirically forces LLMs to mention strings related to a brand more often, by absolute improvements up to 78.3%. Our results suggest that our perturbed prompts, 1) are inconspicuous to humans, 2) force LLMs to recommend a target brand more often, and 3) increase the perceived chances of picking targeted brands.

[AI-57] PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction

链接: https://arxiv.org/abs/2406.04746
作者: Eduard Poesina,Adriana Valentina Costache,Adrian-Gabriel Chifu,Josiane Mothe,Radu Tudor Ionescu
关键词: generative diffusion models, visually impressive results, diffusion models, recently emerged, viable alternative
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-to-image generation has recently emerged as a viable alternative to text-to-image retrieval, due to the visually impressive results of generative diffusion models. Although query performance prediction is an active research topic in information retrieval, to the best of our knowledge, there is no prior study that analyzes the difficulty of queries (prompts) in text-to-image generation, based on human judgments. To this end, we introduce the first dataset of prompts which are manually annotated in terms of image generation performance. In order to determine the difficulty of the same prompts in image retrieval, we also collect manual annotations that represent retrieval performance. We thus propose the first benchmark for joint text-to-image prompt and query performance prediction, comprising 10K queries. Our benchmark enables: (i) the comparative assessment of the difficulty of prompts/queries in image generation and image retrieval, and (ii) the evaluation of prompt/query performance predictors addressing both generation and retrieval. We present results with several pre-generation/retrieval and post-generation/retrieval performance predictors, thus providing competitive baselines for future research. Our benchmark and code is publicly available under the CC BY 4.0 license at this https URL.

[AI-58] Generative AI Models: Opportunities and Risks for Industry and Authorities

链接: https://arxiv.org/abs/2406.04734
作者: Tobias Alt,Andrea Ibisch,Clemens Meiser,Anna Wilhelm,Raphael Zimmer,Christian Berghoff,Christoph Droste,Jens Karschau,Friederike Laus,Rainer Plaga,Carola Plesch,Britta Sennewald,Thomas Thaeren,Kristina Unverricht,Steffen Waurick
关键词: traditionally require creativity, human understanding, capable of performing, performing a wide, wide range
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注: 33 pages, 3 figures

点击查看摘要

Abstract:Generative AI models are capable of performing a wide range of tasks that traditionally require creativity and human understanding. They learn patterns from existing data during training and can subsequently generate new content such as texts, images, and music that follow these patterns. Due to their versatility and generally high-quality results, they, on the one hand, represent an opportunity for digitalization. On the other hand, the use of generative AI models introduces novel IT security risks that need to be considered for a comprehensive analysis of the threat landscape in relation to IT security. In response to this risk potential, companies or authorities using them should conduct an individual risk analysis before integrating generative AI into their workflows. The same applies to developers and operators, as many risks in the context of generative AI have to be taken into account at the time of development or can only be influenced by the operating company. Based on this, existing security measures can be adjusted, and additional measures can be taken.

[AI-59] Predicting Polymer Properties Based on Multimodal Multitask Pretraining

链接: https://arxiv.org/abs/2406.04727
作者: Fanmeng Wang,Wentao Guo,Minjie Cheng,Shen Yuan,Hongteng Xu,Zhifeng Gao
关键词: similar monomers covalently, bonding numerous identical, polymer property prediction, property prediction, polymer property
类目: Machine Learning (cs.LG); Soft Condensed Matter (cond-mat.soft); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the past few decades, polymers, high-molecular-weight compounds formed by bonding numerous identical or similar monomers covalently, have played an essential role in various scientific fields. In this context, accurate prediction of their properties is becoming increasingly crucial. Typically, the properties of a polymer, such as plasticity, conductivity, bio-compatibility, and so on, are highly correlated with its 3D structure. However, current methods for predicting polymer properties heavily rely on information from polymer SMILES sequences (P-SMILES strings) while ignoring crucial 3D structural information, leading to sub-optimal performance. In this work, we propose MMPolymer, a novel multimodal multitask pretraining framework incorporating both polymer 1D sequential information and 3D structural information to enhance downstream polymer property prediction tasks. Besides, to overcome the limited availability of polymer 3D data, we further propose the “Star Substitution” strategy to extract 3D structural information effectively. During pretraining, MMPolymer not only predicts masked tokens and recovers 3D coordinates but also achieves the cross-modal alignment of latent representation. Subsequently, we further fine-tune the pretrained MMPolymer for downstream polymer property prediction tasks in the supervised learning paradigm. Experimental results demonstrate that MMPolymer achieves state-of-the-art performance in various polymer property prediction tasks. Moreover, leveraging the pretrained MMPolymer and using only one modality (either P-SMILES string or 3D conformation) during fine-tuning can also surpass existing polymer property prediction methods, highlighting the exceptional capability of MMPolymer in polymer feature extraction and utilization. Our online platform for polymer property prediction is available at https://app.bohrium.dp.tech/mmpolymer.

[AI-60] Probabilistic Perspectives on Error Minimization in Adversarial Reinforcement Learning

链接: https://arxiv.org/abs/2406.04724
作者: Roman Belaire,Arunesh Sinha,Pradeep Varakantham
关键词: Deep Reinforcement Learning, Deep Reinforcement, Reinforcement Learning, posing severe risks, policies are critically
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep Reinforcement Learning (DRL) policies are critically vulnerable to adversarial noise in observations, posing severe risks in safety-critical scenarios. For example, a self-driving car receiving manipulated sensory inputs about traffic signs could lead to catastrophic outcomes. Existing strategies to fortify RL algorithms against such adversarial perturbations generally fall into two categories: (a) using regularization methods that enhance robustness by incorporating adversarial loss terms into the value objectives, and (b) adopting “maximin” principles, which focus on maximizing the minimum value to ensure robustness. While regularization methods reduce the likelihood of successful attacks, their effectiveness drops significantly if an attack does succeed. On the other hand, maximin objectives, although robust, tend to be overly conservative. To address this challenge, we introduce a novel objective called Adversarial Counterfactual Error (ACoE), which naturally balances optimizing value and robustness against adversarial attacks. To optimize ACoE in a scalable manner in model-free settings, we propose a theoretically justified surrogate objective known as Cumulative-ACoE (C-ACoE). The core idea of optimizing C-ACoE is utilizing the belief about the underlying true state given the adversarially perturbed observation. Our empirical evaluations demonstrate that our method outperforms current state-of-the-art approaches for addressing adversarial RL problems across all established benchmarks (MuJoCo, Atari, and Highway) used in the literature.

[AI-61] FlowMM: Generating Materials with Riemannian Flow Matching

链接: https://arxiv.org/abs/2406.04713
作者: Benjamin Kurt Miller,Ricky T. Q. Chen,Anuroop Sriram,Brandon M Wood
关键词: unique computational challenges, presents unique computational, Crystalline materials, next-generation technologies, computational challenges
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注: this https URL

点击查看摘要

Abstract:Crystalline materials are a fundamental component in next-generation technologies, yet modeling their distribution presents unique computational challenges. Of the plausible arrangements of atoms in a periodic lattice only a vanishingly small percentage are thermodynamically stable, which is a key indicator of the materials that can be experimentally realized. Two fundamental tasks in this area are to (a) predict the stable crystal structure of a known composition of elements and (b) propose novel compositions along with their stable structures. We present FlowMM, a pair of generative models that achieve state-of-the-art performance on both tasks while being more efficient and more flexible than competing methods. We generalize Riemannian Flow Matching to suit the symmetries inherent to crystals: translation, rotation, permutation, and periodic boundary conditions. Our framework enables the freedom to choose the flow base distributions, drastically simplifying the problem of learning crystal structures compared with diffusion models. In addition to standard benchmarks, we validate FlowMM’s generated structures with quantum chemistry calculations, demonstrating that it is about 3x more efficient, in terms of integration steps, at finding stable materials compared to previous open methods.

[AI-62] Morescient GAI for Software Engineering

链接: https://arxiv.org/abs/2406.04710
作者: Marcus Kessel,Colin Atkinson
关键词: engineering artifacts promises, modify software engineering, software engineering artifacts, software engineering, ability of Generative
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The ability of Generative AI (GAI) technology to automatically check, synthesize and modify software engineering artifacts promises to revolutionize all aspects of software engineering. Using GAI for software engineering tasks is consequently one of the most rapidly expanding fields of software engineering research, with dozens of LLM-based code models having been published since 2021. However, the overwhelming majority of existing code models share a major weakness - they are exclusively trained on the syntactic facet of software, significantly lowering their trustworthiness in tasks dependent on software semantics. To address this problem, a new class of “Morescient” GAI is needed that is “aware” of (i.e., trained on) both the semantic and static facets of software. This, in turn, will require a new generation of software observation platforms capable of generating ultra-large quantities of execution observations in a structured and readily analyzable way. In this paper, we present a vision for how such “Morescient” GAI models can be engineered, evolved and disseminated according to the principles of open science.

[AI-63] Logic Synthesis with Generative Deep Neural Networks

链接: https://arxiv.org/abs/2406.04699
作者: Xihan Li,Xing Li,Lei Chen,Xing Zhang,Mingxuan Yuan,Jun Wang
关键词: strict feasibility requirement, Circuit Transformer, achieved significant success, Circuit Transformer model, logic circuit design
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: In IWLS 2024

点击查看摘要

Abstract:While deep learning has achieved significant success in various domains, its application to logic circuit design has been limited due to complex constraints and strict feasibility requirement. However, a recent generative deep neural model, “Circuit Transformer”, has shown promise in this area by enabling equivalence-preserving circuit transformation on a small scale. In this paper, we introduce a logic synthesis rewriting operator based on the Circuit Transformer model, named “ctrw” (Circuit Transformer Rewriting), which incorporates the following techniques: (1) a two-stage training scheme for the Circuit Transformer tailored for logic synthesis, with iterative improvement of optimality through self-improvement training; (2) integration of the Circuit Transformer with state-of-the-art rewriting techniques to address scalability issues, allowing for guided DAG-aware rewriting. Experimental results on the IWLS 2023 contest benchmark demonstrate the effectiveness of our proposed rewriting methods.

[AI-64] LLM-Vectorizer: LLM-based Verified Loop Vectorizer

链接: https://arxiv.org/abs/2406.04693
作者: Jubi Taneja,Avery Laird,Cong Yan,Madan Musuvathi,Shuvendu K. Lahiri
关键词: computing applications operating, performance computing applications, powerful optimization technique, large data arrays, powerful optimization
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:Vectorization is a powerful optimization technique that significantly boosts the performance of high performance computing applications operating on large data arrays. Despite decades of research on auto-vectorization, compilers frequently miss opportunities to vectorize code. On the other hand, writing vectorized code manually using compiler intrinsics is still a complex, error-prone task that demands deep knowledge of specific architecture and compilers. In this paper, we evaluate the potential of large-language models (LLMs) to generate vectorized (Single Instruction Multiple Data) code from scalar programs that process individual array elements. We propose a novel finite-state machine multi-agents based approach that harnesses LLMs and test-based feedback to generate vectorized code. Our findings indicate that LLMs are capable of producing high performance vectorized code with run-time speedup ranging from 1.1x to 9.4x as compared to the state-of-the-art compilers such as Intel Compiler, GCC, and Clang. To verify the correctness of vectorized code, we use Alive2, a leading bounded translation validation tool for LLVM IR. We describe a few domain-specific techniques to improve the scalability of Alive2 on our benchmark dataset. Overall, our approach is able to verify 38.2% of vectorizations as correct on the TSVC benchmark dataset. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF) Cite as: arXiv:2406.04693 [cs.SE] (or arXiv:2406.04693v1 [cs.SE] for this version)

[AI-65] MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

链接: https://arxiv.org/abs/2406.04673
作者: Sanjoy Chowdhury,Sayan Nag,K J Joseph,Balaji Vasan Srinivasan,Dinesh Manocha
关键词: emotions and feelings, universal language, communicate emotions, Music, synthesize music
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Accepted at CVPR 2024 as Highlight paper. Webpage: this https URL

点击查看摘要

Abstract:Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel “visual synapse”, which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music, measured both objectively and subjectively, with a relative gain of up to 67.98% on the FAD score. We hope that our work will gather attention to this pragmatic, yet relatively under-explored research area.

[AI-66] he Reasonable Person Standard for AI

链接: https://arxiv.org/abs/2406.04671
作者: Sunayana Rane
关键词: Reasonable Person Standard, Reasonable Person, set the norm, constructive for society, human behavior
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: