本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-06-12)

今日共更新585篇论文,其中:

  • 自然语言处理148篇(Computation and Language (cs.CL))
  • 计算机视觉111篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能195篇(Artificial Intelligence (cs.AI))
  • 机器学习221篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?
[NLP-0] 常识T2 I挑战:文本到图像生成模型能否理解常识?

链接: https://arxiv.org/abs/2406.07546
作者: Xingyu Fu,Muyu He,Yujie Lu,William Yang Wang,Dan Roth
关键词: evaluating the ability, lightbulb, real life, produce images, Abstract
中文关键词: 评估能力,灯泡,现实生活,制作图像,摘要
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Text-to-Image Generation, Commonsense, Project Url: this https URL

点击查看摘要

Abstract:We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that fit commonsense in real life, which we call Commonsense-T2I. Given two adversarial text prompts containing an identical set of action words with minor differences, such as “a lightbulb without electricity” v.s. “a lightbulb with electricity”, we evaluate whether T2I models can conduct visual-commonsense reasoning, e.g. produce images that fit “the lightbulb is unlit” vs. “the lightbulb is lit” correspondingly. Commonsense-T2I presents an adversarial challenge, providing pairwise text prompts along with expected outputs. The dataset is carefully hand-curated by experts and annotated with fine-grained labels, such as commonsense type and likelihood of the expected outputs, to assist analyzing model behavior. We benchmark a variety of state-of-the-art (sota) T2I models and surprisingly find that, there is still a large gap between image synthesis and real life photos–even the DALL-E 3 model could only achieve 48.92% on Commonsense-T2I, and the stable diffusion XL model only achieves 24.92% accuracy. Our experiments show that GPT-enriched prompts cannot solve this challenge, and we include a detailed analysis about possible reasons for such deficiency. We aim for Commonsense-T2I to serve as a high-quality evaluation benchmark for T2I commonsense checking, fostering advancements in real life image generation.
摘要:我们提出了一个新的任务和基准来评估文本到图像(T2I)生成模型产生符合现实生活常识的图像的能力,我们称之为常识-T2I。给出两个对抗性的文本提示,其中包含一组相同的动作词,但略有不同,例如“一个没有电的灯泡”V.S.“一个有电的灯泡”,我们评估T2I模型是否能够进行视觉常识推理,例如产生与“灯泡没有点亮”和“灯泡被点亮”相匹配的图像。常识-T2I提出了一个对抗性的挑战,提供成对的文本提示和预期的输出。数据集由专家仔细地手工管理,并使用细粒度的标签进行注释,例如常识类型和预期输出的可能性,以帮助分析模型行为。我们对各种最先进的(SOTA)T2I模型进行了基准测试,结果令人惊讶地发现,图像合成与真实照片之间仍然存在很大差距–即使是Dall-E3模型在常识-T2I上也只能达到48.92%,而稳定扩散XL模型只能达到24.92%的准确率。我们的实验表明,GPT丰富的提示不能解决这一挑战,我们还详细分析了导致这种不足的可能原因。我们的目标是将Common Sense-T2I作为T2I常识检查的高质量评估基准,促进现实生活中图像生成的进步。

[NLP-1] Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation Benchmark and Arena
[NLP-1] 开放式法学硕士排行榜:从多项选择到开放式的法学硕士评估基准和竞技场问题

链接: https://arxiv.org/abs/2406.07545
作者: Aidar Myrzakhan,Sondos Mahmoud Bsharat,Zhiqiang Shen
关键词: large language models, assess large language, Multiple-choice questions, language models, assess large
中文关键词: 大型语言模型,评估大型语言,多项选择题,语言模型,评估大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code and dataset are available at this https URL

点击查看摘要

Abstract:Multiple-choice questions (MCQ) are frequently used to assess large language models (LLMs). Typically, an LLM is given a question and selects the answer deemed most probable after adjustments for factors like length. Unfortunately, LLMs may inherently favor certain answer choice IDs, such as A/B/C/D, due to inherent biases of priori unbalanced probabilities, influencing the prediction of answers based on these IDs. Previous research has introduced methods to reduce this ‘‘selection bias’’ by simply permutating options on a few test samples and applying to new ones. Another problem of MCQ is the lottery ticket choice by ‘‘random guessing’’. The LLM does not learn particular knowledge, but the option is guessed correctly. This situation is especially serious for those small-scale LLMs. To address them, a more thorough approach involves shifting from MCQ to open-style questions, which can fundamentally eliminate selection bias and random guessing issues. However, transitioning causes its own set of challenges in (1) identifying suitable open-style questions and (2) validating the correctness of LLM open-style responses against human-annotated ground-truths. This work aims to tackle these significant difficulties, and establish a new LLM evaluation benchmark through entirely open-style questions. Consequently, we introduce the Open-LLM-Leaderboard to track various LLMs’ performance and reflect true capability of them, such as GPT-4o/4/3.5, Claude 3, Gemini, etc. Our code and dataset are available at this https URL.
摘要:多项选择题是评估大型语言模型的常用题型。通常,LLM会被问到一个问题,并在对长度等因素进行调整后选择最有可能的答案。不幸的是,由于先验不平衡概率的固有偏差,LLMS可能固有地偏爱某些答案选择ID,例如A/B/C/D,从而影响基于这些ID的答案的预测。之前的研究已经介绍了一些方法,通过简单地排列几个测试样本上的选项并将其应用于新的样本,来减少这种“选择偏差”。McQ的另一个问题是彩票的随机猜测。LLM不学习特定的知识,但选项被正确猜测。对于那些小规模的小岛屿发展中国家来说,这种情况尤其严重。为了解决这些问题,一种更彻底的方法包括从McQ转向开放式问题,这可以从根本上消除选择偏见和随机猜测问题。然而,转换在(1)识别合适的开放式问题和(2)对照人类注释的基本事实验证LLM开放式回答的正确性方面导致了自己的一系列挑战。这项工作旨在解决这些重大困难,并通过完全开放式的问题建立一个新的LLM评估基准。因此,我们引入了Open-LLM-Leaderboard来跟踪各种LLMS的性能,并反映它们的真实能力,如GPT-4o/4/3.5、Claude 3、Gemini等。我们的代码和数据集可以在这个HTTPS URL中找到。

[NLP-2] Situational Awareness Matters in 3D Vision Language Reasoning
[NLP-2] 情境感知在3D视觉语言推理中的重要性

链接: https://arxiv.org/abs/2406.07544
作者: Yunze Man,Liang-Yan Gui,Yu-Xiong Wang
关键词: developing household robots, vision language reasoning, complicated vision language, language reasoning tasks, vision language
中文关键词: 开发家用机器人、视觉语言推理、复杂视觉语言、语言推理任务、视觉语言
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: CVPR 2024. Project Page: this https URL

点击查看摘要

Abstract:Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work, we demonstrate that a critical and distinct challenge in 3D vision language reasoning is situational awareness, which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge, we introduce SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator, followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in situation estimation and question answering by a large margin (e.g., an enhancement of over 30% on situation estimation accuracy). Subsequent analysis corroborates our architectural design choices, explores the distinct functions of visual and textual tokens, and highlights the importance of situational awareness in the domain of 3D question answering.
摘要:能够在3D空间中执行复杂的视觉语言推理任务,是发展家用机器人和以人为中心的体验式人工智能的一个重要里程碑。在这项工作中,我们证明了3D视觉语言推理中的一个关键和独特的挑战是情景感知,它包括两个关键组成部分:(1)自主智能体根据语言提示确定自己的位置。(2)代理人从计算出的位置来回答开放式问题。为了应对这一挑战,我们引入了SIG3D,这是一个端到端的基于情景的3D视觉语言推理模型。我们将3D场景标记化为稀疏体素表示,并提出了一种基于语言的情境估计器,然后是情境问答模块。在SQA3D和ScanQA数据集上的实验表明,SIG3D在态势估计和问答方面的性能远远优于最新的模型(例如,在态势估计精度上提高了30%以上)。随后的分析证实了我们的建筑设计选择,探索了视觉和文本符号的不同功能,并强调了情景感知在3D问答领域的重要性。

[NLP-3] Simple and Effective Masked Diffusion Language Models
[NLP-3] 简单有效的掩蔽扩散语言模型

链接: https://arxiv.org/abs/2406.07524
作者: Subham Sekhar Sahoo,Marianne Arriola,Yair Schiff,Aaron Gokaslan,Edgar Marroquin,Justin T Chiu,Alexander Rush,Volodymyr Kuleshov
关键词: generating high-quality images, prior work reports, significant performance gap, high-quality images, excel at generating
中文关键词: 生成高质量图像、先前的工作报告、显着的绩效差距、高质量图像、擅长生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While diffusion models excel at generating high-quality images, prior work reports a significant performance gap between diffusion and autoregressive (AR) methods in language modeling. In this work, we show that simple masked discrete diffusion is more performant than previously thought. We apply an effective training recipe that improves the performance of masked diffusion models and derive a simplified, Rao-Blackwellized objective that results in additional improvements. Our objective has a simple form – it is a mixture of classical masked language modeling losses – and can be used to train encoder-only language models that admit efficient samplers, including ones that can generate arbitrary lengths of text semi-autoregressively like a traditional language model. On language modeling benchmarks, a range of masked diffusion models trained with modern engineering practices achieves a new state-of-the-art among diffusion models, and approaches AR perplexity. We release our code at: this https URL
摘要:虽然扩散模型擅长生成高质量的图像,但以往的工作表明,在语言建模中,扩散方法和自回归(AR)方法之间存在着显著的性能差距。在这项工作中,我们证明了简单的掩蔽离散扩散比以前认为的更好。我们应用了一种有效的训练配方来提高屏蔽扩散模型的性能,并得出了一个简化的Rao-Blackwell化目标,从而导致了额外的改进。我们的目标有一个简单的形式–它是经典掩蔽语言建模损失的混合体–可以用于训练只允许高效采样器的编码器语言模型,包括那些可以像传统语言模型一样半自动回归地生成任意长度文本的语言模型。在语言建模基准方面,经过现代工程实践训练的一系列屏蔽扩散模型在扩散模型中达到了新的水平,并接近AR困惑。我们在以下地址发布我们的代码:这个https URL

[NLP-4] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
[NLP-4] Samba:用于高效无限制上下文语言建模的简单混合状态空间模型

链接: https://arxiv.org/abs/2406.07522
作者: Liliang Ren,Yang Liu,Yadong Lu,Yelong Shen,Chen Liang,Weizhu Chen
关键词: long-standing problem, Samba, Sliding Window Attention, infinite context length, Efficiently modeling sequences
中文关键词: 长期存在的问题、Samba、滑动窗口注意力、无限上下文长度、高效建模序列
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Efficiently modeling sequences with infinite context length has been a long-standing problem. Past works suffer from either the quadratic computation complexity or the limited extrapolation ability on length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and show that Samba substantially outperforms the state-of-the-art models based on pure attention or SSMs on a wide range of benchmarks. When trained on 4K length sequences, Samba can be efficiently extrapolated to 256K context length with perfect memory recall and show improved token predictions up to 1M context length. As a linear-time sequence model, Samba enjoys a 3.73x higher throughput compared to Transformers with grouped-query attention when processing user prompts of 128K length, and 3.64x speedup when generating 64K tokens with unlimited streaming. A sample implementation of Samba is publicly available in this https URL.
摘要:有效地对具有无限上下文长度的序列进行建模一直是一个长期存在的问题。以往的工作要么存在二次计算复杂性,要么存在长度泛化外推能力有限的问题。在这项工作中,我们提出了一种简单的混合体系结构Samba,它分层地结合了选择性状态空间模型(SSM)和滑动窗口注意(SWA)的Mamba。Samba有选择地将给定的序列压缩成反复出现的隐藏状态,同时仍然保持通过注意力机制准确回忆记忆的能力。我们使用3.2T训练令牌将Samba扩展到3.8B参数,并表明Samba在广泛的基准测试范围上显著优于基于纯注意力或SSMS的最先进模型。当对4K长度的序列进行训练时,Samba可以有效地外推到256K的上下文长度,并具有完美的记忆召回率,并显示出高达1M上下文长度的改进的令牌预测。作为一种线性时间序列模型,在处理128K长度的用户提示时,Samba的吞吐量是具有分组查询注意力的Transformers的3.73倍,当生成64K无限流的令牌时,Samba的加速比为3.64倍。Samba的样例实现在此HTTPS URL中公开提供。

[NLP-5] HaLLE: Text Hyperlocally Augmented Large Language Extension – Technical Report
[NLP-5] HaLLE:文本超本地增强大型语言扩展–技术报告

链接: https://arxiv.org/abs/2406.07505
作者: KBTG Labs,Danupat Khamnuansin,Atthakorn Petchsod,Anuruth Lertpiya,Pornchanan Balee,Thanawat Lodkaew,Tawunrat Chalothorn,Thadpong Pongthawornkamol,Monchai Lertsutthiwong
关键词: Recent advancements, Large Language Models, Large Language Extension, Augmented Large Language, Large Language
中文关键词: 最新进展、大型语言模型、大型语言扩展、增强大型语言、大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have revealed new capabilities and opportunities across the technological landscape. However, the practicality of very large LLMs is challenged by their high compute cost, which does not justify the benefits given their limited capability compared to humans. While smaller, more practical LLMs have shown potential in financial analysis, though they are not yet fully proficient, as evidenced by their near-passing performance on the Chartered Financial Analyst (CFA) exam. In this work, we present Financial Analyst Extension to our Text Hyperlocally Augmented Large Language Extension (THaLLE), a series of 8B LLMs consistently achieving highest performance on mock CFA exams against models of comparable size. We thoroughly document the fine-tuning techniques used to facilitate future research. Additionally, we introduce the use of Flare CFA, a publicly available dataset for evaluating LLMs as a financial advisor.
摘要:大型语言模型(LLM)的最新进展揭示了整个技术领域的新能力和机遇。然而,超大型LLM的实用性受到其高计算成本的挑战,鉴于其与人类相比能力有限,这并不能证明其优势是合理的。虽然规模较小、更实用的法学硕士在财务分析方面表现出了潜力,但他们尚未完全精通,正如他们在特许金融分析师(CFA)考试中近乎及格的表现所证明的那样。在这项工作中,我们将财务分析师扩展扩展到文本超本地增强大型语言扩展(THaLLE),一系列8B LLM在模拟CFA考试中与同等规模的模型一致取得最高表现。我们彻底记录了用于促进未来研究的微调技术。此外,我们还引入了Flare CFA的使用,这是一个公开可用的数据集,用于评估LLM作为财务顾问。

[NLP-6] Just Because We Camp Doesnt Mean We Should: The Ethics of Modelling Queer Voices
[NLP-6] 仅仅因为我们露营并不意味着我们应该:酷儿声音建模的道德

链接: https://arxiv.org/abs/2406.07504
作者: Atli Sigurgeirsson,Eddie L. Ungless
关键词: Modern voice cloning, Modern voice, cloning models claim, gay voice, diverse range
中文关键词: 现代声音克隆,现代声音,克隆模型声称,同性恋声音,多样化
类目: Computation and Language (cs.CL)
备注: 4 pages (+1 page references). To be presented at Interspeech 2024

点击查看摘要

Abstract:Modern voice cloning models claim to be able to capture a diverse range of voices. We test the ability of a typical pipeline to capture the style known colloquially as “gay voice” and notice a homogenisation effect: synthesised speech is rated as sounding significantly “less gay” (by LGBTQ+ participants) than its corresponding ground-truth for speakers with “gay voice”, but ratings actually increase for control speakers. Loss of “gay voice” has implications for accessibility. We also find that for speakers with “gay voice”, loss of “gay voice” corresponds to lower similarity ratings. However, we caution that improving the ability of such models to synthesise ``gay voice’’ comes with a great number of risks. We use this pipeline as a starting point for a discussion on the ethics of modelling queer voices more broadly. Collecting “clean” queer data has safety and fairness ramifications, and the resulting technology may cause harms from mockery to death. Comments: 4 pages (+1 page references). To be presented at Interspeech 2024 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2406.07504 [cs.CL] (or arXiv:2406.07504v1 [cs.CL] for this version)
摘要:现代语音克隆模型声称能够捕获各种不同的声音。我们测试了一种典型的管道捕获俗称为“同性恋声音”的风格的能力,并注意到了同质化效应:对于具有“同性恋声音”的说话者,合成语音听起来明显“不那么同性恋”(由LGBTQ+参与者评定),而对于控制说话者,评级实际上是上升的。“同性恋声音”的丧失对人们的可及性有影响。我们还发现,对于具有“同性恋声音”的说话人,失去“同性恋声音”对应于较低的相似度评级。然而,我们警告说,提高这类模型合成“同性恋声音”的能力伴随着大量的风险。我们利用这条渠道作为一个起点,讨论更广泛地模拟同性恋声音的伦理问题。收集“干净”的同性恋数据有安全和公平的后果,由此产生的技术可能会造成从嘲弄到死亡的伤害。评论:4页(+1页参考文献)。将在2024年InterSpeech主题会议上发表:计算与语言(cs.CL)引用为:arxiv:2406.07504cs.CL

[NLP-7] Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions
[NLP-7] 图像文本化:创建准确详细图像描述的自动框架

链接: https://arxiv.org/abs/2406.07502
作者: Renjie Pi,Jianshu Zhang,Jipeng Zhang,Rui Pan,Zhekai Chen,Tong Zhang
关键词: Image, text-image retrieval, description datasets play, image descriptions, Image description datasets
中文关键词: 图像、文本图像检索、描述数据集播放、图像描述、图像描述数据集
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scraping of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another is through human labeling. Datasets such as COCO are generally very short and lack details. Although detailed image descriptions can be annotated by humans, the high annotation cost limits the feasibility. These limitations underscore the need for more efficient and scalable methods to generate accurate and detailed image descriptions. In this paper, we propose an innovative framework termed Image Textualization (IT), which automatically produces high-quality image descriptions by leveraging existing multi-modal large language models (MLLMs) and multiple vision expert models in a collaborative manner, which maximally convert the visual information into text. To address the current lack of benchmarks for detailed descriptions, we propose several benchmarks for comprehensive evaluation, which verifies the quality of image descriptions created by our framework. Furthermore, we show that LLaVA-7B, benefiting from training on IT-curated descriptions, acquire improved capability to generate richer image descriptions, substantially increasing the length and detail of their output with less hallucination.
摘要:图像描述数据集在图像理解、文本到图像生成、文本图像检索等应用中起着至关重要的作用。目前,图像描述数据集主要来自两个来源。其中一个来源是从网络上抓取图像-文本对。尽管这些描述很丰富,但它们的质量往往很低,而且噪音很大。另一种是通过人工标记。像COCO这样的数据集通常很短,缺乏细节。虽然详细的图像描述可以由人来标注,但高昂的标注成本限制了可行性。这些限制突显了需要更高效和可伸缩的方法来生成准确和详细的图像描述。在本文中,我们提出了一种创新的框架,称为图像纹理(IT),它通过利用现有的多模式大型语言模型(MLLMS)和多个视觉专家模型以协作的方式自动生成高质量的图像描述,最大限度地将视觉信息转换为文本。为了解决目前缺乏详细描述基准的问题,我们提出了几个综合评价基准,以验证我们框架所创建的图像描述的质量。此外,我们还表明,LLaVA-7B得益于IT精选描述的培训,获得了更好的生成更丰富图像描述的能力,大大增加了输出的长度和细节,减少了幻觉。

[NLP-8] xtGrad: Automatic “Differentiation” via Text
[NLP-8] xtGrad:通过文本自动“差异化”

链接: https://arxiv.org/abs/2406.07496
作者: Mert Yuksekgonul,Federico Bianchi,Joseph Boen,Sheng Liu,Zhi Huang,Carlos Guestrin,James Zou
关键词: orchestrating multiple large, systems orchestrating multiple, large language models, multiple large language, paradigm shift
中文关键词: 编排多个大型系统编排多个大型语言模型、多种大型语言、范式转变
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 41 pages, 6 figures

点击查看摘要

Abstract:AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models (LLMs) and other complex components. As a result, developing principled and automated optimization methods for compound AI systems is one of the most important new challenges. Neural networks faced a similar challenge in its early days until backpropagation and automatic differentiation transformed the field by making optimization turn-key. Inspired by this, we introduce TextGrad, a powerful framework performing automatic ``differentiation’’ via text. TextGrad backpropagates textual feedback provided by LLMs to improve individual components of a compound AI system. In our framework, LLMs provide rich, general, natural language suggestions to optimize variables in computation graphs, ranging from code snippets to molecular structures. TextGrad follows PyTorch’s syntax and abstraction and is flexible and easy-to-use. It works out-of-the-box for a variety of tasks, where the users only provide the objective function without tuning components or prompts of the framework. We showcase TextGrad’s effectiveness and generality across a diverse range of applications, from question answering and molecule optimization to radiotherapy treatment planning. Without modifying the framework, TextGrad improves the zero-shot accuracy of GPT-4o in Google-Proof Question Answering from 51% to 55% , yields 20% relative performance gain in optimizing LeetCode-Hard coding problem solutions, improves prompts for reasoning, designs new druglike small molecules with desirable in silico binding, and designs radiation oncology treatment plans with high specificity. TextGrad lays a foundation to accelerate the development of the next-generation of AI systems.
摘要:人工智能正在经历范式的转变,通过系统编排多个大型语言模型(LLM)和其他复杂组件实现了突破。因此,为复合人工智能系统开发原则性和自动化的优化方法是最重要的新挑战之一。神经网络在早期面临着类似的挑战,直到反向传播和自动微分通过使优化交钥匙而改变了这一领域。受此启发,我们引入了TextGrad,这是一个功能强大的框架,通过文本执行自动“区分”。TextGrad反向传播LLMS提供的文本反馈,以改进复合AI系统的单个组件。在我们的框架中,LLM提供了丰富的、通用的、自然的语言建议来优化计算图中的变量,范围从代码片段到分子结构。TextGrad遵循PyTorch的语法和抽象,并且灵活且易于使用。它开箱即用,适用于各种任务,其中用户只提供目标函数,而不调整框架的组件或提示。我们展示了TextGrad在各种应用中的有效性和通用性,从问题回答和分子优化到放射治疗计划。在不修改框架的情况下,TextGrad将GPT-4o在谷歌防问题回答中的零命中率从51%提高到55%,在优化LeetCode-Hard编码问题解方面产生了20%的相对性能增益,改进了推理提示,设计了新的具有良好的电子结合的类药物小分子,并设计了具有高特异性的放射肿瘤学治疗方案。TextGrad为加速发展下一代AI系统奠定了基础。

[NLP-9] CADS: A Systematic Literature Review on the Challenges of Abstractive Dialogue Summarization
[NLP-9] CADS:关于抽象对话总结挑战的系统文献评论

链接: https://arxiv.org/abs/2406.07494
作者: Frederic Kirstein,Jan Philip Wahle,Bela Gipp,Terry Ruas
关键词: Abstractive dialogue summarization, Transformer-based abstractive summarization, concise summaries, distilling conversations, conversations into informative
中文关键词: 抽象对话总结、基于变形者的抽象总结、简洁总结、提炼对话、对话转化为信息
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Abstractive dialogue summarization is the task of distilling conversations into informative and concise summaries. Although reviews have been conducted on this topic, there is a lack of comprehensive work detailing the challenges of dialogue summarization, unifying the differing understanding of the task, and aligning proposed techniques, datasets, and evaluation metrics with the challenges. This article summarizes the research on Transformer-based abstractive summarization for English dialogues by systematically reviewing 1262 unique research papers published between 2019 and 2024, relying on the Semantic Scholar and DBLP databases. We cover the main challenges present in dialog summarization (i.e., language, structure, comprehension, speaker, salience, and factuality) and link them to corresponding techniques such as graph-based approaches, additional training tasks, and planning strategies, which typically overly rely on BART-based encoder-decoder models. We find that while some challenges, like language, have seen considerable progress, mainly due to training methods, others, such as comprehension, factuality, and salience, remain difficult and hold significant research opportunities. We investigate how these approaches are typically assessed, covering the datasets for the subdomains of dialogue (e.g., meeting, medical), the established automatic metrics and human evaluation approaches for assessing scores and annotator agreement. We observe that only a few datasets span across all subdomains. The ROUGE metric is the most used, while human evaluation is frequently reported without sufficient detail on inner-annotator agreement and annotation guidelines. Additionally, we discuss the possible implications of the recently explored large language models and conclude that despite a potential shift in relevance and difficulty, our described challenge taxonomy remains relevant.
摘要:抽象对话摘要是将对话提炼成信息丰富、简明扼要的摘要的任务。虽然对这一专题进行了审查,但缺乏详细说明对话总结挑战的全面工作,统一了对任务的不同理解,并使拟议的技术、数据集和评价指标与挑战保持一致。本文以语义学者和DBLP数据库为依托,系统回顾了2019年至2024年发表的1262篇独特的研究论文,总结了基于Transformer的英语对话摘要的研究进展。我们涵盖了对话摘要中存在的主要挑战(即语言、结构、理解、说话人、突出度和真实性),并将它们与相应的技术相关联,例如基于图形的方法、额外的训练任务和规划策略,这些方法通常过度依赖基于BART的编解码器模型。我们发现,虽然一些挑战,如语言,已经取得了相当大的进步,主要是由于训练方法,但其他挑战,如理解力、真实性和突出性,仍然困难,拥有重要的研究机会。我们调查了这些方法通常是如何被评估的,包括对话子域(例如,会议、医疗)的数据集、用于评估分数的已建立的自动度量和人工评估方法以及注释器协议。我们观察到只有少数数据集跨越所有子域。使用最多的是Rouge指标,而人工评估经常被报道,而没有关于内部注释器协议和注解指南的足够详细的信息。此外,我们讨论了最近探索的大型语言模型的可能含义,并得出结论,尽管相关性和难度可能发生变化,但我们描述的挑战分类仍然相关。

[NLP-10] Paraphrasing in Affirmative Terms Improves Negation Understanding
[NLP-10] 用肯定术语解释可以提高否定理解

链接: https://arxiv.org/abs/2406.07492
作者: MohammadHossein Rezaei,Eduardo Blanco
关键词: common linguistic phenomenon, linguistic phenomenon, common linguistic, Negation, natural language understanding
中文关键词: 常见语言现象,语言现象,常见语言,否定,自然语言理解
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024

点击查看摘要

Abstract:Negation is a common linguistic phenomenon. Yet language models face challenges with negation in many natural language understanding tasks such as question answering and natural language inference. In this paper, we experiment with seamless strategies that incorporate affirmative interpretations (i.e., paraphrases without negation) to make models more robust against negation. Crucially, our affirmative interpretations are obtained automatically. We show improvements with CondaQA, a large corpus requiring reasoning with negation, and five natural language understanding tasks.
摘要:否定是一种常见的语言现象。然而,语言模型在许多自然语言理解任务(例如问答和自然语言推理)中面临着否定的挑战。在本文中,我们尝试了融合肯定解释的无缝策略(即,无否定的转述)使模型在对抗否定时更加稳健。至关重要的是,我们的肯定解释是自动获得的。我们展示了CondaQA的改进,这是一个需要用否定进行推理的大型数据库,以及五个自然语言理解任务。

[NLP-11] Advancing Annotation of Stance in Social Media Posts: A Comparative Analysis of Large Language Models and Crowd Sourcing
[NLP-11] 社交媒体帖子中姿态的推进注释:大型语言模型和人群采购的比较分析

链接: https://arxiv.org/abs/2406.07483
作者: Mao Li,Frederick Conrad
关键词: Natural Language Processing, Large Language Models, Language Processing, Language Models, Natural Language
中文关键词: 自然语言处理、大型语言模型、语言处理、语言模型、自然语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the rapidly evolving landscape of Natural Language Processing (NLP), the use of Large Language Models (LLMs) for automated text annotation in social media posts has garnered significant interest. Despite the impressive innovations in developing LLMs like ChatGPT, their efficacy, and accuracy as annotation tools are not well understood. In this paper, we analyze the performance of eight open-source and proprietary LLMs for annotating the stance expressed in social media posts, benchmarking their performance against human annotators’ (i.e., crowd-sourced) judgments. Additionally, we investigate the conditions under which LLMs are likely to disagree with human judgment. A significant finding of our study is that the explicitness of text expressing a stance plays a critical role in how faithfully LLMs’ stance judgments match humans’. We argue that LLMs perform well when human annotators do, and when LLMs fail, it often corresponds to situations in which human annotators struggle to reach an agreement. We conclude with recommendations for a comprehensive approach that combines the precision of human expertise with the scalability of LLM predictions. This study highlights the importance of improving the accuracy and comprehensiveness of automated stance detection, aiming to advance these technologies for more efficient and unbiased analysis of social media.
摘要:在快速发展的自然语言处理(NLP)领域,在社交媒体帖子中使用大语言模型(LLM)进行自动文本标注引起了人们的极大兴趣。尽管在开发像ChatGPT这样的LLM方面有了令人印象深刻的创新,但它们作为注释工具的有效性和准确性并没有得到很好的理解。在本文中,我们分析了八个开源和专有的LLM的性能,用于标注社交媒体帖子中表达的立场,并将它们的性能与人类注释者(即众包)的判断进行比较。此外,我们还调查了LLM可能不同意人类判断的条件。我们研究的一个重要发现是,表达立场的文本的明确性对LLM的立场判断是否与人类的立场判断相匹配起着至关重要的作用。我们认为,当人工注释者做到这一点时,LLM表现得很好,而当LLM失败时,通常对应于人工注释者难以达成协议的情况。最后,我们建议一种综合的方法,将人类专业知识的精确度与LLM预测的可扩展性结合起来。这项研究强调了提高自动姿态检测的准确性和全面性的重要性,旨在推动这些技术的发展,以便对社交媒体进行更高效和公正的分析。

[NLP-12] VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
[NLP-12] VideoLLaMA 2:推进Video-LLM中的时空建模和音频理解

链接: https://arxiv.org/abs/2406.07476
作者: Zesen Cheng,Sicong Leng,Hang Zhang,Yifei Xin,Xin Li,Guanzheng Chen,Yongxin Zhu,Wenqi Zhang,Ziyang Luo,Deli Zhao,Lidong Bing
关键词: Video Large Language, Large Language Models, Large Language, enhance spatial-temporal modeling, Video Large
中文关键词: 视频大语言、大语言模型、大语言、增强时空建模、视频大
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: ZC, SL, HZ, YX, and XL contributed equally to this project

点击查看摘要

Abstract:In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2’s superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.
摘要:在本文中,我们提出了一套视频大语言模型(Video Large Language Model,Video-LLMS),旨在增强面向视频和音频的任务中的时空建模和音频理解。在其前身的基础上,VideoLLaMA 2采用了定制的时空卷积(STC)连接器,有效地捕获了视频数据复杂的空间和时间动态。此外,我们通过联合训练将音频分支集成到模型中,从而通过无缝地整合音频提示来丰富模型的多通道理解能力。在多项选择视频问答(MC-VQA)、开放式视频问答(OE-VQA)和视频字幕(VC)任务上的综合评估表明,VideoLLaMA 2在开源模式中始终取得具有竞争力的结果,甚至在多个基准上接近一些专有模式。此外,与现有型号相比,VideoLLaMA 2在纯音频和音视频问答(AQA OE-AVQA)基准方面表现出合理的改进。这些进步突显了视频LLaMA 2的S在多模式理解方面的卓越表现,为智能视频分析系统设定了新的标准。所有模型都是公开的,以方便进一步的研究。

[NLP-13] Multimodal Belief Prediction
[NLP-13] 多峰信念预测

链接: https://arxiv.org/abs/2406.07466
作者: John Murzaku,Adil Soubki,Owen Rambow
关键词: belief prediction task, words in context, level of commitment, interpret the meaning, understand cues
中文关键词: 信念预测任务、上下文中的词语、承诺水平、解释含义、理解线索
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: John Murzaku and Adil Soubki contributed equally to this work

点击查看摘要

Abstract:Recognizing a speaker’s level of commitment to a belief is a difficult task; humans do not only interpret the meaning of the words in context, but also understand cues from intonation and other aspects of the audio signal. Many papers and corpora in the NLP community have approached the belief prediction task using text-only approaches. We are the first to frame and present results on the multimodal belief prediction task. We use the CB-Prosody corpus (CBP), containing aligned text and audio with speaker belief annotations. We first report baselines and significant features using acoustic-prosodic features and traditional machine learning methods. We then present text and audio baselines for the CBP corpus fine-tuning on BERT and Whisper respectively. Finally, we present our multimodal architecture which fine-tunes on BERT and Whisper and uses multiple fusion methods, improving on both modalities alone.
摘要:识别说话者对信仰的承诺程度是一项艰巨的任务;人类不仅在上下文中解释词语的含义,而且还理解来自语调和音频信号其他方面的线索。NLP社区中的许多论文和文集都使用纯文本方法来处理信念预测任务。我们是第一个框架并呈现多模式信念预测任务结果的人。我们使用CB-Prosody corpus(CBP),其中包含对齐的文本和音频,并带有说话者信念注释。我们首先使用声学韵律特征和传统机器学习方法报告基线和重要特征。然后,我们分别在BERT和Whisper上展示CBP数据库微调的文本和音频基线。最后,我们介绍了我们的多模式架构,该架构对BERT和Whisper进行了微调,并使用多种融合方法,仅对这两种模式进行了改进。

[NLP-14] On the Robustness of Document-Level Relation Extraction Models to Entity Name Variations
[NLP-14] 文档级关系提取模型对实体名称变化的鲁棒性

链接: https://arxiv.org/abs/2406.07444
作者: Shiao Meng,Xuming Hu,Aiwei Liu,Fukun Ma,Yawen Yang,Shuang Li,Lijie Wen
关键词: increasing research interest, attracted increasing research, document-level relation extraction, large-scale relation extraction, relation extraction
中文关键词: 研究兴趣不断增加,吸引了越来越多的研究,文档级关系提取,大规模关系提取,关系提取
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024 Findings

点击查看摘要

Abstract:Driven by the demand for cross-sentence and large-scale relation extraction, document-level relation extraction (DocRE) has attracted increasing research interest. Despite the continuous improvement in performance, we find that existing DocRE models which initially perform well may make more mistakes when merely changing the entity names in the document, hindering the generalization to novel entity names. To this end, we systematically investigate the robustness of DocRE models to entity name variations in this work. We first propose a principled pipeline to generate entity-renamed documents by replacing the original entity names with names from Wikidata. By applying the pipeline to DocRED and Re-DocRED datasets, we construct two novel benchmarks named Env-DocRED and Env-Re-DocRED for robustness evaluation. Experimental results show that both three representative DocRE models and two in-context learned large language models consistently lack sufficient robustness to entity name variations, particularly on cross-sentence relation instances and documents with more entities. Finally, we propose an entity variation robust training method which not only improves the robustness of DocRE models but also enhances their understanding and reasoning capabilities. We further verify that the basic idea of this method can be effectively transferred to in-context learning for DocRE as well.
摘要:在交叉句和大规模关系抽取需求的推动下,文档级关系抽取(DocRE)引起了越来越多的研究兴趣。尽管性能在不断提高,但我们发现,最初性能良好的现有DocRE模型在仅更改文档中的实体名称时可能会出现更多错误,阻碍了对新实体名称的泛化。为此,我们系统地研究了DocRE模型对实体名称变化的稳健性。我们首先提出了一个原则性的管道,通过用来自维基数据的名称替换原始的实体名称来生成实体重命名的文档。通过将流水线应用于DocRED和Re-DocRED数据集,我们构建了两个新的基准测试程序:Env-DocRED和Env-Re-DocRED,用于健壮性评估。实验结果表明,三个具有代表性的DocRE模型和两个在上下文中学习的大型语言模型对于实体名称的变化都缺乏足够的健壮性,特别是在跨句关系实例和实体较多的文档上。最后,提出了一种实体变异稳健训练方法,不仅提高了DocRE模型的稳健性,而且增强了其理解和推理能力。我们进一步验证了该方法的基本思想也可以有效地移植到DocRE的情境学习中。

[NLP-15] xtual Similarity as a Key Metric in Machine Translation Quality Estimation
[NLP-15] 绝对相似度作为机器翻译质量估计的关键指标

链接: https://arxiv.org/abs/2406.07440
作者: Kun Sun,Rong Wang
关键词: Quality Estimation, assesses translation reliability, Machine Translation, assesses translation, translation reliability
中文关键词: 质量评估,评估翻译可靠性,机器翻译,评估翻译,翻译可靠性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine Translation (MT) Quality Estimation (QE) assesses translation reliability without reference texts. This study introduces “textual similarity” as a new metric for QE, using sentence transformers and cosine similarity to measure semantic closeness. Analyzing data from the MLQE-PE dataset, we found that textual similarity exhibits stronger correlations with human scores than traditional metrics (hter, model evaluation etc.). Employing GAMMs as a statistical tool, we demonstrated that textual similarity consistently outperforms other metrics across multiple language pairs in predicting human scores. We also found that “hter” actually failed to predict human scores in QE. Our findings highlight the effectiveness of textual similarity as a robust QE metric, recommending its integration with other metrics into QE frameworks and MT system training for improved accuracy and usability.
摘要:机器翻译(MT)质量评估(QE)在没有参考文本的情况下评估翻译可靠性。这项研究引入“文本相似度”作为量化宽松的新指标,使用句子转换器和cos相似度来衡量语义接近度。分析来自MLQE-PE数据集的数据,我们发现文本相似性与人类分数的相关性比传统指标(hter、模型评估等)更强。使用GAMM作为统计工具,我们证明,在预测人类分数方面,文本相似性始终优于多种语言对的其他指标。我们还发现,“hter”实际上无法预测人类在量化宽松中的得分。我们的研究结果强调了文本相似性作为稳健量化宽松指标的有效性,建议将其与其他指标集成到量化宽松框架和MT系统培训中,以提高准确性和可用性。

[NLP-16] Learning Domain-Invariant Features for Out-of-Context News Detection
[NLP-16] 学习用于脱离上下文新闻检测的域不变特征

链接: https://arxiv.org/abs/2406.07430
作者: Yimeng Gu,Mengqi Zhang,Ignacio Castro,Shu Wu,Gareth Tyson
关键词: online media platforms, media platforms, online media, Multimodal, domain
中文关键词: 在线媒体平台、媒体平台、在线媒体、多模式、领域
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multimodal out-of-context news is a common type of misinformation on online media platforms. This involves posting a caption, alongside an invalid out-of-context news image. Reflecting its importance, researchers have developed models to detect such misinformation. However, a common limitation of these models is that they only consider the scenario where pre-labeled data is available for each domain, failing to address the out-of-context news detection on unlabeled domains (e.g., unverified news on new topics or agencies). In this work, we therefore focus on domain adaptive out-of-context news detection. In order to effectively adapt the detection model to unlabeled news topics or agencies, we propose ConDA-TTA (Contrastive Domain Adaptation with Test-Time Adaptation) which applies contrastive learning and maximum mean discrepancy (MMD) to learn the domain-invariant feature. In addition, it leverages target domain statistics during test-time to further assist domain adaptation. Experimental results show that our approach outperforms baselines in 5 out of 7 domain adaptation settings on two public datasets, by as much as 2.93% in F1 and 2.08% in accuracy.
摘要:多模式断章取义新闻是网络媒体平台上常见的一种错误信息类型。这包括在无效的断章取义的新闻图片旁边发布标题。考虑到它的重要性,研究人员开发了一些模型来检测这种错误信息。然而,这些模型的一个共同局限性是,它们只考虑每个域都有预先标记的数据可用的场景,而没有解决在未标记的域上的脱离上下文的新闻检测(例如,关于新主题或机构的未经验证的新闻)。因此,在本文中,我们主要研究领域自适应的断章取义新闻检测方法。为了有效地将检测模型适应于未标注的新闻主题或机构,我们提出了Conda-TTA(Contrastive领域适配与测试时间适配),它应用对比学习和最大均值差异(MMD)来学习领域不变特征。此外,它还在测试期间利用目标域统计信息来进一步辅助域适应。实验结果表明,在两个公共数据集上,我们的方法在7个领域自适应设置中有5个比基线高2.93%,在F1上的准确率高达2.08%。

[NLP-17] MINERS: Multilingual Language Models as Semantic Retrievers
[NLP-17] MINERS:作为语义检索器的多语言语言模型

链接: https://arxiv.org/abs/2406.07424
作者: Genta Indra Winata,Ruochen Zhang,David Ifeoluwa Adelani
关键词: enabling downstream applications, high-dimensional vector space, enabling downstream, high-dimensional vector, vector space
中文关键词: 实现下游应用程序,多维载体空间,实现下游,多维载体,载体空间
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Words have been represented in a high-dimensional vector space that encodes their semantic similarities, enabling downstream applications such as retrieving synonyms, antonyms, and relevant contexts. However, despite recent advances in multilingual language models (LMs), the effectiveness of these models’ representations in semantic retrieval contexts has not been comprehensively explored. To fill this gap, this paper introduces the MINERS, a benchmark designed to evaluate the ability of multilingual LMs in semantic retrieval tasks, including bitext mining and classification via retrieval-augmented contexts. We create a comprehensive framework to assess the robustness of LMs in retrieving samples across over 200 diverse languages, including extremely low-resource languages in challenging cross-lingual and code-switching settings. Our results demonstrate that by solely retrieving semantically similar embeddings yields performance competitive with state-of-the-art approaches, without requiring any fine-tuning.
摘要:单词已经被表示在高维向量空间中,该空间编码了它们的语义相似性,使得下游应用程序能够检索同义词、反义词和相关上下文。然而,尽管最近在多语言语言模型(LMS)方面取得了进展,但这些模型在语义检索上下文中的表示的有效性还没有得到全面的探索。为了填补这一空白,本文引入了MINERS,这是一个旨在评估多语言LMS在语义检索任务中的能力的基准,包括基于检索增强的上下文的比特文本挖掘和分类。我们创建了一个全面的框架来评估LMS在检索200多种不同语言的样本时的稳健性,其中包括在具有挑战性的跨语言和代码转换环境中资源极低的语言。我们的结果表明,仅通过检索语义相似的嵌入可以产生与最先进的方法相当的性能,而不需要任何微调。

[NLP-18] VersiCode: Towards Version-controllable Code Generation
[NLP-18] VersiCode:迈向版本可控的代码生成

链接: https://arxiv.org/abs/2406.07411
作者: Tongtong Wu,Weigang Wu,Xingyu Wang,Kang Xu,Suyu Ma,Bo Jiang,Ping Yang,Zhenchang Xing,Yuan-Fang Li,Gholamreza Haffari
关键词: code-related tasks due, practical importance, Significant research, large language model, focused on improving
中文关键词: 应有的代码相关任务,实际重要性,重要的研究,大型语言模型,专注于改进
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Significant research has focused on improving the performance of large language model on code-related tasks due to their practical importance. Although performance is typically evaluated using public benchmark datasets, the existing datasets do not account for the concept of \emphversion, which is crucial in professional software development. In this paper, we introduce VersiCode, the first comprehensive dataset designed to assess the ability of large language models to generate verifiable code for specific library versions. VersiCode encompasses 300 libraries across more than 2,000 versions spanning 9 years. We design two dedicated evaluation tasks: version-specific code completion (VSCC) and version-aware code editing (VACE). Comprehensive experiments are conducted to benchmark the performance of LLMs, revealing the challenging nature of these tasks and VersiCode, that even state-of-the-art LLMs struggle to generate version-correct code. This dataset, together with the proposed tasks, sheds light on LLMs’ capabilities and limitations in handling version-specific code generation, and opens up an important new area of research for further investigation. The resources can be found at this https URL.
摘要:由于大型语言模型在代码相关任务中的实际重要性,许多研究都集中在提高它们的性能上。尽管通常使用公共基准数据集来评估性能,但现有数据集不考虑在专业软件开发中至关重要的版本概念。在本文中,我们介绍了VersiCode,这是第一个全面的数据集,旨在评估大型语言模型为特定库版本生成可验证代码的能力。VersiCode包含300个库,跨越9年的2,000多个版本。我们设计了两个专门的评估任务:版本特定代码完成(VSCC)和版本感知代码编辑(VACE)。进行了全面的实验来对LLMS的性能进行基准测试,揭示了这些任务和VersiCode的挑战性,即使是最先进的LLM也很难生成版本正确的代码。这个数据集和建议的任务一起揭示了LLMS在处理特定于版本的代码生成方面的能力和限制,并为进一步的研究开辟了一个重要的新研究领域。这些资源可以在此HTTPS URL中找到。

[NLP-19] Limited Out-of-Context Knowledge Reasoning in Large Language Models
[NLP-19] 大型语言模型中的有限脱离上下文知识推理

链接: https://arxiv.org/abs/2406.07393
作者: Peng Hu,Changjiang Gao,Ruiqi Gao,Jiajun Chen,Shujian Huang
关键词: Large Language Models, Large Language, demonstrated strong capabilities, significant in-context reasoning, in-context reasoning capabilities
中文关键词: 大型语言模型,大型语言,表现出强大的能力、显着的上下文推理、上下文推理能力
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities as knowledge bases and significant in-context reasoning capabilities. However, previous work challenges their out-of-context reasoning ability, i.e., the ability to infer information from their training data, instead of from the context or prompt. This paper focuses on a significant facet of out-of-context reasoning: Out-of-Context Knowledge Reasoning (OCKR), which is to combine multiple knowledge to infer new knowledge. We designed a synthetic dataset with seven representative OCKR tasks to systematically assess the OCKR capabilities of LLMs. Using this dataset, we evaluated the LLaMA2-13B-chat model and discovered that its proficiency in this aspect is limited, regardless of whether the knowledge is trained in a separate or adjacent training settings. Moreover, training the model to reason with complete reasoning data did not result in significant improvement. Training the model to perform explicit knowledge retrieval helps in only one of the tasks, indicating that the model’s limited OCKR capabilities are due to difficulties in retrieving relevant knowledge. Furthermore, we treat cross-lingual knowledge transfer as a distinct form of OCKR, and evaluate this ability. Our results show that the evaluated model also exhibits limited ability in transferring knowledge across languages. The dataset used in this study is available at this https URL.
摘要:大型语言模型已经显示出强大的知识库能力和显著的上下文推理能力。然而,以前的工作挑战了他们的上下文外推理能力,即从他们的训练数据而不是从上下文或提示推断信息的能力。本文重点研究了上下文外推理的一个重要方面:上下文外知识推理(OCKR),即结合多个知识来推理新知识。我们设计了一个包含7个具有代表性的OCKR任务的合成数据集,以系统地评估LLMS的OCKR能力。使用这个数据集,我们评估了LLaMA2-13B-Chat模型,发现它在这方面的熟练程度是有限的,无论知识是在单独的还是相邻的训练环境中训练的。此外,训练模型使用完整的推理数据进行推理并没有产生显著的改善。训练模型执行显式知识检索只在其中一个任务中有所帮助,这表明模型的OCKR能力有限是由于检索相关知识的困难。此外,我们将跨语言知识转移视为一种独特的开放式知识转移,并对这种能力进行了评估。我们的结果表明,被评估的模型在跨语言传递知识方面的能力也是有限的。本研究中使用的数据集可在此HTTPS URL上获得。

[NLP-20] Large Language Models for Constrained-Based Causal Discovery
[NLP-20] 基于约束的因果发现的大型语言模型

链接: https://arxiv.org/abs/2406.07378
作者: Kai-Hendrik Cohrs,Gherardo Varando,Emiliano Diaz,Vasileios Sitokonstantinou,Gustau Camps-Valls
关键词: Causality is essential, understanding complex systems, essential for understanding, understanding complex, Large Language Models
中文关键词: 因果关系至关重要,理解复杂系统,对于理解复杂的大型语言模型至关重要
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Causality is essential for understanding complex systems, such as the economy, the brain, and the climate. Constructing causal graphs often relies on either data-driven or expert-driven approaches, both fraught with challenges. The former methods, like the celebrated PC algorithm, face issues with data requirements and assumptions of causal sufficiency, while the latter demand substantial time and domain knowledge. This work explores the capabilities of Large Language Models (LLMs) as an alternative to domain experts for causal graph generation. We frame conditional independence queries as prompts to LLMs and employ the PC algorithm with the answers. The performance of the LLM-based conditional independence oracle on systems with known causal graphs shows a high degree of variability. We improve the performance through a proposed statistical-inspired voting schema that allows some control over false-positive and false-negative rates. Inspecting the chain-of-thought argumentation, we find causal reasoning to justify its answer to a probabilistic query. We show evidence that knowledge-based CIT could eventually become a complementary tool for data-driven causal discovery.
摘要:因果关系对于理解经济、大脑和气候等复杂系统至关重要。构建因果关系图通常依赖于数据驱动或专家驱动的方法,两者都充满挑战。前一种方法,就像著名的PC算法一样,面临着数据要求和因果充分性假设的问题,而后一种方法需要大量的时间和领域知识。这项工作探索了大型语言模型(LLM)作为领域专家的因果图生成的替代方案的能力。我们将条件独立性查询作为对LLMS的提示,并使用具有答案的PC算法。基于LLM的条件独立预言在具有已知因果图的系统上的性能表现出高度的可变性。我们通过提出一种受统计启发的投票模式来提高性能,该模式允许对假阳性和假阴性比率进行一些控制。考察思维链论证,我们发现因果推理可以证明它对概率问题的回答是正确的。我们展示了基于知识的CIT最终可能成为数据驱动的因果发现的补充工具的证据。

[NLP-21] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models
[NLP-21] 当线性注意力遇到自回归解码:迈向更有效和高效的线性化大型语言模型

链接: https://arxiv.org/abs/2406.07368
作者: Haoran You,Yichao Fu,Zheng Wang,Amir Yazdanbakhsh,Yingyan(Celine)Lin
关键词: Autoregressive Large Language, Large Language Models, Large Language, limited efficiency due, achieved impressive performance
中文关键词: 自回归大型语言、大型语言模型、大型语言,由于效率有限,取得了令人印象深刻的性能
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICML 2024; 17 pages; 10 figures; 16 tables

点击查看摘要

Abstract:Autoregressive Large Language Models (LLMs) have achieved impressive performance in language tasks but face two significant bottlenecks: (1) quadratic complexity in the attention module as the number of tokens increases, and (2) limited efficiency due to the sequential processing nature of autoregressive LLMs during generation. While linear attention and speculative decoding offer potential solutions, their applicability and synergistic potential for enhancing autoregressive LLMs remain uncertain. We conduct the first comprehensive study on the efficacy of existing linear attention methods for autoregressive LLMs, integrating them with speculative decoding. We introduce an augmentation technique for linear attention that ensures compatibility with speculative decoding, enabling more efficient training and serving of LLMs. Extensive experiments and ablation studies involving seven existing linear attention models and five encoder/decoder-based LLMs consistently validate the effectiveness of our augmented linearized LLMs. Notably, our approach achieves up to a 6.67 reduction in perplexity on the LLaMA model and up to a 2 \times speedup during generation compared to prior linear attention methods. Codes and models are available at this https URL.
摘要:自回归大语言模型在语言任务中取得了令人印象深刻的性能,但面临着两个显著的瓶颈:(1)随着标记数量的增加,注意模块的二次复杂性;(2)由于自回归大语言模型在生成过程中的顺序处理性质,导致效率有限。虽然线性注意和推测译码提供了潜在的解决方案,但它们在增强自回归最小二乘模型方面的适用性和协同潜力仍然不确定。我们首次对现有的线性注意方法在自回归LLMS中的有效性进行了全面的研究,并将它们与推测解码相结合。我们引入了一种线性注意增强技术,确保了与推测解码的兼容性,从而使LLMS能够更有效地训练和服务。广泛的实验和消融研究涉及7个现有的线性注意模型和5个基于编解码器的LLMS,一致地验证了我们的增广线性化LLMS的有效性。值得注意的是,与以前的线性注意方法相比,我们的方法在骆驼模型上实现了高达6.67的困惑降低,并在生成过程中实现了高达2倍的加速。代码和型号可在此HTTPS URL上找到。

[NLP-22] BvSP: Broad-view Soft Prompting for Few-Shot Aspect Sentiment Quad Prediction
[NLP-22] BvSP:用于少镜头方面情绪四元预测的广角软预算

链接: https://arxiv.org/abs/2406.07365
作者: Yinhao Bai,Yalan Xie,Xiaoyi Liu,Yuhua Zhao,Zhixin Han,Mengting Hu,Hang Gao,Renhong Cheng
关键词: including aspect term, sentiment quad prediction, Aspect sentiment quad, opinion term, aspect term
中文关键词: 包括方面项、情绪四元预测、方面情绪四元、意见项、方面项
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2024 Main Conference

点击查看摘要

Abstract:Aspect sentiment quad prediction (ASQP) aims to predict four aspect-based elements, including aspect term, opinion term, aspect category, and sentiment polarity. In practice, unseen aspects, due to distinct data distribution, impose many challenges for a trained neural model. Motivated by this, this work formulates ASQP into the few-shot scenario, which aims for fast adaptation in real applications. Therefore, we first construct a few-shot ASQP dataset (FSQP) that contains richer categories and is more balanced for the few-shot study. Moreover, recent methods extract quads through a generation paradigm, which involves converting the input sentence into a templated target sequence. However, they primarily focus on the utilization of a single template or the consideration of different template orders, thereby overlooking the correlations among various templates. To tackle this issue, we further propose a Broadview Soft Prompting (BvSP) method that aggregates multiple templates with a broader view by taking into account the correlation between the different templates. Specifically, BvSP uses the pre-trained language model to select the most relevant k templates with Jensen-Shannon divergence. BvSP further introduces soft prompts to guide the pre-trained language model using the selected templates. Then, we aggregate the results of multi-templates by voting mechanism. Empirical results demonstrate that BvSP significantly outperforms the stateof-the-art methods under four few-shot settings and other public datasets. Our code and dataset are available at this https URL.
摘要:方面情感四项预测(ASQP)旨在预测基于方面的四个要素,包括方面项、意见项、方面类别和情感极性。在实践中,由于不同的数据分布,看不见的方面给训练好的神经模型带来了许多挑战。受此启发,本文将ASQP算法扩展到少镜头场景,以期在实际应用中实现快速自适应。因此,我们首先构建一个少镜头ASQP数据集(FSQP),该数据集包含更丰富的类别,对于少镜头研究来说更加平衡。此外,最近的方法通过生成范式提取四元组,这涉及将输入句子转换为模板化的目标序列。然而,它们主要侧重于单一模板的利用或考虑不同的模板顺序,从而忽视了各种模板之间的相关性。为了解决这个问题,我们进一步提出了一种Broadview软提示(BvSP)方法,该方法通过考虑不同模板之间的相关性来以更广泛的视角聚合多个模板。具体地说,BvSP使用预先训练的语言模型来选择与Jensen-Shannon发散度最相关的k个模板。BvSP还引入了软提示,以使用所选模板来指导预先训练的语言模型。然后,通过投票机制对多模板的结果进行聚合。实验结果表明,BvSP在四种少镜头场景和其他公开数据集上的性能明显优于最新的方法。我们的代码和数据集可以在这个HTTPS URL上找到。

[NLP-23] GLIMPSE: Pragmatically Informative Multi-Document Summarization for Scholarly Reviews
[NLP-23] GLIMPSE:用于学术评论的实用信息多文档摘要

链接: https://arxiv.org/abs/2406.07359
作者: Maxime Darrin,Ines Arous,Pablo Piantanida,Jackie CK Cheung
关键词: Scientific peer review, Scientific peer, academic publications, quality of academic, Scientific
中文关键词: 科学同行评审、科学同行、学术出版物、学术质量、科学
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific peer review is essential for the quality of academic publications. However, the increasing number of paper submissions to conferences has strained the reviewing process. This surge poses a burden on area chairs who have to carefully read an ever-growing volume of reviews and discern each reviewer’s main arguments as part of their decision process. In this paper, we introduce \sys, a summarization method designed to offer a concise yet comprehensive overview of scholarly reviews. Unlike traditional consensus-based methods, \sys extracts both common and unique opinions from the reviews. We introduce novel uniqueness scores based on the Rational Speech Act framework to identify relevant sentences in the reviews. Our method aims to provide a pragmatic glimpse into all reviews, offering a balanced perspective on their opinions. Our experimental results with both automatic metrics and human evaluation show that \sys generates more discriminative summaries than baseline methods in terms of human evaluation while achieving comparable performance with these methods in terms of automatic metrics.
摘要:科学的同行评议对学术出版物的质量至关重要。然而,提交给各次会议的论文数量越来越多,给审查进程带来了压力。这种激增给地区主席带来了负担,他们必须仔细阅读越来越多的评论,并识别每个审查者的主要论点,作为他们决策过程的一部分。在本文中,我们介绍了一种旨在提供简明而全面的学术评论概述的摘要方法\sys。与传统的基于共识的方法不同,Sys从评论中提取共同和独特的意见。我们引入了基于理性言语行为框架的新的独特性评分来识别评论中的相关句子。我们的方法旨在提供对所有评论的务实一瞥,为他们的意见提供一个平衡的视角。我们在自动度量和人工评估方面的实验结果表明,在人工评估方面,Sys生成比基线方法更具区别性的摘要,而在自动度量方面获得与这些方法相当的性能。

[NLP-24] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
[NLP-24] 人工智能沙袋:语言模型在评估中可能战略性表现不佳

链接: https://arxiv.org/abs/2406.07358
作者: Teun van der Weij,Felix Hofstätter,Ollie Jaffe,Samuel F. Brown,Francis Rhys Ward
关键词: Trustworthy capability evaluations, Trustworthy capability, crucial for ensuring, key component, capability evaluations
中文关键词: 值得信赖的能力评估,值得信赖的能力,对于确保至关重要,关键组件,能力评估
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: We publish our code and results \href\href{ [this https URL](https://github.com/your-repo/your-project) }{here}

点击查看摘要

Abstract:Trustworthy capability evaluations are crucial for ensuring the safety of AI systems, and are becoming a key component of AI regulation. However, the developers of an AI system, or the AI system itself, may have incentives for evaluations to understate the AI’s actual capability. These conflicting interests lead to the problem of sandbagging \unicodex2013 which we define as “strategic underperformance on an evaluation”. In this paper we assess sandbagging capabilities in contemporary language models (LMs). We prompt frontier LMs, like GPT-4 and Claude 3 Opus, to selectively underperform on dangerous capability evaluations, while maintaining performance on general (harmless) capability evaluations. Moreover, we find that models can be fine-tuned, on a synthetic dataset, to hide specific capabilities unless given a password. This behaviour generalizes to high-quality, held-out benchmarks such as WMDP. In addition, we show that both frontier and smaller models can be prompted, or password-locked, to target specific scores on a capability evaluation. Even more, we found that a capable password-locked model (Llama 3 70b) is reasonably able to emulate a less capable model (Llama 2 7b). Overall, our results suggest that capability evaluations are vulnerable to sandbagging. This vulnerability decreases the trustworthiness of evaluations, and thereby undermines important safety decisions regarding the development and deployment of advanced AI systems.
摘要:可信能力评估是保证人工智能系统安全的关键,也是人工智能监管的重要组成部分。然而,人工智能系统的开发人员,或人工智能系统本身,可能有动机进行评估,低估人工智能的实际能力。这些相互冲突的利益导致了沙袋\unicodex2013的问题,我们将其定义为“在评估中的战略表现不佳”。在本文中,我们评估了当代语言模型(LMS)中的沙袋能力。我们建议前沿LMS,如GPT-4和Claude 3 Opus,在危险能力评估中选择性地表现不佳,同时保持在一般(无害)能力评估中的表现。此外,我们发现可以在合成数据集上对模型进行微调,以隐藏特定的功能,除非给出密码。这种行为适用于高质量的坚持基准,如WMDP。此外,我们还展示了前沿模型和较小模型都可以被提示或密码锁定,以针对能力评估的特定分数。更重要的是,我们发现一个功能强大的密码锁模型(Llama 3 70b)可以合理地模仿能力较差的模型(Llama 2 7b)。总体而言,我们的结果表明,能力评估很容易受到沙袋的影响。这一漏洞降低了评估的可信度,从而破坏了有关开发和部署先进人工智能系统的重要安全决策。

[NLP-25] oxic Memes: A Survey of Computational Perspectives on the Detection and Explanation of Meme Toxicities
[NLP-25] 氧化模因:检测和解释模因毒性的计算观点概览

链接: https://arxiv.org/abs/2406.07353
作者: Delfina Sol Martinez Pandiani,Erik Tjong Kim Sang,Davide Ceolin
关键词: spread toxic messages, Internet memes, toxic meme analysis, meme, toxic meme
中文关键词: 传播有毒信息、互联网模因、有毒模因分析、模因、有毒模因
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: 39 pages, 12 figures, 9 tables

点击查看摘要

Abstract:Internet memes, channels for humor, social commentary, and cultural expression, are increasingly used to spread toxic messages. Studies on the computational analyses of toxic memes have significantly grown over the past five years, and the only three surveys on computational toxic meme analysis cover only work published until 2022, leading to inconsistent terminology and unexplored trends. Our work fills this gap by surveying content-based computational perspectives on toxic memes, and reviewing key developments until early 2024. Employing the PRISMA methodology, we systematically extend the previously considered papers, achieving a threefold result. First, we survey 119 new papers, analyzing 158 computational works focused on content-based toxic meme analysis. We identify over 30 datasets used in toxic meme analysis and examine their labeling systems. Second, after observing the existence of unclear definitions of meme toxicity in computational works, we introduce a new taxonomy for categorizing meme toxicity types. We also note an expansion in computational tasks beyond the simple binary classification of memes as toxic or non-toxic, indicating a shift towards achieving a nuanced comprehension of toxicity. Third, we identify three content-based dimensions of meme toxicity under automatic study: target, intent, and conveyance tactics. We develop a framework illustrating the relationships between these dimensions and meme toxicities. The survey analyzes key challenges and recent trends, such as enhanced cross-modal reasoning, integrating expert and cultural knowledge, the demand for automatic toxicity explanations, and handling meme toxicity in low-resource languages. Also, it notes the rising use of Large Language Models (LLMs) and generative AI for detecting and generating toxic memes. Finally, it proposes pathways for advancing toxic meme detection and interpretation.
摘要:网络模因是幽默、社会评论和文化表达的渠道,越来越多地被用来传播有毒信息。关于有毒模因计算分析的研究在过去五年中显著增长,仅有三项关于计算有毒模因分析的调查只涵盖了2022年之前发表的工作,导致术语不一致和未探索的趋势。我们的工作填补了这一空白,方法是调查基于内容的有毒模因计算视角,并审查2024年初之前的关键发展。采用PRISMA方法,我们系统地扩展了先前考虑的论文,取得了三方面的结果。首先,我们调查了119篇新发表的论文,分析了158篇专注于基于内容的有毒模因分析的计算作品。我们识别了30多个用于有毒模因分析的数据集,并检查了它们的标记系统。其次,在观察到计算工作中存在模因毒性定义不明确的情况后,我们引入了一种新的分类方法来对模因毒性类型进行分类。我们还注意到,计算任务的扩展超出了对迷因的简单二进制分类,即有毒或无毒,这表明正在转向对毒性的细微差别理解。第三,在自动学习下,我们确定了模因毒性的三个基于内容的维度:目标、意图和传递策略。我们开发了一个框架来说明这些维度与模因毒性之间的关系。该调查分析了关键挑战和最新趋势,例如增强的跨模式推理,整合专家和文化知识,对毒性自动解释的需求,以及用低资源语言处理模因毒性。此外,它还注意到,大型语言模型(LLM)和生成性人工智能在检测和生成有毒模因方面的使用正在增加。最后,提出了推进毒性模因检测和解释的途径。

[NLP-26] DR-RAG: Applying Dynamic Document Relevance to Retrieval-Augmented Generation for Question-Answering
[NLP-26] DR-RAG:将动态文档相关性应用于检索增强生成以实现信息服务

链接: https://arxiv.org/abs/2406.07348
作者: Zijian Hei,Weiling Wei,Wenjie Ou,Juyi Qiao,Junming Jiao,Zhiqing Zhu,Guowen Song
关键词: Large Language Models, Language Models, Large Language, performance of Large, knowledge-intensive tasks
中文关键词: 大型语言模型、语言模型、大型语言、大型知识密集型任务的性能
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has significantly demonstrated the performance of Large Language Models (LLMs) in the knowledge-intensive tasks, such as Question-Answering (QA). RAG expands the query context by incorporating external knowledge bases to enhance the response accuracy. However, it would be inefficient to access LLMs multiple times for each query and unreliable to retrieve all the relevant documents by a single query. We find that even though there is low relevance between some critical documents and query, it is possible to retrieve the remaining documents by combining parts of the documents with the query. To mine the relevance, a two-stage retrieval framework called Dynamic-Relevant Retrieval-Augmented Generation (DR-RAG) is proposed to improve document retrieval recall and the accuracy of answers while maintaining efficiency. Also, a small classifier is applied to two different selection strategies to determine the contribution of the retrieved documents to answering the query and retrieve the relatively relevant documents. Meanwhile, DR-RAG call the LLMs only once, which significantly improves the efficiency of the experiment. The experimental results on multi-hop QA datasets show that DR-RAG can significantly improve the accuracy of the answers and achieve new progress in QA systems.
摘要:检索增强生成(RAG)已经显著地展示了大语言模型(LLM)在问答等知识密集型任务中的性能。RAG通过结合外部知识库来扩展查询上下文,以提高响应的准确性。然而,每次查询多次访问LLMS的效率将很低,而且通过一次查询检索所有相关文件也不可靠。我们发现,即使一些关键文档和查询之间的相关性很低,也可以通过将文档的部分与查询相结合来检索剩余的文档。为了挖掘文档的相关性,提出了一种动态相关检索-增强生成(DR-RAG)的两阶段检索框架,在保持效率的同时提高了文档检索的查全率和答案的准确率。此外,将小分类器应用于两种不同的选择策略,以确定检索到的文档对回答查询的贡献,并检索相对相关的文档。同时,DR-RAG只需调用LLMS一次,显著提高了实验效率。在多跳问答数据集上的实验结果表明,DR-RAG能够显著提高答案的准确率,在问答系统中取得了新的进展。

[NLP-27] CTC-based Non-autoregressive Textless Speech-to-Speech Translation
[NLP-27] 基于ATC的非自回归无文本语音翻译

链接: https://arxiv.org/abs/2406.07330
作者: Qingkai Fang,Zhengrui Ma,Yan Zhou,Min Zhang,Yang Feng
关键词: slow decoding due, speech sequences, achieved impressive translation, faces the challenge, challenge of slow
中文关键词: 由于解码速度慢,语音序列,实现了令人印象深刻的翻译,面临着挑战,缓慢的挑战
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: ACL 2024 Findings

点击查看摘要

Abstract:Direct speech-to-speech translation (S2ST) has achieved impressive translation quality, but it often faces the challenge of slow decoding due to the considerable length of speech sequences. Recently, some research has turned to non-autoregressive (NAR) models to expedite decoding, yet the translation quality typically lags behind autoregressive (AR) models significantly. In this paper, we investigate the performance of CTC-based NAR models in S2ST, as these models have shown impressive results in machine translation. Experimental results demonstrate that by combining pretraining, knowledge distillation, and advanced NAR training techniques such as glancing training and non-monotonic latent alignments, CTC-based NAR models achieve translation quality comparable to the AR model, while preserving up to 26.81 \times decoding speedup.
摘要:直接语音到语音翻译(S2 ST)取得了令人印象深刻的翻译质量,但由于语音序列的长度相当长,它经常面临解码缓慢的挑战。最近,一些研究转向非自回归(NAR)模型来加速解码,但翻译质量通常显着落后于自回归(AR)模型。在本文中,我们研究了S2 ST中基于ATC的NAR模型的性能,因为这些模型在机器翻译中表现出了令人印象深刻的结果。实验结果表明,通过结合预训练、知识提炼和先进的NAR训练技术(例如浏览训练和非单调潜在对齐),基于ATC的NAR模型实现了与AR模型相当的翻译质量,同时保持高达26.81倍的解码加速。

[NLP-28] 3D-Properties: Identifying Challenges in DPO and Charting a Path Forward
[NLP-28] 3D属性:识别DPO中的挑战并绘制前进道路

链接: https://arxiv.org/abs/2406.07327
作者: Yuzi Yan,Yibo Miao,Jialian Li,Yipin Zhang,Jian Xie,Zhijie Deng,Dong Yan
关键词: Direct Preference Optimization, Aligning large language, gained tremendous attention, straightforward Direct Preference, recently gained tremendous
中文关键词: 直接偏好优化,调整大型语言,获得了巨大的关注,简单的直接偏好,最近获得了巨大的关注
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) with human preference has recently gained tremendous attention, with the canonical yet costly RLHF-PPO and the simple and straightforward Direct Preference Optimization (DPO) as two examples. Despite the efficiency, DPO has rarely be used in the state-of-the-art production-level LLMs, implying its potential pathologies. In this work, we revisit DPO with a comprehensive examination of its empirical efficacy and a systematic comparison with RLHF-PPO. We identify the \textbf3D-properties of DPO’s learning outcomes: the \textbfDrastic drop in the likelihood of rejected responses, the \textbfDegradation into LLM unlearning, and the \textbfDispersion effect on unseen responses through experiments with both a carefully designed toy model and practical LLMs on tasks including mathematical problem-solving and instruction following. These findings inherently connect to some observations made by related works and we additionally contribute a plausible theoretical explanation for them. Accordingly, we propose easy regularization methods to mitigate the issues caused by \textbf3D-properties, improving the training stability and final performance of DPO. Our contributions also include an investigation into how the distribution of the paired preference data impacts the effectiveness of DPO. We hope this work could offer research directions to narrow the gap between reward-free preference learning methods and reward-based ones.
摘要:将大语言模型(LLM)与人的偏好相结合最近得到了极大的关注,经典但代价高昂的RLHF-PPO和简单明了的直接偏好优化(DPO)就是两个例子。尽管DPO效率很高,但很少在最先进的生产级LLM中使用,这意味着它可能会发生病理变化。在这项工作中,我们重新审视了DPO,全面检验了它的经验有效性,并与RLHF-PPO进行了系统的比较。我们通过精心设计的玩具模型和实际的LLM在数学问题解决和指令遵循等任务上的实验,确定了DPO学习结果的\textbf3D-属性:拒绝响应可能性的急剧下降,退化为LLM遗忘,以及对看不见的反应的分散效应。这些发现与相关工作中的一些观察结果内在地联系在一起,我们还为它们提供了一个可信的理论解释。相应地,我们提出了简单的正则化方法来缓解文本bf3D属性带来的问题,提高了DPO的训练稳定性和最终性能。我们的贡献还包括调查配对偏好数据的分布如何影响DPO的有效性。我们希望这项工作能够为缩小无报酬偏好学习方法和基于报酬偏好学习方法之间的差距提供研究方向。

[NLP-29] BertaQA: How Much Do Language Models Know About Local Culture?
[NLP-29] BertaQA:语言模型对当地文化了解多少?

链接: https://arxiv.org/abs/2406.07302
作者: Julen Etxaniz,Gorka Azkune,Aitor Soroa,Oier Lopez de Lacalle,Mikel Artetxe
关键词: Large Language Models, exhibit extensive knowledge, Large Language, exhibit extensive, anglocentric subjects
中文关键词: 大型语言模型,展示广泛的知识,大型语言,展示广泛的、以英语为中心的主题
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit extensive knowledge about the world, but most evaluations have been limited to global or anglocentric subjects. This raises the question of how well these models perform on topics relevant to other cultures, whose presence on the web is not that prominent. To address this gap, we introduce BertaQA, a multiple-choice trivia dataset that is parallel in English and Basque. The dataset consists of a local subset with questions pertinent to the Basque culture, and a global subset with questions of broader interest. We find that state-of-the-art LLMs struggle with local cultural knowledge, even as they excel on global topics. However, we show that continued pre-training in Basque significantly improves the models’ performance on Basque culture, even when queried in English. To our knowledge, this is the first solid evidence of knowledge transfer from a low-resource to a high-resource language. Our analysis sheds light on the complex interplay between language and knowledge, and reveals that some prior findings do not fully hold when reassessed on local topics. Our dataset and evaluation code are available under open licenses at this https URL.
摘要:大型语言模型展示了关于世界的广泛知识,但大多数评估都局限于全球或以英语为中心的主题。这就提出了一个问题,即这些模型在与其他文化相关的话题上表现得有多好,这些文化在网络上的存在并不是那么突出。为了弥补这一差距,我们引入了BertaQA,这是一个多项选择琐事数据集,在英语和巴斯克语中是平行的。该数据集由带有与巴斯克文化相关的问题的局部子集和具有更广泛兴趣的问题的全局子集组成。我们发现,即使在全球话题上出类拔萃,最先进的小岛屿发展中国家也在努力掌握当地的文化知识。然而,我们发现,持续的巴斯克语预培训显著提高了模特们在巴斯克文化方面的表现,即使是在用英语查询时也是如此。据我们所知,这是知识从低资源语言向高资源语言转移的第一个确凿证据。我们的分析揭示了语言和知识之间的复杂相互作用,并揭示了以前的一些发现在重新评估当地主题时并不完全成立。我们的数据集和评估代码在开放许可证下可通过此HTTPS URL获得。

[NLP-30] Instruct Large Language Models to Drive like Humans
[NLP-30] 指导大型语言模型像人类一样驾驶

链接: https://arxiv.org/abs/2406.07296
作者: Ruijun Zhang,Xianda Guo,Wenzhao Zheng,Chenming Zhang,Kurt Keutzer,Long Chen
关键词: core challenge, challenge in autonomous, driving, autonomous driving, complex scenarios
中文关键词: 核心挑战,自动驾驶挑战,驾驶,自动驾驶,复杂场景
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注: project page: this https URL

点击查看摘要

Abstract:Motion planning in complex scenarios is the core challenge in autonomous driving. Conventional methods apply predefined rules or learn from driving data to plan the future trajectory. Recent methods seek the knowledge preserved in large language models (LLMs) and apply them in the driving scenarios. Despite the promising results, it is still unclear whether the LLM learns the underlying human logic to drive. In this paper, we propose an InstructDriver method to transform LLM into a motion planner with explicit instruction tuning to align its behavior with humans. We derive driving instruction data based on human logic (e.g., do not cause collisions) and traffic rules (e.g., proceed only when green lights). We then employ an interpretable InstructChain module to further reason the final planning reflecting the instructions. Our InstructDriver allows the injection of human rules and learning from driving data, enabling both interpretability and data scalability. Different from existing methods that experimented on closed-loop or simulated settings, we adopt the real-world closed-loop motion planning nuPlan benchmark for better evaluation. InstructDriver demonstrates the effectiveness of the LLM planner in a real-world closed-loop setting. Our code is publicly available at this https URL.
摘要:复杂场景下的运动规划是自动驾驶的核心挑战。传统的方法应用预定义的规则或从驾驶数据中学习来规划未来的轨迹。最近的方法寻求保存在大型语言模型(LLM)中的知识,并将它们应用于驾驶场景。尽管结果令人振奋,但目前仍不清楚LLM是否学习了驾驶的基本人类逻辑。在本文中,我们提出了一种InstructDriver方法,将LLM转换为一个具有显式指令调整的运动规划器,以使其行为与人类保持一致。我们基于人类逻辑(例如,不引起碰撞)和交通规则(例如,仅在绿灯时进行)来导出驾驶指令数据。然后,我们使用可解释的InstructChain模块来进一步推理反映指令的最终规划。我们的InstructDriver允许注入人工规则并从驱动数据中学习,从而实现可解释性和数据可伸缩性。与现有的在闭环或模拟环境下进行实验的方法不同,我们采用了真实世界的闭环运动规划nuPlan基准来进行更好的评估。InstructDriver演示了LLM计划器在真实世界的闭环设置中的有效性。我们的代码在此HTTPS URL上公开提供。

[NLP-31] Joint Learning of Context and Feedback Embeddings in Spoken Dialogue
[NLP-31] 口语对话中嵌入的上下文和反馈的联合学习

链接: https://arxiv.org/abs/2406.07291
作者: Livia Qian,Gabriel Skantze
关键词: feedback responses, play an important, Short feedback responses, important role, role in spoken
中文关键词: 反馈反应,发挥着重要的作用,简短的反馈反应,重要的作用,在口语中的作用
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Interspeech 2024

点击查看摘要

Abstract:Short feedback responses, such as backchannels, play an important role in spoken dialogue. So far, most of the modeling of feedback responses has focused on their timing, often neglecting how their lexical and prosodic form influence their contextual appropriateness and conversational function. In this paper, we investigate the possibility of embedding short dialogue contexts and feedback responses in the same representation space using a contrastive learning objective. In our evaluation, we primarily focus on how such embeddings can be used as a context-feedback appropriateness metric and thus for feedback response ranking in U.S. English dialogues. Our results show that the model outperforms humans given the same ranking task and that the learned embeddings carry information about the conversational function of feedback responses.
摘要:简短的反馈响应,例如反向渠道,在口语对话中发挥着重要作用。到目前为止,大多数反馈响应的建模都集中在它们的时机上,通常忽视了它们的词汇和韵律形式如何影响它们的上下文适当性和对话功能。在本文中,我们研究了使用对比学习目标将短对话上下文和反馈响应嵌入到同一表示空间中的可能性。在我们的评估中,我们主要关注如何将此类嵌入用作上下文反馈适当性指标,从而用于美式英语对话中的反馈响应排名。我们的结果表明,在相同的排名任务下,该模型的表现优于人类,并且学习到的嵌入携带有关反馈响应对话功能的信息。

[NLP-32] Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?
[NLP-32] 我们能否在没有并行语音数据的情况下实现高质量的直接语音到语音翻译?

链接: https://arxiv.org/abs/2406.07289
作者: Qingkai Fang,Shaolei Zhang,Zhengrui Ma,Min Zhang,Yang Feng
关键词: Recently proposed two-pass, Recently proposed, yielding promising results, parallel speech data, TTS
中文关键词: 最近提出的两遍,最近提出的,产生了有希望的结果,并行语音数据,TTC
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: ACL 2024 main conference. Project Page: this https URL

点击查看摘要

Abstract:Recently proposed two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) within an end-to-end model, yielding promising results. However, the training of these models still relies on parallel speech data, which is extremely challenging to collect. In contrast, S2TT and TTS have accumulated a large amount of data and pretrained models, which have not been fully utilized in the development of S2ST models. Inspired by this, in this paper, we first introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model. Furthermore, to eliminate the reliance on parallel speech data, we propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data. It aligns representations in the latent space through contrastive learning, enabling the speech synthesis capability learned from the TTS data to generalize to S2ST in a zero-shot manner. Experimental results on the CVSS dataset show that when the parallel speech data is available, ComSpeech surpasses previous two-pass models like UnitY and Translatotron 2 in both translation quality and decoding speed. When there is no parallel speech data, ComSpeech-ZS lags behind \name by only 0.7 ASR-BLEU and outperforms the cascaded models.
摘要:最近提出的两遍直接语音到语音翻译(S2ST)模型将任务分解为端到端模型中的语音到文本翻译(S2TT)和文本到语音(TTS),取得了良好的结果。然而,这些模型的训练仍然依赖于并行语音数据,这对收集语音数据具有极大的挑战性。相比之下,S2TT和TTS积累了大量的数据和预先训练的模型,这些数据和模型在S2ST模型的开发中没有得到充分的利用。受此启发,本文首先介绍了一种名为ComSpeech的组合S2ST模型,它可以无缝地将任何预先训练的S2TT和TTS模型集成到一个直接的S2ST模型中。此外,为了消除对并行语音数据的依赖,我们提出了一种新的只利用S2TT和TTS数据的训练方法ComSpeech-ZS。它通过对比学习来对齐潜在空间中的表示,使得从TTS数据中学习的语音合成能力能够以零射击的方式推广到S2ST。在CVSS数据集上的实验结果表明,在并行语音数据可用的情况下,ComSpeech在翻译质量和解码速度方面都超过了Unity和Translatotron 2等以前的两遍模型。在没有平行语音数据的情况下,ComSpeech-ZS仅落后于\NAME 0.7ASR-BLEU,并且性能优于级联模型。

[NLP-33] Fine-tuning with HED-IT: The impact of human post-editing for dialogical language models
[NLP-33] 使用HED-IT进行微调:人类后期编辑对对话语言模型的影响

链接: https://arxiv.org/abs/2406.07288
作者: Daniela Occhipinti,Michele Marchi,Irene Mondella,Huiyuan Lai,Felice Dell’Orletta,Malvina Nissim,Marco Guerini
关键词: resourced than English, fine-tuning Language Models, gathering linguistic data, fine-tuning Language, Language Models
中文关键词: 比英语资源丰富,微调语言模型,收集语言数据,微调语言,语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic methods for generating and gathering linguistic data have proven effective for fine-tuning Language Models (LMs) in languages less resourced than English. Still, while there has been emphasis on data quantity, less attention has been given to its quality. In this work, we investigate the impact of human intervention on machine-generated data when fine-tuning dialogical models. In particular, we study (1) whether post-edited dialogues exhibit higher perceived quality compared to the originals that were automatically generated; (2) whether fine-tuning with post-edited dialogues results in noticeable differences in the generated outputs; and (3) whether post-edited dialogues influence the outcomes when considering the parameter size of the LMs. To this end we created HED-IT, a large-scale dataset where machine-generated dialogues are paired with the version post-edited by humans. Using both the edited and unedited portions of HED-IT, we fine-tuned three different sizes of an LM. Results from both human and automatic evaluation show that the different quality of training data is clearly perceived and it has an impact also on the models trained on such data. Additionally, our findings indicate that larger models are less sensitive to data quality, whereas this has a crucial impact on smaller models. These results enhance our comprehension of the impact of human intervention on training data in the development of high-quality LMs.
摘要:自动生成和收集语言数据的方法已被证明对资源比英语少的语言模型进行微调是有效的。尽管如此,尽管人们一直强调数据的数量,但对其质量的关注较少。在这项工作中,我们调查了当微调对话模型时,人为干预对机器生成数据的影响。我们特别研究了(1)编辑后的对话是否表现出比自动生成的原文更高的感知质量;(2)对编辑后的对话进行微调是否会导致生成的输出显著不同;(3)当考虑LMS的参数大小时,编辑后的对话是否会影响结果。为此,我们创建了HED-IT,这是一个大规模的数据集,其中机器生成的对话与人类编辑后的版本配对。使用HED-IT的编辑和未编辑部分,我们微调了三个不同大小的LM。人工评估和自动评估的结果表明,训练数据的不同质量是显而易见的,它也对基于这些数据训练的模型产生影响。此外,我们的发现表明,较大的模型对数据质量不那么敏感,而这对较小的模型有至关重要的影响。这些结果增强了我们对高质量学习管理系统开发过程中人为干预对训练数据的影响的理解。

[NLP-34] Bilingual Sexism Classification: Fine-Tuned XLM-RoBERTa and GPT-3.5 Few-Shot Learning
[NLP-34] 双语性别歧视分类:微调XLM-RoBERTa和GPT-3.5 Few-shot学习

链接: https://arxiv.org/abs/2406.07287
作者: AmirMohammad Azadi,Baktash Ansari,Sina Zamani
关键词: necessitates effective classification, effective classification techniques, necessitates effective, effective classification, classification techniques
中文关键词: 需要有效的分类,有效的分类技术,需要有效的分类,分类技术
类目: Computation and Language (cs.CL)
备注: 8 pages, 6 tables

点击查看摘要

Abstract:Sexism in online content is a pervasive issue that necessitates effective classification techniques to mitigate its harmful impact. Online platforms often have sexist comments and posts that create a hostile environment, especially for women and minority groups. This content not only spreads harmful stereotypes but also causes emotional harm. Reliable methods are essential to find and remove sexist content, making online spaces safer and more welcoming. Therefore, the sEXism Identification in Social neTworks (EXIST) challenge addresses this issue at CLEF 2024. This study aims to improve sexism identification in bilingual contexts (English and Spanish) by leveraging natural language processing models. The tasks are to determine whether a text is sexist and what the source intention behind it is. We fine-tuned the XLM-RoBERTa model and separately used GPT-3.5 with few-shot learning prompts to classify sexist content. The XLM-RoBERTa model exhibited robust performance in handling complex linguistic structures, while GPT-3.5’s few-shot learning capability allowed for rapid adaptation to new data with minimal labeled examples. Our approach using XLM-RoBERTa achieved 4th place in the soft-soft evaluation of Task 1 (sexism identification). For Task 2 (source intention), we achieved 2nd place in the soft-soft evaluation.
摘要:在线内容中的性别歧视是一个普遍存在的问题,需要有效的分类技术来减轻其有害影响。在线平台上经常有性别歧视的评论和帖子,这些评论和帖子创造了一个充满敌意的环境,特别是对女性和少数群体。这些内容不仅传播有害的刻板印象,还造成情感伤害。可靠的方法对于发现和删除性别歧视内容至关重要,使在线空间更安全、更受欢迎。因此,社交网络中的性别歧视(EXIST)挑战在2024论坛上解决了这一问题。这项研究旨在通过利用自然语言处理模型来提高双语环境(英语和西班牙语)中的性别歧视识别。其任务是确定文本是否存在性别歧视,以及文本背后的原意是什么。我们对XLM-Roberta模型进行了微调,并分别使用带有少量学习提示的GPT-3.5对性别歧视内容进行分类。XLm-Roberta模型在处理复杂语言结构方面表现出较好的性能,而GPT-3.5中的S少镜头学习能力允许快速适应新数据,并使用最少的标记样本。我们使用XLm-Roberta的方法在任务1(性别歧视识别)的软-软评估中获得第四名。对于任务2(来源意图),我们在软-软评估中获得第二名。

[NLP-35] Speaking Your Language: Spatial Relationships in Interpretable Emergent Communication
[NLP-35] 说你的语言:可解释紧急沟通中的空间关系

链接: https://arxiv.org/abs/2406.07277
作者: Olaf Lipinski,Adam J. Sobey,Federico Cerutti,Timothy J. Norman
关键词: Effective communication requires, Effective communication, ability to refer, refer to specific, Effective
中文关键词: 有效的沟通需要,有效的沟通,能够推荐,推荐具体,有效
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 16 pages, 3 figures

点击查看摘要

Abstract:Effective communication requires the ability to refer to specific parts of an observation in relation to others. While emergent communication literature shows success in developing various language properties, no research has shown the emergence of such positional references. This paper demonstrates how agents can communicate about spatial relationships within their observations. The results indicate that agents can develop a language capable of expressing the relationships between parts of their observation, achieving over 90% accuracy when trained in a referential game which requires such communication. Using a collocation measure, we demonstrate how the agents create such references. This analysis suggests that agents use a mixture of non-compositional and compositional messages to convey spatial relationships. We also show that the emergent language is interpretable by humans. The translation accuracy is tested by communicating with the receiver agent, where the receiver achieves over 78% accuracy using parts of this lexicon, confirming that the interpretation of the emergent language was successful.
摘要:有效的沟通需要有能力将观察中的特定部分与他人联系起来。虽然新兴的交际文献显示成功地开发了各种语言属性,但还没有研究表明这种位置指称的出现。本文演示了代理人如何在他们的观察范围内就空间关系进行沟通。结果表明,智能体可以开发一种能够表达他们观察到的部分之间的关系的语言,当在需要这种交流的参照游戏中训练时,准确率达到90%以上。使用搭配度量,我们演示了代理如何创建这样的引用。这一分析表明,代理人使用非构成性和构成性信息的混合来传达空间关系。我们还表明,涌现的语言是人类可以解释的。通过与接收方代理通信来测试翻译的准确性,接收方使用该词典的部分内容达到了78%以上的准确率,从而确认了对紧急语言的解释是成功的。

[NLP-36] Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation
[NLP-36] 通过基于LLM的重新表述和基于框的分割推进接地多模式命名实体识别

链接: https://arxiv.org/abs/2406.07268
作者: Jinyuan Li,Ziyan Li,Han Li,Jianfei Yu,Rui Xia,Di Sun,Gang Pan
关键词: Grounded Multimodal Named, Named Entity Recognition, Multimodal Named Entity, identify named entities, Grounded Multimodal
中文关键词: 接地多模式命名、命名实体识别、多模式命名实体、识别命名实体、接地多模式
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Extension of our Findings of EMNLP 2023 ACL 2024 paper

点击查看摘要

Abstract:Grounded Multimodal Named Entity Recognition (GMNER) task aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging attributes: 1) The tenuous correlation between images and text on social media contributes to a notable proportion of named entities being ungroundable. 2) There exists a distinction between coarse-grained noun phrases used in similar tasks (e.g., phrase localization) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as connecting bridges. This reformulation brings two benefits: 1) It enables us to optimize the MNER module for optimal MNER performance and eliminates the need to pre-extract region features using object detection methods, thus naturally addressing the two major limitations of existing GMNER methods. 2) The introduction of Entity Expansion Expression module and Visual Entailment (VE) module unifies Visual Grounding (VG) and Entity Grounding (EG). This endows the proposed framework with unlimited data and model scalability. Furthermore, to address the potential ambiguity stemming from the coarse-grained bounding box output in GMNER, we further construct the new Segmented Multimodal Named Entity Recognition (SMNER) task and corresponding Twitter-SMNER dataset aimed at generating fine-grained segmentation masks, and experimentally demonstrate the feasibility and effectiveness of using box prompt-based Segment Anything Model (SAM) to empower any GMNER model with the ability to accomplish the SMNER task. Extensive experiments demonstrate that RiVEG significantly outperforms SoTA methods on four datasets across the MNER, GMNER, and SMNER tasks.
摘要:接地多通道命名实体识别(GMNER)任务旨在识别命名实体、实体类型及其对应的视觉区域。GMNER任务表现出两个具有挑战性的特征:1)社交媒体上图像和文本之间的微弱相关性导致相当大比例的命名实体无法成立。2)在相似任务(如短语本地化)中使用的粗粒度名词短语和细粒度命名实体之间存在区别。在本文中,我们提出了RiVEG,一个统一的框架,通过利用大语言模型(LLM)作为连接桥梁,将GMNER重塑为一个联合的MNer-VE-VG任务。这种重新制定带来了两个好处:1)它使我们能够优化MNER模块以获得最佳的MNER性能,并消除了使用目标检测方法预先提取区域特征的需要,从而自然地解决了现有GMNER方法的两大局限性。2)引入实体扩展表达模块和视觉蕴涵(VE)模块,将视觉接地(VG)和实体接地(EG)统一起来。这使得提出的框架具有无限的数据和模型可伸缩性。此外,为了解决GMNER中粗粒度边界框输出可能产生的歧义,我们进一步构建了新的分段多模式命名实体识别(SMNER)任务和相应的Twitter-SMNER数据集,旨在生成细粒度的分割掩码,并通过实验证明了使用基于框提示的Segment Anything Model(SAM)来赋予任何GMNER模型完成SMNER任务的能力的可行性和有效性。大量实验表明,RiVEG在MNER、GMNER和SMNER任务的四个数据集上的性能明显优于SOTA方法。

[NLP-37] Scientific Computing with Large Language Models
[NLP-37] 使用大型语言模型的科学计算

链接: https://arxiv.org/abs/2406.07259
作者: Christopher Culver,Peter Hicks,Mihailo Milenkovic,Sanjif Shanmugavelu,Tobias Becker
关键词: provide an overview, emergence of large, large language models, scientific computing applications, large language
中文关键词: 提供概述、大型语言模型、科学计算应用、大型语言的出现
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages

点击查看摘要

Abstract:We provide an overview of the emergence of large language models for scientific computing applications. We highlight use cases that involve natural language processing of scientific documents and specialized languages designed to describe physical systems. For the former, chatbot style applications appear in medicine, mathematics and physics and can be used iteratively with domain experts for problem solving. We also review specialized languages within molecular biology, the languages of molecules, proteins, and DNA where language models are being used to predict properties and even create novel physical systems at much faster rates than traditional computing methods.
摘要:我们概述了科学计算应用程序大型语言模型的出现。我们强调涉及科学文档的自然语言处理和旨在描述物理系统的专业语言的用例。对于前者,聊天机器人风格的应用程序出现在医学、数学和物理中,并且可以与领域专家一起迭代使用来解决问题。我们还回顾了分子生物学中的专业语言,即分子、蛋白质和DNA的语言,其中语言模型被用来预测属性,甚至以比传统计算方法快得多的速度创建新型物理系统。

[NLP-38] Scholarly Question Answering using Large Language Models in the NFDI4DataScience Gateway
[NLP-38] 在NFDI 4DataScience Gateway中使用大型语言模型进行学术问题志愿服务

链接: https://arxiv.org/abs/2406.07257
作者: Hamed Babaei Giglou,Tilahun Abedissa Taffa,Rana Abdullah,Aida Usmanova,Ricardo Usbeck,Jennifer D’Souza,Sören Auer
关键词: Retrieval Augmented Generation-based, scholarly Question Answering, Question Answering, Augmented Generation-based, Retrieval Augmented
中文关键词: 基于检索增强代、学术问题解答、问题解答、基于增强代、检索增强
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages main content, 16 pages overall, 3 Figures, accepted for publication at NSLP 2024 workshop at ESWC 2024

点击查看摘要

Abstract:This paper introduces a scholarly Question Answering (QA) system on top of the NFDI4DataScience Gateway, employing a Retrieval Augmented Generation-based (RAG) approach. The NFDI4DS Gateway, as a foundational framework, offers a unified and intuitive interface for querying various scientific databases using federated search. The RAG-based scholarly QA, powered by a Large Language Model (LLM), facilitates dynamic interaction with search results, enhancing filtering capabilities and fostering a conversational engagement with the Gateway search. The effectiveness of both the Gateway and the scholarly QA system is demonstrated through experimental analysis.
摘要:本文介绍了NFDI 4 DataScience Gateway之上的学术问题解答(QA)系统,采用基于检索增强代(RAG)方法。NFDI 4DS Gateway作为基础框架,提供了统一且直观的界面,用于使用联邦搜索查询各种科学数据库。基于RAG的学术QA由大型语言模型(LLM)支持,促进与搜索结果的动态交互,增强过滤能力并促进与网关搜索的对话参与。通过实验分析证明了Gateway和学术QA系统的有效性。

[NLP-39] MBBQ: A Dataset for Cross-Lingual Comparison of Stereotypes in Generative LLMs
[NLP-39] MBBQ:代际LLM刻板印象的跨语言比较数据集

链接: https://arxiv.org/abs/2406.07243
作者: Vera Neplenbroek,Arianna Bisazza,Raquel Fernández
关键词: Generative large language, exhibit harmful biases, Generative large, shown to exhibit, exhibit harmful
中文关键词: 生成性大语言,表现出有害的偏见,生成性大,表现出有害的
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative large language models (LLMs) have been shown to exhibit harmful biases and stereotypes. While safety fine-tuning typically takes place in English, if at all, these models are being used by speakers of many different languages. There is existing evidence that the performance of these models is inconsistent across languages and that they discriminate based on demographic factors of the user. Motivated by this, we investigate whether the social stereotypes exhibited by LLMs differ as a function of the language used to prompt them, while controlling for cultural differences and task accuracy. To this end, we present MBBQ (Multilingual Bias Benchmark for Question-answering), a carefully curated version of the English BBQ dataset extended to Dutch, Spanish, and Turkish, which measures stereotypes commonly held across these languages. We further complement MBBQ with a parallel control dataset to measure task performance on the question-answering task independently of bias. Our results based on several open-source and proprietary LLMs confirm that some non-English languages suffer from bias more than English, even when controlling for cultural shifts. Moreover, we observe significant cross-lingual differences in bias behaviour for all except the most accurate models. With the release of MBBQ, we hope to encourage further research on bias in multilingual settings. The dataset and code are available at this https URL.
摘要:生成性大型语言模型(LLM)显示出有害的偏见和刻板印象。虽然安全微调通常在英语中进行,但这些模式正在被说多种不同语言的人使用。现有证据表明,这些模型的表现在不同语言之间并不一致,而且它们是根据用户的人口统计因素进行歧视的。在此基础上,我们调查了在控制文化差异和任务精确度的同时,LLM所表现出的社会刻板印象是否随着语言的不同而不同。为此,我们提出了MBBQ(问答的多语言偏见基准),这是一个精心策划的英语BBQ数据集版本,扩展到荷兰语、西班牙语和土耳其语,衡量这些语言普遍存在的刻板印象。我们进一步用平行控制数据集来补充MBBQ,以独立于偏见测量问答任务上的任务表现。我们基于几个开源和专有LLM的结果证实,一些非英语语言比英语更容易受到偏见,即使在控制文化转变的情况下也是如此。此外,我们观察到,除了最准确的模型外,所有人的偏见行为在跨语言方面都存在显著差异。随着MBBQ的发布,我们希望鼓励对多语言环境下偏见的进一步研究。数据集和代码可在此HTTPS URL上找到。

[NLP-40] On the Hallucination in Simultaneous Machine Translation
[NLP-40] 论机器同步翻译中的幻觉

链接: https://arxiv.org/abs/2406.07239
作者: Meizhi Zhong,Kehai Chen,Zhengshan Xue,Lemao Liu,Mingming Yang,Min Zhang
关键词: Simultaneous Machine Translation, Machine Translation, Simultaneous Machine, issue in Simultaneous, critical issue
中文关键词: 同时机器翻译,机器翻译,同时机器,同时问题,关键问题
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:It is widely known that hallucination is a critical issue in Simultaneous Machine Translation (SiMT) due to the absence of source-side information. While many efforts have been made to enhance performance for SiMT, few of them attempt to understand and analyze hallucination in SiMT. Therefore, we conduct a comprehensive analysis of hallucination in SiMT from two perspectives: understanding the distribution of hallucination words and the target-side context usage of them. Intensive experiments demonstrate some valuable findings and particularly show that it is possible to alleviate hallucination by decreasing the over usage of target-side information for SiMT.
摘要:众所周知,由于缺乏源端信息,幻觉是机器同步翻译(SiMT)中的一个关键问题。虽然人们已经做出了很多努力来提高SiMT的性能,但很少有人尝试理解和分析SiMT中的幻觉。因此,我们从两个角度对SiMT中的幻觉进行了全面分析:了解幻觉词的分布及其目标端上下文使用。密集的实验证明了一些有价值的发现,特别表明可以通过减少SiMT目标侧信息的过度使用来缓解幻觉。

[NLP-41] DUAL-REFLECT: Enhancing Large Language Models for Reflective Translation through Dual Learning Feedback Mechanisms
[NLP-41] 双重反射:通过双重学习反馈机制增强大型语言模型以实现反射性翻译

链接: https://arxiv.org/abs/2406.07232
作者: Andong Chen,Lianzhang Lou,Kehai Chen,Xuefeng Bai,Yang Xiang,Muyun Yang,Tiejun Zhao,Min Zhang
关键词: large language models, achieved promising performance, achieved promising, translation, Recently
中文关键词: 大型语言模型,实现了有希望的性能,实现了有希望的翻译,最近
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2024 main conference

点击查看摘要

Abstract:Recently, large language models (LLMs) enhanced by self-reflection have achieved promising performance on machine translation. The key idea is guiding LLMs to generate translation with human-like feedback. However, existing self-reflection methods lack effective feedback information, limiting the translation performance. To address this, we introduce a DUAL-REFLECT framework, leveraging the dual learning of translation tasks to provide effective feedback, thereby enhancing the models’ self-reflective abilities and improving translation performance. The application of this method across various translation tasks has proven its effectiveness in improving translation accuracy and eliminating ambiguities, especially in translation tasks with low-resource language pairs.
摘要:最近,通过自我反思增强的大型语言模型(LLM)在机器翻译方面取得了令人看好的性能。关键想法是引导LLM通过类人的反馈生成翻译。然而,现有的自我反思方法缺乏有效的反馈信息,限制了翻译性能。为了解决这个问题,我们引入了双重反射框架,利用翻译任务的双重学习来提供有效的反馈,从而增强模型的自我反思能力并提高翻译性能。该方法在各种翻译任务中的应用已被证明其在提高翻译准确性和消除歧义方面的有效性,特别是在具有低资源语言对的翻译任务中。

[NLP-42] Decipherment-Aware Multilingual Learning in Jointly Trained Language Models
[NLP-42] 联合训练语言模型中的解码感知多语言学习

链接: https://arxiv.org/abs/2406.07231
作者: Grandee Lee
关键词: trained language models, governs unsupervised multilingual, jointly trained language, unsupervised multilingual learning, principle that governs
中文关键词: 经过训练的语言模型,管理无监督多语言,联合训练的语言,无监督多语言学习,管理原则
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The principle that governs unsupervised multilingual learning (UCL) in jointly trained language models (mBERT as a popular example) is still being debated. Many find it surprising that one can achieve UCL with multiple monolingual corpora. In this work, we anchor UCL in the context of language decipherment and show that the joint training methodology is a decipherment process pivotal for UCL. In a controlled setting, we investigate the effect of different decipherment settings on the multilingual learning performance and consolidate the existing opinions on the contributing factors to multilinguality. From an information-theoretic perspective we draw a limit to the UCL performance and demonstrate the importance of token alignment in challenging decipherment settings caused by differences in the data domain, language order and tokenization granularity. Lastly, we apply lexical alignment to mBERT and investigate the contribution of aligning different lexicon groups to downstream performance.
摘要:联合训练语言模型(MBERT)中的无监督多语言学习(UCL)的原则仍在争论中。许多人发现令人惊讶的是,一个人可以用多个单一语言语料库来实现UCL。在这项工作中,我们将UCL固定在语言解读的背景下,并表明联合训练方法是UCL的一个关键的解读过程。在受控环境下,我们考察了不同的解读环境对多语言学习成绩的影响,并巩固了已有的关于影响多语言能力的因素的观点。我们从信息论的角度对UCL的性能进行了限制,并论证了由于数据域、语言顺序和标记化粒度的差异而导致的令牌对齐在挑战破译环境中的重要性。最后,我们将词汇对齐应用于mBERT,并考察了不同词汇组的对齐对下游性能的贡献。

[NLP-43] Improving Commonsense Bias Classification by Mitigating the Influence of Demographic Terms
[NLP-43] 通过减轻人口统计术语的影响来改进常识偏见分类

链接: https://arxiv.org/abs/2406.07229
作者: JinKyu Lee,Jihie Kim
关键词: Natural Language Processing, Language Processing, Natural Language, Understanding commonsense knowledge, field of Natural
中文关键词: 自然语言处理,语言处理,自然语言,理解常识知识,自然领域
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, conference presentation, supported by MSIT (Korea) under ITRC program (IITP-2024-2020-0-01789) and AI Convergence Innovation HR Development (IITP-2024-RS-2023-00254592)

点击查看摘要

Abstract:Understanding commonsense knowledge is crucial in the field of Natural Language Processing (NLP). However, the presence of demographic terms in commonsense knowledge poses a potential risk of compromising the performance of NLP models. This study aims to investigate and propose methods for enhancing the performance and effectiveness of a commonsense polarization classifier by mitigating the influence of demographic terms. Three methods are introduced in this paper: (1) hierarchical generalization of demographic terms (2) threshold-based augmentation and (3) integration of hierarchical generalization and threshold-based augmentation methods (IHTA). The first method involves replacing demographic terms with more general ones based on a term hierarchy ontology, aiming to mitigate the influence of specific terms. To address the limited bias-related information, the second method measures the polarization of demographic terms by comparing the changes in the model’s predictions when these terms are masked versus unmasked. This method augments commonsense sentences containing terms with high polarization values by replacing their predicates with synonyms generated by ChatGPT. The third method combines the two approaches, starting with threshold-based augmentation followed by hierarchical generalization. The experiments show that the first method increases the accuracy over the baseline by 2.33%, and the second one by 0.96% over standard augmentation methods. The IHTA techniques yielded an 8.82% and 9.96% higher accuracy than threshold-based and standard augmentation methods, respectively.
摘要:在自然语言处理领域,常识知识的理解是至关重要的。然而,常识知识中人口统计学术语的存在构成了损害NLP模型性能的潜在风险。本研究旨在探讨并提出通过减少人口统计学术语的影响来提高常识极化分类器的性能和有效性的方法。本文介绍了三种方法:(1)人口统计术语的分级概括;(2)基于阈值的扩充;(3)分级概括和基于阈值的扩充方法(IHTA)的结合。第一种方法涉及使用基于术语层次本体的更通用的术语来替换人口统计术语,旨在减轻特定术语的影响。为了处理有限的与偏见相关的信息,第二种方法通过比较屏蔽和不屏蔽时模型预测的变化来衡量人口统计术语的两极分化。该方法通过用ChatGPT生成的同义词替换常识句的谓词来扩充包含高极化值词的常识句。第三种方法结合了这两种方法,首先是基于阈值的增强,然后是分层泛化。实验表明,第一种方法比标准增强方法的准确率提高了2.33%,第二种方法的准确率提高了0.96%。IHTA技术的准确率分别比基于阈值的方法和标准增强方法高8.82%和9.96%。

[NLP-44] Improving Autoformalization using Type Checking
[NLP-44] 使用类型检查改进自动形式化

链接: https://arxiv.org/abs/2406.07222
作者: Auguste Poiroux,Gail Weiss,Viktor Kunčak,Antoine Bosselut
关键词: Large language models, translating natural language, automatically translating natural, Large language, natural language
中文关键词: 大型语言模型,翻译自然语言,自动翻译自然语言,大型语言,自然语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models show promise for autoformalization, the task of automatically translating natural language into formal languages. However, current autoformalization methods remain limited. The last reported state-of-the-art performance on the ProofNet formalization benchmark for the Lean proof assistant, achieved using Codex for Lean 3, only showed successful formalization of 16.1% of informal statements. Similarly, our evaluation of GPT-4o for Lean 4 only produces successful translations 34.9% of the time. Our analysis shows that the performance of these models is largely limited by their inability to generate formal statements that successfully type-check (i.e., are syntactically correct and consistent with types) - with a whopping 86.6% of GPT-4o errors starting from a type-check failure. In this work, we propose a method to fix this issue through decoding with type-check filtering, where we initially sample a diverse set of candidate formalizations for an informal statement, then use the Lean proof assistant to filter out candidates that do not type-check. Using GPT-4o as a base model, and combining our method with self-consistency, we obtain a +18.3% absolute increase in formalization accuracy, and achieve a new state-of-the-art of 53.2% on ProofNet with Lean 4.
摘要:大型语言模型显示了自动形式化的前景,即自动将自然语言转换为形式语言的任务。然而,目前的自动形式化方法仍然有限。最近报道的关于精益证明助手ProofNet正式化基准的最新表现,是使用Codex for Lean 3实现的,仅显示16.1%的非正式陈述成功正式化。同样,我们对精益4的GPT-4o的评估只产生了34.9%的成功翻译。我们的分析表明,这些模型的性能在很大程度上受到它们无法生成成功进行类型检查的正式语句(即,语法正确且与类型一致)的限制-高达86.6%的GPT-40错误始于类型检查失败。在这项工作中,我们提出了一种通过类型检查过滤解码来解决这个问题的方法,其中我们首先对非正式语句的不同候选形式化集合进行采样,然后使用精益证明助手来过滤不进行类型检查的候选语句。以GPT-4o为基础模型,结合自一致性方法,形式化准确率提高了+18.3%,在ProofNet上的精益4达到了53.2%的最新水平。

[NLP-45] A Synthetic Dataset for Personal Attribute Inference
[NLP-45] 个人属性推理的合成数据集

链接: https://arxiv.org/abs/2406.07217
作者: Hanna Yukhymenko,Robin Staab,Mark Vero,Martin Vechev
关键词: Large Language Models, powerful Large Language, Language Models, Large Language, powerful Large
中文关键词: 大型语言模型,强大的大型语言,语言模型,大型语言,强大的大型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, powerful Large Language Models (LLMs) have become easily accessible to hundreds of millions of users worldwide. However, their strong capabilities and vast world knowledge do not come without associated privacy risks. In this work, we focus on the emerging privacy threat LLMs pose - the ability to accurately infer personal information from online texts. Despite the growing importance of LLM-based author profiling, research in this area has been hampered by a lack of suitable public datasets, largely due to ethical and privacy concerns associated with real personal data. In this work, we take two steps to address this problem: (i) we construct a simulation framework for the popular social media platform Reddit using LLM agents seeded with synthetic personal profiles; (ii) using this framework, we generate SynthPAI, a diverse synthetic dataset of over 7800 comments manually labeled for personal attributes. We validate our dataset with a human study showing that humans barely outperform random guessing on the task of distinguishing our synthetic comments from real ones. Further, we verify that our dataset enables meaningful personal attribute inference research by showing across 18 state-of-the-art LLMs that our synthetic comments allow us to draw the same conclusions as real-world data. Together, this indicates that our dataset and pipeline provide a strong and privacy-preserving basis for future research toward understanding and mitigating the inference-based privacy threats LLMs pose.
摘要:最近,功能强大的大型语言模型(LLM)已成为全球数亿用户轻松访问的工具。然而,他们强大的能力和广博的世界知识并不是没有相关的隐私风险。在这项工作中,我们关注的是LLMS构成的新的隐私威胁–从在线文本中准确推断个人信息的能力。尽管基于LLM的作者侧写的重要性与日俱增,但由于缺乏合适的公共数据集,这一领域的研究一直受到阻碍,这主要是由于与真实个人数据相关的伦理和隐私问题。在这项工作中,我们采取了两个步骤来解决这个问题:(I)我们使用LLM代理构建了一个流行的社交媒体平台Reddit的模拟框架;(Ii)使用这个框架,我们生成了SynthPAI,一个包含7800多条评论的多样化的人工标记个人属性的合成数据集。我们用一项人类研究验证了我们的数据集,研究表明,在区分我们的合成评论和真实评论的任务上,人类的表现几乎不会超过随机猜测。此外,通过展示18个最先进的LLM,我们验证了我们的数据集支持有意义的个人属性推理研究,我们的合成评论允许我们得出与真实世界数据相同的结论。总而言之,这表明我们的数据集和管道为未来的研究提供了强大的隐私保护基础,以了解和减轻基于推理的LLMS构成的隐私威胁。

[NLP-46] owards Human-AI Collaboration in Healthcare: Guided Deferral Systems with Large Language Models
[NLP-46] owards医疗保健领域的人机协作:具有大型语言模型的引导式延期系统

链接: https://arxiv.org/abs/2406.07212
作者: Joshua Strong,Qianhui Men,Alison Noble
关键词: critical decision-making situations, Large language models, hallucinate introduces unacceptable, Large language, introduces unacceptable uncertainty
中文关键词: 关键决策情况,大型语言模型,幻觉引入不可接受,大型语言,引入不可接受的不确定性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) present a valuable technology for various applications in healthcare, but their tendency to hallucinate introduces unacceptable uncertainty in critical decision-making situations. Human-AI collaboration (HAIC) can mitigate this uncertainty by combining human and AI strengths for better outcomes. This paper presents a novel guided deferral system that provides intelligent guidance when AI defers cases to human decision-makers. We leverage LLMs’ verbalisation capabilities and internal states to create this system, demonstrating that fine-tuning smaller LLMs with data from larger models enhances performance while maintaining computational efficiency. A pilot study showcases the effectiveness of our deferral system.
摘要:大型语言模型(LLM)为医疗保健领域的各种应用提供了一项有价值的技术,但它们的幻觉倾向在关键决策情况下引入了不可接受的不确定性。人与人工智能合作(HAIC)可以通过结合人类和人工智能的优势来减轻这种不确定性,以获得更好的结果。本文提出了一种新型的引导延期系统,当人工智能将案件移交给人类决策者时,该系统提供智能指导。我们利用LLM的语言化能力和内部状态来创建这个系统,证明用来自较大模型的数据微调较小的LLM可以增强性能,同时保持计算效率。一项试点研究展示了我们延期制度的有效性。

[NLP-47] Merging Improves Self-Critique Against Jailbreak Attacks
[NLP-47] 合并提高了对越狱袭击的自我批评

链接: https://arxiv.org/abs/2406.07188
作者: Victor Gallego
关键词: large language models, remains a significant, significant challenge, large language, robustness of large
中文关键词: 大型语言模型,仍然是一个重大的挑战,大型语言,大型的鲁棒性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The robustness of large language models (LLMs) against adversarial manipulations, such as jailbreak attacks, remains a significant challenge. In this work, we propose an approach that enhances the self-critique capability of the LLM and further fine-tunes it over sanitized synthetic data. This is done with the addition of an external critic model that can be merged with the original, thus bolstering self-critique capabilities and improving the robustness of the LLMs response to adversarial prompts. Our results demonstrate that the combination of merging and self-critique can reduce the attack success rate of adversaries significantly, thus offering a promising defense mechanism against jailbreak attacks. Code, data and models released at this https URL .
摘要:大型语言模型(LLM)对越狱攻击等对抗性操纵的稳健性仍然是一个重大挑战。在这项工作中,我们提出了一种增强LLM自我批评能力的方法,并根据净化的合成数据进一步对其进行微调。这是通过添加一个可以与原始模型合并的外部批评者模型来实现的,从而增强自我批评能力并提高LLM对对抗提示反应的稳健性。我们的结果表明,合并和自我批评的结合可以显着降低对手的攻击成功率,从而提供一种有希望的针对越狱攻击的防御机制。在此https URL中发布的代码、数据和模型。

[NLP-48] aching Language Models to Self-Improve by Learning from Language Feedback
[NLP-48] 通过从语言反馈中学习来提高语言模型自我改进

链接: https://arxiv.org/abs/2406.07168
作者: Chi Hu,Yimin Hu,Hang Cao,Tong Xiao,Jingbo Zhu
关键词: Aligning Large Language, Aligning Large, Large Language Models, Large Language, SRT
中文关键词: 调整大型语言,调整大型,大型语言模型,大型语言,SRT
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2024

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) with human intentions and values is crucial yet challenging. Current methods primarily rely on human preferences, which are costly and insufficient in capturing nuanced feedback expressed in natural language. In this paper, we present Self-Refinement Tuning (SRT), a method that leverages model feedback for alignment, thereby reducing reliance on human annotations. SRT uses a base language model (e.g., Tulu2) to generate initial responses, which are critiqued and refined by a more advanced model (e.g., GPT-4-Turbo). This process enables the base model to self-evaluate and improve its outputs, facilitating continuous learning. SRT further optimizes the model by learning from its self-generated feedback and refinements, creating a feedback loop that promotes model improvement. Our empirical evaluations demonstrate that SRT significantly outperforms strong baselines across diverse tasks and model sizes. When applied to a 70B parameter model, SRT increases the win rate from 9.6% to 25.8% on the AlpacaEval 2.0 benchmark, surpassing well-established systems such as GPT-4-0314, Claude 2, and Gemini. Our analysis highlights the crucial role of language feedback in the success of SRT, suggesting potential for further exploration in this direction.
摘要:使大型语言模型(LLM)与人类的意图和价值观保持一致是至关重要的,但也是具有挑战性的。目前的方法主要依赖于人类的偏好,这在捕捉以自然语言表达的细微差别反馈方面代价高昂,而且不够充分。在本文中,我们提出了自精化调整(SRT),这是一种利用模型反馈进行对齐的方法,从而减少了对人工注释的依赖。SRT使用基本语言模型(例如,Tulu2)来生成初始响应,该初始响应由更高级的模型(例如,GPT-4-Turbo)进行评价和改进。这一过程使基本模型能够自我评估并改进其产出,从而促进持续学习。SRT通过学习其自生成的反馈和改进进一步优化了模型,创建了一个促进模型改进的反馈循环。我们的经验评估表明,在不同的任务和模型大小上,SRT显著优于强大的基线。当应用于70B参数模型时,SRT在AlpacaEval 2.0基准上将胜率从9.6%提高到25.8%,超过了GPT-4-0314、Claude 2和Gemini等成熟的系统。我们的分析强调了语言反馈在SRT成功中的关键作用,暗示了在这一方向上进一步探索的潜力。

[NLP-49] EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark
[NLP-49] 收件箱:多语言多文本语音情感识别工具包和基准

链接: https://arxiv.org/abs/2406.07162
作者: Ziyang Ma,Mingjie Chen,Hezhao Zhang,Zhisheng Zheng,Wenxi Chen,Xiquan Li,Jiaxin Ye,Xie Chen,Thomas Hain
关键词: receiving extensive attention, human-computer interaction, receiving extensive, industry and academia, SER
中文关键词: 受到广泛关注,人机互动,受到广泛,行业和学术界,BER
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Accepted by INTERSPEECH 2024. GitHub Repository: this https URL

点击查看摘要

Abstract:Speech emotion recognition (SER) is an important part of human-computer interaction, receiving extensive attention from both industry and academia. However, the current research field of SER has long suffered from the following problems: 1) There are few reasonable and universal splits of the datasets, making comparing different models and methods difficult. 2) No commonly used benchmark covers numerous corpus and languages for researchers to refer to, making reproduction a burden. In this paper, we propose EmoBox, an out-of-the-box multilingual multi-corpus speech emotion recognition toolkit, along with a benchmark for both intra-corpus and cross-corpus settings. For intra-corpus settings, we carefully designed the data partitioning for different datasets. For cross-corpus settings, we employ a foundation SER model, emotion2vec, to mitigate annotation errors and obtain a test set that is fully balanced in speakers and emotions distributions. Based on EmoBox, we present the intra-corpus SER results of 10 pre-trained speech models on 32 emotion datasets with 14 languages, and the cross-corpus SER results on 4 datasets with the fully balanced test sets. To the best of our knowledge, this is the largest SER benchmark, across language scopes and quantity scales. We hope that our toolkit and benchmark can facilitate the research of SER in the community.
摘要:语音情感识别是人机交互的重要组成部分,受到工业界和学术界的广泛关注。然而,目前SER的研究领域长期存在以下问题:1)数据集缺乏合理、通用的划分,使得不同的模型和方法难以进行比较。2)没有一个常用的基准涵盖了众多的语料库和语种供研究人员参考,这使得生育成为一种负担。在本文中,我们提出了EmoBox,一个开箱即用的多语言多语料库语音情感识别工具包,以及一个语料库内和语料库间设置的基准。对于语料库内的设置,我们仔细设计了针对不同数据集的数据划分。对于跨语料库设置,我们采用了一个基础的SER模型emotion2vec来减少标注错误,并获得一个在说话人和情绪分布方面完全平衡的测试集。基于EmoBox,我们给出了10个预先训练的语音模型在14种语言的32个情感数据集上的语料库内SER结果,以及在完全平衡的测试集上的4个数据集上的跨语料库SER结果。据我们所知,这是最大的SER基准,跨越了语言范围和数量范围。我们希望我们的工具包和基准能够促进社会各界对SER的研究。

[NLP-50] Scaling Large-Language-Model-based Multi-Agent Collaboration
[NLP-50] 扩展基于大语言模型的多Agent协作

链接: https://arxiv.org/abs/2406.07155
作者: Chen Qian,Zihao Xie,Yifei Wang,Wei Liu,Yufan Dang,Zhuoyun Du,Weize Chen,Cheng Yang,Zhiyuan Liu,Maosong Sun
关键词: large language model-powered, Pioneering advancements, language model-powered agents, demonstrating that collective, advancements in large
中文关键词: 大型语言模型驱动,开创性进步,语言模型驱动的代理,展示了大型领域的集体进步
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI); Social and Information Networks (cs.SI)
备注: Work in progress; The code and data will be available at this https URL

点击查看摘要

Abstract:Pioneering advancements in large language model-powered agents have underscored the design pattern of multi-agent collaboration, demonstrating that collective intelligence can surpass the capabilities of each individual. Inspired by the neural scaling law, which posits that increasing neurons leads to emergent abilities, this study investigates whether a similar principle applies to increasing agents in multi-agent collaboration. Technically, we propose multi-agent collaboration networks (MacNet), which utilize directed acyclic graphs to organize agents and streamline their interactive reasoning via topological ordering, with solutions derived from their dialogues. Extensive experiments show that MacNet consistently outperforms baseline models, enabling effective agent collaboration across various network topologies and supporting cooperation among more than a thousand agents. Notably, we observed a small-world collaboration phenomenon, where topologies resembling small-world properties achieved superior performance. Additionally, we identified a collaborative scaling law, indicating that normalized solution quality follows a logistic growth pattern as scaling agents, with collaborative emergence occurring much earlier than previously observed instances of neural emergence. The code and data will be available at this https URL.
摘要:大型语言模型驱动的智能体的开创性进展强调了多智能体协作的设计模式,表明集体智能可以超越每个个体的能力。受神经标度定律的启发,该定律假设增加的神经元导致涌现的能力,本研究调查类似的原理是否适用于多代理协作中的增加代理。在技术上,我们提出了多智能体协作网络(MacNet),它利用有向无环图来组织智能体,并通过拓扑排序来简化它们的交互推理,解决方案来自于它们的对话。大量的实验表明,MacNet的性能一直优于基线模型,能够实现跨各种网络拓扑的有效代理协作,并支持1000多个代理之间的协作。值得注意的是,我们观察到了小世界协作现象,其中类似小世界属性的拓扑获得了优越的性能。此外,我们确定了协作性缩放定律,表明归一化溶液质量遵循Logistic增长模式作为伸缩剂,协作性出现的时间比之前观察到的神经出现的情况要早得多。代码和数据将在此HTTPS URL上提供。

[NLP-51] Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent “Middle” Enhancement
[NLP-51] 永不错过:具有一致“中间”增强的大型语言模型上下文窗口扩展的有效食谱

链接: https://arxiv.org/abs/2406.07138
作者: Tong Wu,Yanpeng Zhao,Zilong Zheng
关键词: pre-trained large language, large language models, effectively utilize information, textbf, large language
中文关键词: 预训练的大型语言、大型语言模型、有效利用信息、文本BF、大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, many methods have been developed to extend the context length of pre-trained large language models (LLMs), but they often require fine-tuning at the target length ( \gg4K ) and struggle to effectively utilize information from the middle part of the context. To address these issues, we propose \textbfC ontinuity- \textbfR elativity ind \textbfE xing with g \textbfA ussian \textbfM iddle (CREAM), which interpolates positional encodings by manipulating position indices. Apart from being simple, CREAM is training-efficient: it only requires fine-tuning at the pre-trained context window (eg, Llama 2-4K) and can extend LLMs to a much longer target context length (eg, 256K). To ensure that the model focuses more on the information in the middle, we introduce a truncated Gaussian to encourage sampling from the middle part of the context during fine-tuning, thus alleviating the Lost-in-the-Middle'' problem faced by long-context LLMs. Experimental results show that CREAM successfully extends LLMs to the target length for both Base and Chat versions of \textttLlama2-7B with Never Miss A Beat’'. Our code will be publicly available soon.
摘要:最近,许多方法被用来扩展预先训练的大语言模型的上下文长度,但它们往往需要在目标长度(Gg4K)处进行微调,并且难以有效地利用来自上下文中间部分的信息。为了解决这些问题,我们提出了具有g-extbfA usus.extbfM idid(CREAM)的连续–extbfR相对论inextbfE Xing,它通过操纵位置索引来内插位置编码。除了简单之外,CREAM还是训练高效的:它只需要在预先训练的上下文窗口(例如,Llama 2-4K)进行微调,并且可以将LLMS扩展到更长的目标上下文长度(例如,256K)。为了确保模型更多地关注中间的信息,我们引入了截断高斯来鼓励在微调过程中从上下文的中间部分进行采样,从而缓解了长上下文LLM所面临的“迷失在中间”的问题。实验结果表明,对于基本版本和聊天版本的\extttLlama2-7B,CREAM成功地将LLMS扩展到目标长度。我们的代码很快就会公开。

[NLP-52] Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees
[NLP-52] 推进工具增强大型语言模型:集成推理树中错误的见解

链接: https://arxiv.org/abs/2406.07115
作者: Sijia Chen,Yibo Wang,Yi-Feng Wu,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Lijun Zhang
关键词: intelligent agents interacting, Tool-augmented large language, tool-augmented LLMs compared, large language models, real world
中文关键词: 智能代理交互、工具增强的大型语言、工具增强的LLM比较、大型语言模型、现实世界
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tool-augmented large language models (LLMs) leverage tools, often in the form of APIs, to enhance their reasoning capabilities on complex tasks, thus taking on the role of intelligent agents interacting with the real world. The recently introduced ToolLLaMA model by Qin et al. [2024] utilizes the depth-first search-based decision tree (DFSDT) method for reasoning with 16000+ real-world APIs, which effectively improves the planning and inferencing performance of tool-augmented LLMs compared to traditional chain reasoning approaches. However, their approach only employs successful paths from decision trees (also called inference trees) for supervised fine-tuning (SFT) during training, which does not fully exploit the advantages of the tree of thought. In this study, we propose an inference trajectory optimization framework based on the preference data extracted from decision trees to address this limitation. We first introduce a novel method for constructing preference data from the tree of thought, capitalizing on the failed explorations previously overlooked in the trees. Specifically, we generate an effective step-wise preference dataset, named ToolPreference, for tool use based on the ToolBench dataset. In the subsequent training phase, we first fine-tune the LLM with tool-usage expert trajectories and then use these step-wise preference pairs for direct preference optimization (DPO) to update the policy of the LLM, resulting in our ToolPrefer-LLaMA (TP-LLaMA) model. Our experiments demonstrate that by obtaining insights from errors in inference trees, TP-LLaMA significantly outperforms the baselines across almost all test scenarios by a large margin and exhibits better generalization capabilities with unseen APIs. At the same time, TP-LLaMA has also demonstrated superior reasoning efficiency compared to the baselines, making it more suitable for complex tool-usage reasoning tasks.
摘要:工具增强的大型语言模型(LLM)利用工具(通常是API的形式)来增强其对复杂任务的推理能力,从而承担起与现实世界交互的智能代理的角色。最近由Qin等人提出的ToolLLaMA模型。[2024]利用基于深度优先搜索的决策树方法对16000多个真实API进行推理,与传统的链式推理方法相比,有效地提高了工具扩充的LLMS的规划和推理性能。然而,他们的方法在训练过程中只使用决策树(也称为推理树)中的成功路径进行有监督微调(SFT),没有充分利用思维树的优势。在本研究中,我们提出了一种基于从决策树中提取的偏好数据的推理轨迹优化框架来解决这一局限性。我们首先介绍了一种从思维树中构建偏好数据的新方法,利用了以前在树中被忽视的失败的探索。具体地说,我们基于工具台数据集生成一个有效的步进式首选项数据集,名为ToolPference,供工具使用。在随后的训练阶段,我们首先用工具使用专家轨迹对LLM进行微调,然后使用这些逐步偏好对进行直接偏好优化(DPO)来更新LLM的策略,从而得到我们的ToolPrefer-Llama(TP-Llama)模型。我们的实验表明,通过从推理树中的错误中获得洞察力,TP-LAMA在几乎所有测试场景中的性能都显著高于基线,并且具有更好的泛化能力。同时,TP-LAMA也表现出了比基线更高的推理效率,使其更适合于复杂的工具使用推理任务。

[NLP-53] Efficiently Exploring Large Language Models for Document-Level Machine Translation with In-context Learning
[NLP-53] 通过上下文学习有效探索文档级机器翻译的大型语言模型

链接: https://arxiv.org/abs/2406.07081
作者: Menglong Cui,Jiangcun Du,Shaolin Zhu,Deyi Xiong
关键词: Large language models, exhibit outstanding performance, Large language, in-context learning, language models
中文关键词: 大型语言模型,表现出出色的性能,大型语言,上下文学习,语言模型
类目: Computation and Language (cs.CL)
备注: Accepted to ACL2024 long paper (Findings)

点击查看摘要

Abstract:Large language models (LLMs) exhibit outstanding performance in machine translation via in-context learning. In contrast to sentence-level translation, document-level translation (DOCMT) by LLMs based on in-context learning faces two major challenges: firstly, document translations generated by LLMs are often incoherent; secondly, the length of demonstration for in-context learning is usually limited. To address these issues, we propose a Context-Aware Prompting method (CAP), which enables LLMs to generate more accurate, cohesive, and coherent translations via in-context learning. CAP takes into account multi-level attention, selects the most relevant sentences to the current one as context, and then generates a summary from these collected sentences. Subsequently, sentences most similar to the summary are retrieved from the datastore as demonstrations, which effectively guide LLMs in generating cohesive and coherent translations. We conduct extensive experiments across various DOCMT tasks, and the results demonstrate the effectiveness of our approach, particularly in zero pronoun translation (ZPT) and literary translation tasks.
摘要:大语言模型通过上下文学习在机器翻译中表现出优异的性能。与句子级翻译相比,基于语境学习的文档级翻译面临着两大挑战:首先,基于语境学习的翻译生成的文档往往是不连贯的;其次,语境学习的演示时间通常是有限的。为了解决这些问题,我们提出了一种语境感知提示方法(CAP),该方法使LLM能够通过语境学习生成更准确、更连贯、更连贯的翻译。CAP考虑了多层次的注意,选择与当前句子最相关的句子作为上下文,然后从收集的句子中生成摘要。随后,从数据库中检索出与摘要最相似的句子作为演示,从而有效地指导LLM生成连贯连贯的翻译。我们在不同的DOCMT任务上进行了大量的实验,结果表明我们的方法是有效的,特别是在零代词翻译和文学翻译任务中。

[NLP-54] DARA: Decomposition-Alignment-Reasoning Autonomous Language Agent for Question Answering over Knowledge Graphs
[NLP-54] DARA:用于知识图问题解答的分解-对齐-推理自治语言代理

链接: https://arxiv.org/abs/2406.07080
作者: Haishuo Fang,Xiaodan Zhu,Iryna Gurevych
关键词: Knowledge Graphs, well-functioning autonomous language, Answering Questions, autonomous language agents, Large Language Models
中文关键词: 知识图、功能良好的自治语言、志愿服务问题、自治语言代理、大型语言模型
类目: Computation and Language (cs.CL)
备注: Accepted by ACL2024 findings

点击查看摘要

Abstract:Answering Questions over Knowledge Graphs (KGQA) is key to well-functioning autonomous language agents in various real-life applications. To improve the neural-symbolic reasoning capabilities of language agents powered by Large Language Models (LLMs) in KGQA, we propose the DecompositionAlignment-Reasoning Agent (DARA) framework. DARA effectively parses questions into formal queries through a dual mechanism: high-level iterative task decomposition and low-level task grounding. Importantly, DARA can be efficiently trained with a small number of high-quality reasoning trajectories. Our experimental results demonstrate that DARA fine-tuned on LLMs (e.g. Llama-2-7B, Mistral) outperforms both in-context learning-based agents with GPT-4 and alternative fine-tuned agents, across different benchmarks in zero-shot evaluation, making such models more accessible for real-life applications. We also show that DARA attains performance comparable to state-of-the-art enumerating-and-ranking-based methods for KGQA.
摘要:在知识图上回答问题是保证自主语言代理在各种实际应用中正常运行的关键。为了提高KGQA中基于大语言模型的语言代理的神经符号推理能力,提出了分解对齐推理代理(DARA)框架。DARA通过一种双重机制将问题有效地解析为形式查询:高层迭代任务分解和低级任务基础。重要的是,DARA可以通过少量高质量的推理轨迹进行有效的训练。我们的实验结果表明,在LLMS(例如Llama-2-7B,Mistral)上微调的DARA在零射击评估中跨不同基准测试的性能优于使用GPT-4的基于上下文学习的代理和替代微调代理,使此类模型更易于用于实际应用。我们还表明,在KGQA中,DARA达到了与基于枚举和排名的最新方法相当的性能。

[NLP-55] HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation
[NLP-55] HalluDial:自动对话级幻觉评估的大规模基准

链接: https://arxiv.org/abs/2406.07070
作者: Wen Luo,Tianshu Shen,Wei Li,Guangyue Peng,Richeng Xuan,Houfeng Wang,Xi Yang
关键词: Natural Language Processing, Large Language Models, widespread real-world applications, enabling widespread real-world, field of Natural
中文关键词: 自然语言处理、大型语言模型、广泛的现实世界应用程序,实现广泛的现实世界、自然领域
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), achieving remarkable performance across diverse tasks and enabling widespread real-world applications. However, LLMs are prone to hallucination, generating content that either conflicts with established knowledge or is unfaithful to the original sources. Existing hallucination benchmarks primarily focus on sentence- or passage-level hallucination detection, neglecting dialogue-level evaluation, hallucination localization, and rationale provision. They also predominantly target factuality hallucinations while underestimating faithfulness hallucinations, often relying on labor-intensive or non-specialized evaluators. To address these limitations, we propose HalluDial, the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation. HalluDial encompasses both spontaneous and induced hallucination scenarios, covering factuality and faithfulness hallucinations. The benchmark includes 4,094 dialogues with a total of 146,856 samples. Leveraging HalluDial, we conduct a comprehensive meta-evaluation of LLMs’ hallucination evaluation capabilities in information-seeking dialogues and introduce a specialized judge language model, HalluJudge. The high data quality of HalluDial enables HalluJudge to achieve superior or competitive performance in hallucination evaluation, facilitating the automatic assessment of dialogue-level hallucinations in LLMs and providing valuable insights into this phenomenon. The dataset and the code are available at this https URL.
摘要:大型语言模型极大地促进了自然语言处理领域的发展,在不同的任务上取得了显著的性能,并使广泛的现实世界应用成为可能。然而,LLM容易产生幻觉,产生的内容要么与既定的知识冲突,要么不忠于原始来源。现有的幻觉基准主要集中在句子或段落水平的幻觉检测,而忽略了对话水平的评估、幻觉的本地化和理据提供。他们还主要针对真实性幻觉,而低估了忠诚幻觉,通常依赖于劳动密集型或非专业的评估员。为了解决这些局限性,我们提出了HalluDial,这是第一个用于自动对话级幻觉评估的全面大规模基准。HalluDial包括自发和诱导的幻觉场景,涵盖了真实性和忠诚性幻觉。该基准包括4094个对话,总计146,856个样本。利用HalluDial,我们对LLMS在信息寻求对话中的幻觉评估能力进行了全面的元评估,并引入了一个专门的法官语言模型HalluJustice。HalluDial的高数据质量使HalluJustice在幻觉评估方面取得了卓越或具有竞争力的表现,促进了对LLMS对话级别幻觉的自动评估,并为这一现象提供了有价值的见解。数据集和代码可在此HTTPS URL上找到。

[NLP-56] Reading Miscue Detection in Primary School through Automatic Speech Recognition
[NLP-56] 基于自动语音识别的小学阅读错误检测

链接: https://arxiv.org/abs/2406.07060
作者: Lingyun Gao,Cristian Tejedor-Garcia,Helmer Strik,Catia Cucchiarini
关键词: accessing reading exercises, Automatic Speech Recognition, Automatic reading diagnosis, reading diagnosis systems, feedback more easily
中文关键词: 访问阅读练习、自动语音识别、自动阅读诊断、阅读诊断系统、更轻松地反馈
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Proc. INTERSPEECH 2024, 1-5 September 2024. Kos Island, Greece

点击查看摘要

Abstract:Automatic reading diagnosis systems can benefit both teachers for more efficient scoring of reading exercises and students for accessing reading exercises with feedback more easily. However, there are limited studies on Automatic Speech Recognition (ASR) for child speech in languages other than English, and limited research on ASR-based reading diagnosis systems. This study investigates how efficiently state-of-the-art (SOTA) pretrained ASR models recognize Dutch native children speech and manage to detect reading miscues. We found that Hubert Large finetuned on Dutch speech achieves SOTA phoneme-level child speech recognition (PER at 23.1%), while Whisper (Faster Whisper Large-v2) achieves SOTA word-level performance (WER at 9.8%). Our findings suggest that Wav2Vec2 Large and Whisper are the two best ASR models for reading miscue detection. Specifically, Wav2Vec2 Large shows the highest recall at 0.83, whereas Whisper exhibits the highest precision at 0.52 and an F1 score of 0.52.
摘要:阅读自动诊断系统不仅有利于教师对阅读练习进行更有效的评分,也有利于学生更容易地获取具有反馈的阅读练习。然而,针对非英语语言儿童语音的自动语音识别(ASR)的研究有限,基于ASR的阅读诊断系统的研究也很少。这项研究调查了最先进的(SOTA)预训练ASR模型如何有效地识别荷兰母语儿童的语音并设法检测阅读错误。我们发现,Hubert Large对荷兰语的精调实现了SOTA音素级别的儿童语音识别(PER为23.1%),而Whisper(Fast Whisper Large-v2)实现了SOTA单词级别的识别(WER为9.8%)。我们的研究结果表明,Wav2Vec2 Large和Whisper是两种最好的误读检测ASR模型。具体来说,Wav2Vec2 Large的召回率最高,为0.83,而Whisper的准确率最高,为0.52,F1得分为0.52。

[NLP-57] Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study
[NLP-57] 多模式大型语言模型的可信度基准:全面研究

链接: https://arxiv.org/abs/2406.07057
作者: Yichi Zhang,Yao Huang,Yitong Sun,Chang Liu,Zhe Zhao,Zhengwei Fang,Yifan Wang,Huanran Chen,Xiao Yang,Xingxing Wei,Hang Su,Yinpeng Dong,Jun Zhu
关键词: Large Language Models, Multimodal Large Language, Large Language, significant trustworthiness challenges, face significant trustworthiness
中文关键词: 大型语言模型,多模式大型语言,大型语言,重大可信度挑战,面临重大可信度
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 100 pages, 84 figures, 33 tables

点击查看摘要

Abstract:Despite the superior capabilities of Multimodal Large Language Models (MLLMs) across diverse tasks, they still face significant trustworthiness challenges. Yet, current literature on the assessment of trustworthy MLLMs remains limited, lacking a holistic evaluation to offer thorough insights into future improvements. In this work, we establish MultiTrust, the first comprehensive and unified benchmark on the trustworthiness of MLLMs across five primary aspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts, encompassing 32 diverse tasks with self-curated datasets. Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks, highlighting the complexities introduced by the multimodality and underscoring the necessity for advanced methodologies to enhance their reliability. For instance, typical proprietary models still struggle with the perception of visually confusing images and are vulnerable to multimodal jailbreaking and adversarial attacks; MLLMs are more inclined to disclose privacy in text and reveal ideological and cultural biases even when paired with irrelevant images in inference, indicating that the multimodality amplifies the internal risks from base LLMs. Additionally, we release a scalable toolbox for standardized trustworthiness research, aiming to facilitate future advancements in this important field. Code and resources are publicly available at: this https URL.
摘要:尽管多模式大型语言模型(MLLMS)在不同的任务中具有卓越的能力,但它们仍然面临着巨大的可信性挑战。然而,目前关于评估值得信赖的MLLMS的文献仍然有限,缺乏全面的评估来提供对未来改进的透彻见解。在这项工作中,我们建立了多重信任,这是第一个关于MLLMS可信度的全面和统一的基准,涉及五个主要方面:真实性、安全性、健壮性、公平性和隐私性。我们的基准采用了严格的评估战略,同时应对多式联运风险和跨联运影响,包括32项不同的任务和自我管理的数据集。对21个现代多模式管理进行的广泛实验揭示了一些以前从未探索过的可信度问题和风险,突显了多模式带来的复杂性,并强调了先进方法提高其可靠性的必要性。例如,典型的专有模型仍然难以识别视觉上令人困惑的图像,容易受到多模式越狱和敌意攻击;MLLM更倾向于在文本中泄露隐私,甚至在推理中与无关图像搭配使用时也会暴露意识形态和文化偏见,这表明多模式放大了基本LLM的内部风险。此外,我们还发布了一个用于标准化可信度研究的可扩展工具箱,旨在促进这一重要领域的未来发展。代码和资源可在以下网址公开获得:This HTTPS URL。

[NLP-58] Effectively Compress KV Heads for LLM
[NLP-58] 有效压缩LLM的KV头部

链接: https://arxiv.org/abs/2406.07056
作者: Hao Yu,Zelan Yang,Shen Li,Yong Li,Jianxin Wu
关键词: language processing tasks, pre-trained large language, natural language processing, large language models, large language
中文关键词: 语言处理任务、预训练的大型语言、自然语言处理、大型语言模型、大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advent of pre-trained large language models (LLMs) has revolutionized various natural language processing tasks. These models predominantly employ an auto-regressive decoding mechanism that utilizes Key-Value (KV) caches to eliminate redundant calculations for previous tokens. Nevertheless, as context lengths and batch sizes increase, the linear expansion in memory footprint of KV caches becomes a key bottleneck of LLM deployment, which decreases generation speeds significantly. To mitigate this issue, previous techniques like multi-query attention (MQA) and grouped-query attention (GQA) have been developed, in order to reduce KV heads to accelerate inference with comparable accuracy to multi-head attention (MHA). Despite their effectiveness, existing strategies for compressing MHA often overlook the intrinsic properties of the KV caches. In this work, we explore the low-rank characteristics of the KV caches and propose a novel approach for compressing KV heads. In particular, we carefully optimize the MHA-to-GQA transformation to minimize compression error, and to remain compatible with rotary position embeddings (RoPE), we also introduce specialized strategies for key caches with RoPE. We demonstrate that our method can compress half or even three-quarters of KV heads while maintaining performance comparable to the original LLMs, which presents a promising direction for more efficient LLM deployment in resource-constrained environments.
摘要:预训练的大语言模型的出现使各种自然语言处理任务发生了革命性的变化。这些模型主要使用自回归解码机制,该机制利用键值(KV)缓存来消除对先前令牌的冗余计算。然而,随着上下文长度和批处理大小的增加,KV缓存的内存空间的线性扩展成为LLM部署的关键瓶颈,这显著降低了生成速度。为了缓解这一问题,已经开发了多查询注意(MQA)和分组查询注意(GQA)等技术,以减少KV头来加速推理,其精度与多头注意(MHA)相当。尽管它们很有效,但现有的MHA压缩策略往往忽略了KV缓存的内在属性。在这项工作中,我们探索了KV缓存的低阶特性,并提出了一种新的KV头部压缩方法。特别是,我们仔细优化了MHA到GQA的转换,以最大限度地减少压缩误差,并保持与旋转位置嵌入(ROPE)的兼容性,我们还引入了针对带有ROPE的密钥缓存的专门策略。我们证明,我们的方法可以压缩一半甚至四分之三的KV头,同时保持与原始LLM相当的性能,这为在资源受限的环境中更有效地部署LLM提供了一个有前途的方向。

[NLP-59] CoEvol: Constructing Better Responses for Instruction Finetuning through Multi-Agent Cooperation
[NLP-59] CoEvo:通过多代理合作构建更好的教学微调响应

链接: https://arxiv.org/abs/2406.07054
作者: Renhao Li,Minghuan Tan,Derek F. Wong,Min Yang
关键词: garnered considerable attention, large language models, enhance model performance, recent years, unseen tasks
中文关键词: 引起了相当大的关注,大型语言模型,增强模型性能,近年来,看不见的任务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, instruction fine-tuning (IFT) on large language models (LLMs) has garnered considerable attention to enhance model performance on unseen tasks. Attempts have been made on automatic construction and effective selection for IFT data. However, we posit that previous methods have not fully harnessed the potential of LLMs for enhancing data quality. The responses within IFT data could be further enhanced by leveraging the capabilities of LLMs themselves. In this paper, we propose CoEvol, an LLM-based multi-agent cooperation framework for the improvement of responses to instructions. To effectively refine the responses, we develop an iterative framework following a debate-advise-edit-judge paradigm. A two-stage multi-agent debate strategy is further devised to ensure the diversity and reliability of editing suggestions within the framework. Empirically, models equipped with CoEvol outperform competitive baselines evaluated by MT-Bench and AlpacaEval, demonstrating its effectiveness in enhancing instruction-following capabilities for LLMs.
摘要:近年来,针对大型语言模型的指令微调(IFT)引起了人们对提高模型在看不见的任务上的性能的极大关注。对IFT数据的自动构建和有效选择进行了尝试。然而,我们假设,以前的方法并没有充分利用LLMS的潜力来提高数据质量。通过利用LLMS本身的能力,可以进一步加强IFT数据中的反应。本文提出了一种基于LLM的多智能体协作框架CoEvol,用于改进对指令的响应。为了有效地改进回应,我们开发了一个遵循辩论-建议-编辑-评判范式的迭代框架。进一步制定了两阶段多主体辩论战略,以确保框架内编辑建议的多样性和可靠性。经验上,配备CoEvol的型号优于MT-BENCH和AlpacaEval评估的竞争基线,证明了其在增强LLM指令遵循能力方面的有效性。

[NLP-60] Paying More Attention to Source Context: Mitigating Unfaithful Translations from Large Language Model
[NLP-60] 更加关注源上下文:减少大型语言模型中的不忠实翻译

链接: https://arxiv.org/abs/2406.07036
作者: Hongbin Zhang,Kehai Chen,Xuefeng Bai,Yang Xiang,Min Zhang
关键词: Large language models, showcased impressive multilingual, impressive multilingual machine, machine translation ability, multilingual machine translation
中文关键词: 大型语言模型,展示了令人印象深刻的多语言、令人印象深刻的多语言机器、机器翻译能力、多语言机器翻译
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL2024 Findings

点击查看摘要

Abstract:Large language models (LLMs) have showcased impressive multilingual machine translation ability. However, unlike encoder-decoder style models, decoder-only LLMs lack an explicit alignment between source and target contexts. Analyzing contribution scores during generation processes revealed that LLMs can be biased towards previously generated tokens over corresponding source tokens, leading to unfaithful translations. To address this issue, we propose to encourage LLMs to pay more attention to the source context from both source and target perspectives in zeroshot prompting: 1) adjust source context attention weights; 2) suppress irrelevant target prefix influence; Additionally, we propose 3) avoiding over-reliance on the target prefix in instruction tuning. Experimental results from both human-collected unfaithfulness test sets focusing on LLM-generated unfaithful translations and general test sets, verify our methods’ effectiveness across multiple language pairs. Further human evaluation shows our method’s efficacy in reducing hallucinatory translations and facilitating faithful translation generation.
摘要:大型语言模型(LLM)已经显示出令人印象深刻的多语言机器翻译能力。然而,与编码器-解码器风格的模型不同,仅解码器的LLM缺乏源和目标上下文之间的显式对齐。分析生成过程中的贡献分数发现,LLMS可能偏向于先前生成的标记而不是对应的源标记,从而导致不忠实翻译。为了解决这一问题,我们建议在零射提示中鼓励LLMS从源和目标两个角度更多地关注源语境:1)调整源语境注意权重;2)抑制不相关的目标前缀影响;3)在指令调优中避免过度依赖目标前缀。在人工收集的针对LLM生成的不忠实翻译的测试集和一般测试集上的实验结果验证了我们方法在多语言对上的有效性。进一步的人类评估表明,我们的方法在减少幻觉翻译和促进忠实翻译生成方面具有有效性。

[NLP-61] Improving Multi-hop Logical Reasoning in Knowledge Graphs with Context-Aware Query Representation Learning
[NLP-61] 通过上下文感知查询表示学习改进知识图中的多跳逻辑推理

链接: https://arxiv.org/abs/2406.07034
作者: Jeonghoon Kim,Heesoo Jung,Hyeju Jang,Hogun Park
关键词: answer First-Order Logic, natural language processing, First-Order Logic, numerous approaches aiming, FOL query graph
中文关键词: 答案一阶逻辑、自然语言处理、一阶逻辑、多种瞄准方法、FOL查询图
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ACL 2024 Findings

点击查看摘要

Abstract:Multi-hop logical reasoning on knowledge graphs is a pivotal task in natural language processing, with numerous approaches aiming to answer First-Order Logic (FOL) queries. Recent geometry (e.g., box, cone) and probability (e.g., beta distribution)-based methodologies have effectively addressed complex FOL queries. However, a common challenge across these methods lies in determining accurate geometric bounds or probability parameters for these queries. The challenge arises because existing methods rely on linear sequential operations within their computation graphs, overlooking the logical structure of the query and the relation-induced information that can be gleaned from the relations of the query, which we call the context of the query. To address the problem, we propose a model-agnostic methodology that enhances the effectiveness of existing multi-hop logical reasoning approaches by fully integrating the context of the FOL query graph. Our approach distinctively discerns (1) the structural context inherent to the query structure and (2) the relation-induced context unique to each node in the query graph as delineated in the corresponding knowledge graph. This dual-context paradigm helps nodes within a query graph attain refined internal representations throughout the multi-hop reasoning steps. Through experiments on two datasets, our method consistently enhances the three multi-hop reasoning foundation models, achieving performance improvements of up to 19.5%. Our code is available at this https URL.
摘要:知识图上的多跳逻辑推理是自然语言处理中的一项关键任务,针对一阶逻辑(FOL)查询的方法有很多。最近基于几何(例如,盒、锥)和概率(例如,贝塔分布)的方法已经有效地解决了复杂的FOL查询。然而,跨这些方法的一个共同挑战在于为这些查询确定准确的几何界限或概率参数。挑战的出现是因为现有的方法依赖于它们的计算图中的线性顺序操作,而忽略了查询的逻辑结构和可以从查询的关系中收集的关系诱导的信息,我们称之为查询的上下文。为了解决这个问题,我们提出了一种与模型无关的方法,通过完全集成FOL查询图的上下文来增强现有多跳逻辑推理方法的有效性。我们的方法区别地识别(1)查询结构固有的结构上下文和(2)在相应知识图中描绘的查询图中每个节点唯一的关系诱导上下文。这种双重上下文范例帮助查询图中的节点在多跳推理步骤中获得精细化的内部表示。通过在两个数据集上的实验,我们的方法一致地改进了三个多跳推理基础模型,获得了高达19.5%的性能提升。我们的代码可以在这个HTTPS URL上找到。

[NLP-62] MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations
[NLP-62] MoreauPruner:针对权重扰动对大型语言模型进行稳健修剪

链接: https://arxiv.org/abs/2406.07017
作者: Zixiao Wang,Jingwei Zhang,Wenqian Zhao,Farzan Farnia,Bei Yu
关键词: few-shot gradient pruning, Few-shot gradient, Few-shot gradient methods, potential weight perturbations, regarded as static
中文关键词: 少镜头梯度修剪、少镜头梯度、少镜头梯度方法、潜在权重扰动,视为静态
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Few-shot gradient methods have been extensively utilized in existing model pruning methods, where the model weights are regarded as static values and the effects of potential weight perturbations are not considered. However, the widely used large language models (LLMs) have several billion model parameters, which could increase the fragility of few-shot gradient pruning. In this work, we experimentally show that one-shot gradient pruning algorithms could lead to unstable results under perturbations to model weights. And the minor error of switching between data formats bfloat16 and float16 could result in drastically different outcomes. To address such instabilities, we leverage optimization analysis and propose an LLM structural pruning method, called MoreauPruner, with provable robustness against weight perturbations. In MoreauPruner, the model weight importance is estimated based on the neural network’s Moreau envelope, which can be flexibly combined with \ell_1 -norm regularization techniques to induce the sparsity required in the pruning task. We extensively evaluate the MoreauPruner algorithm on several well-known LLMs, including LLaMA-7B, LLaMA-13B, LLaMA3-8B, and Vicuna-7B. Our numerical results suggest the robustness of MoreauPruner against weight perturbations, and indicate the MoreauPruner’s successful accuracy-based scores in comparison to several existing pruning methods. We have released the code in \urlthis https URL.
摘要:在已有的模型剪枝方法中,广泛使用的是少射梯度法,它们将模型权值视为静态值,不考虑潜在权值扰动的影响。然而,广泛使用的大型语言模型(LLM)具有数十亿个模型参数,这可能会增加少镜头梯度剪枝的脆弱性。在这项工作中,我们通过实验证明,在模型权重摄动的情况下,一次梯度剪枝算法会导致结果不稳定。而在数据格式bFloat16和Float16之间切换的微小错误可能会导致截然不同的结果。为了解决这种不稳定性,我们利用优化分析,提出了一种称为MoreauPruner的LLM结构剪枝方法,该方法对权重扰动具有可证明的健壮性。在MoreauPruner中,模型权重重要性是基于神经网络的Moreau包络来估计的,它可以灵活地与1范数正则化技术相结合,以产生剪枝任务所需的稀疏性。我们在几个著名的LLM上对MoreauPruner算法进行了广泛的评估,包括Llama-7B、Llama-13B、LLaMA3-8B和Vicuna-7B。我们的数值结果表明了MoreauPruner对权重扰动的稳健性,并与现有的几种剪枝方法相比,表明了MoreauPruner成功的基于精度的分数。我们已在此HTTPS URL中发布了代码。

[NLP-63] Delving into ChatGPT usage in academic writing through excess vocabulary
[NLP-63] 通过多余的词汇量研究ChatGPT在学术写作中的使用

链接: https://arxiv.org/abs/2406.07016
作者: Dmitry Kobak,Rita González Márquez,Emőke-Ágnes Horvát,Jan Lause
关键词: Recent large language, large language models, Recent large, human-level performance, systems like ChatGPT
中文关键词: 最近的大型语言、大型语言模型、最近的大型人类级性能、ChatGPT等系统
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Recent large language models (LLMs) can generate and revise text with human-level performance, and have been widely commercialized in systems like ChatGPT. These models come with clear limitations: they can produce inaccurate information, reinforce existing biases, and be easily misused. Yet, many scientists have been using them to assist their scholarly writing. How wide-spread is LLM usage in the academic literature currently? To answer this question, we use an unbiased, large-scale approach, free from any assumptions on academic LLM usage. We study vocabulary changes in 14 million PubMed abstracts from 2010-2024, and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. Our analysis based on excess words usage suggests that at least 10% of 2024 abstracts were processed with LLMs. This lower bound differed across disciplines, countries, and journals, and was as high as 30% for some PubMed sub-corpora. We show that the appearance of LLM-based writing assistants has had an unprecedented impact in the scientific literature, surpassing the effect of major world events such as the Covid pandemic.
摘要:最近的大型语言模型(LLMS)可以生成和修改具有人类水平的文本,并已在ChatGPT等系统中广泛商业化。这些模型有明显的局限性:它们可能产生不准确的信息,强化现有的偏见,并且很容易被滥用。然而,许多科学家一直在使用它们来帮助他们的学术写作。目前,LLM在学术文献中的使用有多广泛?为了回答这个问题,我们使用了一种不偏不倚的大规模方法,没有任何关于学术LLM使用的假设。我们研究了2010-2024年1400万篇PubMed摘要中的词汇变化,并展示了LLM的出现是如何导致某些风格词汇的频率突然增加的。我们基于过度词汇使用的分析表明,2024篇摘要中至少有10%是用LLMS处理的。这一下限因学科、国家和期刊的不同而不同,一些PubMed子语料库的下限高达30%。我们表明,基于LLM的写作助理的出现在科学文献中产生了前所未有的影响,超过了像Covid大流行这样的重大世界事件的影响。

[NLP-64] Bridging Language Gaps in Audio-Text Retrieval
[NLP-64] 弥合音频文本检索中的语言差距

链接: https://arxiv.org/abs/2406.07012
作者: Zhiyong Yan,Heinrich Dinkel,Yongqing Wang,Jizhong Liu,Junbo Zhang,Yujun Wang,Bin Wang
关键词: challenging task, requiring the search, English audio-text retrieval, Audio-text retrieval, text caption
中文关键词: 具有挑战性的任务,需要搜索、英语音频文本检索、音频文本检索、文本字幕
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: interspeech2024

点击查看摘要

Abstract:Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multilingual text encoder (SONAR) to encode the text data with language-specific information. Additionally, we optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the approach exhibits proficiency in retrieving content in seven other languages with only 10% of additional language-enhanced training data, yielding promising results. The source code is publicly available this https URL.
摘要:音频文本检索是一项具有挑战性的任务,需要在数据库中搜索音频片段或文本字幕。鉴于现实世界数据中大量的非英语内容,现有的英语描述研究的主要焦点对这些模型的适用性造成了限制。为了解决这些语言差异,我们提出了一种语言增强(LE),使用多语言文本编码器(声纳)来编码具有语言特定信息的文本数据。此外,我们通过应用一致集成蒸馏(CED)优化了音频编码器,增强了对可变长度音频文本检索的支持。我们的方法在英语音频文本检索方面表现出色,在AudioCaps和Clotho等常用数据集上展示了最先进的性能(SOTA)。同时,该方法在检索其他七种语言的内容方面表现出熟练的能力,只需要10%的额外语言增强的训练数据,产生了令人振奋的结果。源代码在此HTTPS URL公开可用。

[NLP-65] Crayon: Customized On-Device LLM via Instant Adapter Blending and Edge-Server Hybrid Inference
[NLP-65] Crayon:通过即时适配器混合和边缘服务器混合推理自定义设备上LLM

链接: https://arxiv.org/abs/2406.07007
作者: Jihwan Bang,Juntae Lee,Kyuhong Shim,Seunghan Yang,Simyung Chang
关键词: large language models, large language, On-device LLMs, LLMs, language models
中文关键词: 大型语言模型、大型语言、设备上LLM、LLM、语言模型
类目: Computation and Language (cs.CL)
备注: ACL 2024 Main

点击查看摘要

Abstract:The customization of large language models (LLMs) for user-specified tasks gets important. However, maintaining all the customized LLMs on cloud servers incurs substantial memory and computational overheads, and uploading user data can also lead to privacy concerns. On-device LLMs can offer a promising solution by mitigating these issues. Yet, the performance of on-device LLMs is inherently constrained by the limitations of small-scaled models. To overcome these restrictions, we first propose Crayon, a novel approach for on-device LLM customization. Crayon begins by constructing a pool of diverse base adapters, and then we instantly blend them into a customized adapter without extra training. In addition, we develop a device-server hybrid inference strategy, which deftly allocates more demanding queries or non-customized tasks to a larger, more capable LLM on a server. This ensures optimal performance without sacrificing the benefits of on-device customization. We carefully craft a novel benchmark from multiple question-answer datasets, and show the efficacy of our method in the LLM customization.
摘要:为用户指定的任务定制大型语言模型(LLM)变得很重要。然而,在云服务器上维护所有定制的LLM会产生大量的内存和计算开销,上传用户数据还会导致隐私问题。设备上的LLM可以通过缓解这些问题来提供一个很有前途的解决方案。然而,设备上LLM的性能固有地受到小规模模型的限制。为了克服这些限制,我们首先提出了一种新的设备上LLM定制方法–Crayon。蜡笔首先构建一个不同的基本适配器池,然后我们立即将它们混合到一个定制的适配器中,而不需要额外的培训。此外,我们开发了一种设备-服务器混合推理策略,该策略巧妙地将要求更高的查询或非定制任务分配给服务器上更大、更有能力的LLM。这确保了最佳性能,而不会牺牲设备定制的好处。我们从多个问答数据集中精心构建了一个新的基准,并在LLM定制中展示了我们的方法的有效性。

[NLP-66] Mitigating Boundary Ambiguity and Inherent Bias for Text Classification in the Era of Large Language Models
[NLP-66] 大语言模型时代文本分类的边界歧义和固有偏见

链接: https://arxiv.org/abs/2406.07001
作者: Zhenyi Lu,Jie Tian,Wei Wei,Xiaoye Qu,Yu Cheng,Wenfeng xie,Dangyang Chen
关键词: large language models, crucial task encountered, task encountered frequently, Text classification, practical scenarios
中文关键词: 大型语言模型、遇到的关键任务、经常遇到的任务、文本分类、实际场景
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL2024 findings

点击查看摘要

Abstract:Text classification is a crucial task encountered frequently in practical scenarios, yet it is still under-explored in the era of large language models (LLMs). This study shows that LLMs are vulnerable to changes in the number and arrangement of options in text classification. Our extensive empirical analyses reveal that the key bottleneck arises from ambiguous decision boundaries and inherent biases towards specific tokens and positions. To mitigate these issues, we make the first attempt and propose a novel two-stage classification framework for LLMs. Our approach is grounded in the empirical observation that pairwise comparisons can effectively alleviate boundary ambiguity and inherent bias. Specifically, we begin with a self-reduction technique to efficiently narrow down numerous options, which contributes to reduced decision space and a faster comparison process. Subsequently, pairwise contrastive comparisons are employed in a chain-of-thought manner to draw out nuances and distinguish confusable options, thus refining the ambiguous decision boundary. Extensive experiments on four datasets (Banking77, HWU64, LIU54, and Clinic150) verify the effectiveness of our framework. Furthermore, benefitting from our framework, various LLMs can achieve consistent improvements. Our code and data are available in \urlthis https URL.
摘要:文本分类是实际场景中经常遇到的一项关键任务,但在大语言模型(LLMS)时代仍然没有得到充分的探索。这项研究表明,LLMS容易受到文本分类中选项数量和排列方式的变化的影响。我们广泛的实证分析表明,关键瓶颈来自于模糊的决策边界和对特定标志和位置的固有偏见。为了缓解这些问题,我们进行了第一次尝试,并提出了一种新颖的两阶段分类框架。我们的方法是基于两两比较可以有效地缓解边界模糊和固有偏见的经验观察。具体地说,我们从自我约简技术开始,有效地缩小了众多选项的范围,这有助于减少决策空间和更快的比较过程。随后,以一种思维链的方式进行成对的对比比较,以提取细微差别并区分易混淆的选项,从而细化模糊的决策边界。在四个数据集(Banking77、HWU64、LIU54和Clinic150)上的大量实验验证了该框架的有效性。此外,得益于我们的框架,各种LLM可以实现持续的改进。我们的代码和数据在此HTTPS URL中提供。

[NLP-67] Missingness-resilient Video-enhanced Multimodal Disfluency Detection
[NLP-67] 具有丢失弹性的视频增强的多模式不流利检测

链接: https://arxiv.org/abs/2406.06964
作者: Payal Mohapatra,Shamika Likhite,Subrata Biswas,Bashima Islam,Qi Zhu
关键词: existing speech disfluency, speech disfluency detection, existing speech, rely upon acoustic, disfluency detection
中文关键词: 现有语音不流利,语音不流利检测,现有语音,依赖于声学,不流利检测
类目: Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:Most existing speech disfluency detection techniques only rely upon acoustic data. In this work, we present a practical multimodal disfluency detection approach that leverages available video data together with audio. We curate an audiovisual dataset and propose a novel fusion technique with unified weight-sharing modality-agnostic encoders to learn the temporal and semantic context. Our resilient design accommodates real-world scenarios where the video modality may sometimes be missing during inference. We also present alternative fusion strategies when both modalities are assured to be complete. In experiments across five disfluency-detection tasks, our unified multimodal approach significantly outperforms Audio-only unimodal methods, yielding an average absolute improvement of 10% (i.e., 10 percentage point increase) when both video and audio modalities are always available, and 7% even when video modality is missing in half of the samples.
摘要:大多数现有的语音不流利检测技术仅依赖于声学数据。在这项工作中,我们提出了一种实用的多模式不流利检测方法,该方法利用可用的视频数据和音频。我们策划了一个视听数据集,并提出了一种新颖的融合技术,具有统一的权重共享模式不可知编码器来学习时间和语义上下文。我们的弹性设计可适应现实世界场景,其中视频模式有时可能在推理过程中丢失。当两种模式都确保完成时,我们还提出了替代融合策略。在五个不流利检测任务的实验中,我们的统一多模式方法显着优于纯音频的单模式方法,平均绝对改进10%(即,当视频和音频模式始终可用时,增加10个百分点),即使一半样本中缺少视频模式,也增加7%。

[NLP-68] Evolving Subnetwork Training for Large Language Models
[NLP-68] 大型语言模型的不断发展的子网络训练

链接: https://arxiv.org/abs/2406.06962
作者: Hanqi Li,Lu Chen,Da Ma,Zijian Wu,Su Zhu,Kai Yu
关键词: artificial intelligence research, Large language models, Large language, intelligence research, era of artificial
中文关键词: 人工智能研究,大型语言模型,大型语言,智能研究,人工时代
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2024

点击查看摘要

Abstract:Large language models have ushered in a new era of artificial intelligence research. However, their substantial training costs hinder further development and widespread adoption. In this paper, inspired by the redundancy in the parameters of large language models, we propose a novel training paradigm: Evolving Subnetwork Training (EST). EST samples subnetworks from the layers of the large language model and from commonly used modules within each layer, Multi-Head Attention (MHA) and Multi-Layer Perceptron (MLP). By gradually increasing the size of the subnetworks during the training process, EST can save the cost of training. We apply EST to train GPT2 model and TinyLlama model, resulting in 26.7% FLOPs saving for GPT2 and 25.0% for TinyLlama without an increase in loss on the pre-training dataset. Moreover, EST leads to performance improvements in downstream tasks, indicating that it benefits generalization. Additionally, we provide intuitive theoretical studies based on training dynamics and Dropout theory to ensure the feasibility of EST. Our code is available at this https URL.
摘要:大语言模型开启了人工智能研究的新时代。然而,其高昂的培训费用阻碍了进一步的发展和广泛采用。受大型语言模型参数冗余的启发,我们提出了一种新的训练范式:演化子网络训练(EST)。EST从大型语言模型的各层和每一层中的常用模块–多头注意(MHA)和多层感知器(MLP)–采样子网络。通过在训练过程中逐渐增加子网络的规模,EST可以节省训练成本。我们将EST应用于GPT2模型和TinyLlama模型的训练,在不增加训练前数据集的损失的情况下,GPT2模型和TinyLlama模型分别减少了26.7和25.0的FLOPS。此外,EST还能提高下游任务的性能,这表明它有利于推广。此外,我们还基于训练动力学和辍学理论提供了直观的理论研究,以确保EST的可行性。我们的代码可以在这个HTTPS URL上找到。

[NLP-69] A Probabilistic Framework for LLM Hallucination Detection via Belief Tree Propagation
[NLP-69] 通过信念树传播进行LLM幻觉检测的概率框架

链接: https://arxiv.org/abs/2406.06950
作者: Bairu Hou,Yang Zhang,Jacob Andreas,Shiyu Chang
关键词: LLM, LLM belief, aims to determine, determine the truthfulness, truthfulness of LLM-generated
中文关键词: LLM,LLM信仰,旨在确定LLM生成的真实性、真实性
类目: Computation and Language (cs.CL)
备注: 26 pages, 18 figures

点击查看摘要

Abstract:This paper focuses on the task of hallucination detection, which aims to determine the truthfulness of LLM-generated statements. To address this problem, a popular class of methods utilize the LLM’s self-consistencies in its beliefs in a set of logically related augmented statements generated by the LLM, which does not require external knowledge databases and can work with both white-box and black-box LLMs. However, in many existing approaches, the augmented statements tend to be very monotone and unstructured, which makes it difficult to integrate meaningful information from the LLM beliefs in these statements. Also, many methods work with the binarized version of the LLM’s belief, instead of the continuous version, which significantly loses information. To overcome these limitations, in this paper, we propose Belief Tree Propagation (BTProp), a probabilistic framework for LLM hallucination detection. BTProp introduces a belief tree of logically related statements by recursively decomposing a parent statement into child statements with three decomposition strategies, and builds a hidden Markov tree model to integrate the LLM’s belief scores in these statements in a principled way. Experiment results show that our method improves baselines by 3%-9% (evaluated by AUROC and AUC-PR) on multiple hallucination detection benchmarks. Code is available at this https URL.
摘要:本文主要研究幻觉检测任务,其目的是确定LLM生成的语句的真实性。为了解决这个问题,一类流行的方法利用LLM在其信念中的自相合性,由LLM生成的一组逻辑相关的扩充语句,它不需要外部知识数据库,并且可以与白盒和黑盒LLM一起工作。然而,在现有的许多方法中,扩充后的语句往往非常单调和非结构化,这使得将来自LLM信念的有意义的信息整合到这些语句中是困难的。此外,许多方法使用LLM信念的二进制版本,而不是连续版本,这会显著丢失信息。为了克服这些局限性,本文提出了一种LLM幻觉检测的概率框架–信念树传播算法(BTProp)。BTProp通过使用三种分解策略将父语句递归分解为子语句,引入了逻辑相关语句的信任树,并构建了隐马尔可夫树模型,以有原则地整合LLM在这些语句中的信任分数。实验结果表明,我们的方法在多个幻觉检测基准上提高了3%-9%的基线(由AUROC和AUC-PR评估)。代码可在此HTTPS URL上找到。

[NLP-70] Post-Hoc Answer Attribution for Grounded and Trustworthy Long Document Comprehension: Task Insights and Challenges
[NLP-70] 基于基础且值得信赖的长文档理解的事后答案归因:任务见解和挑战

链接: https://arxiv.org/abs/2406.06938
作者: Abhilasha Sancheti,Koustava Goswami,Balaji Vasan Srinivasan
关键词: Attributing answer text, Attributing answer, building trustworthy, questions is crucial, crucial for building
中文关键词: 归因答案文本,归因答案,构建值得信赖,问题至关重要,对于构建至关重要
类目: Computation and Language (cs.CL)
备注: Accepted to *SEM 2024

点击查看摘要

Abstract:Attributing answer text to its source document for information-seeking questions is crucial for building trustworthy, reliable, and accountable systems. We formulate a new task of post-hoc answer attribution for long document comprehension (LDC). Owing to the lack of long-form abstractive and information-seeking LDC datasets, we refactor existing datasets to assess the strengths and weaknesses of existing retrieval-based and proposed answer decomposition and textual entailment-based optimal selection attribution systems for this task. We throw light on the limitations of existing datasets and the need for datasets to assess the actual performance of systems on this task.
摘要:将答案文本归因于其信息寻求问题的源文档对于构建值得信赖、可靠和负责任的系统至关重要。我们为长文档理解(LDC)制定了一项新任务,即事后答案归因。由于缺乏长篇抽象和寻求信息的LDC数据集,我们重组了现有数据集,以评估现有的基于检索的和拟议的答案分解和基于文本蕴含的最佳选择归因系统的优点和缺点。我们揭示了现有数据集的局限性以及对数据集评估系统在此任务中实际性能的需求。

[NLP-71] A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation
[NLP-71] 端到端同时语音对任意翻译的非自回归生成框架

链接: https://arxiv.org/abs/2406.06937
作者: Zhengrui Ma,Qingkai Fang,Shaolei Zhang,Shoutao Guo,Yang Feng,Min Zhang
关键词: facilitating communication, play a crucial, crucial role, role in facilitating, translation models play
中文关键词: 促进沟通,在促进翻译模型方面发挥着至关重要的作用
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: ACL 2024; Codes and demos are at this https URL

点击查看摘要

Abstract:Simultaneous translation models play a crucial role in facilitating communication. However, existing research primarily focuses on text-to-text or speech-to-text models, necessitating additional cascade components to achieve speech-to-speech translation. These pipeline methods suffer from error propagation and accumulate delays in each cascade component, resulting in reduced synchronization between the speaker and listener. To overcome these challenges, we propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X), which integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. We develop a non-autoregressive decoder capable of concurrently generating multiple text or acoustic unit tokens upon receiving fixed-length speech chunks. The decoder can generate blank or repeated tokens and employ CTC decoding to dynamically adjust its latency. Experimental results show that NAST-S2X outperforms state-of-the-art models in both speech-to-text and speech-to-speech tasks. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
摘要:同声传译模式在促进交流方面起着至关重要的作用。然而,现有的研究主要集中在文本到文本或语音到文本的模型上,需要额外的级联组件来实现语音到语音的翻译。这些流水线方法受到错误传播的影响,并在每个级联组件中积累延迟,导致说话者和听者之间的同步性降低。为了克服这些挑战,我们提出了一种新的非自回归同步语音翻译框架(NAST-S2X),它将语音到文本和语音到语音的任务整合到一个统一的端到端框架中。我们开发了一种非自回归解码器,能够在接收到固定长度的语音块时同时生成多个文本或声学单元令牌。解码器可以生成空白或重复的令牌,并使用CTC解码来动态调整其延迟。实验结果表明,NAST-S2X在语音到文本和语音到语音的任务上都优于最先进的模型。在不到3秒的延迟内实现了高质量的同声传译,并在离线生成时提供了28倍的解码加速比。

[NLP-72] Agent-SiMT: Agent-assisted Simultaneous Machine Translation with Large Language Models
[NLP-72] Agent-SiMT:具有大型语言模型的代理辅助同步机器翻译

链接: https://arxiv.org/abs/2406.06910
作者: Shoutao Guo,Shaolei Zhang,Zhengrui Ma,Min Zhang,Yang Feng
关键词: Simultaneous Machine Translation, Simultaneous Machine, Machine Translation, Translation, generates target translations
中文关键词: 机器同步翻译,机器同步,机器翻译,翻译,生成目标翻译
类目: Computation and Language (cs.CL)
备注: 18 pages, 8 figures, 7 tables. arXiv admin note: substantial text overlap with arXiv:2402.13036

点击查看摘要

Abstract:Simultaneous Machine Translation (SiMT) generates target translations while reading the source sentence. It relies on a policy to determine the optimal timing for reading sentences and generating translations. Existing SiMT methods generally adopt the traditional Transformer architecture, which concurrently determines the policy and generates translations. While they excel at determining policies, their translation performance is suboptimal. Conversely, Large Language Models (LLMs), trained on extensive corpora, possess superior generation capabilities, but it is difficult for them to acquire translation policy through the training methods of SiMT. Therefore, we introduce Agent-SiMT, a framework combining the strengths of LLMs and traditional SiMT methods. Agent-SiMT contains the policy-decision agent and the translation agent. The policy-decision agent is managed by a SiMT model, which determines the translation policy using partial source sentence and translation. The translation agent, leveraging an LLM, generates translation based on the partial source sentence. The two agents collaborate to accomplish SiMT. Experiments demonstrate that Agent-SiMT attains state-of-the-art performance.
摘要:同时机器翻译(SIMT)在阅读原句的同时生成译文。它依赖于一项政策来确定阅读句子和生成翻译的最佳时间。现有的SIMT方法一般采用传统的Transformer架构,该架构同时确定策略和生成翻译。虽然他们擅长确定策略,但他们的翻译性能并不是最优的。相反,在大量语料库上训练的大语言模型(LLM)具有优越的生成能力,但它们很难通过SIMT的训练方法获得翻译策略。因此,我们引入了一种结合了LLMS和传统SIMT方法优点的框架–Agent-SIMT。代理-SIMT包含策略决策代理和翻译代理。策略决策代理由SIMT模型管理,该模型使用部分源句和翻译来确定翻译策略。翻译代理利用LLM基于部分源句生成翻译。这两个代理协作完成SIMT。实验表明,Agent-SIMT的性能达到了一流水平。

[NLP-73] SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale
[NLP-73] SignMusketeers:大规模手语翻译的高效多流方法

链接: https://arxiv.org/abs/2406.06907
作者: Shester Gueuwou,Xiaodan Du,Greg Shakhnarovich,Karen Livescu
关键词: irrelevant visual differences, sign language, language video processing, written language translation, sign language video
中文关键词: 无关紧要的视觉差异、手语、语言视频处理、书面语翻译、手语视频
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A persistent challenge in sign language video processing, including the task of sign language to written language translation, is how we learn representations of sign language in an effective and efficient way that can preserve the important attributes of these languages, while remaining invariant to irrelevant visual differences. Informed by the nature and linguistics of signed languages, our proposed method focuses on just the most relevant parts in a signing video: the face, hands and body posture of the signer. However, instead of using pose estimation coordinates from off-the-shelf pose tracking models, which have inconsistent performance for hands and faces, we propose to learn the complex handshapes and rich facial expressions of sign languages in a self-supervised fashion. Our approach is based on learning from individual frames (rather than video sequences) and is therefore much more efficient than prior work on sign language pre-training. Compared to a recent model that established a new state of the art in sign language translation on the How2Sign dataset, our approach yields similar translation performance, using less than 3% of the compute.
摘要:手语视频处理中的一个长期挑战,包括手语到书面语的翻译,是我们如何以一种有效和高效的方式学习手语的表征,同时保持这些语言的重要属性,同时保持对无关的视觉差异的不变性。受手语的性质和语言学的启发,我们提出的方法只关注手语视频中最相关的部分:签名者的脸、手和身体姿势。然而,我们不是使用现有姿势跟踪模型中的姿势估计坐标,而是以自我监督的方式学习手语复杂的手形和丰富的面部表情,而不是使用对手和脸的性能不一致的现有姿势跟踪模型。我们的方法是基于从单个帧(而不是视频序列)中学习的,因此比以前的手语预训练工作效率高得多。与最近在How2Sign数据集上建立了手语翻译的新技术的模型相比,我们的方法得到了相似的翻译性能,使用了不到3%的计算。

[NLP-74] PLUM: Preference Learning Plus Test Cases Yields Better Code Language Models
[NLP-74] PLUM:偏好学习加测试用例产生更好的代码语言模型

链接: https://arxiv.org/abs/2406.06887
作者: Dylan Zhang,Shizhe Diao,Xueyan Zou,Hao Peng
关键词: Instruction-finetuned code language, Instruction-finetuned code, programming tasks, shown promise, Instruction-finetuned
中文关键词: 指令微调代码语言,指令微调代码,编程任务,显示承诺,指令微调
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Instruction-finetuned code language models (LMs) have shown promise in various programming tasks. They are trained, using a language modeling objective, on natural language instructions and gold code snippet pairs. Recent evidence suggests that these models, never exposed to incorrect solutions during training, often struggle to distinguish between correct and incorrect solutions. This observation raises our inquiry: Can preference learning, which trains models to prefer correct solutions over incorrect ones, help push the boundaries of code LMs even further? We propose PLUM, a novel \textbfpreference \textbflearning framework a\textbfugmented with test cases tailored for code L\textbfMs.PLUM aims to investigate the key success factors and potential benefits of preference learning in code LMs, which remain elusive despite its success in aligning LMs with human values. PLUM consists of three stages: (1) Generating test cases for natural language instructions, (2) sampling candidate solutions from the policy and evaluating them against the test cases to create a preference dataset, which is then used to (3) train the policy with a preference learning algorithm. Experiments demonstrate that PLUM substantially improves the performance of existing code LMs on established code generation benchmarks such as HumanEval (+) and MBPP (+), even for the state-of-the-art open-source language model CodeQwen-1.5-7B-Chat. PLUM complements the supervised fine-tuning (SFT) stage, demonstrating synergistic effects.
摘要:指令优化的代码语言模型(LMS)在各种编程任务中显示出了良好的前景。他们使用语言建模目标,接受关于自然语言指令和黄金代码片段对的培训。最近的证据表明,这些模型在训练期间从未接触过错误的解决方案,往往很难区分正确和不正确的解决方案。这一观察结果引发了我们的疑问:偏好学习训练模型更倾向于正确的解决方案,而不是错误的解决方案,是否有助于进一步推动代码LMS的边界?我们提出了PLUM,这是一个新颖的学习框架,带有为L代码量身定做的测试用例。PLUM旨在研究代码LMS中偏好学习的关键成功因素和潜在好处,尽管它成功地将LMS与人类价值观保持一致,但仍然难以实现。PLUM包括三个阶段:(1)生成自然语言指令的测试用例;(2)从策略中采样候选解决方案,并根据测试用例对其进行评估以创建偏好数据集,然后使用偏好数据集来训练策略。实验表明,即使对于最先进的开源语言模型CodeQwen-1.5-7B-Chat,PLUM在已建立的代码生成基准测试(如HumanEval(+)和MBPP(+))上也显著提高了现有代码LMS的性能。李子是对监督微调(SFT)阶段的补充,表现出协同效应。

[NLP-75] Modeling language contact with the Iterated Learning Model
[NLP-75] 与迭代学习模型建模语言接触

链接: https://arxiv.org/abs/2406.06878
作者: Seth Bullock,Conor Houghton
关键词: iterated learning model, iterated learning, language contact, potential to transmit, transmit vocabulary
中文关键词: 迭代学习模型,迭代学习,语言接触,传输潜力,传输词汇
类目: Computation and Language (cs.CL)
备注: to appear ALIFE24

点击查看摘要

Abstract:Contact between languages has the potential to transmit vocabulary and other language features; however, this does not always happen. Here, an iterated learning model is used to examine, in a simple way, the resistance of languages to change during language contact. Iterated learning models are agent-based models of language change, they demonstrate that languages that are expressive and compositional arise spontaneously as a consequence of a language transmission bottleneck. A recently introduced type of iterated learning model, the Semi-Supervised ILM is used to simulate language contact. These simulations do not include many of the complex factors involved in language contact and do not model a population of speakers; nonetheless the model demonstrates that the dynamics which lead languages in the model to spontaneously become expressive and compositional, also cause a language to maintain its core traits even after mixing with another language.
摘要:语言之间的接触有可能传输词汇和其他语言特征;然而,这种情况并不总是发生。在这里,使用迭代学习模型以简单的方式检查语言接触期间语言对变化的抵抗力。迭代学习模型是基于代理的语言变化模型,它们表明具有表达性和组合性的语言是由于语言传输瓶颈而自发产生的。半监督ILM是最近推出的迭代学习模型,用于模拟语言接触。这些模拟不包括语言接触中涉及的许多复杂因素,也不对说话者群体进行建模;尽管如此,该模型表明,导致模型中的语言自发变得具有表达力和组合性的动力学也会导致语言即使在与另一种语言混合后也能保持其核心特征。

[NLP-76] Whats in an embedding? Would a rose by any embedding smell as sweet?
[NLP-76] 嵌入中有什么?玫瑰花经过任何嵌入后闻起来都会那么甜吗?

链接: https://arxiv.org/abs/2406.06870
作者: Venkat Venkatasubramanian
关键词: advanced autocomplete systems, advanced autocomplete, Large Language Models, understanding, lacking true
中文关键词: 高级自动完成系统,高级自动完成,大型语言模型,理解,缺乏真实
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 9 images

点击查看摘要

Abstract:Large Language Models (LLMs) are often criticized for lacking true “understanding” and an ability to “reason” with their knowledge, being seen merely as advanced autocomplete systems. We believe that this perspective might be missing an important insight. We suggest that LLMs do develop a kind of empirical “understanding” that is “geometry”-like, which seems quite sufficient for a range of applications in NLP, computer vision, coding assistance, etc. However, this “geometric” understanding, built from incomplete and noisy data, makes them unreliable, difficult to generalize, and lacking in inference capabilities and explanations, similar to the challenges faced by heuristics-based expert systems decades ago. To overcome these limitations, we suggest that LLMs should be integrated with an “algebraic” representation of knowledge that includes symbolic AI elements used in expert systems. This integration aims to create large knowledge models (LKMs) that not only possess “deep” knowledge grounded in first principles, but also have the ability to reason and explain, mimicking human expert capabilities. To harness the full potential of generative AI safely and effectively, a paradigm shift from LLMs to the more comprehensive LKMs is needed. Comments: 7 pages, 9 images Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2406.06870 [cs.AI] (or arXiv:2406.06870v1 [cs.AI] for this version)
摘要:大型语言模型经常被批评为缺乏真正的“理解”和用其知识进行“推理”的能力,仅仅被视为高级的自动补全系统。我们认为,这种观点可能遗漏了一个重要的见解。我们建议LLMS确实发展了一种类似于几何的经验“理解”,这似乎足以用于NLP、计算机视觉、编码辅助等领域的一系列应用。然而,这种基于不完整和噪声数据的“几何”理解使它们不可靠,难以推广,缺乏推理能力和解释,类似于几十年前基于启发式的专家系统所面临的挑战。为了克服这些限制,我们建议LLMS应该与知识的“代数”表示相集成,其中包括专家系统中使用的符号人工智能元素。这种集成旨在创建大型知识模型(LKM),这些模型不仅拥有基于基本原则的“深厚”知识,而且还具有推理和解释的能力,模仿人类的专家能力。为了安全有效地利用生成性人工智能的全部潜力,需要从LLMS到更全面的LKM的范式转变。评论:7页,9个图像主题:人工智能(cs.AI);计算与语言(cs.CL)引用为:arxiv:2406.06870cs.AI

[NLP-77] A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures
[NLP-77] 大型语言模型后门攻击和防御的调查:对安全措施的影响

链接: https://arxiv.org/abs/2406.06852
作者: Shuai Zhao,Meihuizi Jia,Zhongliang Guo,Leilei Gan,Jie Fu,Yichao Feng,Fengjun Pan,Luu Anh Tuan
关键词: NLP tasks, human language understanding, backdoor attacks, large language models, language models
中文关键词: NLP任务、人类语言理解、后门攻击、大型语言模型、语言模型
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The large language models (LLMs), which bridge the gap between human language understanding and complex problem-solving, achieve state-of-the-art performance on several NLP tasks, particularly in few-shot and zero-shot settings. Despite the demonstrable efficacy of LMMs, due to constraints on computational resources, users have to engage with open-source language models or outsource the entire training process to third-party platforms. However, research has demonstrated that language models are susceptible to potential security vulnerabilities, particularly in backdoor attacks. Backdoor attacks are designed to introduce targeted vulnerabilities into language models by poisoning training samples or model weights, allowing attackers to manipulate model responses through malicious triggers. While existing surveys on backdoor attacks provide a comprehensive overview, they lack an in-depth examination of backdoor attacks specifically targeting LLMs. To bridge this gap and grasp the latest trends in the field, this paper presents a novel perspective on backdoor attacks for LLMs by focusing on fine-tuning methods. Specifically, we systematically classify backdoor attacks into three categories: full-parameter fine-tuning, parameter-efficient fine-tuning, and attacks without fine-tuning. Based on insights from a substantial review, we also discuss crucial issues for future research on backdoor attacks, such as further exploring attack algorithms that do not require fine-tuning, or developing more covert attack algorithms.
摘要:大型语言模型是人类语言理解和复杂问题解决之间的桥梁,在多个自然语言处理任务上取得了最先进的性能,特别是在少镜头和零镜头的情况下。尽管LMM具有明显的功效,但由于计算资源的限制,用户不得不使用开放源码语言模型或将整个培训过程外包给第三方平台。然而,研究表明,语言模型容易受到潜在的安全漏洞的影响,特别是在后门攻击中。后门攻击旨在通过毒化训练样本或模型权重,将有针对性的漏洞引入语言模型,允许攻击者通过恶意触发器操纵模型响应。虽然现有的关于后门攻击的调查提供了全面的概述,但它们缺乏对专门针对LLM的后门攻击的深入检查。为了弥补这一差距,掌握该领域的最新趋势,本文提出了一种新的视角来研究针对LLMS的后门攻击,重点是微调方法。具体来说,我们系统地将后门攻击分为三类:全参数微调、参数高效微调和未微调攻击。在大量综述的基础上,我们还讨论了未来后门攻击研究的关键问题,如进一步探索不需要微调的攻击算法,或开发更隐蔽的攻击算法。

[NLP-78] Silent Signals Loud Impact: LLMs for Word-Sense Disambiguation of Coded Dog Whistles
[NLP-78] 无声信号巨大的影响:编码狗哨子的字面意义歧义消除的LLM

链接: https://arxiv.org/abs/2406.06840
作者: Julia Kruk,Michela Marchini,Rijul Ragu,Caleb Ziems,David Muchlinski,Diyi Yang
关键词: United States politics, socioeconomic discrimination, Large Language Models, carries a secondary, secondary meaning
中文关键词: 美国政治、社会经济歧视、大型语言模型,具有次要的、次要的含义
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ACL 2024

点击查看摘要

Abstract:A dog whistle is a form of coded communication that carries a secondary meaning to specific audiences and is often weaponized for racial and socioeconomic discrimination. Dog whistling historically originated from United States politics, but in recent years has taken root in social media as a means of evading hate speech detection systems and maintaining plausible deniability. In this paper, we present an approach for word-sense disambiguation of dog whistles from standard speech using Large Language Models (LLMs), and leverage this technique to create a dataset of 16,550 high-confidence coded examples of dog whistles used in formal and informal communication. Silent Signals is the largest dataset of disambiguated dog whistle usage, created for applications in hate speech detection, neology, and political science. The dataset can be found at this https URL. Comments: ACL 2024 Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: J.4; K.4.1; K.4.2 Cite as: arXiv:2406.06840 [cs.CL] (or arXiv:2406.06840v1 [cs.CL] for this version)
摘要:狗哨是一种对特定受众具有次要意义的编码交流形式,经常被用作种族和社会经济歧视的武器。狗吹口哨在历史上起源于美国政治,但近年来在社交媒体上生根发芽,成为逃避仇恨语音检测系统、保持貌似合理的否认的一种手段。在本文中,我们提出了一种使用大语言模型(LLMS)从标准语音中消除狗哨声的词义歧义的方法,并利用该技术创建了一个包含16,550个用于正式和非正式交流的狗哨声的高置信度编码实例的数据集。无声信号是消除歧义的狗哨子使用情况的最大数据集,是为仇恨语音检测、新词和政治学应用而创建的。数据集可在此HTTPS URL中找到。评论:ACL2024科目:计算和语言(cs.CL);机器学习(cs.LG)ACM类:J.4;K.4.1;K.4.2引用为:arxiv:2406.06840cs.CL

[NLP-79] EAVE: Efficient Product Attribute Value Extraction via Lightweight Sparse-layer Interaction
[NLP-79] EAVE:通过轻量级稀疏层交互高效的产品属性值提取

链接: https://arxiv.org/abs/2406.06839
作者: Li Yang,Qifan Wang,Jianfeng Chi,Jiahao Liu,Jingang Wang,Fuli Feng,Zenglin Xu,Yi Fang,Lifu Huang,Dongfang Liu
关键词: extraction involves identifying, involves identifying, identifying the specific, Efficient product Attribute, extraction involves
中文关键词: 提取涉及识别,涉及识别,识别特定的,高效的产品属性,提取涉及
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Product attribute value extraction involves identifying the specific values associated with various attributes from a product profile. While existing methods often prioritize the development of effective models to improve extraction performance, there has been limited emphasis on extraction efficiency. However, in real-world scenarios, products are typically associated with multiple attributes, necessitating multiple extractions to obtain all corresponding values. In this work, we propose an Efficient product Attribute Value Extraction (EAVE) approach via lightweight sparse-layer interaction. Specifically, we employ a heavy encoder to separately encode the product context and attribute. The resulting non-interacting heavy representations of the context can be cached and reused for all attributes. Additionally, we introduce a light encoder to jointly encode the context and the attribute, facilitating lightweight interactions between them. To enrich the interaction within the lightweight encoder, we design a sparse-layer interaction module to fuse the non-interacting heavy representation into the lightweight encoder. Comprehensive evaluation on two benchmarks demonstrate that our method achieves significant efficiency gains with neutral or marginal loss in performance when the context is long and number of attributes is large. Our code is available \hrefhttps://anonymous.4open.science/r/EAVE-EA18here.
摘要:产品属性值提取涉及识别与产品配置文件中的各种属性相关联的特定值。虽然现有的方法往往优先考虑开发有效的模型来提高提取性能,但对提取效率的重视有限。然而,在现实世界的场景中,产品通常与多个属性相关联,需要多次提取才能获得所有相应值。在这项工作中,我们提出了一种基于轻量级稀疏层交互的高效产品属性值提取方法。具体地说,我们使用繁重的编码器来分别对产品上下文和属性进行编码。所得到的上下文的非交互的繁重表示可以被缓存并对所有属性重复使用。此外,我们引入了一个轻量级编码器来联合对上下文和属性进行编码,促进了它们之间的轻量级交互。为了丰富轻量级编码器内部的交互,我们设计了一个稀疏层交互模块,将无交互的重表示融合到轻量级编码器中。对两个基准测试的综合评估表明,当上下文较长且属性数较大时,该方法在性能中性或边际损失的情况下取得了显著的效率提升。我们的代码可从\hrefhttps://anonymous.4open.science/r/EAVE-EA18here.获得

[NLP-80] AGB-DE: A Corpus for the Automated Legal Assessment of Clauses in German Consumer Contracts
[NLP-80] AGB-DE:德国消费者合同条款自动法律评估的数据库

链接: https://arxiv.org/abs/2406.06809
作者: Daniel Braun,Florian Matthes
关键词: language models, annotated datasets, datasets are rare, Legal tasks, open language models
中文关键词: 语言模型、注释数据集、数据集很少见、法律任务、开放语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Legal tasks and datasets are often used as benchmarks for the capabilities of language models. However, openly available annotated datasets are rare. In this paper, we introduce AGB-DE, a corpus of 3,764 clauses from German consumer contracts that have been annotated and legally assessed by legal experts. Together with the data, we present a first baseline for the task of detecting potentially void clauses, comparing the performance of an SVM baseline with three fine-tuned open language models and the performance of GPT-3.5. Our results show the challenging nature of the task, with no approach exceeding an F1-score of 0.54. While the fine-tuned models often performed better with regard to precision, GPT-3.5 outperformed the other approaches with regard to recall. An analysis of the errors indicates that one of the main challenges could be the correct interpretation of complex clauses, rather than the decision boundaries of what is permissible and what is not.
摘要:法律任务和数据集通常被用作语言模型能力的基准。然而,公开可用的注释数据集很少见。本文中,我们介绍AGB-DE,这是一个包含德国消费者合同中3,764个条款的文集,这些条款已由法律专家注释和法律评估。与这些数据一起,我们为检测潜在无效条款的任务提供了第一个基线,比较了三个微调开放语言模型的性能以及GPT-3.5的性能。我们的结果显示了该任务的挑战性,没有任何方法超过F1评分0.54。虽然微调模型在精确度方面通常表现更好,但GPT-3.5在召回方面优于其他方法。对错误的分析表明,主要挑战之一可能是对复杂条款的正确解释,而不是什么是允许的和什么是不允许的决策界限。

[NLP-81] LLM-dCache: Improving Tool-Augmented LLMs with GPT-Driven Localized Data Caching
[NLP-81] LLM-dache:利用GPT驱动的本地化数据缓存改进工具增强的LLM

链接: https://arxiv.org/abs/2406.06799
作者: Simranjit Singh,Michael Fore,Andreas Karatzas,Chaehong Lee,Yanan Jian,Longfei Shangguan,Fuxun Yu,Iraklis Anagnostopoulos,Dimitrios Stamoulis
关键词: Large Language Models, Language Models, Large Language, complex data operations, API calls
中文关键词: 大型语言模型、语言模型、大型语言、复杂数据操作、API调用
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) broaden their capabilities to manage thousands of API calls, they are confronted with complex data operations across vast datasets with significant overhead to the underlying system. In this work, we introduce LLM-dCache to optimize data accesses by treating cache operations as callable API functions exposed to the tool-augmented agent. We grant LLMs the autonomy to manage cache decisions via prompting, seamlessly integrating with existing function-calling mechanisms. Tested on an industry-scale massively parallel platform that spans hundreds of GPT endpoints and terabytes of imagery, our method improves Copilot times by an average of 1.24x across various LLMs and prompting techniques.
摘要:随着大型语言模型(LLM)扩展其管理数千个API调用的能力,它们面临着跨越庞大数据集的复杂数据操作,并给底层系统带来了巨大的负担。在这项工作中,我们引入了LLM-dache,通过将缓存操作视为暴露给工具增强代理的可调用API函数来优化数据访问。我们通过提示、与现有功能调用机制无缝集成,授予LLM管理缓存决策的自主权。在跨越数百个GPT端点和TB图像的行业规模大规模并行平台上进行测试,我们的方法在各种LLM和提示技术上将Copilot时间平均提高了1.24倍。

[NLP-82] Evaluating Zero-Shot Long-Context LLM Compression
[NLP-82] 评估零镜头长上下文LLM压缩

链接: https://arxiv.org/abs/2406.06773
作者: Chenyu Wang,Yihan Wang
关键词: large language models, language models, study evaluates, evaluates the effectiveness, effectiveness of zero-shot
中文关键词: 大型语言模型、语言模型、研究评估、评估零镜头的有效性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study evaluates the effectiveness of zero-shot compression techniques on large language models (LLMs) under long-context. We identify the tendency for computational errors to increase under long-context when employing certain compression methods. We propose a hypothesis to explain the varied behavior of different LLM compression techniques and explore remedies to mitigate the performance decline observed in some techniques under long-context. This is a course report for COS 598D Machine Learning and Systems by Prof. Kai Li at Princeton University. Due to limited computational resources, our experiments were conducted only on LLaMA-2-7B-32K.
摘要:本研究评估了零镜头压缩技术在长上下文下对大型语言模型(LLM)的有效性。我们发现,当使用某些压缩方法时,在长上下文下计算错误有增加的趋势。我们提出了一个假设来解释不同LLM压缩技术的不同行为,并探索缓解长期背景下某些技术中观察到的性能下降的补救措施。这是普林斯顿大学李凯教授撰写的COS 598 D机器学习与系统课程报告。由于计算资源有限,我们的实验仅在LLaMA-2- 7 B-32 K上进行。

[NLP-83] DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents
[NLP-83] DISCOVERYWORLD:开发和评估自动化科学发现代理的虚拟环境

链接: https://arxiv.org/abs/2406.06769
作者: Peter Jansen,Marc-Alexandre Côté,Tushar Khot,Erin Bransom,Bhavana Dalvi Mishra,Bodhisattwa Prasad Majumder,Oyvind Tafjord,Peter Clark
关键词: Automated scientific discovery, Automated scientific, DISCOVERYWORLD, scientific discovery promises, scientific discovery
中文关键词: 自动科学发现,自动科学,发现世界,科学发现承诺,科学发现
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 4 figures. Preprint, under review

点击查看摘要

Abstract:Automated scientific discovery promises to accelerate progress across scientific domains. However, developing and evaluating an AI agent’s capacity for end-to-end scientific reasoning is challenging as running real-world experiments is often prohibitively expensive or infeasible. In this work we introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent’s ability to perform complete cycles of novel scientific discovery. DISCOVERYWORLD contains a variety of different challenges, covering topics as diverse as radioisotope dating, rocket science, and proteomics, to encourage development of general discovery skills rather than task-specific solutions. DISCOVERYWORLD itself is an inexpensive, simulated, text-based environment (with optional 2D visual overlay). It includes 120 different challenge tasks, spanning eight topics each with three levels of difficulty and several parametric variations. Each task requires an agent to form hypotheses, design and run experiments, analyze results, and act on conclusions. DISCOVERYWORLD further provides three automatic metrics for evaluating performance, based on (a) task completion, (b) task-relevant actions taken, and © the discovered explanatory knowledge. We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks, suggesting that DISCOVERYWORLD captures some of the novel challenges of discovery, and thus that DISCOVERYWORLD may help accelerate near-term development and assessment of scientific discovery competency in agents. Code available at: this http URL
摘要:自动科学发现有望加速科学领域的进步。然而,开发和评估人工智能代理的端到端科学推理能力是具有挑战性的,因为运行现实世界的实验往往昂贵得令人望而却步,或者是不可行的。在这项工作中,我们引入了DISCOVERYWORLD,这是第一个用于开发和基准测试代理人执行新科学发现的完整周期的能力的虚拟环境。DISCOVERYWORLD包含各种不同的挑战,涵盖了放射性同位素测年、火箭科学和蛋白质组学等各种主题,以鼓励开发一般的发现技能,而不是特定任务的解决方案。DISCOVERYWORLD本身是一个廉价的、模拟的、基于文本的环境(具有可选的2D视觉覆盖)。它包括120个不同的挑战任务,跨越8个主题,每个主题有三个难度级别和几个参数变化。每项任务都需要一个代理人来形成假设,设计和运行实验,分析结果,并根据结论采取行动。DISCOVERYWORLD还根据(A)任务完成、(B)采取的与任务相关的行动和©发现的解释性知识,为评估绩效提供了三个自动衡量标准。我们发现,在以前发表的环境中表现良好的强基线代理在大多数DISCOVERYWORLD任务中都表现不佳,这表明DISCOVERYWORLD捕获了发现的一些新挑战,因此DISCOVERYWORLD可能有助于加快代理中科学发现能力的近期开发和评估。代码可从以下地址获得:此http URL

[NLP-84] Classi|Qrangle Towards a Translation Framework To Bridge The Classical-Quantum Programming Gap
[NLP-84] 归类|Qrangle迈向翻译框架以弥合经典与量子编程差距

链接: https://arxiv.org/abs/2406.06764
作者: Matteo Esposito,Maryam Tavassoli Sabzevari,Boshuai Ye,Davide Falessi,Arif Ali Khan,Davide Taibi
关键词: complex programming paradigms, Quantum computing, Classi, albeit readily, learning curves
中文关键词: 复杂的编程范式、量子计算、Classi,尽管很容易,学习曲线
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Quantum computing, albeit readily available as hardware or emulated on the cloud, is still far from being available in general regarding complex programming paradigms and learning curves. This vision paper introduces Classi|Q\rangle , a translation framework idea to bridge Classical and Quantum Computing by translating high-level programming languages, e.g., Python or C++, into a low-level language, e.g., Quantum Assembly. Our idea paper serves as a blueprint for ongoing efforts in quantum software engineering, offering a roadmap for further Classi|Q\rangle development to meet the diverse needs of researchers and practitioners. Classi|Q\rangle is designed to empower researchers and practitioners with no prior quantum experience to harness the potential of hybrid quantum computation. We also discuss future enhancements to Classi|Q\rangle , including support for additional quantum languages, improved optimization strategies, and integration with emerging quantum computing platforms.
摘要:量子计算虽然很容易以硬件的形式提供,或者在云上进行仿真,但在复杂的编程范例和学习曲线方面,仍然远远不能普遍使用。这篇愿景白皮书介绍了Classi|Q\Range,这是一种翻译框架思想,通过将高级编程语言(如Python或C++)翻译成低级语言(如Quantum Assembly)来连接经典计算和量子计算。我们的想法文件为量子软件工程的持续努力提供了蓝图,为Classi|Q\Rangle的进一步发展提供了路线图,以满足研究人员和从业者的不同需求。Classi|Q\Rangle旨在让没有量子经验的研究人员和从业者能够利用混合量子计算的潜力。我们还将讨论Classi|Q\Range的未来增强功能,包括支持其他量子语言、改进的优化策略以及与新兴量子计算平台的集成。

[NLP-85] Scaling the Vocabulary of Non-autoregressive Models for Efficient Generative Retrieval
[NLP-85] 扩展非自回归模型的词汇以实现高效生成式检索

链接: https://arxiv.org/abs/2406.06739
作者: Ravisri Valluri,Akash Kumar Mohankumar,Kushal Dave,Amit Singh,Jian Jiao,Manik Varma,Gaurav Sinha
关键词: constrained generation task, leveraging recent advancements, Generative Retrieval introduces, advancements in Autoregressive, Generative Retrieval
中文关键词: 受约束的生成任务,利用最新进展,引入生成式检索,自回归、生成式检索的进展
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 14 pages, 6 tables, 2 figures

点击查看摘要

Abstract:Generative Retrieval introduces a new approach to Information Retrieval by reframing it as a constrained generation task, leveraging recent advancements in Autoregressive (AR) language models. However, AR-based Generative Retrieval methods suffer from high inference latency and cost compared to traditional dense retrieval techniques, limiting their practical applicability. This paper investigates fully Non-autoregressive (NAR) language models as a more efficient alternative for generative retrieval. While standard NAR models alleviate latency and cost concerns, they exhibit a significant drop in retrieval performance (compared to AR models) due to their inability to capture dependencies between target tokens. To address this, we question the conventional choice of limiting the target token space to solely words or sub-words. We propose PIXAR, a novel approach that expands the target vocabulary of NAR models to include multi-word entities and common phrases (up to 5 million tokens), thereby reducing token dependencies. PIXAR employs inference optimization strategies to maintain low inference latency despite the significantly larger vocabulary. Our results demonstrate that PIXAR achieves a relative improvement of 31.0% in MRR@10 on MS MARCO and 23.2% in Hits@5 on Natural Questions compared to standard NAR models with similar latency and cost. Furthermore, online A/B experiments on a large commercial search engine show that PIXAR increases ad clicks by 5.08% and revenue by 4.02%.
摘要:生成性检索引入了一种新的信息检索方法,它利用自回归(AR)语言模型的最新进展,将其重组为一个受限的生成任务。然而,与传统的密集检索技术相比,基于AR的产生式检索方法存在较高的推理延迟和代价,限制了其实际应用。本文研究了完全非自回归(NAR)语言模型作为生成性检索的一种更有效的替代方案。虽然标准NAR模型缓解了延迟和成本问题,但由于无法捕获目标令牌之间的依赖关系,它们的检索性能(与AR模型相比)显著下降。为了解决这个问题,我们质疑将目标标记空间仅限于单词或子词的传统选择。我们提出了一种新的方法Pixar,它扩展了NAR模型的目标词汇表,包括多词实体和常见短语(多达500万个令牌),从而减少了对令牌的依赖。尽管词汇量明显较大,Pixar仍采用推理优化策略来保持较低的推理延迟。实验结果表明,与相同延迟和代价的标准NAR模型相比,Pixar在MS Marco上的MRR@10和在自然问题上的HITS@5上分别提高了31.0%和23.2%。此外,在一家大型商业搜索引擎上进行的在线A/B实验表明,Pixar的广告点击量增加了5.08%,收入增加了4.02%。

[NLP-86] Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications
[NLP-86] Raccoon:LLM集成应用程序的即时提取基准

链接: https://arxiv.org/abs/2406.06737
作者: Junlin Wang,Tianyi Yang,Roy Xie,Bhuwan Dhingra
关键词: prompt extraction attacks, proprietary instruction prompts, offering valuable services, prompt extraction, millions are deployed
中文关键词: 即时提取攻击、专有指令提示、提供有价值的服务、即时提取、部署了数百万美元
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the proliferation of LLM-integrated applications such as GPT-s, millions are deployed, offering valuable services through proprietary instruction prompts. These systems, however, are prone to prompt extraction attacks through meticulously designed queries. To help mitigate this problem, we introduce the Raccoon benchmark which comprehensively evaluates a model’s susceptibility to prompt extraction attacks. Our novel evaluation method assesses models under both defenseless and defended scenarios, employing a dual approach to evaluate the effectiveness of existing defenses and the resilience of the models. The benchmark encompasses 14 categories of prompt extraction attacks, with additional compounded attacks that closely mimic the strategies of potential attackers, alongside a diverse collection of defense templates. This array is, to our knowledge, the most extensive compilation of prompt theft attacks and defense mechanisms to date. Our findings highlight universal susceptibility to prompt theft in the absence of defenses, with OpenAI models demonstrating notable resilience when protected. This paper aims to establish a more systematic benchmark for assessing LLM robustness against prompt extraction attacks, offering insights into their causes and potential countermeasures. Resources of Raccoon are publicly available at this https URL.
摘要:随着GPT-S等LLM集成应用的激增,数以百万计的应用程序被部署,通过专有的指令提示提供有价值的服务。然而,这些系统容易通过精心设计的查询进行提示提取攻击。为了帮助缓解这个问题,我们引入了浣熊基准测试,该基准测试全面评估模型对即时提取攻击的敏感度。我们的新评估方法在无防御和有防御的情况下对模型进行评估,采用双重方法来评估现有防御的有效性和模型的弹性。该基准包括14类即时提取攻击,以及与潜在攻击者的策略非常相似的其他复合攻击,以及各种防御模板。据我们所知,这一系列是迄今为止对即时盗窃攻击和防御机制进行的最广泛的汇编。我们的发现突显了在没有防御的情况下普遍容易被迅速盗窃,OpenAI模型在受到保护时显示出显著的弹性。本文旨在建立一个更系统的基准来评估LLM对即时提取攻击的健壮性,并提供对其原因和潜在对策的见解。浣熊的资源可以在这个HTTPS URL上公开获得。

[NLP-87] Synthetic Query Generation using Large Language Models for Virtual Assistants
[NLP-87] 使用虚拟助理的大型语言模型合成查询生成

链接: https://arxiv.org/abs/2406.06729
作者: Sonal Sannigrahi,Thiago Fraga-Silva,Youssef Oualil,Christophe Van Gysel
关键词: Virtual Assistants, important Information Retrieval, Information Retrieval platforms, Information Retrieval, spoken commands
中文关键词: 虚拟助理、重要信息检索、信息检索平台、信息检索、口头命令
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: SIGIR '24. The 47th International ACM SIGIR Conference on Research Development in Information Retrieval

点击查看摘要

Abstract:Virtual Assistants (VAs) are important Information Retrieval platforms that help users accomplish various tasks through spoken commands. The speech recognition system (speech-to-text) uses query priors, trained solely on text, to distinguish between phonetically confusing alternatives. Hence, the generation of synthetic queries that are similar to existing VA usage can greatly improve upon the VA’s abilities – especially for use-cases that do not (yet) occur in paired audio/text data. In this paper, we provide a preliminary exploration of the use of Large Language Models (LLMs) to generate synthetic queries that are complementary to template-based methods. We investigate whether the methods (a) generate queries that are similar to randomly sampled, representative, and anonymized user queries from a popular VA, and (b) whether the generated queries are specific. We find that LLMs generate more verbose queries, compared to template-based methods, and reference aspects specific to the entity. The generated queries are similar to VA user queries, and are specific enough to retrieve the relevant entity. We conclude that queries generated by LLMs and templates are complementary. Comments: SIGIR '24. The 47th International ACM SIGIR Conference on Research Development in Information Retrieval Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2406.06729 [cs.IR] (or arXiv:2406.06729v1 [cs.IR] for this version) Related DOI: https://doi.org/10.1145/3626772.3661355 Focus to learn more DOI(s) linking to related resources
摘要:虚拟助手是帮助用户通过口述命令完成各种任务的重要信息检索平台。语音识别系统(语音到文本)使用仅针对文本训练的查询先验来区分语音上令人困惑的备选方案。因此,类似于现有VA使用的合成查询的生成可以极大地提高VA的能力–特别是对于(尚未)出现在配对的音频/文本数据中的用例。在本文中,我们对使用大型语言模型(LLM)来生成合成查询进行了初步的探索,这些合成查询是对基于模板的方法的补充。我们调查这些方法(A)是否生成类似于来自流行的VA的随机抽样的、代表性的和匿名的用户查询的查询,以及(B)所生成的查询是否是特定的。我们发现,与基于模板的方法相比,LLM生成更详细的查询,并引用特定于实体的方面。生成的查询类似于VA用户查询,并且足够具体以检索相关实体。我们的结论是,LLMS和模板生成的查询是互补的。评论:SIGIR‘24。第47届国际ACMSIGIR信息检索主题研究发展会议:信息检索(cs.IR);人工智能(cs.AI);计算与语言(cs.CL)引用为:arxiv:2406.06729cs.IR相关DOI:https://doi.org/10.1145/3626772.3661355 Focus了解更多DOI(S)链接到相关资源

[NLP-88] Leveraging Large Language Models for Knowledge-free Weak Supervision in Clinical Natural Language Processing
[NLP-88] 利用大型语言模型在临床自然语言处理中实现无知识弱监督

链接: https://arxiv.org/abs/2406.06723
作者: Enshuo Hsu,Kirk Roberts
关键词: deep learning-based natural, learning-based natural language, natural language processing, deep learning-based, learning-based natural
中文关键词: 基于深度学习的自然,基于学习的自然语言,自然语言处理,基于深度学习,基于学习的自然
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The performance of deep learning-based natural language processing systems is based on large amounts of labeled training data which, in the clinical domain, are not easily available or affordable. Weak supervision and in-context learning offer partial solutions to this issue, particularly using large language models (LLMs), but their performance still trails traditional supervised methods with moderate amounts of gold-standard data. In particular, inferencing with LLMs is computationally heavy. We propose an approach leveraging fine-tuning LLMs and weak supervision with virtually no domain knowledge that still achieves consistently dominant performance. Using a prompt-based approach, the LLM is used to generate weakly-labeled data for training a downstream BERT model. The weakly supervised model is then further fine-tuned on small amounts of gold standard data. We evaluate this approach using Llama2 on three different n2c2 datasets. With no more than 10 gold standard notes, our final BERT models weakly supervised by fine-tuned Llama2-13B consistently outperformed out-of-the-box PubMedBERT by 4.7% to 47.9% in F1 scores. With only 50 gold standard notes, our models achieved close performance to fully fine-tuned systems.
摘要:基于深度学习的自然语言处理系统的性能是基于大量的带标签的训练数据,而这些数据在临床领域是不容易获得或负担得起的。弱监督和情景学习为这个问题提供了部分解决方案,特别是使用大型语言模型(LLM),但它们的性能仍然落后于使用中等数量黄金标准数据的传统监督方法。特别是,使用LLMS进行推理的计算量很大。我们提出了一种利用微调LLM和弱监管的方法,实际上没有领域知识,仍然实现了一致的主导性能。使用基于提示的方法,LLM被用来生成用于训练下游BERT模型的弱标记数据。然后,在少量金本位数据的基础上,进一步微调弱监督模型。我们使用Llama2在三个不同的n2c2数据集上对该方法进行了评估。凭借不超过10个黄金标准音符,我们最终的BERT车型在微调Llama2-13B的弱监督下,在F1得分上始终以4.7%对47.9%的优势超过开箱即用的PubMedBERT。我们的型号仅有50张黄金标准纸币,性能接近完全微调的系统。

[NLP-89] In-Context Learning and Fine-Tuning GPT for Argument Mining
[NLP-89] 上下文学习和微调GPT以进行参数挖掘

链接: https://arxiv.org/abs/2406.06699
作者: Jérémie Cabessa,Hugo Hernault,Umer Mushtaq
关键词: Large Language Models, Large Language, Language Models, ubiquitous in NLP, NLP and deep
中文关键词: 大型语言模型、大型语言、语言模型、NLP和深度中无处不在
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become ubiquitous in NLP and deep learning. In-Context Learning (ICL) has been suggested as a bridging paradigm between the training-free and fine-tuning LLMs settings. In ICL, an LLM is conditioned to solve tasks by means of a few solved demonstration examples included as prompt. Argument Mining (AM) aims to extract the complex argumentative structure of a text, and Argument Type Classification (ATC) is an essential sub-task of AM. We introduce an ICL strategy for ATC combining kNN-based examples selection and majority vote ensembling. In the training-free ICL setting, we show that GPT-4 is able to leverage relevant information from only a few demonstration examples and achieve very competitive classification accuracy on ATC. We further set up a fine-tuning strategy incorporating well-crafted structural features given directly in textual form. In this setting, GPT-3.5 achieves state-of-the-art performance on ATC. Overall, these results emphasize the emergent ability of LLMs to grasp global discursive flow in raw text in both off-the-shelf and fine-tuned setups.
摘要:大语言模型在自然语言处理和深度学习中已经变得无处不在。情境学习(ICL)被认为是免培训和微调LLMS设置之间的一种桥梁范式。在ICL中,LLM的条件是通过作为提示包含的几个已解决的演示实例来解决任务。论元挖掘的目的是提取文本中复杂的论辩结构,论元类型分类是论元挖掘的一个重要子任务提出了一种结合基于KNN的样本选择和多数投票集成的ATC ICL策略。在无需训练的ICL环境中,我们表明GPT-4能够利用仅来自几个演示示例的相关信息,并且在ATC上获得非常有竞争力的分类准确率。我们进一步建立了一个微调策略,结合了直接以文本形式给出的精心制作的结构特征。在这种设置下,GPT-3.5在ATC上实现了最先进的性能。总体而言,这些结果强调了LLMS在现成的和微调的设置中掌握原始文本中的全局语篇流动的紧急能力。

[NLP-90] Enrolment-based personalisation for improving individual-level fairness in speech emotion recognition
[NLP-90] 基于注册的个性化提高语音情感识别中个人层面的公平性

链接: https://arxiv.org/abs/2406.06665
作者: Andreas Triantafyllopoulos,Björn Schuller
关键词: highly individualistic, Abstract, emotion, SER, contemporary speech emotion
中文关键词: 高度个人主义、抽象、情感、BER、当代言语情感
类目: Computation and Language (cs.CL)
备注: Accepted to INTERSPEECH 2024

点击查看摘要

Abstract:The expression of emotion is highly individualistic. However, contemporary speech emotion recognition (SER) systems typically rely on population-level models that adopt a `one-size-fits-all’ approach for predicting emotion. Moreover, standard evaluation practices measure performance also on the population level, thus failing to characterise how models work across different speakers. In the present contribution, we present a new method for capitalising on individual differences to adapt an SER model to each new speaker using a minimal set of enrolment utterances. In addition, we present novel evaluation schemes for measuring fairness across different speakers. Our findings show that aggregated evaluation metrics may obfuscate fairness issues on the individual-level, which are uncovered by our evaluation, and that our proposed method can improve performance both in aggregated and disaggregated terms.
摘要:情感的表达是高度个人主义的。然而,当代语音情感识别(BER)系统通常依赖于群体级模型,这些模型采用“一刀切”的方法来预测情感。此外,标准评估实践还衡量人口层面的绩效,因此未能说明模型如何在不同发言者之间工作。在本论文中,我们提出了一种利用个体差异的新方法,使用最少的注册话语集来调整BER模型以适应每个新说话者。此外,我们还提出了新颖的评估方案来衡量不同说话者之间的公平性。我们的研究结果表明,汇总评估指标可能会混淆我们的评估所揭示的个人层面的公平性问题,并且我们提出的方法可以在汇总和分解方面提高绩效。

[NLP-91] SecureNet: A Comparative Study of DeBERTa and Large Language Models for Phishing Detection
[NLP-91] SecureNet:用于网络钓鱼检测的DeBERTa和大型语言模型的比较研究

链接: https://arxiv.org/abs/2406.06663
作者: Sakshi Mahendru,Tejul Pandit
关键词: revealing sensitive information, poses a major, sensitive information, social engineering, engineering to trick
中文关键词: 泄露敏感信息,构成重大敏感信息,社会工程,工程欺骗
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint. 10 pages, Accepted in IEEE 7th International Conference on Big Data and Artificial Intelligence (BDAI 2024)

点击查看摘要

Abstract:Phishing, whether through email, SMS, or malicious websites, poses a major threat to organizations by using social engineering to trick users into revealing sensitive information. It not only compromises company’s data security but also incurs significant financial losses. In this paper, we investigate whether the remarkable performance of Large Language Models (LLMs) can be leveraged for particular task like text classification, particularly detecting malicious content and compare its results with state-of-the-art Deberta V3 (DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing) model. We systematically assess the potential and limitations of both approaches using comprehensive public datasets comprising diverse data sources such as email, HTML, URL, SMS, and synthetic data generation. Additionally, we demonstrate how LLMs can generate convincing phishing emails, making it harder to spot scams and evaluate the performance of both models in this context. Our study delves further into the challenges encountered by DeBERTa V3 during its training phases, fine-tuning methodology and transfer learning processes. Similarly, we examine the challenges associated with LLMs and assess their respective performance. Among our experimental approaches, the transformer-based DeBERTa method emerged as the most effective, achieving a test dataset (HuggingFace phishing dataset) recall (sensitivity) of 95.17% closely followed by GPT-4 providing a recall of 91.04%. We performed additional experiments with other datasets on the trained DeBERTa V3 model and LLMs like GPT 4 and Gemini 1.5. Based on our findings, we provide valuable insights into the effectiveness and robustness of these advanced language models, offering a detailed comparative analysis that can inform future research efforts in strengthening cybersecurity measures for detecting and mitigating phishing threats.
摘要:网络钓鱼,无论是通过电子邮件、短信还是恶意网站,通过使用社会工程来诱骗用户泄露敏感信息,对组织构成了重大威胁。它不仅危及公司的数据安全,还会造成重大的经济损失。在本文中,我们研究了大语言模型(LLMS)的显著性能是否可以被用于特定的任务,如文本分类,特别是检测恶意内容,并将其结果与最新的Deberta V3模型(Deberta V3(DeBERTa使用Electra风格的预训练和梯度解缠嵌入共享)模型进行比较。我们使用全面的公共数据集系统地评估了这两种方法的潜力和局限性,这些数据集包括不同的数据源,如电子邮件、HTML、URL、短信和合成数据生成。此外,我们还演示了LLMS如何生成令人信服的钓鱼电子邮件,从而使识别诈骗和评估这两个模型在此背景下的性能变得更加困难。我们的研究进一步深入探讨了DeBERTa V3在培训阶段、微调方法和迁移学习过程中遇到的挑战。同样,我们检查与低成本管理相关的挑战,并评估它们各自的表现。在我们的实验方法中,基于转换器的DeBERTa方法是最有效的,达到了95.17%的召回率(敏感度),紧随其后的是GPT-4,召回率为91.04%。我们在经过训练的DeBERTa V3模型以及GPT 4和Gemini 1.5等LLM上使用其他数据集进行了额外的实验。基于我们的发现,我们对这些高级语言模型的有效性和稳健性提供了有价值的见解,提供了详细的比较分析,可以为未来的研究工作提供信息,以加强检测和缓解网络钓鱼威胁的网络安全措施。

[NLP-92] Harnessing AI for efficient analysis of complex policy documents: a case study of Executive Order 14110
[NLP-92] 利用人工智能高效分析复杂政策文件:第14110号行政命令的案例研究

链接: https://arxiv.org/abs/2406.06657
作者: Mark A. Kramer,Allen Leavens,Alexander Scarlat
关键词: shaping society, crucial in shaping, Artificial intelligence, Policy, Policy documents
中文关键词: 塑造社会,塑造至关重要,人工智能,政策,政策文件
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 28 pages, 1 figure

点击查看摘要

Abstract:Policy documents, such as legislation, regulations, and executive orders, are crucial in shaping society. However, their length and complexity make interpretation and application challenging and time-consuming. Artificial intelligence (AI), particularly large language models (LLMs), has the potential to automate the process of analyzing these documents, improving accuracy and efficiency. This study aims to evaluate the potential of AI in streamlining policy analysis and to identify the strengths and limitations of current AI approaches. The research focuses on question answering and tasks involving content extraction from policy documents. A case study was conducted using Executive Order 14110 on “Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence” as a test case. Four commercial AI systems were used to analyze the document and answer a set of representative policy questions. The performance of the AI systems was compared to manual analysis conducted by human experts. The study found that two AI systems, Gemini 1.5 Pro and Claude 3 Opus, demonstrated significant potential for supporting policy analysis, providing accurate and reliable information extraction from complex documents. They performed comparably to human analysts but with significantly higher efficiency. However, achieving reproducibility remains a challenge, necessitating further research and development.
摘要:法律、法规和行政命令等政策文件在塑造社会方面起着至关重要的作用。然而,它们的长度和复杂性使得解释和应用具有挑战性和耗时。人工智能(AI),特别是大型语言模型(LLM),有可能使分析这些文档的过程自动化,从而提高准确性和效率。这项研究旨在评估人工智能在简化政策分析方面的潜力,并确定当前人工智能方法的优势和局限性。研究的重点是问答和涉及政策文档内容提取的任务。以14110号行政命令“人工智能的安全、可靠和值得信赖的开发和使用”为测试案例,进行了案例研究。四个商业人工智能系统被用来分析这份文件,并回答了一组具有代表性的政策问题。将人工智能系统的性能与人类专家进行的手动分析进行了比较。研究发现,两个人工智能系统Gemini 1.5 Pro和Claude 3 Opus在支持政策分析方面显示出巨大的潜力,可以从复杂的文件中提供准确可靠的信息提取。他们的表现与人类分析师相当,但效率要高得多。然而,实现可重复性仍然是一项挑战,需要进一步的研究和开发。

[NLP-93] SignBLEU: Automatic Evaluation of Multi-channel Sign Language Translation
[NLP-93] SignBLEU:多通道手语翻译的自动评估

链接: https://arxiv.org/abs/2406.06648
作者: Jung-Ho Kim,Mathew Huerta-Enochian,Changyong Ko,Du Hui Lee
关键词: upper body movements, sign language, body movements, sign language translation, facial expressions
中文关键词: 上身动作、手语、身体动作、手语翻译、面部表情
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in LREC-Coling 2024

点击查看摘要

Abstract:Sign languages are multi-channel languages that communicate information through not just the hands (manual signals) but also facial expressions and upper body movements (non-manual signals). However, since automatic sign language translation is usually performed by generating a single sequence of glosses, researchers eschew non-manual and co-occurring manual signals in favor of a simplified list of manual glosses. This can lead to significant information loss and ambiguity. In this paper, we introduce a new task named multi-channel sign language translation (MCSLT) and present a novel metric, SignBLEU, designed to capture multiple signal channels. We validated SignBLEU on a system-level task using three sign language corpora with varied linguistic structures and transcription methodologies and examined its correlation with human judgment through two segment-level tasks. We found that SignBLEU consistently correlates better with human judgment than competing metrics. To facilitate further MCSLT research, we report benchmark scores for the three sign language corpora and release the source code for SignBLEU at this https URL.
摘要:手语是一种多渠道语言,不仅通过手(手动信号),而且还通过面部表情和上半身动作(非手动信号)来交流信息。然而,由于自动手语翻译通常是通过生成单个注释序列来执行的,研究人员避免使用非手动和共现的手动信号,而是支持简化的手动注释列表。这可能会导致严重的信息丢失和歧义。本文介绍了一种新的手语翻译任务–多通道手语翻译,并提出了一种用于捕获多个信号通道的新度量SignBLEU。我们使用三个具有不同语言结构和转录方法的手语语料库在系统级任务上验证了SignBLEU,并通过两个片段级任务检验了SignBLEU与人类判断的相关性。我们发现,与竞争指标相比,SignBLEU始终更好地与人类判断相关。为了促进MCSLT的进一步研究,我们报告了三个手语语料库的基准分数,并在这个HTTPS URL上发布了SignBLEU的源代码。

[NLP-94] Investigation of the Impact of Economic and Social Factors on Energy Demand through Natural Language Processing
[NLP-94] 通过自然语言处理研究经济和社会因素对能源需求的影响

链接: https://arxiv.org/abs/2406.06641
作者: Yun Bai,Simon Camal,Andrea Michiorri
关键词: activity and weather, energy demand, economic activity, relationship between energy, demand
中文关键词: 活动和天气、能源需求、经济活动、能源之间的关系、需求
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The relationship between energy demand and variables such as economic activity and weather is well established. However, this paper aims to explore the connection between energy demand and other social aspects, which receive little attention. Through the use of natural language processing on a large news corpus, we shed light on this important link. This study was carried out in five regions of the UK and Ireland and considers multiple horizons from 1 to 30 days. It also considers economic variables such as GDP, unemployment and inflation. We found that: 1) News about military conflicts, transportation, the global pandemic, regional economics, and the international energy market are related to electricity demand. 2) Economic indicators are more important in the East Midlands and Northern Ireland, while social indicators are more useful in the West Midlands and the South West of England. 3) The use of these indices improved forecasting performance by up to 9%.
摘要:能源需求与经济活动和天气等变量之间的关系已经得到充分证实。然而,本文旨在探索能源需求与其他社会方面之间的联系,而这些方面很少受到关注。通过对大型新闻数据库使用自然语言处理,我们揭示了这个重要环节。这项研究在英国和爱尔兰的五个地区进行,考虑了1至30天的多视野。它还考虑GDP、失业率和通货膨胀等经济变量。我们发现:1)有关军事冲突、交通、全球疫情、区域经济和国际能源市场的新闻与电力需求有关。2)经济指标在东米德兰兹郡和北爱尔兰更为重要,而社会指标在西米德兰兹郡和英格兰西南部更为有用。3)这些指数的使用使预测性能提高了9%。

[NLP-95] LLM Questionnaire Completion for Automatic Psychiatric Assessment
[NLP-95] 自动精神病学评估LLM问卷填写

链接: https://arxiv.org/abs/2406.06636
作者: Gony Rosenman,Lior Wolf,Talma Hendler
关键词: Large Language Model, Language Model, Large Language, structured questionnaires spanning, employ a Large
中文关键词: 大型语言模型,语言模型,大型语言,结构化调查问卷跨越,采用大型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We employ a Large Language Model (LLM) to convert unstructured psychological interviews into structured questionnaires spanning various psychiatric and personality domains. The LLM is prompted to answer these questionnaires by impersonating the interviewee. The obtained answers are coded as features, which are used to predict standardized psychiatric measures of depression (PHQ-8) and PTSD (PCL-C), using a Random Forest regressor. Our approach is shown to enhance diagnostic accuracy compared to multiple baselines. It thus establishes a novel framework for interpreting unstructured psychological interviews, bridging the gap between narrative-driven and data-driven approaches for mental health assessment.
摘要:我们采用大语言模型(LLM)将非结构化心理访谈转化为跨越各种精神和性格领域的结构化调查问卷。LLM被提示通过冒充受访者回答这些调查问卷。获得的答案被编码为特征,用于使用随机森林回归量预测抑郁症(PHQ-8)和创伤后应激障碍(PCL-C)的标准化精神指标。与多个基线相比,我们的方法可以提高诊断准确性。因此,它建立了一个新颖的框架来解释非结构化心理访谈,弥合叙事驱动和数据驱动的心理健康评估方法之间的差距。

[NLP-96] Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
[NLP-96] 对抗性调整:防御LLM的越狱攻击

链接: https://arxiv.org/abs/2406.06622
作者: Fan Liu,Zhao Xu,Hao Liu
关键词: Large Language Models, enhanced Large Language, safely enhanced Large, Language Models, Large Language
中文关键词: 大型语言模型、增强型大型语言、安全增强型大型、语言模型、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Although safely enhanced Large Language Models (LLMs) have achieved remarkable success in tackling various complex tasks in a zero-shot manner, they remain susceptible to jailbreak attacks, particularly the unknown jailbreak attack. To enhance LLMs’ generalized defense capabilities, we propose a two-stage adversarial tuning framework, which generates adversarial prompts to explore worst-case scenarios by optimizing datasets containing pairs of adversarial prompts and their safe responses. In the first stage, we introduce the hierarchical meta-universal adversarial prompt learning to efficiently and effectively generate token-level adversarial prompts. In the second stage, we propose the automatic adversarial prompt learning to iteratively refine semantic-level adversarial prompts, further enhancing LLM’s defense capabilities. We conducted comprehensive experiments on three widely used jailbreak datasets, comparing our framework with six defense baselines under five representative attack scenarios. The results underscore the superiority of our proposed methods. Furthermore, our adversarial tuning framework exhibits empirical generalizability across various attack strategies and target LLMs, highlighting its potential as a transferable defense mechanism.
摘要:尽管安全增强型大语言模型在处理各种复杂任务方面取得了显著的成功,但它们仍然容易受到越狱攻击,特别是未知的越狱攻击。为了增强LLMS的广义防御能力,我们提出了一个两阶段对抗性调整框架,该框架通过优化包含对抗性提示对及其安全响应的数据集来生成对抗性提示以探索最坏的情况。在第一阶段,我们引入了层次化的元通用对抗性提示学习来高效、有效地生成令牌级对抗性提示。在第二阶段,我们提出了自动对抗性提示学习来迭代提炼语义级的对抗性提示,进一步增强了LLM的防御能力。我们在三个广泛使用的越狱数据集上进行了全面的实验,将我们的框架与五个典型攻击场景下的六个防御基线进行了比较。结果强调了我们所提出的方法的优越性。此外,我们的对抗性调整框架展示了对各种攻击策略和目标LLM的经验泛化,突出了其作为一种可转移防御机制的潜力。

[NLP-97] LinkQ: An LLM-Assisted Visual Interface for Knowledge Graph Question-Answering
[NLP-97] LinkQ:知识图谱志愿服务的LLM辅助视觉界面

链接: https://arxiv.org/abs/2406.06621
作者: Harry Li,Gabriel Appleby,Ashley Suh
关键词: large language model, facilitate knowledge graph, natural language question-answering, leverages a large, construction through natural
中文关键词: 大型语言模型,促进知识图谱,自然语言问答,利用自然的大型结构
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present LinkQ, a system that leverages a large language model (LLM) to facilitate knowledge graph (KG) query construction through natural language question-answering. Traditional approaches often require detailed knowledge of complex graph querying languages, limiting the ability for users – even experts – to acquire valuable insights from KG data. LinkQ simplifies this process by first interpreting a user’s question, then converting it into a well-formed KG query. By using the LLM to construct a query instead of directly answering the user’s question, LinkQ guards against the LLM hallucinating or generating false, erroneous information. By integrating an LLM into LinkQ, users are able to conduct both exploratory and confirmatory data analysis, with the LLM helping to iteratively refine open-ended questions into precise ones. To demonstrate the efficacy of LinkQ, we conducted a qualitative study with five KG practitioners and distill their feedback. Our results indicate that practitioners find LinkQ effective for KG question-answering, and desire future LLM-assisted systems for the exploratory analysis of graph databases.
摘要:我们介绍了LinkQ系统,它利用一个大型语言模型(LLM)来通过自然语言问答来促进知识图(KG)查询的构建。传统方法通常需要复杂的图形查询语言的详细知识,这限制了用户–甚至是专家–从KG数据中获得有价值的见解的能力。LinkQ首先解释用户的问题,然后将其转换为格式良好的KG查询,从而简化了这一过程。通过使用LLM构建查询,而不是直接回答用户的问题,LinkQ防止LLM产生幻觉或生成虚假的错误信息。通过将LLM集成到LinkQ中,用户能够进行探索性和验证性数据分析,LLM有助于反复将开放式问题提炼成精确的问题。为了证明LinkQ的有效性,我们对五名KG从业者进行了定性研究,并提取了他们的反馈。我们的结果表明,从业者发现LinkQ对于KG问答是有效的,并希望未来的LLM辅助系统用于图形数据库的探索性分析。

[NLP-98] DualTime: A Dual-Adapter Multimodal Language Model for Time Series Representation
[NLP-98] DualTime:用于时间序列表示的双适配器多模式语言模型

链接: https://arxiv.org/abs/2406.06620
作者: Weiqi Zhang,Jiexia Ye,Ziyue Li,Jia Li,Fugee Tsung
关键词: time series multimodal, recent rapid development, time series, including multimodal time, multimodal time series
中文关键词: 多峰时间序列,最近快速发展,时间序列,包括多峰时间,多峰时间序列
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 12 figure, 5 tables

点击查看摘要

Abstract:The recent rapid development of language models (LMs) has attracted attention in the field of time series, including multimodal time series modeling. However, we note that current time series multimodal methods are biased, often assigning a primary role to one modality while the other assumes a secondary role. They overlook the mutual benefits and complementary of different modalities. For example, in seizure diagnosis, relying solely on textual clinical reports makes it difficult to pinpoint the area and type of the disease, while electroencephalograms (EEGs) alone cannot provide an accurate diagnosis without considering the symptoms. In this study, based on the complementary information mining of time series multimodal data, we propose DualTime, a Dual-adapter multimodal language model for Time series representation implementing temporal-primary and textual-primary modeling simultaneously. By injecting lightweight adaption tokens, the LM pipeline shared by dual adapters encourages embedding alignment and achieves efficient fine-tuning. Empirically, our method outperforms state-of-the-art models in both supervised and unsupervised settings, highlighting the complementary benefits of different modalities. In addition, we conduct few-shot label transfer experiments, which further verifies the transferability and expressiveness of our proposed DualTime.
摘要:近年来,语言模型(LMS)的快速发展引起了时间序列领域的关注,其中包括多峰时间序列建模。然而,我们注意到,当前的时间序列多模式方法是有偏见的,往往将主要角色分配给一种模式,而另一种模式则承担次要角色。它们忽视了不同模式的互惠和互补。例如,在癫痫诊断中,仅依靠文字临床报告很难准确定位疾病的区域和类型,而如果不考虑症状,仅靠脑电(EEG)不能提供准确的诊断。本文在对时间序列多通道数据进行互补信息挖掘的基础上,提出了一种用于时间序列表示的双适配多通道语言模型DualTime,该模型同时实现了时间序列的时态初级建模和文本初级建模。通过注入轻量级适配令牌,双适配器共享的LM管道鼓励嵌入对齐并实现高效的微调。从经验来看,我们的方法在有监督和无监督的情况下都优于最先进的模型,突出了不同模式的互补优势。此外,我们还进行了少镜头标签传输实验,进一步验证了我们提出的DualTime的可转移性和表现力。

[NLP-99] ransforming Dental Diagnostics with Artificial Intelligence: Advanced Integration of ChatGPT and Large Language Models for Patient Care
[NLP-99] 用人工智能改造牙科诊断:ChatGPT和大型语言模型的高级集成用于患者护理

链接: https://arxiv.org/abs/2406.06616
作者: Masoumeh Farhadi Nia,Mohsen Ahmadi,Elyas Irankhah
关键词: Large Language Models, natural language processing, Large Language, Artificial intelligence, algorithms and Large
中文关键词: 大型语言模型、自然语言处理、大型语言、人工智能、算法和大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Artificial intelligence has dramatically reshaped our interaction with digital technologies, ushering in an era where advancements in AI algorithms and Large Language Models (LLMs) have natural language processing (NLP) systems like ChatGPT. This study delves into the impact of cutting-edge LLMs, notably OpenAI’s ChatGPT, on medical diagnostics, with a keen focus on the dental sector. Leveraging publicly accessible datasets, these models augment the diagnostic capabilities of medical professionals, streamline communication between patients and healthcare providers, and enhance the efficiency of clinical procedures. The advent of ChatGPT-4 is poised to make substantial inroads into dental practices, especially in the realm of oral surgery. This paper sheds light on the current landscape and explores potential future research directions in the burgeoning field of LLMs, offering valuable insights for both practitioners and developers. Furthermore, it critically assesses the broad implications and challenges within various sectors, including academia and healthcare, thus mapping out an overview of AI’s role in transforming dental diagnostics for enhanced patient care.
摘要:人工智能极大地重塑了我们与数字技术的互动,开创了一个人工智能算法和大语言模型(LLMS)的进步拥有ChatGPT等自然语言处理(NLP)系统的时代。这项研究深入探讨了尖端LLMS,特别是OpenAI的ChatGPT对医疗诊断的影响,并将重点放在牙科行业。利用可公开访问的数据集,这些模型增强了医疗专业人员的诊断能力,简化了患者和医疗保健提供者之间的沟通,并提高了临床程序的效率。ChatGPT-4的出现将在牙科实践中取得实质性进展,特别是在口腔外科领域。本文阐明了当前的情况,并探索了在新兴的低成本管理领域潜在的未来研究方向,为从业人员和开发人员提供了有价值的见解。此外,它还批判性地评估了包括学术界和医疗保健在内的不同行业的广泛影响和挑战,从而概述了人工智能在将牙科诊断转变为加强患者护理方面的作用。

[NLP-100] Language Guided Skill Discovery
[NLP-100] 语言引导的技能发现

链接: https://arxiv.org/abs/2406.06615
作者: Seungeun Rho,Laura Smith,Tianyu Li,Sergey Levine,Xue Bin Peng,Sehoon Ha
关键词: learn diverse emergent, Skill discovery, explicit rewards, skills, Skill discovery methods
中文关键词: 学习多样化的新兴、技能发现、明确的奖励、技能、技能发现方法
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Skill discovery methods enable agents to learn diverse emergent behaviors without explicit rewards. To make learned skills useful for unknown downstream tasks, obtaining a semantically diverse repertoire of skills is essential. While some approaches introduce a discriminator to distinguish skills and others aim to increase state coverage, no existing work directly addresses the “semantic diversity” of skills. We hypothesize that leveraging the semantic knowledge of large language models (LLMs) can lead us to improve semantic diversity of resulting behaviors. In this sense, we introduce Language Guided Skill Discovery (LGSD), a skill discovery framework that aims to directly maximize the semantic diversity between skills. LGSD takes user prompts as input and outputs a set of semantically distinctive skills. The prompts serve as a means to constrain the search space into a semantically desired subspace, and the generated LLM outputs guide the agent to visit semantically diverse states within the subspace. We demonstrate that LGSD enables legged robots to visit different user-intended areas on a plane by simply changing the prompt. Furthermore, we show that language guidance aids in discovering more diverse skills compared to five existing skill discovery methods in robot-arm manipulation environments. Lastly, LGSD provides a simple way of utilizing learned skills via natural language.
摘要:技能发现方法使智能体能够学习不同的涌现行为,而不需要明确的奖励。要使学到的技能对未知的下游任务有用,获得语义多样化的技能库是必不可少的。虽然一些方法引入了鉴别器来区分技能,而另一些方法的目的是增加国家覆盖面,但现有的工作没有直接解决技能的“语义多样性”问题。我们假设,利用大型语言模型(LLM)的语义知识可以帮助我们提高结果行为的语义多样性。在这个意义上,我们引入了语言引导的技能发现(LGSD),这是一个旨在直接最大化技能之间语义多样性的技能发现框架。LGSD将用户提示作为输入,并输出一组语义上不同的技能。提示用作将搜索空间约束到语义上所需的子空间的手段,生成的LLM输出引导代理访问子空间内语义不同的状态。我们演示了LGSD使腿部机器人能够通过简单地更改提示来访问平面上不同的用户预期区域。此外,我们还表明,在机械臂操作环境中,与现有的五种技能发现方法相比,语言指导有助于发现更多样化的技能。最后,LGSD提供了一种通过自然语言利用所学技能的简单方法。

[NLP-101] GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
[NLP-101] GameBench:评估LLM代理的战略推理能力

链接: https://arxiv.org/abs/2406.06613
作者: Anthony Costarelli,Mat Allen,Roman Hauksson,Grace Sodunke,Suhas Hariharan,Carlson Cheng,Wenjie Li,Arjun Yadav
关键词: demonstrated remarkable few-shot, language understanding tasks, Large language models, natural language understanding, remarkable few-shot performance
中文关键词: 展示了非凡的少镜头、语言理解任务、大型语言模型、自然语言理解、非凡的少镜头性能
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents’ performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of LLM agents. We focus on 9 different game environments, where each covers at least one axis of key reasoning skill identified in strategy games, and select games for which strategy explanations are unlikely to form a significant portion of models’ pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP). Our results show that none of the tested models match human performance, and at worse GPT-4 performs worse than random action. CoT and RAP both improve scores but not comparable to human levels.
摘要:大型语言模型在许多自然语言理解任务中表现出了惊人的机率。尽管有几个在复杂的战略场景中使用大型语言模型的演示,但缺乏一个全面的框架来评估代理在游戏中发现的各种类型推理的性能。为了弥补这一差距,我们引入了GameBch,这是一个用于评估LLM代理的策略推理能力的跨域基准。我们关注9个不同的游戏环境,每个环境至少涵盖策略游戏中确定的关键推理技能的一个轴,并选择策略解释不太可能构成模型预训练语料库的重要部分的游戏。我们的评估使用了GPT-3和GPT-4的基本形式,以及两个旨在增强战略推理能力的脚手架框架:思想链(COT)提示和通过计划推理(RAP)。我们的结果表明,所有测试的模型都不符合人类的表现,在更差的情况下,GPT-4的表现比随机动作更差。COT和RAP都提高了分数,但无法与人类水平相提并论。

[NLP-102] Reinterpreting the Company a Word Keeps: Towards Explainable and Ontologically Grounded Language Models
[NLP-102] 重新解释一个词保留的公司:迈向可解释和基于实体的语言模型

链接: https://arxiv.org/abs/2406.06610
作者: Walid S. Saba
关键词: large language models, successful bottom-up strategy, relative success, success of large, reverse engineering
中文关键词: 大型语言模型、成功的自下而上策略、相对成功、大型反向工程的成功
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 4 figures. arXiv admin note: text overlap with arXiv:2308.14199 , arXiv:2306.00017

点击查看摘要

Abstract:We argue that the relative success of large language models (LLMs) is not a reflection on the symbolic vs. subsymbolic debate but a reflection on employing a successful bottom-up strategy of a reverse engineering of language at scale. However, and due to their subsymbolic nature whatever knowledge these systems acquire about language will always be buried in millions of weights none of which is meaningful on its own, rendering such systems utterly unexplainable. Furthermore, and due to their stochastic nature, LLMs will often fail in making the correct inferences in various linguistic contexts that require reasoning in intensional, temporal, or modal contexts. To remedy these shortcomings we suggest employing the same successful bottom-up strategy employed in LLMs but in a symbolic setting, resulting in explainable, language-agnostic, and ontologically grounded language models.
摘要:我们认为,大型语言模型(LLM)的相对成功并不是对符号与亚符号争论的反映,而是对大规模采用成功的自下而上策略的反映。然而,由于它们的亚符号性质,这些系统获得的关于语言的任何知识都将永远被埋葬在数百万个权重中,这些权重本身没有意义,从而使此类系统完全无法解释。此外,由于其随机性,LLM通常无法在需要在内涵、时间或情态上下文中进行推理的各种语言上下文中做出正确的推断。为了弥补这些缺陷,我们建议采用与LLM相同的成功自下而上策略,但要在符号环境中进行,从而产生可解释的、语言不可知的和基于实体的语言模型。

[NLP-103] he Prompt Report: A Systematic Survey of Prompting Techniques
[NLP-103] 即时报告:缝合技术的系统调查

链接: https://arxiv.org/abs/2406.06608
作者: Sander Schulhoff,Michael Ilie,Nishant Balepur,Konstantine Kahadze,Amanda Liu,Chenglei Si,Yinheng Li,Aayush Gupta,HyoJung Han,Sevien Schulhoff,Pranav Sandeep Dulepet,Saurav Vidyadhara,Dayeon Ki,Sweta Agrawal,Chau Pham,Gerson Kroiz,Feileen Li,Hudson Tao,Ashay Srivastava,Hevander Da Costa,Saloni Gupta,Megan L. Rogers,Inna Goncearenco,Giuseppe Sarli,Igor Galynker,Denis Peskoff,Marine Carpuat,Jules White,Shyamal Anadkat,Alexander Hoyle,Philip Resnik
关键词: Generative Artificial Intelligence, Generative Artificial, Artificial Intelligence, research settings, increasingly deployed
中文关键词: 生成人工智能,生成人工,人工智能,研究环境,越来越多地部署
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI) systems are being increasingly deployed across all parts of industry and research settings. Developers and end users interact with these systems through the use of prompting or prompt engineering. While prompting is a widespread and highly researched concept, there exists conflicting terminology and a poor ontological understanding of what constitutes a prompt due to the area’s nascency. This paper establishes a structured understanding of prompts, by assembling a taxonomy of prompting techniques and analyzing their use. We present a comprehensive vocabulary of 33 vocabulary terms, a taxonomy of 58 text-only prompting techniques, and 40 techniques for other modalities. We further present a meta-analysis of the entire literature on natural language prefix-prompting.
摘要:生成人工智能(GenAI)系统越来越多地部署在工业和研究环境的各个部分。开发人员和最终用户通过使用提示或提示工程与这些系统互动。虽然提示是一个广泛且经过深入研究的概念,但由于该地区的nascency,存在相互矛盾的术语,并且对什么构成提示的存在较差的本体论理解。本文通过组装提示技术的分类并分析其使用,建立了对提示的结构化理解。我们提供了由33个词汇术语组成的全面词汇表、由58种纯文本提示技术和40种其他模式技术组成的分类。我们进一步对有关自然语言前置提示的整个文献进行了荟萃分析。

[NLP-104] Prototypical Reward Network for Data-Efficient RLHF
[NLP-104] 数据高效的WLHF原型奖励网络

链接: https://arxiv.org/abs/2406.06606
作者: Jinghan Zhang,Xiting Wang,Yiqiao Jin,Changyu Chen,Xinhao Zhang,Kunpeng Liu
关键词: Large Language Models, fine-tuning Large Language, Reinforcement Learning, Large Language, Human Feedback
中文关键词: 大型语言模型、微调大型语言、强化学习、大型语言、人类反馈
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2024

点击查看摘要

Abstract:The reward model for Reinforcement Learning from Human Feedback (RLHF) has proven effective in fine-tuning Large Language Models (LLMs). Notably, collecting human feedback for RLHF can be resource-intensive and lead to scalability issues for LLMs and complex tasks. Our proposed framework Proto-RM leverages prototypical networks to enhance reward models under limited human feedback. By enabling stable and reliable structural learning from fewer samples, Proto-RM significantly enhances LLMs’ adaptability and accuracy in interpreting human preferences. Extensive experiments on various datasets demonstrate that Proto-RM significantly improves the performance of reward models and LLMs in human feedback tasks, achieving comparable and usually better results than traditional methods, while requiring significantly less data. in data-limited scenarios. This research offers a promising direction for enhancing the efficiency of reward models and optimizing the fine-tuning of language models under restricted feedback conditions.
摘要:人类反馈强化学习(RLHF)的奖励模型已被证明在微调大型语言模型(LLM)方面是有效的。值得注意的是,为RLHF收集人工反馈可能是资源密集型的,并导致LLM和复杂任务的可扩展性问题。我们提出的框架Proto-RM利用原型网络来增强有限人类反馈下的奖励模型。通过从更少的样本中实现稳定和可靠的结构学习,Proto-RM显著提高了LLMS在解释人类偏好方面的适应性和准确性。在不同数据集上的广泛实验表明,Proto-RM显著提高了奖励模型和LLMS在人类反馈任务中的性能,获得了与传统方法相当且通常更好的结果,同时需要的数据显著减少。在数据受限的情况下。本研究为提高奖励模型的效率和优化受限反馈条件下语言模型的微调提供了一个有希望的方向。

[NLP-105] A Human-in-the-Loop Approach to Improving Cross-Text Prosody Transfer
[NLP-105] 改善跨文本韵律迁移的人在环方法

链接: https://arxiv.org/abs/2406.06601
作者: Himanshu Maurya,Atli Sigurgeirsson
关键词: generate varied prosodic, target text, generate varied, text, TTS
中文关键词: 生成不同的韵律、目标文本、生成不同的、文本、TTC
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 4 pages (+1 references), 4 figures, to be presented at Interspeech 2024

点击查看摘要

Abstract:Text-To-Speech (TTS) prosody transfer models can generate varied prosodic renditions, for the same text, by conditioning on a reference utterance. These models are trained with a reference that is identical to the target utterance. But when the reference utterance differs from the target text, as in cross-text prosody transfer, these models struggle to separate prosody from text, resulting in reduced perceived naturalness. To address this, we propose a Human-in-the-Loop (HitL) approach. HitL users adjust salient correlates of prosody to make the prosody more appropriate for the target text, while maintaining the overall reference prosodic effect. Human adjusted renditions maintain the reference prosody while being rated as more appropriate for the target text 57.8% of the time. Our analysis suggests that limited user effort suffices for these improvements, and that closeness in the latent reference space is not a reliable prosodic similarity metric for the cross-text condition.
摘要:文本到语音(TTS)韵律转换模型可以通过以参考话语为条件,为同一文本生成不同的韵律再现。这些模型用与目标发声相同的参考进行训练。但是,当参考话语与目标文本不同时,如跨文本韵律迁移,这些模型难以将韵律从文本中分离出来,从而降低了人们感觉到的自然性。为了解决这个问题,我们提出了一种人在环(HITL)方法。HITL使用者调整韵律的显著相关性,使韵律更适合目标文本,同时保持整体的参考韵律效果。人工调整的翻译保持了参考韵律,而57.8%的时间被评为更适合目标文本。我们的分析表明,对于这些改进,有限的用户努力就足够了,并且潜在参考空间中的贴近度不是跨文本条件下可靠的韵律相似性度量。

[NLP-106] HORAE: A Domain-Agnostic Modeling Language for Automating Multimodal Service Regulation
[NLP-106] HORAE:一种用于自动化多模式服务监管的领域不可知建模语言

链接: https://arxiv.org/abs/2406.06600
作者: Yutao Sun,Mingshuai Chen,Kangjia Zhao,He Li,Jintao Chen,Linyu Yang,Zhongyi Wang,Tiancheng Zhao,Jianwei Yin
关键词: Artificial intelligence, intelligence is rapidly, rapidly encroaching, service regulation, intelligent service regulation
中文关键词: 人工智能、智能正在迅速、迅速入侵、服务监管、智能服务监管
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Artificial intelligence is rapidly encroaching on the field of service regulation. This work presents the design principles behind HORAE, a unified specification language to model multimodal regulation rules across a diverse set of domains. We show how HORAE facilitates an intelligent service regulation pipeline by further exploiting a fine-tuned large language model named HORAE that automates the HORAE modeling process, thereby yielding an end-to-end framework for fully automated intelligent service regulation.
摘要:人工智能正在迅速入侵服务监管领域。这项工作展示了HORAE背后的设计原则,HORAE是一种统一规范语言,用于对不同领域的多模式监管规则进行建模。我们展示了HORAE如何通过进一步利用名为HORAE的微调大型语言模型来促进智能服务监管管道,该模型可自动化HORAE建模过程,从而为完全自动化的智能服务监管提供端到端框架。

[NLP-107] Anna Karenina Strikes Again: Pre-Trained LLM Embeddings May Favor High-Performing Learners
[NLP-107] 安娜·卡列尼娜(Anna Karenina)再次出击:预培训的法学硕士嵌入可能会有利于高绩效学习者

链接: https://arxiv.org/abs/2406.06599
作者: Abigail Gurin Schleifer,Beata Beigman Klebanov,Moriah Ariely,Giora Alexandron
关键词: pedagogically meaningful information, captures pedagogically meaningful, pre-trained LLM embeddings, student responses, responses to open-ended
中文关键词: 具有教学意义的信息,捕获具有教学意义的预培训LLM嵌入、学生反应、对开放式的反应
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 9 pages (not including bibliography), Appendix and 10 tables. Accepted to the 19th Workshop on Innovative Use of NLP for Building Educational Applications, Co-located with NAACL 2024

点击查看摘要

Abstract:Unsupervised clustering of student responses to open-ended questions into behavioral and cognitive profiles using pre-trained LLM embeddings is an emerging technique, but little is known about how well this captures pedagogically meaningful information. We investigate this in the context of student responses to open-ended questions in biology, which were previously analyzed and clustered by experts into theory-driven Knowledge Profiles (KPs). Comparing these KPs to ones discovered by purely data-driven clustering techniques, we report poor discoverability of most KPs, except for the ones including the correct answers. We trace this “discoverability bias” to the representations of KPs in the pre-trained LLM embeddings space.
摘要:使用预先训练的LLM嵌入将学生对开放式问题的回答无监督地聚集到行为和认知档案中是一种新兴技术,但人们对这种技术捕获具有教学意义的信息的能力知之甚少。我们在学生对生物学开放式问题的回答的背景下调查这一点,这些问题之前已由专家分析并聚集到理论驱动的知识概况(KP)中。将这些KP与通过纯粹数据驱动的集群技术发现的KP进行比较,我们报告大多数KP的可互换性较差,除了包含正确答案的KP。我们将这种“可互换性偏差”追溯到预训练的LLM嵌入空间中的KP的表示。

[NLP-108] Qabas: An Open-Source Arabic Lexicographic Database
[NLP-108] Qabas:开源阿拉伯语词典数据库

链接: https://arxiv.org/abs/2406.06598
作者: Mustafa Jarrar,Tymaa Hammouda
关键词: NLP applications, designed for NLP, Arabic lexicon designed, Qabas, Arabic lexicon
中文关键词: NLP应用程序,为NLP设计,阿拉伯语词典设计,Qabas,阿拉伯语词典
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Qabas, a novel open-source Arabic lexicon designed for NLP applications. The novelty of Qabas lies in its synthesis of 110 lexicons. Specifically, Qabas lexical entries (lemmas) are assembled by linking lemmas from 110 lexicons. Furthermore, Qabas lemmas are also linked to 12 morphologically annotated corpora (about 2M tokens), making it the first Arabic lexicon to be linked to lexicons and corpora. Qabas was developed semi-automatically, utilizing a mapping framework and a web-based tool. Compared with other lexicons, Qabas stands as the most extensive Arabic lexicon, encompassing about 58K lemmas (45K nominal lemmas, 12.5K verbal lemmas, and 473 functional-word lemmas). Qabas is open-source and accessible online at this https URL.
摘要:我们介绍Qabas,这是一个为NLP应用程序设计的新型开源阿拉伯语词典。Qabas的新颖之处在于它综合了110个词典。具体来说,Qabas词汇条目(引理)是通过链接来自110个词典的引理来组装的。此外,Qabas引理还与12个形态注释的数据库(约200万个记号)相关联,使其成为第一个与词典和数据库相关联的阿拉伯语词典。Qabas是利用地图框架和基于网络的工具半自动开发的。与其他词典相比,Qabas是最广泛的阿拉伯语词典,包含约58 K个引理(45 K个名词引理、12.5 K个动词引理和473个功能词引理)。Qabas是开源的,可以通过httpsURL在线访问。

[NLP-109] Are Large Language Models the New Interface for Data Pipelines?
[NLP-109] 大型语言模型是数据管道的新接口吗?

链接: https://arxiv.org/abs/2406.06596
作者: Sylvio Barbon Junior,Paolo Ceravolo,Sven Groppe,Mustafa Jarrar,Samira Maghool,Florence Sèdes,Soror Sahri,Maurice Van Keulen
关键词: generate human communication, Large Language Models, Automated Machine Learning, term that encompasses, encompasses various types
中文关键词: 生成人类通信、大型语言模型、自动机器学习,涵盖各种类型的术语
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:A Language Model is a term that encompasses various types of models designed to understand and generate human communication. Large Language Models (LLMs) have gained significant attention due to their ability to process text with human-like fluency and coherence, making them valuable for a wide range of data-related tasks fashioned as pipelines. The capabilities of LLMs in natural language understanding and generation, combined with their scalability, versatility, and state-of-the-art performance, enable innovative applications across various AI-related fields, including eXplainable Artificial Intelligence (XAI), Automated Machine Learning (AutoML), and Knowledge Graphs (KG). Furthermore, we believe these models can extract valuable insights and make data-driven decisions at scale, a practice commonly referred to as Big Data Analytics (BDA). In this position paper, we provide some discussions in the direction of unlocking synergies among these technologies, which can lead to more powerful and intelligent AI solutions, driving improvements in data pipelines across a wide range of applications and domains integrating humans, computers, and knowledge.
摘要:语言模型是一个包含各种类型的模型的术语,这些模型旨在理解和生成人类交流。大型语言模型(LLM)由于能够像人类一样流畅和连贯地处理文本,因此受到了极大的关注,这使得它们对于以管道形式形成的各种与数据相关的任务来说是有价值的。LLMS在自然语言理解和生成方面的能力,与其可扩展性、多功能性和最先进的性能相结合,使各种与人工智能相关的领域能够实现创新应用,包括可解释人工智能(XAI)、自动机器学习(AutoML)和知识图(KG)。此外,我们相信这些模型可以提取有价值的见解,并在规模上做出数据驱动的决策,这种做法通常称为大数据分析(BDA)。在这份立场文件中,我们提供了一些关于释放这些技术之间的协同效应的讨论,这些技术可以带来更强大和智能的人工智能解决方案,推动整合人类、计算机和知识的广泛应用程序和领域的数据管道的改进。

[NLP-110] Improve Mathematical Reasoning in Language Models by Automated Process Supervision
[NLP-110] 通过自动化过程监督改进语言模型中的数学推理

链接: https://arxiv.org/abs/2406.06592
作者: Liangchen Luo,Yinxiao Liu,Rosanne Liu,Samrat Phatale,Harsh Lara,Yunxuan Li,Lei Shu,Yun Zhu,Lei Meng,Jiao Sun,Abhinav Rastogi
关键词: Complex multi-step reasoning, solving mathematical problems, advanced large language, Complex multi-step, large language models
中文关键词: 复杂的多步推理,解决数学问题,高级大型语言,复杂的多步,大型语言模型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 5 figures, 1 table

点击查看摘要

Abstract:Complex multi-step reasoning tasks, such as solving mathematical problems or generating code, remain a significant hurdle for even the most advanced large language models (LLMs). Verifying LLM outputs with an Outcome Reward Model (ORM) is a standard inference-time technique aimed at enhancing the reasoning performance of LLMs. However, this still proves insufficient for reasoning tasks with a lengthy or multi-hop reasoning chain, where the intermediate outcomes are neither properly rewarded nor penalized. Process supervision addresses this limitation by assigning intermediate rewards during the reasoning process. To date, the methods used to collect process supervision data have relied on either human annotation or per-step Monte Carlo estimation, both prohibitively expensive to scale, thus hindering the broad application of this technique. In response to this challenge, we propose a novel divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named \textitOmegaPRM for the efficient collection of high-quality process supervision data. This algorithm swiftly identifies the first error in the Chain of Thought (CoT) with binary search and balances the positive and negative examples, thereby ensuring both efficiency and quality. As a result, we are able to collect over 1.5 million process supervision annotations to train a Process Reward Model (PRM). Utilizing this fully automated process supervision alongside the weighted self-consistency algorithm, we have enhanced the instruction tuned Gemini Pro model’s math reasoning performance, achieving a 69.4% success rate on the MATH benchmark, a 36% relative improvement from the 51% base model performance. Additionally, the entire process operates without any human intervention, making our method both financially and computationally cost-effective compared to existing methods.
摘要:复杂的多步推理任务,如解决数学问题或生成代码,即使是最先进的大型语言模型(LLM)也仍然是一个重要的障碍。使用结果奖励模型(ORM)验证LLM输出是一种标准的推理时间技术,旨在提高LLMS的推理性能。然而,事实证明,这对于具有冗长或多跳推理链的推理任务来说仍然是不够的,在这种任务中,中间结果既不会得到适当的奖励,也不会受到惩罚。过程监督通过在推理过程中分配中间奖励来解决这一限制。到目前为止,用于收集过程监控数据的方法要么依赖于人工注释,要么依赖于每一步的蒙特卡罗估计,两者的规模都高得令人望而却步,因此阻碍了这项技术的广泛应用。针对这一挑战,我们提出了一种新的分而治之的蒙特卡罗树搜索(MCTS)算法-.该算法通过二进制搜索快速识别思想链中的第一个错误,并平衡正反两个例子,从而保证了效率和质量。结果,我们能够收集超过150万个过程监督注释来训练过程奖励模型(PRM)。利用这种全自动化的过程监控和加权自洽算法,我们提高了指令调优Gemini Pro模型的数学推理性能,在数学基准测试中获得了69.4%的成功率,比51基础模型的性能相对提高了36%。此外,整个过程无需任何人工干预,与现有方法相比,使我们的方法在财务和计算上都具有成本效益。

[NLP-111] Exploring Multilingual Large Language Models for Enhanced TNM classification of Radiology Report in lung cancer staging
[NLP-111] 探索多语言大语言模型以增强肺癌分期放射学报告的TNI分类

链接: https://arxiv.org/abs/2406.06591
作者: Hidetoshi Matsuo,Mizuho Nishio,Takaaki Matsunaga,Koji Fujimoto,Takamichi Murakami
关键词: remains underdeveloped due, Structured radiology reports, reports remains underdeveloped, Structured radiology, radiology reports
中文关键词: 由于结构化放射学报告,报告仍然欠发达,结构化放射学,放射学报告
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 3figures

点击查看摘要

Abstract:Background: Structured radiology reports remains underdeveloped due to labor-intensive structuring and narrative-style reporting. Deep learning, particularly large language models (LLMs) like GPT-3.5, offers promise in automating the structuring of radiology reports in natural languages. However, although it has been reported that LLMs are less effective in languages other than English, their radiological performance has not been extensively studied. Purpose: This study aimed to investigate the accuracy of TNM classification based on radiology reports using GPT3.5-turbo (GPT3.5) and the utility of multilingual LLMs in both Japanese and English. Material and Methods: Utilizing GPT3.5, we developed a system to automatically generate TNM classifications from chest CT reports for lung cancer and evaluate its performance. We statistically analyzed the impact of providing full or partial TNM definitions in both languages using a Generalized Linear Mixed Model. Results: Highest accuracy was attained with full TNM definitions and radiology reports in English (M = 94%, N = 80%, T = 47%, and ALL = 36%). Providing definitions for each of the T, N, and M factors statistically improved their respective accuracies (T: odds ratio (OR) = 2.35, p 0.001; N: OR = 1.94, p 0.01; M: OR = 2.50, p 0.001). Japanese reports exhibited decreased N and M accuracies (N accuracy: OR = 0.74 and M accuracy: OR = 0.21). Conclusion: This study underscores the potential of multilingual LLMs for automatic TNM classification in radiology reports. Even without additional model training, performance improvements were evident with the provided TNM definitions, indicating LLMs’ relevance in radiology contexts.
摘要:背景:由于劳动密集型的结构和叙事式的报道,结构化的放射学报告仍然不发达。深度学习,特别是像GPT-3.5这样的大型语言模型(LLM),在以自然语言自动构建放射学报告方面提供了希望。然而,尽管有报道称LLMS在英语以外的语言中效果较差,但其放射性能尚未得到广泛研究。目的:本研究旨在探讨基于GPT3.5-TURBO(GPT3.5)放射学报告的TNM分类的准确性以及日语和英语多语言LLM的使用情况。材料和方法:利用GPT3.5,我们开发了一个从肺癌胸部CT报告自动生成TNM分类的系统,并对其性能进行了评估。我们使用广义线性混合模型统计分析了用两种语言提供全部或部分TNM定义的影响。结果:完整的TNM定义和英文放射学报告的准确率最高(M=94%,N=80%,T=47%,ALL=36%)。对T、N和M因素中的每一个进行定义,在统计学上提高了它们各自的准确性(T:优势比(OR)=2.35,p 0.001;N:OR=1.94,p 0.01;M:OR=2.5,p 0.001)。日本报告显示N和M准确度降低(N准确度:OR=0.74,M准确度:OR=0.21)。结论:这项研究强调了多语言LLM在放射学报告中用于自动TNM分类的潜力。即使没有额外的模型培训,使用所提供的TNM定义,性能也有明显的改善,表明LLMS在放射学背景下的相关性。

[NLP-112] Are LLMs classical or nonmonotonic reasoners? Lessons from generics
[NLP-112] LLM是经典推理机还是非单调推理机?仿制药的教训

链接: https://arxiv.org/abs/2406.06590
作者: Alina Leidinger,Robert van Rooij,Ekaterina Shutova
关键词: Recent scholarship, supplied evidence, evidence of impressive, impressive performance, performance and flexible
中文关键词: 最近的奖学金、提供的证据、令人印象深刻的表现的证据、令人印象深刻的表现、表现和灵活性
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2024 (main)

点击查看摘要

Abstract:Recent scholarship on reasoning in LLMs has supplied evidence of impressive performance and flexible adaptation to machine generated or human feedback. Nonmonotonic reasoning, crucial to human cognition for navigating the real world, remains a challenging, yet understudied task. In this work, we study nonmonotonic reasoning capabilities of seven state-of-the-art LLMs in one abstract and one commonsense reasoning task featuring generics, such as ‘Birds fly’, and exceptions, ‘Penguins don’t fly’ (see Fig. 1). While LLMs exhibit reasoning patterns in accordance with human nonmonotonic reasoning abilities, they fail to maintain stable beliefs on truth conditions of generics at the addition of supporting examples (‘Owls fly’) or unrelated information (‘Lions have manes’). Our findings highlight pitfalls in attributing human reasoning behaviours to LLMs, as well as assessing general capabilities, while consistent reasoning remains elusive.
摘要:最近关于LLM推理的学术研究提供了令人印象深刻的性能和对机器生成或人类反馈的灵活适应的证据。非单调推理对于人类在现实世界中的认知至关重要,仍然是一项具有挑战性但研究不足的任务。在这项工作中,我们在一项抽象推理任务和一项常识推理任务中研究了七种最先进的法学硕士的非单调推理能力,其中包括“鸟会飞”和例外“企鹅不会飞”(见图1)。虽然LLM表现出符合人类非单调推理能力的推理模式,但在添加支持示例(“猫头鹰飞”)或不相关信息(“狮子有鬃毛”)时,它们无法保持对类属真值条件的稳定信念。我们的研究结果凸显了将人类推理行为归因于LLM以及评估一般能力方面的陷阱,而一致的推理仍然难以捉摸。

[NLP-113] PatentEval: Understanding Errors in Patent Generation
[NLP-113] PatentEval:了解专利生成中的错误

链接: https://arxiv.org/abs/2406.06589
作者: You Zuo(ALMAnaCH),Kim Gerdes(LISN),Eric Villemonte de La Clergerie(ALMAnaCH),Benoît Sagot(ALMAnaCH)
关键词: comprehensive error typology, typology specifically designed, error typology specifically, introduce a comprehensive, comprehensive error
中文关键词: 全面的错误类型学,专门设计的类型学,专门的错误类型学,引入全面、全面的错误
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we introduce a comprehensive error typology specifically designed for evaluating two distinct tasks in machine-generated patent texts: claims-to-abstract generation, and the generation of the next claim given previous ones. We have also developed a benchmark, PatentEval, for systematically assessing language models in this context. Our study includes a comparative analysis, annotated by humans, of various models. These range from those specifically adapted during training for tasks within the patent domain to the latest general-purpose large language models (LLMs). Furthermore, we explored and evaluated some metrics to approximate human judgments in patent text evaluation, analyzing the extent to which these metrics align with expert assessments. These approaches provide valuable insights into the capabilities and limitations of current language models in the specialized field of patent text generation.
摘要:在这项工作中,我们引入了一种全面的错误类型学,专门为评估机器生成的专利文本中的两个不同任务而设计:权利要求到摘要的生成,以及在先前的权利要求的情况下生成下一个权利要求。我们还开发了一个基准PatentEval,用于在此背景下系统评估语言模型。我们的研究包括对各种模型进行的比较分析,并由人类注释。这些模型的范围从在专利领域任务培训期间专门调整的模型到最新的通用大型语言模型(LLM)。此外,我们探索和评估了一些指标来逼近专利文本评估中的人类判断,分析这些指标与专家评估的一致程度。这些方法为专利文本生成专业领域当前语言模型的能力和局限性提供了有价值的见解。

[NLP-114] Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models
[NLP-114] 评估大羊驼大型语言模型的紧急符号推理能力

链接: https://arxiv.org/abs/2406.06588
作者: Flavio Petruzzellis,Alberto Testolin,Alessandro Sperduti
关键词: Large Language Models, Large Language, achieve impressive performance, achieve impressive, fluently with users
中文关键词: 大型语言模型,大型语言,实现令人印象深刻的性能,实现令人印象深刻的、流畅的用户
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: Accepted at 33rd International Conference on Artificial Neural Networks (ICANN24)

点击查看摘要

Abstract:Large Language Models (LLMs) achieve impressive performance in a wide range of tasks, even if they are often trained with the only objective of chatting fluently with users. Among other skills, LLMs show emergent abilities in mathematical reasoning benchmarks, which can be elicited with appropriate prompting methods. In this work, we systematically investigate the capabilities and limitations of popular open-source LLMs on different symbolic reasoning tasks. We evaluate three models of the Llama 2 family on two datasets that require solving mathematical formulas of varying degrees of difficulty. We test a generalist LLM (Llama 2 Chat) as well as two fine-tuned versions of Llama 2 (MAmmoTH and MetaMath) specifically designed to tackle mathematical problems. We observe that both increasing the scale of the model and fine-tuning it on relevant tasks lead to significant performance gains. Furthermore, using fine-grained evaluation measures, we find that such performance gains are mostly observed with mathematical formulas of low complexity, which nevertheless often remain challenging even for the largest fine-tuned models.
摘要:大型语言模型(LLM)在广泛的任务中取得了令人印象深刻的表现,即使他们接受的培训通常只有一个目标,就是流利地与用户交谈。在其他技能中,LLMS在数学推理基准中表现出涌现能力,这种能力可以通过适当的提示方法来激发。在这项工作中,我们系统地调查了流行的开源LLMS在不同的符号推理任务上的能力和局限性。我们在两个数据集上评估了Llama 2家族的三个模型,这些数据集需要求解不同难度的数学公式。我们测试了一个通才的LLM(Llama2chat)以及两个专门为解决数学问题而设计的Llama2的微调版本(猛犸象和MetaMath)。我们观察到,增加模型的规模和在相关任务中对其进行微调都会导致显著的性能提升。此外,使用细粒度的评估措施,我们发现,这样的性能收益大多是通过低复杂性的数学公式观察到的,然而,即使对于最大的微调模型,这些公式仍然经常具有挑战性。

[NLP-115] Exploring Human-AI Perception Alignment in Sensory Experiences: Do LLMs Understand Textile Hand?
[NLP-115] 探索感官体验中的人类与人工智能感知一致:LLM了解纺织手吗?

链接: https://arxiv.org/abs/2406.06587
作者: Shu Zhong,Elia Gatti,Youngjun Cho,Marianna Obrist
关键词: Aligning large language, large language models, Aligning large, language models, large language
中文关键词: 调整大型语言、大型语言模型、调整大型语言模型、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) behaviour with human intent is critical for future AI. An important yet often overlooked aspect of this alignment is the perceptual alignment. Perceptual modalities like touch are more multifaceted and nuanced compared to other sensory modalities such as vision. This work investigates how well LLMs align with human touch experiences using the “textile hand” task. We created a “Guess What Textile” interaction in which participants were given two textile samples – a target and a reference – to handle. Without seeing them, participants described the differences between them to the LLM. Using these descriptions, the LLM attempted to identify the target textile by assessing similarity within its high-dimensional embedding space. Our results suggest that a degree of perceptual alignment exists, however varies significantly among different textile samples. For example, LLM predictions are well aligned for silk satin, but not for cotton denim. Moreover, participants didn’t perceive their textile experiences closely matched by the LLM predictions. This is only the first exploration into perceptual alignment around touch, exemplified through textile hand. We discuss possible sources of this alignment variance, and how better human-AI perceptual alignment can benefit future everyday tasks.
摘要:使大型语言模型(LLM)的行为与人类意图保持一致对于未来的人工智能至关重要。这种对齐的一个重要但经常被忽视的方面是知觉对齐。与视觉等其他感官形式相比,触觉等知觉形式更具多面性和细微差别。这项工作调查了LLM与人类使用“纺织之手”任务的触摸体验有多好的一致性。我们创建了一个“猜猜纺织品”互动,参与者被给予两个纺织品样本–一个目标和一个参考–来处理。参与者在没有看到他们的情况下,向LLM描述了他们之间的差异。使用这些描述,LLM试图通过评估其高维嵌入空间内的相似性来识别目标纺织品。我们的结果表明,存在一定程度的知觉一致性,但在不同的纺织品样本之间存在显著差异。例如,LLM对真丝绸缎的预测很好地吻合了,但对棉质牛仔布的预测就不是这样。此外,参与者并没有感觉到他们的纺织经历与LLM的预测非常吻合。这只是对围绕触觉的知觉排列的第一次探索,通过纺织之手就是例证。我们讨论了这种比对差异的可能来源,以及更好的人类-人工智能感知比对如何使未来的日常任务受益。

[NLP-116] Bi-Chainer: Automated Large Language Models Reasoning with Bidirectional Chaining
[NLP-116] 双链:利用双向链自动化大型语言模型推理

链接: https://arxiv.org/abs/2406.06586
作者: Shuqi Liu,Bowei He,Linqi Song
关键词: Large Language Models, Large Language, Language Models, complex logical problems, solving complex logical
中文关键词: 大型语言模型,大型语言,语言模型,复杂逻辑问题,解决复杂逻辑
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have shown human-like reasoning abilities but still face challenges in solving complex logical problems. Existing unidirectional chaining methods, such as forward chaining and backward chaining, suffer from issues like low prediction accuracy and efficiency. To address these, we propose a bidirectional chaining method, Bi-Chainer, which dynamically switches to depth-first reasoning in the opposite reasoning direction when it encounters multiple branching options within the current direction. Thus, the intermediate reasoning results can be utilized as guidance to facilitate the reasoning process. We show that Bi-Chainer achieves sizable accuracy boots over unidirectional chaining frameworks on four challenging logical reasoning datasets. Moreover, Bi-Chainer enhances the accuracy of intermediate proof steps and reduces the average number of inference calls, resulting in more efficient and accurate reasoning.
摘要:大型语言模型(LLM)已经表现出类似人类的推理能力,但在解决复杂逻辑问题方面仍然面临挑战。现有的单向链接方法,例如前向链接和后向链接,存在预测准确性和效率低等问题。为了解决这些问题,我们提出了一种双向链接方法Bi-Chainer,当当前方向内遇到多个分支选项时,该方法会动态切换到相反推理方向的深度优先推理。因此,中间推理结果可以用作指导以促进推理过程。我们表明,Bi-Chainer在四个具有挑战性的逻辑推理数据集上实现了相当大的准确性引导。此外,双链增强了中间证明步骤的准确性,并减少了推理调用的平均次数,从而实现更高效、更准确的推理。

[NLP-117] Evaluating the Efficacy of Large Language Models in Detecting Fake News: A Comparative Analysis
[NLP-117] 评估大型语言模型检测假新闻的功效:比较分析

链接: https://arxiv.org/abs/2406.06584
作者: Sahas Koka,Anthony Vuong,Anish Kataria
关键词: significant societal impacts, era increasingly influenced, artificial intelligence, societal impacts, era increasingly
中文关键词: 重大社会影响,时代影响越来越大,人工智能,社会影响,时代越来越大
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In an era increasingly influenced by artificial intelligence, the detection of fake news is crucial, especially in contexts like election seasons where misinformation can have significant societal impacts. This study evaluates the effectiveness of various LLMs in identifying and filtering fake news content. Utilizing a comparative analysis approach, we tested four large LLMs – GPT-4, Claude 3 Sonnet, Gemini Pro 1.0, and Mistral Large – and two smaller LLMs – Gemma 7B and Mistral 7B. By using fake news dataset samples from Kaggle, this research not only sheds light on the current capabilities and limitations of LLMs in fake news detection but also discusses the implications for developers and policymakers in enhancing AI-driven informational integrity.
摘要:在一个受人工智能影响日益严重的时代,假新闻的检测至关重要,尤其是在选举季等错误信息可能产生重大社会影响的背景下。本研究评估了各种LLM在识别和过滤虚假新闻内容方面的有效性。利用比较分析方法,我们测试了四个大型LLM-- GPT-4、Claude 3十四行诗、Gemini Pro 1.0和Mistral Large --以及两个较小的LLM-- Gemma 7 B和Mistral 7 B。通过使用Kaggle的假新闻数据集样本,这项研究不仅揭示了LLM在假新闻检测方面的当前能力和局限性,还讨论了开发人员和政策制定者在增强人工智能驱动的信息完整性方面的影响。

[NLP-118] Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing
[NLP-118] 具有预训练大语言模型的离散多模式转换器用于混合监督语音处理

链接: https://arxiv.org/abs/2406.06582
作者: Viet Anh Trinh,Rosy Southwell,Yiwen Guan,Xinlu He,Zhiyong Wang,Jacob Whitehill
关键词: Recent work, seamlessly perform multiple, Multimodal Language Model, tokenization has paved, seamlessly perform
中文关键词: 最近的工作,无缝执行多个,多模式语言模型,标记化已经铺设,无缝执行
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent work on discrete speech tokenization has paved the way for models that can seamlessly perform multiple tasks across modalities, e.g., speech recognition, text to speech, speech to speech translation. Moreover, large language models (LLMs) pretrained from vast text corpora contain rich linguistic information that can improve accuracy in a variety of tasks. In this paper, we present a decoder-only Discrete Multimodal Language Model (DMLM), which can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision). We explore several critical aspects of discrete multi-modal models, including the loss function, weight initialization, mixed training supervision, and codebook. Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training. Moreover, for ASR, it benefits from initializing DMLM from a pretrained LLM, and from a codebook derived from Whisper activations.
摘要:最近关于离散语音标记化的工作已经为可以跨通道无缝执行多个任务的模型铺平了道路,例如,语音识别、文本到语音、语音到语音翻译。此外,从庞大的文本语料库中预训练的大型语言模型(LLM)包含了丰富的语言信息,可以提高各种任务的准确性。本文提出了一种只需译码的离散多模式语言模型(DMLM),它可以灵活地应用于多任务(ASR、T2S、S2TT等)。和方式(文本、语音、视觉)。我们探讨了离散多模式模型的几个关键方面,包括损失函数、权重初始化、混合训练监督和码本。我们的结果表明,DMLM从监督和非监督训练的组合中跨多个任务和数据集显著受益。此外,对于ASR,它受益于从预先训练的LLM初始化DMLM,以及从Whisper激活派生的码本。

[NLP-119] Set-Based Prompting: Provably Solving the Language Model Order Dependency Problem
[NLP-119] 基于集的预算处理:可证明地解决语言模型顺序依赖性问题

链接: https://arxiv.org/abs/2406.06581
作者: Reid McIlroy-Young,Katrina Brown,Conlan Olson,Linjun Zhang,Cynthia Dwork
关键词: Large Language Models’, coherent textual outputs, generative language models, generative language, development of generative
中文关键词: 大型语言模型、连贯的文本输出、生成性语言模型、生成性语言、生成性语言的发展
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 27 figures, code this https URL

点击查看摘要

Abstract:The development of generative language models that can create long and coherent textual outputs via autoregression has lead to a proliferation of uses and a corresponding sweep of analyses as researches work to determine the limitations of this new paradigm. Unlike humans, these ‘Large Language Models’ (LLMs) are highly sensitive to small changes in their inputs, leading to unwanted inconsistency in their behavior. One problematic inconsistency when LLMs are used to answer multiple-choice questions or analyze multiple inputs is order dependency: the output of an LLM can (and often does) change significantly when sub-sequences are swapped, despite both orderings being semantically identical. In this paper we present , a technique that guarantees the output of an LLM will not have order dependence on a specified set of sub-sequences. We show that this method provably eliminates order dependency, and that it can be applied to any transformer-based LLM to enable text generation that is unaffected by re-orderings. Delving into the implications of our method, we show that, despite our inputs being out of distribution, the impact on expected accuracy is small, where the expectation is over the order of uniformly chosen shuffling of the candidate responses, and usually significantly less in practice. Thus, can be used as a ‘dropped-in’ method on fully trained models. Finally, we discuss how our method’s success suggests that other strong guarantees can be obtained on LLM performance via modifying the input representations.
摘要:可以通过自回归产生长而连贯的文本输出的生成语言模型的发展导致了用途的激增和相应的分析,因为研究人员正在努力确定这一新范式的局限性。与人类不同,这些“大语言模型”(LLM)对其输入的微小变化高度敏感,从而导致它们的行为出现不必要的不一致。当LLM被用来回答多项选择题或分析多个输入时,一个有问题的不一致性是顺序相关性:当子序列被交换时,LLM的输出可能(并且经常确实)显著改变,尽管两个顺序在语义上相同。在这篇文章中,我们提出了一种技术,它保证LLM的输出不会对特定的子序列集具有顺序依赖性。我们证明了这种方法消除了顺序依赖,并且它可以应用于任何基于转换器的LLM,以实现不受重新排序影响的文本生成。深入研究我们方法的含义,我们表明,尽管我们的输入是非分布的,但对预期精度的影响很小,其中预期超过了候选响应的统一选择洗牌的顺序,而且在实践中通常明显较小。因此,可以在完全训练的模型上作为一种“插入式”方法。最后,我们讨论了我们的方法的成功表明,可以通过修改输入表示来获得对LLM性能的其他强保证。

[NLP-120] Break the Chain: Large Language Models Can be Shortcut Reasoners
[NLP-120] 打破链条:大型语言模型可以成为可重写推理者

链接: https://arxiv.org/abs/2406.06580
作者: Mengru Ding,Hanmeng Liu,Zhizhang Fu,Jian Song,Wenbo Xie,Yue Zhang
关键词: high token consumption, Recent advancements, utilize complex modules, limited applicability, token consumption
中文关键词: 代币消耗高、最近的进步、利用复杂的模块、有限的适用性、代币消耗
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Chain-of-Thought (CoT) reasoning utilize complex modules but are hampered by high token consumption, limited applicability, and challenges in reproducibility. This paper conducts a critical evaluation of CoT prompting, extending beyond arithmetic to include complex logical and commonsense reasoning tasks, areas where standard CoT methods fall short. We propose the integration of human-like heuristics and shortcuts into language models (LMs) through “break the chain” strategies. These strategies disrupt traditional CoT processes using controlled variables to assess their efficacy. Additionally, we develop innovative zero-shot prompting strategies that encourage the use of shortcuts, enabling LMs to quickly exploit reasoning clues and bypass detailed procedural steps. Our comprehensive experiments across various LMs, both commercial and open-source, reveal that LMs maintain effective performance with “break the chain” strategies. We also introduce ShortcutQA, a dataset specifically designed to evaluate reasoning through shortcuts, compiled from competitive tests optimized for heuristic reasoning tasks such as forward/backward reasoning and simplification. Our analysis confirms that ShortcutQA not only poses a robust challenge to LMs but also serves as an essential benchmark for enhancing reasoning efficiency in AI.
摘要:思想链(COT)推理的最新进展利用了复杂的模块,但受到高令牌消耗、有限的适用性和重复性方面的挑战的阻碍。本文对COT提示进行了批判性评估,从算术扩展到包括复杂的逻辑推理和常识推理任务,这是标准COT方法不足的领域。我们提出了通过“断链”策略将类似人类的启发式方法和捷径方法集成到语言模型中。这些策略使用受控变量来评估其有效性,从而扰乱了传统的COT流程。此外,我们开发了创新的零射提示策略,鼓励使用快捷方式,使LMS能够快速利用推理线索并绕过详细的程序步骤。我们对不同的LMS进行了全面的实验,包括商业的和开源的,结果表明LMS通过“断链”策略保持了有效的性能。我们还介绍了ShortutQA,这是一个专门为通过快捷方式评估推理而设计的数据集,它是从针对启发式推理任务(如前向/后向推理和简化)优化的竞争性测试中编译而成的。我们的分析证实了ShortutQA不仅对LMS构成了强大的挑战,而且也是提高人工智能推理效率的重要基准。

[NLP-121] From Redundancy to Relevance: Enhancing Explainability in Multimodal Large Language Models
[NLP-121] 从冗余到相关性:增强多模式大型语言模型的解释性

链接: https://arxiv.org/abs/2406.06579
作者: Xiaofeng Zhang,Chen Shen,Xiaosong Yuan,Shaotian Yan,Liang Xie,Wenxiao Wang,Chaochen Gu,Hao Tang,Jieping Ye
关键词: Large Vision Language, multimodal large language, large language models, Vision Language Models, popular Large Vision
中文关键词: 大视觉语言,多模式大语言,大语言模型,视觉语言模型,流行的大视觉
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, multimodal large language models have exploded with an endless variety, most of the popular Large Vision Language Models (LVLMs) depend on sequential visual representation, where images are converted into hundreds or thousands of tokens before being input into the Large Language Model (LLM) along with language prompts. The black-box design hinders the interpretability of visual-language models, especially regarding more complex reasoning tasks. To explore the interaction process between image and text in complex reasoning tasks, we introduce the information flow method to visualize the interaction mechanism. By analyzing the dynamic flow of the information flow, we find that the information flow appears to converge in the shallow layer. Further investigation revealed a redundancy of the image token in the shallow layer. Consequently, a truncation strategy was introduced to aggregate image tokens within these shallow layers. This approach has been validated through experiments across multiple models, yielding consistent improvements.
摘要:近年来,多通道大语言模型层出不穷,大多数流行的大视觉语言模型依赖于顺序的视觉表示,图像被转换成成百上千个符号,然后连同语言提示一起输入到大语言模型中。黑盒设计阻碍了视觉语言模型的可解释性,特别是对于更复杂的推理任务。为了探索复杂推理任务中图文交互的过程,我们引入信息流的方法来可视化图文交互机制。通过分析信息流的动态流动,我们发现信息流似乎在浅层收敛。进一步的研究发现,图像表征在浅层中存在冗余。因此,引入了一种截断策略来聚合这些浅层内的图像标记。该方法已通过跨多个模型的实验进行了验证,并产生了一致的改进。

[NLP-122] SMS Spam Detection and Classification to Combat Abuse in Telephone Networks Using Natural Language Processing
[NLP-122] 使用自然语言处理对短信垃圾邮件进行检测和分类以打击电话网络中的滥用行为

链接: https://arxiv.org/abs/2406.06578
作者: Dare Azeez Oyeyemi,Adebola K. Ojo
关键词: Short Message Service, SMS spam, service due, SMS, mobile phones
中文关键词: 短信服务、短信垃圾邮件、服务到期、短信、手机
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 8 figures, 3 tables

点击查看摘要

Abstract:In the modern era, mobile phones have become ubiquitous, and Short Message Service (SMS) has grown to become a multi-million-dollar service due to the widespread adoption of mobile devices and the millions of people who use SMS daily. However, SMS spam has also become a pervasive problem that endangers users’ privacy and security through phishing and fraud. Despite numerous spam filtering techniques, there is still a need for a more effective solution to address this problem [1]. This research addresses the pervasive issue of SMS spam, which poses threats to users’ privacy and security. Despite existing spam filtering techniques, the high false-positive rate persists as a challenge. The study introduces a novel approach utilizing Natural Language Processing (NLP) and machine learning models, particularly BERT (Bidirectional Encoder Representations from Transformers), for SMS spam detection and classification. Data preprocessing techniques, such as stop word removal and tokenization, are applied, along with feature extraction using BERT. Machine learning models, including SVM, Logistic Regression, Naive Bayes, Gradient Boosting, and Random Forest, are integrated with BERT for differentiating spam from ham messages. Evaluation results revealed that the Naïve Bayes classifier + BERT model achieves the highest accuracy at 97.31% with the fastest execution time of 0.3 seconds on the test dataset. This approach demonstrates a notable enhancement in spam detection efficiency and a low false-positive rate. The developed model presents a valuable solution to combat SMS spam, ensuring faster and more accurate detection. This model not only safeguards users’ privacy but also assists network providers in effectively identifying and blocking SMS spam messages.
摘要:在现代,移动电话已经变得无处不在,由于移动设备的广泛使用和数百万人每天使用短信,短信服务已经成长为一项价值数百万美元的服务。然而,垃圾短信也成为了一个普遍存在的问题,通过钓鱼和诈骗危及用户的隐私和安全。尽管有许多垃圾邮件过滤技术,但仍然需要一个更有效的解决方案来解决这个问题[1]。本研究针对普遍存在的对用户隐私和安全构成威胁的垃圾短信问题。尽管有现有的垃圾邮件过滤技术,但高假阳性率仍然是一个挑战。该研究提出了一种利用自然语言处理(NLP)和机器学习模型,特别是来自Transformers的双向编码表示(BERT)来检测和分类垃圾短信的新方法。应用了数据预处理技术,如停用字删除和标记化,以及使用ERT进行特征提取。机器学习模型,包括支持向量机、Logistic回归、朴素贝叶斯、梯度提升和随机森林,被集成到ERT中,用于区分垃圾邮件和垃圾邮件。评价结果表明,朴素贝叶斯分类器+BERT模型的分类正确率最高,达到97.31%,在测试数据集上的最快执行时间为0.3秒。该方法显著提高了垃圾邮件的检测效率,降低了误检率。开发的模型为打击垃圾短信提供了一个有价值的解决方案,确保了更快、更准确的检测。该模型不仅保护了用户的隐私,还帮助网络提供商有效地识别和阻止垃圾短信。

[NLP-123] RAG-based Crowdsourcing Task Decomposition via Masked Contrastive Learning with Prompts
[NLP-123] 基于RAG的众包任务分解通过带预算的掩蔽对比学习

链接: https://arxiv.org/abs/2406.06577
作者: Jing Yang,Xiao Wang,Yu Zhao,Yuhang Liu,Fei-Yue Wang
关键词: complex tasks relies, critical technology, technology in social, leverages an extensive, extensive and boundless
中文关键词: 复杂的任务依赖,关键技术,社会技术,利用广泛的、广泛的、无限的
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:Crowdsourcing is a critical technology in social manufacturing, which leverages an extensive and boundless reservoir of human resources to handle a wide array of complex tasks. The successful execution of these complex tasks relies on task decomposition (TD) and allocation, with the former being a prerequisite for the latter. Recently, pre-trained language models (PLMs)-based methods have garnered significant attention. However, they are constrained to handling straightforward common-sense tasks due to their inherent restrictions involving limited and difficult-to-update knowledge as well as the presence of hallucinations. To address these issues, we propose a retrieval-augmented generation-based crowdsourcing framework that reimagines TD as event detection from the perspective of natural language understanding. However, the existing detection methods fail to distinguish differences between event types and always depend on heuristic rules and external semantic analyzing tools. Therefore, we present a Prompt-Based Contrastive learning framework for TD (PBCT), which incorporates a prompt-based trigger detector to overcome dependence. Additionally, trigger-attentive sentinel and masked contrastive learning are introduced to provide varying attention to trigger and contextual features according to different event types. Experiment results demonstrate the competitiveness of our method in both supervised and zero-shot detection. A case study on printed circuit board manufacturing is showcased to validate its adaptability to unknown professional domains.
摘要:众包是社会制造中的一项关键技术,它利用广泛和无限的人力资源来处理各种复杂的任务。这些复杂任务的成功执行依赖于任务分解和分配,而任务分解和分配是后者的前提。最近,基于预训练语言模型(PLM)的方法得到了极大的关注。然而,由于他们固有的限制,涉及有限和难以更新的知识以及幻觉的存在,他们被限制在处理简单的常识任务上。为了解决这些问题,我们提出了一个基于检索增强生成的众包框架,该框架从自然语言理解的角度将TD重新想象为事件检测。然而,现有的检测方法不能区分事件类型之间的差异,往往依赖于启发式规则和外部语义分析工具。因此,我们提出了一种基于提示的TD对比学习框架(PBCT),该框架结合了一个基于提示的触发检测器来克服依赖。此外,引入了触发注意哨兵和掩蔽对比学习,以根据不同的事件类型提供对触发和语境特征的不同关注。实验结果表明,该方法在监督检测和零镜头检测方面都具有较强的竞争力。以印制电路板制造为例,验证了该方法对未知专业领域的适应性。

[NLP-124] OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
[NLP-124] OccamLLM:一步完成快速精确的语言模型算法

链接: https://arxiv.org/abs/2406.06576
作者: Owen Dugan,Donato Manuel Jimenez Beneto,Charlotte Loh,Zhuo Chen,Rumen Dangovski,Marin Soljačić
关键词: Large Language Models, accurately performing complex, Large Language, complex arithmetic operations, performing complex arithmetic
中文关键词: 大型语言模型,准确执行复杂的、大型语言的、复杂的算术运算,执行复杂的算术
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite significant advancements in text generation and reasoning, Large Language Models (LLMs) still face challenges in accurately performing complex arithmetic operations. To achieve accurate calculations, language model systems often enable LLMs to generate code for arithmetic operations. However, this approach compromises speed and security and, if finetuning is involved, risks the language model losing prior capabilities. We propose a framework that enables exact arithmetic in \textita single autoregressive step, providing faster, more secure, and more interpretable LLM systems with arithmetic capabilities. We use the hidden states of an LLM to control a symbolic architecture which performs arithmetic. Our implementation using Llama 3 8B Instruct with OccamNet as a symbolic model (OccamLlama) achieves 100% accuracy on single arithmetic operations ( +,-,\times,÷,\sin,\cos,\log,\exp,\sqrt ), outperforming GPT 4o and on par with GPT 4o using a code interpreter. OccamLlama also outperforms both Llama 3 8B Instruct and GPT 3.5 Turbo on multistep reasoning problems involving challenging arithmetic, thus enabling small LLMs to match the arithmetic performance of even much larger models. We will make our code public shortly.
摘要:尽管在文本生成和推理方面有了很大的进步,但大语言模型在准确执行复杂的算术运算方面仍然面临挑战。为了实现准确的计算,语言模型系统通常使LLM能够生成用于算术运算的代码。然而,这种方法会牺牲速度和安全性,如果涉及到优化,语言模型可能会失去先前的功能。我们提出了一种框架,能够在单步自回归中实现精确的算术运算,从而为LLM系统提供更快、更安全、更可解释的算术能力。我们使用LLM的隐藏状态来控制执行算术运算的符号体系结构。我们使用Llama 3 8B指令,以OccamNet为符号模型(OccamLlama),在单次算术运算(+、-、\次、?、\sin、\cos、\log、\exp、\sqrt)上达到100%的准确率,性能优于GPT 4o,与使用代码解释器的GPT 4o相当。在涉及具有挑战性的算术的多步推理问题上,OccamLlama还胜过Llama 38B指令和GPT 3.5 Turbo,从而使小型LLM能够与更大模型的算术性能相媲美。我们将很快公开我们的代码。

[NLP-125] Ask-EDA: A Design Assistant Empowered by LLM Hybrid RAG and Abbreviation De-hallucination
[NLP-125] Ask-EDA:由LLM Hybrid RAG和缩写去幻觉赋予动力的设计助理

链接: https://arxiv.org/abs/2406.06575
作者: Luyao Shi,Michael Kazda,Bradley Sears,Nick Shropshire,Ruchir Puri
关键词: Electronic design engineers, find relevant information, relevant information efficiently, Electronic design, verification and technology
中文关键词: 电子设计工程师,高效查找相关信息,相关信息,电子设计、验证和技术
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted paper at The First IEEE International Workshop on LLM-Aided Design, 2024 (LAD 24)

点击查看摘要

Abstract:Electronic design engineers are challenged to find relevant information efficiently for a myriad of tasks within design construction, verification and technology development. Large language models (LLM) have the potential to help improve productivity by serving as conversational agents that effectively function as subject-matter experts. In this paper we demonstrate Ask-EDA, a chat agent designed to serve as a 24x7 expert available to provide guidance to design engineers. Ask-EDA leverages LLM, hybrid retrieval augmented generation (RAG) and abbreviation de-hallucination (ADH) techniques to deliver more relevant and accurate responses. We curated three evaluation datasets, namely q2a-100, cmds-100 and abbr-100. Each dataset is tailored to assess a distinct aspect: general design question answering, design command handling and abbreviation resolution. We demonstrated that hybrid RAG offers over a 40% improvement in Recall on the q2a-100 dataset and over a 60% improvement on the cmds-100 dataset compared to not using RAG, while ADH yields over a 70% enhancement in Recall on the abbr-100 dataset. The evaluation results show that Ask-EDA can effectively respond to design-related inquiries.
摘要:电子设计工程师面临着在设计构建、验证和技术开发过程中如何有效地查找相关信息的挑战。大型语言模型(LLM)可以作为会话代理,有效地充当主题专家,从而帮助提高工作效率。在本文中,我们演示了一个聊天代理Ask-EDA,它被设计为24x7全天候的专家,可以为设计工程师提供指导。ASK-EDA利用LLM、混合检索增强生成(RAG)和缩写去幻觉(ADH)技术来提供更相关和准确的响应。我们整理了三个评估数据集,即q2a-100、cmds-100和abbr-100。每个数据集都是为评估一个不同的方面量身定做的:一般设计问题解答、设计命令处理和缩写解析。与不使用RAG相比,混合RAG在q2a-100数据集上的召回率提高了40%以上,在CMDS-100数据集上提高了60%以上,而ADH在abbr-100数据集上的召回率提高了70%以上。评估结果表明,ASK-EDA能够有效地响应与设计相关的查询。

[NLP-126] owards Transparency: Exploring LLM Trainings Datasets through Visual Topic Modeling and Semantic Frame
[NLP-126] owards透明度:通过视觉主题建模和语义框架探索LLM培训数据集

链接: https://arxiv.org/abs/2406.06574
作者: Charles de Dampierre,Andrei Mogoutov,Nicolas Baumard
关键词: behalf of humans, classifying things, everyday life, responsible for making, making many decisions
中文关键词: 代表人类,对事物进行分类,日常生活,负责制定,做出许多决定
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs are now responsible for making many decisions on behalf of humans: from answering questions to classifying things, they have become an important part of everyday life. While computation and model architecture have been rapidly expanding in recent years, the efforts towards curating training datasets are still in their beginnings. This underappreciation of training datasets has led LLMs to create biased and low-quality content. In order to solve that issue, we present Bunka, a software that leverages AI and Cognitive Science to improve the refinement of textual datasets. We show how Topic Modeling coupled with 2-dimensional Cartography can increase the transparency of datasets. We then show how the same Topic Modeling techniques can be applied to Preferences datasets to accelerate the fine-tuning process and increase the capacities of the model on different benchmarks. Lastly, we show how using Frame Analysis can give insights into existing biases in the training corpus. Overall, we argue that we need better tools to explore and increase the quality and transparency of LLMs training datasets.
摘要:LLM现在代表人类负责做出许多决策:从回答问题到对事物进行分类,它们已经成为日常生活中重要的一部分。虽然计算和模型架构近年来迅速扩大,但为管理训练数据集所作的努力仍处于起步阶段。这种对训练数据集的低估导致LLMS创造了有偏见和低质量的内容。为了解决这个问题,我们提出了Bunka,这是一个利用人工智能和认知科学来改进文本数据集的精化的软件。我们展示了主题建模与二维地图绘制相结合如何增加数据集的透明度。然后,我们展示了如何将相同的主题建模技术应用于偏好数据集,以加快微调过程,并增加模型在不同基准上的容量。最后,我们展示了如何使用框架分析来洞察训练语料库中存在的偏见。总体而言,我们认为我们需要更好的工具来探索和提高LLMS训练数据集的质量和透明度。

[NLP-127] MedFuzz: Exploring the Robustness of Large Language Models in Medical Question Answering
[NLP-127] MedFuzz:探索医学问题解答中大型语言模型的鲁棒性

链接: https://arxiv.org/abs/2406.06573
作者: Robert Osazuwa Ness,Katie Matton,Hayden Helm,Sheng Zhang,Junaid Bajwa,Carey E. Priebe,Eric Horvitz
关键词: Large language models, Large language, achieved impressive performance, LLM, language models
中文关键词: 大型语言模型,大型语言,取得了令人印象深刻的性能,LLM,语言模型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 2 figures, 2 algorithms, appendix

点击查看摘要

Abstract:Large language models (LLM) have achieved impressive performance on medical question-answering benchmarks. However, high benchmark accuracy does not imply that the performance generalizes to real-world clinical settings. Medical question-answering benchmarks rely on assumptions consistent with quantifying LLM performance but that may not hold in the open world of the clinic. Yet LLMs learn broad knowledge that can help the LLM generalize to practical conditions regardless of unrealistic assumptions in celebrated benchmarks. We seek to quantify how well LLM medical question-answering benchmark performance generalizes when benchmark assumptions are violated. Specifically, we present an adversarial method that we call MedFuzz (for medical fuzzing). MedFuzz attempts to modify benchmark questions in ways aimed at confounding the LLM. We demonstrate the approach by targeting strong assumptions about patient characteristics presented in the MedQA benchmark. Successful “attacks” modify a benchmark item in ways that would be unlikely to fool a medical expert but nonetheless “trick” the LLM into changing from a correct to an incorrect answer. Further, we present a permutation test technique that can ensure a successful attack is statistically significant. We show how to use performance on a “MedFuzzed” benchmark, as well as individual successful attacks. The methods show promise at providing insights into the ability of an LLM to operate robustly in more realistic settings.
摘要:大型语言模型(LLM)在医学问答基准上取得了令人印象深刻的表现。然而,高基准准确率并不意味着性能适用于现实世界的临床设置。医学问题回答基准依赖于与量化LLM性能一致的假设,但这在诊所的开放世界中可能不成立。然而,LLM学习了广泛的知识,可以帮助LLM将其推广到实际情况,而不考虑著名基准中不切实际的假设。我们试图量化当违反基准假设时,LLM医疗问答基准性能的泛化程度。具体地说,我们提出了一种对抗性方法,我们称之为MedFuzz(用于医学模糊)。MedFuzz试图以混淆LLM的方式修改基准问题。我们通过针对MedQA基准中提出的关于患者特征的强烈假设来演示该方法。成功的“攻击”修改基准项目的方式不太可能愚弄医学专家,但仍然“诱骗”LLM将正确答案更改为不正确答案。此外,我们提出了一种置换测试技术,该技术可以确保成功的攻击具有统计意义。我们展示了如何使用“MedFuzze”基准测试的性能,以及个别成功的攻击。这些方法在洞察LLM在更现实的环境中稳健运行的能力方面表现出了希望。

[NLP-128] Graph Neural Network Enhanced Retrieval for Question Answering of LLMs
[NLP-128] 用于LLM问题解答的图神经网络增强检索

链接: https://arxiv.org/abs/2406.06572
作者: Zijian Li,Qingyan Guo,Jiawei Shao,Lei Song,Jiang Bian,Jun Zhang,Rui Wang
关键词: large language model, providing factual supports, revolutionized large language, Retrieval augmented generation, language model
中文关键词: 大型语言模型,提供事实支持,彻底改变大型语言,检索增强生成,语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Under review

点击查看摘要

Abstract:Retrieval augmented generation has revolutionized large language model (LLM) outputs by providing factual supports. Nevertheless, it struggles to capture all the necessary knowledge for complex reasoning questions. Existing retrieval methods typically divide reference documents into passages, treating them in isolation. These passages, however, are often interrelated, such as passages that are contiguous or share the same keywords. Therefore, recognizing the relatedness is crucial for enhancing the retrieval process. In this paper, we propose a novel retrieval method, called GNN-Ret, which leverages graph neural networks (GNNs) to enhance retrieval by considering the relatedness between passages. Specifically, we first construct a graph of passages by connecting passages that are structure-related and keyword-related. A graph neural network (GNN) is then leveraged to exploit the relationships between passages and improve the retrieval of supporting passages. Furthermore, we extend our method to handle multi-hop reasoning questions using a recurrent graph neural network (RGNN), named RGNN-Ret. At each step, RGNN-Ret integrates the graphs of passages from previous steps, thereby enhancing the retrieval of supporting passages. Extensive experiments on benchmark datasets demonstrate that GNN-Ret achieves higher accuracy for question answering with a single query of LLMs than strong baselines that require multiple queries, and RGNN-Ret further improves accuracy and achieves state-of-the-art performance, with up to 10.4% accuracy improvement on the 2WikiMQA dataset.
摘要:检索增强生成通过提供事实支持使大型语言模型(LLM)输出发生了革命性的变化。然而,它很难捕捉到复杂推理问题的所有必要知识。现有的检索方法通常将参考文档分成多个段落,单独处理它们。然而,这些段落通常是相互关联的,例如连续的或共享相同关键字的段落。因此,认识到这种关联性对于提高检索过程至关重要。在本文中,我们提出了一种新的检索方法GNN-Ret,它利用图神经网络(GNN)通过考虑段落之间的相关性来增强检索能力。具体地说,我们首先通过连接与结构相关和关键字相关的段落来构建段落图。然后利用图神经网络(GNN)来利用段落之间的关系并改进支持段落的检索。此外,我们使用递归图神经网络(RGNN-Ret)将我们的方法扩展到处理多跳推理问题。在每个步骤中,RGNN-Ret整合了前面步骤中的段落图形,从而增强了对支持段落的检索。在基准数据集上的大量实验表明,GNN-Ret在LLMS的一次查询中的问答准确率高于需要多次查询的强基线,RGNN-Ret进一步提高了准确率并达到了最先进的性能,在2WikiMQA数据集上的准确率提高了10.4%。

[NLP-129] SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM
[NLP-129] SUBLLM:一种用于LLM的具有令牌序列子采样的新型高效架构

链接: https://arxiv.org/abs/2406.06571
作者: Quandong Wang,Yuxuan Yuan,Xiaoyu Yang,Ruike Zhang,Kang Zhao,Wei Liu,Jian Luan,Daniel Povey,Bin Wang
关键词: Large Language Models, achieved remarkable success, Large Language, Language Models, major challenge
中文关键词: 大型语言模型,取得显着成功,大型语言,语言模型,重大挑战
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, submitted to ECAI 2024

点击查看摘要

Abstract:While Large Language Models (LLMs) have achieved remarkable success in various fields, the efficiency of training and inference remains a major challenge. To address this issue, we propose SUBLLM, short for Subsampling-Upsampling-Bypass Large Language Model, an innovative architecture that extends the core decoder-only framework by incorporating subsampling, upsampling, and bypass modules. The subsampling modules are responsible for shortening the sequence, while the upsampling modules restore the sequence length, and the bypass modules enhance convergence. In comparison to LLaMA, the proposed SUBLLM exhibits significant enhancements in both training and inference speeds as well as memory usage, while maintaining competitive few-shot performance. During training, SUBLLM increases speeds by 26% and cuts memory by 10GB per GPU. In inference, it boosts speeds by up to 37% and reduces memory by 1GB per GPU. The training and inference speeds can be enhanced by 34% and 52% respectively when the context window is expanded to 8192. We shall release the source code of the proposed architecture in the published version.
摘要:虽然大语言模型在各个领域都取得了显著的成就,但训练和推理的效率仍然是一个重大挑战。为了解决这一问题,我们提出了SUBLLM,即子采样-上采样-旁路大语言模型的缩写,这是一个创新的体系结构,通过结合下采样、上采样和旁路模块来扩展仅用于核心解码器的框架。子采样模块负责缩短序列,上采样模块恢复序列长度,旁路模块增强收敛。与骆驼相比,所提出的SUBLLM在训练和推理速度以及内存使用方面都有显著的提高,同时保持了具有竞争力的少射性能。在培训期间,SUBLLM将速度提高了26%,并将每个GPU的内存减少了10 GB。根据推论,它将速度提高高达37%,并将每个GPU的内存减少1 GB。当上下文窗口扩展到8192时,训练速度和推理速度分别提高了34%和52%。我们将在发布的版本中发布建议架构的源代码。

[NLP-130] Review of Computational Epigraphy
[NLP-130] 计算铭文评论

链接: https://arxiv.org/abs/2406.06570
作者: Vishal Kumar
关键词: Computational Epigraphy refers, extracting text, Epigraphy refers, stone inscription, Traditional epigraphy methods
中文关键词: 计算铭文参考,提取文本,铭文参考,石碑,传统铭文方法
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Computational Epigraphy refers to the process of extracting text from stone inscription, transliteration, interpretation, and attribution with the aid of computational methods. Traditional epigraphy methods are time consuming, and tend to damage the stone inscriptions while extracting text. Additionally, interpretation and attribution are subjective and can vary between different epigraphers. However, using modern computation methods can not only be used to extract text, but also interpret and attribute the text in a robust way. We survey and document the existing computational methods that aid in the above-mentioned tasks in epigraphy.
摘要:计算铭文是指借助计算方法从石碑中提取文本、音译、解释和归因的过程。传统的铭文方法耗时,并且在提取文本时往往会损坏石雕。此外,解释和归因是主观的,并且不同的铭文家可能会有所不同。然而,使用现代计算方法不仅可以用于提取文本,还可以以稳健的方式解释文本和属性。我们调查并记录了现有的计算方法,这些方法有助于完成上述铭文学任务。

[NLP-131] Enhancing Clinical Documentation with Synthetic Data: Leveraging Generative Models for Improved Accuracy
[NLP-131] 利用合成数据增强临床文档:利用生成模型提高准确性

链接: https://arxiv.org/abs/2406.06569
作者: Anjanava Biswas,Wrick Talukdar
关键词: facilitating effective communication, facilitating effective, communication among providers, regulatory requirements, comprehensive clinical documentation
中文关键词: 促进有效沟通、促进有效、提供者之间的沟通、监管要求、全面的临床文档
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate and comprehensive clinical documentation is crucial for delivering high-quality healthcare, facilitating effective communication among providers, and ensuring compliance with regulatory requirements. However, manual transcription and data entry processes can be time-consuming, error-prone, and susceptible to inconsistencies, leading to incomplete or inaccurate medical records. This paper proposes a novel approach to augment clinical documentation by leveraging synthetic data generation techniques to generate realistic and diverse clinical transcripts. We present a methodology that combines state-of-the-art generative models, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), with real-world clinical transcript and other forms of clinical data to generate synthetic transcripts. These synthetic transcripts can then be used to supplement existing documentation workflows, providing additional training data for natural language processing models and enabling more accurate and efficient transcription processes. Through extensive experiments on a large dataset of anonymized clinical transcripts, we demonstrate the effectiveness of our approach in generating high-quality synthetic transcripts that closely resemble real-world data. Quantitative evaluation metrics, including perplexity scores and BLEU scores, as well as qualitative assessments by domain experts, validate the fidelity and utility of the generated synthetic transcripts. Our findings highlight synthetic data generation’s potential to address clinical documentation challenges, improving patient care, reducing administrative burdens, and enhancing healthcare system efficiency.
摘要:准确和全面的临床文档对于提供高质量的医疗保健、促进提供者之间的有效沟通以及确保遵守法规要求至关重要。然而,手动转录和数据录入过程可能非常耗时、容易出错,并且容易出现不一致,从而导致医疗记录不完整或不准确。本文提出了一种新的方法,通过利用合成数据生成技术来增强临床文档,以生成真实和多样化的临床记录。我们提出了一种方法,将最先进的生成模型,如生成性对抗网络(GANS)和变量自动编码器(VAES),与真实世界的临床记录和其他形式的临床数据相结合,生成合成记录。然后,可以使用这些合成抄本来补充现有的文档工作流程,为自然语言处理模型提供额外的训练数据,并实现更准确和高效的转录过程。通过在匿名临床记录的大型数据集上的广泛实验,我们证明了我们的方法在生成与真实世界数据非常相似的高质量合成记录方面的有效性。量化评估指标,包括困惑分数和BLEU分数,以及领域专家的定性评估,验证生成的合成成绩单的保真度和实用性。我们的发现突出了合成数据生成在解决临床文档挑战、改善患者护理、减轻管理负担和提高医疗保健系统效率方面的潜力。

[NLP-132] DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion
[NLP-132] SHA:通过自适应头部融合从Transformer检查点学习去耦合头部注意力

链接: https://arxiv.org/abs/2406.06567
作者: Yilong Chen,Linhao Zhang,Junyuan Shang,Zhenyu Zhang,Tingwen Liu,Shuohuan Wang,Yu Sun
关键词: Large language models, Large language, parameters demonstrate impressive, demonstrate impressive performance, demonstrate impressive
中文关键词: 大型语言模型、大型语言、参数展示令人印象深刻、展示令人印象深刻的性能、展示令人印象深刻
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Large language models (LLMs) with billions of parameters demonstrate impressive performance. However, the widely used Multi-Head Attention (MHA) in LLMs incurs substantial computational and memory costs during inference. While some efforts have optimized attention mechanisms by pruning heads or sharing parameters among heads, these methods often lead to performance degradation or necessitate substantial continued pre-training costs to restore performance. Based on the analysis of attention redundancy, we design a Decoupled-Head Attention (DHA) mechanism. DHA adaptively configures group sharing for key heads and value heads across various layers, achieving a better balance between performance and efficiency. Inspired by the observation of clustering similar heads, we propose to progressively transform the MHA checkpoint into the DHA model through linear fusion of similar head parameters step by step, retaining the parametric knowledge of the MHA checkpoint. We construct DHA models by transforming various scales of MHA checkpoints given target head budgets. Our experiments show that DHA remarkably requires a mere 0.25% of the original model’s pre-training budgets to achieve 97.6% of performance while saving 75% of KV cache. Compared to Group-Query Attention (GQA), DHA achieves a 5 \times training acceleration, a maximum of 13.93% performance improvement under 0.01% pre-training budget, and 4% relative improvement under 0.05% pre-training budget.
摘要:具有数十亿个参数的大型语言模型(LLM)表现出令人印象深刻的性能。然而,LLMS中广泛使用的多头注意(MHA)在推理过程中会产生大量的计算和存储开销。虽然一些努力通过修剪头部或在头部之间共享参数来优化注意力机制,但这些方法往往会导致性能下降,或者需要大量持续的预训练成本来恢复性能。在分析注意冗余的基础上,设计了一种解耦头部注意(DHA)机制。DHA跨各层自适应配置Key Head和Value Head的群组共享,实现了性能和效率的更好平衡。受相似头部聚类现象的启发,我们提出通过逐步对相似头部参数进行线性融合,将MHA检查点逐步转换为DHA模型,同时保留了MHA检查点的参数知识。我们通过转换不同尺度的MHA检查点来构建DHA模型,给出了目标总预算。实验表明,DHA算法只需原始模型训练前预算的0.25%即可达到97.6%的性能,同时节省75KV的缓存。与GQA算法相比,DHA算法的训练速度提高了5倍,在0.01%的训练预算下最大提高了13.93%,在0.05%的训练预算下相对提高了4倍。

[NLP-133] RAG Enabled Conversations about Household Electricity Monitoring
[NLP-133] RAG启用有关家庭电力监控的对话

链接: https://arxiv.org/abs/2406.06566
作者: Carolina Fortuna,Vid Hanžel,Blaž Bertalanič
关键词: Retrieval Augmented Generation, large language models, Augmented Generation, Llama to enhance, Retrieval Augmented
中文关键词: 检索增强生成、大型语言模型、增强生成、Lama增强、检索增强
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to ACM KDD 2024

点击查看摘要

Abstract:In this paper, we investigate the integration of Retrieval Augmented Generation (RAG) with large language models (LLMs) such as ChatGPT, Gemini, and Llama to enhance the accuracy and specificity of responses to complex questions about electricity datasets. Recognizing the limitations of LLMs in generating precise and contextually relevant answers due to their dependency on the patterns in training data rather than factual understanding, we propose a solution that leverages a specialized electricity knowledge graph. This approach facilitates the retrieval of accurate, real-time data which is then synthesized with the generative capabilities of LLMs. Our findings illustrate that the RAG approach not only reduces the incidence of incorrect information typically generated by LLMs but also significantly improves the quality of the output by grounding responses in verifiable data. This paper details our methodology, presents a comparative analysis of responses with and without RAG, and discusses the implications of our findings for future applications of AI in specialized sectors like energy data analysis.
摘要:在本文中,我们研究了检索增强生成(RAG)与大型语言模型(LLMS)的集成,如ChatGPT、Gemini和Llama,以提高对关于电力数据集的复杂问题的回答的准确性和特异性。认识到LLMS由于依赖于训练数据中的模式而不是事实理解而在生成准确和上下文相关的答案方面的局限性,我们提出了一种利用专门的电力知识图的解决方案。这种方法便于检索准确的实时数据,然后利用LLMS的生成能力合成这些数据。我们的发现表明,RAG方法不仅减少了LLMS通常产生的不正确信息的发生率,而且通过将响应固定在可验证的数据中,显着提高了输出的质量。本文详细介绍了我们的方法,对使用RAG和不使用RAG的响应进行了比较分析,并讨论了我们的发现对人工智能在能源数据分析等专门部门的未来应用的影响。

[NLP-134] MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
[NLP-134] MixEval:从LLM Benchmark Mixtures中汲取人群智慧

链接: https://arxiv.org/abs/2406.06565
作者: Jinjie Ni,Fuzhao Xue,Xiang Yue,Yuntian Deng,Mahir Shah,Kabir Jain,Graham Neubig,Yang You
关键词: Evaluating large language, Evaluating large, large language models, large language, Chatbot Arena
中文关键词: 评估大型语言,评估大型语言模型,大型语言,Chatbot Arena
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Evaluating large language models (LLMs) is challenging. Traditional ground-truth-based benchmarks fail to capture the comprehensiveness and nuance of real-world queries, while LLM-as-judge benchmarks suffer from grading biases and limited query quantity. Both of them may also become contaminated over time. User-facing evaluation, such as Chatbot Arena, provides reliable signals but is costly and slow. In this work, we propose MixEval, a new paradigm for establishing efficient, gold-standard LLM evaluation by strategically mixing off-the-shelf benchmarks. It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks. Based on MixEval, we further build MixEval-Hard, which offers more room for model improvement. Our benchmarks’ advantages lie in (1) a 0.96 model ranking correlation with Chatbot Arena arising from the highly impartial query distribution and grading mechanism, (2) fast, cheap, and reproducible execution (6% of the time and cost of MMLU), and (3) dynamic evaluation enabled by the rapid and stable data update pipeline. We provide extensive meta-evaluation and analysis for our and existing LLM benchmarks to deepen the community’s understanding of LLM evaluation and guide future research directions.
摘要:评估大型语言模型(LLM)是具有挑战性的。传统的基于地面事实的基准测试不能捕捉真实世界查询的全面性和细微差别,而LLM-as-Screen基准测试存在评分偏差和查询数量有限的问题。随着时间的推移,它们也都可能受到污染。面向用户的评估,如聊天机器人竞技场,提供了可靠的信号,但成本高、速度慢。在这项工作中,我们提出了MixEval,一种新的范式,通过战略性地混合现有基准来建立有效的、黄金标准的LLM评估。它通过将从Web挖掘的查询与来自现有基准的类似查询进行匹配,将(1)全面且分布良好的真实世界用户查询与(2)高效且评级合理的基于地面事实的基准测试连接起来。在MixEval的基础上,进一步构建了MixEval-Hard,为模型的改进提供了更大的空间。我们的基准测试的优势在于(1)高度公正的查询分发和评分机制与Chatbot Arena的0.96模型排名相关性,(2)快速、廉价和可重复执行(MMLU的时间和成本的6%),以及(3)快速稳定的数据更新管道支持的动态评估。我们为我们和现有的LLM基准提供广泛的元评估和分析,以加深社区对LLM评估的理解,并指导未来的研究方向。

[NLP-135] Revolutionizing Large Language Model Training through Dynamic Parameter Adjustment
[NLP-135] 通过动态参数调整彻底改变大语言模型训练

链接: https://arxiv.org/abs/2406.06564
作者: Kaiye Zhou,Shucheng Wang
关键词: large language models, critically important, era of large, large language, demand for efficient
中文关键词: 大型语言模型,至关重要,大型语言时代,对高效的需求
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper introduces an innovative parameter-efficient training method that dynamically switches parameters throughout the entire training period, achieving significant memory and computational savings

点击查看摘要

Abstract:In the era of large language models, the demand for efficient use of computational resources has become critically important. Although parameter-efficient fine-tuning techniques have achieved results comparable to full fine-tuning, their application during the pre-training phase poses significant challenges. Specifically, employing parameter-efficient strategies at the onset of pre-training can severely compromise efficiency, especially in larger models. In this paper, building upon the fine-tuning method LoRA, we introduce a novel parameter-efficient training technique that frequently alters trainable part of parameters, facilitating effective pre-training. Our method not only achieves memory reductions and computational overhead comparable to current state-of-the-art parameter-efficient algorithms during the pre-training phase but also maintains accuracy levels comparable to those of full pre-training. We provide both theoretical analyses and empirical evidence to demonstrate the effectiveness of our approach.
摘要:在大型语言模型的时代,高效使用计算资源的需求变得至关重要。尽管参数高效微调技术取得了与完全微调相当的结果,但在培训前阶段应用这些技术带来了巨大的挑战。具体地说,在预训练开始时采用参数高效策略可能会严重影响效率,特别是在较大的模型中。在本文中,我们在LORA微调方法的基础上,引入了一种新的参数高效训练技术,该技术频繁地改变参数的可训练部分,从而有助于有效的预训练。我们的方法不仅在预训练阶段实现了与当前最先进的参数高效算法相当的内存和计算开销,而且保持了与完全预训练相当的精度水平。我们提供了理论分析和经验证据来证明我们方法的有效性。

[NLP-136] Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
[NLP-136] Skywork-MoE:深入研究混合专家语言模型的训练技术

链接: https://arxiv.org/abs/2406.06563
作者: Tianwen Wei,Bo Zhu,Liang Zhao,Cheng Cheng,Biye Li,Weiwei Lü,Peng Cheng,Jianhao Zhang,Xiaoyu Zhang,Liang Zeng,Xiaokun Wang,Yutuan Ma,Rui Hu,Shuicheng Yan,Han Fang,Yahui Zhou
关键词: large language model, training methodologies implemented, technical report, large language, billion parameters
中文关键词: 大型语言模型、实施的培训方法、技术报告、大型语言、十亿个参数
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this technical report, we introduce the training methodologies implemented in the development of Skywork-MoE, a high-performance mixture-of-experts (MoE) large language model (LLM) with 146 billion parameters and 16 experts. It is initialized from the pre-existing dense checkpoints of our Skywork-13B model. We explore the comparative effectiveness of upcycling versus training from scratch initializations. Our findings suggest that the choice between these two approaches should consider both the performance of the existing dense checkpoints and the MoE training budget. We highlight two innovative techniques: gating logit normalization, which improves expert diversification, and adaptive auxiliary loss coefficients, allowing for layer-specific adjustment of auxiliary loss coefficients. Our experimental results validate the effectiveness of these methods. Leveraging these techniques and insights, we trained our upcycled Skywork-MoE on a condensed subset of our SkyPile corpus. The evaluation results demonstrate that our model delivers strong performance across a wide range of benchmarks.
摘要:在这份技术报告中,我们介绍了Skywork-MoE的开发过程中实施的培训方法,Skywork-MoE是一个具有1460亿个参数和16个专家的高性能混合专家(MOE)大型语言模型(LLM)。它是从我们的Skywork-13B型号的预先存在的密集检查点初始化的。我们探索升级周期与从头开始培训初始化的比较有效性。我们的研究结果表明,在这两种方法之间进行选择时,应同时考虑现有密集检查站的性能和教育部的培训预算。我们重点介绍了两种创新技术:门控Logit归一化,它提高了专家的多样性;以及自适应辅助损失系数,允许对辅助损失系数进行特定于层的调整。实验结果验证了这些方法的有效性。利用这些技术和见解,我们在SkyPile语料库的一个浓缩子集上训练了我们升级的Skywork-MoE。评估结果表明,我们的模型在广泛的基准范围内提供了强大的性能。

[NLP-137] Achieving Sparse Activation in Small Language Models
[NLP-137] 在小语言模型中实现稀疏激活

链接: https://arxiv.org/abs/2406.06562
作者: Jifeng Song,Kai Huang,Xiangyu Yin,Boyuan Yang,Wei Gao
关键词: Small Language Models, Large Language Models, Language Models, emerging Small Language, Small Language
中文关键词: 小语言模型,大语言模型,语言模型,新兴小语言,小语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages

点击查看摘要

Abstract:Sparse activation, which selectively activates only an input-dependent set of neurons in inference, is a useful technique to reduce the computing cost of Large Language Models (LLMs) without retraining or adaptation efforts. However, whether it can be applied to the recently emerging Small Language Models (SLMs) remains questionable, because SLMs are generally less over-parameterized than LLMs. In this paper, we aim to achieve sparse activation in SLMs. We first show that the existing sparse activation schemes in LLMs that build on neurons’ output magnitudes cannot be applied to SLMs, and activating neurons based on their attribution scores is a better alternative. Further, we demonstrated and quantified the large errors of existing attribution metrics when being used for sparse activation, due to the interdependency among attribution scores of neurons across different layers. Based on these observations, we proposed a new attribution metric that can provably correct such errors and achieve precise sparse activation. Experiments over multiple popular SLMs and datasets show that our approach can achieve 80% sparsification ratio with 5% model accuracy loss, comparable to the sparse activation achieved in LLMs. The source code is available at: this https URL.
摘要:稀疏激活,即在推理中选择性地激活一组依赖于输入的神经元,是一种有用的技术,可以在不需要重新训练或适应的情况下降低大型语言模型(LLM)的计算成本。然而,它是否可以应用于最近出现的小语言模型仍然是值得怀疑的,因为小语言模型通常比小语言模型的过度参数少。在本文中,我们的目标是在SLM中实现稀疏激活。我们首先证明了LLMS中现有的建立在神经元输出幅度上的稀疏激活方案不适用于SLM,而基于其属性分数来激活神经元是一种更好的选择。此外,由于不同层次神经元的归因分数之间的相互依赖,我们证明并量化了现有的属性度量在用于稀疏激活时的巨大误差。基于这些观察结果,我们提出了一种新的属性度量,该度量可以证明可以纠正这些错误,并实现精确的稀疏激活。在多个流行的SLM和数据集上的实验表明,我们的方法可以在模型精度损失5%的情况下获得80%的稀疏率,与LLMS的稀疏激活相媲美。源代码可在以下网址获得:This HTTPS URL。

[NLP-138] Brainstorming Brings Power to Large Language Models of Knowledge Reasoning
[NLP-138] 集思广益为知识推理的大型语言模型带来力量

链接: https://arxiv.org/abs/2406.06561
作者: Zining Qin,Chenhao Wang,Huiling Qin,Weijia Jia
关键词: Large Language Models, Large Language, language generation, demonstrated amazing capabilities, text comprehension
中文关键词: 大型语言模型、大型语言、语言生成、展示了惊人的能力、文本理解
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated amazing capabilities in language generation, text comprehension, and knowledge reasoning. While a single powerful model can already handle multiple tasks, relying on a single perspective can lead to biased and unstable results. Recent studies have further improved the model’s reasoning ability on a wide range of tasks by introducing multi-model collaboration. However, models with different capabilities may produce conflicting answers on the same problem, and how to reasonably obtain the correct answer from multiple candidate models has become a challenging problem. In this paper, we propose the multi-model brainstorming based on prompt. It incorporates different models into a group for brainstorming, and after multiple rounds of reasoning elaboration and re-inference, a consensus answer is reached within the group. We conducted experiments on three different types of datasets, and demonstrate that the brainstorming can significantly improve the effectiveness in logical reasoning and fact extraction. Furthermore, we find that two small-parameter models can achieve accuracy approximating that of larger-parameter models through brainstorming, which provides a new solution for distributed deployment of LLMs.
摘要:大型语言模型在语言生成、文本理解和知识推理等方面表现出了惊人的能力。虽然一个强大的模型已经可以处理多个任务,但依赖单一的视角可能会导致偏颇和不稳定的结果。最近的研究通过引入多模型协作进一步提高了模型在大范围任务上的推理能力。然而,具有不同能力的模型可能会在同一问题上产生相互冲突的答案,如何从多个候选模型中合理地获得正确的答案已成为一个具有挑战性的问题。本文提出了基于Prompt的多模型头脑风暴法。它将不同的模型整合到一个小组中进行头脑风暴,经过多轮推理细化和重新推理,在小组内部达成了一致的答案。我们在三个不同类型的数据集上进行了实验,结果表明头脑风暴法可以显著提高逻辑推理和事实抽取的效率。此外,通过集思广益,我们发现两个小参数模型可以达到接近大参数模型的精度,这为LLMS的分布式部署提供了一种新的解决方案。

[NLP-139] Inverse Constitutional AI: Compressing Preferences into Principles
[NLP-139] 逆宪法人工智能:将偏好压缩为原则

链接: https://arxiv.org/abs/2406.06560
作者: Arduin Findeis,Timo Kaufmann,Eyke Hüllermeier,Samuel Albanie,Robert Mullins
关键词: fine-tuning and evaluating, pairwise text preference, plays an important, important role, role in fine-tuning
中文关键词: 微调和评估,成对文本偏好,在微调中发挥着重要的作用
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Feedback data plays an important role in fine-tuning and evaluating state-of-the-art AI models. Often pairwise text preferences are used: given two texts, human (or AI) annotators select the “better” one. Such feedback data is widely used to align models to human preferences (e.g., reinforcement learning from human feedback), or to rank models according to human preferences (e.g., Chatbot Arena). Despite its wide-spread use, prior work has demonstrated that human-annotated pairwise text preference data often exhibits unintended biases. For example, human annotators have been shown to prefer assertive over truthful texts in certain contexts. Models trained or evaluated on this data may implicitly encode these biases in a manner hard to identify. In this paper, we formulate the interpretation of existing pairwise text preference data as a compression task: the Inverse Constitutional AI (ICAI) problem. In constitutional AI, a set of principles (or constitution) is used to provide feedback and fine-tune AI models. The ICAI problem inverts this process: given a dataset of feedback, we aim to extract a constitution that best enables a large language model (LLM) to reconstruct the original annotations. We propose a corresponding initial ICAI algorithm and validate its generated constitutions quantitatively based on reconstructed annotations. Generated constitutions have many potential use-cases – they may help identify undesirable biases, scale feedback to unseen data or assist with adapting LLMs to individual user preferences. We demonstrate our approach on a variety of datasets: (a) synthetic feedback datasets with known underlying principles; (b) the AlpacaEval dataset of cross-annotated human feedback; and © the crowdsourced Chatbot Arena data set. We release the code for our algorithm and experiments at this https URL .
摘要:反馈数据在微调和评估最先进的人工智能模型中发挥着重要作用。通常使用成对的文本偏好:给定两个文本,人类(或人工智能)注释员选择“较好的”一个。这种反馈数据被广泛用于将模型与人类偏好对齐(例如,来自人类反馈的强化学习),或者根据人类偏好对模型进行排序(例如,聊天机器人竞技场)。尽管它被广泛使用,但先前的工作已经证明,人类注释的成对文本偏好数据经常表现出意想不到的偏见。例如,在某些情况下,人类注释者被证明更喜欢断言而不是真实的文本。根据这些数据训练或评估的模型可能会以一种难以识别的方式隐含地编码这些偏差。在本文中,我们将现有的成对文本偏好数据的解释描述为一个压缩任务:逆宪法人工智能(ICAI)问题。在宪法人工智能中,一套原则(或宪法)用于提供反馈和微调人工智能模型。ICAI问题颠倒了这个过程:给定反馈的数据集,我们的目标是提取一个最好地使大型语言模型(LLM)能够重建原始注释的构成。我们提出了相应的初始ICAI算法,并基于重构的标注对其生成的构件进行了定量验证。生成的构成有许多潜在的用例–它们可以帮助识别不受欢迎的偏差,将反馈扩大到看不见的数据,或者帮助调整LLM以适应个人的用户偏好。我们在各种数据集上展示了我们的方法:(A)具有已知潜在原理的合成反馈数据集;(B)交叉注释的人类反馈的AlpacaEval数据集;以及©众包的聊天机器人Arena数据集。我们发布了我们算法的代码,并在这个HTTPS URL上进行了实验。

[NLP-140] Harnessing Business and Media Insights with Large Language Models
[NLP-140] 利用大型语言模型利用商业和媒体洞察

链接: https://arxiv.org/abs/2406.06559
作者: Yujia Bao,Ankit Parag Shah,Neeru Narang,Jonathan Rivers,Rajeev Maksey,Lan Guan,Louise N. Barrere,Shelley Evenson,Rahul Basole,Connie Miao,Ankit Mehta,Fabien Boulay,Su Min Park,Natalie E. Pearson,Eldhose Joy,Tiger He,Sumiran Thakur,Koustav Ghosal,Josh On,Phoebe Morrison,Tim Major,Eva Siqi Wang,Gina Escobar,Jiaheng Wei,Tharindu Cyril Weerasooriya,Queena Song,Daria Lashkevich,Clare Chen,Gyuhak Kim,Dengpan Yin,Don Hejna,Mo Nomeli,Wei Wei
关键词: introduces Fortune Analytics, Analytics Language Model, Fortune Analytics Language, paper introduces Fortune, Fortune Analytics
中文关键词: 介绍财富分析、分析语言模型、财富分析语言、论文介绍财富、财富分析
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces Fortune Analytics Language Model (FALM). FALM empowers users with direct access to comprehensive business analysis, including market trends, company performance metrics, and expert insights. Unlike generic LLMs, FALM leverages a curated knowledge base built from professional journalism, enabling it to deliver precise and in-depth answers to intricate business questions. Users can further leverage natural language queries to directly visualize financial data, generating insightful charts and graphs to understand trends across diverse business sectors clearly. FALM fosters user trust and ensures output accuracy through three novel methods: 1) Time-aware reasoning guarantees accurate event registration and prioritizes recent updates. 2) Thematic trend analysis explicitly examines topic evolution over time, providing insights into emerging business landscapes. 3) Content referencing and task decomposition enhance answer fidelity and data visualization accuracy. We conduct both automated and human evaluations, demonstrating FALM’s significant performance improvements over baseline methods while prioritizing responsible AI practices. These benchmarks establish FALM as a cutting-edge LLM in the business and media domains, with exceptional accuracy and trustworthiness.
摘要:介绍了财富分析语言模型。FALM使用户能够直接访问全面的业务分析,包括市场趋势、公司业绩指标和专家见解。与一般的LLM不同,Falm利用从专业新闻建立的精心策划的知识库,使其能够为复杂的商业问题提供准确和深入的答案。用户可以进一步利用自然语言查询来直接可视化财务数据,生成有洞察力的图表和图形,以清楚地了解不同业务部门的趋势。Falm通过三种新方法培养用户信任并确保输出准确性:1)时间感知推理确保准确的事件注册并对最近的更新进行优先排序。2)主题趋势分析明确检查主题随时间的演变,提供对新兴业务环境的洞察。3)内容引用和任务分解提高了答案的逼真度和数据可视化的准确性。我们同时进行自动和人工评估,展示了Falm在基准方法上的显著性能改进,同时优先考虑负责任的人工智能实践。这些基准使Falm成为商业和媒体领域的尖端LLM,具有出众的准确性和可信度。

[NLP-141] Enhancing Text Authenticity: A Novel Hybrid Approach for AI-Generated Text Detection
[NLP-141] 增强文本真实性:人工智能生成文本检测的新型混合方法

链接: https://arxiv.org/abs/2406.06558
作者: Ye Zhang,Qian Leng,Mengran Zhu,Rui Ding,Yue Wu,Jintong Song,Yulu Gong
关键词: Large Language Models, Large Language, Language Models, Stochastic Gradient Descent, Categorical Gradient Boosting
中文关键词: 大型语言模型,大型语言,语言模型,随机梯度下降,类别梯度提升
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has ushered in an era where AI-generated text is increasingly indistinguishable from human-generated content. Detecting AI-generated text has become imperative to combat misinformation, ensure content authenticity, and safeguard against malicious uses of AI. In this paper, we propose a novel hybrid approach that combines traditional TF-IDF techniques with advanced machine learning models, including Bayesian classifiers, Stochastic Gradient Descent (SGD), Categorical Gradient Boosting (CatBoost), and 12 instances of Deberta-v3-large models. Our approach aims to address the challenges associated with detecting AI-generated text by leveraging the strengths of both traditional feature extraction methods and state-of-the-art deep learning models. Through extensive experiments on a comprehensive dataset, we demonstrate the effectiveness of our proposed method in accurately distinguishing between human and AI-generated text. Our approach achieves superior performance compared to existing methods. This research contributes to the advancement of AI-generated text detection techniques and lays the foundation for developing robust solutions to mitigate the challenges posed by AI-generated content.
摘要:大型语言模型(LLMS)的快速发展开启了一个人工智能生成的文本与人类生成的内容越来越难以区分的时代。检测人工智能生成的文本已成为打击错误信息、确保内容真实性和防范恶意使用人工智能的当务之急。在本文中,我们提出了一种新的混合方法,将传统的TF-IDF技术与先进的机器学习模型相结合,包括贝叶斯分类器、随机梯度下降(SGD)、类别梯度提升(CatBoost)和12个Deberta-v3-Large模型的实例。我们的方法旨在通过利用传统特征提取方法和最先进的深度学习模型的优势来解决与检测人工智能生成的文本相关的挑战。通过在一个广泛的数据集上的大量实验,我们证明了该方法在准确区分人类和人工智能生成的文本方面的有效性。与现有方法相比,我们的方法取得了更好的性能。这项研究有助于人工智能生成的文本检测技术的进步,并为开发健壮的解决方案以缓解人工智能生成的内容带来的挑战奠定了基础。

[NLP-142] Enhancing Presentation Slide Generation by LLMs with a Multi-Staged End-to-End Approach
[NLP-142] 通过多阶段端到端方法增强LLM的演示幻灯片生成

链接: https://arxiv.org/abs/2406.06556
作者: Sambaran Bandyopadhyay,Himanshu Maheshwari,Anandhavelu Natarajan,Apoorv Saxena
关键词: important task, multimodal elements, text and images, Generating presentation slides, long document
中文关键词: 重要任务、多模式元素、文本和图像、生成演示幻灯片、长文档
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating presentation slides from a long document with multimodal elements such as text and images is an important task. This is time consuming and needs domain expertise if done manually. Existing approaches for generating a rich presentation from a document are often semi-automatic or only put a flat summary into the slides ignoring the importance of a good narrative. In this paper, we address this research gap by proposing a multi-staged end-to-end model which uses a combination of LLM and VLM. We have experimentally shown that compared to applying LLMs directly with state-of-the-art prompting, our proposed multi-staged solution is better in terms of automated metrics and human evaluation.
摘要:从具有文本和图像等多模式元素的长文档中生成演示幻灯片是一项重要任务。如果手动完成,这很耗时,并且需要领域专业知识。从文档生成丰富演示文稿的现有方法通常是半自动的,或者仅将平面摘要放入幻灯片中,而忽略了良好叙述的重要性。在本文中,我们通过提出一种使用LLM和VLM组合的多阶段端到端模型来解决这一研究空白。我们通过实验表明,与直接应用具有最先进的提示的LLM相比,我们提出的多阶段解决方案在自动化指标和人工评估方面更好。

[NLP-143] An Evaluation Benchmark for Autoformalization in Lean4
[NLP-143] Lean 4中自动化的评估基准

链接: https://arxiv.org/abs/2406.06555
作者: Aryan Gulati,Devanshu Ladsaria,Shubhra Mishra,Jasdeep Sidhu,Brando Miranda
关键词: Large Language Models, Large Language, Language Models, Large, Models
中文关键词: 大型语言模型,大型语言,语言模型,大型,模型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: To appear at ICLR 2024 as part of the Tiny Papers track

点击查看摘要

Abstract:Large Language Models (LLMs) hold the potential to revolutionize autoformalization. The introduction of Lean4, a mathematical programming language, presents an unprecedented opportunity to rigorously assess the autoformalization capabilities of LLMs. This paper introduces a novel evaluation benchmark designed for Lean4, applying it to test the abilities of state-of-the-art LLMs, including GPT-3.5, GPT-4, and Gemini Pro. Our comprehensive analysis reveals that, despite recent advancements, these LLMs still exhibit limitations in autoformalization, particularly in more complex areas of mathematics. These findings underscore the need for further development in LLMs to fully harness their potential in scientific research and development. This study not only benchmarks current LLM capabilities but also sets the stage for future enhancements in autoformalization.
摘要:大型语言模型(LLM)具有彻底改变自动形式化的潜力。数学编程语言Lean 4的引入为严格评估LLM的自动形式化能力提供了前所未有的机会。本文介绍了一种为Lean 4设计的新型评估基准,将其应用于测试最先进的LLM(包括GPT-3.5、GPT-4和Gemini Pro)的能力。我们的全面分析表明,尽管最近取得了进步,但这些LLM在自动形式化方面仍然表现出局限性,特别是在更复杂的数学领域。这些发现强调了进一步发展法学硕士的必要性,以充分利用其在科学研究和开发中的潜力。这项研究不仅对当前的LLM功能进行了基准测试,而且还为未来自动形式化的增强奠定了基础。

[NLP-144] MM-KWS: Multi-modal Prompts for Multilingual User-defined Keyword Spotting
[NLP-144] MM-KWS:用于多语言用户定义关键词定位的多模式预算

链接: https://arxiv.org/abs/2406.07310
作者: Zhiqi Ai,Zhiyong Chen,Shugong Xu
关键词: spotting leveraging multi-modal, leveraging multi-modal enrollments, user-defined keyword spotting, keyword spotting leveraging, approach to user-defined
中文关键词: 利用多模式识别、利用多模式注册、用户定义的关键字识别、关键字识别利用、用户定义的方法
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at INTERSPEECH 2024

点击查看摘要

Abstract:In this paper, we propose MM-KWS, a novel approach to user-defined keyword spotting leveraging multi-modal enrollments of text and speech templates. Unlike previous methods that focus solely on either text or speech features, MM-KWS extracts phoneme, text, and speech embeddings from both modalities. These embeddings are then compared with the query speech embedding to detect the target keywords. To ensure the applicability of MM-KWS across diverse languages, we utilize a feature extractor incorporating several multilingual pre-trained models. Subsequently, we validate its effectiveness on Mandarin and English tasks. In addition, we have integrated advanced data augmentation tools for hard case mining to enhance MM-KWS in distinguishing confusable words. Experimental results on the LibriPhrase and WenetPhrase datasets demonstrate that MM-KWS outperforms prior methods significantly.
摘要:在本文中,我们提出了MM-KWS,这是一种利用文本和语音模板的多模式注册的用户定义关键词发现的新颖方法。与之前仅关注文本或语音特征的方法不同,MM-KWS从这两种模式中提取音素、文本和语音嵌入。然后将这些嵌入与查询语音嵌入进行比较以检测目标关键词。为了确保MM-KWS在不同语言中的适用性,我们使用了一个包含多语言预训练模型的特征提取器。随后,我们验证了它在普通话和英语任务中的有效性。此外,我们还集成了用于硬案例挖掘的高级数据增强工具,以增强MM-KWS区分易混淆单词的能力。LibriPhrase和WenetPhrase数据集的实验结果表明,MM-KWS的性能显着优于先前的方法。

[NLP-145] ranslating speech with just images
[NLP-145] 仅用图像来亵渎演讲

链接: https://arxiv.org/abs/2406.07133
作者: Dan Oneata,Herman Kamper
关键词: Visually grounded speech, Visually grounded, grounded speech models, speech models link, models link speech
中文关键词: 视觉接地气的演讲,视觉接地气的、接地气的演讲模型,演讲模型链接,模型链接演讲
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at Interspeech 2024

点击查看摘要

Abstract:Visually grounded speech models link speech to images. We extend this connection by linking images to text via an existing image captioning system, and as a result gain the ability to map speech audio directly to text. This approach can be used for speech translation with just images by having the audio in a different language from the generated captions. We investigate such a system on a real low-resource language, Yorùbá, and propose a Yorùbá-to-English speech translation model that leverages pretrained components in order to be able to learn in the low-resource regime. To limit overfitting, we find that it is essential to use a decoding scheme that produces diverse image captions for training. Results show that the predicted translations capture the main semantics of the spoken audio, albeit in a simpler and shorter form.
摘要:基于视觉的语音模型将语音与图像联系起来。我们通过现有的图像字幕系统将图像链接到文本来扩展这种连接,从而获得了将语音音频直接映射到文本的能力。这种方法可以通过将音频使用与生成的字幕不同的语言来仅用于图像的语音翻译。我们在真正的低资源语言Yorkobá上研究了这样的系统,并提出了Yorkobá到英语的语音翻译模型,该模型利用预先训练的组件,以便能够在低资源制度中学习。为了限制过度匹配,我们发现使用能够产生用于训练的各种图像字幕的解码方案至关重要。结果表明,预测的翻译捕获了口语音频的主要语义,尽管形式更简单、更短。

[NLP-146] Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter
[NLP-146] 具有基于CTE的Word Spotter的用于TLC和Transducer ASB模型的快速上下文偏置

链接: https://arxiv.org/abs/2406.07096
作者: Andrei Andrusenko,Aleksandr Laptev,Vladimir Bataev,Vitaly Lavrukhin,Boris Ginsburg
关键词: contextualized Automatic Speech, Automatic Speech Recognition, Automatic Speech, contextualized Automatic, Accurate recognition
中文关键词: 背景化自动语音,自动语音识别,自动语音,背景化自动,准确识别
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted by Interspeech 2024

点击查看摘要

Abstract:Accurate recognition of rare and new words remains a pressing problem for contextualized Automatic Speech Recognition (ASR) systems. Most context-biasing methods involve modification of the ASR model or the beam-search decoding algorithm, complicating model reuse and slowing down inference. This work presents a new approach to fast context-biasing with CTC-based Word Spotter (CTC-WS) for CTC and Transducer (RNN-T) ASR models. The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates. The valid candidates then replace their greedy recognition counterparts in corresponding frame intervals. A Hybrid Transducer-CTC model enables the CTC-WS application for the Transducer model. The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER compared to baseline methods. The proposed method is publicly available in the NVIDIA NeMo toolkit.
摘要:稀有词和新词的准确识别仍然是上下文化自动语音识别(ASB)系统的一个紧迫问题。大多数上下文偏置方法都涉及对ASB模型或射束搜索解码算法的修改,从而使模型重用变得复杂并减慢推理速度。这项工作提出了一种使用基于ATC的Word Spotter(CTC-WS)来快速上下文偏置的新方法,用于针对TLC和Transducer(RNN-T)ASB模型。所提出的方法将CTC日志概率与紧凑的上下文图进行匹配,以检测潜在的上下文偏向候选者。然后,有效候选项替换相应帧间隔中的贪婪识别对应项。混合传感器-ctducer模型支持传感器模型的CTC-WS应用程序。结果表明,与基线方法相比,上下文偏见识别显着加速,F评分和WER同时提高。所提出的方法在NVIDIA NeMo工具包中公开可用。

[NLP-147] LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR
[NLP-147] LoRA-Whisper:参数高效且可扩展的多语言ASB

链接: https://arxiv.org/abs/2406.06619
作者: Zheshu Song,Jianheng Zhuo,Yifan Yang,Ziyang Ma,Shixiong Zhang,Xie Chen
关键词: automatic speech recognition, witnessed significant progress, Recent years, multilingual automatic speech, multilingual ASR
中文关键词: 自动语音识别,取得了重大进展,近年来,多语言自动语音,多语言ASB
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages, 2 figures, conference

点击查看摘要

Abstract:Recent years have witnessed significant progress in multilingual automatic speech recognition (ASR), driven by the emergence of end-to-end (E2E) models and the scaling of multilingual datasets. Despite that, two main challenges persist in multilingual ASR: language interference and the incorporation of new languages without degrading the performance of the existing ones. This paper proposes LoRA-Whisper, which incorporates LoRA matrix into Whisper for multilingual ASR, effectively mitigating language interference. Furthermore, by leveraging LoRA and the similarities between languages, we can achieve better performance on new languages while upholding consistent performance on original ones. Experiments on a real-world task across eight languages demonstrate that our proposed LoRA-Whisper yields a relative gain of 18.5% and 23.0% over the baseline system for multilingual ASR and language expansion respectively.
摘要:近年来,在端到端(E2 E)模型的出现和多语言数据集的扩展的推动下,多语言自动语音识别(ASB)取得了重大进展。尽管如此,多语言ASB仍然存在两个主要挑战:语言干扰和在不降低现有语言性能的情况下合并新语言。本文提出LoRA-Whisper,将LoRA矩阵融入Whisper中,实现多语言ASB,有效减轻语言干扰。此外,通过利用LoRA和语言之间的相似性,我们可以在新语言上实现更好的性能,同时保持原始语言的一致性能。跨八种语言的现实世界任务的实验表明,我们提出的LoRA-Whisper比多语言ASB和语言扩展的基线系统分别产生18.5%和23.0%的相对收益。

计算机视觉

[CV-0] Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring

链接: https://arxiv.org/abs/2406.07551
作者: Huicong Zhang,Haozhe Xie,Hongxun Yao
关键词: Video deblurring relies, video sequence, bidirectional feature propagation, Video deblurring, Video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR 2024

点击查看摘要

Abstract:Video deblurring relies on leveraging information from other frames in the video sequence to restore the blurred regions in the current frame. Mainstream approaches employ bidirectional feature propagation, spatio-temporal transformers, or a combination of both to extract information from the video sequence. However, limitations in memory and computational resources constraints the temporal window length of the spatio-temporal transformer, preventing the extraction of longer temporal contextual information from the video sequence. Additionally, bidirectional feature propagation is highly sensitive to inaccurate optical flow in blurry frames, leading to error accumulation during the propagation process. To address these issues, we propose \textbfBSSTNet, \textbfBlur-aware \textbfSpatio-temporal \textbfSparse \textbfTransformer Network. It introduces the blur map, which converts the originally dense attention into a sparse form, enabling a more extensive utilization of information throughout the entire video sequence. Specifically, BSSTNet (1) uses a longer temporal window in the transformer, leveraging information from more distant frames to restore the blurry pixels in the current frame. (2) introduces bidirectional feature propagation guided by blur maps, which reduces error accumulation caused by the blur frame. The experimental results demonstrate the proposed BSSTNet outperforms the state-of-the-art methods on the GoPro and DVD datasets.

[CV-1] An Image is Worth 32 Tokens for Reconstruction and Generation

链接: https://arxiv.org/abs/2406.07550
作者: Qihang Yu,Mark Weber,Xueqing Deng,Xiaohui Shen,Daniel Cremers,Liang-Chieh Chen
关键词: Recent advancements, advancements in generative, highlighted the crucial, crucial role, synthesis of high-resolution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: A compact 1D Image Tokenization method, leading to SOTA generation performance while being substantially faster. Project page at this https URL

点击查看摘要

Abstract:Recent advancements in generative models have highlighted the crucial role of image tokenization in the efficient synthesis of high-resolution images. Tokenization, which transforms images into latent representations, reduces computational demands compared to directly processing pixels and enhances the effectiveness and efficiency of the generation process. Prior methods, such as VQGAN, typically utilize 2D latent grids with fixed downsampling factors. However, these 2D tokenizations face challenges in managing the inherent redundancies present in images, where adjacent regions frequently display similarities. To overcome this issue, we introduce Transformer-based 1-Dimensional Tokenizer (TiTok), an innovative approach that tokenizes images into 1D latent sequences. TiTok provides a more compact latent representation, yielding substantially more efficient and effective representations than conventional techniques. For example, a 256 x 256 x 3 image can be reduced to just 32 discrete tokens, a significant reduction from the 256 or 1024 tokens obtained by prior methods. Despite its compact nature, TiTok achieves competitive performance to state-of-the-art approaches. Specifically, using the same generator framework, TiTok attains 1.97 gFID, outperforming MaskGIT baseline significantly by 4.21 at ImageNet 256 x 256 benchmark. The advantages of TiTok become even more significant when it comes to higher resolution. At ImageNet 512 x 512 benchmark, TiTok not only outperforms state-of-the-art diffusion model DiT-XL/2 (gFID 2.74 vs. 3.04), but also reduces the image tokens by 64x, leading to 410x faster generation process. Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.

[CV-2] Image and Video Tokenization with Binary Spherical Quantization

链接: https://arxiv.org/abs/2406.07548
作者: Yue Zhao,Yuanjun Xiong,Philipp Krähenbühl
关键词: Binary Spherical Quantization, Spherical Quantization, Binary Spherical, applies binary quantization, binary quantization
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Tech report

点击查看摘要

Abstract:We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary token dimensions, and (3) compact: compressing visual data by up to 100 \times with minimal distortion. Our tokenizer uses a transformer encoder and decoder with simple block-wise causal masking to support variable-length videos as input. The resulting BSQ-ViT achieves state-of-the-art visual reconstruction quality on image and video reconstruction benchmarks with 2.4 \times throughput compared to the best prior methods. Furthermore, by learning an autoregressive prior for adaptive arithmetic coding, BSQ-ViT achieves comparable results on video compression with state-of-the-art video compression standards. BSQ-ViT also enables masked language models to achieve competitive image synthesis quality to GAN- and diffusion-based methods.

[CV-3] Zero-shot Image Editing with Reference Imitation

链接: https://arxiv.org/abs/2406.07547
作者: Xi Chen,Yutong Feng,Mengting Chen,Yiyang Wang,Shilong Zhang,Yu Liu,Yujun Shen,Hengshuang Zhao
关键词: Image editing serves, practical yet challenging, challenging task, diverse demands, hardest parts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: this https URL

点击查看摘要

Abstract:Image editing serves as a practical yet challenging task considering the diverse demands from users, where one of the hardest parts is to precisely describe how the edited image should look like. In this work, we present a new form of editing, termed imitative editing, to help users exercise their creativity more conveniently. Concretely, to edit an image region of interest, users are free to directly draw inspiration from some in-the-wild references (e.g., some relative pictures come across online), without having to cope with the fit between the reference and the source. Such a design requires the system to automatically figure out what to expect from the reference to perform the editing. For this purpose, we propose a generative training framework, dubbed MimicBrush, which randomly selects two frames from a video clip, masks some regions of one frame, and learns to recover the masked regions using the information from the other frame. That way, our model, developed from a diffusion prior, is able to capture the semantic correspondence between separate images in a self-supervised manner. We experimentally show the effectiveness of our method under various test cases as well as its superiority over existing alternatives. We also construct a benchmark to facilitate further research.

[CV-4] Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

链接: https://arxiv.org/abs/2406.07546
作者: Xingyu Fu,Muyu He,Yujie Lu,William Yang Wang,Dan Roth
关键词: evaluating the ability, lightbulb, real life, produce images, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Text-to-Image Generation, Commonsense, Project Url: this https URL

点击查看摘要

Abstract:We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that fit commonsense in real life, which we call Commonsense-T2I. Given two adversarial text prompts containing an identical set of action words with minor differences, such as “a lightbulb without electricity” v.s. “a lightbulb with electricity”, we evaluate whether T2I models can conduct visual-commonsense reasoning, e.g. produce images that fit “the lightbulb is unlit” vs. “the lightbulb is lit” correspondingly. Commonsense-T2I presents an adversarial challenge, providing pairwise text prompts along with expected outputs. The dataset is carefully hand-curated by experts and annotated with fine-grained labels, such as commonsense type and likelihood of the expected outputs, to assist analyzing model behavior. We benchmark a variety of state-of-the-art (sota) T2I models and surprisingly find that, there is still a large gap between image synthesis and real life photos–even the DALL-E 3 model could only achieve 48.92% on Commonsense-T2I, and the stable diffusion XL model only achieves 24.92% accuracy. Our experiments show that GPT-enriched prompts cannot solve this challenge, and we include a detailed analysis about possible reasons for such deficiency. We aim for Commonsense-T2I to serve as a high-quality evaluation benchmark for T2I commonsense checking, fostering advancements in real life image generation.

[CV-5] Situational Awareness Matters in 3D Vision Language Reasoning

链接: https://arxiv.org/abs/2406.07544
作者: Yunze Man,Liang-Yan Gui,Yu-Xiong Wang
关键词: developing household robots, vision language reasoning, complicated vision language, language reasoning tasks, vision language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: CVPR 2024. Project Page: this https URL

点击查看摘要

Abstract:Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work, we demonstrate that a critical and distinct challenge in 3D vision language reasoning is situational awareness, which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge, we introduce SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator, followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in situation estimation and question answering by a large margin (e.g., an enhancement of over 30% on situation estimation accuracy). Subsequent analysis corroborates our architectural design choices, explores the distinct functions of visual and textual tokens, and highlights the importance of situational awareness in the domain of 3D question answering.

[CV-6] Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

链接: https://arxiv.org/abs/2406.07543
作者: Chenyu Yang,Xizhou Zhu,Jinguo Zhu,Weijie Su,Junjie Wang,Xuan Dong,Wenhai Wang,Lewei Lu,Bin Li,Jie Zhou,Yu Qiao,Jifeng Dai
关键词: interleaved image-text data, web-crawled image-text data, image-text data, vision model pre-training, Latent Compression Learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, vision model pre-training has evolved from relying on manually annotated datasets to leveraging large-scale, web-crawled image-text data. Despite these advances, there is no pre-training method that effectively exploits the interleaved image-text data, which is very prevalent on the Internet. Inspired by the recent success of compression learning in natural language processing, we propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data. This method performs latent compression learning by maximizing the mutual information between the inputs and outputs of a causal attention model. The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation. Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets (e.g., LAION), but can also leverage interleaved pre-training data (e.g., MMC4) to learn robust visual representation from scratch, showcasing the potential of vision model pre-training with interleaved image-text data. Code is released at this https URL.

[CV-7] Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

链接: https://arxiv.org/abs/2406.07540
作者: Kuan Heng Lin,Sicheng Mo,Ben Klingher,Fangzhou Mu,Bolei Zhou
关键词: Self-guidance bring fine-grained, Recent controllable generation, Diffusion Self-guidance bring, bring fine-grained spatial, controllable generation approaches
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 18 pages, 11 figures, see project page at this https URL

点击查看摘要

Abstract:Recent controllable generation approaches such as FreeControl and Diffusion Self-guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexibility and use. This work presents Ctrl-X, a simple framework for T2I diffusion controlling structure and appearance without additional training or guidance. Ctrl-X designs feed-forward structure control to enable the structure alignment with a structure image and semantic-aware appearance transfer to facilitate the appearance transfer from a user-input image. Extensive qualitative and quantitative experiments illustrate the superior performance of Ctrl-X on various condition inputs and model checkpoints. In particular, Ctrl-X supports novel structure and appearance control with arbitrary condition images of any modality, exhibits superior image quality and appearance transfer compared to existing works, and provides instant plug-and-play functionality to any T2I and text-to-video (T2V) diffusion model. See our project page for an overview of the results: this https URL

[CV-8] Autoregressive Pretraining with Mamba in Vision

链接: https://arxiv.org/abs/2406.07537
作者: Sucheng Ren,Xianhang Li,Haoqin Tu,Feng Wang,Fangxun Shu,Lei Zhang,Jieru Mei,Linjie Yang,Peng Wang,Heng Wang,Alan Yuille,Cihang Xie
关键词: recently developed state, developed state space, state space model, Mamba, range of tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba’s visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored. Efficiency-wise, the autoregressive nature can well capitalize on the Mamba’s unidirectional recurrent structure, enabling faster overall training speed compared to other training strategies like mask modeling. Performance-wise, autoregressive pretraining equips the Mamba architecture with markedly higher accuracy over its supervised-trained counterparts and, more importantly, successfully unlocks its scaling potential to large and even huge model sizes. For example, with autoregressive pretraining, a base-size Mamba attains 83.2% ImageNet accuracy, outperforming its supervised counterpart by 2.0%; our huge-size Mamba, the largest Vision Mamba to date, attains 85.0% ImageNet accuracy (85.5% when finetuned with 384\times384 inputs), notably surpassing all other Mamba variants in vision. The code is available at \urlthis https URL.

[CV-9] owards Fundamentally Scalable Model Selection: Asymptotically Fast Update and Selection

链接: https://arxiv.org/abs/2406.07536
作者: Wenxiao Wang,Weiming Zhuang,Lingjuan Lyu
关键词: deep learning technologies, model selection, model, isolated model embedding, selection
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: 19 pages, 8 figures

点击查看摘要

Abstract:The advancement of deep learning technologies is bringing new models every day, motivating the study of scalable model selection. An ideal model selection scheme should minimally support two operations efficiently over a large pool of candidate models: update, which involves either adding a new candidate model or removing an existing candidate model, and selection, which involves locating highly performing models for a given task. However, previous solutions to model selection require high computational complexity for at least one of these two operations. In this work, we target fundamentally (more) scalable model selection that supports asymptotically fast update and asymptotically fast selection at the same time. Firstly, we define isolated model embedding, a family of model selection schemes supporting asymptotically fast update and selection: With respect to the number of candidate models m , the update complexity is O(1) and the selection consists of a single sweep over m vectors in addition to O(1) model operations. Isolated model embedding also implies several desirable properties for applications. Secondly, we present Standardized Embedder, an empirical realization of isolated model embedding. We assess its effectiveness by using it to select representations from a pool of 100 pre-trained vision models for classification tasks and measuring the performance gaps between the selected models and the best candidates with a linear probing protocol. Experiments suggest our realization is effective in selecting models with competitive performances and highlight isolated model embedding as a promising direction towards model selection that is fundamentally (more) scalable.

[CV-10] Hearing Anything Anywhere

链接: https://arxiv.org/abs/2406.07532
作者: Mason Wang,Ryosuke Sawata,Samuel Clarke,Ruohan Gao,Shangzhe Wu,Jiajun Wu
关键词: numerous Mixed Reality, Mixed Reality, Recent years, numerous Mixed, computer graphics
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: CVPR 2024. The first two authors contributed equally. Project page: this https URL

点击查看摘要

Abstract:Recent years have seen immense progress in 3D computer vision and computer graphics, with emerging tools that can virtualize real-world 3D environments for numerous Mixed Reality (XR) applications. However, alongside immersive visual experiences, immersive auditory experiences are equally vital to our holistic perception of an environment. In this paper, we aim to reconstruct the spatial acoustic characteristics of an arbitrary environment given only a sparse set of (roughly 12) room impulse response (RIR) recordings and a planar reconstruction of the scene, a setup that is easily achievable by ordinary users. To this end, we introduce DiffRIR, a differentiable RIR rendering framework with interpretable parametric models of salient acoustic features of the scene, including sound source directivity and surface reflectivity. This allows us to synthesize novel auditory experiences through the space with any source audio. To evaluate our method, we collect a dataset of RIR recordings and music in four diverse, real environments. We show that our model outperforms state-ofthe-art baselines on rendering monaural and binaural RIRs and music at unseen locations, and learns physically interpretable parameters characterizing acoustic properties of the sound source and surfaces in the scene.

[CV-11] Neural Gaffer: Relighting Any Object via Diffusion

链接: https://arxiv.org/abs/2406.07520
作者: Haian Jin,Yuan Li,Fujun Luan,Yuanbo Xiangli,Sai Bi,Kai Zhang,Zexiang Xu,Jin Sun,Noah Snavely
关键词: Single-image relighting, interplay between geometry, involves reasoning, complex interplay, Single-image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project Website: this https URL

点击查看摘要

Abstract:Single-image relighting is a challenging task that involves reasoning about the complex interplay between geometry, materials, and lighting. Many prior methods either support only specific categories of images, such as portraits, or require special capture conditions, like using a flashlight. Alternatively, some methods explicitly decompose a scene into intrinsic components, such as normals and BRDFs, which can be inaccurate or under-expressive. In this work, we propose a novel end-to-end 2D relighting diffusion model, called Neural Gaffer, that takes a single image of any object and can synthesize an accurate, high-quality relit image under any novel environmental lighting condition, simply by conditioning an image generator on a target environment map, without an explicit scene decomposition. Our method builds on a pre-trained diffusion model, and fine-tunes it on a synthetic relighting dataset, revealing and harnessing the inherent understanding of lighting present in the diffusion model. We evaluate our model on both synthetic and in-the-wild Internet imagery and demonstrate its advantages in terms of generalization and accuracy. Moreover, by combining with other generative methods, our model enables many downstream 2D tasks, such as text-based relighting and object insertion. Our model can also operate as a strong relighting prior for 3D tasks, such as relighting a radiance field.

[CV-12] Instant 3D Human Avatar Generation using Image Diffusion Models

链接: https://arxiv.org/abs/2406.07516
作者: Nikos Kolotouros,Thiemo Alldieck,Enric Corona,Eduard Gabriel Bazavan,Cristian Sminchisescu
关键词: high quality, present AvatarPopUp, input modalities, pose and shape, generated pose
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present AvatarPopUp, a method for fast, high quality 3D human avatar generation from different input modalities, such as images and text prompts and with control over the generated pose and shape. The common theme is the use of diffusion-based image generation networks that are specialized for each particular task, followed by a 3D lifting network. We purposefully decouple the generation from the 3D modeling which allow us to leverage powerful image synthesis priors, trained on billions of text-image pairs. We fine-tune latent diffusion networks with additional image conditioning to solve tasks such as image generation and back-view prediction, and to support qualitatively different multiple 3D hypotheses. Our partial fine-tuning approach allows to adapt the networks for each task without inducing catastrophic forgetting. In our experiments, we demonstrate that our method produces accurate, high-quality 3D avatars with diverse appearance that respect the multimodal text, image, and body control signals. Our approach can produce a 3D model in as few as 2 seconds, a four orders of magnitude speedup w.r.t. the vast majority of existing methods, most of which solve only a subset of our tasks, and with fewer controls, thus enabling applications that require the controlled 3D generation of human avatars at scale. The project website can be found at this https URL.

[CV-13] Understanding Visual Concepts Across Models

链接: https://arxiv.org/abs/2406.07506
作者: Brandon Trabucco,Max Gurinas,Kyle Doherty,Ruslan Salakhutdinov
关键词: Stable Diffusion, Large multimodal models, Large multimodal, Large, Diffusion can generate
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Official code at: this https URL

点击查看摘要

Abstract:Large multimodal models such as Stable Diffusion can generate, detect, and classify new visual concepts after fine-tuning just a single word embedding. Do models learn similar words for the same concepts (i.e. orange-cat = orange + cat)? We conduct a large-scale analysis on three state-of-the-art models in text-to-image generation, open-set object detection, and zero-shot classification, and find that new word embeddings are model-specific and non-transferable. Across 4,800 new embeddings trained for 40 diverse visual concepts on four standard datasets, we find perturbations within an \epsilon -ball to any prior embedding that generate, detect, and classify an arbitrary concept. When these new embeddings are spliced into new models, fine-tuning that targets the original model is lost. We show popular soft prompt-tuning approaches find these perturbative solutions when applied to visual concept learning tasks, and embeddings for visual concepts are not transferable. Code for reproducing our work is available at: this https URL.

[CV-14] Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions

链接: https://arxiv.org/abs/2406.07502
作者: Renjie Pi,Jianshu Zhang,Jipeng Zhang,Rui Pan,Zhekai Chen,Tong Zhang
关键词: Image, text-image retrieval, description datasets play, image descriptions, Image description datasets
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scraping of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another is through human labeling. Datasets such as COCO are generally very short and lack details. Although detailed image descriptions can be annotated by humans, the high annotation cost limits the feasibility. These limitations underscore the need for more efficient and scalable methods to generate accurate and detailed image descriptions. In this paper, we propose an innovative framework termed Image Textualization (IT), which automatically produces high-quality image descriptions by leveraging existing multi-modal large language models (MLLMs) and multiple vision expert models in a collaborative manner, which maximally convert the visual information into text. To address the current lack of benchmarks for detailed descriptions, we propose several benchmarks for comprehensive evaluation, which verifies the quality of image descriptions created by our framework. Furthermore, we show that LLaVA-7B, benefiting from training on IT-curated descriptions, acquire improved capability to generate richer image descriptions, substantially increasing the length and detail of their output with less hallucination.

[CV-15] SPIN: Spacecraft Imagery for Navigation

链接: https://arxiv.org/abs/2406.07500
作者: Javier Montalvo,Juan Ignacio Bravo Pérez-Villar,Álvaro García-Martín,Pablo Carballeira,Jesús Besc’os
关键词: scarce due, costs and complexity, Data, Data acquired, navigation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Data acquired in space operational conditions is scarce due to the costs and complexity of space operations. This poses a challenge to learning-based visual-based navigation algorithms employed in autonomous spacecraft navigation. Existing datasets, which largely depend on computer-simulated data, have partially filled this gap. However, the image generation tools they use are proprietary, which limits the evaluation of methods to unseen scenarios. Furthermore, these datasets provide limited ground-truth data, primarily focusing on the spacecraft’s translation and rotation relative to the camera. To address these limitations, we present SPIN (SPacecraft Imagery for Navigation), an open-source realistic spacecraft image generation tool for relative navigation between two spacecrafts. SPIN provides a wide variety of ground-truth data and allows researchers to employ custom 3D models of satellites, define specific camera-relative poses, and adjust various settings such as camera parameters and environmental illumination conditions. For the task of spacecraft pose estimation, we compare the results of training with a SPIN-generated dataset against existing synthetic datasets. We show a %50 average error reduction in common testbed data (that simulates realistic space conditions). Both the SPIN tool (and source code) and our enhanced version of the synthetic datasets will be publicly released upon paper acceptance on GitHub this https URL.

[CV-16] rim 3D Gaussian Splatting for Accurate Geometry Representation

链接: https://arxiv.org/abs/2406.07499
作者: Lue Fan,Yuxue Yang,Minxing Li,Hongsheng Li,Zhaoxiang Zhang
关键词: introduce Trim, Gaussian Splatting, Gaussian, Trim, Splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project page: this https URL

点击查看摘要

Abstract:In this paper, we introduce Trim 3D Gaussian Splatting (TrimGS) to reconstruct accurate 3D geometry from images. Previous arts for geometry reconstruction from 3D Gaussians mainly focus on exploring strong geometry regularization. Instead, from a fresh perspective, we propose to obtain accurate 3D geometry of a scene by Gaussian trimming, which selectively removes the inaccurate geometry while preserving accurate structures. To achieve this, we analyze the contributions of individual 3D Gaussians and propose a contribution-based trimming strategy to remove the redundant or inaccurate Gaussians. Furthermore, our experimental and theoretical analyses reveal that a relatively small Gaussian scale is a non-negligible factor in representing and optimizing the intricate details. Therefore the proposed TrimGS maintains relatively small Gaussian scales. In addition, TrimGS is also compatible with the effective geometry regularization strategies in previous arts. When combined with the original 3DGS and the state-of-the-art 2DGS, TrimGS consistently yields more accurate geometry and higher perceptual quality. Our project page is this https URL

[CV-17] ReduceFormer: Attention with Tensor Reduction by Summation

链接: https://arxiv.org/abs/2406.07488
作者: John Yang,Le An,Su Inn Park
关键词: tasks including vision, including vision, tasks including, Transformers have excelled, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transformers have excelled in many tasks including vision. However, efficient deployment of transformer models in low-latency or high-throughput applications is hindered by the computation in the attention mechanism which involves expensive operations such as matrix multiplication and Softmax. To address this, we introduce ReduceFormer, a family of models optimized for efficiency with the spirit of attention. ReduceFormer leverages only simple operations such as reduction and element-wise multiplication, leading to greatly simplified architecture and improved inference performance, with up to 37% reduction in latency and 44% improvement in throughput, while maintaining competitive accuracy comparable to other recent methods. The proposed model family is suitable for edge devices where compute resource and memory bandwidth are limited, as well as for cloud computing where high throughput is sought after.

[CV-18] GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection

链接: https://arxiv.org/abs/2406.07487
作者: Hang Yao,Ming Liu,Haolin Wang,Zhicun Yin,Zifei Yan,Xiaopeng Hong,Wangmeng Zuo
关键词: shown superior performance, standard Gaussian distribution, Diffusion models, Gaussian distribution, shown superior
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Due to the limitation “The abstract field cannot be longer than 1,920 characters”, the abstract here is shorter than that in the PDF file

点击查看摘要

Abstract:Diffusion models have shown superior performance on unsupervised anomaly detection tasks. Since trained with normal data only, diffusion models tend to reconstruct normal counterparts of test images with certain noises added. However, these methods treat all potential anomalies equally, which may cause two main problems. From the global perspective, the difficulty of reconstructing images with different anomalies is uneven. Therefore, instead of utilizing the same setting for all samples, we propose to predict a particular denoising step for each sample by evaluating the difference between image contents and the priors extracted from diffusion models. From the local perspective, reconstructing abnormal regions differs from normal areas even in the same image. Theoretically, the diffusion model predicts a noise for each step, typically following a standard Gaussian distribution. However, due to the difference between the anomaly and its potential normal counterpart, the predicted noise in abnormal regions will inevitably deviate from the standard Gaussian distribution. To this end, we propose introducing synthetic abnormal samples in training to encourage the diffusion models to break through the limitation of standard Gaussian distribution, and a spatial-adaptive feature fusion scheme is utilized during inference. With the above modifications, we propose a global and local adaptive diffusion model (abbreviated to GLAD) for unsupervised anomaly detection, which introduces appealing flexibility and achieves anomaly-free reconstruction while retaining as much normal information as possible. Extensive experiments are conducted on three commonly used anomaly detection datasets (MVTec-AD, MPDD, and VisA) and a printed circuit board dataset (PCB-Bank) we integrated, showing the effectiveness of the proposed method.

[CV-19] Comparing Deep Learning Models for Rice Mapping in Bhutan Using High Resolution Satellite Imagery

链接: https://arxiv.org/abs/2406.07482
作者: Biplov Bhandari,Timothy Mayer
关键词: including Remote Sensing-based, Remote Sensing-based knowledge, Remote Sensing-based, Bhutanese government, including Remote
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:The Bhutanese government is increasing its utilization of technological approaches such as including Remote Sensing-based knowledge in their decision-making process. This study focuses on crop type and crop extent in Paro, one of the top rice-yielding districts in Bhutan, and employs publicly available NICFI high-resolution satellite imagery from Planet. Two Deep Learning (DL) approaches, point-based (DNN) and patch-based (U-Net), models were used in conjunction with cloud-computing platforms. Three different models per DL approaches (DNN and U-Net) were trained: 1) RGBN channels from Planet; 2) RGBN and elevation data (RGBNE); 3) RGBN and Sentinel-1 (S1) data (RGBNS), and RGBN with E and S1 data (RGBNES). From this comprehensive analysis, the U-Net displayed higher performance metrics across both model training and model validation efforts. Among the U-Net model sets, the RGBN, RGBNE, RGBNS, and RGBNES models had an F1-score of 0.8546, 0.8563, 0.8467, and 0.8500 respectively. An independent model evaluation was performed and found a high level of performance variation across all the metrics. For this independent model evaluation, the U-Net RGBN, RGBNE, RGBNES, and RGBN models displayed the F1-scores of 0.5935, 0.6154, 0.5882, and 0.6582, suggesting U-Net RGBNES as the best model. The study shows that the DL approaches can predict rice. Also, DL methods can be used with the survey-based approaches currently utilized by the Bhutan Department of Agriculture. Further, this study demonstrated the usage of regional land cover products such as SERVIR’s RLCMS as a weak label approach to capture different strata addressing the class imbalance problem and improving the sampling design for DL application. Finally, through preliminary model testing and comparisons outlined it was shown that using additional features such as NDVI, EVI, and NDWI did not drastically improve model performance.

[CV-20] Image Neural Field Diffusion Models

链接: https://arxiv.org/abs/2406.07480
作者: Yinbo Chen,Oliver Wang,Richard Zhang,Eli Shechtman,Xiaolong Wang,Michael Gharbi
关键词: complex data distributions, model complex data, image neural, Diffusion models, image neural fields
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Diffusion models have shown an impressive ability to model complex data distributions, with several key advantages over GANs, such as stable training, better coverage of the training distribution’s modes, and the ability to solve inverse problems without extra training. However, most diffusion models learn the distribution of fixed-resolution images. We propose to learn the distribution of continuous images by training diffusion models on image neural fields, which can be rendered at any resolution, and show its advantages over fixed-resolution models. To achieve this, a key challenge is to obtain a latent space that represents photorealistic image neural fields. We propose a simple and effective method, inspired by several recent techniques but with key changes to make the image neural fields photorealistic. Our method can be used to convert existing latent diffusion autoencoders into image neural field autoencoders. We show that image neural field diffusion models can be trained using mixed-resolution image datasets, outperform fixed-resolution diffusion models followed by super-resolution models, and can solve inverse problems with conditions applied at different scales efficiently.

[CV-21] VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

链接: https://arxiv.org/abs/2406.07476
作者: Zesen Cheng,Sicong Leng,Hang Zhang,Yifei Xin,Xin Li,Guanzheng Chen,Yongxin Zhu,Wenqi Zhang,Ziyang Luo,Deli Zhao,Lidong Bing
关键词: Video Large Language, Large Language Models, Large Language, enhance spatial-temporal modeling, Video Large
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: ZC, SL, HZ, YX, and XL contributed equally to this project

点击查看摘要

Abstract:In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2’s superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.

[CV-22] 4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

链接: https://arxiv.org/abs/2406.07472
作者: Heng Yu,Chaoyang Wang,Peiye Zhuang,Willi Menapace,Aliaksandr Siarohin,Junli Cao,Laszlo A Jeni,Sergey Tulyakov,Hsin-Ying Lee
关键词: Existing dynamic scene, synthetic object datasets, Existing dynamic, knowledge from pre-trained, scene generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing dynamic scene generation methods mostly rely on distilling knowledge from pre-trained 3D generative models, which are typically fine-tuned on synthetic object datasets. As a result, the generated scenes are often object-centric and lack photorealism. To address these limitations, we introduce a novel pipeline designed for photorealistic text-to-4D scene generation, discarding the dependency on multi-view generative models and instead fully utilizing video generative models trained on diverse real-world datasets. Our method begins by generating a reference video using the video generation model. We then learn the canonical 3D representation of the video using a freeze-time video, delicately generated from the reference video. To handle inconsistencies in the freeze-time video, we jointly learn a per-frame deformation to model these imperfections. We then learn the temporal deformation based on the canonical representation to capture dynamic interactions in the reference video. The pipeline facilitates the generation of dynamic scenes with enhanced photorealism and structural integrity, viewable from multiple perspectives, thereby setting a new standard in 4D scene generation.

[CV-23] OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding

链接: https://arxiv.org/abs/2406.07471
作者: Ming Hu,Peng Xia,Lin Wang,Siyuan Yan,Feilong Tang,Zhongxing Xu,Yimin Luo,Kaimin Song,Jurgen Leitner,Xuelian Cheng,Jun Cheng,Chi Liu,Kaijing Zhou,Zongyuan Ge
关键词: surgical workflow analysis, Surgical scene perception, advancing robotic surgery, surgical workflow, Surgical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Version 1. arXiv admin note: text overlap with arXiv:2305.15701 by other authors

点击查看摘要

Abstract:Surgical scene perception via videos are critical for advancing robotic surgery, telesurgery, and AI-assisted surgery, particularly in ophthalmology. However, the scarcity of diverse and richly annotated video datasets has hindered the development of intelligent systems for surgical workflow analysis. Existing datasets for surgical workflow analysis, which typically face challenges such as small scale, a lack of diversity in surgery and phase categories, and the absence of time-localized annotations, limit the requirements for action understanding and model generalization validation in complex and diverse real-world surgical scenarios. To address this gap, we introduce OphNet, a large-scale, expert-annotated video benchmark for ophthalmic surgical workflow understanding. OphNet features: 1) A diverse collection of 2,278 surgical videos spanning 66 types of cataract, glaucoma, and corneal surgeries, with detailed annotations for 102 unique surgical phases and 150 granular operations; 2) It offers sequential and hierarchical annotations for each surgery, phase, and operation, enabling comprehensive understanding and improved interpretability; 3) Moreover, OphNet provides time-localized annotations, facilitating temporal localization and prediction tasks within surgical workflows. With approximately 205 hours of surgical videos, OphNet is about 20 times larger than the largest existing surgical workflow analysis benchmark. Our dataset and code have been made available at: \urlthis https URL.

[CV-24] Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning

链接: https://arxiv.org/abs/2406.07450
作者: Shuvendu Roy,Yasaman Parhizkar,Franklin Ogidi,Vahid Reza Khazaie,Michael Colacci,Ali Etemad,Elham Dolatabadi,Arash Afkanpour
关键词: medical domain, perform a comprehensive, comprehensive benchmarking, multimodal medical representation, medical representation learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We perform a comprehensive benchmarking of contrastive frameworks for learning multimodal representations in the medical domain. Through this study, we aim to answer the following research questions: (i) How transferable are general-domain representations to the medical domain? (ii) Is multimodal contrastive training sufficient, or does it benefit from unimodal training as well? (iii) What is the impact of feature granularity on the effectiveness of multimodal medical representation learning? To answer these questions, we investigate eight contrastive learning approaches under identical training setups, and train them on 2.8 million image-text pairs from four datasets, and evaluate them on 25 downstream tasks, including classification (zero-shot and linear probing), image-to-text and text-to-image retrieval, and visual question-answering. Our findings suggest a positive answer to the first question, a negative answer to the second question, and the benefit of learning fine-grained features. Finally, we make our code publicly available.

[CV-25] Beware of Aliases – Signal Preservation is Crucial for Robust Image Restoration

链接: https://arxiv.org/abs/2406.07435
作者: Shashank Agnihotri,Julia Grabinski,Janis Keuper,Margret Keuper
关键词: aggregating image content, responsible for aggregating, content from noisy, restore clean, Image restoration networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Tags: Adversarial attack, image restoration, image deblurring, frequency sampling

点击查看摘要

Abstract:Image restoration networks are usually comprised of an encoder and a decoder, responsible for aggregating image content from noisy, distorted data and to restore clean, undistorted images, respectively. Data aggregation as well as high-resolution image generation both usually come at the risk of involving aliases, i.e.~standard architectures put their ability to reconstruct the model input in jeopardy to reach high PSNR values on validation data. The price to be paid is low model robustness. In this work, we show that simply providing alias-free paths in state-of-the-art reconstruction transformers supports improved model robustness at low costs on the restoration performance. We do so by proposing BOA-Restormer, a transformer-based image restoration model that executes downsampling and upsampling operations partly in the frequency domain to ensure alias-free paths along the entire model while potentially preserving all relevant high-frequency information.

[CV-26] Active Scout: Multi-Target Tracking Using Neural Radiance Fields in Dense Urban Environments

链接: https://arxiv.org/abs/2406.07431
作者: Christopher D. Hsu,Pratik Chaudhari
关键词: occluded urban environments, study pursuit-evasion games, highly occluded urban, urban environments, tall buildings
类目: Multiagent Systems (cs.MA); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 8 figures, 1 table

点击查看摘要

Abstract:We study pursuit-evasion games in highly occluded urban environments, e.g. tall buildings in a city, where a scout (quadrotor) tracks multiple dynamic targets on the ground. We show that we can build a neural radiance field (NeRF) representation of the city – online – using RGB and depth images from different vantage points. This representation is used to calculate the information gain to both explore unknown parts of the city and track the targets – thereby giving a completely first-principles approach to actively tracking dynamic targets. We demonstrate, using a custom-built simulator using Open Street Maps data of Philadelphia and New York City, that we can explore and locate 20 stationary targets within 300 steps. This is slower than a greedy baseline which which does not use active perception. But for dynamic targets that actively hide behind occlusions, we show that our approach maintains, at worst, a tracking error of 200m; the greedy baseline can have a tracking error as large as 600m. We observe a number of interesting properties in the scout’s policies, e.g., it switches its attention to track a different target periodically, as the quality of the NeRF representation improves over time, the scout also becomes better in terms of target tracking.

[CV-27] Visual Representation Learning with Stochastic Frame Prediction

链接: https://arxiv.org/abs/2406.07398
作者: Huiwon Jang,Dongyoung Kim,Junsu Kim,Jinwoo Shin,Pieter Abbeel,Younggyo Seo
关键词: Self-supervised learning, predicting future frames, promising direction, frame prediction, Self-supervised
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: International Conference on Machine Learning (ICML) 2024

点击查看摘要

Abstract:Self-supervised learning of image representations by predicting future frames is a promising direction but still remains a challenge. This is because of the under-determined nature of frame prediction; multiple potential futures can arise from a single current frame. To tackle this challenge, in this paper, we revisit the idea of stochastic video generation that learns to capture uncertainty in frame prediction and explore its effectiveness for representation learning. Specifically, we design a framework that trains a stochastic frame prediction model to learn temporal information between frames. Moreover, to learn dense information within each frame, we introduce an auxiliary masked image modeling objective along with a shared decoder architecture. We find this architecture allows for combining both objectives in a synergistic and compute-efficient manner. We demonstrate the effectiveness of our framework on a variety of tasks from video label propagation and vision-based robot learning domains, such as video segmentation, pose tracking, vision-based robotic locomotion, and manipulation tasks. Code is available on the project webpage: this https URL.

[CV-28] Deep Implicit Optimization for Robust and Flexible Image Registration

链接: https://arxiv.org/abs/2406.07361
作者: Rohit Jena,Pratik Chaudhari,James C. Gee
关键词: incorporate weak label, weak label supervision, DLIR methods forego, image registration due, tremendously successful
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Deep Learning in Image Registration (DLIR) methods have been tremendously successful in image registration due to their speed and ability to incorporate weak label supervision at training time. However, DLIR methods forego many of the benefits of classical optimization-based methods. The functional nature of deep networks do not guarantee that the predicted transformation is a local minima of the registration objective, the representation of the transformation (displacement/velocity field/affine) is fixed, and the networks are not robust to domain shift. Our method aims to bridge this gap between classical and learning methods by incorporating optimization as a layer in a deep network. A deep network is trained to predict multi-scale dense feature images that are registered using a black box iterative optimization solver. This optimal warp is then used to minimize image and label alignment errors. By implicitly differentiating end-to-end through an iterative optimization solver, our learned features are registration and label-aware, and the warp functions are guaranteed to be local minima of the registration objective in the feature space. Our framework shows excellent performance on in-domain datasets, and is agnostic to domain shift such as anisotropy and varying intensity profiles. For the first time, our method allows switching between arbitrary transformation representations (free-form to diffeomorphic) at test time with zero retraining. End-to-end feature learning also facilitates interpretability of features, and out-of-the-box promptability using additional label-fidelity terms at inference.

[CV-29] oxic Memes: A Survey of Computational Perspectives on the Detection and Explanation of Meme Toxicities

链接: https://arxiv.org/abs/2406.07353
作者: Delfina Sol Martinez Pandiani,Erik Tjong Kim Sang,Davide Ceolin
关键词: spread toxic messages, Internet memes, toxic meme analysis, meme, toxic meme
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注: 39 pages, 12 figures, 9 tables

点击查看摘要

Abstract:Internet memes, channels for humor, social commentary, and cultural expression, are increasingly used to spread toxic messages. Studies on the computational analyses of toxic memes have significantly grown over the past five years, and the only three surveys on computational toxic meme analysis cover only work published until 2022, leading to inconsistent terminology and unexplored trends. Our work fills this gap by surveying content-based computational perspectives on toxic memes, and reviewing key developments until early 2024. Employing the PRISMA methodology, we systematically extend the previously considered papers, achieving a threefold result. First, we survey 119 new papers, analyzing 158 computational works focused on content-based toxic meme analysis. We identify over 30 datasets used in toxic meme analysis and examine their labeling systems. Second, after observing the existence of unclear definitions of meme toxicity in computational works, we introduce a new taxonomy for categorizing meme toxicity types. We also note an expansion in computational tasks beyond the simple binary classification of memes as toxic or non-toxic, indicating a shift towards achieving a nuanced comprehension of toxicity. Third, we identify three content-based dimensions of meme toxicity under automatic study: target, intent, and conveyance tactics. We develop a framework illustrating the relationships between these dimensions and meme toxicities. The survey analyzes key challenges and recent trends, such as enhanced cross-modal reasoning, integrating expert and cultural knowledge, the demand for automatic toxicity explanations, and handling meme toxicity in low-resource languages. Also, it notes the rising use of Large Language Models (LLMs) and generative AI for detecting and generating toxic memes. Finally, it proposes pathways for advancing toxic meme detection and interpretation.

[CV-30] Global-Regularized Neighborhood Regression for Efficient Zero-Shot Texture Anomaly Detection

链接: https://arxiv.org/abs/2406.07333
作者: Haiming Yao,Wei Luo,Yunkang Cao,Yiheng Zhang,Wenyong Yu,Weiming Shen
关键词: finds widespread applications, detection finds widespread, finds widespread, widespread applications, GRNR
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: SUBMISSION TO IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS

点击查看摘要

Abstract:Texture surface anomaly detection finds widespread applications in industrial settings. However, existing methods often necessitate gathering numerous samples for model training. Moreover, they predominantly operate within a close-set detection framework, limiting their ability to identify anomalies beyond the training dataset. To tackle these challenges, this paper introduces a novel zero-shot texture anomaly detection method named Global-Regularized Neighborhood Regression (GRNR). Unlike conventional approaches, GRNR can detect anomalies on arbitrary textured surfaces without any training data or cost. Drawing from human visual cognition, GRNR derives two intrinsic prior supports directly from the test texture image: local neighborhood priors characterized by coherent similarities and global normality priors featuring typical normal patterns. The fundamental principle of GRNR involves utilizing the two extracted intrinsic support priors for self-reconstructive regression of the query sample. This process employs the transformation facilitated by local neighbor support while being regularized by global normality support, aiming to not only achieve visually consistent reconstruction results but also preserve normality properties. We validate the effectiveness of GRNR across various industrial scenarios using eight benchmark datasets, demonstrating its superior detection performance without the need for training data. Remarkably, our method is applicable for open-set texture defect detection and can even surpass existing vanilla approaches that require extensive training.

[CV-31] Minimizing Energy Costs in Deep Learning Model Training: The Gaussian Sampling Approach

链接: https://arxiv.org/abs/2406.07332
作者: Challapalli Phanindra Revanth,Sumohana S. Channappayya,C Krishna Mohan
关键词: backpropagation consumes considerable, consumes considerable energy, consumes considerable, Computing the loss, backpropagation consumes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Computing the loss gradient via backpropagation consumes considerable energy during deep learning (DL) model training. In this paper, we propose a novel approach to efficiently compute DL models’ gradients to mitigate the substantial energy overhead associated with backpropagation. Exploiting the over-parameterized nature of DL models and the smoothness of their loss landscapes, we propose a method called \em GradSamp for sampling gradient updates from a Gaussian distribution. Specifically, we update model parameters at a given epoch (chosen periodically or randomly) by perturbing the parameters (element-wise) from the previous epoch with Gaussian ``noise’'. The parameters of the Gaussian distribution are estimated using the error between the model parameter values from the two previous epochs. \em GradSamp not only streamlines gradient computation but also enables skipping entire epochs, thereby enhancing overall efficiency. We rigorously validate our hypothesis across a diverse set of standard and non-standard CNN and transformer-based models, spanning various computer vision tasks such as image classification, object detection, and image segmentation. Additionally, we explore its efficacy in out-of-distribution scenarios such as Domain Adaptation (DA), Domain Generalization (DG), and decentralized settings like Federated Learning (FL). Our experimental results affirm the effectiveness of \em GradSamp in achieving notable energy savings without compromising performance, underscoring its versatility and potential impact in practical DL applications.

[CV-32] Cinematic Gaussians: Real-Time HDR Radiance Fields with Depth of Field

链接: https://arxiv.org/abs/2406.07329
作者: Chao Wang,Krzysztof Wolski,Bernhard Kerbl,Ana Serrano,Mojtaba Bemana,Hans-Peter Seidel,Karol Myszkowski,Thomas Leimkühler
关键词: reconstructing complex scenes, reconstructing complex, Radiance field, Radiance field methods, typically represent scenes
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Radiance field methods represent the state of the art in reconstructing complex scenes from multi-view photos. However, these reconstructions often suffer from one or both of the following limitations: First, they typically represent scenes in low dynamic range (LDR), which restricts their use to evenly lit environments and hinders immersive viewing experiences. Secondly, their reliance on a pinhole camera model, assuming all scene elements are in focus in the input images, presents practical challenges and complicates refocusing during novel-view synthesis. Addressing these limitations, we present a lightweight method based on 3D Gaussian Splatting that utilizes multi-view LDR images of a scene with varying exposure times, apertures, and focus distances as input to reconstruct a high-dynamic-range (HDR) radiance field. By incorporating analytical convolutions of Gaussians based on a thin-lens camera model as well as a tonemapping module, our reconstructions enable the rendering of HDR content with flexible refocusing capabilities. We demonstrate that our combined treatment of HDR and depth of field facilitates real-time cinematic rendering, outperforming the state of the art.

[CV-33] A Framework for Efficient Model Evaluation through Stratification Sampling and Estimation

链接: https://arxiv.org/abs/2406.07320
作者: Riccardo Fogliato,Pratik Patil,Mathew Monfort,Pietro Perona
关键词: critical and expensive, expensive task, task in machine, machine learning, Model
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Model performance evaluation is a critical and expensive task in machine learning and computer vision. Without clear guidelines, practitioners often estimate model accuracy using a one-time random selection of the data. However, by employing tailored sampling and estimation strategies, one can obtain more precise estimates and reduce annotation costs. In this paper, we propose a statistical framework for model evaluation that includes stratification, sampling, and estimation components. We examine the statistical properties of each component and evaluate their efficiency (precision). One key result of our work is that stratification via k-means clustering based on accurate predictions of model performance yields efficient estimators. Our experiments on computer vision datasets show that this method consistently provides more precise accuracy estimates than the traditional simple random sampling, even with substantial efficiency gains of 10x. We also find that model-assisted estimators, which leverage predictions of model accuracy on the unlabeled portion of the dataset, are generally more efficient than the traditional estimates based solely on the labeled data.

[CV-34] Embedded Graph Convolutional Networks for Real-Time Event Data Processing on SoC FPGAs

链接: https://arxiv.org/abs/2406.07318
作者: Kamil Jeziorek,Piotr Wzorek,Krzysztof Blachut,Andrea Pinna,Tomasz Kryjak
关键词: traditional video systems, swiftly evolving trend, evolving trend aimed, event cameras represents, graph convolutional networks
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR); Image and Video Processing (eess.IV)
*备注: Submitted to the IEEE Transactions on Circuits and System for Video Technology. This manuscript was first submitted for publication on March 31, 2024. It has since been revised twice: on May 22, 2024 and June 10, 2024

点击查看摘要

Abstract:The utilisation of event cameras represents an important and swiftly evolving trend aimed at addressing the constraints of traditional video systems. Particularly within the automotive domain, these cameras find significant relevance for their integration into embedded real-time systems due to lower latency and energy consumption. One effective approach to ensure the necessary throughput and latency for event processing systems is through the utilisation of graph convolutional networks (GCNs). In this study, we introduce a series of hardware-aware optimisations tailored for PointNet++, a GCN architecture designed for point cloud processing. The proposed techniques result in more than a 100-fold reduction in model size compared to Asynchronous Event-based GNN (AEGNN), one of the most recent works in the field, with a relatively small decrease in accuracy (2.3% for N-Caltech101 classification, 1.7% for N-Cars classification), thus following the TinyML trend. Based on software research, we designed a custom EFGCN (Event-Based FPGA-accelerated Graph Convolutional Network) and we implemented it on ZCU104 SoC FPGA platform, achieving a throughput of 13.3 million events per second (MEPS) and real-time partially asynchronous processing with a latency of 4.47 ms. We also address the scalability of the proposed hardware model to improve the obtained accuracy score. To the best of our knowledge, this study marks the first endeavour in accelerating PointNet++ networks on SoC FPGAs, as well as the first hardware architecture exploration of graph convolutional networks implementation for real-time continuous event data processing. We publish both software and hardware source code in an open repository: this https URL (will be published upon acceptance).

[CV-35] Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval

链接: https://arxiv.org/abs/2406.07315
作者: Adrià Molina,Oriol Ramos Terrades,Josep Lladós
关键词: comprehensive benchmark tailored, large-scale document retrieval, document retrieval systems, paper introduces, addressing the challenges
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint for the manuscript accepted for publication in the DAS2024 LNCS proceedings

点击查看摘要

Abstract:This paper introduces Fetch-A-Set (FAS), a comprehensive benchmark tailored for legislative historical document analysis systems, addressing the challenges of large-scale document retrieval in historical contexts. The benchmark comprises a vast repository of documents dating back to the XVII century, serving both as a training resource and an evaluation benchmark for retrieval systems. It fills a critical gap in the literature by focusing on complex extractive tasks within the domain of cultural heritage. The proposed benchmark tackles the multifaceted problem of historical document analysis, including text-to-image retrieval for queries and image-to-text topic extraction from document fragments, all while accommodating varying levels of document legibility. This benchmark aims to spur advancements in the field by providing baselines and data for the development and evaluation of robust historical document retrieval systems, particularly in scenarios characterized by wide historical spectrum.

[CV-36] OTO Planner: An Efficient Only Travelling Once Exploration Planner for Complex and Unknown Environments

链接: https://arxiv.org/abs/2406.07294
作者: Bo Zhou,Chuanzhao Lu,Yan Pan,Fu Chen
关键词: Autonomous exploration, Autonomous, repeated paths, cluttered environments, exploration
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Autonomous exploration in complex and cluttered environments is essential for various applications. However, there are many challenges due to the lack of global heuristic information. Existing exploration methods suffer from the repeated paths and considerable computational resource requirement in large-scale environments. To address the above issues, this letter proposes an efficient exploration planner that reduces repeated paths in complex environments, hence it is called “Only Travelling Once Planner”. OTO Planner includes fast frontier updating, viewpoint evaluation and viewpoint refinement. A selective frontier updating mechanism is designed, saving a large amount of computational resources. In addition, a novel viewpoint evaluation system is devised to reduce the repeated paths utilizing the enclosed sub-region detection. Besides, a viewpoint refinement approach is raised to concentrate the redundant viewpoints, leading to smoother paths. We conduct extensive simulation and real-world experiments to validate the proposed method. Compared to the state-of-the-art approach, the proposed method reduces the exploration time and movement distance by 10%-20% and improves the speed of frontier detection by 6-9 times.

[CV-37] Unsupervised Object Detection with Theoretical Guarantees

链接: https://arxiv.org/abs/2406.07284
作者: Marian Longa,João F. Henriques
关键词: Unsupervised object detection, deep neural networks, object detection, object detection methods, Unsupervised object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Unsupervised object detection using deep neural networks is typically a difficult problem with few to no guarantees about the learned representation. In this work we present the first unsupervised object detection method that is theoretically guaranteed to recover the true object positions up to quantifiable small shifts. We develop an unsupervised object detection architecture and prove that the learned variables correspond to the true object positions up to small shifts related to the encoder and decoder receptive field sizes, the object sizes, and the widths of the Gaussians used in the rendering process. We perform detailed analysis of how the error depends on each of these variables and perform synthetic experiments validating our theoretical predictions up to a precision of individual pixels. We also perform experiments on CLEVR-based data and show that, unlike current SOTA object detection methods (SAM, CutLER), our method’s prediction errors always lie within our theoretical bounds. We hope that this work helps open up an avenue of research into object detection methods with theoretical guarantees.

[CV-38] Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

链接: https://arxiv.org/abs/2406.07268
作者: Jinyuan Li,Ziyan Li,Han Li,Jianfei Yu,Rui Xia,Di Sun,Gang Pan
关键词: Grounded Multimodal Named, Named Entity Recognition, Multimodal Named Entity, identify named entities, Grounded Multimodal
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Extension of our Findings of EMNLP 2023 ACL 2024 paper

点击查看摘要

Abstract:Grounded Multimodal Named Entity Recognition (GMNER) task aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging attributes: 1) The tenuous correlation between images and text on social media contributes to a notable proportion of named entities being ungroundable. 2) There exists a distinction between coarse-grained noun phrases used in similar tasks (e.g., phrase localization) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as connecting bridges. This reformulation brings two benefits: 1) It enables us to optimize the MNER module for optimal MNER performance and eliminates the need to pre-extract region features using object detection methods, thus naturally addressing the two major limitations of existing GMNER methods. 2) The introduction of Entity Expansion Expression module and Visual Entailment (VE) module unifies Visual Grounding (VG) and Entity Grounding (EG). This endows the proposed framework with unlimited data and model scalability. Furthermore, to address the potential ambiguity stemming from the coarse-grained bounding box output in GMNER, we further construct the new Segmented Multimodal Named Entity Recognition (SMNER) task and corresponding Twitter-SMNER dataset aimed at generating fine-grained segmentation masks, and experimentally demonstrate the feasibility and effectiveness of using box prompt-based Segment Anything Model (SAM) to empower any GMNER model with the ability to accomplish the SMNER task. Extensive experiments demonstrate that RiVEG significantly outperforms SoTA methods on four datasets across the MNER, GMNER, and SMNER tasks.

[CV-39] owards Realistic Data Generation for Real-World Super-Resolution

链接: https://arxiv.org/abs/2406.07255
作者: Long Peng,Wenbo Li,Renjing Pei,Jingjing Ren,Yang Wang,Yang Cao,Zheng-Jun Zha
关键词: Existing image super-resolution, complex real-world settings, real-world settings due, practical scenarios, Existing image
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Existing image super-resolution (SR) techniques often fail to generalize effectively in complex real-world settings due to the significant divergence between training data and practical scenarios. To address this challenge, previous efforts have either manually simulated intricate physical-based degradations or utilized learning-based techniques, yet these approaches remain inadequate for producing large-scale, realistic, and diverse data simultaneously. In this paper, we introduce a novel Realistic Decoupled Data Generator (RealDGen), an unsupervised learning data generation framework designed for real-world super-resolution. We meticulously develop content and degradation extraction strategies, which are integrated into a novel content-degradation decoupled diffusion model to create realistic low-resolution images from unpaired real LR and HR images. Extensive experiments demonstrate that RealDGen excels in generating large-scale, high-quality paired data that mirrors real-world degradations, significantly advancing the performance of popular SR models on various real-world benchmarks.

[CV-40] Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models

链接: https://arxiv.org/abs/2406.07251
作者: Athanasios Tragakis,Marco Aversa,Chaitanya Kaul,Roderick Murray-Smith,Daniele Faccio
关键词: generative framework, single GPU, framework to sample, higher resolutions, gigapixel image generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we introduce Pixelsmith, a zero-shot text-to-image generative framework to sample images at higher resolutions with a single GPU. We are the first to show that it is possible to scale the output of a pre-trained diffusion model by a factor of 1000, opening the road for gigapixel image generation at no additional cost. Our cascading method uses the image generated at the lowest resolution as a baseline to sample at higher resolutions. For the guidance, we introduce the Slider, a tunable mechanism that fuses the overall structure contained in the first-generated image with enhanced fine details. At each inference step, we denoise patches rather than the entire latent space, minimizing memory demands such that a single GPU can handle the process, regardless of the image’s resolution. Our experimental results show that Pixelsmith not only achieves higher quality and diversity compared to existing techniques, but also reduces sampling time and artifacts. The code for our work is available at this https URL.

[CV-41] Needle In A Multimodal Haystack

链接: https://arxiv.org/abs/2406.07230
作者: Weiyun Wang,Shuibo Zhang,Yiming Ren,Yuchen Duan,Tiantong Li,Shuo Liu,Mengkang Hu,Zhe Chen,Kaipeng Zhang,Lewei Lu,Xizhou Zhu,Ping Luo,Yu Qiao,Jifeng Dai,Wenqi Shao,Wenhai Wang
关键词: multimodal large language, large language models, increasingly comprehensive, large language, multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid advancement of multimodal large language models (MLLMs), their evaluation has become increasingly comprehensive. However, understanding long multimodal content, as a foundational ability for real-world applications, remains underexplored. In this work, we present Needle In A Multimodal Haystack (MM-NIAH), the first benchmark specifically designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents. Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning. In each task, the model is required to answer the questions according to different key information scattered throughout the given multimodal document. Evaluating the leading MLLMs on MM-NIAH, we observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation. We hope this work can provide a platform for further research on long multimodal document comprehension and contribute to the advancement of MLLMs. Code and benchmark are released at this https URL.

[CV-42] Which Country Is This? Automatic Country Ranking of Street View Photos

链接: https://arxiv.org/abs/2406.07227
作者: Tim Menzner,Jochen L. Leidner,Florian Mittag
关键词: present Country Guesser, Google Street View, Street View image, Street View, Country Guesser
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In this demonstration, we present Country Guesser, a live system that guesses the country that a photo is taken in. In particular, given a Google Street View image, our federated ranking model uses a combination of computer vision, machine learning and text retrieval methods to compute a ranking of likely countries of the location shown in a given image from Street View. Interestingly, using text-based features to probe large pre-trained language models can assist to provide cross-modal supervision. We are not aware of previous country guessing systems informed by visual and textual features.

[CV-43] Open-World Human-Object Interaction Detection via Multi-modal Prompts

链接: https://arxiv.org/abs/2406.07221
作者: Jie Yang,Bingliang Li,Ailing Zeng,Lei Zhang,Ruimao Zhang
关键词: Prompt-based HOI detector, powerful Multi-modal Prompt-based, Multi-modal Prompt-based HOI, realizing HOI detection, HOI detector designed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR24. arXiv admin note: text overlap with arXiv:2305.12252

点击查看摘要

Abstract:In this paper, we develop \textbfMP-HOI, a powerful Multi-modal Prompt-based HOI detector designed to leverage both textual descriptions for open-set generalization and visual exemplars for handling high ambiguity in descriptions, realizing HOI detection in the open world. Specifically, it integrates visual prompts into existing language-guided-only HOI detectors to handle situations where textual descriptions face difficulties in generalization and to address complex scenarios with high interaction ambiguity. To facilitate MP-HOI training, we build a large-scale HOI dataset named Magic-HOI, which gathers six existing datasets into a unified label space, forming over 186K images with 2.4K objects, 1.2K actions, and 20K HOI interactions. Furthermore, to tackle the long-tail issue within the Magic-HOI dataset, we introduce an automated pipeline for generating realistically annotated HOI images and present SynHOI, a high-quality synthetic HOI dataset containing 100K images. Leveraging these two datasets, MP-HOI optimizes the HOI task as a similarity learning process between multi-modal prompts and objects/interactions via a unified contrastive loss, to learn generalizable and transferable objects/interactions representations from large-scale data. MP-HOI could serve as a generalist HOI detector, surpassing the HOI vocabulary of existing expert models by more than 30 times. Concurrently, our results demonstrate that MP-HOI exhibits remarkable zero-shot capability in real-world scenarios and consistently achieves a new state-of-the-art performance across various benchmarks.

[CV-44] MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance

链接: https://arxiv.org/abs/2406.07209
作者: X. Wang,Siming Fu,Qihan Huang,Wanggui He,Hao Jiang
关键词: Recent advancements, dramatically enhanced, increased interest, textual prompts, photorealistic images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in text-to-image generation models have dramatically enhanced the generation of photorealistic images from textual prompts, leading to an increased interest in personalized text-to-image applications, particularly in multi-subject scenarios. However, these advances are hindered by two main challenges: firstly, the need to accurately maintain the details of each referenced subject in accordance with the textual descriptions; and secondly, the difficulty in achieving a cohesive representation of multiple subjects in a single image without introducing inconsistencies. To address these concerns, our research introduces the MS-Diffusion framework for layout-guided zero-shot image personalization with multi-subjects. This innovative approach integrates grounding tokens with the feature resampler to maintain detail fidelity among subjects. With the layout guidance, MS-Diffusion further improves the cross-attention to adapt to the multi-subject inputs, ensuring that each subject condition acts on specific areas. The proposed multi-subject cross-attention orchestrates harmonious inter-subject compositions while preserving the control of texts. Comprehensive quantitative and qualitative experiments affirm that this method surpasses existing models in both image and text fidelity, promoting the development of personalized text-to-image generation.

[CV-45] Can Foundation Models Reliably Identify Spatial Hazards? A Case Study on Curb Segmentation

链接: https://arxiv.org/abs/2406.07202
作者: Diwei Sheng,Giles Hamilton-Fletcher,Mahya Beheshti,Chen Feng,John-Ross Rizzo
关键词: potential vehicular traffic, vehicular traffic hazards, delineate safe pedestrian, safe pedestrian zones, serve as vital
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 21 pages, 8 figures, submitted to Assistive Technology

点击查看摘要

Abstract:Curbs serve as vital borders that delineate safe pedestrian zones from potential vehicular traffic hazards. Curbs also represent a primary spatial hazard during dynamic navigation with significant stumbling potential. Such vulnerabilities are particularly exacerbated for persons with blindness and low vision (PBLV). Accurate visual-based discrimination of curbs is paramount for assistive technologies that aid PBLV with safe navigation in urban environments. Herein, we investigate the efficacy of curb segmentation for foundation models. We introduce the largest curb segmentation dataset to-date to benchmark leading foundation models. Our results show that state-of-the-art foundation models face significant challenges in curb segmentation. This is due to their high false-positive rates (up to 95%) with poor performance distinguishing curbs from curb-like objects or non-curb areas, such as sidewalks. In addition, the best-performing model averaged a 3.70-second inference time, underscoring problems in providing real-time assistance. In response, we propose solutions including filtered bounding box selections to achieve more accurate curb segmentation. Overall, despite the immediate flexibility of foundation models, their application for practical assistive technology applications still requires refinement. This research highlights the critical need for specialized datasets and tailored model training to address navigation challenges for PBLV and underscores implicit weaknesses in foundation models.

[CV-46] MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD

链接: https://arxiv.org/abs/2406.07191
作者: Ioanna Ntinou,Enrique Sanchez,Georgios Tzimiropoulos
关键词: recognise human actions, long temporal windows, long-term video understanding, long temporal context, long temporal
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ICIP 2024

点击查看摘要

Abstract:This paper is on long-term video understanding where the goal is to recognise human actions over long temporal windows (up to minutes long). In prior work, long temporal context is captured by constructing a long-term memory bank consisting of past and future video features which are then integrated into standard (short-term) video recognition backbones through the use of attention mechanisms. Two well-known problems related to this approach are the quadratic complexity of the attention operation and the fact that the whole feature bank must be stored in memory for inference. To address both issues, we propose an alternative to attention-based schemes which is based on a low-rank approximation of the memory obtained using Singular Value Decomposition. Our scheme has two advantages: (a) it reduces complexity by more than an order of magnitude, and (b) it is amenable to an efficient implementation for the calculation of the memory bases in an incremental fashion which does not require the storage of the whole feature bank in memory. The proposed scheme matches or surpasses the accuracy achieved by attention-based mechanisms while being memory-efficient. Through extensive experiments, we demonstrate that our framework generalises to different architectures and tasks, outperforming the state-of-the-art in three datasets.

[CV-47] RGB-Sonar Tracking Benchmark and Spatial Cross-Attention Transformer Tracker

链接: https://arxiv.org/abs/2406.07189
作者: Yunfeng Li,Bo Wang,Jiuran Sun,Xueyi Wu,Ye Li
关键词: Vision camera, naturally complementary, underwater environment, RGB and sonar, spatial cross-attention
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision camera and sonar are naturally complementary in the underwater environment. Combining the information from two modalities will promote better observation of underwater targets. However, this problem has not received sufficient attention in previous research. Therefore, this paper introduces a new challenging RGB-Sonar (RGB-S) tracking task and investigates how to achieve efficient tracking of an underwater target through the interaction of RGB and sonar modalities. Specifically, we first propose an RGBS50 benchmark dataset containing 50 sequences and more than 87000 high-quality annotated bounding boxes. Experimental results show that the RGBS50 benchmark poses a challenge to currently popular SOT trackers. Second, we propose an RGB-S tracker called SCANet, which includes a spatial cross-attention module (SCAM) consisting of a novel spatial cross-attention layer and two independent global integration modules. The spatial cross-attention is used to overcome the problem of spatial misalignment of between RGB and sonar images. Third, we propose a SOT data-based RGB-S simulation training method (SRST) to overcome the lack of RGB-S training datasets. It converts RGB images into sonar-like saliency images to construct pseudo-data pairs, enabling the model to learn the semantic structure of RGB-S-like data. Comprehensive experiments show that the proposed spatial cross-attention effectively achieves the interaction between RGB and sonar modalities and SCANet achieves state-of-the-art performance on the proposed benchmark. The code is available at this https URL.

[CV-48] RAD: A Comprehensive Dataset for Benchmarking the Robustness of Image Anomaly Detection

链接: https://arxiv.org/abs/2406.07176
作者: Yuqi Cheng,Yunkang Cao,Rui Chen,Weiming Shen
关键词: anomaly detection systems, anomaly detection, anomaly detection methods, image anomaly detection, Robust Anomaly Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:Robustness against noisy imaging is crucial for practical image anomaly detection systems. This study introduces a Robust Anomaly Detection (RAD) dataset with free views, uneven illuminations, and blurry collections to systematically evaluate the robustness of current anomaly detection methods. Specifically, RAD aims to identify foreign objects on working platforms as anomalies. The collection process incorporates various sources of imaging noise, such as viewpoint changes, uneven illuminations, and blurry collections, to replicate real-world inspection scenarios. Subsequently, we assess and analyze 11 state-of-the-art unsupervised and zero-shot methods on RAD. Our findings indicate that: 1) Variations in viewpoint, illumination, and blurring affect anomaly detection methods to varying degrees; 2) Methods relying on memory banks and assisted by synthetic anomalies demonstrate stronger robustness; 3) Effectively leveraging the general knowledge of foundational models is a promising avenue for enhancing the robustness of anomaly detection methods.

[CV-49] VoxNeuS: Enhancing Voxel-Based Neural Surface Reconstruction via Gradient Interpolation

链接: https://arxiv.org/abs/2406.07170
作者: Sidun Liu,Peng Qiao,Zongxin Ye,Wenyu Li,Yong Dou
关键词: Signed Distance Field, Distance Field, Signed Distance, learns a Signed, Neural Surface Reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Neural Surface Reconstruction learns a Signed Distance Field~(SDF) to reconstruct the 3D model from multi-view images. Previous works adopt voxel-based explicit representation to improve efficiency. However, they ignored the gradient instability of interpolation in the voxel grid, leading to degradation on convergence and smoothness. Besides, previous works entangled the optimization of geometry and radiance, which leads to the deformation of geometry to explain radiance, causing artifacts when reconstructing textured planes. In this work, we reveal that the instability of gradient comes from its discontinuity during trilinear interpolation, and propose to use the interpolated gradient instead of the original analytical gradient to eliminate the discontinuity. Based on gradient interpolation, we propose VoxNeuS, a lightweight surface reconstruction method for computational and memory efficient neural surface reconstruction. Thanks to the explicit representation, the gradient of regularization terms, i.e. Eikonal and curvature loss, are directly solved, avoiding computation and memory-access overhead. Further, VoxNeuS adopts a geometry-radiance disentangled architecture to handle the geometry deformation from radiance optimization. The experimental results show that VoxNeuS achieves better reconstruction quality than previous works. The entire training process takes 15 minutes and less than 3 GB of memory on a single 2080ti GPU. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.07170 [cs.CV] (or arXiv:2406.07170v1 [cs.CV] for this version) Submission history From: Sidun Liu [view email] [v1] Tue, 11 Jun 2024 11:26:27 UTC (35,735 KB)

[CV-50] RecMoDiffuse: Recurrent Flow Diffusion for Human Motion Generation

链接: https://arxiv.org/abs/2406.07169
作者: Mirgahney Mohamed,Harry Jake Cunningham,Marc P. Deisenroth,Lourdes Agapito
关键词: Human motion generation, Human motion, computer animation, Human, paramount importance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 6 figures

点击查看摘要

Abstract:Human motion generation has paramount importance in computer animation. It is a challenging generative temporal modelling task due to the vast possibilities of human motion, high human sensitivity to motion coherence and the difficulty of accurately generating fine-grained motions. Recently, diffusion methods have been proposed for human motion generation due to their high sample quality and expressiveness. However, generated sequences still suffer from motion incoherence, and are limited to short duration, and simpler motion and take considerable time during inference. To address these limitations, we propose \textitRecMoDiffuse: Recurrent Flow Diffusion, a new recurrent diffusion formulation for temporal modelling. Unlike previous work, which applies diffusion to the whole sequence without any temporal dependency, an approach that inherently makes temporal consistency hard to achieve. Our method explicitly enforces temporal constraints with the means of normalizing flow models in the diffusion process and thereby extends diffusion to the temporal dimension. We demonstrate the effectiveness of RecMoDiffuse in the temporal modelling of human motion. Our experiments show that RecMoDiffuse achieves comparable results with state-of-the-art methods while generating coherent motion sequences and reducing the computational overhead in the inference stage.

[CV-51] FaceGPT: Self-supervised Learning to Chat about 3D Human Faces

链接: https://arxiv.org/abs/2406.07163
作者: Haoran Wang,Mohit Mendiratta,Christian Theobalt,Adam Kortylewski
关键词: Large Vision-Language Models, framework for Large, Large Vision-Language, Vision-Language Models, self-supervised learning framework
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce FaceGPT, a self-supervised learning framework for Large Vision-Language Models (VLMs) to reason about 3D human faces from images and text. Typical 3D face reconstruction methods are specialized algorithms that lack semantic reasoning capabilities. FaceGPT overcomes this limitation by embedding the parameters of a 3D morphable face model (3DMM) into the token space of a VLM, enabling the generation of 3D faces from both textual and visual inputs. FaceGPT is trained in a self-supervised manner as a model-based autoencoder from in-the-wild images. In particular, the hidden state of LLM is projected into 3DMM parameters and subsequently rendered as 2D face image to guide the self-supervised learning process via image-based reconstruction. Without relying on expensive 3D annotations of human faces, FaceGPT obtains a detailed understanding about 3D human faces, while preserving the capacity to understand general user instructions. Our experiments demonstrate that FaceGPT not only achieves high-quality 3D face reconstructions but also retains the ability for general-purpose visual instruction following. Furthermore, FaceGPT learns fully self-supervised to generate 3D faces based on complex textual inputs, which opens a new direction in human face analysis.

[CV-52] Benchmarking and Boosting Radiology Report Generation for 3D High-Resolution Medical Images

链接: https://arxiv.org/abs/2406.07146
作者: Che Liu,Zhongwei Wan,Yuqi Wang,Hui Shen,Haozhe Wang,Kangyu Zheng,Mi Zhang,Rossella Arcucci
关键词: broad clinical diagnostics, Automatic radiology report, Automatic radiology, writing by radiologists, significantly benefit
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Automatic radiology report generation can significantly benefit the labor-intensive process of report writing by radiologists, especially for 3D radiographs like CT scans, which are crucial for broad clinical diagnostics yet underexplored compared to 2D radiographs. Existing methods often handle 3D volumes either slice-wise or with aggressive downsampling due to current GPU memory limitations, which results in a loss of the inherent 3D nature and critical details. To overcome these issues, we introduce a novel framework that efficiently and effectively generates radiology reports for high-resolution (HR) 3D volumes, based on large language models (LLMs). Specifically, our framework utilizes low-resolution (LR) visual tokens as queries to mine information from HR tokens, preserving detailed HR information while reducing computational costs by only processing HR informed LR visual queries. Further benefiting the field, we curate and release BIMCV-RG, a new dataset with 5,328 HR 3D volumes and paired reports, establishing the first benchmarks for report generation from 3D HR medical images. Our method consistently surpasses existing methods on this benchmark across three different settings: normal-resolution, high-resolution inputs, and zero-shot domain transfer, all at an acceptable computational cost, trainable on a single A100-80G.

[CV-53] 2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text

链接: https://arxiv.org/abs/2406.07119
作者: Aoxiong Yin,Haoyuan Li,Kai Shen,Siliang Tang,Yueting Zhuang
关键词: sign language, sign language production, two-stage sign language, encodes sign language, autoregressively generates sign
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by ACL 2024

点击查看摘要

Abstract:In this work, we propose a two-stage sign language production (SLP) paradigm that first encodes sign language sequences into discrete codes and then autoregressively generates sign language from text based on the learned codebook. However, existing vector quantization (VQ) methods are fixed-length encodings, overlooking the uneven information density in sign language, which leads to under-encoding of important regions and over-encoding of unimportant regions. To address this issue, we propose a novel dynamic vector quantization (DVA-VAE) model that can dynamically adjust the encoding length based on the information density in sign language to achieve accurate and compact encoding. Then, a GPT-like model learns to generate code sequences and their corresponding durations from spoken language text. Extensive experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method. To promote sign language research, we propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.Experimental analysis on PHOENIX-News shows that the performance of our model can be further improved by increasing the size of the training data. Our project homepage is this https URL.

[CV-54] Beyond Bare Queries: Open-Vocabulary Object Retrieval with 3D Scene Graph

链接: https://arxiv.org/abs/2406.07113
作者: Sergey Linok,Tatiana Zemskova,Svetlana Ladanova,Roman Titkov,Dmitry Yudin
关键词: Locating objects referred, natural language poses, Locating objects, autonomous agents, challenge for autonomous
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Locating objects referred to in natural language poses a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object retrieval with simple (bare) queries but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene spatial graph representation with metric edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to form 3D objects, an advanced raycasting algorithm to project them to 2D, and a vision-language model to describe them as graph nodes. On Replica and ScanNet datasets, we show that the designed method accurately constructs 3D object-centric maps. We have demonstrated that their quality takes a leading place for open-vocabulary 3D semantic segmentation against other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On Sr3D and Nr3D benchmarks, our deductive approach demonstrates a significant improvement, enabling retrieving objects by complex queries compared to other state-of-the-art methods. Considering our design solutions, we achieved a processing speed approximately x3 times faster than the closest analog. This promising performance enables our approach for usage in applied intelligent robotics projects. We make the code publicly available at this http URL.

[CV-55] NeRSP: Neural 3D Reconstruction for Reflective Objects with Sparse Polarized Images

链接: https://arxiv.org/abs/2406.07111
作者: Yufei Han,Heng Guo,Koki Fukai,Hiroaki Santo,Boxin Shi,Fumio Okura,Zhanyu Ma,Yunpeng Jia
关键词: present NeRSP, Reflective surface reconstruction, Reflective surfaces, Sparse Polarized images, reconstruction technique
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:We present NeRSP, a Neural 3D reconstruction technique for Reflective surfaces with Sparse Polarized images. Reflective surface reconstruction is extremely challenging as specular reflections are view-dependent and thus violate the multiview consistency for multiview stereo. On the other hand, sparse image inputs, as a practical capture setting, commonly cause incomplete or distorted results due to the lack of correspondence matching. This paper jointly handles the challenges from sparse inputs and reflective surfaces by leveraging polarized images. We derive photometric and geometric cues from the polarimetric image formation model and multiview azimuth consistency, which jointly optimize the surface geometry modeled via implicit neural representation. Based on the experiments on our synthetic and real datasets, we achieve the state-of-the-art surface reconstruction results with only 6 views as input.

[CV-56] AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding

链接: https://arxiv.org/abs/2406.07091
作者: Xing Zhang,Jiaxi Gu,Haoyu Zhao,Shicong Wang,Hang Xu,Renjing Pei,Songcen Xu,Zuxuan Wu,Yu-Gang Jiang
关键词: Temporal Video Grounding, aims to localize, language description, TVG, Video Grounding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technique Report

点击查看摘要

Abstract:Temporal Video Grounding (TVG) aims to localize a moment from an untrimmed video given the language description. Since the annotation of TVG is labor-intensive, TVG under limited supervision has accepted attention in recent years. The great success of vision-language pre-training guides TVG to follow the traditional “pre-training + fine-tuning” paradigm, however, the pre-training process would suffer from a lack of temporal modeling and fine-grained alignment due to the difference of data nature between pre-train and test. Besides, the large gap between pretext and downstream tasks makes zero-shot testing impossible for the pre-trained model. To avoid the drawbacks of the traditional paradigm, we propose AutoTVG, a new vision-language pre-training paradigm for TVG that enables the model to learn semantic alignment and boundary regression from automatically annotated untrimmed videos. To be specific, AutoTVG consists of a novel Captioned Moment Generation (CMG) module to generate captioned moments from untrimmed videos, and TVGNet with a regression head to predict localization results. Experimental results on Charades-STA and ActivityNet Captions show that, regarding zero-shot temporal video grounding, AutoTVG achieves highly competitive performance with in-distribution methods under out-of-distribution testing, and is superior to existing pre-training frameworks with much less training data.

[CV-57] RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents

链接: https://arxiv.org/abs/2406.07089
作者: Wenjia Xu,Zijian Yu,Yixu Wang,Jiuniu Wang,Mugen Peng
关键词: Large Language Models, achieved great performance, Language Models, Large Language, Visual Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:An increasing number of models have achieved great performance in remote sensing tasks with the recent development of Large Language Models (LLMs) and Visual Language Models (VLMs). However, these models are constrained to basic vision and language instruction-tuning tasks, facing challenges in complex remote sensing applications. Additionally, these models lack specialized expertise in professional domains. To address these limitations, we propose a LLM-driven remote sensing intelligent agent named RS-Agent. Firstly, RS-Agent is powered by a large language model (LLM) that acts as its “Central Controller,” enabling it to understand and respond to various problems intelligently. Secondly, our RS-Agent integrates many high-performance remote sensing image processing tools, facilitating multi-tool and multi-turn conversations. Thirdly, our RS-Agent can answer professional questions by leveraging robust knowledge documents. We conducted experiments using several datasets, e.g., RSSDIVCS, RSVQA, and DOTAv1. The experimental results demonstrate that our RS-Agent delivers outstanding performance in many tasks, i.e., scene classification, visual question answering, and object counting tasks.

[CV-58] CAT: Coordinating Anatomical-Textual Prompts for Multi-Organ and Tumor Segmentation

链接: https://arxiv.org/abs/2406.07085
作者: Zhongzhen Huang,Yankai Jiang,Rongzhao Zhang,Shaoting Zhang,Xiaofan Zhang
关键词: Existing promptable segmentation, segment relevant objects, imaging field primarily, Existing promptable, medical imaging field
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing promptable segmentation methods in the medical imaging field primarily consider either textual or visual prompts to segment relevant objects, yet they often fall short when addressing anomalies in medical images, like tumors, which may vary greatly in shape, size, and appearance. Recognizing the complexity of medical scenarios and the limitations of textual or visual prompts, we propose a novel dual-prompt schema that leverages the complementary strengths of visual and textual prompts for segmenting various organs and tumors. Specifically, we introduce CAT, an innovative model that Coordinates Anatomical prompts derived from 3D cropped images with Textual prompts enriched by medical domain knowledge. The model architecture adopts a general query-based design, where prompt queries facilitate segmentation queries for mask prediction. To synergize two types of prompts within a unified framework, we implement a ShareRefiner, which refines both segmentation and prompt queries while disentangling the two types of prompts. Trained on a consortium of 10 public CT datasets, CAT demonstrates superior performance in multiple segmentation tasks. Further validation on a specialized in-house dataset reveals the remarkable capacity of segmenting tumors across multiple cancer stages. This approach confirms that coordinating multimodal prompts is a promising avenue for addressing complex scenarios in the medical domain.

[CV-59] Unified Modeling Enhanced Multimodal Learning for Precision Neuro-Oncology

链接: https://arxiv.org/abs/2406.07078
作者: Huahui Yi,Xiaofei Wang,Kang Li,Chao Li
关键词: integrating histology images, Enhanced Multimodal Learning, Multimodal learning, molecular levels, oncology with comprehensive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multimodal learning, integrating histology images and genomics, promises to enhance precision oncology with comprehensive views at microscopic and molecular levels. However, existing methods may not sufficiently model the shared or complementary information for more effective integration. In this study, we introduce a Unified Modeling Enhanced Multimodal Learning (UMEML) framework that employs a hierarchical attention structure to effectively leverage shared and complementary features of both modalities of histology and genomics. Specifically, to mitigate unimodal bias from modality imbalance, we utilize a query-based cross-attention mechanism for prototype clustering in the pathology encoder. Our prototype assignment and modularity strategy are designed to align shared features and minimizes modality gaps. An additional registration mechanism with learnable tokens is introduced to enhance cross-modal feature integration and robustness in multimodal unified modeling. Our experiments demonstrate that our method surpasses previous state-of-the-art approaches in glioma diagnosis and prognosis tasks, underscoring its superiority in precision neuro-Oncology.

[CV-60] Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study

链接: https://arxiv.org/abs/2406.07057
作者: Yichi Zhang,Yao Huang,Yitong Sun,Chang Liu,Zhe Zhao,Zhengwei Fang,Yifan Wang,Huanran Chen,Xiao Yang,Xingxing Wei,Hang Su,Yinpeng Dong,Jun Zhu
关键词: Large Language Models, Multimodal Large Language, Large Language, significant trustworthiness challenges, face significant trustworthiness
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 100 pages, 84 figures, 33 tables

点击查看摘要

Abstract:Despite the superior capabilities of Multimodal Large Language Models (MLLMs) across diverse tasks, they still face significant trustworthiness challenges. Yet, current literature on the assessment of trustworthy MLLMs remains limited, lacking a holistic evaluation to offer thorough insights into future improvements. In this work, we establish MultiTrust, the first comprehensive and unified benchmark on the trustworthiness of MLLMs across five primary aspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts, encompassing 32 diverse tasks with self-curated datasets. Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks, highlighting the complexities introduced by the multimodality and underscoring the necessity for advanced methodologies to enhance their reliability. For instance, typical proprietary models still struggle with the perception of visually confusing images and are vulnerable to multimodal jailbreaking and adversarial attacks; MLLMs are more inclined to disclose privacy in text and reveal ideological and cultural biases even when paired with irrelevant images in inference, indicating that the multimodality amplifies the internal risks from base LLMs. Additionally, we release a scalable toolbox for standardized trustworthiness research, aiming to facilitate future advancements in this important field. Code and resources are publicly available at: this https URL.

[CV-61] DualMamba: A Lightweight Spectral-Spatial Mamba-Convolution Network for Hyperspectral Image Classification

链接: https://arxiv.org/abs/2406.07050
作者: Jiamu Sheng,Jingyi Zhou,Jiong Wang,Peng Ye,Jiayuan Fan
关键词: Hyperspectral image, crucial for Hyperspectral, complex spectral-spatial relations, spectral-spatial, Mamba
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The effectiveness and efficiency of modeling complex spectral-spatial relations are both crucial for Hyperspectral image (HSI) classification. Most existing methods based on CNNs and transformers still suffer from heavy computational burdens and have room for improvement in capturing the global-local spectral-spatial feature representation. To this end, we propose a novel lightweight parallel design called lightweight dual-stream Mamba-convolution network (DualMamba) for HSI classification. Specifically, a parallel lightweight Mamba and CNN block are first developed to extract global and local spectral-spatial features. First, the cross-attention spectral-spatial Mamba module is proposed to leverage the global modeling of Mamba at linear complexity. Within this module, dynamic positional embedding is designed to enhance the spatial location information of visual sequences. The lightweight spectral/spatial Mamba blocks comprise an efficient scanning strategy and a lightweight Mamba design to efficiently extract global spectral-spatial features. And the cross-attention spectral-spatial fusion is designed to learn cross-correlation and fuse spectral-spatial features. Second, the lightweight spectral-spatial residual convolution module is proposed with lightweight spectral and spatial branches to extract local spectral-spatial features through residual learning. Finally, the adaptive global-local fusion is proposed to dynamically combine global Mamba features and local convolution features for a global-local spectral-spatial representation. Compared with state-of-the-art HSI classification methods, experimental results demonstrate that DualMamba achieves significant classification accuracy on three public HSI datasets and a superior reduction in model parameters and floating point operations (FLOPs).

[CV-62] 1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation

链接: https://arxiv.org/abs/2406.07043
作者: Mingqi Gao,Jingnan Luo,Jinyu Yang,Jungong Han,Feng Zheng
关键词: Motion Expression guided, guided Video Segmentation, video object segmentation, Expression guided Video, referring video object
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Motion Expression guided Video Segmentation (MeViS), as an emerging task, poses many new challenges to the field of referring video object segmentation (RVOS). In this technical report, we investigated and validated the effectiveness of static-dominant data and frame sampling on this challenging setting. Our solution achieves a JF score of 0.5447 in the competition phase and ranks 1st in the MeViS track of the PVUW Challenge. The code is available at: this https URL.

[CV-63] EFFOcc: A Minimal Baseline for EFficient Fusion-based 3D Occupancy Network

链接: https://arxiv.org/abs/2406.07042
作者: Yining Shi,Kun Jiang,Ke Wang,Kangan Qian,Yunlong Wang,Jiusi Li,Tuopu Wen,Mengmeng Yang,Yiliang Xu,Diange Yang
关键词: rapidly rising challenging, rising challenging perception, challenging perception task, autonomous driving, driving scene
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: preprint under review

点击查看摘要

Abstract:3D occupancy prediction (Occ) is a rapidly rising challenging perception task in the field of autonomous driving which represents the driving scene as uniformly partitioned 3D voxel grids with semantics. Compared to 3D object detection, grid perception has great advantage of better recognizing irregularly shaped, unknown category, or partially occluded general objects. However, existing 3D occupancy networks (occnets) are both computationally heavy and label-hungry. In terms of model complexity, occnets are commonly composed of heavy Conv3D modules or transformers on the voxel level. In terms of label annotations requirements, occnets are supervised with large-scale expensive dense voxel labels. Model and data inefficiency, caused by excessive network parameters and label annotations requirement, severely hinder the onboard deployment of occnets. This paper proposes an efficient 3d occupancy network (EFFOcc), that targets the minimal network complexity and label requirement while achieving state-of-the-art accuracy. EFFOcc only uses simple 2D operators, and improves Occ accuracy to the state-of-the-art on multiple large-scale benchmarks: Occ3D-nuScenes, Occ3D-Waymo, and OpenOccupancy-nuScenes. On Occ3D-nuScenes benchmark, EFFOcc has only 18.4M parameters, and achieves 50.46 in terms of mean IoU (mIoU), to our knowledge, it is the occnet with minimal parameters compared with related occnets. Moreover, we propose a two-stage active learning strategy to reduce the requirements of labelled data. Active EFFOcc trained with 6% labelled voxels achieves 47.19 mIoU, which is 95.7% fully supervised performance. The proposed EFFOcc also supports improved vision-only occupancy prediction with the aid of region-decomposed distillation. Code and demo videos will be available at this https URL.

[CV-64] PanoSSC: Exploring Monocular Panoptic 3D Scene Reconstruction for Autonomous Driving

链接: https://arxiv.org/abs/2406.07037
作者: Yining Shi,Jiusi Li,Kun Jiang,Ke Wang,Yunlong Wang,Mengmeng Yang,Diange Yang
关键词: driving perception systems, Vision-centric occupancy networks, camera-only autonomous driving, autonomous driving perception, Vision-centric occupancy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 3dv2024

点击查看摘要

Abstract:Vision-centric occupancy networks, which represent the surrounding environment with uniform voxels with semantics, have become a new trend for safe driving of camera-only autonomous driving perception systems, as they are able to detect obstacles regardless of their shape and occlusion. Modern occupancy networks mainly focus on reconstructing visible voxels from object surfaces with voxel-wise semantic prediction. Usually, they suffer from inconsistent predictions of one object and mixed predictions for adjacent objects. These confusions may harm the safety of downstream planning modules. To this end, we investigate panoptic segmentation on 3D voxel scenarios and propose an instance-aware occupancy network, PanoSSC. We predict foreground objects and backgrounds separately and merge both in post-processing. For foreground instance grouping, we propose a novel 3D instance mask decoder that can efficiently extract individual objects. we unify geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation into PanoSSC framework and propose new metrics for evaluating panoptic voxels. Extensive experiments show that our method achieves competitive results on SemanticKITTI semantic scene completion benchmark.

[CV-65] RS-DFM: A Remote Sensing Distributed Foundation Model for Diverse Downstream Tasks

链接: https://arxiv.org/abs/2406.07032
作者: Zhechao Wang,Peirui Cheng,Pengju Tian,Yuchao Wang,Mingxin Chen,Shujing Duan,Zhirui Wang,Xinming Li,Xian Sun
关键词: achieved notable success, Remote sensing, sensing lightweight foundation, Remote sensing lightweight, Sensing Distributed Foundation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Remote sensing lightweight foundation models have achieved notable success in online perception within remote sensing. However, their capabilities are restricted to performing online inference solely based on their own observations and models, thus lacking a comprehensive understanding of large-scale remote sensing scenarios. To overcome this limitation, we propose a Remote Sensing Distributed Foundation Model (RS-DFM) based on generalized information mapping and interaction. This model can realize online collaborative perception across multiple platforms and various downstream tasks by mapping observations into a unified space and implementing a task-agnostic information interaction strategy. Specifically, we leverage the ground-based geometric prior of remote sensing oblique observations to transform the feature mapping from absolute depth estimation to relative depth estimation, thereby enhancing the model’s ability to extract generalized features across diverse heights and perspectives. Additionally, we present a dual-branch information compression module to decouple high-frequency and low-frequency feature information, achieving feature-level compression while preserving essential task-agnostic details. In support of our research, we create a multi-task simulation dataset named AirCo-MultiTasks for multi-UAV collaborative observation. We also conduct extensive experiments, including 3D object detection, instance segmentation, and trajectory prediction. The numerous results demonstrate that our RS-DFM achieves state-of-the-art performance across various downstream tasks.

[CV-66] LiSD: An Efficient Multi-Task Learning Framework for LiDAR Segmentation and Detection

链接: https://arxiv.org/abs/2406.07023
作者: Jiahua Xu,Si Zuo,Chenfeng Wei,Wei Zhou
关键词: object detection methodologies, autonomous driving, research of lidar-based, traffic participants, rapid proliferation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the rapid proliferation of autonomous driving, there has been a heightened focus on the research of lidar-based 3D semantic segmentation and object detection methodologies, aiming to ensure the safety of traffic participants. In recent decades, learning-based approaches have emerged, demonstrating remarkable performance gains in comparison to conventional algorithms. However, the segmentation and detection tasks have traditionally been examined in isolation to achieve the best precision. To this end, we propose an efficient multi-task learning framework named LiSD which can address both segmentation and detection tasks, aiming to optimize the overall performance. Our proposed LiSD is a voxel-based encoder-decoder framework that contains a hierarchical feature collaboration module and a holistic information aggregation module. Different integration methods are adopted to keep sparsity in segmentation while densifying features for query initialization in detection. Besides, cross-task information is utilized in an instance-aware refinement module to obtain more accurate predictions. Experimental results on the nuScenes dataset and Waymo Open Dataset demonstrate the effectiveness of our proposed model. It is worth noting that LiSD achieves the state-of-the-art performance of 83.3% mIoU on the nuScenes segmentation benchmark for lidar-only methods.

[CV-67] Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models

链接: https://arxiv.org/abs/2406.07008
作者: Sooyeon Go,Kyungmook Choi,Minjung Shin,Youngjung Uh
关键词: diffusion models, semantic correspondences, image synthesis, reference, reference image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: project page : this https URL

点击查看摘要

Abstract:As pretrained text-to-image diffusion models have become a useful tool for image synthesis, people want to specify the results in various ways. In this paper, we introduce a method to produce results with the same structure of a target image but painted with colors from a reference image, i.e., appearance transfer, especially following the semantic correspondence between the result and the reference. E.g., the result wing takes color from the reference wing, not the reference head. Existing methods rely on the query-key similarity within self-attention layer, usually producing defective results. To this end, we propose to find semantic correspondences and explicitly rearrange the features according to the semantic correspondences. Extensive experiments show the superiority of our method in various aspects: preserving the structure of the target and reflecting the color from the reference according to the semantic correspondences, even when the two images are not aligned.

[CV-68] MIPI 2024 Challenge on Few-shot RAW Image Denoising: Methods and Results

链接: https://arxiv.org/abs/2406.07006
作者: Xin Jin,Chunle Guo,Xiaoming Li,Zongsheng Yue,Chongyi Li,Shangchen Zhou,Ruicheng Feng,Yuekun Dai,Peiqing Yang,Chen Change Loy,Ruoqi Li,Chang Liu,Ziyi Wang,Yao Du,Jingjing Yang,Long Bao,Heng Sun,Xiangyu Kong,Xiaoxia Xing,Jinlong Wu,Yuanyang Xue,Hyunhee Park,Sejun Song,Changho Kim,Jingfan Tan,Wenhan Luo,Zikun Liu,Mingde Qiao,Junjun Jiang,Kui Jiang,Yao Xiao,Chuyang Sun,Jinhui Hu,Weijian Ruan,Yubo Dong,Kai Chen,Hyejeong Jo,Jiahao Qin,Bingjie Han,Pinle Qin,Rui Chai,Pengyuan Wang
关键词: Few-shot RAW Image, RAW Image Denoising, advanced image sensors, camera systems, mobile intelligent photography
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR 2024 Mobile Intelligent Photography and Imaging (MIPI) Workshop–Few-shot RAWImage Denoising Challenge Report. Website: this https URL

点击查看摘要

Abstract:The increasing demand for computational photography and imaging on mobile platforms has led to the widespread development and integration of advanced image sensors with novel algorithms in camera systems. However, the scarcity of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of mobile intelligent photography and imaging (MIPI). Building on the achievements of the previous MIPI Workshops held at ECCV 2022 and CVPR 2023, we introduce our third MIPI challenge including three tracks focusing on novel image sensors and imaging algorithms. In this paper, we summarize and review the Few-shot RAW Image Denoising track on MIPI 2024. In total, 165 participants were successfully registered, and 7 teams submitted results in the final testing phase. The developed solutions in this challenge achieved state-of-the-art erformance on Few-shot RAW Image Denoising. More details of this challenge and the link to the dataset can be found at this https URL.

[CV-69] aching with Uncertainty: Unleashing the Potential of Knowledge Distillation in Object Detection

链接: https://arxiv.org/abs/2406.06999
作者: Junfei Yi,Jianxu Mao,Tengfei Liu,Mingjie Li,Hanyu Gu,Hui Zhang,Xiaojun Chang,Yaonan Wang
关键词: widely adopted, adopted and effective, object detection tasks, Knowledge, knowledge uncertainty
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Knowledge distillation (KD) is a widely adopted and effective method for compressing models in object detection tasks. Particularly, feature-based distillation methods have shown remarkable performance. Existing approaches often ignore the uncertainty in the teacher model’s knowledge, which stems from data noise and imperfect training. This limits the student model’s ability to learn latent knowledge, as it may overly rely on the teacher’s imperfect guidance. In this paper, we propose a novel feature-based distillation paradigm with knowledge uncertainty for object detection, termed “Uncertainty Estimation-Discriminative Knowledge Extraction-Knowledge Transfer (UET)”, which can seamlessly integrate with existing distillation methods. By leveraging the Monte Carlo dropout technique, we introduce knowledge uncertainty into the training process of the student model, facilitating deeper exploration of latent knowledge. Our method performs effectively during the KD process without requiring intricate structures or extensive computational resources. Extensive experiments validate the effectiveness of our proposed approach across various distillation strategies, detectors, and backbone architectures. Specifically, following our proposed paradigm, the existing FGD method achieves state-of-the-art (SoTA) performance, with ResNet50-based GFL achieving 44.1% mAP on the COCO dataset, surpassing the baselines by 3.9%.

[CV-70] Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

链接: https://arxiv.org/abs/2406.06978
作者: Zhenxin Li,Kailin Li,Shihao Wang,Shiyi Lan,Zhiding Yu,Yishen Ji,Zhiqi Li,Ziyue Zhu,Jan Kautz,Zuxuan Wu,Yu-Gang Jiang,Jose M. Alvarez
关键词: paradigm employing multiple, employing multiple teachers, paradigm employing, employing multiple, teacher-student model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The 1st place solution of End-to-end Driving at Scale at the CVPR 2024 Autonomous Grand Challenge

点击查看摘要

Abstract:We propose Hydra-MDP, a novel paradigm employing multiple teachers in a teacher-student model. This approach uses knowledge distillation from both human and rule-based teachers to train the student model, which features a multi-head decoder to learn diverse trajectory candidates tailored to various evaluation metrics. With the knowledge of rule-based teachers, Hydra-MDP learns how the environment influences the planning in an end-to-end manner instead of resorting to non-differentiable post-processing. This method achieves the 1^st place in the Navsim challenge, demonstrating significant improvements in generalization across diverse driving environments and conditions. Code will be available at \urlthis https URL

[CV-71] RWKV-CLIP: A Robust Vision-Language Representation Learner

链接: https://arxiv.org/abs/2406.06973
作者: Tiancheng Gu,Kaicheng Yang,Xiang An,Ziyong Feng,Dongnan Liu,Weidong Cai,Jiankang Deng
关键词: Contrastive Language-Image Pre-training, Contrastive Language-Image, significantly improved performance, image-text pairs obtained, obtained from websites
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 10 figures

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks by expanding the dataset with image-text pairs obtained from websites. This paper further explores CLIP from the perspectives of data and model architecture. To address the prevalence of noisy data and enhance the quality of large-scale image-text data crawled from the internet, we introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. Furthermore, we propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs. Comprehensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP is a robust and efficient vision-language representation learner, it achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. To facilitate future research, the code and pre-trained models are released at this https URL

[CV-72] Generative Lifting of Multiview to 3D from Unknown Pose: Wrapping NeRF inside Diffusion

链接: https://arxiv.org/abs/2406.06972
作者: Xin Yuan,Rana Hanocka,Michael Maire
关键词: generative modeling problem, cast multiview reconstruction, Neural Radiance Field, Diffusion Probabilistic Model, modeling problem
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We cast multiview reconstruction from unknown pose as a generative modeling problem. From a collection of unannotated 2D images of a scene, our approach simultaneously learns both a network to predict camera pose from 2D image input, as well as the parameters of a Neural Radiance Field (NeRF) for the 3D scene. To drive learning, we wrap both the pose prediction network and NeRF inside a Denoising Diffusion Probabilistic Model (DDPM) and train the system via the standard denoising objective. Our framework requires the system accomplish the task of denoising an input 2D image by predicting its pose and rendering the NeRF from that pose. Learning to denoise thus forces the system to concurrently learn the underlying 3D NeRF representation and a mapping from images to camera extrinsic parameters. To facilitate the latter, we design a custom network architecture to represent pose as a distribution, granting implicit capacity for discovering view correspondences when trained end-to-end for denoising alone. This technique allows our system to successfully build NeRFs, without pose knowledge, for challenging scenes where competing methods fail. At the conclusion of training, our learned NeRF can be extracted and used as a 3D scene model; our full system can be used to sample novel camera poses and generate novel-view images.

[CV-73] Dual Thinking and Perceptual Analysis of Deep Learning Models using Human Adversarial Examples

链接: https://arxiv.org/abs/2406.06967
作者: Kailas Dayanandan,Anand Sinha,Brejesh Lall
关键词: dual thinking framework, dual thinking, human vision, logical processing, thinking framework
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The dual thinking framework considers fast, intuitive processing and slower, logical processing. The perception of dual thinking in vision requires images where inferences from intuitive and logical processing differ. We introduce an adversarial dataset to provide evidence for the dual thinking framework in human vision, which also aids in studying the qualitative behavior of deep learning models. Our study also addresses a major criticism of using classification models as computational models of human vision by using instance segmentation models that localize objects. The evidence underscores the importance of shape in identifying instances in human vision and shows that deep learning models lack an understanding of sub-structures, as indicated by errors related to the position and number of sub-components. Additionally, the similarity in errors made by models and intuitive human processing indicates that models only address intuitive thinking in human vision.

[CV-74] Evolving from Single-modal to Multi-modal Facial Deepfake Detection: A Survey

链接: https://arxiv.org/abs/2406.06965
作者: Ping Liu,Qiqi Tao,Joey Tianyi Zhou
关键词: artificial intelligence, addresses the critical, amidst the rapid, rapid advancements, advancements in artificial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This survey addresses the critical challenge of deepfake detection amidst the rapid advancements in artificial intelligence. As AI-generated media, including video, audio and text, become more realistic, the risk of misuse to spread misinformation and commit identity fraud increases. Focused on face-centric deepfakes, this work traces the evolution from traditional single-modality methods to sophisticated multi-modal approaches that handle audio-visual and text-visual scenarios. We provide comprehensive taxonomies of detection techniques, discuss the evolution of generative methods from auto-encoders and GANs to diffusion models, and categorize these technologies by their unique attributes. To our knowledge, this is the first survey of its kind. We also explore the challenges of adapting detection methods to new generative models and enhancing the reliability and robustness of deepfake detectors, proposing directions for future research. This survey offers a detailed roadmap for researchers, supporting the development of technologies to counter the deceptive use of AI in media creation, particularly facial forgery. A curated list of all related papers can be found at \hrefthis https URLthis https URL.

[CV-75] Stepwise Regression and Pre-trained Edge for Robust Stereo Matching

链接: https://arxiv.org/abs/2406.06953
作者: Weiqing Xiao,Wei Zhao
关键词: obtaining real samples, stereo matching, real-world applications, stereo matching methods, difficulty in obtaining
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Due to the difficulty in obtaining real samples and ground truth, the generalization performance and the fine-tuned performance are critical for the feasibility of stereo matching methods in real-world applications. However, the presence of substantial disparity distributions and density variations across different datasets presents significant challenges for the generalization and fine-tuning of the model. In this paper, we propose a novel stereo matching method, called SR-Stereo, which mitigates the distributional differences across different datasets by predicting the disparity clips and uses a loss weight related to the regression target scale to improve the accuracy of the disparity clips. Moreover, this stepwise regression architecture can be easily extended to existing iteration-based methods to improve the performance without changing the structure. In addition, to mitigate the edge blurring of the fine-tuned model on sparse ground truth, we propose Domain Adaptation Based on Pre-trained Edges (DAPE). Specifically, we use the predicted disparity and RGB image to estimate the edge map of the target domain image. The edge map is filtered to generate edge map background pseudo-labels, which together with the sparse ground truth disparity on the target domain are used as a supervision to jointly fine-tune the pre-trained stereo matching model. These proposed methods are extensively evaluated on SceneFlow, KITTI, Middbury 2014 and ETH3D. The SR-Stereo achieves competitive disparity estimation performance and state-of-the-art cross-domain generalisation performance. Meanwhile, the proposed DAPE significantly improves the disparity estimation performance of fine-tuned models, especially in the textureless and detail regions.

[CV-76] riple-domain Feature Learning with Frequency-aware Memory Enhancement for Moving Infrared Small Target Detection

链接: https://arxiv.org/abs/2406.06949
作者: Weiwei Duan,Luping Ji,Shengjia Chen,Sicheng Zhu,Mao Ye
关键词: Moving infrared small, detection presents significant, presents significant challenges, significant challenges due, Moving infrared
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This paper has submitted to IEEE TGRS,under review

点击查看摘要

Abstract:Moving infrared small target detection presents significant challenges due to tiny target sizes and low contrast against backgrounds. Currently-existing methods primarily focus on extracting target features only from the spatial-temporal domain. For further enhancing feature representation, more information domains such as frequency are believed to be potentially valuable. To extend target feature learning, we propose a new Triple-domain Strategy (Tridos) with the frequency-aware memory enhancement on the spatial-temporal domain. In our scheme, it effectively detaches and enhances frequency features by a local-global frequency-aware module with Fourier transform. Inspired by the human visual system, our memory enhancement aims to capture the target spatial relations between video frames. Furthermore, it encodes temporal dynamics motion features via differential learning and residual enhancing. Additionally, we further design a residual compensation unit to reconcile possible cross-domain feature mismatches. To our best knowledge, our Tridos is the first work to explore target feature learning comprehensively in spatial-temporal-frequency domains. The extensive experiments on three datasets (DAUB, ITSDT-15K, and IRDST) validate that our triple-domain learning scheme could be obviously superior to state-of-the-art ones. Source codes are available at this https URL.

[CV-77] Neural Visibility Field for Uncertainty-Driven Active Mapping

链接: https://arxiv.org/abs/2406.06948
作者: Shangjie Xue,Jesse Dill,Pranay Mathur,Frank Dellaert,Panagiotis Tsiotra,Danfei Xu
关键词: presents Neural Visibility, Neural Visibility Field, Neural Radiance Fields, paper presents Neural, Neural Visibility
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Accepted to CVPR 2024. More details can be found at this https URL

点击查看摘要

Abstract:This paper presents Neural Visibility Field (NVF), a novel uncertainty quantification method for Neural Radiance Fields (NeRF) applied to active mapping. Our key insight is that regions not visible in the training views lead to inherently unreliable color predictions by NeRF at this region, resulting in increased uncertainty in the synthesized views. To address this, we propose to use Bayesian Networks to composite position-based field uncertainty into ray-based uncertainty in camera observations. Consequently, NVF naturally assigns higher uncertainty to unobserved regions, aiding robots to select the most informative next viewpoints. Extensive evaluations show that NVF excels not only in uncertainty quantification but also in scene reconstruction for active mapping, outperforming existing methods.

[CV-78] Sparse Bayesian Networks: Efficient Uncertainty Quantification in Medical Image Analysis

链接: https://arxiv.org/abs/2406.06946
作者: Zeinab Abboud,Herve Lombaert,Samuel Kadoury
关键词: Efficiently quantifying predictive, medical images remains, Efficiently quantifying, remains a challenge, Bayesian
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Efficiently quantifying predictive uncertainty in medical images remains a challenge. While Bayesian neural networks (BNN) offer predictive uncertainty, they require substantial computational resources to train. Although Bayesian approximations such as ensembles have shown promise, they still suffer from high training and inference costs. Existing approaches mainly address the costs of BNN inference post-training, with little focus on improving training efficiency and reducing parameter complexity. This study introduces a training procedure for a sparse (partial) Bayesian network. Our method selectively assigns a subset of parameters as Bayesian by assessing their deterministic saliency through gradient sensitivity analysis. The resulting network combines deterministic and Bayesian parameters, exploiting the advantages of both representations to achieve high task-specific performance and minimize predictive uncertainty. Demonstrated on multi-label ChestMNIST for classification and ISIC, LIDC-IDRI for segmentation, our approach achieves competitive performance and predictive uncertainty estimation by reducing Bayesian parameters by over 95%, significantly reducing computational expenses compared to fully Bayesian and ensemble methods.

[CV-79] Optimal Matrix-Mimetic Tensor Algebras via Variable Projection

链接: https://arxiv.org/abs/2406.06942
作者: Elizabeth Newman,Katherine Keegan
关键词: Recent advances, linear algebraic properties, properties for multilinear, obtain optimal representations, multilinear data analysis
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
*备注: 46 pages, 15 figures

点击查看摘要

Abstract:Recent advances in matrix-mimetic tensor frameworks have made it possible to preserve linear algebraic properties for multilinear data analysis and, as a result, to obtain optimal representations of multiway data. Matrix mimeticity arises from interpreting tensors as operators that can be multiplied, factorized, and analyzed analogous to matrices. Underlying the tensor operation is an algebraic framework parameterized by an invertible linear transformation. The choice of linear mapping is crucial to representation quality and, in practice, is made heuristically based on expected correlations in the data. However, in many cases, these correlations are unknown and common heuristics lead to suboptimal performance. In this work, we simultaneously learn optimal linear mappings and corresponding tensor representations without relying on prior knowledge of the data. Our new framework explicitly captures the coupling between the transformation and representation using variable projection. We preserve the invertibility of the linear mapping by learning orthogonal transformations with Riemannian optimization. We provide original theory of uniqueness of the transformation and convergence analysis of our variable-projection-based algorithm. We demonstrate the generality of our framework through numerical experiments on a wide range of applications, including financial index tracking, image compression, and reduced order modeling. We have published all the code related to this work at this https URL.

[CV-80] Synthetic Face Ageing: Evaluation Analysis and Facilitation of Age-Robust Facial Recognition Algorithms

链接: https://arxiv.org/abs/2406.06932
作者: Wang Yao,Muhammad Ali Farooq,Joseph Lemley,Peter Corcoran
关键词: public security bureaus, human aging factor, aging factor holds, factor holds significant, holds significant importance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:The ability to accurately recognize an individual’s face with respect to human aging factor holds significant importance for various private as well as government sectors such as customs and public security bureaus, passport office, and national database systems. Therefore, developing a robust age-invariant face recognition system is of crucial importance to address the challenges posed by ageing and maintain the reliability and accuracy of facial recognition technology. In this research work, the focus is to explore the feasibility of utilizing synthetic ageing data to improve the robustness of face recognition models that can eventually help in recognizing people at broader age intervals. To achieve this, we first design set of experiments to evaluate state-of-the-art synthetic ageing methods. In the next stage we explore the effect of age intervals on a current deep learning-based face recognition algorithm by using synthetic ageing data as well as real ageing data to perform rigorous training and validation. Moreover, these synthetic age data have been used in facilitating face recognition algorithms. Experimental results show that the recognition rate of the model trained on synthetic ageing images is 3.33% higher than the results of the baseline model when tested on images with an age gap of 40 years, which prove the potential of synthetic age data which has been quantified to enhance the performance of age-invariant face recognition systems.

[CV-81] Explaining Representation Learning with Perceptual Components

链接: https://arxiv.org/abs/2406.06930
作者: Yavuz Yarici,Kiran Kokilepersaud,Mohit Prabhushankar,Ghassan AlRegib
关键词: clear semantic meaning, Self-supervised models create, lack clear semantic, Self-supervised models, semantic meaning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 Pages, 3 Figures, Accepted to 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates (UAE). Date of Acceptance: June 6th, 2024

点击查看摘要

Abstract:Self-supervised models create representation spaces that lack clear semantic meaning. This interpretability problem of representations makes traditional explainability methods ineffective in this context. In this paper, we introduce a novel method to analyze representation spaces using three key perceptual components: color, shape, and texture. We employ selective masking of these components to observe changes in representations, resulting in distinct importance maps for each. In scenarios, where labels are absent, these importance maps provide more intuitive explanations as they are integral to the human visual system. Our approach enhances the interpretability of the representation space, offering explanations that resonate with human visual perception. We analyze how different training objectives create distinct representation spaces using perceptual components. Additionally, we examine the representation of images across diverse image domains, providing insights into the role of these components in different contexts.

[CV-82] AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

链接: https://arxiv.org/abs/2406.06911
作者: Zigeng Chen,Xinyin Ma,Gongfan Fang,Zhenxiong Tan,Xinchao Wang
关键词: garnered significant interest, great generative ability, garnered significant, significant interest, Diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Work in progress. Project Page: this https URL

点击查看摘要

Abstract:Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency, thereby precluding the possibilities of parallel computation. To address this, we introduce AsyncDiff, a universal and plug-and-play acceleration scheme that enables model parallelism across multiple devices. Our approach divides the cumbersome noise prediction model into multiple components, assigning each to a different device. To break the dependency chain between these components, it transforms the conventional sequential denoising into an asynchronous process by exploiting the high similarity between hidden states in consecutive diffusion steps. Consequently, each component is facilitated to compute in parallel on separate devices. The proposed strategy significantly reduces inference latency while minimally impacting the generative quality. Specifically, for the Stable Diffusion v2.1, AsyncDiff achieves a 2.7x speedup with negligible degradation and a 4.0x speedup with only a slight reduction of 0.38 in CLIP Score, on four NVIDIA A5000 GPUs. Our experiments also demonstrate that AsyncDiff can be readily applied to video diffusion models with encouraging performances. The code is available at this https URL.

[CV-83] UVIS: Unsupervised Video Instance Segmentation

链接: https://arxiv.org/abs/2406.06908
作者: Shuaiyi Huang,Saksham Suri,Kamal Gupta,Sai Saketh Rambhatla,Ser-nam Lim,Abhinav Shrivastava
关键词: Video instance segmentation, instance segmentation requires, segmentation requires classifying, instance segmentation, Video instance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR2024 Workshop

点击查看摘要

Abstract:Video instance segmentation requires classifying, segmenting, and tracking every object across video frames. Unlike existing approaches that rely on masks, boxes, or category labels, we propose UVIS, a novel Unsupervised Video Instance Segmentation (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining. Our key insight comes from leveraging the dense shape prior from the self-supervised vision foundation model DINO and the openset recognition ability from the image-caption supervised vision-language model CLIP. Our UVIS framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking. To improve the quality of VIS predictions in the unsupervised setup, we introduce a dual-memory design. This design includes a semantic memory bank for generating accurate pseudo-labels and a tracking memory bank for maintaining temporal consistency in object tracks. We evaluate our approach on three standard VIS benchmarks, namely YoutubeVIS-2019, YoutubeVIS-2021, and Occluded VIS. Our UVIS achieves 21.1 AP on YoutubeVIS-2019 without any video annotations or dense pretraining, demonstrating the potential of our unsupervised VIS framework.

[CV-84] SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale

链接: https://arxiv.org/abs/2406.06907
作者: Shester Gueuwou,Xiaodan Du,Greg Shakhnarovich,Karen Livescu
关键词: irrelevant visual differences, sign language, language video processing, written language translation, sign language video
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A persistent challenge in sign language video processing, including the task of sign language to written language translation, is how we learn representations of sign language in an effective and efficient way that can preserve the important attributes of these languages, while remaining invariant to irrelevant visual differences. Informed by the nature and linguistics of signed languages, our proposed method focuses on just the most relevant parts in a signing video: the face, hands and body posture of the signer. However, instead of using pose estimation coordinates from off-the-shelf pose tracking models, which have inconsistent performance for hands and faces, we propose to learn the complex handshapes and rich facial expressions of sign languages in a self-supervised fashion. Our approach is based on learning from individual frames (rather than video sequences) and is therefore much more efficient than prior work on sign language pre-training. Compared to a recent model that established a new state of the art in sign language translation on the How2Sign dataset, our approach yields similar translation performance, using less than 3% of the compute.

[CV-85] Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

链接: https://arxiv.org/abs/2406.06890
作者: Yuanhao Zhai,Kevin Lin,Zhengyuan Yang,Linjie Li,Jianfeng Wang,Chung-Ching Lin,David Doermann,Junsong Yuan,Lijuan Wang
关键词: video diffusion, video, diffusion, diffusion distillation, video diffusion distillation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Image diffusion distillation achieves high-fidelity generation with very few sampling steps. However, applying these techniques directly to video diffusion often results in unsatisfactory frame quality due to the limited visual quality in public video datasets. This affects the performance of both teacher and student video diffusion models. Our study aims to improve video diffusion distillation while improving frame appearance using abundant high-quality image data. We propose motion consistency model (MCM), a single-stage video diffusion distillation method that disentangles motion and appearance learning. Specifically, MCM includes a video consistency model that distills motion from the video teacher model, and an image discriminator that enhances frame appearance to match high-quality image data. This combination presents two challenges: (1) conflicting frame learning objectives, as video distillation learns from low-quality video frames while the image discriminator targets high-quality images; and (2) training-inference discrepancies due to the differing quality of video samples used during training and inference. To address these challenges, we introduce disentangled motion distillation and mixed trajectory distillation. The former applies the distillation objective solely to the motion representation, while the latter mitigates training-inference discrepancies by mixing distillation trajectories from both the low- and high-quality video domains. Extensive experiments show that our MCM achieves the state-of-the-art video diffusion distillation performance. Additionally, our method can enhance frame quality in video diffusion models, producing frames with high aesthetic scores or specific styles without corresponding video data.

[CV-86] axes Are All You Need: Integration of Taxonomical Hierarchy Relationships into the Contrastive Loss

链接: https://arxiv.org/abs/2406.06848
作者: Kiran Kokilepersaud,Yavuz Yarici,Mohit Prabhushankar,Ghassan AlRegib
关键词: supervised contrastive loss, representation learning process, contrastive loss, taxonomic hierarchy information, supervised contrastive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at IEEE International Conference on Image Processing

点击查看摘要

Abstract:In this work, we propose a novel supervised contrastive loss that enables the integration of taxonomic hierarchy information during the representation learning process. A supervised contrastive loss operates by enforcing that images with the same class label (positive samples) project closer to each other than images with differing class labels (negative samples). The advantage of this approach is that it directly penalizes the structure of the representation space itself. This enables greater flexibility with respect to encoding semantic concepts. However, the standard supervised contrastive loss only enforces semantic structure based on the downstream task (i.e. the class label). In reality, the class label is only one level of a \emphhierarchy of different semantic relationships known as a taxonomy. For example, the class label is oftentimes the species of an animal, but between different classes there are higher order relationships such as all animals with wings being ``birds". We show that by explicitly accounting for these relationships with a weighting penalty in the contrastive loss we can out-perform the supervised contrastive loss. Additionally, we demonstrate the adaptability of the notion of a taxonomy by integrating our loss into medical and noise-based settings that show performance improvements by as much as 7%.

[CV-87] Generalized W-Net: Arbitrary-style Chinese Character Synthesization

链接: https://arxiv.org/abs/2406.06847
作者: Haochuan Jiang,Guanyu Yang,Fei Cheng,Kaizhu Huang
关键词: Synthesizing Chinese characters, Synthesizing Chinese, Synthesizing, Adaptive Instance Normalization, Chinese characters
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Synthesizing Chinese characters with consistent style using few stylized examples is challenging. Existing models struggle to generate arbitrary style characters with limited examples. In this paper, we propose the Generalized W-Net, a novel class of W-shaped architectures that addresses this. By incorporating Adaptive Instance Normalization and introducing multi-content, our approach can synthesize Chinese characters in any desired style, even with limited examples. It handles seen and unseen styles during training and can generate new character contents. Experimental results demonstrate the effectiveness of our approach.

[CV-88] HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

链接: https://arxiv.org/abs/2406.06843
作者: Jikai Wang,Qifan Zhang,Yu-Wei Chao,Bowen Wen,Xiaohu Guo,Yu Xiang
关键词: dataset named HO-Cap, data capture system, named HO-Cap, data capture, capture system
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce a data capture system and a new dataset named HO-Cap that can be used to study 3D reconstruction and pose tracking of hands and objects in videos. The capture system uses multiple RGB-D cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems. We propose a semi-automatic method to obtain annotations of shape and pose of hands and objects in the collected videos, which significantly reduces the required annotation time compared to manual labeling. With this system, we captured a video dataset of humans using objects to perform different tasks, as well as simple pick-and-place and handover of an object from one hand to the other, which can be used as human demonstrations for embodied AI and robot manipulation research. Our data capture setup and annotation framework can be used by the community to reconstruct 3D shapes of objects and human hands and track their poses in videos.

[CV-89] Adapters Strike Back

链接: https://arxiv.org/abs/2406.06820
作者: Jan-Martin O. Steitz,Stefan Roth
关键词: adapting trained transformer, trained transformer models, efficient and lightweight, adapting trained, trained transformer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: To appear at CVPR 2024. Code: this https URL

点击查看摘要

Abstract:Adapters provide an efficient and lightweight mechanism for adapting trained transformer models to a variety of different tasks. However, they have often been found to be outperformed by other adaptation mechanisms, including low-rank adaptation. In this paper, we provide an in-depth study of adapters, their internal structure, as well as various implementation choices. We uncover pitfalls for using adapters and suggest a concrete, improved adapter architecture, called Adapter+, that not only outperforms previous adapter implementations but surpasses a number of other, more complex adaptation mechanisms in several challenging settings. Despite this, our suggested adapter is highly robust and, unlike previous work, requires little to no manual intervention when addressing a novel scenario. Adapter+ reaches state-of-the-art average accuracy on the VTAB benchmark, even without a per-task hyperparameter optimization.

[CV-90] Stable Neighbor Denoising for Source-free Domain Adaptive Segmentation

链接: https://arxiv.org/abs/2406.06813
作者: Dong Zhao,Shuang Wang,Qi Zang,Licheng Jiao,Nicu Sebe,Zhun Zhong
关键词: study source-free unsupervised, source-free unsupervised domain, source data, study source-free, source-free unsupervised
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 2024 Conference on Computer Vision and Pattern Recognition

点击查看摘要

Abstract:We study source-free unsupervised domain adaptation (SFUDA) for semantic segmentation, which aims to adapt a source-trained model to the target domain without accessing the source data. Many works have been proposed to address this challenging problem, among which uncertainty-based self-training is a predominant approach. However, without comprehensive denoising mechanisms, they still largely fall into biased estimates when dealing with different domains and confirmation bias. In this paper, we observe that pseudo-label noise is mainly contained in unstable samples in which the predictions of most pixels undergo significant variations during self-training. Inspired by this, we propose a novel mechanism to denoise unstable samples with stable ones. Specifically, we introduce the Stable Neighbor Denoising (SND) approach, which effectively discovers highly correlated stable and unstable samples by nearest neighbor retrieval and guides the reliable optimization of unstable samples by bi-level learning. Moreover, we compensate for the stable set by object-level object paste, which can further eliminate the bias caused by less learned classes. Our SND enjoys two advantages. First, SND does not require a specific segmentor structure, endowing its universality. Second, SND simultaneously addresses the issues of class, domain, and confirmation biases during adaptation, ensuring its effectiveness. Extensive experiments show that SND consistently outperforms state-of-the-art methods in various SFUDA semantic segmentation settings. In addition, SND can be easily integrated with other approaches, obtaining further improvements.

[CV-91] FlexLoc: Conditional Neural Networks for Zero-Shot Sensor Perspective Invariance in Object Localization with Distributed Multimodal Sensors

链接: https://arxiv.org/abs/2406.06796
作者: Jason Wu,Ziqi Wang,Xiaomin Ouyang,Ho Lyun Jeong,Colin Samplawski,Lance Kaplan,Benjamin Marlin,Mani Srivastava
关键词: assisted living, critical technology, applications ranging, ranging from navigation, navigation and surveillance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Localization is a critical technology for various applications ranging from navigation and surveillance to assisted living. Localization systems typically fuse information from sensors viewing the scene from different perspectives to estimate the target location while also employing multiple modalities for enhanced robustness and accuracy. Recently, such systems have employed end-to-end deep neural models trained on large datasets due to their superior performance and ability to handle data from diverse sensor modalities. However, such neural models are often trained on data collected from a particular set of sensor poses (i.e., locations and orientations). During real-world deployments, slight deviations from these sensor poses can result in extreme inaccuracies. To address this challenge, we introduce FlexLoc, which employs conditional neural networks to inject node perspective information to adapt the localization pipeline. Specifically, a small subset of model weights are derived from node poses at run time, enabling accurate generalization to unseen perspectives with minimal additional overhead. Our evaluations on a multimodal, multiview indoor tracking dataset showcase that FlexLoc improves the localization accuracy by almost 50% in the zero-shot case (no calibration data available) compared to the baselines. The source code of FlexLoc is available at this https URL.

[CV-92] MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

链接: https://arxiv.org/abs/2406.06777
作者: Khiem Le,Zhichun Guo,Kaiwen Dong,Xiaobao Huang,Bozhao Nan,Roshni Iyer,Xiangliang Zhang,Olaf Wiest,Wei Wang,Nitesh V. Chawla
关键词: Large Language Models, natural language understanding, Large Language, strong task-handling capabilities, shown remarkable advancements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) with their strong task-handling capabilities have shown remarkable advancements across a spectrum of fields, moving beyond natural language understanding. However, their proficiency within the chemistry domain remains restricted, especially in solving professional molecule-related tasks. This challenge is attributed to their inherent limitations in comprehending molecules using only common textual representations, i.e., SMILES strings. In this study, we seek to enhance the ability of LLMs to comprehend molecules by designing and equipping them with a multi-modal external module, namely MolX. In particular, instead of directly using a SMILES string to represent a molecule, we utilize specific encoders to extract fine-grained features from both SMILES string and 2D molecular graph representations for feeding into an LLM. Moreover, a human-defined molecular fingerprint is incorporated to leverage its embedded domain knowledge. Then, to establish an alignment between MolX and the LLM’s textual input space, the whole model in which the LLM is frozen, is pre-trained with a versatile strategy including a diverse set of tasks. Extensive experimental evaluations demonstrate that our proposed method only introduces a small number of trainable parameters while outperforming baselines on various downstream molecule-related tasks ranging from molecule-to-text translation to retrosynthesis, with and without fine-tuning the LLM.

[CV-93] SeeFar: Satellite Agnostic Multi-Resolution Dataset for Geospatial Foundation Models

链接: https://arxiv.org/abs/2406.06776
作者: James Lowman,Kelly Liu Zheng,Roydon Fraser,Jesse Van Griensven The,Mojtaba Valipour
关键词: evolving collection, satellite, satellite imagery, data, satellites
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Work in Progress!

点击查看摘要

Abstract:SeeFar is an evolving collection of multi-resolution satellite images from public and commercial satellites. We specifically curated this dataset for training geospatial foundation models, unconstrained by satellite type. In recent years, advances in technology have made satellite imagery more accessible than ever. More earth-observing satellites have been launched in the last five years than in the previous fifty. Modern commercial satellites now offer up to 100 times the spatial resolution of public access satellites. However, the high cost and limited historical availability of commercial satellite imagery is a barrier to the training of foundational models, impacting what images can be used during inference. The SeeFar dataset represents a step towards training models that are satellite-agnostic by combining multi-resolution commercial and public access pre-processed images. This will enable users to utilize historical data alongside higher-resolution, more expensive satellite imagery, offering greater flexibility during inference. To achieve this, we describe a process for standardizing data from diverse satellite sources, normalizing different data formats, and aligning spectral bands to enhance interoperability. The SeeFar dataset includes images at a resolution of 384x384 pixels, spanning four spectral bands (Blue, Green, Red, and Near-Infrared) and expanding spatial resolutions (starting with 30, 10, 1.5, and 1.0 meters), all in cloud-optimized GeoTIFF format. It also provides consistent and comprehensive metadata to enhance data transparency and reliability. By aggregating data from multiple sources, SeeFar makes processed and consistent satellite data accessible to a wider range of users - from researchers to policymakers - fostering competition and innovation in satellite imagery analysis. The dataset is available at \urlthis http URL.

[CV-94] An Elliptic Kernel Unsupervised Autoencoder-Graph Convolutional Network Ensemble Model for Hyperspectral Unmixing

链接: https://arxiv.org/abs/2406.06742
作者: Estefania Alfaro-Mejia,Carlos J Delgado,Vidya Manian
关键词: analyze hyperspectral images, abundance maps, Spectral Unmixing, estimate abundance maps, abundance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注: 13 pages, 13 figures, Transaction in Geoscience

点击查看摘要

Abstract:Spectral Unmixing is an important technique in remote sensing used to analyze hyperspectral images to identify endmembers and estimate abundance maps. Over the past few decades, performance of techniques for endmember extraction and fractional abundance map estimation have significantly improved. This article presents an ensemble model workflow called Autoencoder Graph Ensemble Model (AEGEM) designed to extract endmembers and fractional abundance maps. An elliptical kernel is applied to measure spectral distances, generating the adjacency matrix within the elliptical neighborhood. This information is used to construct an elliptical graph, with centroids as senders and remaining pixels within the geometry as receivers. The next step involves stacking abundance maps, senders, and receivers as inputs to a Graph Convolutional Network, which processes this input to refine abundance maps. Finally, an ensemble decision-making process determines the best abundance maps based on root mean square error metric. The proposed AEGEM is assessed with benchmark datasets such as Samson, Jasper, and Urban, outperforming results obtained by baseline algorithms. For the Samson dataset, AEGEM excels in three abundance maps: water, tree and soil yielding values of 0.081, 0.158, and 0.182, respectively. For the Jasper dataset, results are improved for the tree and water endmembers with values of 0.035 and 0.060 in that order, as well as for the mean average of the spectral angle distance metric 0.109. For the Urban dataset, AEGEM outperforms previous results for the abundance maps of roof and asphalt, achieving values of 0.135 and 0.240, respectively. Additionally, for the endmembers of grass and roof, AEGEM achieves values of 0.063 and 0.094.

[CV-95] RINS: Towards Multimodal Language Models that Can Read

链接: https://arxiv.org/abs/2406.06730
作者: Ruiyi Zhang,Yanzhe Zhang,Jian Chen,Yufan Zhou,Jiuxiang Gu,Changyou Chen,Tong Sun
关键词: shown remarkable proficiency, multimodal large language, shown remarkable, remarkable proficiency, large language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: CVPR 2024

点击查看摘要

Abstract:Large multimodal language models have shown remarkable proficiency in understanding and editing images. However, a majority of these visually-tuned models struggle to comprehend the textual content embedded in images, primarily due to the limitation of training data. In this work, we introduce TRINS: a Text-Rich image INStruction dataset, with the objective of enhancing the reading ability of the multimodal large language model. TRINS is built upon LAION using hybrid data annotation strategies that include machine-assisted and human-assisted annotation processes. It contains 39,153 text-rich images, captions, and 102,437 questions. Specifically, we show that the number of words per annotation in TRINS is significantly longer than that of related datasets, providing new challenges. Furthermore, we introduce a simple and effective architecture, called a Language-vision Reading Assistant (LaRA), which is good at understanding textual content within images. LaRA outperforms existing state-of-the-art multimodal large language models on the TRINS dataset, as well as other classical benchmarks. Lastly, we conducted a comprehensive evaluation with TRINS on various text-rich image understanding and generation tasks, demonstrating its effectiveness.

[CV-96] Video-based Exercise Classification and Activated Muscle Group Prediction with Hybrid X3D-SlowFast Network

链接: https://arxiv.org/abs/2406.06703
作者: Manvik Pasula,Pramit Saha
关键词: group activation prediction, activation prediction, paper introduces, introduces a simple, muscle group activation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 16 pages, 7 figures, submitted to IEEE Open Journal of the Computer Society

点击查看摘要

Abstract:This paper introduces a simple yet effective strategy for exercise classification and muscle group activation prediction (MGAP). These tasks have significant implications for personal fitness, facilitating more affordable, accessible, safer, and simpler exercise routines. This is particularly relevant for novices and individuals with disabilities. Previous research in the field is mostly dominated by the reliance on mounted sensors and a limited scope of exercises, reducing practicality for everyday use. Furthermore, existing MGAP methodologies suffer from a similar dependency on sensors and a restricted range of muscle groups, often excluding strength training exercises, which are pivotal for a comprehensive fitness regimen. Addressing these limitations, our research employs a video-based deep learning framework that encompasses a broad spectrum of exercises and muscle groups, including those vital for strength training. Utilizing the “Workout/Exercises Video” dataset, our approach integrates the X3D and SlowFast video activity recognition models in an effective way to enhance exercise classification and MGAP performance. Our findings demonstrate that this hybrid method obtained via weighted ensemble outperforms existing baseline models in accuracy. Pretrained models play a crucial role in enhancing overall performance, with optimal channel reduction values for the SlowFast model identified near 10. Through an ablation study that explores fine-tuning, we further elucidate the interrelation between the two tasks. Our composite model, a weighted-average ensemble of X3D and SlowFast, sets a new benchmark in both exercise classification and MGAP across all evaluated categories, offering a robust solution to the limitations of previous approaches.

[CV-97] PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation

链接: https://arxiv.org/abs/2406.06679
作者: Zhenyu Li,Shariq Farooq Bhat,Peter Wonka
关键词: metric single image, high-resolution real-domain inputs, single image depth, paper introduces PatchRefiner, depth estimation aimed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces PatchRefiner, an advanced framework for metric single image depth estimation aimed at high-resolution real-domain inputs. While depth estimation is crucial for applications such as autonomous driving, 3D generative modeling, and 3D reconstruction, achieving accurate high-resolution depth in real-world scenarios is challenging due to the constraints of existing architectures and the scarcity of detailed real-world depth data. PatchRefiner adopts a tile-based methodology, reconceptualizing high-resolution depth estimation as a refinement process, which results in notable performance enhancements. Utilizing a pseudo-labeling strategy that leverages synthetic data, PatchRefiner incorporates a Detail and Scale Disentangling (DSD) loss to enhance detail capture while maintaining scale accuracy, thus facilitating the effective transfer of knowledge from synthetic to real-world data. Our extensive evaluations demonstrate PatchRefiner’s superior performance, significantly outperforming existing benchmarks on the Unreal4KStereo dataset by 18.1% in terms of the root mean squared error (RMSE) and showing marked improvements in detail accuracy and consistent scale estimation on diverse real-world datasets like CityScape, ScanNet++, and ETH3D.

[CV-98] SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound

链接: https://arxiv.org/abs/2406.06612
作者: Rishit Dagli,Shivesh Prakash,Robert Wu,Houman Khosravani
关键词: auditory sensory experiences, auditory sensory, spatial audio, Generating combined visual, audio
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Generating combined visual and auditory sensory experiences is critical for the consumption of immersive content. Recent advances in neural generative models have enabled the creation of high-resolution content across multiple modalities such as images, text, speech, and videos. Despite these successes, there remains a significant gap in the generation of high-quality spatial audio that complements generated visual content. Furthermore, current audio generation models excel in either generating natural audio or speech or music but fall short in integrating spatial audio cues necessary for immersive experiences. In this work, we introduce SEE-2-SOUND, a zero-shot approach that decomposes the task into (1) identifying visual regions of interest; (2) locating these elements in 3D space; (3) generating mono-audio for each; and (4) integrating them into spatial audio. Using our framework, we demonstrate compelling results for generating spatial audio for high-quality videos, images, and dynamic images from the internet, as well as media generated by learned approaches.

[CV-99] Ameliorate Spurious Correlations in Dataset Condensation

链接: https://arxiv.org/abs/2406.06609
作者: Justin Cui,Ruochen Wang,Yuanhao Xiong,Cho-Jui Hsieh
关键词: facilitating downstream training, downstream training tasks, compressing large datasets, smaller synthetic counterparts, Dataset Condensation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: ICML

点击查看摘要

Abstract:Dataset Condensation has emerged as a technique for compressing large datasets into smaller synthetic counterparts, facilitating downstream training tasks. In this paper, we study the impact of bias inside the original dataset on the performance of dataset condensation. With a comprehensive empirical evaluation on canonical datasets with color, corruption and background biases, we found that color and background biases in the original dataset will be amplified through the condensation process, resulting in a notable decline in the performance of models trained on the condensed dataset, while corruption bias is suppressed through the condensation process. To reduce bias amplification in dataset condensation, we introduce a simple yet highly effective approach based on a sample reweighting scheme utilizing kernel density estimation. Empirical results on multiple real-world and synthetic datasets demonstrate the effectiveness of the proposed method. Notably, on CMNIST with 5% bias-conflict ratio and IPC 50, our method achieves 91.5% test accuracy compared to 23.8% from vanilla DM, boosting the performance by 67.7%, whereas applying state-of-the-art debiasing method on the same dataset only achieves 53.7% accuracy. Our findings highlight the importance of addressing biases in dataset condensation and provide a promising avenue to address bias amplification in the process. Comments: ICML Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.06609 [cs.LG] (or arXiv:2406.06609v1 [cs.LG] for this version) Submission history From: Justin Cui [view email] [v1] Thu, 6 Jun 2024 18:52:28 UTC (4,314 KB)

[CV-100] 1-D CNN-Based Online Signature Verification with Federated Learning

链接: https://arxiv.org/abs/2406.06597
作者: Lingfeng Zhang,Yuheng Guo,Yepeng Ding,Hiroyuki Sato
关键词: Online signature verification, signature verification plays, Online signature, Convolutional Neural Networks, signature verification
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 8 pages, 11 figures, 1 table

点击查看摘要

Abstract:Online signature verification plays a pivotal role in security infrastructures. However, conventional online signature verification models pose significant risks to data privacy, especially during training processes. To mitigate these concerns, we propose a novel federated learning framework that leverages 1-D Convolutional Neural Networks (CNN) for online signature verification. Furthermore, our experiments demonstrate the effectiveness of our framework regarding 1-D CNN and federated learning. Particularly, the experiment results highlight that our framework 1) minimizes local computational resources; 2) enhances transfer effects with substantial initialization data; 3) presents remarkable scalability. The centralized 1-D CNN model achieves an Equal Error Rate (EER) of 3.33% and an accuracy of 96.25%. Meanwhile, configurations with 2, 5, and 10 agents yield EERs of 5.42%, 5.83%, and 5.63%, along with accuracies of 95.21%, 94.17%, and 94.06%, respectively.

[CV-101] From Redundancy to Relevance: Enhancing Explainability in Multimodal Large Language Models

链接: https://arxiv.org/abs/2406.06579
作者: Xiaofeng Zhang,Chen Shen,Xiaosong Yuan,Shaotian Yan,Liang Xie,Wenxiao Wang,Chaochen Gu,Hao Tang,Jieping Ye
关键词: Large Vision Language, multimodal large language, large language models, Vision Language Models, popular Large Vision
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, multimodal large language models have exploded with an endless variety, most of the popular Large Vision Language Models (LVLMs) depend on sequential visual representation, where images are converted into hundreds or thousands of tokens before being input into the Large Language Model (LLM) along with language prompts. The black-box design hinders the interpretability of visual-language models, especially regarding more complex reasoning tasks. To explore the interaction process between image and text in complex reasoning tasks, we introduce the information flow method to visualize the interaction mechanism. By analyzing the dynamic flow of the information flow, we find that the information flow appears to converge in the shallow layer. Further investigation revealed a redundancy of the image token in the shallow layer. Consequently, a truncation strategy was introduced to aggregate image tokens within these shallow layers. This approach has been validated through experiments across multiple models, yielding consistent improvements.

[CV-102] MatFusion: A Generative Diffusion Model for SVBRDF Capture

链接: https://arxiv.org/abs/2406.06539
作者: Sam Sartor,Pieter Peers
关键词: SVBRDF diffusion backbone, SVBRDF diffusion, SVBRDF diffusion models, formulate SVBRDF estimation, SVBRDF
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We formulate SVBRDF estimation from photographs as a diffusion task. To model the distribution of spatially varying materials, we first train a novel unconditional SVBRDF diffusion backbone model on a large set of 312,165 synthetic spatially varying material exemplars. This SVBRDF diffusion backbone model, named MatFusion, can then serve as a basis for refining a conditional diffusion model to estimate the material properties from a photograph under controlled or uncontrolled lighting. Our backbone MatFusion model is trained using only a loss on the reflectance properties, and therefore refinement can be paired with more expensive rendering methods without the need for backpropagation during training. Because the conditional SVBRDF diffusion models are generative, we can synthesize multiple SVBRDF estimates from the same input photograph from which the user can select the one that best matches the users’ expectation. We demonstrate the flexibility of our method by refining different SVBRDF diffusion models conditioned on different types of incident lighting, and show that for a single photograph under colocated flash lighting our method achieves equal or better accuracy than existing SVBRDF estimation methods.

[CV-103] Understanding attention-based encoder-decoder networks: a case study with chess scoresheet recognition

链接: https://arxiv.org/abs/2406.06538
作者: Sergio Y. Hayashi,Nina S. T. Hirata
关键词: Deep neural networks, Deep neural, complex prediction tasks, Deep, complex prediction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This work was accepted and published in the 2022 26th International Conference on Pattern Recognition (ICPR)

点击查看摘要

Abstract:Deep neural networks are largely used for complex prediction tasks. There is plenty of empirical evidence of their successful end-to-end training for a diversity of tasks. Success is often measured based solely on the final performance of the trained network, and explanations on when, why and how they work are less emphasized. In this paper we study encoder-decoder recurrent neural networks with attention mechanisms for the task of reading handwritten chess scoresheets. Rather than prediction performance, our concern is to better understand how learning occurs in these type of networks. We characterize the task in terms of three subtasks, namely input-output alignment, sequential pattern recognition, and handwriting recognition, and experimentally investigate which factors affect their learning. We identify competition, collaboration and dependence relations between the subtasks, and argue that such knowledge might help one to better balance factors to properly train a network.

[CV-104] Utilizing Graph Generation for Enhanced Domain Adaptive Object Detection

链接: https://arxiv.org/abs/2406.06535
作者: Mu Wang
关键词: Object Detection involves, object detection models, Object Detection, transfer of object, Detection involves
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The problem of Domain Adaptive in the field of Object Detection involves the transfer of object detection models from labeled source domains to unannotated target domains. Recent advancements in this field aim to address domain discrepancies by aligning pixel-pairs across domains within a non-Euclidean graphical space, thereby minimizing semantic distribution variance. Despite their remarkable achievements, these methods often use coarse semantic representations to model graphs, mainly due to ignoring non-informative elements and failing to focus on precise semantic alignment. Additionally, the generation of coarse graphs inherently introduces abnormal nodes, posing challenges and potentially biasing domain adaptation outcomes. Consequently, we propose a framework, which utilizes the Graph Generation to enhance the quality of DAOD (\method). Specifically, we introduce a Node Refinement module that utilizes a memory bank to reconstruct noisy sampled nodes while applying contrastive regularization to noisy features. To enhance semantic alignment, we propose separating domain-specific styles from category invariance encoded within graph covariances, which allows us to selectively remove domain-specific styles while preserving category-invariant information, thus facilitating more accurate semantic alignment across different domains. Furthermore, we propose a Graph Optimization adaptor, leveraging variational inference to mitigate the impact of abnormal nodes. Extensive experimentation across three adaptation benchmarks validates that \method achieves state-of-the-art performance in the task of unsupervised domain adaptation.

[CV-105] Compressed Meta-Optical Encoder for Image Classification

链接: https://arxiv.org/abs/2406.06534
作者: Anna Wirth-Singh,Jinlin Xiang,Minho Choi,Johannes E. Fröch,Luocheng Huang,Shane Colburn,Eli Shlizerman,Arka Majumdar
关键词: computer vision tasks, low-power image classification, achieve low-latency, low-power image, vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:Optical and hybrid convolutional neural networks (CNNs) recently have become of increasing interest to achieve low-latency, low-power image classification and computer vision tasks. However, implementing optical nonlinearity is challenging, and omitting the nonlinear layers in a standard CNN comes at a significant reduction in accuracy. In this work, we use knowledge distillation to compress modified AlexNet to a single linear convolutional layer and an electronic backend (two fully connected layers). We obtain comparable performance to a purely electronic CNN with five convolutional layers and three fully connected layers. We implement the convolution optically via engineering the point spread function of an inverse-designed meta-optic. Using this hybrid approach, we estimate a reduction in multiply-accumulate operations from 688M in a conventional electronic modified AlexNet to only 86K in the hybrid compressed network enabled by the optical frontend. This constitutes a four orders of magnitude reduction in latency and power consumption. Furthermore, we experimentally demonstrate that the classification accuracy of the system exceeds 93% on the MNIST dataset.

[CV-106] DERM12345: A Large Multisource Dermatoscopic Skin Lesion Dataset with 38 Subclasses

链接: https://arxiv.org/abs/2406.07426
作者: Abdurrahim Yilmaz,Sirin Pekcan Yasar,Gulsum Gencoglan,Burak Temelkuran
关键词: effective diagnostic tools, provide essential information, developing effective diagnostic, datasets provide essential, skin lesions
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 2 figures, 1 table

点击查看摘要

Abstract:Skin lesion datasets provide essential information for understanding various skin conditions and developing effective diagnostic tools. They aid the artificial intelligence-based early detection of skin cancer, facilitate treatment planning, and contribute to medical education and research. Published large datasets have partially coverage the subclassifications of the skin lesions. This limitation highlights the need for more expansive and varied datasets to reduce false predictions and help improve the failure analysis for skin lesions. This study presents a diverse dataset comprising 12,345 dermatoscopic images with 38 subclasses of skin lesions collected in Turkiye which comprises different skin types in the transition zone between Europe and Asia. Each subgroup contains high-resolution photos and expert annotations, providing a strong and reliable basis for future research. The detailed analysis of each subgroup provided in this study facilitates targeted research endeavors and enhances the depth of understanding regarding the skin lesions. This dataset distinguishes itself through a diverse structure with 5 super classes, 15 main classes, 38 subclasses and its 12,345 high-resolution dermatoscopic images.

[CV-107] riage of 3D pathology data via 2.5D multiple-instance learning to guide pathologist assessments

链接: https://arxiv.org/abs/2406.07061
作者: Gan Gao,Andrew H. Song,Fiona Wang,David Brenes,Rui Wang,Sarah S.L. Chow,Kevin W. Bishop,Lawrence D. True,Faisal Mahmood,Jonathan T.C. Liu
关键词: human tissue biopsies, current clinical practice, patient diagnoses based, Accurate patient diagnoses, number of thin
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR CVMI 2024

点击查看摘要

Abstract:Accurate patient diagnoses based on human tissue biopsies are hindered by current clinical practice, where pathologists assess only a limited number of thin 2D tissue slices sectioned from 3D volumetric tissue. Recent advances in non-destructive 3D pathology, such as open-top light-sheet microscopy, enable comprehensive imaging of spatially heterogeneous tissue morphologies, offering the feasibility to improve diagnostic determinations. A potential early route towards clinical adoption for 3D pathology is to rely on pathologists for final diagnosis based on viewing familiar 2D HE-like image sections from the 3D datasets. However, manual examination of the massive 3D pathology datasets is infeasible. To address this, we present CARP3D, a deep learning triage approach that automatically identifies the highest-risk 2D slices within 3D volumetric biopsy, enabling time-efficient review by pathologists. For a given slice in the biopsy, we estimate its risk by performing attention-based aggregation of 2D patches within each slice, followed by pooling of the neighboring slices to compute a context-aware 2.5D risk score. For prostate cancer risk stratification, CARP3D achieves an area under the curve (AUC) of 90.4% for triaging slices, outperforming methods relying on independent analysis of 2D sections (AUC=81.3%). These results suggest that integrating additional depth context enhances the model’s discriminative capabilities. In conclusion, CARP3D has the potential to improve pathologist diagnosis via accurate triage of high-risk slices within large-volume 3D pathology datasets.

[CV-108] Predicting the risk of early-stage breast cancer recurrence using HE-stained tissue images

链接: https://arxiv.org/abs/2406.06650
作者: Geongyu Lee,Joonho Lee,Tae-Yeong Kwak,Sun Woo Kim,Youngmee Kwon,Chungyeul Kim,Hyeyoon Chang
关键词: early-stage breast cancer, Accurate prediction, selection of postoperative, postoperative treatment, treatment for patients
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:Accurate prediction of the likelihood of recurrence is important in the selection of postoperative treatment for patients with early-stage breast cancer. In this study, we investigated whether deep learning algorithms can predict patients’ risk of recurrence by analyzing the pathology images of their cancer histology. A total of 125 hematoxylin and eosin stained breast cancer whole slide images labeled with the risk prediction via genomics assays were used, and we obtained sensitivity of 0.857, 0.746, and 0.529 for predicting low, intermediate, and high risk, and specificity of 0.816, 0.803, and 0.972. When compared to the expert pathologist’s regional histology grade information, a Pearson’s correlation coefficient of 0.61 was obtained. When we checked the model learned through these studies through the class activation map, we found that it actually considered tubule formation and mitotic rate when predicting different risk groups.

[CV-109] 2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution

链接: https://arxiv.org/abs/2406.06649
作者: Kai Liu,Haotong Qin,Yong Guo,Xin Yuan,Linghe Kong,Guihai Chen,Yulun Zhang
关键词: Low-bit quantization, compressing image super-resolution, compact low-bit parameters, enjoy compact low-bit, edge deployment
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages, 6 figures. The code and models will be available at this https URL

点击查看摘要

Abstract:Low-bit quantization has become widespread for compressing image super-resolution (SR) models for edge deployment, which allows advanced SR models to enjoy compact low-bit parameters and efficient integer/bitwise constructions for storage compression and inference acceleration, respectively. However, it is notorious that low-bit quantization degrades the accuracy of SR models compared to their full-precision (FP) counterparts. Despite several efforts to alleviate the degradation, the transformer-based SR model still suffers severe degradation due to its distinctive activation distribution. In this work, we present a dual-stage low-bit post-training quantization (PTQ) method for image super-resolution, namely 2DQuant, which achieves efficient and accurate SR under low-bit quantization. The proposed method first investigates the weight and activation and finds that the distribution is characterized by coexisting symmetry and asymmetry, long tails. Specifically, we propose Distribution-Oriented Bound Initialization (DOBI), using different searching strategies to search a coarse bound for quantizers. To obtain refined quantizer parameters, we further propose Distillation Quantization Calibration (DQC), which employs a distillation approach to make the quantized model learn from its FP counterpart. Through extensive experiments on different bits and scaling factors, the performance of DOBI can reach the state-of-the-art (SOTA) while after stage two, our method surpasses existing PTQ in both metrics and visual effects. 2DQuant gains an increase in PSNR as high as 4.52dB on Set5 (x2) compared with SOTA when quantized to 2-bit and enjoys a 3.60x compression ratio and 5.08x speedup ratio. The code and models will be available at this https URL.

[CV-110] Interactive Generation of Laparoscopic Videos with Diffusion Models

链接: https://arxiv.org/abs/2406.06537
作者: Ivan Iliash(1),Simeon Allmendinger(2),Felix Meissen(1),Niklas Kühl(2),Daniel Rückert(1) ((1) Technical University of Munich, (2) University of Bayreuth)
关键词: synthetic visual data, visual data generation, benefiting surgical training, hold much promise, simulation environments
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Generative AI, in general, and synthetic visual data generation, in specific, hold much promise for benefiting surgical training by providing photorealism to simulation environments. Current training methods primarily rely on reading materials and observing live surgeries, which can be time-consuming and impractical. In this work, we take a significant step towards improving the training process. Specifically, we use diffusion models in combination with a zero-shot video diffusion method to interactively generate realistic laparoscopic images and videos by specifying a surgical action through text and guiding the generation with tool positions through segmentation masks. We demonstrate the performance of our approach using the publicly available Cholec dataset family and evaluate the fidelity and factual correctness of our generated images using a surgical action recognition model as well as the pixel-wise F1-score for the spatial control of tool generation. We achieve an FID of 38.097 and an F1-score of 0.71.

机器学习

[LG-0] Image and Video Tokenization with Binary Spherical Quantization

链接: https://arxiv.org/abs/2406.07548
作者: Yue Zhao,Yuanjun Xiong,Philipp Krähenbühl
关键词: Binary Spherical Quantization, Spherical Quantization, Binary Spherical, applies binary quantization, binary quantization
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Tech report

点击查看摘要

Abstract:We propose a new transformer-based image and video tokenizer with Binary Spherical Quantization (BSQ). BSQ projects the high-dimensional visual embedding to a lower-dimensional hypersphere and then applies binary quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2) scalable to arbitrary token dimensions, and (3) compact: compressing visual data by up to 100 \times with minimal distortion. Our tokenizer uses a transformer encoder and decoder with simple block-wise causal masking to support variable-length videos as input. The resulting BSQ-ViT achieves state-of-the-art visual reconstruction quality on image and video reconstruction benchmarks with 2.4 \times throughput compared to the best prior methods. Furthermore, by learning an autoregressive prior for adaptive arithmetic coding, BSQ-ViT achieves comparable results on video compression with state-of-the-art video compression standards. BSQ-ViT also enables masked language models to achieve competitive image synthesis quality to GAN- and diffusion-based methods.

[LG-1] Situational Awareness Matters in 3D Vision Language Reasoning

链接: https://arxiv.org/abs/2406.07544
作者: Yunze Man,Liang-Yan Gui,Yu-Xiong Wang
关键词: developing household robots, vision language reasoning, complicated vision language, language reasoning tasks, vision language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: CVPR 2024. Project Page: this https URL

点击查看摘要

Abstract:Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work, we demonstrate that a critical and distinct challenge in 3D vision language reasoning is situational awareness, which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge, we introduce SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator, followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in situation estimation and question answering by a large margin (e.g., an enhancement of over 30% on situation estimation accuracy). Subsequent analysis corroborates our architectural design choices, explores the distinct functions of visual and textual tokens, and highlights the importance of situational awareness in the domain of 3D question answering.

[LG-2] Cognitive Insights Across Languages: Enhancing Multimodal Interview Analysis

链接: https://arxiv.org/abs/2406.07542
作者: David Ortiz-Perez,Jose Garcia-Rodriguez,David Tomás
关键词: Mild Cognitive Impairment, individuals age, natural process, process that occurs, occurs as individuals
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: GitHub repository: this https URL

点击查看摘要

Abstract:Cognitive decline is a natural process that occurs as individuals age. Early diagnosis of anomalous decline is crucial for initiating professional treatment that can enhance the quality of life of those affected. To address this issue, we propose a multimodal model capable of predicting Mild Cognitive Impairment and cognitive scores. The TAUKADIAL dataset is used to conduct the evaluation, which comprises audio recordings of clinical interviews. The proposed model demonstrates the ability to transcribe and differentiate between languages used in the interviews. Subsequently, the model extracts audio and text features, combining them into a multimodal architecture to achieve robust and generalized results. Our approach involves in-depth research to implement various features obtained from the proposed modalities.

[LG-3] CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2406.07541
作者: Zeyuan Liu,Kai Yang,Xiu Li
关键词: avoid overestimating rare, Distribution shift, offline reinforcement learning, unseen actions, major obstacle
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distribution shift is a major obstacle in offline reinforcement learning, which necessitates minimizing the discrepancy between the learned policy and the behavior policy to avoid overestimating rare or unseen actions. Previous conservative offline RL algorithms struggle to generalize to unseen actions, despite their success in learning good in-distribution policy. In contrast, we propose to use the gradient fields of the dataset density generated from a pre-trained offline RL algorithm to adjust the original actions. We decouple the conservatism constraints from the policy, thus can benefit wide offline RL algorithms. As a consequence, we propose the Conservative Denoising Score-based Algorithm (CDSA) which utilizes the denoising score-based model to model the gradient of the dataset density, rather than the dataset density itself, and facilitates a more accurate and efficient method to adjust the action generated by the pre-trained policy in a deterministic and continuous MDP environment. In experiments, we show that our approach significantly improves the performance of baseline algorithms in D4RL datasets, and demonstrate the generalizability and plug-and-play capability of our model across different pre-trained offline RL policy in different tasks. We also validate that the agent exhibits greater risk aversion after employing our method while showcasing its ability to generalize effectively across diverse tasks.

[LG-4] Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

链接: https://arxiv.org/abs/2406.07540
作者: Kuan Heng Lin,Sicheng Mo,Ben Klingher,Fangzhou Mu,Bolei Zhou
关键词: Self-guidance bring fine-grained, Recent controllable generation, Diffusion Self-guidance bring, bring fine-grained spatial, controllable generation approaches
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 18 pages, 11 figures, see project page at this https URL

点击查看摘要

Abstract:Recent controllable generation approaches such as FreeControl and Diffusion Self-guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexibility and use. This work presents Ctrl-X, a simple framework for T2I diffusion controlling structure and appearance without additional training or guidance. Ctrl-X designs feed-forward structure control to enable the structure alignment with a structure image and semantic-aware appearance transfer to facilitate the appearance transfer from a user-input image. Extensive qualitative and quantitative experiments illustrate the superior performance of Ctrl-X on various condition inputs and model checkpoints. In particular, Ctrl-X supports novel structure and appearance control with arbitrary condition images of any modality, exhibits superior image quality and appearance transfer compared to existing works, and provides instant plug-and-play functionality to any T2I and text-to-video (T2V) diffusion model. See our project page for an overview of the results: this https URL

[LG-5] owards Fundamentally Scalable Model Selection: Asymptotically Fast Update and Selection

链接: https://arxiv.org/abs/2406.07536
作者: Wenxiao Wang,Weiming Zhuang,Lingjuan Lyu
关键词: deep learning technologies, model selection, model, isolated model embedding, selection
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: 19 pages, 8 figures

点击查看摘要

Abstract:The advancement of deep learning technologies is bringing new models every day, motivating the study of scalable model selection. An ideal model selection scheme should minimally support two operations efficiently over a large pool of candidate models: update, which involves either adding a new candidate model or removing an existing candidate model, and selection, which involves locating highly performing models for a given task. However, previous solutions to model selection require high computational complexity for at least one of these two operations. In this work, we target fundamentally (more) scalable model selection that supports asymptotically fast update and asymptotically fast selection at the same time. Firstly, we define isolated model embedding, a family of model selection schemes supporting asymptotically fast update and selection: With respect to the number of candidate models m , the update complexity is O(1) and the selection consists of a single sweep over m vectors in addition to O(1) model operations. Isolated model embedding also implies several desirable properties for applications. Secondly, we present Standardized Embedder, an empirical realization of isolated model embedding. We assess its effectiveness by using it to select representations from a pool of 100 pre-trained vision models for classification tasks and measuring the performance gaps between the selected models and the best candidates with a linear probing protocol. Experiments suggest our realization is effective in selecting models with competitive performances and highlight isolated model embedding as a promising direction towards model selection that is fundamentally (more) scalable.

[LG-6] Hearing Anything Anywhere

链接: https://arxiv.org/abs/2406.07532
作者: Mason Wang,Ryosuke Sawata,Samuel Clarke,Ruohan Gao,Shangzhe Wu,Jiajun Wu
关键词: numerous Mixed Reality, Mixed Reality, Recent years, numerous Mixed, computer graphics
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: CVPR 2024. The first two authors contributed equally. Project page: this https URL

点击查看摘要

Abstract:Recent years have seen immense progress in 3D computer vision and computer graphics, with emerging tools that can virtualize real-world 3D environments for numerous Mixed Reality (XR) applications. However, alongside immersive visual experiences, immersive auditory experiences are equally vital to our holistic perception of an environment. In this paper, we aim to reconstruct the spatial acoustic characteristics of an arbitrary environment given only a sparse set of (roughly 12) room impulse response (RIR) recordings and a planar reconstruction of the scene, a setup that is easily achievable by ordinary users. To this end, we introduce DiffRIR, a differentiable RIR rendering framework with interpretable parametric models of salient acoustic features of the scene, including sound source directivity and surface reflectivity. This allows us to synthesize novel auditory experiences through the space with any source audio. To evaluate our method, we collect a dataset of RIR recordings and music in four diverse, real environments. We show that our model outperforms state-ofthe-art baselines on rendering monaural and binaural RIRs and music at unseen locations, and learns physically interpretable parameters characterizing acoustic properties of the sound source and surfaces in the scene.

[LG-7] MAP: Low-compute Model Merging with Amortized Pareto Fronts via Quadratic Approximation

链接: https://arxiv.org/abs/2406.07529
作者: Lu Li,Tianyu Zhang,Zhiqi Bu,Suyuchen Wang,Huan He,Jie Fu,Yonghui Wu,Jiang Bian,Yong Chen,Yoshua Bengio
关键词: combine multiple single-task, effective approach, approach to combine, Model, MAP
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Model merging has emerged as an effective approach to combine multiple single-task models, fine-tuned from the same pre-trained model, into a multitask model. This process typically involves computing a weighted average of the model parameters without any additional training. Existing model-merging methods focus on enhancing average task accuracy. However, interference and conflicts between the objectives of different tasks can lead to trade-offs during model merging. In real-world applications, a set of solutions with various trade-offs can be more informative, helping practitioners make decisions based on diverse preferences. In this paper, we introduce a novel low-compute algorithm, Model Merging with Amortized Pareto Front (MAP). MAP identifies a Pareto set of scaling coefficients for merging multiple models to reflect the trade-offs. The core component of MAP is approximating the evaluation metrics of the various tasks using a quadratic approximation surrogate model derived from a pre-selected set of scaling coefficients, enabling amortized inference. Experimental results on vision and natural language processing tasks show that MAP can accurately identify the Pareto front. To further reduce the required computation of MAP, we propose (1) a Bayesian adaptive sampling algorithm and (2) a nested merging scheme with multiple stages.

[LG-8] QuickLLaMA: Query-aware Inference Acceleration for Large Language Models

链接: https://arxiv.org/abs/2406.07528
作者: Jingyao Li,Han Shi,Xin Jiang,Zhenguo Li,Hong Xu,Jiaya Jia
关键词: Large Language Models, Language Models, Large Language, capacity of Large, diverse fields
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The capacity of Large Language Models (LLMs) to comprehend and reason over long contexts is pivotal for advancements in diverse fields. Yet, they still stuggle with capturing long-distance dependencies within sequences to deeply understand semantics. To address this issue, we introduce Query-aware Inference for LLMs (Q-LLM), a system designed to process extensive sequences akin to human cognition. By focusing on memory data relevant to a given query, Q-LLM can accurately capture pertinent information within a fixed window size and provide precise answers to queries. It doesn’t require extra training and can be seamlessly integrated with any LLMs. Q-LLM using LLaMA3 (QuickLLaMA) can read Harry Potter within 30s and accurately answer the questions. Q-LLM improved by 7.17% compared to the current state-of-the-art on LLaMA3, and by 3.26% on Mistral on the \infty -bench. In the Needle-in-a-Haystack task, On widely recognized benchmarks, Q-LLM improved upon the current SOTA by 7.0% on Mistral and achieves 100% on LLaMA3. Our code can be found in this https URL.

[LG-9] Simple and Effective Masked Diffusion Language Models

链接: https://arxiv.org/abs/2406.07524
作者: Subham Sekhar Sahoo,Marianne Arriola,Yair Schiff,Aaron Gokaslan,Edgar Marroquin,Justin T Chiu,Alexander Rush,Volodymyr Kuleshov
关键词: generating high-quality images, prior work reports, significant performance gap, high-quality images, excel at generating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While diffusion models excel at generating high-quality images, prior work reports a significant performance gap between diffusion and autoregressive (AR) methods in language modeling. In this work, we show that simple masked discrete diffusion is more performant than previously thought. We apply an effective training recipe that improves the performance of masked diffusion models and derive a simplified, Rao-Blackwellized objective that results in additional improvements. Our objective has a simple form – it is a mixture of classical masked language modeling losses – and can be used to train encoder-only language models that admit efficient samplers, including ones that can generate arbitrary lengths of text semi-autoregressively like a traditional language model. On language modeling benchmarks, a range of masked diffusion models trained with modern engineering practices achieves a new state-of-the-art among diffusion models, and approaches AR perplexity. We release our code at: this https URL

[LG-10] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

链接: https://arxiv.org/abs/2406.07522
作者: Liliang Ren,Yang Liu,Yadong Lu,Yelong Shen,Chen Liang,Weizhu Chen
关键词: long-standing problem, Samba, Sliding Window Attention, infinite context length, Efficiently modeling sequences
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficiently modeling sequences with infinite context length has been a long-standing problem. Past works suffer from either the quadratic computation complexity or the limited extrapolation ability on length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and show that Samba substantially outperforms the state-of-the-art models based on pure attention or SSMs on a wide range of benchmarks. When trained on 4K length sequences, Samba can be efficiently extrapolated to 256K context length with perfect memory recall and show improved token predictions up to 1M context length. As a linear-time sequence model, Samba enjoys a 3.73x higher throughput compared to Transformers with grouped-query attention when processing user prompts of 128K length, and 3.64x speedup when generating 64K tokens with unlimited streaming. A sample implementation of Samba is publicly available in this https URL.

[LG-11] Faster Spectral Density Estimation and Sparsification in the Nuclear Norm

链接: https://arxiv.org/abs/2406.07521
作者: Yujia Jin,Ishani Karmarkar,Christopher Musco,Aaron Sidford,Apoorv Vikram Singh
关键词: normalized adjacency matrix, epsilon, node undirected graph, node undirected, problem of estimating
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Accepted for presentation at the Conference on Learning Theory (COLT) 2024

点击查看摘要

Abstract:We consider the problem of estimating the spectral density of the normalized adjacency matrix of an n -node undirected graph. We provide a randomized algorithm that, with O(n\epsilon^-2) queries to a degree and neighbor oracle and in O(n\epsilon^-3) time, estimates the spectrum up to \epsilon accuracy in the Wasserstein-1 metric. This improves on previous state-of-the-art methods, including an O(n\epsilon^-7) time algorithm from [Braverman et al., STOC 2022] and, for sufficiently small \epsilon , a 2^O(\epsilon^-1) time method from [Cohen-Steiner et al., KDD 2018]. To achieve this result, we introduce a new notion of graph sparsification, which we call nuclear sparsification. We provide an O(n\epsilon^-2) -query and O(n\epsilon^-2) -time algorithm for computing O(n\epsilon^-2) -sparse nuclear sparsifiers. We show that this bound is optimal in both its sparsity and query complexity, and we separate our results from the related notion of additive spectral sparsification. Of independent interest, we show that our sparsification method also yields the first deterministic algorithm for spectral density estimation that scales linearly with n (sublinear in the representation size of the graph).

[LG-12] Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcement

链接: https://arxiv.org/abs/2406.07515
作者: Yunzhen Feng,Elvis Dohmatob,Pu Yang,Francois Charton,Julia Kempe
关键词: Large Language Models, fine-tuning Large Language, Synthesized data, model collapse, increasingly considered
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Synthesized data from generative models is increasingly considered as an alternative to human-annotated data for fine-tuning Large Language Models. This raises concerns about model collapse: a drop in performance of models fine-tuned on generated data. Considering that it is easier for both humans and machines to tell between good and bad examples than to generate high-quality samples, we investigate the use of feedback on synthesized data to prevent model collapse. We derive theoretical conditions under which a Gaussian mixture classification model can achieve asymptotically optimal performance when trained on feedback-augmented synthesized data, and provide supporting simulations for finite regimes. We illustrate our theoretical predictions on two practical problems: computing matrix eigenvalues with transformers and news summarization with large language models, which both undergo model collapse when trained on model-generated data. We show that training from feedback-augmented synthesized data, either by pruning incorrect predictions or by selecting the best of several guesses, can prevent model collapse, validating popular approaches like RLHF.

[LG-13] Flow Map Matching

链接: https://arxiv.org/abs/2406.07507
作者: Nicholas M. Boffi,Michael S. Albergo,Eric Vanden-Eijnden
关键词: trajectories push initial, push initial conditions, transport of measure, Generative models based, based on dynamical
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Generative models based on dynamical transport of measure, such as diffusion models, flow matching models, and stochastic interpolants, learn an ordinary or stochastic differential equation whose trajectories push initial conditions from a known base distribution onto the target. While training is cheap, samples are generated via simulation, which is more expensive than one-step models like GANs. To close this gap, we introduce flow map matching – an algorithm that learns the two-time flow map of an underlying ordinary differential equation. The approach leads to an efficient few-step generative model whose step count can be chosen a-posteriori to smoothly trade off accuracy for computational expense. Leveraging the stochastic interpolant framework, we introduce losses for both direct training of flow maps and distillation from pre-trained (or otherwise known) velocity fields. Theoretically, we show that our approach unifies many existing few-step generative models, including consistency models, consistency trajectory models, progressive distillation, and neural operator approaches, which can be obtained as particular cases of our formalism. With experiments on CIFAR-10 and ImageNet 32x32, we show that flow map matching leads to high-quality samples with significantly reduced sampling cost compared to diffusion or stochastic interpolant methods.

[LG-14] Understanding Visual Concepts Across Models

链接: https://arxiv.org/abs/2406.07506
作者: Brandon Trabucco,Max Gurinas,Kyle Doherty,Ruslan Salakhutdinov
关键词: Stable Diffusion, Large multimodal models, Large multimodal, Large, Diffusion can generate
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Official code at: this https URL

点击查看摘要

Abstract:Large multimodal models such as Stable Diffusion can generate, detect, and classify new visual concepts after fine-tuning just a single word embedding. Do models learn similar words for the same concepts (i.e. orange-cat = orange + cat)? We conduct a large-scale analysis on three state-of-the-art models in text-to-image generation, open-set object detection, and zero-shot classification, and find that new word embeddings are model-specific and non-transferable. Across 4,800 new embeddings trained for 40 diverse visual concepts on four standard datasets, we find perturbations within an \epsilon -ball to any prior embedding that generate, detect, and classify an arbitrary concept. When these new embeddings are spliced into new models, fine-tuning that targets the original model is lost. We show popular soft prompt-tuning approaches find these perturbative solutions when applied to visual concept learning tasks, and embeddings for visual concepts are not transferable. Code for reproducing our work is available at: this https URL.

[LG-15] xtGrad: Automatic “Differentiation” via Text

链接: https://arxiv.org/abs/2406.07496
作者: Mert Yuksekgonul,Federico Bianchi,Joseph Boen,Sheng Liu,Zhi Huang,Carlos Guestrin,James Zou
关键词: orchestrating multiple large, systems orchestrating multiple, large language models, multiple large language, paradigm shift
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 41 pages, 6 figures

点击查看摘要

Abstract:AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models (LLMs) and other complex components. As a result, developing principled and automated optimization methods for compound AI systems is one of the most important new challenges. Neural networks faced a similar challenge in its early days until backpropagation and automatic differentiation transformed the field by making optimization turn-key. Inspired by this, we introduce TextGrad, a powerful framework performing automatic ``differentiation’’ via text. TextGrad backpropagates textual feedback provided by LLMs to improve individual components of a compound AI system. In our framework, LLMs provide rich, general, natural language suggestions to optimize variables in computation graphs, ranging from code snippets to molecular structures. TextGrad follows PyTorch’s syntax and abstraction and is flexible and easy-to-use. It works out-of-the-box for a variety of tasks, where the users only provide the objective function without tuning components or prompts of the framework. We showcase TextGrad’s effectiveness and generality across a diverse range of applications, from question answering and molecule optimization to radiotherapy treatment planning. Without modifying the framework, TextGrad improves the zero-shot accuracy of GPT-4o in Google-Proof Question Answering from 51% to 55% , yields 20% relative performance gain in optimizing LeetCode-Hard coding problem solutions, improves prompts for reasoning, designs new druglike small molecules with desirable in silico binding, and designs radiation oncology treatment plans with high specificity. TextGrad lays a foundation to accelerate the development of the next-generation of AI systems.

[LG-16] owards Generalized Hydrological Forecasting using Transformer Models for 120-Hour Streamflow Prediction

链接: https://arxiv.org/abs/2406.07484
作者: Bekir Z. Demiray,Ibrahim Demir
关键词: locations in Iowa, Transformer model, diverse locations, explores the efficacy, Transformer
类目: Machine Learning (cs.LG)
*备注: 20 pages, 5 figures

点击查看摘要

Abstract:This study explores the efficacy of a Transformer model for 120-hour streamflow prediction across 125 diverse locations in Iowa, US. Utilizing data from the preceding 72 hours, including precipitation, evapotranspiration, and discharge values, we developed a generalized model to predict future streamflow. Our approach contrasts with traditional methods that typically rely on location-specific models. We benchmarked the Transformer model’s performance against three deep learning models (LSTM, GRU, and Seq2Seq) and the Persistence approach, employing Nash-Sutcliffe Efficiency (NSE), Kling-Gupta Efficiency (KGE), Pearson’s r, and Normalized Root Mean Square Error (NRMSE) as metrics. The study reveals the Transformer model’s superior performance, maintaining higher median NSE and KGE scores and exhibiting the lowest NRMSE values. This indicates its capability to accurately simulate and predict streamflow, adapting effectively to varying hydrological conditions and geographical variances. Our findings underscore the Transformer model’s potential as an advanced tool in hydrological modeling, offering significant improvements over traditional and contemporary approaches.

[LG-17] Comparing Deep Learning Models for Rice Mapping in Bhutan Using High Resolution Satellite Imagery

链接: https://arxiv.org/abs/2406.07482
作者: Biplov Bhandari,Timothy Mayer
关键词: including Remote Sensing-based, Remote Sensing-based knowledge, Remote Sensing-based, Bhutanese government, including Remote
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:The Bhutanese government is increasing its utilization of technological approaches such as including Remote Sensing-based knowledge in their decision-making process. This study focuses on crop type and crop extent in Paro, one of the top rice-yielding districts in Bhutan, and employs publicly available NICFI high-resolution satellite imagery from Planet. Two Deep Learning (DL) approaches, point-based (DNN) and patch-based (U-Net), models were used in conjunction with cloud-computing platforms. Three different models per DL approaches (DNN and U-Net) were trained: 1) RGBN channels from Planet; 2) RGBN and elevation data (RGBNE); 3) RGBN and Sentinel-1 (S1) data (RGBNS), and RGBN with E and S1 data (RGBNES). From this comprehensive analysis, the U-Net displayed higher performance metrics across both model training and model validation efforts. Among the U-Net model sets, the RGBN, RGBNE, RGBNS, and RGBNES models had an F1-score of 0.8546, 0.8563, 0.8467, and 0.8500 respectively. An independent model evaluation was performed and found a high level of performance variation across all the metrics. For this independent model evaluation, the U-Net RGBN, RGBNE, RGBNES, and RGBN models displayed the F1-scores of 0.5935, 0.6154, 0.5882, and 0.6582, suggesting U-Net RGBNES as the best model. The study shows that the DL approaches can predict rice. Also, DL methods can be used with the survey-based approaches currently utilized by the Bhutan Department of Agriculture. Further, this study demonstrated the usage of regional land cover products such as SERVIR’s RLCMS as a weak label approach to capture different strata addressing the class imbalance problem and improving the sampling design for DL application. Finally, through preliminary model testing and comparisons outlined it was shown that using additional features such as NDVI, EVI, and NDWI did not drastically improve model performance.

[LG-18] Partially Observed Trajectory Inference using Optimal Transport and a Dynamics Prior

链接: https://arxiv.org/abs/2406.07475
作者: Anming Gu,Edward Chien,Kristjan Greenewald
关键词: Trajectory inference seeks, seeks to recover, population from snapshots, trajectory inference problem, Trajectory inference
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 32 pages, 9 figures

点击查看摘要

Abstract:Trajectory inference seeks to recover the temporal dynamics of a population from snapshots of its (uncoupled) temporal marginals, i.e. where observed particles are not tracked over time. Lavenant et al. arXiv:2102.09204 addressed this challenging problem under a stochastic differential equation (SDE) model with a gradient-driven drift in the observed space, introducing a minimum entropy estimator relative to the Wiener measure. Chizat et al. arXiv:2205.07146 then provided a practical grid-free mean-field Langevin (MFL) algorithm using Schrödinger bridges. Motivated by the overwhelming success of observable state space models in the traditional paired trajectory inference problem (e.g. target tracking), we extend the above framework to a class of latent SDEs in the form of observable state space models. In this setting, we use partial observations to infer trajectories in the latent space under a specified dynamics model (e.g. the constant velocity/acceleration models from target tracking). We introduce PO-MFL to solve this latent trajectory inference problem and provide theoretical guarantees by extending the results of arXiv:2102.09204 to the partially observed setting. We leverage the MFL framework of arXiv:2205.07146, yielding an algorithm based on entropic OT between dynamics-adjusted adjacent time marginals. Experiments validate the robustness of our method and the exponential convergence of the MFL dynamics, and demonstrate significant outperformance over the latent-free method of arXiv:2205.07146 in key scenarios.

[LG-19] Multimodal Belief Prediction

链接: https://arxiv.org/abs/2406.07466
作者: John Murzaku,Adil Soubki,Owen Rambow
关键词: belief prediction task, words in context, level of commitment, interpret the meaning, understand cues
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: John Murzaku and Adil Soubki contributed equally to this work

点击查看摘要

Abstract:Recognizing a speaker’s level of commitment to a belief is a difficult task; humans do not only interpret the meaning of the words in context, but also understand cues from intonation and other aspects of the audio signal. Many papers and corpora in the NLP community have approached the belief prediction task using text-only approaches. We are the first to frame and present results on the multimodal belief prediction task. We use the CB-Prosody corpus (CBP), containing aligned text and audio with speaker belief annotations. We first report baselines and significant features using acoustic-prosodic features and traditional machine learning methods. We then present text and audio baselines for the CBP corpus fine-tuning on BERT and Whisper respectively. Finally, we present our multimodal architecture which fine-tunes on BERT and Whisper and uses multiple fusion methods, improving on both modalities alone.

[LG-20] Estimating the Hallucination Rate of Generative AI

链接: https://arxiv.org/abs/2406.07457
作者: Andrew Jesson,Nicolas Beltran-Velez,Quentin Chu,Sweta Karlekar,Jannik Kossen,Yarin Gal,John P. Cunningham,David Blei
关键词: in-context learning, rate for in-context, ICL, conditional generative model, CGM
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work is about estimating the hallucination rate for in-context learning (ICL) with Generative AI. In ICL, a conditional generative model (CGM) is prompted with a dataset and asked to make a prediction based on that dataset. The Bayesian interpretation of ICL assumes that the CGM is calculating a posterior predictive distribution over an unknown Bayesian model of a latent parameter and data. With this perspective, we define a \textithallucination as a generated prediction that has low-probability under the true latent parameter. We develop a new method that takes an ICL problem – that is, a CGM, a dataset, and a prediction question – and estimates the probability that a CGM will generate a hallucination. Our method only requires generating queries and responses from the model and evaluating its response log probability. We empirically evaluate our method on synthetic regression and natural language ICL tasks using large language models.

[LG-21] fKAN: Fractional Kolmogorov-Arnold Networks with trainable Jacobi basis functions

链接: https://arxiv.org/abs/2406.07456
作者: Alireza Afzal Aghaei
关键词: neural network design, Recent advancements, fractional Jacobi functions, Fractional Kolmogorov-Arnold Network, neural network
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Recent advancements in neural network design have given rise to the development of Kolmogorov-Arnold Networks (KANs), which enhance speed, interpretability, and precision. This paper presents the Fractional Kolmogorov-Arnold Network (fKAN), a novel neural network architecture that incorporates the distinctive attributes of KANs with a trainable adaptive fractional-orthogonal Jacobi function as its basis function. By leveraging the unique mathematical properties of fractional Jacobi functions, including simple derivative formulas, non-polynomial behavior, and activity for both positive and negative input values, this approach ensures efficient learning and enhanced accuracy. The proposed architecture is evaluated across a range of tasks in deep learning and physics-informed deep learning. Precision is tested on synthetic regression data, image classification, image denoising, and sentiment analysis. Additionally, the performance is measured on various differential equations, including ordinary, partial, and fractional delay differential equations. The results demonstrate that integrating fractional Jacobi functions into KANs significantly improves training speed and performance across diverse fields and applications.

[LG-22] Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis

链接: https://arxiv.org/abs/2406.07455
作者: Qining Zhang,Honghao Wei,Lei Ying
关键词: episodic Markov decision, Markov decision process, study reinforcement learning, general trajectory-wise reward, trajectory-wise reward model
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we study reinforcement learning from human feedback (RLHF) under an episodic Markov decision process with a general trajectory-wise reward model. We developed a model-free RLHF best policy identification algorithm, called \mathsfBSAD , without explicit reward model inference, which is a critical intermediate step in the contemporary RLHF paradigms for training large language models (LLM). The algorithm identifies the optimal policy directly from human preference information in a backward manner, employing a dueling bandit sub-routine that constantly duels actions to identify the superior one. \mathsfBSAD adopts a reward-free exploration and best-arm-identification-like adaptive stopping criteria to equalize the visitation among all states in the same decision step while moving to the previous step as soon as the optimal action is identifiable, leading to a provable, instance-dependent sample complexity \tilde\mathcalO(c_\mathcalMSA^3H^3M\log\frac1\delta) which resembles the result in classic RL, where c_\mathcalM is the instance-dependent constant and M is the batch size. Moreover, \mathsfBSAD can be transformed into an explore-then-commit algorithm with logarithmic regret and generalized to discounted MDPs using a frame-based approach. Our results show: (i) sample-complexity-wise, RLHF is not significantly harder than classic RL and (ii) end-to-end RLHF may deliver improved performance by avoiding pitfalls in reward inferring such as overfit and distribution shift.

[LG-23] An Optimism-based Approach to Online Evaluation of Generative Models

链接: https://arxiv.org/abs/2406.07451
作者: Xiaoyan Hu,Ho-fung Leung,Farzan Farnia
关键词: models typically target, Existing frameworks, comparing generative models, generative models typically, evaluating and comparing
类目: Machine Learning (cs.LG)
*备注: arXiv version

点击查看摘要

Abstract:Existing frameworks for evaluating and comparing generative models typically target an offline setting, where the evaluator has access to full batches of data produced by the models. However, in many practical scenarios, the goal is to identify the best model using the fewest generated samples to minimize the costs of querying data from the models. Such an online comparison is challenging with current offline assessment methods. In this work, we propose an online evaluation framework to find the generative model that maximizes a standard assessment score among a group of available models. Our method uses an optimism-based multi-armed bandit framework to identify the model producing data with the highest evaluation score, quantifying the quality and diversity of generated data. Specifically, we study the online assessment of generative models based on the Fréchet Inception Distance (FID) and Inception Score (IS) metrics and propose the FID-UCB and IS-UCB algorithms leveraging the upper confidence bound approach in online learning. We prove sub-linear regret bounds for these algorithms and present numerical results on standard image datasets, demonstrating their effectiveness in identifying the score-maximizing generative model.

[LG-24] Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning

链接: https://arxiv.org/abs/2406.07450
作者: Shuvendu Roy,Yasaman Parhizkar,Franklin Ogidi,Vahid Reza Khazaie,Michael Colacci,Ali Etemad,Elham Dolatabadi,Arash Afkanpour
关键词: medical domain, perform a comprehensive, comprehensive benchmarking, multimodal medical representation, medical representation learning
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We perform a comprehensive benchmarking of contrastive frameworks for learning multimodal representations in the medical domain. Through this study, we aim to answer the following research questions: (i) How transferable are general-domain representations to the medical domain? (ii) Is multimodal contrastive training sufficient, or does it benefit from unimodal training as well? (iii) What is the impact of feature granularity on the effectiveness of multimodal medical representation learning? To answer these questions, we investigate eight contrastive learning approaches under identical training setups, and train them on 2.8 million image-text pairs from four datasets, and evaluate them on 25 downstream tasks, including classification (zero-shot and linear probing), image-to-text and text-to-image retrieval, and visual question-answering. Our findings suggest a positive answer to the first question, a negative answer to the second question, and the benefit of learning fine-grained features. Finally, we make our code publicly available.

[LG-25] DeformTime: Capturing Variable Dependencies with Deformable Attention for Time Series Forecasting

链接: https://arxiv.org/abs/2406.07438
作者: Yuxuan Shu,Vasileios Lampos
关键词: deep learning approaches, learning approaches tend, approaches tend, tend to focus, focus on autoregressive
类目: Machine Learning (cs.LG)
*备注: The code is available at this https URL

点击查看摘要

Abstract:In multivariate time series (MTS) forecasting, existing state-of-the-art deep learning approaches tend to focus on autoregressive formulations and overlook the information within exogenous indicators. To address this limitation, we present DeformTime, a neural network architecture that attempts to capture correlated temporal patterns from the input space, and hence, improve forecasting accuracy. It deploys two core operations performed by deformable attention blocks (DABs): learning dependencies across variables from different time steps (variable DAB), and preserving temporal dependencies in data from previous time steps (temporal DAB). Input data transformation is explicitly designed to enhance learning from the deformed series of information while passing through a DAB. We conduct extensive experiments on 6 MTS data sets, using previously established benchmarks as well as challenging infectious disease modelling tasks with more exogenous variables. The results demonstrate that DeformTime improves accuracy against previous competitive methods across the vast majority of MTS forecasting tasks, reducing the mean absolute error by 10% on average. Notably, performance gains remain consistent across longer forecasting horizons.

[LG-26] Beware of Aliases – Signal Preservation is Crucial for Robust Image Restoration

链接: https://arxiv.org/abs/2406.07435
作者: Shashank Agnihotri,Julia Grabinski,Janis Keuper,Margret Keuper
关键词: aggregating image content, responsible for aggregating, content from noisy, restore clean, Image restoration networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Tags: Adversarial attack, image restoration, image deblurring, frequency sampling

点击查看摘要

Abstract:Image restoration networks are usually comprised of an encoder and a decoder, responsible for aggregating image content from noisy, distorted data and to restore clean, undistorted images, respectively. Data aggregation as well as high-resolution image generation both usually come at the risk of involving aliases, i.e.~standard architectures put their ability to reconstruct the model input in jeopardy to reach high PSNR values on validation data. The price to be paid is low model robustness. In this work, we show that simply providing alias-free paths in state-of-the-art reconstruction transformers supports improved model robustness at low costs on the restoration performance. We do so by proposing BOA-Restormer, a transformer-based image restoration model that executes downsampling and upsampling operations partly in the frequency domain to ensure alias-free paths along the entire model while potentially preserving all relevant high-frequency information.

[LG-27] GemNet: Menu-Based Strategy-Proof Multi-Bidder Auctions Through Deep Learning

链接: https://arxiv.org/abs/2406.07428
作者: Tonghan Wang,Yanchen Jiang,David C. Parkes
关键词: Differentiable economics, automated mechanism design, economics uses deep, deep learning, learning for automated
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Differentiable economics uses deep learning for automated mechanism design. Despite strong progress, it has remained an open problem to learn multi-bidder, general, and fully strategy-proof (SP) auctions. We introduce GEneral Menu-based NETwork (GemNet), which significantly extends the menu-based approach of RochetNet [Dütting et al., 2023] to the multi-bidder setting. The challenge in achieving SP is to learn bidder-independent menus that are feasible, so that the optimal menu choices for each bidder do not over-allocate items when taken together (we call this menu compatibility). GemNet penalizes the failure of menu compatibility during training, and transforms learned menus after training through price changes, by considering a set of discretized bidder values and reasoning about Lipschitz smoothness to guarantee menu compatibility on the entire value space. This approach is general, leaving undisturbed trained menus that already satisfy menu compatibility and reducing to RochetNet for a single bidder. Mixed-integer linear programs are used for menu transforms and through a number of optimizations, including adaptive grids and methods to skip menu elements, we scale to large auction design problems. GemNet learns auctions with better revenue than affine maximization methods, achieves exact SP whereas previous general multi-bidder methods are approximately SP, and offers greatly enhanced interpretability.

[LG-28] Beyond ELBOs: A Large-Scale Evaluation of Variational Methods for Sampling

链接: https://arxiv.org/abs/2406.07423
作者: Denis Blessing,Xiaogang Jia,Johannes Esslinger,Francisco Vargas,Gerhard Neumann
关键词: Monte Carlo methods, Variational Inference, Monte Carlo, intractable probability distributions, probability distributions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Monte Carlo methods, Variational Inference, and their combinations play a pivotal role in sampling from intractable probability distributions. However, current studies lack a unified evaluation framework, relying on disparate performance measures and limited method comparisons across diverse tasks, complicating the assessment of progress and hindering the decision-making of practitioners. In response to these challenges, our work introduces a benchmark that evaluates sampling methods using a standardized task suite and a broad range of performance criteria. Moreover, we study existing metrics for quantifying mode collapse and introduce novel metrics for this purpose. Our findings provide insights into strengths and weaknesses of existing sampling methods, serving as a valuable reference for future developments. The code is publicly available here.

[LG-29] Enhanced Gene Selection in Single-Cell Genomics: Pre-Filtering Synergy and Reinforced Optimization

链接: https://arxiv.org/abs/2406.07418
作者: Weiliang Zhang,Zhen Meng,Dongjie Wang,Min Wu,Kunpeng Liu,Yuanchun Zhou,Meng Xiao
关键词: interpret complex biological, biological data effectively, complex biological data, Recent advancements, genomics necessitate precision
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 25 pages

点击查看摘要

Abstract:Recent advancements in single-cell genomics necessitate precision in gene panel selection to interpret complex biological data effectively. Those methods aim to streamline the analysis of scRNA-seq data by focusing on the most informative genes that contribute significantly to the specific analysis task. Traditional selection methods, which often rely on expert domain knowledge, embedded machine learning models, or heuristic-based iterative optimization, are prone to biases and inefficiencies that may obscure critical genomic signals. Recognizing the limitations of traditional methods, we aim to transcend these constraints with a refined strategy. In this study, we introduce an iterative gene panel selection strategy that is applicable to clustering tasks in single-cell genomics. Our method uniquely integrates results from other gene selection algorithms, providing valuable preliminary boundaries or prior knowledge as initial guides in the search space to enhance the efficiency of our framework. Furthermore, we incorporate the stochastic nature of the exploration process in reinforcement learning (RL) and its capability for continuous optimization through reward-based feedback. This combination mitigates the biases inherent in the initial boundaries and harnesses RL’s adaptability to refine and target gene panel selection dynamically. To illustrate the effectiveness of our method, we conducted detailed comparative experiments, case studies, and visualization analysis.

[LG-30] Holistic Memory Diversification for Incremental Learning in Growing Graphs

链接: https://arxiv.org/abs/2406.07413
作者: Ziyue Qiao,Junren Xiao,Qingqiang Sun,Meng Xiao,Hui Xiong
关键词: increasingly complex tasks, paper addresses, addresses the challenge, increasingly complex, memory
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper addresses the challenge of incremental learning in growing graphs with increasingly complex tasks. The goal is to continually train a graph model to handle new tasks while retaining its inference ability on previous tasks. Existing methods usually neglect the importance of memory diversity, limiting in effectively selecting high-quality memory from previous tasks and remembering broad previous knowledge within the scarce memory on graphs. To address that, we introduce a novel holistic Diversified Memory Selection and Generation (DMSG) framework for incremental learning in graphs, which first introduces a buffer selection strategy that considers both intra-class and inter-class diversities, employing an efficient greedy algorithm for sampling representative training nodes from graphs into memory buffers after learning each new task. Then, to adequately rememorize the knowledge preserved in the memory buffer when learning new tasks, we propose a diversified memory generation replay method. This method first utilizes a variational layer to generate the distribution of buffer node embeddings and sample synthesized ones for replaying. Furthermore, an adversarial variational embedding learning method and a reconstruction-based decoder are proposed to maintain the integrity and consolidate the generalization of the synthesized node embeddings, respectively. Finally, we evaluate our model on node classification tasks involving increasing class numbers. Extensive experimental results on publicly accessible datasets demonstrate the superiority of DMSG over state-of-the-art methods.

[LG-31] Private Geometric Median

链接: https://arxiv.org/abs/2406.07407
作者: Mahdi Haghifam,Thomas Steinke,Jonathan Ullman
关键词: study differentially private, Euclidean distances, excess error guarantee, geometric median, study differentially
类目: Machine Learning (cs.LG)
*备注: 36 pages

点击查看摘要

Abstract:In this paper, we study differentially private (DP) algorithms for computing the geometric median (GM) of a dataset: Given n points, x_1,\dots,x_n in \mathbbR^d , the goal is to find a point \theta that minimizes the sum of the Euclidean distances to these points, i.e., \sum_i=1^n |\theta - x_i|_2 . Off-the-shelf methods, such as DP-GD, require strong a priori knowledge locating the data within a ball of radius R , and the excess risk of the algorithm depends linearly on R . In this paper, we ask: can we design an efficient and private algorithm with an excess error guarantee that scales with the (unknown) radius containing the majority of the datapoints? Our main contribution is a pair of polynomial-time DP algorithms for the task of private GM with an excess error guarantee that scales with the effective diameter of the datapoints. Additionally, we propose an inefficient algorithm based on the inverse smooth sensitivity mechanism, which satisfies the more restrictive notion of pure DP. We complement our results with a lower bound and demonstrate the optimality of our polynomial-time algorithms in terms of sample complexity.

[LG-32] Enhancing Tabular Data Optimization with a Flexible Graph-based Reinforced Exploration Strategy

链接: https://arxiv.org/abs/2406.07404
作者: Xiaohan Huang,Dongjie Wang,Zhiyuan Ning,Ziyue Qiao,Qingqing Long,Haowei Zhu,Min Wu,Yuanchun Zhou,Meng Xiao
关键词: Tabular data optimization, machine learning tasks, downstream machine learning, Tabular data, optimization methods aim
类目: Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Tabular data optimization methods aim to automatically find an optimal feature transformation process that generates high-value features and improves the performance of downstream machine learning tasks. Current frameworks for automated feature transformation rely on iterative sequence generation tasks, optimizing decision strategies through performance feedback from downstream tasks. However, these approaches fail to effectively utilize historical decision-making experiences and overlook potential relationships among generated features, thus limiting the depth of knowledge extraction. Moreover, the granularity of the decision-making process lacks dynamic backtracking capabilities for individual features, leading to insufficient adaptability when encountering inefficient pathways, adversely affecting overall robustness and exploration efficiency. To address the limitations observed in current automatic feature engineering frameworks, we introduce a novel method that utilizes a feature-state transformation graph to effectively preserve the entire feature transformation journey, where each node represents a specific transformation state. During exploration, three cascading agents iteratively select nodes and idea mathematical operations to generate new transformation states. This strategy leverages the inherent properties of the graph structure, allowing for the preservation and reuse of valuable transformations. It also enables backtracking capabilities through graph pruning techniques, which can rectify inefficient transformation paths. To validate the efficacy and flexibility of our approach, we conducted comprehensive experiments and detailed case studies, demonstrating superior performance in diverse scenarios.

[LG-33] A Survey on Recent Random Walk-based Methods for Embedding Knowledge Graphs

链接: https://arxiv.org/abs/2406.07402
作者: Elika Bozorgi,Sakher Khalil Alqaiidi,Afsaneh Shams,Hamid Reza Arabnia,Krzysztof Kochut
关键词: social media platforms, Machine learning, deep learning, NLP methods, knowledge graphs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning, deep learning, and NLP methods on knowledge graphs are present in different fields and have important roles in various domains from self-driving cars to friend recommendations on social media platforms. However, to apply these methods to knowledge graphs, the data usually needs to be in an acceptable size and format. In fact, knowledge graphs normally have high dimensions and therefore we need to transform them to a low-dimensional vector space. An embedding is a low-dimensional space into which you can translate high dimensional vectors in a way that intrinsic features of the input data are preserved. In this review, we first explain knowledge graphs and their embedding and then review some of the random walk-based embedding methods that have been developed recently.

[LG-34] Guiding LLM Temporal Logic Generation with Explicit Separation of Data and Control

链接: https://arxiv.org/abs/2406.07400
作者: William Murphy,Nikolaus Holzer,Nathan Koenig,Leyi Cui,Raven Rothkopf,Feitong Qiao,Mark Santolucito
关键词: Large Language Models, powerful tools, Language Models, temporal logic specification, Temporal logics
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Temporal logics are powerful tools that are widely used for the synthesis and verification of reactive systems. The recent progress on Large Language Models (LLMs) has the potential to make the process of writing such specifications more accessible. However, writing specifications in temporal logics remains challenging for all but the most expert users. A key question in using LLMs for temporal logic specification engineering is to understand what kind of guidance is most helpful to the LLM and the users to easily produce specifications. Looking specifically at the problem of reactive program synthesis, we explore the impact of providing an LLM with guidance on the separation of control and data–making explicit for the LLM what functionality is relevant for the specification, and treating the remaining functionality as an implementation detail for a series of pre-defined functions and predicates. We present a benchmark set and find that this separation of concerns improves specification generation. Our benchmark provides a test set against which to verify future work in LLM generation of temporal logic specifications.

[LG-35] Redefining Automotive Radar Imaging: A Domain-Informed 1D Deep Learning Approach for High-Resolution and Efficient Performance

链接: https://arxiv.org/abs/2406.07399
作者: Ruxin Zheng,Shunqiao Sun,Holger Caesar,Honglei Chen,Jian Li
关键词: challenging weather conditions, autonomous vehicles, weather conditions, indispensable for perception, perception tasks
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Millimeter-wave (mmWave) radars are indispensable for perception tasks of autonomous vehicles, thanks to their resilience in challenging weather conditions. Yet, their deployment is often limited by insufficient spatial resolution for precise semantic scene interpretation. Classical super-resolution techniques adapted from optical imaging inadequately address the distinct characteristics of radar signal data. In response, our study redefines radar imaging super-resolution as a one-dimensional (1D) signal super-resolution spectra estimation problem by harnessing the radar signal processing domain knowledge, introducing innovative data normalization and a domain-informed signal-to-noise ratio (SNR)-guided loss function. Our tailored deep learning network for automotive radar imaging exhibits remarkable scalability, parameter efficiency and fast inference speed, alongside enhanced performance in terms of radar imaging quality and resolution. Extensive testing confirms that our SR-SPECNet sets a new benchmark in producing high-resolution radar range-azimuth images, outperforming existing methods across varied antenna configurations and dataset sizes. Source code and new radar dataset will be made publicly available online.

[LG-36] Visual Representation Learning with Stochastic Frame Prediction

链接: https://arxiv.org/abs/2406.07398
作者: Huiwon Jang,Dongyoung Kim,Junsu Kim,Jinwoo Shin,Pieter Abbeel,Younggyo Seo
关键词: Self-supervised learning, predicting future frames, promising direction, frame prediction, Self-supervised
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: International Conference on Machine Learning (ICML) 2024

点击查看摘要

Abstract:Self-supervised learning of image representations by predicting future frames is a promising direction but still remains a challenge. This is because of the under-determined nature of frame prediction; multiple potential futures can arise from a single current frame. To tackle this challenge, in this paper, we revisit the idea of stochastic video generation that learns to capture uncertainty in frame prediction and explore its effectiveness for representation learning. Specifically, we design a framework that trains a stochastic frame prediction model to learn temporal information between frames. Moreover, to learn dense information within each frame, we introduce an auxiliary masked image modeling objective along with a shared decoder architecture. We find this architecture allows for combining both objectives in a synergistic and compute-efficient manner. We demonstrate the effectiveness of our framework on a variety of tasks from video label propagation and vision-based robot learning domains, such as video segmentation, pose tracking, vision-based robotic locomotion, and manipulation tasks. Code is available on the project webpage: this https URL.

[LG-37] Robust Image Semantic Coding with Learnable CSI Fusion Masking over MIMO Fading Channels

链接: https://arxiv.org/abs/2406.07389
作者: Bingyan Xie,Yongpeng Wu,Yuxuan Shi,Wenjun Zhang,Shuguang Cui,Merouane Debbah
关键词: single-input single-output Gaussian, single-output Gaussian channels, widely-used multiple-input multiple-output, Rayleigh fading channels, achieving marvelous progress
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: This paper has been accepted by IEEE Transactions on Wireless Communications

点击查看摘要

Abstract:Though achieving marvelous progress in various scenarios, existing semantic communication frameworks mainly consider single-input single-output Gaussian channels or Rayleigh fading channels, neglecting the widely-used multiple-input multiple-output (MIMO) channels, which hinders the application into practical systems. One common solution to combat MIMO fading is to utilize feedback MIMO channel state information (CSI). In this paper, we incorporate MIMO CSI into system designs from a new perspective and propose the learnable CSI fusion semantic communication (LCFSC) framework, where CSI is treated as side information by the semantic extractor to enhance the semantic coding. To avoid feature fusion due to abrupt combination of CSI with features, we present a non-invasive CSI fusion multi-head attention module inside the Swin Transformer. With the learned attention masking map determined by both source and channel states, more robust attention distribution could be generated. Furthermore, the percentage of mask elements could be flexibly adjusted by the learnable mask ratio, which is produced based on the conditional variational interference in an unsupervised manner. In this way, CSI-aware semantic coding is achieved through learnable CSI fusion masking. Experiment results testify the superiority of LCFSC over traditional schemes and state-of-the-art Swin Transformer-based semantic communication frameworks in MIMO fading channels.

[LG-38] Machine Learning-Based Channel Prediction for RIS-assisted MIMO Systems With Channel Aging

链接: https://arxiv.org/abs/2406.07387
作者: Nipuni Ginige,Arthur Sousa de Sena,Nurul Huda Mahmood,Nandana Rajatheva,Matti Latva-aho
关键词: Reconfigurable intelligent surfaces, Reconfigurable intelligent, intelligent surfaces, performance of sixth-generation, promising technology
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Reconfigurable intelligent surfaces (RISs) have emerged as a promising technology to enhance the performance of sixth-generation (6G) and beyond communication systems. The passive nature of RISs and their large number of reflecting elements pose challenges to the channel estimation process. The associated complexity further escalates when the channel coefficients are fast-varying as in scenarios with user mobility. In this paper, we propose an extended channel estimation framework for RIS-assisted multiple-input multiple-output (MIMO) systems based on a convolutional neural network (CNN) integrated with an autoregressive (AR) predictor. The implemented framework is designed for identifying the aging pattern and predicting enhanced estimates of the wireless channels in correlated fast-fading environments. Insightful simulation results demonstrate that our proposed CNN-AR approach is robust to channel aging, exhibiting a high-precision estimation accuracy. The results also show that our approach can achieve high spectral efficiency and low pilot overhead compared to traditional methods.

[LG-39] World Models with Hints of Large Language Models for Goal Achieving

链接: https://arxiv.org/abs/2406.07381
作者: Zeyuan Liu,Ziyu Huan,Xiyao Wang,Jiafei Lyu,Jian Tao,Xiu Li,Furong Huang,Huazhe Xu
关键词: Reinforcement learning struggles, manual reward specification, Reinforcement learning, sparse goals due, learning struggles
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning struggles in the face of long-horizon tasks and sparse goals due to the difficulty in manual reward specification. While existing methods address this by adding intrinsic rewards, they may fail to provide meaningful guidance in long-horizon decision-making tasks with large state and action spaces, lacking purposeful exploration. Inspired by human cognition, we propose a new multi-modal model-based RL approach named Dreaming with Large Language Models (DLLM). DLLM integrates the proposed hinting subgoals from the LLMs into the model rollouts to encourage goal discovery and reaching in challenging tasks. By assigning higher intrinsic rewards to samples that align with the hints outlined by the language model during model rollouts, DLLM guides the agent toward meaningful and efficient exploration. Extensive experiments demonstrate that the DLLM outperforms recent methods in various challenging, sparse-reward environments such as HomeGrid, Crafter, and Minecraft by 27.7%, 21.1%, and 9.9%, respectively.

[LG-40] Improving the realism of robotic surgery simulation through injection of learning-based estimated errors

链接: https://arxiv.org/abs/2406.07375
作者: Juan Antonio Barragan,Hisashi Ishida,Adnan Munawar,Peter Kazanzides
关键词: realistic simulation environments, physical robot, development of algorithms, algorithms for automation, automation of subtasks
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 6 page paper

点击查看摘要

Abstract:The development of algorithms for automation of subtasks during robotic surgery can be accelerated by the availability of realistic simulation environments. In this work, we focus on one aspect of the realism of a surgical simulator, which is the positional accuracy of the robot. In current simulators, robots have perfect or near-perfect accuracy, which is not representative of their physical counterparts. We therefore propose a pair of neural networks, trained by data collected from a physical robot, to estimate both the controller error and the kinematic and non-kinematic error. These error estimates are then injected within the simulator to produce a simulated robot that has the characteristic performance of the physical robot. In this scenario, we believe it is sufficient for the estimated error used in the simulation to have a statistically similar distribution to the actual error of the physical robot. This is less stringent, and therefore more tenable, than the requirement for error compensation of a physical robot, where the estimated error should equal the actual error. Our results demonstrate that error injection reduces the mean position and orientation differences between the simulated and physical robots from 5.0 mm / 3.6 deg to 1.3 mm / 1.7 deg, respectively, which represents reductions by factors of 3.8 and 2.1.

[LG-41] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

链接: https://arxiv.org/abs/2406.07368
作者: Haoran You,Yichao Fu,Zheng Wang,Amir Yazdanbakhsh,Yingyan(Celine)Lin
关键词: Autoregressive Large Language, Large Language Models, Large Language, limited efficiency due, achieved impressive performance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ICML 2024; 17 pages; 10 figures; 16 tables

点击查看摘要

Abstract:Autoregressive Large Language Models (LLMs) have achieved impressive performance in language tasks but face two significant bottlenecks: (1) quadratic complexity in the attention module as the number of tokens increases, and (2) limited efficiency due to the sequential processing nature of autoregressive LLMs during generation. While linear attention and speculative decoding offer potential solutions, their applicability and synergistic potential for enhancing autoregressive LLMs remain uncertain. We conduct the first comprehensive study on the efficacy of existing linear attention methods for autoregressive LLMs, integrating them with speculative decoding. We introduce an augmentation technique for linear attention that ensures compatibility with speculative decoding, enabling more efficient training and serving of LLMs. Extensive experiments and ablation studies involving seven existing linear attention models and five encoder/decoder-based LLMs consistently validate the effectiveness of our augmented linearized LLMs. Notably, our approach achieves up to a 6.67 reduction in perplexity on the LLaMA model and up to a 2 \times speedup during generation compared to prior linear attention methods. Codes and models are available at this https URL.

[LG-42] Deep Implicit Optimization for Robust and Flexible Image Registration

链接: https://arxiv.org/abs/2406.07361
作者: Rohit Jena,Pratik Chaudhari,James C. Gee
关键词: incorporate weak label, weak label supervision, DLIR methods forego, image registration due, tremendously successful
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Deep Learning in Image Registration (DLIR) methods have been tremendously successful in image registration due to their speed and ability to incorporate weak label supervision at training time. However, DLIR methods forego many of the benefits of classical optimization-based methods. The functional nature of deep networks do not guarantee that the predicted transformation is a local minima of the registration objective, the representation of the transformation (displacement/velocity field/affine) is fixed, and the networks are not robust to domain shift. Our method aims to bridge this gap between classical and learning methods by incorporating optimization as a layer in a deep network. A deep network is trained to predict multi-scale dense feature images that are registered using a black box iterative optimization solver. This optimal warp is then used to minimize image and label alignment errors. By implicitly differentiating end-to-end through an iterative optimization solver, our learned features are registration and label-aware, and the warp functions are guaranteed to be local minima of the registration objective in the feature space. Our framework shows excellent performance on in-domain datasets, and is agnostic to domain shift such as anisotropy and varying intensity profiles. For the first time, our method allows switching between arbitrary transformation representations (free-form to diffeomorphic) at test time with zero retraining. End-to-end feature learning also facilitates interpretability of features, and out-of-the-box promptability using additional label-fidelity terms at inference.

[LG-43] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

链接: https://arxiv.org/abs/2406.07358
作者: Teun van der Weij,Felix Hofstätter,Ollie Jaffe,Samuel F. Brown,Francis Rhys Ward
关键词: Trustworthy capability evaluations, Trustworthy capability, crucial for ensuring, key component, capability evaluations
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: We publish our code and results \href\href{ [this https URL](https://github.com/your-repo/your-project) }{here}

点击查看摘要

Abstract:Trustworthy capability evaluations are crucial for ensuring the safety of AI systems, and are becoming a key component of AI regulation. However, the developers of an AI system, or the AI system itself, may have incentives for evaluations to understate the AI’s actual capability. These conflicting interests lead to the problem of sandbagging \unicodex2013 which we define as “strategic underperformance on an evaluation”. In this paper we assess sandbagging capabilities in contemporary language models (LMs). We prompt frontier LMs, like GPT-4 and Claude 3 Opus, to selectively underperform on dangerous capability evaluations, while maintaining performance on general (harmless) capability evaluations. Moreover, we find that models can be fine-tuned, on a synthetic dataset, to hide specific capabilities unless given a password. This behaviour generalizes to high-quality, held-out benchmarks such as WMDP. In addition, we show that both frontier and smaller models can be prompted, or password-locked, to target specific scores on a capability evaluation. Even more, we found that a capable password-locked model (Llama 3 70b) is reasonably able to emulate a less capable model (Llama 2 7b). Overall, our results suggest that capability evaluations are vulnerable to sandbagging. This vulnerability decreases the trustworthiness of evaluations, and thereby undermines important safety decisions regarding the development and deployment of advanced AI systems.

[LG-44] DR-RAG: Applying Dynamic Document Relevance to Retrieval-Augmented Generation for Question-Answering

链接: https://arxiv.org/abs/2406.07348
作者: Zijian Hei,Weiling Wei,Wenjie Ou,Juyi Qiao,Junming Jiao,Zhiqing Zhu,Guowen Song
关键词: Large Language Models, Language Models, Large Language, performance of Large, knowledge-intensive tasks
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has significantly demonstrated the performance of Large Language Models (LLMs) in the knowledge-intensive tasks, such as Question-Answering (QA). RAG expands the query context by incorporating external knowledge bases to enhance the response accuracy. However, it would be inefficient to access LLMs multiple times for each query and unreliable to retrieve all the relevant documents by a single query. We find that even though there is low relevance between some critical documents and query, it is possible to retrieve the remaining documents by combining parts of the documents with the query. To mine the relevance, a two-stage retrieval framework called Dynamic-Relevant Retrieval-Augmented Generation (DR-RAG) is proposed to improve document retrieval recall and the accuracy of answers while maintaining efficiency. Also, a small classifier is applied to two different selection strategies to determine the contribution of the retrieved documents to answering the query and retrieve the relatively relevant documents. Meanwhile, DR-RAG call the LLMs only once, which significantly improves the efficiency of the experiment. The experimental results on multi-hop QA datasets show that DR-RAG can significantly improve the accuracy of the answers and achieve new progress in QA systems.

[LG-45] ransferring Knowledge from Large Foundation Models to Small Downstream Models

链接: https://arxiv.org/abs/2406.07337
作者: Shikai Qiu,Boran Han,Danielle C. Maddix,Shuai Zhang,Yuyang Wang,Andrew Gordon Wilson
关键词: larger foundation models, AFT, relevant knowledge, larger foundation, pre-trained
类目: Machine Learning (cs.LG)
*备注: ICML 2024. Code available at this https URL

点击查看摘要

Abstract:How do we transfer the relevant knowledge from ever larger foundation models into small, task-specific downstream models that can run at much lower costs? Standard transfer learning using pre-trained weights as the initialization transfers limited information and commits us to often massive pre-trained architectures. This procedure also precludes combining multiple pre-trained models that learn complementary information. To address these shortcomings, we introduce Adaptive Feature Transfer (AFT). Instead of transferring weights, AFT operates purely on features, thereby decoupling the choice of the pre-trained model from the smaller downstream model. Rather than indiscriminately compressing all pre-trained features, AFT adaptively transfers pre-trained features that are most useful for performing the downstream task, using a simple regularization that adds minimal overhead. Across multiple vision, language, and multi-modal datasets, AFT achieves significantly better downstream performance compared to alternatives with a similar computational cost. Furthermore, AFT reliably translates improvement in pre-trained models into improvement in downstream performance, even if the downstream model is over 50\times smaller, and can effectively transfer complementary information learned by multiple pre-trained models.

[LG-46] Realistic Data Generation for 6D Pose Estimation of Surgical Instruments

链接: https://arxiv.org/abs/2406.07328
作者: Juan Antonio Barragan,Jintan Zhang,Haoying Zhou,Adnan Munawar,Peter Kazanzides
关键词: improve patient safety, robust perception algorithms, pose estimation, surgical, potential to improve
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

Abstract:Automation in surgical robotics has the potential to improve patient safety and surgical efficiency, but it is difficult to achieve due to the need for robust perception algorithms. In particular, 6D pose estimation of surgical instruments is critical to enable the automatic execution of surgical maneuvers based on visual feedback. In recent years, supervised deep learning algorithms have shown increasingly better performance at 6D pose estimation tasks; yet, their success depends on the availability of large amounts of annotated data. In household and industrial settings, synthetic data, generated with 3D computer graphics software, has been shown as an alternative to minimize annotation costs of 6D pose datasets. However, this strategy does not translate well to surgical domains as commercial graphics software have limited tools to generate images depicting realistic instrument-tissue interactions. To address these limitations, we propose an improved simulation environment for surgical robotics that enables the automatic generation of large and diverse datasets for 6D pose estimation of surgical instruments. Among the improvements, we developed an automated data generation pipeline and an improved surgical scene. To show the applicability of our system, we generated a dataset of 7.5k images with pose annotations of a surgical needle that was used to evaluate a state-of-the-art pose estimation network. The trained model obtained a mean translational error of 2.59mm on a challenging dataset that presented varying levels of occlusion. These results highlight our pipeline’s success in training and evaluating novel vision algorithms for surgical robotics applications.

[LG-47] 3D-Properties: Identifying Challenges in DPO and Charting a Path Forward

链接: https://arxiv.org/abs/2406.07327
作者: Yuzi Yan,Yibo Miao,Jialian Li,Yipin Zhang,Jian Xie,Zhijie Deng,Dong Yan
关键词: Direct Preference Optimization, Aligning large language, gained tremendous attention, straightforward Direct Preference, recently gained tremendous
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) with human preference has recently gained tremendous attention, with the canonical yet costly RLHF-PPO and the simple and straightforward Direct Preference Optimization (DPO) as two examples. Despite the efficiency, DPO has rarely be used in the state-of-the-art production-level LLMs, implying its potential pathologies. In this work, we revisit DPO with a comprehensive examination of its empirical efficacy and a systematic comparison with RLHF-PPO. We identify the \textbf3D-properties of DPO’s learning outcomes: the \textbfDrastic drop in the likelihood of rejected responses, the \textbfDegradation into LLM unlearning, and the \textbfDispersion effect on unseen responses through experiments with both a carefully designed toy model and practical LLMs on tasks including mathematical problem-solving and instruction following. These findings inherently connect to some observations made by related works and we additionally contribute a plausible theoretical explanation for them. Accordingly, we propose easy regularization methods to mitigate the issues caused by \textbf3D-properties, improving the training stability and final performance of DPO. Our contributions also include an investigation into how the distribution of the paired preference data impacts the effectiveness of DPO. We hope this work could offer research directions to narrow the gap between reward-free preference learning methods and reward-based ones.

[LG-48] Beyond Training: Optimizing Reinforcement Learning Based Job Shop Scheduling Through Adaptive Action Sampling

链接: https://arxiv.org/abs/2406.07325
作者: Constantin Waubert de Puiseau,Christian Dörpelkus,Jannik Peters,Hasan Tercan,Tobias Meisen
关键词: Learned construction heuristics, trained DRL agents, Learned construction, recent years, increasingly competitive
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Presented Workshop Paper at ICAPS2024

点击查看摘要

Abstract:Learned construction heuristics for scheduling problems have become increasingly competitive with established solvers and heuristics in recent years. In particular, significant improvements have been observed in solution approaches using deep reinforcement learning (DRL). While much attention has been paid to the design of network architectures and training algorithms to achieve state-of-the-art results, little research has investigated the optimal use of trained DRL agents during inference. Our work is based on the hypothesis that, similar to search algorithms, the utilization of trained DRL agents should be dependent on the acceptable computational budget. We propose a simple yet effective parameterization, called \delta -sampling that manipulates the trained action vector to bias agent behavior towards exploration or exploitation during solution construction. By following this approach, we can achieve a more comprehensive coverage of the search space while still generating an acceptable number of solutions. In addition, we propose an algorithm for obtaining the optimal parameterization for such a given number of solutions and any given trained agent. Experiments extending existing training protocols for job shop scheduling problems with our inference method validate our hypothesis and result in the expected improvements of the generated solutions.

[LG-49] Rethinking the impact of noisy labels in graph classification: A utility and privacy perspective

链接: https://arxiv.org/abs/2406.07314
作者: De Li,Xianxian Li,Zeming Gan,Qiyu Li,Bin Qu,Jinyan Wang
关键词: achieved advanced results, graph classification, Graph, graph classification tasks, based on message-passing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks based on message-passing mechanisms have achieved advanced results in graph classification tasks. However, their generalization performance degrades when noisy labels are present in the training data. Most existing noisy labeling approaches focus on the visual domain or graph node classification tasks and analyze the impact of noisy labels only from a utility perspective. Unlike existing work, in this paper, we measure the effects of noise labels on graph classification from data privacy and model utility perspectives. We find that noise labels degrade the model’s generalization performance and enhance the ability of membership inference attacks on graph data privacy. To this end, we propose the robust graph neural network approach with noisy labeled graph classification. Specifically, we first accurately filter the noisy samples by high-confidence samples and the first feature principal component vector of each class. Then, the robust principal component vectors and the model output under data augmentation are utilized to achieve noise label correction guided by dual spatial information. Finally, supervised graph contrastive learning is introduced to enhance the embedding quality of the model and protect the privacy of the training graph data. The utility and privacy of the proposed method are validated by comparing twelve different methods on eight real graph classification datasets. Compared with the state-of-the-art methods, the RGLC method achieves at most and at least 7.8% and 0.8% performance gain at 30% noisy labeling rate, respectively, and reduces the accuracy of privacy attacks to below 60%.

[LG-50] BertaQA: How Much Do Language Models Know About Local Culture?

链接: https://arxiv.org/abs/2406.07302
作者: Julen Etxaniz,Gorka Azkune,Aitor Soroa,Oier Lopez de Lacalle,Mikel Artetxe
关键词: Large Language Models, exhibit extensive knowledge, Large Language, exhibit extensive, anglocentric subjects
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit extensive knowledge about the world, but most evaluations have been limited to global or anglocentric subjects. This raises the question of how well these models perform on topics relevant to other cultures, whose presence on the web is not that prominent. To address this gap, we introduce BertaQA, a multiple-choice trivia dataset that is parallel in English and Basque. The dataset consists of a local subset with questions pertinent to the Basque culture, and a global subset with questions of broader interest. We find that state-of-the-art LLMs struggle with local cultural knowledge, even as they excel on global topics. However, we show that continued pre-training in Basque significantly improves the models’ performance on Basque culture, even when queried in English. To our knowledge, this is the first solid evidence of knowledge transfer from a low-resource to a high-resource language. Our analysis sheds light on the complex interplay between language and knowledge, and reveals that some prior findings do not fully hold when reassessed on local topics. Our dataset and evaluation code are available under open licenses at this https URL.

[LG-51] Multi-objective Reinforcement learning from AI Feedback

链接: https://arxiv.org/abs/2406.07295
作者: Marcus Williams
关键词: Multi-Objective Reinforcement Learning, Reinforcement Learning, presents Multi-Objective Reinforcement, paper presents Multi-Objective, Multi-Objective Reinforcement
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents Multi-Objective Reinforcement Learning from AI Feedback (MORLAIF), a novel approach to improving the alignment and performance of language models trained using reinforcement learning from AI feedback (RLAIF). In contrast to standard approaches that train a single preference model to represent all human preferences, MORLAIF decomposes this task into multiple simpler principles, such as toxicity, factuality, and sycophancy. Separate preference models are trained for each principle using feedback from GPT-3.5-Turbo. These preference model scores are then combined using different scalarization functions to provide a reward signal for Proximal Policy Optimization (PPO) training of the target language model. Our experiments indicate that MORLAIF outperforms the standard RLAIF baselines and that MORLAIF can be used to align larger language models using smaller ones. Surprisingly, the choice of scalarization function does not appear to significantly impact the results.

[LG-52] Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

链接: https://arxiv.org/abs/2406.07291
作者: Livia Qian,Gabriel Skantze
关键词: feedback responses, play an important, Short feedback responses, important role, role in spoken
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Interspeech 2024

点击查看摘要

Abstract:Short feedback responses, such as backchannels, play an important role in spoken dialogue. So far, most of the modeling of feedback responses has focused on their timing, often neglecting how their lexical and prosodic form influence their contextual appropriateness and conversational function. In this paper, we investigate the possibility of embedding short dialogue contexts and feedback responses in the same representation space using a contrastive learning objective. In our evaluation, we primarily focus on how such embeddings can be used as a context-feedback appropriateness metric and thus for feedback response ranking in U.S. English dialogues. Our results show that the model outperforms humans given the same ranking task and that the learned embeddings carry information about the conversational function of feedback responses.

[LG-53] Efficient 3D Molecular Generation with Flow Matching and Scale Optimal Transport

链接: https://arxiv.org/abs/2406.07266
作者: Ross Irwin,Alessandro Tibo,Jon-Paul Janet,Simon Olsson
关键词: design ligands directly, gained prominence recently, Generative models, drug design, protein pockets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: Preprint. Code to be released upon full publication

点击查看摘要

Abstract:Generative models for 3D drug design have gained prominence recently for their potential to design ligands directly within protein pockets. Current approaches, however, often suffer from very slow sampling times or generate molecules with poor chemical validity. Addressing these limitations, we propose Semla, a scalable E(3)-equivariant message passing architecture. We further introduce a molecular generation model, MolFlow, which is trained using flow matching along with scale optimal transport, a novel extension of equivariant optimal transport. Our model produces state-of-the-art results on benchmark datasets with just 100 sampling steps. Crucially, MolFlow samples high quality molecules with as few as 20 steps, corresponding to a two order-of-magnitude speed-up compared to state-of-the-art, without sacrificing performance. Furthermore, we highlight limitations of current evaluation methods for 3D generation and propose new benchmark metrics for unconditional molecular generators. Finally, using these new metrics, we compare our model’s ability to generate high quality samples against current approaches and further demonstrate MolFlow’s strong performance.

[LG-54] Active learning for affinity prediction of antibodies

链接: https://arxiv.org/abs/2406.07263
作者: Alexandra Gessner,Sebastian W. Ober,Owen Vickery,Dino Oglić,Talip Uçar
关键词: lead optimization campaigns, enhance antibody affinity, primary objective, lead optimization, optimization campaigns
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The primary objective of most lead optimization campaigns is to enhance the binding affinity of ligands. For large molecules such as antibodies, identifying mutations that enhance antibody affinity is particularly challenging due to the combinatorial explosion of potential mutations. When the structure of the antibody-antigen complex is available, relative binding free energy (RBFE) methods can offer valuable insights into how different mutations will impact the potency and selectivity of a drug candidate, thereby reducing the reliance on costly and time-consuming wet-lab experiments. However, accurately simulating the physics of large molecules is computationally intensive. We present an active learning framework that iteratively proposes promising sequences for simulators to evaluate, thereby accelerating the search for improved binders. We explore different modeling approaches to identify the most effective surrogate model for this task, and evaluate our framework both using pre-computed pools of data and in a realistic full-loop setting.

[LG-55] Scientific Computing with Large Language Models

链接: https://arxiv.org/abs/2406.07259
作者: Christopher Culver,Peter Hicks,Mihailo Milenkovic,Sanjif Shanmugavelu,Tobias Becker
关键词: provide an overview, emergence of large, large language models, scientific computing applications, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:We provide an overview of the emergence of large language models for scientific computing applications. We highlight use cases that involve natural language processing of scientific documents and specialized languages designed to describe physical systems. For the former, chatbot style applications appear in medicine, mathematics and physics and can be used iteratively with domain experts for problem solving. We also review specialized languages within molecular biology, the languages of molecules, proteins, and DNA where language models are being used to predict properties and even create novel physical systems at much faster rates than traditional computing methods.

[LG-56] Hybrid Reinforcement Learning from Offline Observation Alone

链接: https://arxiv.org/abs/2406.07253
作者: Yuda Song,J. Andrew Bagnell,Aarti Singh
关键词: online interactive access, hybrid reinforcement learning, reinforcement learning setting, reinforcement learning, offline data
类目: Machine Learning (cs.LG)
*备注: 34 pages, 7 figures, published at ICML 2024

点击查看摘要

Abstract:We consider the hybrid reinforcement learning setting where the agent has access to both offline data and online interactive access. While Reinforcement Learning (RL) research typically assumes offline data contains complete action, reward and transition information, datasets with only state information (also known as observation-only datasets) are more general, abundant and practical. This motivates our study of the hybrid RL with observation-only offline dataset framework. While the task of competing with the best policy “covered” by the offline data can be solved if a reset model of the environment is provided (i.e., one that can be reset to any state), we show evidence of hardness when only given the weaker trace model (i.e., one can only reset to the initial states and must produce full traces through the environment), without further assumption of admissibility of the offline data. Under the admissibility assumptions – that the offline data could actually be produced by the policy class we consider – we propose the first algorithm in the trace model setting that provably matches the performance of algorithms that leverage a reset model. We also perform proof-of-concept experiments that suggest the effectiveness of our algorithm in practice.

[LG-57] Marginalization Consistent Mixture of Separable Flows for Probabilistic Irregular Time Series Forecasting

链接: https://arxiv.org/abs/2406.07246
作者: Vijaya Krishna Yalavarthi,Randolf Scholz,Kiran Madhusudhanan,Stefan Born,Lars Schmidt-Thieme
关键词: Gaussian Process Regression, Process Regression model, Process Regression, heavily under-researched area, Transformer-Attentional Copulas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Probabilistic forecasting models for joint distributions of targets in irregular time series are a heavily under-researched area in machine learning with, to the best of our knowledge, only three models researched so far: GPR, the Gaussian Process Regression model~\citepDurichen2015.Multitask, TACTiS, the Transformer-Attentional Copulas for Time Series~\citeDrouin2022.Tactis, ashok2024tactis and ProFITi \citepYalavarthi2024.Probabilistica, a multivariate normalizing flow model based on invertible attention layers. While ProFITi, thanks to using multivariate normalizing flows, is the more expressive model with better predictive performance, we will show that it suffers from marginalization inconsistency: it does not guarantee that the marginal distributions of a subset of variables in its predictive distributions coincide with the directly predicted distributions of these variables. Also, TACTiS does not provide any guarantees for marginalization consistency. We develop a novel probabilistic irregular time series forecasting model, Marginalization Consistent Mixtures of Separable Flows (moses), that mixes several normalizing flows with (i) Gaussian Processes with full covariance matrix as source distributions and (ii) a separable invertible transformation, aiming to combine the expressivity of normalizing flows with the marginalization consistency of Gaussians. In experiments on four different datasets we show that moses outperforms other state-of-the-art marginalization consistent models, performs on par with ProFITi, but different from ProFITi, guarantee marginalization consistency.

[LG-58] Let Go of Your Labels with Unsupervised Transfer

链接: https://arxiv.org/abs/2406.07236
作者: Artyom Gadetsky,Yulun Jiang,Maria Brbic
关键词: remarkable zero-shot transferability, enabled remarkable zero-shot, Foundation vision-language models, enabled remarkable, TURTLE
类目: Machine Learning (cs.LG)
*备注: ICML 2024 camera-ready

点击查看摘要

Abstract:Foundation vision-language models have enabled remarkable zero-shot transferability of the pre-trained representations to a wide range of downstream tasks. However, to solve a new task, zero-shot transfer still necessitates human guidance to define visual categories that appear in the data. Here, we show that fully unsupervised transfer emerges when searching for the labeling of a dataset that induces maximal margin classifiers in representation spaces of different foundation models. We present TURTLE, a fully unsupervised method that effectively employs this guiding principle to uncover the underlying labeling of a downstream dataset without any supervision and task-specific representation learning. We evaluate TURTLE on a diverse benchmark suite of 26 datasets and show that it achieves new state-of-the-art unsupervised performance. Furthermore, TURTLE, although being fully unsupervised, outperforms zero-shot transfer baselines on a wide range of datasets. In particular, TURTLE matches the average performance of CLIP zero-shot on 26 datasets by employing the same representation space, spanning a wide range of architectures and model sizes. By guiding the search for the underlying labeling using the representation spaces of two foundation models, TURTLE surpasses zero-shot transfer and unsupervised prompt tuning baselines, demonstrating the surprising power and effectiveness of unsupervised transfer.

[LG-59] OPFData: Large-scale datasets for AC optimal power flow with topological perturbations

链接: https://arxiv.org/abs/2406.07234
作者: Sean Lovett,Miha Zgubic,Sofia Liguori,Sephora Madjiheurem,Hamish Tomlinson,Sophie Elster,Chris Apps,Sims Witherspoon,Luis Piloto
关键词: optimal power flow, efficient and safe, safe planning, power flow problem, Solving
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving the AC optimal power flow problem (AC-OPF) is critical to the efficient and safe planning and operation of power grids. Small efficiency improvements in this domain have the potential to lead to billions of dollars of cost savings, and significant reductions in emissions from fossil fuel generators. Recent work on data-driven solution methods for AC-OPF shows the potential for large speed improvements compared to traditional solvers; however, no large-scale open datasets for this problem exist. We present the largest readily-available collection of solved AC-OPF problems to date. This collection is orders of magnitude larger than existing readily-available datasets, allowing training of high-capacity data-driven models. Uniquely, it includes topological perturbations - a critical requirement for usage in realistic power grid operations. We hope this resource will spur the community to scale research to larger grid sizes with variable topology.

[LG-60] Improving Autoformalization using Type Checking

链接: https://arxiv.org/abs/2406.07222
作者: Auguste Poiroux,Gail Weiss,Viktor Kunčak,Antoine Bosselut
关键词: Large language models, translating natural language, automatically translating natural, Large language, natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models show promise for autoformalization, the task of automatically translating natural language into formal languages. However, current autoformalization methods remain limited. The last reported state-of-the-art performance on the ProofNet formalization benchmark for the Lean proof assistant, achieved using Codex for Lean 3, only showed successful formalization of 16.1% of informal statements. Similarly, our evaluation of GPT-4o for Lean 4 only produces successful translations 34.9% of the time. Our analysis shows that the performance of these models is largely limited by their inability to generate formal statements that successfully type-check (i.e., are syntactically correct and consistent with types) - with a whopping 86.6% of GPT-4o errors starting from a type-check failure. In this work, we propose a method to fix this issue through decoding with type-check filtering, where we initially sample a diverse set of candidate formalizations for an informal statement, then use the Lean proof assistant to filter out candidates that do not type-check. Using GPT-4o as a base model, and combining our method with self-consistency, we obtain a +18.3% absolute increase in formalization accuracy, and achieve a new state-of-the-art of 53.2% on ProofNet with Lean 4.

[LG-61] A Synthetic Dataset for Personal Attribute Inference

链接: https://arxiv.org/abs/2406.07217
作者: Hanna Yukhymenko,Robin Staab,Mark Vero,Martin Vechev
关键词: Large Language Models, powerful Large Language, Language Models, Large Language, powerful Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recently, powerful Large Language Models (LLMs) have become easily accessible to hundreds of millions of users worldwide. However, their strong capabilities and vast world knowledge do not come without associated privacy risks. In this work, we focus on the emerging privacy threat LLMs pose - the ability to accurately infer personal information from online texts. Despite the growing importance of LLM-based author profiling, research in this area has been hampered by a lack of suitable public datasets, largely due to ethical and privacy concerns associated with real personal data. In this work, we take two steps to address this problem: (i) we construct a simulation framework for the popular social media platform Reddit using LLM agents seeded with synthetic personal profiles; (ii) using this framework, we generate SynthPAI, a diverse synthetic dataset of over 7800 comments manually labeled for personal attributes. We validate our dataset with a human study showing that humans barely outperform random guessing on the task of distinguishing our synthetic comments from real ones. Further, we verify that our dataset enables meaningful personal attribute inference research by showing across 18 state-of-the-art LLMs that our synthetic comments allow us to draw the same conclusions as real-world data. Together, this indicates that our dataset and pipeline provide a strong and privacy-preserving basis for future research toward understanding and mitigating the inference-based privacy threats LLMs pose.

[LG-62] Semantic-Aware Spectrum Sharing in Internet of Vehicles Based on Deep Reinforcement Learning

链接: https://arxiv.org/abs/2406.07213
作者: Zhiyu Shao,Qiong Wu,Pingyi Fan,Nan Cheng,Wen Chen,Jiangzhou Wang,Khaled B. Letaief
关键词: high-speed mobile Internet, mobile Internet, investigate semantic communication, spectrum sharing, sharing
类目: Machine Learning (cs.LG)
*备注: This paper has been submitted to IEEE Journal. The source code has been released at: this https URL

点击查看摘要

Abstract:This work aims to investigate semantic communication in high-speed mobile Internet of vehicles (IoV) environments, with a focus on the spectrum sharing between vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications. We specifically address spectrum scarcity and network traffic and then propose a semantic-aware spectrum sharing algorithm (SSS) based on the deep reinforcement learning (DRL) soft actor-critic (SAC) approach. Firstly, we delve into the extraction of semantic information. Secondly, we redefine metrics for semantic information in V2V and V2I spectrum sharing in IoV environments, introducing high-speed semantic spectrum efficiency (HSSE) and semantic transmission rate (HSR). Finally, we employ the SAC algorithm for decision optimization in V2V and V2I spectrum sharing based on semantic information. This optimization encompasses the optimal link of V2V and V2I sharing strategies, the transmission power for vehicles sending semantic information and the length of transmitted semantic symbols, aiming at maximizing HSSE of V2I and enhancing success rate of effective semantic information transmission (SRS) of V2V. Experimental results demonstrate that the SSS algorithm outperforms other baseline algorithms, including other traditional-communication-based spectrum sharing algorithms and spectrum sharing algorithm using other reinforcement learning approaches. The SSS algorithm exhibits a 15% increase in HSSE and approximately a 7% increase in SRS.

[LG-63] rnaryLLM: Ternarized Large Language Model

链接: https://arxiv.org/abs/2406.07177
作者: Tianqi Chen,Zhe Li,Weixiang Xu,Zeyu Zhu,Dong Li,Lu Tian,Emad Barsoum,Peisong Wang,Jian Cheng
关键词: Natural Language Processing, Large language models, Large language, Language Processing, achieved remarkable performance
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable performance on Natural Language Processing (NLP) tasks, but they are hindered by high computational costs and memory requirements. Ternarization, an extreme form of quantization, offers a solution by reducing memory usage and enabling energy-efficient floating-point additions. However, applying ternarization to LLMs faces challenges stemming from outliers in both weights and activations. In this work, observing asymmetric outliers and non-zero means in weights, we introduce Dual Learnable Ternarization (DLT), which enables both scales and shifts to be learnable. We also propose Outlier-Friendly Feature Knowledge Distillation (OFF) to recover the information lost in extremely low-bit quantization. The proposed OFF can incorporate semantic information and is insensitive to outliers. At the core of OFF is maximizing the mutual information between features in ternarized and floating-point models using cosine similarity. Extensive experiments demonstrate that our TernaryLLM surpasses previous low-bit quantization methods on the standard text generation and zero-shot benchmarks for different LLM families. Specifically, for one of the most powerful open-source models, LLaMA-3, our approach (W1.58A16) outperforms the previous state-of-the-art method (W2A16) by 5.8 in terms of perplexity on C4 and by 8.2% in terms of average accuracy on zero-shot tasks.

[LG-64] Deep Learning-Based Approach for User Activity Detection with Grant-Free Random Access in Cell-Free Massive MIMO

链接: https://arxiv.org/abs/2406.07160
作者: Ali Elkeshawy,HaÏfa Farès,Amor Nafkha
关键词: Modern wireless networks, Modern wireless, reliably support, support a wide, wide array
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern wireless networks must reliably support a wide array of connectivity demands, encompassing various user needs across diverse scenarios. Machine-Type Communication (mMTC) is pivotal in these networks, particularly given the challenges posed by massive connectivity and sporadic device activation patterns. Traditional grant-based random access (GB-RA) protocols face limitations due to constrained orthogonal preamble resources. In response, the adoption of grant-free random access (GF-RA) protocols offers a promising solution. This paper explores the application of supervised machine learning models to tackle activity detection issues in scenarios where non-orthogonal preamble design is considered. We introduce a data-driven algorithm specifically designed for user activity detection in Cell-Free Massive Multiple-Input Multiple-Output (CF-mMIMO) networks operating under GF-RA protocols. Additionally, this study presents a novel clustering strategy that simplifies and enhances activity detection accuracy, assesses the resilience of the algorithm to input perturbations, and investigates the effects of adopting floating-to-fixed-point conversion on algorithm performance. Simulations conducted adhere to 3GPP standards, ensuring accurate channel modeling, and employ a deep learning approach to boost the detection capabilities of mMTC GF-RA devices. The results are compelling: the algorithm achieves an exceptional 99% accuracy rate, confirming its efficacy in real-world applications.

[LG-65] Failures Are Fated But Can Be Faded: Characterizing and Mitigating Unwanted Behaviors in Large-Scale Vision and Language Models

链接: https://arxiv.org/abs/2406.07145
作者: Som Sagar,Aditya Taparia,Ransalu Senanayake
关键词: large deep neural, deep neural networks, social biases, related to accuracy, neural networks
类目: Machine Learning (cs.LG)
*备注: 25 pages, 35 figures

点击查看摘要

Abstract:In large deep neural networks that seem to perform surprisingly well on many tasks, we also observe a few failures related to accuracy, social biases, and alignment with human values, among others. Therefore, before deploying these models, it is crucial to characterize this failure landscape for engineers to debug and legislative bodies to audit models. Nevertheless, it is infeasible to exhaustively test for all possible combinations of factors that could lead to a model’s failure. In this paper, we introduce a post-hoc method that utilizes \emphdeep reinforcement learning to explore and construct the landscape of failure modes in pre-trained discriminative and generative models. With the aid of limited human feedback, we then demonstrate how to restructure the failure landscape to be more desirable by moving away from the discovered failure modes. We empirically show the effectiveness of the proposed method across common Computer Vision, Natural Language Processing, and Vision-Language tasks.

[LG-66] Identifiable Object-Centric Representation Learning via Probabilistic Slot Attention

链接: https://arxiv.org/abs/2406.07141
作者: Avinash Kori,Francesco Locatello,Ainkaran Santhirasekaram,Francesca Toni,Ben Glocker,Fabio De Sousa Ribeiro
关键词: Learning modular object-centric, Learning modular, systematic generalization, Learning, crucial for systematic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning modular object-centric representations is crucial for systematic generalization. Existing methods show promising object-binding capabilities empirically, but theoretical identifiability guarantees remain relatively underdeveloped. Understanding when object-centric representations can theoretically be identified is crucial for scaling slot-based methods to high-dimensional images with correctness guarantees. To that end, we propose a probabilistic slot-attention algorithm that imposes an aggregate mixture prior over object-centric slot representations, thereby providing slot identifiability guarantees without supervision, up to an equivalence relation. We provide empirical verification of our theoretical identifiability result using both simple 2-dimensional data and high-resolution imaging datasets.

[LG-67] Logical Distillation of Graph Neural Networks

链接: https://arxiv.org/abs/2406.07126
作者: Alexander Pluska,Pascal Welke,Thomas Gärtner,Sagar Malhotra
关键词: Graph Neural Network, Neural Network, Graph Neural, learning on graphs, Network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:We present a logic based interpretable model for learning on graphs and an algorithm to distill this model from a Graph Neural Network (GNN). Recent results have shown connections between the expressivity of GNNs and the two-variable fragment of first-order logic with counting quantifiers (C2). We introduce a decision-tree based model which leverages an extension of C2 to distill interpretable logical classifiers from GNNs. We test our approach on multiple GNN architectures. The distilled models are interpretable, succinct, and attain similar accuracy to the underlying GNN. Furthermore, when the ground truth is expressible in C2, our approach outperforms the GNN.

[LG-68] CARACAS: vehiCular ArchitectuRe for detAiled Can Attacks Simulation

链接: https://arxiv.org/abs/2406.07125
作者: Sadek Misto Kirdi,Nicola Scarano,Franco Oberti,Luca Mannella,Stefano Di Carlo,Alessandro Savino
关键词: Controller Area Network, exploit network infrastructures, Controller Area, Area Network, Intrusion Detection Systems
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 8 figures, TrustAICyberSec workshop - IEEE ISCC 2024

点击查看摘要

Abstract:Modern vehicles are increasingly vulnerable to attacks that exploit network infrastructures, particularly the Controller Area Network (CAN) networks. To effectively counter such threats using contemporary tools like Intrusion Detection Systems (IDSs) based on data analysis and classification, large datasets of CAN messages become imperative. This paper delves into the feasibility of generating synthetic datasets by harnessing the modeling capabilities of simulation frameworks such as Simulink coupled with a robust representation of attack models to present CARACAS, a vehicular model, including component control via CAN messages and attack injection capabilities. CARACAS showcases the efficacy of this methodology, including a Battery Electric Vehicle (BEV) model, and focuses on attacks targeting torque control in two distinct scenarios.

[LG-69] CHARME: A chain-based reinforcement learning approach for the minor embedding problem

链接: https://arxiv.org/abs/2406.07124
作者: Hoang M. Ngo,Nguyen H K. Do,Minh N. Vu,Tamer Kahveci,My T. Thai
关键词: holds great potential, solving combinatorial optimization, optimization problems efficiently, minor embedding Problem, Quantum Annealing
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum Annealing (QA) holds great potential for solving combinatorial optimization problems efficiently. However, the effectiveness of QA algorithms heavily relies on the embedding of problem instances, represented as logical graphs, into the quantum unit processing (QPU) whose topology is in form of a limited connectivity graph, known as the minor embedding Problem. Existing methods for the minor embedding problem suffer from scalability issues when confronted with larger problem sizes. In this paper, we propose a novel approach utilizing Reinforcement Learning (RL) techniques to address the minor embedding problem, named CHARME. CHARME includes three key components: a Graph Neural Network (GNN) architecture for policy modeling, a state transition algorithm ensuring solution validity, and an order exploration strategy for effective training. Through comprehensive experiments on synthetic and real-world instances, we demonstrate that the efficiency of our proposed order exploration strategy as well as our proposed RL framework, CHARME. In details, CHARME yields superior solutions compared to fast embedding methods such as Minorminer and ATOM. Moreover, our method surpasses the OCT-based approach, known for its slower runtime but high-quality solutions, in several cases. In addition, our proposed exploration enhances the efficiency of the training of the CHARME framework by providing better solutions compared to the greedy strategy.

[LG-70] Augmenting Offline RL with Unlabeled Data

链接: https://arxiv.org/abs/2406.07117
作者: Zhao Wang,Briti Gangopadhyay,Jia-Fong Yeh,Shingo Takamatsu
关键词: offline Reinforcement Learning, Reinforcement Learning, conservative policy updates, offline Reinforcement, Recent advancements
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in offline Reinforcement Learning (Offline RL) have led to an increased focus on methods based on conservative policy updates to address the Out-of-Distribution (OOD) issue. These methods typically involve adding behavior regularization or modifying the critic learning objective, focusing primarily on states or actions with substantial dataset support. However, we challenge this prevailing notion by asserting that the absence of an action or state from a dataset does not necessarily imply its suboptimality. In this paper, we propose a novel approach to tackle the OOD problem. We introduce an offline RL teacher-student framework, complemented by a policy similarity measure. This framework enables the student policy to gain insights not only from the offline RL dataset but also from the knowledge transferred by a teacher policy. The teacher policy is trained using another dataset consisting of state-action pairs, which can be viewed as practical domain knowledge acquired without direct interaction with the environment. We believe this additional knowledge is key to effectively solving the OOD issue. This research represents a significant advancement in integrating a teacher-student network into the actor-critic framework, opening new avenues for studies on knowledge transfer in offline RL and effectively addressing the OOD challenge.

[LG-71] Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees

链接: https://arxiv.org/abs/2406.07115
作者: Sijia Chen,Yibo Wang,Yi-Feng Wu,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Lijun Zhang
关键词: intelligent agents interacting, Tool-augmented large language, tool-augmented LLMs compared, large language models, real world
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tool-augmented large language models (LLMs) leverage tools, often in the form of APIs, to enhance their reasoning capabilities on complex tasks, thus taking on the role of intelligent agents interacting with the real world. The recently introduced ToolLLaMA model by Qin et al. [2024] utilizes the depth-first search-based decision tree (DFSDT) method for reasoning with 16000+ real-world APIs, which effectively improves the planning and inferencing performance of tool-augmented LLMs compared to traditional chain reasoning approaches. However, their approach only employs successful paths from decision trees (also called inference trees) for supervised fine-tuning (SFT) during training, which does not fully exploit the advantages of the tree of thought. In this study, we propose an inference trajectory optimization framework based on the preference data extracted from decision trees to address this limitation. We first introduce a novel method for constructing preference data from the tree of thought, capitalizing on the failed explorations previously overlooked in the trees. Specifically, we generate an effective step-wise preference dataset, named ToolPreference, for tool use based on the ToolBench dataset. In the subsequent training phase, we first fine-tune the LLM with tool-usage expert trajectories and then use these step-wise preference pairs for direct preference optimization (DPO) to update the policy of the LLM, resulting in our ToolPrefer-LLaMA (TP-LLaMA) model. Our experiments demonstrate that by obtaining insights from errors in inference trees, TP-LLaMA significantly outperforms the baselines across almost all test scenarios by a large margin and exhibits better generalization capabilities with unseen APIs. At the same time, TP-LLaMA has also demonstrated superior reasoning efficiency compared to the baselines, making it more suitable for complex tool-usage reasoning tasks.

[LG-72] Agnostic Sharpness-Aware Minimization

链接: https://arxiv.org/abs/2406.07107
作者: Van-Anh Nguyen,Quyen Tran,Tuan Truong,Thanh-Toan Do,Dinh Phung,Trung Le
关键词: improving deep neural, deep neural network, Sharpness-aware minimization, neural network training, instrumental in improving
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Sharpness-aware minimization (SAM) has been instrumental in improving deep neural network training by minimizing both the training loss and the sharpness of the loss landscape, leading the model into flatter minima that are associated with better generalization properties. In another aspect, Model-Agnostic Meta-Learning (MAML) is a framework designed to improve the adaptability of models. MAML optimizes a set of meta-models that are specifically tailored for quick adaptation to multiple tasks with minimal fine-tuning steps and can generalize well with limited data. In this work, we explore the connection between SAM and MAML, particularly in terms of enhancing model generalization. We introduce Agnostic-SAM, a novel approach that combines the principles of both SAM and MAML. Agnostic-SAM adapts the core idea of SAM by optimizing the model towards wider local minima using training data, while concurrently maintaining low loss values on validation data. By doing so, it seeks flatter minima that are not only robust to small perturbations but also less vulnerable to data distributional shift problems. Our experimental results demonstrate that Agnostic-SAM significantly improves generalization over baselines across a range of datasets and under challenging conditions such as noisy labels and data limitation.

[LG-73] D-GRIL: End-to-End Topological Learning with 2-parameter Persistence

链接: https://arxiv.org/abs/2406.07100
作者: Soham Mukherjee,Shreyas N. Samaga,Cheng Xin,Steve Oudot,Tamal K. Dey
关键词: topological learning, persistence, persistence based vectorization, Abstract, technique called GRIL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:End-to-end topological learning using 1-parameter persistence is well-known. We show that the framework can be enhanced using 2-parameter persistence by adopting a recently introduced 2-parameter persistence based vectorization technique called GRIL. We establish a theoretical foundation of differentiating GRIL producing D-GRIL. We show that D-GRIL can be used to learn a bifiltration function on standard benchmark graph datasets. Further, we exhibit that this framework can be applied in the context of bio-activity prediction in drug discovery.

[LG-74] Leveraging Large Language Models for Efficient Failure Analysis in Game Development

链接: https://arxiv.org/abs/2406.07084
作者: Leonardo Marini,Linus Gisslén,Alessandro Sestini
关键词: early detection, final product, field of software, detection of bugs, bugs is vital
类目: Machine Learning (cs.LG)
*备注: Published at CoG 2024

点击查看摘要

Abstract:In games, and more generally in the field of software development, early detection of bugs is vital to maintain a high quality of the final product. Automated tests are a powerful tool that can catch a problem earlier in development by executing periodically. As an example, when new code is submitted to the code base, a new automated test verifies these changes. However, identifying the specific change responsible for a test failure becomes harder when dealing with batches of changes – especially in the case of a large-scale project such as a AAA game, where thousands of people contribute to a single code base. This paper proposes a new approach to automatically identify which change in the code caused a test to fail. The method leverages Large Language Models (LLMs) to associate error messages with the corresponding code changes causing the failure. We investigate the effectiveness of our approach with quantitative and qualitative evaluations. Our approach reaches an accuracy of 71% in our newly created dataset, which comprises issues reported by developers at EA over a period of one year. We further evaluated our model through a user study to assess the utility and usability of the tool from a developer perspective, resulting in a significant reduction in time – up to 60% – spent investigating issues.

[LG-75] Efficient Mixture Learning in Black-Box Variational Inference

链接: https://arxiv.org/abs/2406.07083
作者: Alexandra Hotti,Oskar Kviman,Ricky Molén,Víctor Elvira,Jens Lagergren
关键词: black box variational, density estimation tasks, challenging density estimation, demonstrated impressive results, box variational inference
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: In Proceedings of the 41 st International Conference on Machine Learning (ICML), Vienna, Austria

点击查看摘要

Abstract:Mixture variational distributions in black box variational inference (BBVI) have demonstrated impressive results in challenging density estimation tasks. However, currently scaling the number of mixture components can lead to a linear increase in the number of learnable parameters and a quadratic increase in inference time due to the evaluation of the evidence lower bound (ELBO). Our two key contributions address these limitations. First, we introduce the novel Multiple Importance Sampling Variational Autoencoder (MISVAE), which amortizes the mapping from input to mixture-parameter space using one-hot encodings. Fortunately, with MISVAE, each additional mixture component incurs a negligible increase in network parameters. Second, we construct two new estimators of the ELBO for mixtures in BBVI, enabling a tremendous reduction in inference time with marginal or even improved impact on performance. Collectively, our contributions enable scalability to hundreds of mixture components and provide superior estimation performance in shorter time, with fewer network parameters compared to previous Mixture VAEs. Experimenting with MISVAE, we achieve astonishing, SOTA results on MNIST. Furthermore, we empirically validate our estimators in other BBVI settings, including Bayesian phylogenetic inference, where we improve inference times for the SOTA mixture model on eight data sets.

[LG-76] Reading Miscue Detection in Primary School through Automatic Speech Recognition

链接: https://arxiv.org/abs/2406.07060
作者: Lingyun Gao,Cristian Tejedor-Garcia,Helmer Strik,Catia Cucchiarini
关键词: accessing reading exercises, Automatic Speech Recognition, Automatic reading diagnosis, reading diagnosis systems, feedback more easily
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Proc. INTERSPEECH 2024, 1-5 September 2024. Kos Island, Greece

点击查看摘要

Abstract:Automatic reading diagnosis systems can benefit both teachers for more efficient scoring of reading exercises and students for accessing reading exercises with feedback more easily. However, there are limited studies on Automatic Speech Recognition (ASR) for child speech in languages other than English, and limited research on ASR-based reading diagnosis systems. This study investigates how efficiently state-of-the-art (SOTA) pretrained ASR models recognize Dutch native children speech and manage to detect reading miscues. We found that Hubert Large finetuned on Dutch speech achieves SOTA phoneme-level child speech recognition (PER at 23.1%), while Whisper (Faster Whisper Large-v2) achieves SOTA word-level performance (WER at 9.8%). Our findings suggest that Wav2Vec2 Large and Whisper are the two best ASR models for reading miscue detection. Specifically, Wav2Vec2 Large shows the highest recall at 0.83, whereas Whisper exhibits the highest precision at 0.52 and an F1 score of 0.52.

[LG-77] Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study

链接: https://arxiv.org/abs/2406.07057
作者: Yichi Zhang,Yao Huang,Yitong Sun,Chang Liu,Zhe Zhao,Zhengwei Fang,Yifan Wang,Huanran Chen,Xiao Yang,Xingxing Wei,Hang Su,Yinpeng Dong,Jun Zhu
关键词: Large Language Models, Multimodal Large Language, Large Language, significant trustworthiness challenges, face significant trustworthiness
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 100 pages, 84 figures, 33 tables

点击查看摘要

Abstract:Despite the superior capabilities of Multimodal Large Language Models (MLLMs) across diverse tasks, they still face significant trustworthiness challenges. Yet, current literature on the assessment of trustworthy MLLMs remains limited, lacking a holistic evaluation to offer thorough insights into future improvements. In this work, we establish MultiTrust, the first comprehensive and unified benchmark on the trustworthiness of MLLMs across five primary aspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts, encompassing 32 diverse tasks with self-curated datasets. Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks, highlighting the complexities introduced by the multimodality and underscoring the necessity for advanced methodologies to enhance their reliability. For instance, typical proprietary models still struggle with the perception of visually confusing images and are vulnerable to multimodal jailbreaking and adversarial attacks; MLLMs are more inclined to disclose privacy in text and reveal ideological and cultural biases even when paired with irrelevant images in inference, indicating that the multimodality amplifies the internal risks from base LLMs. Additionally, we release a scalable toolbox for standardized trustworthiness research, aiming to facilitate future advancements in this important field. Code and resources are publicly available at: this https URL.

[LG-78] comRAG: Taming Telecom Standards with Retrieval Augmented Generation and LLMs

链接: https://arxiv.org/abs/2406.07053
作者: Girma M. Yilma,Jose A. Ayala-Romero,Andres Garcia-Saavedra,Xavier Costa-Perez
关键词: Large Language Models, Large Language, Language Models, immense potential, potential to transform
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 7 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have immense potential to transform the telecommunications industry. They could help professionals understand complex standards, generate code, and accelerate development. However, traditional LLMs struggle with the precision and source verification essential for telecom work. To address this, specialized LLM-based solutions tailored to telecommunication standards are needed. Retrieval-augmented generation (RAG) offers a way to create precise, fact-based answers. This paper proposes TelecomRAG, a framework for a Telecommunication Standards Assistant that provides accurate, detailed, and verifiable responses. Our implementation, using a knowledge base built from 3GPP Release 16 and Release 18 specification documents, demonstrates how this assistant surpasses generic LLMs, offering superior accuracy, technical depth, and verifiability, and thus significant value to the telecommunications field.

[LG-79] GridPE: Unifying Positional Encoding in Transformers with a Grid Cell-Inspired Framework

链接: https://arxiv.org/abs/2406.07049
作者: Boyang Li,Yulin Wu,Nuoxian Huang
关键词: Understanding spatial location, Understanding spatial, capability for modern, Understanding, grid
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding spatial location and relationships is a fundamental capability for modern artificial intelligence systems. Insights from human spatial cognition provide valuable guidance in this domain. Recent neuroscientific discoveries have highlighted the role of grid cells as a fundamental neural component for spatial representation, including distance computation, path integration, and scale discernment. In this paper, we introduce a novel positional encoding scheme inspired by Fourier analysis and the latest findings in computational neuroscience regarding grid cells. Assuming that grid cells encode spatial position through a summation of Fourier basis functions, we demonstrate the translational invariance of the grid representation during inner product calculations. Additionally, we derive an optimal grid scale ratio for multi-dimensional Euclidean spaces based on principles of biological efficiency. Utilizing these computational principles, we have developed a Grid-cell inspired Positional Encoding technique, termed GridPE, for encoding locations within high-dimensional spaces. We integrated GridPE into the Pyramid Vision Transformer architecture. Our theoretical analysis shows that GridPE provides a unifying framework for positional encoding in arbitrary high-dimensional spaces. Experimental results demonstrate that GridPE significantly enhances the performance of transformers, underscoring the importance of incorporating neuroscientific insights into the design of artificial intelligence systems.

[LG-80] Integrating Domain Knowledge for handling Limited Data in Offline RL

链接: https://arxiv.org/abs/2406.07041
作者: Briti Gangopadhyay,Zhao Wang,Jia-Fong Yeh,Shingo Takamatsu
关键词: Offline Reinforcement Learning, Reinforcement Learning, Offline Reinforcement, real-world applications, compelling avenue
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the ability to learn from static datasets, Offline Reinforcement Learning (RL) emerges as a compelling avenue for real-world applications. However, state-of-the-art offline RL algorithms perform sub-optimally when confronted with limited data confined to specific regions within the state space. The performance degradation is attributed to the inability of offline RL algorithms to learn appropriate actions for rare or unseen observations. This paper proposes a novel domain knowledge-based regularization technique and adaptively refines the initial domain knowledge to considerably boost performance in limited data with partially omitted states. The key insight is that the regularization term mitigates erroneous actions for sparse samples and unobserved states covered by domain knowledge. Empirical evaluations on standard discrete environment datasets demonstrate a substantial average performance increase of at least 27% compared to existing offline RL algorithms operating on limited data.

[LG-81] Fairness-Aware Meta-Learning via Nash Bargaining

链接: https://arxiv.org/abs/2406.07029
作者: Yi Zeng,Xuelin Yang,Li Chen,Cristian Canton Ferrer,Ming Jin,Michael I. Jordan,Ruoxi Jia
关键词: sensitive-attributed validation set, adjust model parameters, model parameters based, machine learning, natural to adjust
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To address issues of group-level fairness in machine learning, it is natural to adjust model parameters based on specific fairness objectives over a sensitive-attributed validation set. Such an adjustment procedure can be cast within a meta-learning framework. However, naive integration of fairness goals via meta-learning can cause hypergradient conflicts for subgroups, resulting in unstable convergence and compromising model performance and fairness. To navigate this issue, we frame the resolution of hypergradient conflicts as a multi-player cooperative bargaining game. We introduce a two-stage meta-learning framework in which the first stage involves the use of a Nash Bargaining Solution (NBS) to resolve hypergradient conflicts and steer the model toward the Pareto front, and the second stage optimizes with respect to specific fairness goals. Our method is supported by theoretical results, notably a proof of the NBS for gradient aggregation free from linear independence assumptions, a proof of Pareto improvement, and a proof of monotonic improvement in validation loss. We also show empirical effects across various fairness objectives in six key fairness datasets and two image classification tasks.

[LG-82] Heterogeneous Learning Rate Scheduling for Neural Architecture Search on Long-Tailed Datasets

链接: https://arxiv.org/abs/2406.07028
作者: Chenxia Tang
关键词: Neural Architecture Search, Differentiable Architecture Search, Architecture Search, applying Neural Architecture, specifically the Differentiable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we attempt to address the challenge of applying Neural Architecture Search (NAS) algorithms, specifically the Differentiable Architecture Search (DARTS), to long-tailed datasets where class distribution is highly imbalanced. We observe that traditional re-sampling and re-weighting techniques, which are effective in standard classification tasks, lead to performance degradation when combined with DARTS. To mitigate this, we propose a novel adaptive learning rate scheduling strategy tailored for the architecture parameters of DARTS when integrated with the Bilateral Branch Network (BBN) for handling imbalanced datasets. Our approach dynamically adjusts the learning rate of the architecture parameters based on the training epoch, preventing the disruption of well-trained representations in the later stages of training. Additionally, we explore the impact of branch mixing factors on the algorithm’s performance. Through extensive experiments on the CIFAR-10 dataset with an artificially induced long-tailed distribution, we demonstrate that our method achieves comparable accuracy to using DARTS alone. And the experiment results suggest that re-sampling methods inherently harm the performance of the DARTS algorithm. Our findings highlight the importance of careful data augment when applying DNAS to imbalanced learning scenarios.

[LG-83] Entropy-Reinforced Planning with Large Language Models for Drug Discovery

链接: https://arxiv.org/abs/2406.07025
作者: Xuefeng Liu,Chih-chan Tien,Peng Ding,Songhao Jiang,Rick L. Stevens
关键词: identify chemical compounds, possess specific pharmaceutical, specific pharmaceutical properties, drug discovery, identify chemical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: Published in ICML2024

点击查看摘要

Abstract:The objective of drug discovery is to identify chemical compounds that possess specific pharmaceutical properties toward a binding target. Existing large language models (LLMS) can achieve high token matching scores in terms of likelihood for molecule generation. However, relying solely on LLM decoding often results in the generation of molecules that are either invalid due to a single misused token, or suboptimal due to unbalanced exploration and exploitation as a consequence of the LLMs prior experience. Here we propose ERP, Entropy-Reinforced Planning for Transformer Decoding, which employs an entropy-reinforced planning algorithm to enhance the Transformer decoding process and strike a balance between exploitation and exploration. ERP aims to achieve improvements in multiple properties compared to direct sampling from the Transformer. We evaluated ERP on the SARS-CoV-2 virus (3CLPro) and human cancer cell target protein (RTCB) benchmarks and demonstrated that, in both benchmarks, ERP consistently outperforms the current state-of-the-art algorithm by 1-5 percent, and baselines by 5-10 percent, respectively. Moreover, such improvement is robust across Transformer models trained with different objectives. Finally, to further illustrate the capabilities of ERP, we tested our algorithm on three code generation benchmarks and outperformed the current state-of-the-art approach as well. Our code is publicly available at: this https URL.

[LG-84] Learning Discrete Latent Variable Structures with Tensor Rank Conditions

链接: https://arxiv.org/abs/2406.07020
作者: Zhengming Chen,Ruichu Cai,Feng Xie,Jie Qiao,Anpeng Wu,Zijian Li,Zhifeng Hao,Kun Zhang
关键词: Unobserved discrete data, uncovering data patterns, Unobserved discrete, scientific disciplines, latent variables
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unobserved discrete data are ubiquitous in many scientific disciplines, and how to learn the causal structure of these latent variables is crucial for uncovering data patterns. Most studies focus on the linear latent variable model or impose strict constraints on latent structures, which fail to address cases in discrete data involving non-linear relationships or complex latent structures. To achieve this, we explore a tensor rank condition on contingency tables for an observed variable set \mathbfX_p , showing that the rank is determined by the minimum support of a specific conditional set (not necessary in \mathbfX_p ) that d-separates all variables in \mathbfX_p . By this, one can locate the latent variable through probing the rank on different observed variables set, and further identify the latent causal structure under some structure assumptions. We present the corresponding identification algorithm and conduct simulated experiments to verify the effectiveness of our method. In general, our results elegantly extend the identification boundary for causal discovery with discrete latent variables and expand the application scope of causal discovery with latent variables.

[LG-85] MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations

链接: https://arxiv.org/abs/2406.07017
作者: Zixiao Wang,Jingwei Zhang,Wenqian Zhao,Farzan Farnia,Bei Yu
关键词: few-shot gradient pruning, Few-shot gradient, Few-shot gradient methods, potential weight perturbations, regarded as static
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Few-shot gradient methods have been extensively utilized in existing model pruning methods, where the model weights are regarded as static values and the effects of potential weight perturbations are not considered. However, the widely used large language models (LLMs) have several billion model parameters, which could increase the fragility of few-shot gradient pruning. In this work, we experimentally show that one-shot gradient pruning algorithms could lead to unstable results under perturbations to model weights. And the minor error of switching between data formats bfloat16 and float16 could result in drastically different outcomes. To address such instabilities, we leverage optimization analysis and propose an LLM structural pruning method, called MoreauPruner, with provable robustness against weight perturbations. In MoreauPruner, the model weight importance is estimated based on the neural network’s Moreau envelope, which can be flexibly combined with \ell_1 -norm regularization techniques to induce the sparsity required in the pruning task. We extensively evaluate the MoreauPruner algorithm on several well-known LLMs, including LLaMA-7B, LLaMA-13B, LLaMA3-8B, and Vicuna-7B. Our numerical results suggest the robustness of MoreauPruner against weight perturbations, and indicate the MoreauPruner’s successful accuracy-based scores in comparison to several existing pruning methods. We have released the code in \urlthis https URL.

[LG-86] DNN Partitioning Task Offloading and Resource Allocation in Dynamic Vehicular Networks: A Lyapunov-Guided Diffusion-Based Reinforcement Learning Approach

链接: https://arxiv.org/abs/2406.06986
作者: Zhang Liu,Hongyang Du,Junzhe Lin,Zhibin Gao,Lianfen Huang,Seyyedali Hosseinalipour,Dusit Niyato
关键词: Deep Neural Network, introduced Deep Neural, Artificial Intelligence, Neural Network, advancement of Artificial
类目: Machine Learning (cs.LG)
*备注: 16 pages, 9 figures, and with extra appendix

点击查看摘要

Abstract:The rapid advancement of Artificial Intelligence (AI) has introduced Deep Neural Network (DNN)-based tasks to the ecosystem of vehicular networks. These tasks are often computation-intensive, requiring substantial computation resources, which are beyond the capability of a single vehicle. To address this challenge, Vehicular Edge Computing (VEC) has emerged as a solution, offering computing services for DNN-based tasks through resource pooling via Vehicle-to-Vehicle/Infrastructure (V2V/V2I) communications. In this paper, we formulate the problem of joint DNN partitioning, task offloading, and resource allocation in VEC as a dynamic long-term optimization. Our objective is to minimize the DNN-based task completion time while guaranteeing the system stability over time. To this end, we first leverage a Lyapunov optimization technique to decouple the original long-term optimization with stability constraints into a per-slot deterministic problem. Afterwards, we propose a Multi-Agent Diffusion-based Deep Reinforcement Learning (MAD2RL) algorithm, incorporating the innovative use of diffusion models to determine the optimal DNN partitioning and task offloading decisions. Furthermore, we integrate convex optimization techniques into MAD2RL as a subroutine to allocate computation resources, enhancing the learning efficiency. Through simulations under real-world movement traces of vehicles, we demonstrate the superior performance of our proposed algorithm compared to existing benchmark solutions.

[LG-87] On the H"older Stability of Multiset and Graph Neural Networks

链接: https://arxiv.org/abs/2406.06984
作者: Yair Davidson,Nadav Dym
关键词: message passing neural, neural networks based, passing neural networks, graph isomorphism test, separate all distinct
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Famously, multiset neural networks based on sum-pooling can separate all distinct multisets, and as a result can be used by message passing neural networks (MPNNs) to separate all pairs of graphs that can be separated by the 1-WL graph isomorphism test. However, the quality of this separation may be very weak, to the extent that the embeddings of “separable” multisets and graphs might even be considered identical when using fixed finite precision. In this work, we propose to fully analyze the separation quality of multiset models and MPNNs via a novel adaptation of Lipschitz and Hölder continuity to parametric functions. We prove that common sum-based models are lower-Hölder continuous, with a Hölder exponent that decays rapidly with the network’s depth. Our analysis leads to adversarial examples of graphs which can be separated by three 1-WL iterations, but cannot be separated in practice by standard maximally powerful MPNNs. To remedy this, we propose two novel MPNNs with improved separation quality, one of which is lower Lipschitz continuous. We show these MPNNs can easily classify our adversarial examples, and compare favorably with standard MPNNs on standard graph learning tasks. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2406.06984 [cs.LG] (or arXiv:2406.06984v1 [cs.LG] for this version)

[LG-88] AudioMarkBench: Benchmarking Robustness of Audio Watermarking

链接: https://arxiv.org/abs/2406.06979
作者: Hongbin Liu,Moyang Guo,Zhengyuan Jiang,Lun Wang,Neil Zhenqiang Gong
关键词: raises ethical concerns, synthetic speech, driven by advancements, raises ethical, impersonation and disinformation
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The increasing realism of synthetic speech, driven by advancements in text-to-speech models, raises ethical concerns regarding impersonation and disinformation. Audio watermarking offers a promising solution via embedding human-imperceptible watermarks into AI-generated audios. However, the robustness of audio watermarking against common/adversarial perturbations remains understudied. We present AudioMarkBench, the first systematic benchmark for evaluating the robustness of audio watermarking against watermark removal and watermark forgery. AudioMarkBench includes a new dataset created from Common-Voice across languages, biological sexes, and ages, 3 state-of-the-art watermarking methods, and 15 types of perturbations. We benchmark the robustness of these methods against the perturbations in no-box, black-box, and white-box settings. Our findings highlight the vulnerabilities of current watermarking techniques and emphasize the need for more robust and fair audio watermarking solutions. Our dataset and code are publicly available at \urlthis https URL.

[LG-89] Cross-domain-aware Worker Selection with Training for Crowdsourced Annotation

链接: https://arxiv.org/abs/2406.06977
作者: Yushi Sun,Jiachuan Wang,Peng Cheng,Libin Zheng,Lei Chen,Jian Yin
关键词: draws incremental attention, crowdsourcing draws incremental, Annotation through crowdsourcing, incremental attention, crowdsourcing draws
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: Accepted by ICDE 2024

点击查看摘要

Abstract:Annotation through crowdsourcing draws incremental attention, which relies on an effective selection scheme given a pool of workers. Existing methods propose to select workers based on their performance on tasks with ground truth, while two important points are missed. 1) The historical performances of workers in other tasks. In real-world scenarios, workers need to solve a new task whose correlation with previous tasks is not well-known before the training, which is called cross-domain. 2) The dynamic worker performance as workers will learn from the ground truth. In this paper, we consider both factors in designing an allocation scheme named cross-domain-aware worker selection with training approach. Our approach proposes two estimation modules to both statistically analyze the cross-domain correlation and simulate the learning gain of workers dynamically. A framework with a theoretical analysis of the worker elimination process is given. To validate the effectiveness of our methods, we collect two novel real-world datasets and generate synthetic datasets. The experiment results show that our method outperforms the baselines on both real-world and synthetic datasets.

[LG-90] Discrete Dictionary-based Decomposition Layer for Structured Representation Learning

链接: https://arxiv.org/abs/2406.06976
作者: Taewon Park,Hyun-Chul Kim,Minho Lee
关键词: Neuro-symbolic neural networks, Tensor Product Representation, Neuro-symbolic neural, neural networks, structured TPR representations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neuro-symbolic neural networks have been extensively studied to integrate symbolic operations with neural networks, thereby improving systematic generalization. Specifically, Tensor Product Representation (TPR) framework enables neural networks to perform differentiable symbolic operations by encoding the symbolic structure of data within vector spaces. However, TPR-based neural networks often struggle to decompose unseen data into structured TPR representations, undermining their symbolic operations. To address this decomposition problem, we propose a Discrete Dictionary-based Decomposition (D3) layer designed to enhance the decomposition capabilities of TPR-based models. D3 employs discrete, learnable key-value dictionaries trained to capture symbolic features essential for decomposition operations. It leverages the prior knowledge acquired during training to generate structured TPR representations by mapping input data to pre-learned symbolic features within these dictionaries. D3 is a straightforward drop-in layer that can be seamlessly integrated into any TPR-based model without modifications. Our experimental results demonstrate that D3 significantly improves the systematic generalization of various TPR-based models while requiring fewer additional parameters. Notably, D3 outperforms baseline models on the synthetic task that demands the systematic decomposition of unseen combinatorial data.

[LG-91] Generative Lifting of Multiview to 3D from Unknown Pose: Wrapping NeRF inside Diffusion

链接: https://arxiv.org/abs/2406.06972
作者: Xin Yuan,Rana Hanocka,Michael Maire
关键词: generative modeling problem, cast multiview reconstruction, Neural Radiance Field, Diffusion Probabilistic Model, modeling problem
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We cast multiview reconstruction from unknown pose as a generative modeling problem. From a collection of unannotated 2D images of a scene, our approach simultaneously learns both a network to predict camera pose from 2D image input, as well as the parameters of a Neural Radiance Field (NeRF) for the 3D scene. To drive learning, we wrap both the pose prediction network and NeRF inside a Denoising Diffusion Probabilistic Model (DDPM) and train the system via the standard denoising objective. Our framework requires the system accomplish the task of denoising an input 2D image by predicting its pose and rendering the NeRF from that pose. Learning to denoise thus forces the system to concurrently learn the underlying 3D NeRF representation and a mapping from images to camera extrinsic parameters. To facilitate the latter, we design a custom network architecture to represent pose as a distribution, granting implicit capacity for discovering view correspondences when trained end-to-end for denoising alone. This technique allows our system to successfully build NeRFs, without pose knowledge, for challenging scenes where competing methods fail. At the conclusion of training, our learned NeRF can be extracted and used as a 3D scene model; our full system can be used to sample novel camera poses and generate novel-view images.

[LG-92] Beyond the Norms: Detecting Prediction Errors in Regression Models

链接: https://arxiv.org/abs/2406.06968
作者: Andres Altieri,Marco Romanelli,Georg Pichler,Florence Alberge,Pablo Piantanida
关键词: detecting unreliable behavior, intrinsic variability, paper tackles, tackles the challenge, challenge of detecting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: To appear as spotlight at ICML 2024. 36 pages, 4 figures

点击查看摘要

Abstract:This paper tackles the challenge of detecting unreliable behavior in regression algorithms, which may arise from intrinsic variability (e.g., aleatoric uncertainty) or modeling errors (e.g., model uncertainty). First, we formally introduce the notion of unreliability in regression, i.e., when the output of the regressor exceeds a specified discrepancy (or error). Then, using powerful tools for probabilistic modeling, we estimate the discrepancy density, and we measure its statistical diversity using our proposed metric for statistical dissimilarity. In turn, this allows us to derive a data-driven score that expresses the uncertainty of the regression outcome. We show empirical improvements in error detection for multiple regression tasks, consistently outperforming popular baseline approaches, and contributing to the broader field of uncertainty quantification and safe machine learning systems. Our code is available at this https URL.

[LG-93] Low Rank Multi-Dictionary Selection at Scale

链接: https://arxiv.org/abs/2406.06960
作者: Boya Ma,Maxwell McNeil,Abram Magner,Petko Bogdanov
关键词: framework represents signals, dictionary coding framework, predefined dictionary atoms, coding framework represents, predefined dictionary
类目: Machine Learning (cs.LG)
*备注: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, August 25–29, 2024, Barcelona, Spain

点击查看摘要

Abstract:The sparse dictionary coding framework represents signals as a linear combination of a few predefined dictionary atoms. It has been employed for images, time series, graph signals and recently for 2-way (or 2D) spatio-temporal data employing jointly temporal and spatial dictionaries. Large and over-complete dictionaries enable high-quality models, but also pose scalability challenges which are exacerbated in multi-dictionary settings. Hence, an important problem that we address in this paper is: How to scale multi-dictionary coding for large dictionaries and datasets? We propose a multi-dictionary atom selection technique for low-rank sparse coding named LRMDS. To enable scalability to large dictionaries and datasets, it progressively selects groups of row-column atom pairs based on their alignment with the data and performs convex relaxation coding via the corresponding sub-dictionaries. We demonstrate both theoretically and experimentally that when the data has a low-rank encoding with a sparse subset of the atoms, LRMDS is able to select them with strong guarantees under mild assumptions. Furthermore, we demonstrate the scalability and quality of LRMDS in both synthetic and real-world datasets and for a range of coding dictionaries. It achieves 3X to 10X speed-up compared to baselines, while obtaining up to two orders of magnitude improvement in representation quality on some of the real world datasets given a fixed target number of atoms. Comments: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, August 25–29, 2024, Barcelona, Spain Subjects: Machine Learning (cs.LG) Cite as: arXiv:2406.06960 [cs.LG] (or arXiv:2406.06960v1 [cs.LG] for this version)

[LG-94] Unleashing the Denoising Capability of Diffusion Prior for Solving Inverse Problems

链接: https://arxiv.org/abs/2406.06959
作者: Jiawei Zhang,Jiaxin Zhuang,Cheng Jin,Gen Li,Yuantao Gu
关键词: presenting innovative avenues, addressing inverse problems, inverse problems, presenting innovative, recent emergence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The recent emergence of diffusion models has significantly advanced the precision of learnable priors, presenting innovative avenues for addressing inverse problems. Since inverse problems inherently entail maximum a posteriori estimation, previous works have endeavored to integrate diffusion priors into the optimization frameworks. However, prevailing optimization-based inverse algorithms primarily exploit the prior information within the diffusion models while neglecting their denoising capability. To bridge this gap, this work leverages the diffusion process to reframe noisy inverse problems as a two-variable constrained optimization task by introducing an auxiliary optimization variable. By employing gradient truncation, the projection gradient descent method is efficiently utilized to solve the corresponding optimization problem. The proposed algorithm, termed ProjDiff, effectively harnesses the prior information and the denoising capability of a pre-trained diffusion model within the optimization framework. Extensive experiments on the image restoration tasks and source separation and partial generation tasks demonstrate that ProjDiff exhibits superior performance across various linear and nonlinear inverse problems, highlighting its potential for practical applications. Code is available at this https URL.

[LG-95] ElasticRec: A Microservice-based Model Serving Architecture Enabling Elastic Resource Scaling for Recommendation Models

链接: https://arxiv.org/abs/2406.06955
作者: Yujeong Choi,Jiin Kim,Minsoo Rhu
关键词: datacenters has surged, increasing popularity, popularity of recommendation, resource, RecSys
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the increasing popularity of recommendation systems (RecSys), the demand for compute resources in datacenters has surged. However, the model-wise resource allocation employed in current RecSys model serving architectures falls short in effectively utilizing resources, leading to sub-optimal total cost of ownership. We propose ElasticRec, a model serving architecture for RecSys providing resource elasticity and high memory efficiency. ElasticRec is based on a microservice-based software architecture for fine-grained resource allocation, tailored to the heterogeneous resource demands of RecSys. Additionally, ElasticRec achieves high memory efficiency via our utility-based resource allocation. Overall, ElasticRec achieves an average 3.3x reduction in memory allocation size and 8.1x increase in memory utility, resulting in an average 1.6x reduction in deployment cost compared to state-of-the-art RecSys inference serving system.

[LG-96] Distributional MIPLIB: a Multi-Domain Library for Advancing ML-Guided MILP Methods

链接: https://arxiv.org/abs/2406.06954
作者: Weimin Huang,Taoan Huang,Aaron M Ferber,Bistra Dilkina
关键词: Integer Linear Programming, Mixed Integer Linear, Linear Programming, Integer Linear, modeling combinatorial optimization
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Mixed Integer Linear Programming (MILP) is a fundamental tool for modeling combinatorial optimization problems. Recently, a growing body of research has used machine learning to accelerate MILP solving. Despite the increasing popularity of this approach, there is a lack of a common repository that provides distributions of similar MILP instances across different domains, at different hardness levels, with standardized test sets. In this paper, we introduce Distributional MIPLIB, a multi-domain library of problem distributions for advancing ML-guided MILP methods. We curate MILP distributions from existing work in this area as well as real-world problems that have not been used, and classify them into different hardness levels. It will facilitate research in this area by enabling comprehensive evaluation on diverse and realistic domains. We empirically illustrate the benefits of using Distributional MIPLIB as a research vehicle in two ways. We evaluate the performance of ML-guided variable branching on previously unused distributions to identify potential areas for improvement. Moreover, we propose to learn branching policies from a mix of distributions, demonstrating that mixed distributions achieve better performance compared to homogeneous distributions when there is limited data and generalize well to larger instances.

[LG-97] Non-autoregressive Personalized Bundle Generation

链接: https://arxiv.org/abs/2406.06925
作者: Wenchuan Yang,Cheng Yang,Jichao Li,Yuejin Tan,Xin Lu,Chuan Shi
关键词: numerous candidate items, receives increasing attention, personalized bundle generation, candidate items, receives increasing
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Submitted to Information Processing Management

点击查看摘要

Abstract:The personalized bundle generation problem, which aims to create a preferred bundle for user from numerous candidate items, receives increasing attention in recommendation. However, existing works ignore the order-invariant nature of the bundle and adopt sequential modeling methods as the solution, which might introduce inductive bias and cause a large latency in prediction. To address this problem, we propose to perform the bundle generation via non-autoregressive mechanism and design a novel encoder-decoder framework named BundleNAT, which can effectively output the targeted bundle in one-shot without relying on any inherent order. In detail, instead of learning sequential dependency, we propose to adopt pre-training techniques and graph neural network to fully embed user-based preference and item-based compatibility information, and use a self-attention based encoder to further extract global dependency pattern. We then design a permutation-equivariant decoding architecture that is able to directly output the desired bundle in a one-shot manner. Experiments on three real-world datasets from Youshu and Netease show the proposed BundleNAT significantly outperforms the current state-of-the-art methods in average by up to 35.92%, 10.97% and 23.67% absolute improvements in Precision, Precision+, and Recall, respectively.

[LG-98] raining Dynamics of Nonlinear Contrastive Learning Model in the High Dimensional Limit

链接: https://arxiv.org/abs/2406.06909
作者: Lineghuan Meng,Chuang Wang
关键词: single-layer nonlinear contrastive, nonlinear contrastive learning, contrastive learning model, letter presents, presents a high-dimensional
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注: 21 pages, 11 figures

点击查看摘要

Abstract:This letter presents a high-dimensional analysis of the training dynamics for a single-layer nonlinear contrastive learning model. The empirical distribution of the model weights converges to a deterministic measure governed by a McKean-Vlasov nonlinear partial differential equation (PDE). Under L2 regularization, this PDE reduces to a closed set of low-dimensional ordinary differential equations (ODEs), reflecting the evolution of the model performance during the training process. We analyze the fixed point locations and their stability of the ODEs unveiling several interesting findings. First, only the hidden variable’s second moment affects feature learnability at the state with uninformative initialization. Second, higher moments influence the probability of feature selection by controlling the attraction region, rather than affecting local stability. Finally, independent noises added in the data argumentation degrade performance but negatively correlated noise can reduces the variance of gradient estimation yielding better performance. Despite of the simplicity of the analyzed model, it exhibits a rich phenomena of training dynamics, paving a way to understand more complex mechanism behind practical large models.

[LG-99] SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale

链接: https://arxiv.org/abs/2406.06907
作者: Shester Gueuwou,Xiaodan Du,Greg Shakhnarovich,Karen Livescu
关键词: irrelevant visual differences, sign language, language video processing, written language translation, sign language video
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A persistent challenge in sign language video processing, including the task of sign language to written language translation, is how we learn representations of sign language in an effective and efficient way that can preserve the important attributes of these languages, while remaining invariant to irrelevant visual differences. Informed by the nature and linguistics of signed languages, our proposed method focuses on just the most relevant parts in a signing video: the face, hands and body posture of the signer. However, instead of using pose estimation coordinates from off-the-shelf pose tracking models, which have inconsistent performance for hands and faces, we propose to learn the complex handshapes and rich facial expressions of sign languages in a self-supervised fashion. Our approach is based on learning from individual frames (rather than video sequences) and is therefore much more efficient than prior work on sign language pre-training. Compared to a recent model that established a new state of the art in sign language translation on the How2Sign dataset, our approach yields similar translation performance, using less than 3% of the compute.

[LG-100] Nonlinear time-series embedding by monotone variational inequality

链接: https://arxiv.org/abs/2406.06894
作者: Jonathan Y. Zhou,Yao Xie
关键词: motion capture, natural language, encounter collections, collections of sequential, nonlinear time series
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In the wild, we often encounter collections of sequential data such as electrocardiograms, motion capture, genomes, and natural language, and sequences may be multichannel or symbolic with nonlinear dynamics. We introduce a new method to learn low-dimensional representations of nonlinear time series without supervision and can have provable recovery guarantees. The learned representation can be used for downstream machine-learning tasks such as clustering and classification. The method is based on the assumption that the observed sequences arise from a common domain, but each sequence obeys its own autoregressive models that are related to each other through low-rank regularization. We cast the problem as a computationally efficient convex matrix parameter recovery problem using monotone Variational Inequality and encode the common domain assumption via low-rank constraint across the learned representations, which can learn the geometry for the entire domain as well as faithful representations for the dynamics of each individual sequence using the domain information in totality. We show the competitive performance of our method on real-world time-series data with the baselines and demonstrate its effectiveness for symbolic text modeling and RNA sequence clustering.

[LG-101] okenize features enhancing tables: the FT-TABPFN model for tabular classification

链接: https://arxiv.org/abs/2406.06891
作者: Quangao Liu,Wei Yang,Chen Liang,Longlong Pang,Zhuozhang Zou
关键词: determine model parameters, Prior-Data Fitted Networks, extensive training data, requires extensive training, Traditional methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional methods for tabular classification usually rely on supervised learning from scratch, which requires extensive training data to determine model parameters. However, a novel approach called Prior-Data Fitted Networks (TabPFN) has changed this paradigm. TabPFN uses a 12-layer transformer trained on large synthetic datasets to learn universal tabular representations. This method enables fast and accurate predictions on new tasks with a single forward pass and no need for additional training. Although TabPFN has been successful on small datasets, it generally shows weaker performance when dealing with categorical features. To overcome this limitation, we propose FT-TabPFN, which is an enhanced version of TabPFN that includes a novel Feature Tokenization layer to better handle classification features. By fine-tuning it for downstream tasks, FT-TabPFN not only expands the functionality of the original model but also significantly improves its applicability and accuracy in tabular classification. Our full source code is available for community use and development.

[LG-102] PLUM: Preference Learning Plus Test Cases Yields Better Code Language Models

链接: https://arxiv.org/abs/2406.06887
作者: Dylan Zhang,Shizhe Diao,Xueyan Zou,Hao Peng
关键词: Instruction-finetuned code language, Instruction-finetuned code, programming tasks, shown promise, Instruction-finetuned
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Instruction-finetuned code language models (LMs) have shown promise in various programming tasks. They are trained, using a language modeling objective, on natural language instructions and gold code snippet pairs. Recent evidence suggests that these models, never exposed to incorrect solutions during training, often struggle to distinguish between correct and incorrect solutions. This observation raises our inquiry: Can preference learning, which trains models to prefer correct solutions over incorrect ones, help push the boundaries of code LMs even further? We propose PLUM, a novel \textbfpreference \textbflearning framework a\textbfugmented with test cases tailored for code L\textbfMs.PLUM aims to investigate the key success factors and potential benefits of preference learning in code LMs, which remain elusive despite its success in aligning LMs with human values. PLUM consists of three stages: (1) Generating test cases for natural language instructions, (2) sampling candidate solutions from the policy and evaluating them against the test cases to create a preference dataset, which is then used to (3) train the policy with a preference learning algorithm. Experiments demonstrate that PLUM substantially improves the performance of existing code LMs on established code generation benchmarks such as HumanEval (+) and MBPP (+), even for the state-of-the-art open-source language model CodeQwen-1.5-7B-Chat. PLUM complements the supervised fine-tuning (SFT) stage, demonstrating synergistic effects.

[LG-103] FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

链接: https://arxiv.org/abs/2406.06858
作者: Liwen Chang,Wenlei Bao,Qi Hou,Chengquan Jiang,Ningxin Zheng,Xuanrun Zhang,Zuquan Song,Ziheng Jiang,Haibin Lin,Xin Liu
关键词: Large deep learning, demonstrated strong ability, deep learning models, large models typically, range of applications
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations into much finer-grained operations and further fuses them into a larger kernel to effectively hide communication without compromising kernel efficiency. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1.66x and 1.30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects.

[LG-104] Sample Complexity Reduction via Policy Difference Estimation in Tabular Reinforcement Learning

链接: https://arxiv.org/abs/2406.06856
作者: Adhyyan Narang,Andrew Wagenmaker,Lillian Ratliff,Kevin Jamieson
关键词: pure exploration problem, tabular reinforcement learning, contextual bandits, reinforcement learning, identifying an epsilon-optimal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 59 pages, 2 Figures

点击查看摘要

Abstract:In this paper, we study the non-asymptotic sample complexity for the pure exploration problem in contextual bandits and tabular reinforcement learning (RL): identifying an epsilon-optimal policy from a set of policies with high probability. Existing work in bandits has shown that it is possible to identify the best policy by estimating only the difference between the behaviors of individual policies, which can be substantially cheaper than estimating the behavior of each policy directly. However, the best-known complexities in RL fail to take advantage of this and instead estimate the behavior of each policy directly. Does it suffice to estimate only the differences in the behaviors of policies in RL? We answer this question positively for contextual bandits but in the negative for tabular RL, showing a separation between contextual bandits and RL. However, inspired by this, we show that it almost suffices to estimate only the differences in RL: if we can estimate the behavior of a single reference policy, it suffices to only estimate how any other policy deviates from this reference policy. We develop an algorithm which instantiates this principle and obtains, to the best of our knowledge, the tightest known bound on the sample complexity of tabular RL.

[LG-105] Compass: A Comprehensive Tool for Accurate and Efficient Molecular Docking in Inference and Fine-Tuning

链接: https://arxiv.org/abs/2406.06841
作者: Ahmet Sarigun,Vedran Franke,Altuna Akalin
关键词: Binding Affinity Energy, molecular docking, molecular strain energy, bioactivity noise characteristics, molecular
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:While there has been discussion about noise levels in molecular docking datasets such as PDBBind, a thorough analysis of their physical/chemical and bioactivity noise characteristics is still lacking. PoseCheck addresses this issue by examining molecular strain energy, molecular-protein clashes, and interactions, but it is primarily created for de novo drug design. Another important metric in molecular docking, Binding Affinity Energy, is better assessed by the new empirical score function, AA-Score, which has demonstrated improved performance over existing methods. To tackle these challenges, we propose the COMPASS method, which integrates the PoseCheck and AA-Score modules. This approach evaluates dataset noise levels and the physical/chemical and bioactivity feasibility of docked molecules. Our analysis of the PDBBind dataset using COMPASS reveals significant noise in the ground truth data. Additionally, we incorporate COMPASS with the state-of-the-art molecular docking method, DiffDock, in inference mode to achieve efficient and accurate assessments of docked ligands. Finally, we propose a new paradigm to enhance model performance for molecular docking through fine-tuning and discuss the potential benefits of this approach. The source code is available publicly at this https URL.

[LG-106] Silent Signals Loud Impact: LLMs for Word-Sense Disambiguation of Coded Dog Whistles

链接: https://arxiv.org/abs/2406.06840
作者: Julia Kruk,Michela Marchini,Rijul Ragu,Caleb Ziems,David Muchlinski,Diyi Yang
关键词: United States politics, socioeconomic discrimination, Large Language Models, carries a secondary, secondary meaning
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: ACL 2024

点击查看摘要

Abstract:A dog whistle is a form of coded communication that carries a secondary meaning to specific audiences and is often weaponized for racial and socioeconomic discrimination. Dog whistling historically originated from United States politics, but in recent years has taken root in social media as a means of evading hate speech detection systems and maintaining plausible deniability. In this paper, we present an approach for word-sense disambiguation of dog whistles from standard speech using Large Language Models (LLMs), and leverage this technique to create a dataset of 16,550 high-confidence coded examples of dog whistles used in formal and informal communication. Silent Signals is the largest dataset of disambiguated dog whistle usage, created for applications in hate speech detection, neology, and political science. The dataset can be found at this https URL. Comments: ACL 2024 Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: J.4; K.4.1; K.4.2 Cite as: arXiv:2406.06840 [cs.CL] (or arXiv:2406.06840v1 [cs.CL] for this version)

[LG-107] Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes

链接: https://arxiv.org/abs/2406.06838
作者: Dan Qiao,Kaiqi Zhang,Esha Singh,Daniel Soudry,Yu-Xiang Wang
关键词: two-layer ReLU neural, ReLU neural networks, neural networks, univariate nonparametric regression, learning rate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 51 pages

点击查看摘要

Abstract:We study the generalization of two-layer ReLU neural networks in a univariate nonparametric regression problem with noisy labels. This is a problem where kernels (\emphe.g. NTK) are provably sub-optimal and benign overfitting does not happen, thus disqualifying existing theory for interpolating (0-loss, global optimal) solutions. We present a new theory of generalization for local minima that gradient descent with a constant learning rate can \emphstably converge to. We show that gradient descent with a fixed learning rate \eta can only find local minima that represent smooth functions with a certain weighted \emphfirst order total variation bounded by 1/\eta - 1/2 + \widetildeO(\sigma + \sqrt\mathrmMSE) where \sigma is the label noise level, \mathrmMSE is short for mean squared error against the ground truth, and \widetildeO(\cdot) hides a logarithmic factor. Under mild assumptions, we also prove a nearly-optimal MSE bound of \widetildeO(n^-4/5) within the strict interior of the support of the n data points. Our theoretical results are validated by extensive simulation that demonstrates large learning rate training induces sparse linear spline fits. To the best of our knowledge, we are the first to obtain generalization bound via minima stability in the non-interpolation case and the first to show ReLU NNs without regularization can achieve near-optimal rates in nonparametric regression.

[LG-108] Personalized Binomial DAGs Learning with Network Structured Covariates

链接: https://arxiv.org/abs/2406.06829
作者: Boxin Zhao,Weishi Wang,Dingyuan Zhu,Ziqi Liu,Dong Wang,Zhiqiang Zhang,Jun Zhou,Mladen Kolar
关键词: Directed Acyclic Graphical, Acyclic Graphical, Directed Acyclic, characterized by Directed, DAG
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The causal dependence in data is often characterized by Directed Acyclic Graphical (DAG) models, widely used in many areas. Causal discovery aims to recover the DAG structure using observational data. This paper focuses on causal discovery with multi-variate count data. We are motivated by real-world web visit data, recording individual user visits to multiple websites. Building a causal diagram can help understand user behavior in transitioning between websites, inspiring operational strategy. A challenge in modeling is user heterogeneity, as users with different backgrounds exhibit varied behaviors. Additionally, social network connections can result in similar behaviors among friends. We introduce personalized Binomial DAG models to address heterogeneity and network dependency between observations, which are common in real-world applications. To learn the proposed DAG model, we develop an algorithm that embeds the network structure into a dimension-reduced covariate, learns each node’s neighborhood to reduce the DAG search space, and explores the variance-mean relation to determine the ordering. Simulations show our algorithm outperforms state-of-the-art competitors in heterogeneous data. We demonstrate its practical usefulness on a real-world web visit dataset.

[LG-109] Locally Interdependent Multi-Agent MDP: Theoretical Framework for Decentralized Agents with Dynamic Dependencies

链接: https://arxiv.org/abs/2406.06823
作者: Alex DeWeese,Guannan Qu
关键词: Interdependent Multi-Agent MDP, Locally Interdependent Multi-Agent, dynamically varying dependencies, Locally Interdependent, dynamically varying
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Optimization and Control (math.OC)
*备注: Accepted to International Conference on Machine Learning 2024

点击查看摘要

Abstract:Many multi-agent systems in practice are decentralized and have dynamically varying dependencies. There has been a lack of attempts in the literature to analyze these systems theoretically. In this paper, we propose and theoretically analyze a decentralized model with dynamically varying dependencies called the Locally Interdependent Multi-Agent MDP. This model can represent problems in many disparate domains such as cooperative navigation, obstacle avoidance, and formation control. Despite the intractability that general partially observable multi-agent systems suffer from, we propose three closed-form policies that are theoretically near-optimal in this setting and can be scalable to compute and store. Consequentially, we reveal a fundamental property of Locally Interdependent Multi-Agent MDP’s that the partially observable decentralized solution is exponentially close to the fully observable solution with respect to the visibility radius. We then discuss extensions of our closed-form policies to further improve tractability. We conclude by providing simulations to investigate some long horizon behaviors of our closed-form policies.

[LG-110] Adapters Strike Back

链接: https://arxiv.org/abs/2406.06820
作者: Jan-Martin O. Steitz,Stefan Roth
关键词: adapting trained transformer, trained transformer models, efficient and lightweight, adapting trained, trained transformer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: To appear at CVPR 2024. Code: this https URL

点击查看摘要

Abstract:Adapters provide an efficient and lightweight mechanism for adapting trained transformer models to a variety of different tasks. However, they have often been found to be outperformed by other adaptation mechanisms, including low-rank adaptation. In this paper, we provide an in-depth study of adapters, their internal structure, as well as various implementation choices. We uncover pitfalls for using adapters and suggest a concrete, improved adapter architecture, called Adapter+, that not only outperforms previous adapter implementations but surpasses a number of other, more complex adaptation mechanisms in several challenging settings. Despite this, our suggested adapter is highly robust and, unlike previous work, requires little to no manual intervention when addressing a novel scenario. Adapter+ reaches state-of-the-art average accuracy on the VTAB benchmark, even without a per-task hyperparameter optimization.

[LG-111] Conformal Prediction for Class-wise Coverage via Augmented Label Rank Calibration

链接: https://arxiv.org/abs/2406.06818
作者: Yuanjie Shi,Subhankar Ghosh,Taha Belkhouja,Janardhan Rao Doppa,Yan Yan
关键词: emerging uncertainty quantification, uncertainty quantification framework, Conformal prediction, conditional probability, prediction set
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction (CP) is an emerging uncertainty quantification framework that allows us to construct a prediction set to cover the true label with a pre-specified marginal or conditional probability. Although the valid coverage guarantee has been extensively studied for classification problems, CP often produces large prediction sets which may not be practically useful. This issue is exacerbated for the setting of class-conditional coverage on imbalanced classification tasks. This paper proposes the Rank Calibrated Class-conditional CP (RC3P) algorithm to reduce the prediction set sizes to achieve class-conditional coverage, where the valid coverage holds for each class. In contrast to the standard class-conditional CP (CCP) method that uniformly thresholds the class-wise conformity score for each class, the augmented label rank calibration step allows RC3P to selectively iterate this class-wise thresholding subroutine only for a subset of classes whose class-wise top-k error is small. We prove that agnostic to the classifier and data distribution, RC3P achieves class-wise coverage. We also show that RC3P reduces the size of prediction sets compared to the CCP method. Comprehensive experiments on multiple real-world datasets demonstrate that RC3P achieves class-wise coverage and 26.25% reduction in prediction set sizes on average.

[LG-112] On Learning what to Learn: heterogeneous observations of dynamics and establishing (possibly causal) relations among them

链接: https://arxiv.org/abs/2406.06812
作者: David W. Sroczynski,Felix Dietrich,Eleni D. Koronaki,Ronen Talmon,Ronald R. Coifman,Erik Bollt,Ioannis G. Kevrekidis
关键词: observation processes, function, observation, processes, desired function
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Before we attempt to learn a function between two (sets of) observables of a physical process, we must first decide what the inputs and what the outputs of the desired function are going to be. Here we demonstrate two distinct, data-driven ways of initially deciding the right quantities'' to relate through such a function, and then proceed to learn it. This is accomplished by processing multiple simultaneous heterogeneous data streams (ensembles of time series) from observations of a physical system: multiple observation processes of the system. We thus determine (a) what subsets of observables are common between the observation processes (and therefore observable from each other, relatable through a function); and (b) what information is unrelated to these common observables, and therefore particular to each observation process, and not contributing to the desired function. Any data-driven function approximation technique can subsequently be used to learn the input-output relation, from k-nearest neighbors and Geometric Harmonics to Gaussian Processes and Neural Networks. Two particular twists’’ of the approach are discussed. The first has to do with the identifiability of particular quantities of interest from the measurements. We now construct mappings from a single set of observations of one process to entire level sets of measurements of the process, consistent with this single set. The second attempts to relate our framework to a form of causality: if one of the observation processes measures now'', while the second observation process measures in the future’', the function to be learned among what is common across observation processes constitutes a dynamical model for the system evolution.

[LG-113] Learning Continually by Spectral Regularization

链接: https://arxiv.org/abs/2406.06811
作者: Alex Lewandowski,Saurabh Kumar,Dale Schuurmans,András György,Marlos C. Machado
关键词: Loss of plasticity, Continual learning, phenomenon where neural, difficult to train, learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Loss of plasticity is a phenomenon where neural networks become more difficult to train during the course of learning. Continual learning algorit