本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-06-06)

今日共更新516篇论文,其中:

  • 自然语言处理90篇(Computation and Language (cs.CL))
  • 计算机视觉110篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能151篇(Artificial Intelligence (cs.AI))
  • 机器学习234篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Wings: Learning Multimodal LLMs without Text-only Forgetting
[NLP-0] 翅膀:学习多模式LLM,而不会忘记纯文本

链接: https://arxiv.org/abs/2406.03496
作者: Yi-Kai Zhang,Shiyin Lu,Yang Li,Yanqing Ma,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,De-Chuan Zhan,Han-Jia Ye
关键词: large language models, Multimodal large language, trained LLM, language models, large language
中文关键词: 大型语言模型、多模式大型语言、训练有素的LLM、语言模型、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM catastrophically forgets the text-only instructions, which do not include images and can be addressed within the initial LLM. In this paper, we present Wings, a novel MLLM that excels in both text-only dialogues and multimodal comprehension. Analyzing MLLM attention in multimodal instructions reveals that text-only forgetting is related to the attention shifts from pre-image to post-image text. From that, we construct extra modules that act as the boosted learner to compensate for the attention shift. The complementary visual and textual learners, like “wings” on either side, are connected in parallel within each layer’s attention block. Initially, image and text inputs are aligned with visual learners operating alongside the main attention, balancing focus on visual elements. Textual learners are later collaboratively integrated with attention-based routing to blend the outputs of the visual and textual learners. We design the Low-Rank Residual Attention (LoRRA) to guarantee high efficiency for learners. Our experimental results demonstrate that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks. On a newly constructed Interleaved Image-Text (IIT) benchmark, Wings exhibits superior performance from text-only-rich to multimodal-rich question-answering tasks.
摘要:多通道大语言模型(MLLMS)由训练好的LLM启动,首先将图像与文本对齐,然后对多通道混合输入进行微调。然而,MLLM灾难性地忘记了纯文本指令,这些指令不包括图像,并且可以在初始LLM中寻址。在本文中,我们介绍了Wings,一种新颖的MLLM,它在纯文本对话和多通道理解方面都表现出色。对多通道教学中MLLM注意的分析表明,只有文本遗忘与注意从图像前向图像后转移有关。在此基础上,我们构建了额外的模块,作为增强的学习者来补偿注意力转移。相辅相成的视觉和文本学习者,就像两边的“翅膀”,在每一层的注意块内平行连接。最初,图像和文本输入与视觉学习者在主要注意力的同时操作,平衡对视觉元素的关注。文本学习者随后被协作地与基于注意力的路径相结合,以混合视觉学习者和文本学习者的输出。我们设计了低等级剩余注意力(LORRA)来保证学习者的高效率。我们的实验结果表明,Wings在纯文本和视觉问答任务中的表现都优于同等规模的MLLMS。在一个新构建的交错图文(IIT)基准测试中,Wings在从纯文本丰富到多模式丰富的问答任务中表现出了卓越的性能。

[NLP-1] Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends
[NLP-1] 分析对话总结中的LLM行为:揭露环境幻觉趋势

链接: https://arxiv.org/abs/2406.03487
作者: Sanjana Ramprasad,Elisa Ferracane,Zachary C. Lipton
关键词: Recent advancements, large language models, advancements in large, large language, considerably advanced
中文关键词: 最近的进步,大型语言模型,大型语言的进步,相当先进
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2024

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have considerably advanced the capabilities of summarization systems. However, they continue to face concerns about hallucinations. While prior work has evaluated LLMs extensively in news domains, most evaluation of dialogue summarization has focused on BART-based models, leaving a gap in our understanding of their faithfulness. Our work benchmarks the faithfulness of LLMs for dialogue summarization, using human annotations and focusing on identifying and categorizing span-level inconsistencies. Specifically, we focus on two prominent LLMs: GPT-4 and Alpaca-13B. Our evaluation reveals subtleties as to what constitutes a hallucination: LLMs often generate plausible inferences, supported by circumstantial evidence in the conversation, that lack direct evidence, a pattern that is less prevalent in older models. We propose a refined taxonomy of errors, coining the category of “Circumstantial Inference” to bucket these LLM behaviors and release the dataset. Using our taxonomy, we compare the behavioral differences between LLMs and older fine-tuned models. Additionally, we systematically assess the efficacy of automatic error detection methods on LLM summaries and find that they struggle to detect these nuanced errors. To address this, we introduce two prompt-based approaches for fine-grained error detection that outperform existing metrics, particularly for identifying “Circumstantial Inference.”
摘要:大型语言模型(LLM)的最新进展大大提高了摘要系统的能力。然而,他们仍然面临着对幻觉的担忧。虽然以前的工作在新闻领域对LLMS进行了广泛的评估,但大多数对对话摘要的评估都集中在基于BART的模型上,这使得我们对其忠实性的理解存在差距。我们的工作以LLMS的忠实性为基准进行对话摘要,使用人类注释,并专注于识别和分类跨度级别的不一致。具体地说,我们重点介绍两个著名的LLM:GPT-4和Alpaca-13B。我们的评估揭示了什么构成幻觉的微妙之处:LLM通常会产生可信的推论,并得到对话中间接证据的支持,但缺乏直接证据,这一模式在较旧的模型中不太常见。我们提出了一种改进的错误分类,创造了“间接推理”的类别来存储这些LLM行为并发布数据集。使用我们的分类法,我们比较了LLM和较早的微调模型之间的行为差异。此外,我们系统地评估了自动错误检测方法在LLM摘要上的有效性,发现它们很难检测到这些细微差别的错误。为了解决这一问题,我们引入了两种基于提示的方法来进行细粒度的错误检测,这两种方法的性能优于现有的度量标准,特别是在识别“间接推断”方面。

[NLP-2] BIPED: Pedagogically Informed Tutoring System for ESL Education
[NLP-2] BIPED:ESL教育的教学知情辅导系统

链接: https://arxiv.org/abs/2406.03486
作者: Soonwoo Kwon,Sojung Kim,Minju Park,Seunghyun Lee,Kyuseok Kim
关键词: cost-efficient Conversational Intelligent, Intelligent Tutoring Systems, Conversational Intelligent Tutoring, Large Language Models, Large Language
中文关键词: 经济高效的对话式智能、智能教学系统、对话式智能教学、大型语言模型、大型语言
类目: Computation and Language (cs.CL)
备注: ACL 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have a great potential to serve as readily available and cost-efficient Conversational Intelligent Tutoring Systems (CITS) for teaching L2 learners of English. Existing CITS, however, are designed to teach only simple concepts or lack the pedagogical depth necessary to address diverse learning strategies. To develop a more pedagogically informed CITS capable of teaching complex concepts, we construct a BIlingual PEDagogically-informed Tutoring Dataset (BIPED) of one-on-one, human-to-human English tutoring interactions. Through post-hoc analysis of the tutoring interactions, we come up with a lexicon of dialogue acts (34 tutor acts and 9 student acts), which we use to further annotate the collected dataset. Based on a two-step framework of first predicting the appropriate tutor act then generating the corresponding response, we implemented two CITS models using GPT-4 and SOLAR-KO, respectively. We experimentally demonstrate that the implemented models not only replicate the style of human teachers but also employ diverse and contextually appropriate pedagogical strategies.
摘要:大语言模型(LLM)作为一种易于获得且成本低廉的会话智能教学系统(CITS)具有很大的潜力,可用于二语学习者的教学。然而,现有的CITS仅旨在教授简单的概念,或者缺乏解决不同学习策略所需的教学深度。为了开发一个能够教授复杂概念的教学知情的CITS,我们构建了一个双语教学知情辅导数据集(BIPD),其中包括一对一的、人与人之间的英语辅导互动。通过对辅导互动的事后分析,我们提出了一个对话行为词典(34个教师行为和9个学生行为),并用它来进一步标注收集到的数据集。基于先预测合适的教师行为然后生成相应响应的两步框架,我们分别使用GPT-4和SOLAR-KO实现了两个CITS模型。我们通过实验证明,所实现的模型不仅复制了人类教师的风格,而且还采用了不同的、适合具体情况的教学策略。

[NLP-3] QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
[NLP-3] QJL:1位量化JL变换,用于KV高速缓存量化,零开销

链接: https://arxiv.org/abs/2406.03482
作者: Amir Zandieh,Majid Daliri,Insu Han
关键词: Serving LLMs requires, requires substantial memory, LLMs requires substantial, Serving LLMs, requirements of Key-Value
中文关键词: 服务LLM需要大量内存,LLM需要大量,服务LLM需要大量,关键-价值要求
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Performance (cs.PF)
备注: 13 pages

点击查看摘要

Abstract:Serving LLMs requires substantial memory due to the storage requirements of Key-Value (KV) embeddings in the KV cache, which grows with sequence length. An effective approach to compress KV cache is quantization. However, traditional quantization methods face significant memory overhead due to the need to store quantization constants (at least a zero point and a scale) in full precision per data block. Depending on the block size, this overhead can add 1 or 2 bits per quantized number. We introduce QJL, a new quantization approach that consists of a Johnson-Lindenstrauss (JL) transform followed by sign-bit quantization. In contrast to existing methods, QJL eliminates memory overheads by removing the need for storing quantization constants. We propose an asymmetric estimator for the inner product of two vectors and demonstrate that applying QJL to one vector and a standard JL transform without quantization to the other provides an unbiased estimator with minimal distortion. We have developed an efficient implementation of the QJL sketch and its corresponding inner product estimator, incorporating a lightweight CUDA kernel for optimized computation. When applied across various LLMs and NLP tasks to quantize the KV cache to only 3 bits, QJL demonstrates a more than fivefold reduction in KV cache memory usage without compromising accuracy, all while achieving faster runtime. Codes are available at \urlthis https URL.
摘要:由于密钥-值(KV)嵌入到KV缓存中的存储需求,服务LLM需要大量的内存,KV缓存随着序列长度的增加而增长。量化是压缩KV缓存的一种有效方法。然而,由于需要以全精度存储每个数据块的量化常数(至少一个零点和一个刻度),传统的量化方法面临着巨大的存储开销。根据块大小,该开销可以为每个量化数增加1或2比特。我们介绍了一种新的量化方法QJL,它由Johnson-Lindenstrauss(JL)变换和符号位量化组成。与现有方法相比,QJL通过消除存储量化常量的需要来消除存储器开销。我们提出了两个向量的内积的非对称估计器,并证明了对一个向量应用QJL变换,对另一个向量应用标准JL变换而不量化,提供了一个失真最小的无偏估计。我们开发了QJL草图及其相应的内积估计器的高效实现,结合了用于优化计算的轻量级CUDA内核。当跨各种LLMS和NLP任务将KV缓存量化为仅3位时,QJL在不影响准确性的情况下将KV缓存内存使用量减少了五倍以上,同时实现了更快的运行时间。代码可在此HTTPS URL\url获得。

[NLP-4] MODABS: Multi-Objective Learning for Dynamic Aspect-Based Summarization
[NLP-4] MODABS:基于动态子集的总结的多目标学习

链接: https://arxiv.org/abs/2406.03479
作者: Xiaobo Guo,Soroush Vosoughi
关键词: online content necessitates, content necessitates effective, aspect-based summarization stands, dynamic aspect-based summarization, rapid proliferation
中文关键词: 在线内容是必要的,内容是必要的,有效的,基于方面的摘要,动态的基于方面的摘要,快速扩散
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid proliferation of online content necessitates effective summarization methods, among which dynamic aspect-based summarization stands out. Unlike its traditional counterpart, which assumes a fixed set of known aspects, this approach adapts to the varied aspects of the input text. We introduce a novel multi-objective learning framework employing a Longformer-Encoder-Decoder for this task. The framework optimizes aspect number prediction, minimizes disparity between generated and reference summaries for each aspect, and maximizes dissimilarity across aspect-specific summaries. Extensive experiments show our method significantly outperforms baselines on three diverse datasets, largely due to the effective alignment of generated and reference aspect counts without sacrificing single-aspect summarization quality.
摘要:在线内容的快速激增需要有效的摘要方法,其中基于方面的动态摘要脱颖而出。与传统方法不同,传统方法假设一组固定的已知方面,这种方法适应输入文本的各个方面。我们引入了一种新颖的多目标学习框架,使用Longformer-Encoder-Decoder来完成这项任务。该框架优化了方面号预测,最大限度地减少了每个方面的生成摘要和参考摘要之间的差异,并最大限度地增加了特定方面摘要之间的差异。大量实验表明,我们的方法在三个不同数据集上的表现显着优于基线,这主要是由于生成和参考方面计数的有效对齐,而不会牺牲单方面摘要质量。

[NLP-5] Does your data spark joy? Performance gains from domain upsampling at the end of training
[NLP-5] 您的数据是否激发了喜悦?培训结束时域上采样带来的性能提升

链接: https://arxiv.org/abs/2406.03476
作者: Cody Blakeney,Mansheej Paul,Brett W. Larsen,Sean Owen,Jonathan Frankle
关键词: large FLOP scales, amounts of CommonCrawl, large language models, domain-specific datasets, large amounts
中文关键词: 大FLOP规模、CommonCrawl数量、大型语言模型、特定领域数据集、大量
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: The first three authors contributed equally

点击查看摘要

Abstract:Pretraining datasets for large language models (LLMs) have grown to trillions of tokens composed of large amounts of CommonCrawl (CC) web scrape along with smaller, domain-specific datasets. It is expensive to understand the impact of these domain-specific datasets on model capabilities as training at large FLOP scales is required to reveal significant changes to difficult and emergent benchmarks. Given the increasing cost of experimenting with pretraining data, how does one determine the optimal balance between the diversity in general web scrapes and the information density of domain specific data? In this work, we show how to leverage the smaller domain specific datasets by upsampling them relative to CC at the end of training to drive performance improvements on difficult benchmarks. This simple technique allows us to improve up to 6.90 pp on MMLU, 8.26 pp on GSM8K, and 6.17 pp on HumanEval relative to the base data mix for a 7B model trained for 1 trillion (T) tokens, thus rivaling Llama-2 (7B) \unicodex2014 a model trained for twice as long. We experiment with ablating the duration of domain upsampling from 5% to 30% of training and find that 10% to 20% percent is optimal for navigating the tradeoff between general language modeling capabilities and targeted benchmarks. We also use domain upsampling to characterize at scale the utility of individual datasets for improving various benchmarks by removing them during this final phase of training. This tool opens up the ability to experiment with the impact of different pretraining datasets at scale, but at an order of magnitude lower cost compared to full pretraining runs.
摘要:大型语言模型(LLM)的预训练数据集已经增长到数万亿个标记,其中包括大量的Common Crawl(CC)Web抓取以及较小的特定于领域的数据集。理解这些特定于领域的数据集对模型能力的影响是昂贵的,因为需要在大范围内进行培训,以揭示困难和紧急基准的重大变化。鉴于试验预训练数据的成本不断增加,如何确定一般网络抓取中的多样性和特定领域数据的信息密度之间的最佳平衡?在这项工作中,我们展示了如何在培训结束时通过相对于CC向上采样来利用较小的特定于领域的数据集,以推动困难基准的性能改进。这一简单的技术允许我们相对于针对1万亿(T)令牌训练的7B模型的基本数据混合,在MMLU上提高高达6.90pp,在GSM8K上提高8.26pp,在HumanEval上提高6.17pp,从而与训练了两倍的模型的Llama-2(7B)\unicodex2014相媲美。我们实验将领域上采样的持续时间从训练的5%减少到30%,发现10%到20%是在一般语言建模能力和目标基准之间进行权衡的最佳选择。我们还使用域上采样来规模化地表征单个数据集的效用,通过在培训的最后阶段删除它们来改进各种基准。该工具提供了在规模上试验不同预培训数据集的影响的能力,但与完整的预培训运行相比,成本低了一个数量级。

[NLP-6] Using Synchronic Definitions and Semantic Relations to Classify Semantic Change Types
[NLP-6] 利用同步定义和语义关系对语义变化类型进行分类

链接: https://arxiv.org/abs/2406.03452
作者: Pierluigi Cassotti,Stefano De Pascale,Nina Tahmasebi
关键词: specialization and co-hyponymy, co-hyponymy transfer, semantic change types, abundant evidence, Semantic Change Detection
中文关键词: 专业化与共上下义关系、共上下义关系转移、语义变化类型、丰富的证据、语义变化检测
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There is abundant evidence of the fact that the way words change their meaning can be classified in different types of change, highlighting the relationship between the old and new meanings (among which generalization, specialization and co-hyponymy transfer). In this paper, we present a way of detecting these types of change by constructing a model that leverages information both from synchronic lexical relations and definitions of word meanings. Specifically, we use synset definitions and hierarchy information from WordNet and test it on a digitized version of Blank’s (1997) dataset of semantic change types. Finally, we show how the sense relationships can improve models for both approximation of human judgments of semantic relatedness as well as binary Lexical Semantic Change Detection.
摘要:有大量证据表明,词语改变其意义的方式可以分为不同的变化类型,突出新旧意义之间的关系(其中包括概括性、专门性和共下位关系)。在本文中,我们提出了一种检测这些类型变化的方法,通过构建一个利用来自同步词汇关系和单词含义定义的信息的模型。具体来说,我们使用WordNet中的同义集定义和层次结构信息,并在Blank(1997)语义变化类型数据集的数字化版本上对其进行测试。最后,我们展示了感官关系如何改进人类对语义相关性判断的逼近以及二元词汇语义变化检测的模型。

[NLP-7] What is the Best Way for ChatGPT to Translate Poetry?
[NLP-7] ChatGPT翻译诗歌的最佳方式是什么?

链接: https://arxiv.org/abs/2406.03450
作者: Shanshan Wang,Derek F. Wong,Jingming Yao,Lidia S. Chao
关键词: historically faced significant, faced significant challenges, Large Language Models, historically faced, faced significant
中文关键词: 历史上面临着重大挑战,大型语言模型历史上面临着重大挑战
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 1 figure. The paper has been accepted by ACL 2024(Main Conference)

点击查看摘要

Abstract:Machine translation (MT) has historically faced significant challenges when applied to literary works, particularly in the domain of poetry translation. The advent of Large Language Models such as ChatGPT holds potential for innovation in this field. This study examines ChatGPT’s capabilities in English-Chinese poetry translation tasks, utilizing targeted prompts and small sample scenarios to ascertain optimal performance. Despite promising outcomes, our analysis reveals persistent issues in the translations generated by ChatGPT that warrant attention. To address these shortcomings, we propose an Explanation-Assisted Poetry Machine Translation (EAPMT) method, which leverages monolingual poetry explanation as a guiding information for the translation process. Furthermore, we refine existing evaluation criteria to better suit the nuances of modern poetry translation. We engaged a panel of professional poets for assessments, complemented evaluations by using GPT-4. The results from both human and machine evaluations demonstrate that our EAPMT method outperforms traditional translation methods of ChatGPT and the existing online systems. This paper validates the efficacy of our method and contributes a novel perspective to machine-assisted literary translation.
摘要:机器翻译应用于文学作品,尤其是诗歌翻译领域,历来面临着巨大的挑战。像ChatGPT这样的大型语言模型的出现为这一领域的创新提供了潜力。本研究考察了ChatGPT在英汉诗歌翻译任务中的能力,利用有针对性的提示和小样本情景来确定最佳表现。尽管结果令人振奋,但我们的分析揭示了ChatGPT生成的翻译中持续存在的问题,值得关注。针对这些不足,我们提出了一种解释辅助诗歌机器翻译(EAPMT)方法,该方法利用单语诗歌解释作为翻译过程的指导信息。此外,我们还改进了现有的评价标准,以更好地适应现代诗歌翻译的细微差别。我们聘请了一个专业诗人小组进行评估,并使用GPT-4进行补充评估。人工和机器测试的结果表明,我们的EAPMT方法优于传统的ChatGPT翻译方法和现有的在线系统。本文验证了该方法的有效性,为机器辅助文学翻译提供了一个新的视角。

[NLP-8] Pre-trained Large Language Models Use Fourier Features to Compute Addition
[NLP-8] 预训练的大型语言模型使用傅里叶特征来计算加法

链接: https://arxiv.org/abs/2406.03445
作者: Tianyi Zhou,Deqing Fu,Vatsal Sharan,Robin Jia
关键词: exhibit impressive mathematical, mathematical reasoning capabilities, compute basic arithmetic, impressive mathematical reasoning, Pre-trained large language
中文关键词: 展现出令人印象深刻的数学、数学推理能力、计算基本算术、令人印象深刻的数学推理、预训练的大型语言
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities, yet how they compute basic arithmetic, such as addition, remains unclear. This paper shows that pre-trained LLMs add numbers using Fourier features – dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain. Within the model, MLP and attention layers use Fourier features in complementary ways: MLP layers primarily approximate the magnitude of the answer using low-frequency features, while attention layers primarily perform modular addition (e.g., computing whether the answer is even or odd) using high-frequency features. Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy. Introducing pre-trained token embeddings to a randomly initialized model rescues its performance. Overall, our analysis demonstrates that appropriate pre-trained representations (e.g., Fourier features) can unlock the ability of Transformers to learn precise mechanisms for algorithmic tasks.
摘要:经过预先训练的大型语言模型(LLM)显示出令人印象深刻的数学推理能力,但它们如何计算加法等基本算术仍不清楚。本文展示了预先训练的LLMS使用傅立叶特征来加数–隐藏状态的维度通过一组在频域中稀疏的特征来表示数字。在模型中,MLP层和注意力层以互补的方式使用傅立叶特征:MLP层主要使用低频特征来逼近答案的大小,而注意力层主要使用高频特征来执行模加(例如,计算答案是偶数还是奇数)。预先训练对这一机制至关重要:从头开始训练的加数模型只利用低频特征,导致准确率较低。将预先训练好的令牌嵌入到随机初始化的模型中可以挽救其性能。总体而言,我们的分析表明,适当的预先训练的表示(例如,傅立叶特征)可以释放变压器学习算法任务的精确机制的能力。

[NLP-9] Are language models rational? The case of coherence norms and belief revision
[NLP-9] 语言模型合理吗?一致性规范和信念修订的案例

链接: https://arxiv.org/abs/2406.03442
作者: Thomas Hofweber,Peter Hase,Elias Stengel-Eskin,Mohit Bansal
关键词: machine learning models, Minimal Assent Connection, machine learning, norms, coherence norms
中文关键词: 机器学习模型、最小授权连接、机器学习、规范、一致性规范
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Do norms of rationality apply to machine learning models, in particular language models? In this paper we investigate this question by focusing on a special subset of rational norms: coherence norms. We consider both logical coherence norms as well as coherence norms tied to the strength of belief. To make sense of the latter, we introduce the Minimal Assent Connection (MAC) and propose a new account of credence, which captures the strength of belief in language models. This proposal uniformly assigns strength of belief simply on the basis of model internal next token probabilities. We argue that rational norms tied to coherence do apply to some language models, but not to others. This issue is significant since rationality is closely tied to predicting and explaining behavior, and thus it is connected to considerations about AI safety and alignment, as well as understanding model behavior more generally.
摘要:理性规范是否适用于机器学习模型,特别是语言模型?在本文中,我们通过关注理性规范的一个特殊子集来研究这个问题:一致性规范。我们既考虑逻辑一致性规范,也考虑与信仰强度相关的一致性规范。为了理解后者,我们引入了最小授权连接(MAC)并提出了一种新的信任描述,它捕捉了语言模型中的信仰强度。该提案简单地根据模型内部下一个代币概率来统一分配信念强度。我们认为,与连贯性相关的理性规范确实适用于某些语言模型,但不适用于其他语言模型。这个问题很重要,因为理性与预测和解释行为密切相关,因此它与人工智能安全性和一致性的考虑以及更广泛地理解模型行为有关。

[NLP-10] Cycles of Thought: Measuring LLM Confidence through Stable Explanations
[NLP-10] 思维循环:通过稳定的解释衡量LLM信心

链接: https://arxiv.org/abs/2406.03441
作者: Evan Becker,Stefano Soatto
关键词: high-risk machine learning, machine learning applications, high-risk machine, machine learning, learning applications
中文关键词: 高风险机器学习,机器学习应用,高风险机器,机器学习,学习应用
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In many high-risk machine learning applications it is essential for a model to indicate when it is uncertain about a prediction. While large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, their overconfidence in incorrect responses is still a well-documented failure mode. Traditional methods for ML uncertainty quantification can be difficult to directly adapt to LLMs due to the computational cost of implementation and closed-source nature of many models. A variety of black-box methods have recently been proposed, but these often rely on heuristics such as self-verbalized confidence. We instead propose a framework for measuring an LLM’s uncertainty with respect to the distribution of generated explanations for an answer. While utilizing explanations is not a new idea in and of itself, by interpreting each possible model+explanation pair as a test-time classifier we can calculate a posterior answer distribution over the most likely of these classifiers. We demonstrate how a specific instance of this framework using explanation entailment as our classifier likelihood improves confidence score metrics (in particular AURC and AUROC) over baselines across five different datasets. We believe these results indicate that our framework is both a well-principled and effective way of quantifying uncertainty in LLMs.
摘要:在许多高风险机器学习应用中,对于模型来说,当它对预测不确定时,它是必不可少的。虽然大型语言模型(LLM)可以在各种基准上达到甚至超过人类水平的准确性,但它们对错误响应的过度自信仍然是一种有据可查的失败模式。传统的ML不确定性量化方法很难直接适用于LLMS,这是因为实现的计算量大,而且很多模型都是封闭的。最近提出了各种黑盒方法,但这些方法通常依赖于启发式方法,如自言自语的自信。相反,我们提出了一个框架,用于衡量LLM相对于生成的答案解释的分布的不确定性。虽然利用解释本身并不是一个新想法,但通过将每个可能的模型+解释对解释为测试时间分类器,我们可以计算出这些分类器中最可能的分类器上的后验答案分布。我们展示了这个框架的一个特定实例如何使用解释蕴涵作为我们的分类器可能性来提高五个不同数据集的基线上的置信度得分度量(特别是AURC和AUROC)。我们相信这些结果表明,我们的框架是一种很好的原则性和有效的方法来量化LLMS中的不确定性。

[NLP-11] Automating Turkish Educational Quiz Generation Using Large Language Models
[NLP-11] 使用大型语言模型自动生成土耳其教育测验

链接: https://arxiv.org/abs/2406.03397
作者: Kamyar Zeinalipour,Yusuf Gökberk Keptiğ,Marco Maggini,Marco Gori
关键词: Turkish educational, Turkish educational texts, Turkish, Crafting quizzes, Turkish educational content
中文关键词: 土耳其教育,土耳其教育文本,土耳其语,制作测验,土耳其教育内容
类目: Computation and Language (cs.CL)
备注: Accepted Paper for ISPR 2024

点击查看摘要

Abstract:Crafting quizzes from educational content is a pivotal activity that benefits both teachers and students by reinforcing learning and evaluating understanding. In this study, we introduce a novel approach to generate quizzes from Turkish educational texts, marking a pioneering endeavor in educational technology specifically tailored to the Turkish educational context. We present a specialized dataset, named the Turkish-Quiz-Instruct, comprising an extensive collection of Turkish educational texts accompanied by multiple-choice and short-answer quizzes. This research leverages the capabilities of Large Language Models (LLMs), including GPT-4-Turbo, GPT-3.5-Turbo, Llama-2-7b-chat-hf, and Llama-2-13b-chat-hf, to automatically generate quiz questions and answers from the Turkish educational content. Our work delineates the methodology for employing these LLMs in the context of Turkish educational material, thereby opening new avenues for automated Turkish quiz generation. The study not only demonstrates the efficacy of using such models for generating coherent and relevant quiz content but also sets a precedent for future research in the domain of automated educational content creation for languages other than English. The Turkish-Quiz-Instruct dataset is introduced as a valuable resource for researchers and practitioners aiming to explore the boundaries of educational technology and language-specific applications of LLMs in Turkish. By addressing the challenges of quiz generation in a non-English context specifically Turkish, this study contributes significantly to the field of Turkish educational technology, providing insights into the potential of leveraging LLMs for educational purposes across diverse linguistic landscapes.
摘要:根据教育内容制作测验是一项关键的活动,通过加强学习和评价理解,对教师和学生都有好处。在这项研究中,我们介绍了一种从土耳其语教育文本生成测验的新方法,标志着专门为土耳其教育背景量身定做的教育技术方面的开创性努力。我们提供了一个专门的数据集,名为土耳其语-测验-指令,包括大量的土耳其语教育文本,并伴随着多项选择和简答测验。这项研究利用大型语言模型(LLM)的功能,包括GPT-4-Turbo、GPT-3.5-Turbo、Llama-2-7b-Chat-HF和Llama-2-13b-Chat-HF,从土耳其语教育内容自动生成测验问题和答案。我们的工作描述了在土耳其语教育材料的背景下使用这些LLM的方法,从而为自动生成土耳其语测验开辟了新的途径。这项研究不仅证明了使用这种模型来生成连贯和相关的测验内容的有效性,而且为未来在英语以外的语言的自动教育内容创建领域的研究开创了先例。土耳其语-测验-指令数据集被介绍为研究人员和从业人员的宝贵资源,旨在探索土耳其语中的教育技术和特定语言应用的LLMS。通过解决在非英语环境下特别是土耳其语环境下生成测验的挑战,本研究对土耳其语教育技术领域做出了重大贡献,为在不同语言环境中利用LLMS用于教育目的的潜力提供了洞察力。

[NLP-12] IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
[NLP-12] IrokoBench:大型语言模型时代非洲语言的新基准

链接: https://arxiv.org/abs/2406.03368
作者: David Ifeoluwa Adelani,Jessica Ojo,Israel Abebe Azime,Jian Yun Zhuang,Jesujoba O. Alabi,Xuanli He,Millicent Ochieng,Sara Hooker,Andiswa Bukula,En-Shiun Annie Lee,Chiamaka Chukwuneke,Happy Buzaaba,Blessing Sibanda,Godson Kalipe,Jonathan Mukiibi,Salomon Kabongo,Foutse Yuehgoh,Mmasibidi Setaka,Lolwethu Ndolela,Nkiruka Odu,Rooweither Mabuya,Shamsuddeen Hassan Muhammad,Salomey Osei,Sokhar Samb,Tadesse Kebede Guge,Pontus Stenetorp
关键词: remarkable capabilities remain, capabilities remain limited, adoption of Large, Large language models, African languages
中文关键词: 非凡的能力仍然存在,能力仍然有限,采用大型语言模型,非洲语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languages. Additionally, many low-resource languages (e.g. African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench – a human-translated benchmark dataset for 16 typologically-diverse low-resource African languages covering three tasks: natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), and multi-choice knowledge-based QA~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~(where test sets are translated into English) across 10 open and four proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Aya-101 only at 58% of the best-performing proprietary model GPT-4o performance. Machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, like LLaMa 3 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.
摘要:尽管大型语言模型(LLM)被广泛采用,但它们的卓越功能仍然仅限于少数高资源语言。此外,由于在资源丰富的语言之外缺乏适当或全面的基准,许多资源较少的语言(例如非洲语言)往往只根据基本的文本分类任务进行评价。在本文中,我们介绍了一个面向16种类型多样的低资源非洲语言的人工翻译基准数据集IrokoBch,它涵盖了三个任务:自然语言推理~(AfriXNLI)、数学推理~(AfriMGSM)和基于多选知识的问答~(AfriMMLU)。我们使用IrokoBtch评估了10个开放LLM和4个专有LLM的零激发、少激发和翻译测试设置(其中测试集被翻译成英语)。我们的评估显示,高资源语言(如英语和法语)和低资源非洲语言之间存在显著的性能差距。我们观察到开放机型和专有机型之间存在显著的性能差距,性能最高的开放机型Aya-101的性能仅为性能最佳的专有机型GPT-40的58%。在评估前将测试集机器翻译成英语,有助于缩小以英语为中心的较大型号的差距,如骆驼3 70B。这些发现表明,需要做出更多努力来开发和调整适用于非洲语言的LLMS。

[NLP-13] LLM-based Rewriting of Inappropriate Argumentation using Reinforcement Learning from Machine Feedback
[NLP-13] 基于LLM的使用来自机器反馈的强化学习重写不适当的论证

链接: https://arxiv.org/abs/2406.03363
作者: Timon Ziegenbein,Gabriella Skitalinskaya,Alireza Bayat Makou,Henning Wachsmuth
关键词: social media platforms, Ensuring that online, online discussions, discussions are civil, civil and productive
中文关键词: 社交媒体平台,确保在线、在线讨论、讨论是文明的、文明的和富有成效的
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Ensuring that online discussions are civil and productive is a major challenge for social media platforms. Such platforms usually rely both on users and on automated detection tools to flag inappropriate arguments of other users, which moderators then review. However, this kind of post-hoc moderation is expensive and time-consuming, and moderators are often overwhelmed by the amount and severity of flagged content. Instead, a promising alternative is to prevent negative behavior during content creation. This paper studies how inappropriate language in arguments can be computationally mitigated. We propose a reinforcement learning-based rewriting approach that balances content preservation and appropriateness based on existing classifiers, prompting an instruction-finetuned large language model (LLM) as our initial policy. Unlike related style transfer tasks, rewriting inappropriate arguments allows deleting and adding content permanently. It is therefore tackled on document level rather than sentence level. We evaluate different weighting schemes for the reward function in both absolute and relative human assessment studies. Systematic experiments on non-parallel data provide evidence that our approach can mitigate the inappropriateness of arguments while largely preserving their content. It significantly outperforms competitive baselines, including few-shot learning, prompting, and humans.
摘要:确保在线讨论是文明和富有成效的,这是社交媒体平台面临的一大挑战。这类平台通常既依赖于用户,也依赖于自动检测工具来标记其他用户的不适当论点,然后由主持人进行审查。然而,这种事后审查既昂贵又耗时,而且版主经常被标记内容的数量和严重程度淹没。相反,一个有希望的替代方案是在内容创建过程中防止负面行为。本文研究了如何在计算上减轻论据中不恰当的语言。我们提出了一种基于强化学习的重写方法,该方法在现有分类器的基础上平衡了内容保存性和适当性,并提出了一个教学优化的大型语言模型(LLM)作为我们的初始策略。与相关的样式转换任务不同,重写不适当的参数允许永久删除和添加内容。因此,它是在文件层面而不是句子层面上处理的。我们在绝对和相对人类评估研究中对奖励函数的不同加权方案进行了评估。对非平行数据的系统实验表明,我们的方法可以在很大程度上保留论点的内容的同时,减轻论点的不妥之处。它的表现远远超过竞争基线,包括极少的学习、提示和人力。

[NLP-14] he Challenges of Evaluating LLM Applications: An Analysis of Automated Human and LLM-Based Approaches
[NLP-14] 评估LLM应用的挑战:自动化人类和基于LLM的方法的分析

链接: https://arxiv.org/abs/2406.03339
作者: Bhashithe Abeysinghe,Ruhan Circi
关键词: natural language generation, evaluation, transformer based Generative, human, natural language
中文关键词: 自然语言生成、评估、基于生成、人类、自然语言的转换器
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chatbots have been an interesting application of natural language generation since its inception. With novel transformer based Generative AI methods, building chatbots have become trivial. Chatbots which are targeted at specific domains such as medicine, psychology, and general information retrieval are implemented rapidly. This, however, should not distract from the need to evaluate the chatbot responses. Especially because the natural language generation community does not entirely agree upon how to effectively evaluate such applications. With this work we discuss the issue further with the increasingly popular LLM based evaluations and how they correlate with human evaluations. Additionally, we introduce a comprehensive factored evaluation mechanism that can be utilized in conjunction with both human and LLM-based evaluations. We present the results of an experimental evaluation conducted using this scheme in one of our chatbot implementations, and subsequently compare automated, traditional human evaluation, factored human evaluation, and factored LLM evaluation. Results show that factor based evaluation produces better insights on which aspects need to be improved in LLM applications and further strengthens the argument to use human evaluation in critical spaces where main functionality is not direct retrieval. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.03339 [cs.CL] (or arXiv:2406.03339v1 [cs.CL] for this version)
摘要:聊天机器人自问世以来一直是自然语言生成的一个有趣的应用。有了新的基于变压器的创成式人工智能方法,建造聊天机器人变得微不足道。针对医学、心理学和一般信息检索等特定领域的聊天机器人得到了快速实现。然而,这不应分散对评估聊天机器人响应的需要的注意力。特别是因为自然语言生成社区并不完全同意如何有效地评估这些应用程序。通过这项工作,我们进一步讨论了这个问题,越来越流行的基于LLM的评估,以及它们如何与人类评估相关。此外,我们引入了一种全面的因子化评估机制,可以与基于人工和基于LLM的评估结合使用。我们给出了在我们的一个聊天机器人实现中使用该方案进行的实验评估的结果,并随后比较了自动化、传统的人工评估、因式人工评估和因式LLM评估。结果表明,基于因素的评估能够更好地洞察LLM应用程序中哪些方面需要改进,并进一步加强了在主要功能不是直接检索的关键空间中使用人工评估的论点。学科:计算与语言(cs.CL);人工智能(cs.AI)引用AS:arxiv:2406.03339cs.CL

[NLP-15] he Good the Bad and the Hulk-like GPT: Analyzing Emotional Decisions of Large Language Models in Cooperation and Bargaining Games
[NLP-15] 《好人坏人和绿巨人般的GPT:分析合作和讨价还价游戏中大型语言模型的情感决策

链接: https://arxiv.org/abs/2406.03299
作者: Mikhail Mozikov,Nikita Severin,Valeria Bodishtianu,Maria Glushanina,Mikhail Baklashkin,Andrey V. Savchenko,Ilya Makarov
关键词: Large Language Models, understanding human interactions, important part, part of society, society modeling
中文关键词: 大型语言模型,理解人类互动,重要部分,社会的一部分,社会建模
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Behavior study experiments are an important part of society modeling and understanding human interactions. In practice, many behavioral experiments encounter challenges related to internal and external validity, reproducibility, and social bias due to the complexity of social interactions and cooperation in human user studies. Recent advances in Large Language Models (LLMs) have provided researchers with a new promising tool for the simulation of human behavior. However, existing LLM-based simulations operate under the unproven hypothesis that LLM agents behave similarly to humans as well as ignore a crucial factor in human decision-making: emotions. In this paper, we introduce a novel methodology and the framework to study both, the decision-making of LLMs and their alignment with human behavior under emotional states. Experiments with GPT-3.5 and GPT-4 on four games from two different classes of behavioral game theory showed that emotions profoundly impact the performance of LLMs, leading to the development of more optimal strategies. While there is a strong alignment between the behavioral responses of GPT-3.5 and human participants, particularly evident in bargaining games, GPT-4 exhibits consistent behavior, ignoring induced emotions for rationality decisions. Surprisingly, emotional prompting, particularly with `anger’ emotion, can disrupt the “superhuman” alignment of GPT-4, resembling human emotional responses. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2.7; J.4 Cite as: arXiv:2406.03299 [cs.AI] (or arXiv:2406.03299v1 [cs.AI] for this version)
摘要:行为研究实验是社会建模和理解人类互动的重要组成部分。在实践中,由于人类用户研究中社会互动和合作的复杂性,许多行为实验遇到了与内部和外部效度、重复性和社会偏见有关的挑战。大语言模型(LLMS)的最新进展为研究人员提供了一种新的有前途的人类行为模拟工具。然而,现有的基于LLM的模拟操作是在未经证实的假设下进行的,即LLM代理的行为类似于人类,并且忽略了人类决策中的一个关键因素:情绪。在本文中,我们介绍了一种新的方法和框架来研究情绪状态下LLMS的决策及其与人类行为的一致性。GPT-3.5和GPT-4在来自两种不同行为博弈论的四个游戏上的实验表明,情绪深刻地影响LLM的性能,导致开发出更优的策略。虽然GPT-3.5的行为反应与人类参与者之间有很强的一致性,特别是在讨价还价游戏中,但GPT-4表现出一致的行为,忽略理性决策的诱导情绪。令人惊讶的是,情绪刺激,特别是带有“愤怒”情绪的情绪,可以扰乱GPT-4的“超人”排列,类似于人类的情绪反应。学科:人工智能(cs.AI);计算与语言(cs.CL)ACM类:I.2.7;J.4引用AS:arxiv:2406.03299cs.AI

[NLP-16] SpikeLM: Towards General Spike-Driven Language Modeling via Elastic Bi-Spiking Mechanisms
[NLP-16] SpikeLM:通过弹性双尖峰机制实现通用尖峰驱动语言建模

链接: https://arxiv.org/abs/2406.03287
作者: Xingrun Xing,Zheng Zhang,Ziyi Ni,Shitao Xiao,Yiming Ju,Siqi Fan,Yequan Wang,Jiajun Zhang,Guoqi Li
关键词: energy-efficient artificial intelligence, artificial intelligence similar, spiking neural networks, bio-inspired spiking neural, event-driven sparsity
中文关键词: 节能人工智能、类似人工智能、尖峰神经网络、生物启发的尖峰神经、事件驱动的稀疏性
类目: Neural and Evolutionary Computing (cs.NE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Towards energy-efficient artificial intelligence similar to the human brain, the bio-inspired spiking neural networks (SNNs) have advantages of biological plausibility, event-driven sparsity, and binary activation. Recently, large-scale language models exhibit promising generalization capability, making it a valuable issue to explore more general spike-driven models. However, the binary spikes in existing SNNs fail to encode adequate semantic information, placing technological challenges for generalization. This work proposes the first fully spiking mechanism for general language tasks, including both discriminative and generative ones. Different from previous spikes with 0,1 levels, we propose a more general spike formulation with bi-directional, elastic amplitude, and elastic frequency encoding, while still maintaining the addition nature of SNNs. In a single time step, the spike is enhanced by direction and amplitude information; in spike frequency, a strategy to control spike firing rate is well designed. We plug this elastic bi-spiking mechanism in language modeling, named SpikeLM. It is the first time to handle general language tasks with fully spike-driven models, which achieve much higher accuracy than previously possible. SpikeLM also greatly bridges the performance gap between SNNs and ANNs in language modeling. Our code is available at this https URL.
摘要:对于类似于人脑的高能效人工智能,生物启发的脉冲神经网络(SNN)具有生物真实性、事件驱动稀疏性和二进制激活等优点。近年来,大规模语言模型表现出良好的泛化能力,使得探索更通用的尖峰驱动模型成为一个有价值的问题。然而,现有SNN中的二进制尖峰不能编码足够的语义信息,这给泛化带来了技术挑战。这项工作提出了第一个针对一般语言任务的完全尖峰机制,包括辨别性任务和生成性任务。与以往0,1水平的尖峰不同,我们提出了一个更一般的尖峰公式,它具有双向、弹性幅度和弹性频率编码,同时仍然保持了SNN的加法性质。在单个时间步中,利用方向和幅度信息增强了棘波;在棘波频率上,设计了一种控制棘波放电率的策略。我们将这种弹性双尖峰机制插入语言建模中,称为SpikeLM。这是第一次使用完全尖峰驱动的模型来处理一般语言任务,这种模型实现了比以前可能的更高的精度。SpikeLM还极大地弥合了SNN和ANN在语言建模方面的性能差距。我们的代码可以在这个HTTPS URL上找到。

[NLP-17] FusionBench: A Comprehensive Benchmark of Deep Model Fusion
[NLP-17] FusionBench:深度模型融合的综合基准

链接: https://arxiv.org/abs/2406.03280
作者: Anke Tang,Li Shen,Yong Luo,Han Hu,Bo Do,Dacheng Tao
关键词: Deep model fusion, deep neural networks, model fusion techniques, model fusion, fusion techniques
中文关键词: 深度模型融合,深度神经网络,模型融合技术,模型融合,融合技术
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project homepage: this https URL

点击查看摘要

Abstract:Deep model fusion is an emerging technique that unifies the predictions or parameters of several deep neural networks into a single model in a cost-effective and data-efficient manner. This enables the unified model to take advantage of the original models’ strengths, potentially exceeding their performance. Although a variety of deep model fusion techniques have been introduced, their evaluations tend to be inconsistent and often inadequate to validate their effectiveness and robustness against distribution shifts. To address this issue, we introduce FusionBench, which is the first comprehensive benchmark dedicated to deep model fusion. FusionBench covers a wide range of tasks, including open-vocabulary image classification, text classification, and text-to-text generation. Each category includes up to eight tasks with corresponding task-specific models, featuring both full fine-tuning and LoRA fine-tuning, as well as models of different sizes, to ensure fair and balanced comparisons of various multi-task model fusion techniques across different tasks, model scales, and fine-tuning strategies. We implement and evaluate a broad spectrum of deep model fusion techniques. These techniques range from model ensemble methods, which combine the predictions to improve the overall performance, to model merging, which integrates different models into a single one, and model mixing methods, which upscale or recombine the components of the original models. FusionBench now contains 26 distinct tasks, 74 fine-tuned models, and 16 fusion techniques, and we are committed to consistently expanding the benchmark with more tasks, models, and fusion techniques. In addition, we offer a well-documented set of resources and guidelines to aid researchers in understanding and replicating the benchmark results. Homepage this https URL
摘要:深度模型融合是一种新兴的技术,它以经济和数据高效的方式将多个深度神经网络的预测或参数统一到一个模型中。这使得统一模型能够利用原始模型的优势,潜在地超过它们的性能。虽然已经引入了各种深度模型融合技术,但它们的评估往往是不一致的,并且往往不足以验证它们针对分布变化的有效性和稳健性。为了解决这个问题,我们引入了FusionBch,这是第一个致力于深度模型融合的全面基准。FusionBitch涵盖了广泛的任务,包括开放词汇表图像分类、文本分类和文本到文本生成。每个类别最多包括八个任务,具有相应的特定任务模型,具有完全微调和LORA微调,以及不同大小的模型,以确保公平和平衡地比较不同任务、模型比例和微调策略的各种多任务模型融合技术。我们实现并评估了广泛的深度模型融合技术。这些技术的范围从模型集成方法(将预测组合以提高整体性能)到模型合并(将不同的模型集成到一个单一模型中),以及模型混合方法(提升或重新组合原始模型的组件)。FusionBitch现在包含26个不同的任务、74个微调模型和16个融合技术,我们致力于不断扩大基准,推出更多任务、模型和融合技术。此外,我们提供了一套记录良好的资源和指南,以帮助研究人员理解和复制基准结果。主页此HTTPS URL

[NLP-18] Large Language Models as Evaluators for Recommendation Explanations
[NLP-18] 大型语言模型作为推荐解释的评估者

链接: https://arxiv.org/abs/2406.03248
作者: Xiaoyu Zhang,Yishan Li,Jiayin Wang,Bowen Sun,Weizhi Ma,Peijie Sun,Min Zhang
关键词: attracted significant attention, Natural Language Processing, academia and industry, explainability of recommender, recommender systems
中文关键词: 引起了人们的高度关注,自然语言处理、学术界和工业界,推荐器的解释性,推荐系统
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The explainability of recommender systems has attracted significant attention in academia and industry. Many efforts have been made for explainable recommendations, yet evaluating the quality of the explanations remains a challenging and unresolved issue. In recent years, leveraging LLMs as evaluators presents a promising avenue in Natural Language Processing tasks (e.g., sentiment classification, information extraction), as they perform strong capabilities in instruction following and common-sense reasoning. However, evaluating recommendation explanatory texts is different from these NLG tasks, as its criteria are related to human perceptions and are usually subjective. In this paper, we investigate whether LLMs can serve as evaluators of recommendation explanations. To answer the question, we utilize real user feedback on explanations given from previous work and additionally collect third-party annotations and LLM evaluations. We design and apply a 3-level meta evaluation strategy to measure the correlation between evaluator labels and the ground truth provided by users. Our experiments reveal that LLMs, such as GPT4, can provide comparable evaluations with appropriate prompts and settings. We also provide further insights into combining human labels with the LLM evaluation process and utilizing ensembles of multiple heterogeneous LLM evaluators to enhance the accuracy and stability of evaluations. Our study verifies that utilizing LLMs as evaluators can be an accurate, reproducible and cost-effective solution for evaluating recommendation explanation texts. Our code is available at this https URL.
摘要:推荐系统的可解释性问题引起了学术界和工业界的广泛关注。为提出可解释的建议作出了许多努力,但评价解释的质量仍然是一个具有挑战性和悬而未决的问题。近年来,利用LLMS作为评价器在自然语言处理任务(如情感分类、信息提取)中表现出强大的指令跟随和常识推理能力,为自然语言处理任务提供了一条很有前途的途径。然而,评价建议解释文本不同于这些NLG任务,因为其标准与人的感知有关,通常是主观的。在这篇文章中,我们调查了LLM是否可以作为推荐解释的评价者。为了回答这个问题,我们利用了对以前工作中给出的解释的真实用户反馈,并另外收集了第三方注释和LLM评估。我们设计并应用了一个三级元评估策略来衡量评价者标签与用户提供的基本事实之间的相关性。我们的实验表明,LLMS,如GPT4,可以在适当的提示和设置下提供可比较的评估。我们还提供了关于将人类标签与LLM评估过程相结合以及利用多个不同类型的LLM评估者的集合来提高评估的准确性和稳定性的进一步见解。我们的研究验证了使用LLMS作为评价器可以成为一种准确、可重复性和高性价比的推荐解释文本评估方案。我们的代码可以在这个HTTPS URL上找到。

[NLP-19] Document-level Claim Extraction and Decontextualisation for Fact-Checking
[NLP-19] 文档级声明提取和去背景化以进行事实核查

链接: https://arxiv.org/abs/2406.03239
作者: Zhenyun Deng,Michael Schlichtkrul,Andreas Vlachos
关键词: human fact-checkers, time-consuming task, task for human, claim extraction, extract check-worthy claims
中文关键词: 人类事实核查员、耗时的任务、人类任务、索赔提取、提取值得检查的索赔
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024

点击查看摘要

Abstract:Selecting which claims to check is a time-consuming task for human fact-checkers, especially from documents consisting of multiple sentences and containing multiple claims. However, existing claim extraction approaches focus more on identifying and extracting claims from individual sentences, e.g., identifying whether a sentence contains a claim or the exact boundaries of the claim within a sentence. In this paper, we propose a method for document-level claim extraction for fact-checking, which aims to extract check-worthy claims from documents and decontextualise them so that they can be understood out of context. Specifically, we first recast claim extraction as extractive summarization in order to identify central sentences from documents, then rewrite them to include necessary context from the originating document through sentence decontextualisation. Evaluation with both automatic metrics and a fact-checking professional shows that our method is able to extract check-worthy claims from documents more accurately than previous work, while also improving evidence retrieval.
摘要:对于人类事实核查人员来说,选择要检查哪些索赔是一项耗时的任务,特别是在由多个句子组成的包含多个索赔的文档中。然而,现有的声明抽取方法更侧重于从单个句子中识别和提取声明,例如,识别句子中是否包含声明或声明在句子中的确切边界。本文提出了一种用于事实核查的文档级索赔提取方法,其目的是从文档中提取值得检查的索赔,并对其进行去上下文处理,以便断章取义。具体地说,我们首先将索赔提取重塑为提取摘要,以便从文档中识别中心句子,然后通过句子去上下文重写它们,以包括原始文档中的必要上下文。与自动度量和事实核查专家的评估表明,我们的方法能够比以前的工作更准确地从文档中提取值得检查的声明,同时也改善了证据检索。

[NLP-20] Error-preserving Automatic Speech Recognition of Young English Learners Language
[NLP-20] 青少年英语学习者语言的保错自动语音识别

链接: https://arxiv.org/abs/2406.03235
作者: Janick Michot,Manuela Hürlimann,Jan Deriu,Luzia Sauer,Katsiaryna Mlynchyk,Mark Cieliebak
关键词: ASR, language, ASR systems, central skills, language learners
中文关键词: ASB、语言、ASB系统、中心技能、语言学习者
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2024 Main Conference

点击查看摘要

Abstract:One of the central skills that language learners need to practice is speaking the language. Currently, students in school do not get enough speaking opportunities and lack conversational practice. Recent advances in speech technology and natural language processing allow for the creation of novel tools to practice their speaking skills. In this work, we tackle the first component of such a pipeline, namely, the automated speech recognition module (ASR), which faces a number of challenges: first, state-of-the-art ASR models are often trained on adult read-aloud data by native speakers and do not transfer well to young language learners’ speech. Second, most ASR systems contain a powerful language model, which smooths out errors made by the speakers. To give corrective feedback, which is a crucial part of language learning, the ASR systems in our setting need to preserve the errors made by the language learners. In this work, we build an ASR system that satisfies these requirements: it works on spontaneous speech by young language learners and preserves their errors. For this, we collected a corpus containing around 85 hours of English audio spoken by learners in Switzerland from grades 4 to 6 on different language learning tasks, which we used to train an ASR model. Our experiments show that our model benefits from direct fine-tuning on children’s voices and has a much higher error preservation rate than other models.
摘要:语言学习者需要练习的核心技能之一就是说语言。目前,在校学生没有足够的演讲机会,缺乏会话练习。语音技术和自然语言处理的最新进展使人们能够创造新的工具来练习他们的说话技能。在这项工作中,我们解决了这样一个管道的第一个组成部分,即自动语音识别模块(ASR),它面临着一些挑战:首先,最先进的ASR模型通常是由母语人士在成人朗读数据上进行训练的,不能很好地迁移到年轻语言学习者的语音中。其次,大多数ASR系统包含一个强大的语言模型,可以消除说话者的错误。为了给出纠正反馈,这是语言学习的关键部分,在我们的环境中,ASR系统需要保留语言学习者犯下的错误。在这项工作中,我们构建了一个ASR系统来满足这些要求:它处理年轻语言学习者的自发语音,并保留他们的错误。为此,我们收集了一个语料库,其中包含瑞士四到六年级学习者关于不同语言学习任务的大约85个小时的英语音频,我们用这些音频来训练ASR模型。实验表明,我们的模型受益于对儿童声音的直接微调,并且比其他模型具有更高的错误保留率。

[NLP-21] Linking Named Entities in Diderots textitEncyclopedie to Wikidata
[NLP-21] 将Diderots文本中的命名实体链接到维基数据

链接: https://arxiv.org/abs/2406.03221
作者: Pierre Nugues
关键词: century in Europe, Europe that aimed, textit, reference work, work from XVIIIth
中文关键词: 欧洲世纪,欧洲的目标,文本,参考作品,第十八届的作品
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 6 pages, 3 figures

点击查看摘要

Abstract:Diderot’s \textitEncyclopédie is a reference work from XVIIIth century in Europe that aimed at collecting the knowledge of its era. \textitWikipedia has the same ambition with a much greater scope. However, the lack of digital connection between the two encyclopedias may hinder their comparison and the study of how knowledge has evolved. A key element of \textitWikipedia is Wikidata that backs the articles with a graph of structured data. In this paper, we describe the annotation of more than 10,300 of the \textitEncyclopédie entries with Wikidata identifiers enabling us to connect these entries to the graph. We considered geographic and human entities. The \textitEncyclopédie does not contain biographic entries as they mostly appear as subentries of locations. We extracted all the geographic entries and we completely annotated all the entries containing a description of human entities. This represents more than 2,600 links referring to locations or human entities. In addition, we annotated more than 9,500 entries having a geographic content only. We describe the annotation process as well as application examples. This resource is available at this https URL
提要:狄德罗的《文集》是欧洲十三世纪的一部旨在收集那个时代知识的参考书。维基百科有着同样的野心,但范围要大得多。然而,这两部百科全书之间缺乏数字连接,可能会阻碍它们的比较和对知识如何演变的研究。维基百科的一个关键元素是维基数据,它用结构化数据图表支持文章。在这篇文章中,我们使用维基数据标识符来描述10,300多个文本包围盒条目的注释,从而使我们能够将这些条目连接到图。我们考虑了地理和人文实体。因为传记条目大多显示为位置的子条目,所以\textitEncle pédie不包含传记条目。我们提取了所有地理条目,并对包含人类实体描述的所有条目进行了完整的注释。这代表了2600多个涉及地点或人类实体的链接。此外,我们还注释了9,500多个仅包含地理内容的条目。我们描述了标注过程以及应用实例。此资源可通过以下HTTPS URL获得

[NLP-22] ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction
[NLP-22] ChatLang-8:基于LLM的语法错误纠正合成数据生成框架

链接: https://arxiv.org/abs/2406.03202
作者: Jeiyoon Park,Chanjun Park,Heuiseok Lim
关键词: explore and improve, LLMs to generate, grammatical error correction, Prompt Manager, GEC
中文关键词: 探索和改进、生成LLM、语法错误纠正、提示经理、GEC
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:We explore and improve the capabilities of LLMs to generate data for grammatical error correction (GEC). When merely producing parallel sentences, their patterns are too simplistic to be valuable as a corpus. To address this issue, we propose an automated framework that includes a Subject Selector, Grammar Selector, Prompt Manager, and Evaluator. Additionally, we introduce a new dataset for GEC tasks, named \textbfChatLang-8, which encompasses eight types of subject nouns and 23 types of grammar. It consists of 1 million pairs featuring human-like grammatical errors. Our experiments reveal that ChatLang-8 exhibits a more uniform pattern composition compared to existing GEC datasets. Furthermore, we observe improved model performance when using ChatLang-8 instead of existing GEC datasets. The experimental results suggest that our framework and ChatLang-8 are valuable resources for enhancing ChatGPT’s data generation capabilities.
摘要:我们探索和改进LLM生成语法错误纠正(GEC)数据的能力。当仅仅产生平行句子时,它们的模式过于简单化,作为一个文集没有价值。为了解决这个问题,我们提出了一个自动化框架,其中包括主题收件箱、语法收件箱、提示管理器和评估器。此外,我们还为GEC任务引入了一个新的数据集,名为\textbfChatLang-8,其中包含8种类型的主题名词和23种类型的语法。它由100万对构成,具有类似人类的语法错误。我们的实验表明,与现有的GEC数据集相比,ChatLang-8表现出更均匀的模式组成。此外,当使用ChatLang-8而不是现有的GEC数据集时,我们观察到模型性能有所提高。实验结果表明,我们的框架和ChatLang-8是增强ChatGPT数据生成能力的宝贵资源。

[NLP-23] Bayesian WeakS-to-Strong from Text Classification to Generation
[NLP-23] Bayesian WeakS-to-Strong从文本分类到生成

链接: https://arxiv.org/abs/2406.03199
作者: Ziyun Cui,Ziyang Zhang,Wen Wu,Guangzhi Sun,Chao Zhang
关键词: large language models, language models raise, supervise them weakly, large language, raise the question
中文关键词: 大语言模型,语言模型提出,监督它们较弱,大语言,提出问题
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Advances in large language models raise the question of how alignment techniques will adapt as models become increasingly complex and humans will only be able to supervise them weakly. Weak-to-Strong mimics such a scenario where weak model supervision attempts to harness the full capabilities of a much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions. Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization. Furthermore, we extend the application of WeakS-to-Strong from text classification tasks to text generation tasks where more advanced strategies are investigated for supervision. Moreover, direct preference optimization is applied to advance the student model’s preference learning, beyond the basic learning framework of teacher forcing. Results demonstrate the effectiveness of the proposed approach for the reliability of a strong student model, showing potential for superalignment.
摘要:大型语言模型的进步提出了一个问题,即随着模型变得越来越复杂,人类只能弱地监督它们,对齐技术将如何适应。弱到强模拟了这样一种场景,即弱的模型监管试图利用强大得多的模型的全部功能。这项工作通过探索模拟人类观点的可变性的弱模型集合,将从弱到强扩展到从弱到强。使用贝叶斯方法估计置信度分数,以指导从弱到强的泛化。此外,我们将Weaks-to-Strong的应用从文本分类任务扩展到文本生成任务,在文本生成任务中,研究了更高级的策略来进行监督。此外,直接偏好优化被应用于推进学生模型的偏好学习,超越了教师强迫的基本学习框架。结果表明,所提出的方法对于一个强大的学生模型的可靠性是有效的,显示了超比对的潜力。

[NLP-24] he Impossibility of Fair LLMs
[NLP-24] 公平的法学硕士不可能

链接: https://arxiv.org/abs/2406.03198
作者: Jacy Anthis,Kristian Lum,Michael Ekstrand,Avi Feller,Alexander D’Amour,Chenhao Tan
关键词: large language models, language models, increasingly clear, large language, Gemini
中文关键词: 大语言模型,语言模型,越来越清晰,大语言,双子座
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
备注: Presented at the 1st Human-Centered Evaluation and Auditing of Language Models (HEAL) workshop at CHI 2024

点击查看摘要

Abstract:The need for fair AI is increasingly clear in the era of general-purpose systems such as ChatGPT, Gemini, and other large language models (LLMs). However, the increasing complexity of human-AI interaction and its social impacts have raised questions of how fairness standards could be applied. Here, we review the technical frameworks that machine learning researchers have used to evaluate fairness, such as group fairness and fair representations, and find that their application to LLMs faces inherent limitations. We show that each framework either does not logically extend to LLMs or presents a notion of fairness that is intractable for LLMs, primarily due to the multitudes of populations affected, sensitive attributes, and use cases. To address these challenges, we develop guidelines for the more realistic goal of achieving fairness in particular use cases: the criticality of context, the responsibility of LLM developers, and the need for stakeholder participation in an iterative process of design and evaluation. Moreover, it may eventually be possible and even necessary to use the general-purpose capabilities of AI systems to address fairness challenges as a form of scalable AI-assisted alignment.
摘要:在ChatGPT、Gemini和其他大型语言模型(LLM)等通用系统的时代,对公平人工智能的需求日益明显。然而,人类与人工智能互动的日益复杂及其社会影响引发了如何应用公平标准的问题。在这里,我们回顾了机器学习研究人员用来评估公平性的技术框架,如组公平和公平表示,发现它们在LLMS中的应用面临固有的局限性。我们表明,每个框架要么在逻辑上没有扩展到LLMS,要么提出了一个对LLM来说难以处理的公平概念,这主要是由于受影响的人群、敏感属性和用例的众多。为了应对这些挑战,我们为在特定用例中实现公平的更现实目标制定了指导方针:上下文的临界性、LLM开发人员的责任以及利益相关者参与迭代设计和评估过程的必要性。此外,最终可能甚至有必要使用人工智能系统的通用能力来解决公平挑战,作为一种可扩展的人工智能辅助对齐形式。

[NLP-25] Missci: Reconstructing Fallacies in Misrepresented Science
[NLP-25] 米西:重建被歪曲的科学中的谬误

链接: https://arxiv.org/abs/2406.03181
作者: Max Glockner,Yufang Hou,Preslav Nakov,Iryna Gurevych
关键词: Health-related misinformation, social networks, networks can lead, lead to poor, poor decision-making
中文关键词: 与健康相关的错误信息、社交网络、网络可能会导致糟糕的决策
类目: Computation and Language (cs.CL)
备注: ACL 2024 (main)

点击查看摘要

Abstract:Health-related misinformation on social networks can lead to poor decision-making and real-world dangers. Such misinformation often misrepresents scientific publications and cites them as “proof” to gain perceived credibility. To effectively counter such claims automatically, a system must explain how the claim was falsely derived from the cited publication. Current methods for automated fact-checking or fallacy detection neglect to assess the (mis)used evidence in relation to misinformation claims, which is required to detect the mismatch between them. To address this gap, we introduce Missci, a novel argumentation theoretical model for fallacious reasoning together with a new dataset for real-world misinformation detection that misrepresents biomedical publications. Unlike previous fallacy detection datasets, Missci (i) focuses on implicit fallacies between the relevant content of the cited publication and the inaccurate claim, and (ii) requires models to verbalize the fallacious reasoning in addition to classifying it. We present Missci as a dataset to test the critical reasoning abilities of large language models (LLMs), that are required to reconstruct real-world fallacious arguments, in a zero-shot setting. We evaluate two representative LLMs and the impact of different levels of detail about the fallacy classes provided to the LLM via prompts. Our experiments and human evaluation show promising results for GPT 4, while also demonstrating the difficulty of this task.
摘要:社交网络上与健康相关的错误信息可能会导致糟糕的决策和现实世界的危险。这类错误信息经常歪曲科学出版物,并将其作为获得可信度的“证据”。为了有效地自动反驳这种索赔,系统必须解释索赔是如何错误地源自所引用的出版物的。当前的自动事实核查或错误检测方法忽略了对与错误信息声称有关的(错误)使用的证据进行评估,这是检测它们之间的不匹配所必需的。为了弥补这一差距,我们引入了MISSCI,一个新的用于谬误推理的论证理论模型,以及一个用于错误代表生物医学出版物的真实世界错误信息检测的新数据集。与以前的谬误检测数据集不同,MISSCI(I)专注于所引用出版物的相关内容与不准确主张之间的隐含谬误,以及(Ii)要求模型除了对谬误推理进行分类外,还需要用言语表达谬误推理。我们将MISSCI作为一个数据集来测试大型语言模型(LLM)的关键推理能力,这些模型是在零命中设置下重建现实世界中的谬误论点所必需的。我们评估了两个具有代表性的LLM,以及通过提示提供给LLM的谬误类的不同详细程度的影响。我们的实验和人体评估显示了GPT 4的良好结果,同时也证明了这项任务的难度。

[NLP-26] StatBot.Swiss: Bilingual Open Data Exploration in Natural Language
[NLP-26] StatBot.Swiss:自然语言中的双语开放数据探索

链接: https://arxiv.org/abs/2406.03170
作者: Farhad Nooralahzadeh,Yi Zhang,Ellery Smith,Sabine Maennel,Cyril Matthey-Doret,Raphaël de Fondville,Kurt Stockinger
关键词: Large Language Models, brought by Large, Language Models, Large Language, monolingual English datasets
中文关键词: 大型语言模型,由大型、语言模型、大型语言、单语英语数据集带来
类目: Computation and Language (cs.CL)
备注: This work is accepted at ACL Findings 2024

点击查看摘要

Abstract:The potential for improvements brought by Large Language Models (LLMs) in Text-to-SQL systems is mostly assessed on monolingual English datasets. However, LLMs’ performance for other languages remains vastly unexplored. In this work, we release the StatBot.Swiss dataset, the first bilingual benchmark for evaluating Text-to-SQL systems based on real-world applications. The StatBot.Swiss dataset contains 455 natural language/SQL-pairs over 35 big databases with varying level of complexity for both English and German. We evaluate the performance of state-of-the-art LLMs such as GPT-3.5-Turbo and mixtral-8x7b-instruct for the Text-to-SQL translation task using an in-context learning approach. Our experimental analysis illustrates that current LLMs struggle to generalize well in generating SQL queries on our novel bilingual dataset. Comments: This work is accepted at ACL Findings 2024 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2406.03170 [cs.CL] (or arXiv:2406.03170v1 [cs.CL] for this version)
摘要:大型语言模型在文本到SQL系统中带来的改进潜力主要是在单语言英语数据集上进行评估的。然而,LLMS在其他语言上的表现仍远未被发掘。在这项工作中,我们发布了StatBot.Swiss数据集,这是第一个用于评估基于真实应用程序的Text-to-SQL系统的双语基准。瑞士数据集包含455个自然语言/SQL对,分布在35个大型数据库中,英语和德语的复杂程度各不相同。我们使用上下文学习方法评估了GPT-3.5-Turbo和Mixtral-8x7b-Indict等最先进的LLMS在文本到SQL翻译任务中的性能。我们的实验分析表明,现有的LLMS很难很好地泛化在我们的新的双语数据集上生成SQL查询。评论:这项工作被ACL2024年调查结果接受:计算和语言(cs.CL)引用为:arxiv:2406.03170cs.CL

[NLP-27] CSS: Contrastive Semantic Similarity for Uncertainty Quantification of LLMs
[NLP-27] CSS:LLM不确定性量化的对比语义相似性

链接: https://arxiv.org/abs/2406.03158
作者: Shuang Ao,Stefan Rueger,Advaith Siddharthan
关键词: large language models, open challenge, impressive capability, capability of large, remains an open
中文关键词: 大型语言模型、开放的挑战、令人印象深刻的能力、巨大的能力、仍然是开放的
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The paper is accepted by The Conference on Uncertainty in Artificial Intelligence (UAI), 2024

点击查看摘要

Abstract:Despite the impressive capability of large language models (LLMs), knowing when to trust their generations remains an open challenge. The recent literature on uncertainty quantification of natural language generation (NLG) utilises a conventional natural language inference (NLI) classifier to measure the semantic dispersion of LLMs responses. These studies employ logits of NLI classifier for semantic clustering to estimate uncertainty. However, logits represent the probability of the predicted class and barely contain feature information for potential clustering. Alternatively, CLIP (Contrastive Language-Image Pre-training) performs impressively in extracting image-text pair features and measuring their similarity. To extend its usability, we propose Contrastive Semantic Similarity, the CLIP-based feature extraction module to obtain similarity features for measuring uncertainty for text pairs. We apply this method to selective NLG, which detects and rejects unreliable generations for better trustworthiness of LLMs. We conduct extensive experiments with three LLMs on several benchmark question-answering datasets with comprehensive evaluation metrics. Results show that our proposed method performs better in estimating reliable responses of LLMs than comparable baselines. Results show that our proposed method performs better in estimating reliable responses of LLMs than comparable baselines. The code are available at \urlthis https URL.
摘要:尽管大型语言模型(LLM)的能力令人印象深刻,但知道何时应该信任他们这一代人仍然是一个开放的挑战。最近关于自然语言生成(NLG)的不确定性量化的文献利用传统的自然语言推理(NLI)分类器来度量LLMS响应的语义离散度。这些研究使用NLI分类器的LOGITS进行语义聚类来估计不确定性。然而,Logit表示预测类别的概率,几乎不包含潜在聚类的特征信息。或者,CLIP(对比语言-图像预训练)在提取图文对特征和度量它们的相似性方面表现出色。为了扩展其可用性,我们提出了对比语义相似度,这是一种基于片段的特征提取模块,用于获取相似度特征来度量文本对的不确定性。我们将这种方法应用于选择性NLG,它检测并拒绝不可靠的生成,以获得更好的LLM可信度。我们在几个具有综合评价指标的基准问答数据集上用三个LLM进行了广泛的实验。结果表明,我们提出的方法在估计LLMS的可靠响应方面比可比基线具有更好的性能。结果表明,我们提出的方法在估计LLMS的可靠响应方面比可比基线具有更好的性能。代码可在此HTTPS URL上找到。

[NLP-28] Which Side Are You On? A Multi-task Dataset for End-to-End Argument Summarisation and Evaluation
[NLP-28] 你站在哪一边?用于端到端论点总结和评估的多任务数据集

链接: https://arxiv.org/abs/2406.03151
作者: Hao Li,Yuping Wu,Viktor Schlegel,Riza Batista-Navarro,Tharindu Madusanka,Iqra Zahid,Jiayan Zeng,Xiaochi Wang,Xinran He,Yizhi Li,Goran Nenadic
关键词: large language models, synthesise persuasive arguments, language models, recent advances, advances of large
中文关键词: 大型语言模型、综合有说服力的论点、语言模型、最近的进展、大型的进展
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published on ACL 2024 Findings

点击查看摘要

Abstract:With the recent advances of large language models (LLMs), it is no longer infeasible to build an automated debate system that helps people to synthesise persuasive arguments. Previous work attempted this task by integrating multiple components. In our work, we introduce an argument mining dataset that captures the end-to-end process of preparing an argumentative essay for a debate, which covers the tasks of claim and evidence identification (Task 1 ED), evidence convincingness ranking (Task 2 ECR), argumentative essay summarisation and human preference ranking (Task 3 ASR) and metric learning for automated evaluation of resulting essays, based on human feedback along argument quality dimensions (Task 4 SQE). Our dataset contains 14k examples of claims that are fully annotated with the various properties supporting the aforementioned tasks. We evaluate multiple generative baselines for each of these tasks, including representative LLMs. We find, that while they show promising results on individual tasks in our benchmark, their end-to-end performance on all four tasks in succession deteriorates significantly, both in automated measures as well as in human-centred evaluation. This challenge presented by our proposed dataset motivates future research on end-to-end argument mining and summarisation. The repository of this project is available at this https URL
摘要:随着大型语言模型的发展,建立一个自动辩论系统来帮助人们综合有说服力的论点不再是不可行的。以前的工作试图通过集成多个组件来完成这项任务。在我们的工作中,我们引入了一个论点挖掘数据集,它捕获了为辩论准备议论文的端到端过程,其中包括主张和证据识别(任务1 ED)、证据说服力排名(任务2 ECR)、议论文摘要和人类偏好排名(任务3 ASR)以及基于人类沿论点质量维度的反馈对结果论文进行自动评估的度量学习(任务4 SQE)。我们的数据集包含14k个声明的示例,这些示例用支持上述任务的各种属性进行了完整的注释。我们评估了每项任务的多个生成基线,包括具有代表性的LLM。我们发现,尽管在我们的基准中,它们在个别任务上显示了令人振奋的结果,但它们在所有四项任务上的端到端性能连续显著恶化,无论是在自动测量方面,还是在以人为中心的评估方面。我们提出的数据集带来的这一挑战激励了未来对端到端论点挖掘和摘要的研究。此项目的存储库可在此HTTPS URL中找到

[NLP-29] owards Real-world Scenario: Imbalanced New Intent Discovery
[NLP-29] owards现实世界场景:不平衡的新意图发现

链接: https://arxiv.org/abs/2406.03127
作者: Shun Zhang,Chaoran Yan,Jian Yang,Jiaheng Liu,Ying Mo,Jiaqi Bai,Tongliang Li,Zhoujun Li
关键词: utilizing limited labeled, previously undefined categories, massive unlabeled data, Intent Discovery, aims at detecting
中文关键词: 利用有限的标签、之前未定义的类别、大量未标签的数据,意图发现,旨在检测
类目: Computation and Language (cs.CL)
备注: ACL 2024

点击查看摘要

Abstract:New Intent Discovery (NID) aims at detecting known and previously undefined categories of user intent by utilizing limited labeled and massive unlabeled data. Most prior works often operate under the unrealistic assumption that the distribution of both familiar and new intent classes is uniform, overlooking the skewed and long-tailed distributions frequently encountered in real-world scenarios. To bridge the gap, our work introduces the imbalanced new intent discovery (i-NID) task, which seeks to identify familiar and novel intent categories within long-tailed distributions. A new benchmark (ImbaNID-Bench) comprised of three datasets is created to simulate the real-world long-tail distributions. ImbaNID-Bench ranges from broad cross-domain to specific single-domain intent categories, providing a thorough representation of practical use cases. Besides, a robust baseline model ImbaNID is proposed to achieve cluster-friendly intent representations. It includes three stages: model pre-training, generation of reliable pseudo-labels, and robust representation learning that strengthens the model performance to handle the intricacies of real-world data distributions. Our extensive experiments on previous benchmarks and the newly established benchmark demonstrate the superior performance of ImbaNID in addressing the i-NID task, highlighting its potential as a powerful baseline for uncovering and categorizing user intents in imbalanced and long-tailed distributions\footnote\urlthis https URL.
摘要:新意图发现旨在利用有限的已标记和海量的未标记数据来检测已知和先前未定义的用户意图类别。大多数以前的工作通常是在不现实的假设下操作的,即熟悉的和新的意图类的分布是一致的,忽略了现实世界场景中经常遇到的倾斜和长尾分布。为了弥合这一差距,我们的工作引入了不平衡的新意图发现(I-NID)任务,该任务试图在长尾分布中识别熟悉的和新的意图类别。创建了一个由三个数据集组成的新基准(ImbaNID-BENCH)来模拟真实世界的长尾分布。ImbaNID-BENCH的范围从广泛的跨域到特定的单域意图类别,提供了对实际用例的全面表示。此外,还提出了一种健壮的基线模型ImbaNID来实现对簇友好的意图表示。它包括三个阶段:模型预训练,生成可靠的伪标签,以及增强模型性能以处理真实世界数据分布的错综复杂的稳健表示学习。我们对以前的基准和新建立的基准的广泛实验表明,ImbaNID在解决i-nid任务方面表现出色,突出了它作为一个强大的基准的潜力,用于揭示不平衡和长尾分布中的用户意图并对其进行分类\脚注\url此https URL。

[NLP-30] Space Decomposition for Sentence Embedding
[NLP-30] 句子嵌入的空间分解

链接: https://arxiv.org/abs/2406.03125
作者: Wuttikorn Ponwitayarat,Peerat Limkonchotiwat,Ekapol Chuangsuwanich,Sarana Nutanong
关键词: Determining sentence pair, NLP tasks, Determining sentence, sentence pair similarity, sentence pair
中文关键词: 确定句子对,NLP任务,确定句子,句子对相似性,句子对
类目: Computation and Language (cs.CL)
备注: ACL Finding 2024. The code and pre-trained models are available at this https URL

点击查看摘要

Abstract:Determining sentence pair similarity is crucial for various NLP tasks. A common technique to address this is typically evaluated on a continuous semantic textual similarity scale from 0 to 5. However, based on a linguistic observation in STS annotation guidelines, we found that the score in the range [4,5] indicates an upper-range sample, while the rest are lower-range samples. This necessitates a new approach to treating the upper-range and lower-range classes separately. In this paper, we introduce a novel embedding space decomposition method called MixSP utilizing a Mixture of Specialized Projectors, designed to distinguish and rank upper-range and lower-range samples accurately. The experimental results demonstrate that MixSP decreased the overlap representation between upper-range and lower-range classes significantly while outperforming competitors on STS and zero-shot benchmarks.
摘要:确定句子对相似度对于各种NLP任务至关重要。解决这个问题的常用技术通常根据0到5的连续语义文本相似度来评估。然而,根据STS注释指南中的语言观察,我们发现范围[4,5]中的分数表示上范围样本,而其余的则是下范围样本。这就需要一种新的方法来分别对待上层和下层阶级。本文中,我们介绍了一种新型的嵌入空间分解方法MixSP,该方法利用了专业投影仪的混合物,旨在准确地区分和排名上范围和下范围样本。实验结果表明,MixSP显着减少了上范围和下范围类别之间的重叠表示,同时在STS和零射击基准方面优于竞争对手。

[NLP-31] FragRel: Exploiting Fragment-level Relations in the External Memory of Large Language Models
[NLP-31] FragRel:利用大型语言模型外部存储中的碎片级关系

链接: https://arxiv.org/abs/2406.03092
作者: Xihang Yue,Linchao Zhu,Yi Yang
关键词: Large Language Models, Language Models, Large Language, recent studies explore, studies explore hierarchically
中文关键词: 大型语言模型,语言模型,大型语言,最近的研究探索,研究分层探索
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To process contexts with unlimited length using Large Language Models (LLMs), recent studies explore hierarchically managing the long text. Only several text fragments are taken from the external memory and passed into the temporary working memory, i.e., LLM’s context window. However, existing approaches isolatedly handle the text fragments without considering their structural connections, thereby suffering limited capability on texts with intensive inter-relations, e.g., coherent stories and code repositories. This work attempts to resolve this by exploiting the fragment-level relations in external memory. First, we formulate the fragment-level relations and present several instantiations for different text types. Next, we introduce a relation-aware fragment assessment criteria upon previous independent fragment assessment. Finally, we present the fragment-connected Hierarchical Memory based LLM. We validate the benefits of involving these relations on long story understanding, repository-level code generation, and long-term chatting.
摘要:为了使用大语言模型处理无限长度的上下文,最近的研究探索了对长文本的分层管理。只有几个文本片段从外部存储器中取出并传递到临时工作存储器中,即LLM的上下文窗口。然而,现有的方法孤立地处理文本片段,而没有考虑它们的结构联系,从而限制了对具有强烈相互关系的文本的能力,例如连贯的故事和代码库。这项工作试图通过利用外部存储器中的片段级关系来解决这个问题。首先,我们描述了片断级别的关系,并针对不同的文本类型给出了几个实例。接下来,我们在前面的独立片段评估的基础上,引入了一种关系感知的片段评估标准。最后,我们提出了一种基于分段连接的层次记忆的LLM。我们验证了涉及这些关系在长故事理解、存储库级代码生成和长期聊天方面的好处。

[NLP-32] Cryptocurrency Frauds for Dummies: How ChatGPT introduces us to fraud?
[NLP-32] 加密货币傻瓜欺诈:ChatGPT如何将我们引入欺诈?

链接: https://arxiv.org/abs/2406.03079
作者: Wail Zellagui,Abdessamad Imine,Yamina Tadjeddine
关键词: versatile machine interlocutor, Recent advances, packed with knowledge, field of large, powerful and versatile
中文关键词: 多才多艺的机器对话者,最近的进步,知识渊博,领域大,强大和多才多艺
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in ACM journal “Digital Government: Research and Practice”

点击查看摘要

Abstract:Recent advances in the field of large language models (LLMs), particularly the ChatGPT family, have given rise to a powerful and versatile machine interlocutor, packed with knowledge and challenging our understanding of learning. This interlocutor is a double-edged sword: it can be harnessed for a wide variety of beneficial tasks, but it can also be used to cause harm. This study explores the complicated interaction between ChatGPT and the growing problem of cryptocurrency fraud. Although ChatGPT is known for its adaptability and ethical considerations when used for harmful purposes, we highlight the deep connection that may exist between ChatGPT and fraudulent actions in the volatile cryptocurrency ecosystem. Based on our categorization of cryptocurrency frauds, we show how to influence outputs, bypass ethical terms, and achieve specific fraud goals by manipulating ChatGPT prompts. Furthermore, our findings emphasize the importance of realizing that ChatGPT could be a valuable instructor even for novice fraudsters, as well as understanding and safely deploying complex language models, particularly in the context of cryptocurrency frauds. Finally, our study underlines the importance of using LLMs responsibly and ethically in the digital currency sector, identifying potential risks and resolving ethical issues. It should be noted that our work is not intended to encourage and promote fraud, but rather to raise awareness of the risks of fraud associated with the use of ChatGPT.
摘要:大型语言模型领域的最新进展,特别是ChatGPT家族,催生了一种功能强大、用途广泛的机器对话器,它充斥着知识,挑战着我们对学习的理解。这种对话者是一把双刃剑:它可以被用来完成各种有益的任务,但也可以用来造成伤害。这项研究探索了ChatGPT与日益严重的加密货币欺诈问题之间的复杂相互作用。尽管ChatGPT在用于有害目的时以其适应性和伦理考虑而闻名,但我们强调了ChatGPT与动荡的加密货币生态系统中的欺诈行为之间可能存在的深刻联系。根据我们对加密货币欺诈的分类,我们展示了如何通过操纵ChatGPT提示来影响输出、绕过伦理术语并实现特定的欺诈目标。此外,我们的发现强调了认识到ChatGPT可能是一个有价值的教师的重要性,即使对于初学者也是如此,以及理解和安全地部署复杂的语言模型,特别是在加密货币欺诈的背景下。最后,我们的研究强调了在数字货币领域负责任地、合乎道德地使用LLM、识别潜在风险和解决道德问题的重要性。应该指出的是,我们的工作并不是为了鼓励和促进欺诈,而是为了提高人们对使用ChatGPT相关欺诈风险的认识。

[NLP-33] owards Detecting LLMs Hallucination via Markov Chain-based Multi-agent Debate Framework
[NLP-33] owards通过基于Markov链的多智能体辩论框架检测LLM幻觉

链接: https://arxiv.org/abs/2406.03075
作者: Xiaoxi Sun,Jinpeng Li,Yan Zhong,Dongyan Zhao,Rui Yan
关键词: large language models, language text generation, natural language text, language models, text generation
中文关键词: 大型语言模型、语言文本生成、自然语言文本、语言模型、文本生成
类目: Computation and Language (cs.CL)
备注: 18 pages, 3 figures

点击查看摘要

Abstract:The advent of large language models (LLMs) has facilitated the development of natural language text generation. It also poses unprecedented challenges, with content hallucination emerging as a significant concern. Existing solutions often involve expensive and complex interventions during the training process. Moreover, some approaches emphasize problem disassembly while neglecting the crucial validation process, leading to performance degradation or limited applications. To overcome these limitations, we propose a Markov Chain-based multi-agent debate verification framework to enhance hallucination detection accuracy in concise claims. Our method integrates the fact-checking process, including claim detection, evidence retrieval, and multi-agent verification. In the verification stage, we deploy multiple agents through flexible Markov Chain-based debates to validate individual claims, ensuring meticulous verification outcomes. Experimental results across three generative tasks demonstrate that our approach achieves significant improvements over baselines.
摘要:大语言模型的出现促进了自然语言文本生成的发展。它还带来了前所未有的挑战,内容幻觉成为一个重大问题。现有的解决办法往往涉及在培训过程中昂贵和复杂的干预措施。此外,一些方法强调问题分解而忽略了关键的验证过程,导致性能下降或应用受限。为了克服这些局限性,我们提出了一种基于马尔可夫链的多智能体辩论验证框架,以提高简明声明中幻觉检测的准确性。我们的方法集成了事实核查过程,包括索赔检测、证据检索和多代理验证。在验证阶段,我们通过灵活的基于马尔可夫链的辩论部署多个代理来验证单个索赔,确保验证结果的精细化。三个生成性任务的实验结果表明,我们的方法在基线上取得了显着的改善。

[NLP-34] How Truncating Weights Improves Reasoning in Language Models
[NLP-34] 截断权重如何改善语言模型中的推理

链接: https://arxiv.org/abs/2406.03068
作者: Lei Chen,Joan Bruna,Alberto Bietti
关键词: generate fluent text, large language models, involve basic forms, large language, forms of logical
中文关键词: 生成流畅的文本、大型语言模型、涉及基本形式、大型语言、逻辑形式
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:In addition to the ability to generate fluent text in various languages, large language models have been successful at tasks that involve basic forms of logical “reasoning” over their context. Recent work found that selectively removing certain components from weight matrices in pre-trained models can improve such reasoning capabilities. We investigate this phenomenon further by carefully studying how certain global associations tend to be stored in specific weight components or Transformer blocks, in particular feed-forward layers. Such associations may hurt predictions in reasoning tasks, and removing the corresponding components may then improve performance. We analyze how this arises during training, both empirically and theoretically, on a two-layer Transformer trained on a basic reasoning task with noise, a toy associative memory model, and on the Pythia family of pre-trained models tested on simple reasoning tasks.
摘要:除了能够以各种语言生成流畅的文本之外,大型语言模型还成功地完成了涉及对其上下文进行基本形式逻辑“推理”的任务。最近的工作发现,选择性地从预训练模型中的权重矩阵中删除某些成分可以提高此类推理能力。我们通过仔细研究某些全局关联如何存储在特定的重量成分或Transformer块中,特别是前向层,进一步调查这种现象。此类关联可能会损害推理任务中的预测,删除相应的组件可能会提高性能。我们从经验和理论上分析了这种情况是如何在两层Transformer上进行训练的,该Transformer在带有噪音的基本推理任务、玩具联想记忆模型以及在简单推理任务上测试的Pythia家族的预训练模型上进行训练的。

[NLP-35] RadBARTsum: Domain Specific Adaption of Denoising Sequence-to-Sequence Models for Abstractive Radiology Report Summarization
[NLP-35] RadBARTsum:抽象放射学报告摘要的去噪序列到序列模型的特定领域调整

链接: https://arxiv.org/abs/2406.03062
作者: Jinge Wu,Abul Hasan,Honghan Wu
关键词: Radiology report summarization, doctors quickly identify, quickly identify clinically, identify clinically significant, clinically significant findings
中文关键词: 放射学报告总结,医生快速识别,临床快速识别,识别具有临床意义、具有临床意义的发现
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Radiology report summarization is a crucial task that can help doctors quickly identify clinically significant findings without the need to review detailed sections of reports. This study proposes RadBARTsum, a domain-specific and ontology facilitated adaptation of the BART model for abstractive radiology report summarization. The approach involves two main steps: 1) re-training the BART model on a large corpus of radiology reports using a novel entity masking strategy to improving biomedical domain knowledge learning, and 2) fine-tuning the model for the summarization task using the Findings and Background sections to predict the Impression section. Experiments are conducted using different masking strategies. Results show that the re-training process with domain knowledge facilitated masking improves performances consistently across various settings. This work contributes a domain-specific generative language model for radiology report summarization and a method for utilising medical knowledge to realise entity masking language model. The proposed approach demonstrates a promising direction of enhancing the efficiency of language models by deepening its understanding of clinical knowledge in radiology reports.
摘要:放射学报告摘要是一项至关重要的任务,它可以帮助医生快速识别临床上有意义的发现,而不需要审查报告的详细部分。这项研究提出了RadBARTsum,一个领域特定和本体促进的适用于摘要放射学报告摘要的BART模型。该方法包括两个主要步骤:1)使用一种新的实体掩蔽策略在大量放射学报告语料库上重新训练BART模型,以改进生物医学领域知识学习;2)使用发现和背景部分来预测印象部分,为摘要任务微调模型。实验使用不同的掩蔽策略进行。结果表明,利用领域知识辅助掩蔽的重新训练过程在不同的设置下都能一致地提高性能。这项工作为放射学报告摘要提供了一种特定领域的生成式语言模型,并为利用医学知识实现实体掩蔽语言模型提供了一种方法。提出的方法通过加深对放射学报告中临床知识的理解,展示了提高语言模型效率的一个很有前途的方向。

[NLP-36] StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning
[NLP-36] StreamSpeech:具有多任务学习的语音同步翻译

链接: https://arxiv.org/abs/2406.03049
作者: Shaolei Zhang,Qingkai Fang,Shoutao Guo,Zhengrui Ma,Min Zhang,Yang Feng
关键词: streaming speech inputs, receiving streaming speech, outputs target speech, streaming speech, streaming speech translation
中文关键词: 流语音输入、接收流语音、输出目标语音、流语音、流语音翻译
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to ACL 2024 main conference, Project Page: this https URL

点击查看摘要

Abstract:Simultaneous speech-to-speech translation (Simul-S2ST, a.k.a streaming speech translation) outputs target speech while receiving streaming speech inputs, which is critical for real-time communication. Beyond accomplishing translation between speech, Simul-S2ST requires a policy to control the model to generate corresponding target speech at the opportune moment within speech inputs, thereby posing a double challenge of translation and policy. In this paper, we propose StreamSpeech, a direct Simul-S2ST model that jointly learns translation and simultaneous policy in a unified framework of multi-task learning. Adhering to a multi-task learning approach, StreamSpeech can perform offline and simultaneous speech recognition, speech translation and speech synthesis via an “All-in-One” seamless model. Experiments on CVSS benchmark demonstrate that StreamSpeech achieves state-of-the-art performance in both offline S2ST and Simul-S2ST tasks. Besides, StreamSpeech is able to present high-quality intermediate results (i.e., ASR or translation results) during simultaneous translation process, offering a more comprehensive real-time communication experience.
摘要:同步语音到语音翻译(SIMUL-S2ST,也称为流语音翻译)在输出目标语音的同时接收流语音输入,这对于实时通信至关重要。SIMUL-S2ST除了完成语音之间的翻译外,还需要一个策略来控制模型在语音输入的适当时刻生成相应的目标语音,从而提出了翻译和策略的双重挑战。在本文中,我们提出了StreamSpeech,一个直接的SIMUL-S2ST模型,在一个统一的多任务学习框架中联合学习翻译和同步策略。秉承多任务学习方式,StreamSpeech可通过无缝模型进行离线和同步语音识别、语音翻译和语音合成。在CVSS基准测试平台上的实验表明,StreamSpeech在离线S2ST和SIMUL-S2ST任务中都达到了最好的性能。此外,StreamSpeech能够在同声翻译过程中呈现高质量的中间结果(即ASR或翻译结果),提供更全面的实时交流体验。

[NLP-37] From Tarzan to Tolkien: Controlling the Language Proficiency Level of LLMs for Content Generation
[NLP-37] 从Tarzan到Tolkien:控制LLM的语言熟练程度以实现内容生成

链接: https://arxiv.org/abs/2406.03030
作者: Ali Malik,Stephen Mayhew,Chris Piech,Klinton Bicknell
关键词: Large Language Models, fully proficient, problem of controlling, controlling the difficulty, difficulty level
中文关键词: 大型语言模型,完全精通,控制问题,控制难度,难度水平
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study the problem of controlling the difficulty level of text generated by Large Language Models (LLMs) for contexts where end-users are not fully proficient, such as language learners. Using a novel framework, we evaluate the effectiveness of several key approaches for this task, including few-shot prompting, supervised finetuning, and reinforcement learning (RL), utilising both GPT-4 and open source alternatives like LLama2-7B and Mistral-7B. Our findings reveal a large performance gap between GPT-4 and the open source models when using prompt-based strategies. However, we show how to bridge this gap with a careful combination of finetuning and RL alignment. Our best model, CALM (CEFR-Aligned Language Model), surpasses the performance of GPT-4 and other strategies, at only a fraction of the cost. We further validate the quality of our results through a small-scale human study. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2406.03030 [cs.CL] (or arXiv:2406.03030v1 [cs.CL] for this version) Journalreference: In Findings of the Association for Computational Linguistics (ACL 2024)
摘要:我们研究了在语言学习者等最终用户不是完全熟练的情况下,由大语言模型(LLM)生成的文本的难度控制问题。使用一个新的框架,我们评估了这项任务的几种关键方法的有效性,包括少镜头提示、有监督的精调和强化学习(RL),使用GPT-4和开源替代方法,如LLama2-7B和Mistral-7B。我们的研究结果显示,在使用基于提示的策略时,GPT-4和开源模型之间存在很大的性能差距。然而,我们展示了如何通过精细调整和RL对齐的仔细组合来弥合这一差距。我们最好的模型CAME(CEFR对齐语言模型)的性能超过了GPT-4和其他策略,成本只有GPT-4和其他策略的一小部分。我们通过一项小规模的人体研究进一步验证了我们结果的质量。主题:计算与语言(cs.CL);机器学习(cs.LG)引用AS:arxiv:2406.03030cs.CL期刊参考:在计算语言学协会的发现(ACL2024)

[NLP-38] Unveiling Selection Biases: Exploring Order and Token Sensitivity in Large Language Models
[NLP-38] 揭露选择偏见:探索大型语言模型中的顺序和符号敏感性

链接: https://arxiv.org/abs/2406.03009
作者: Sheng-Lun Wei,Cheng-Kuang Wu,Hen-Hsen Huang,Hsin-Hsi Chen
关键词: Large Language Models, Large Language, Language Models, ordered sequence, investigate the phenomena
中文关键词: 大型语言模型,大型语言,语言模型,有序序列,调查现象
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as a long findings paper at ACL 2024

点击查看摘要

Abstract:In this paper, we investigate the phenomena of “selection biases” in Large Language Models (LLMs), focusing on problems where models are tasked with choosing the optimal option from an ordered sequence. We delve into biases related to option order and token usage, which significantly impact LLMs’ decision-making processes. We also quantify the impact of these biases through an extensive empirical analysis across multiple models and tasks. Furthermore, we propose mitigation strategies to enhance model performance. Our key contributions are threefold: 1) Precisely quantifying the influence of option order and token on LLMs, 2) Developing strategies to mitigate the impact of token and order sensitivity to enhance robustness, and 3) Offering a detailed analysis of sensitivity across models and tasks, which informs the creation of more stable and reliable LLM applications for selection problems.
摘要:在本文中,我们研究了大型语言模型(LLM)中的“选择偏差”现象,重点关注模型负责从有序序列中选择最佳选项的问题。我们深入研究了与期权顺序和代币使用相关的偏见,这些偏见对LLM的决策过程产生了显着影响。我们还通过对多个模型和任务进行广泛的实证分析来量化这些偏见的影响。此外,我们还提出了缓解策略来提高模型性能。我们的主要贡献有三个方面:1)精确量化期权顺序和代币对LLM的影响,2)制定策略来减轻代币和顺序敏感性的影响,以增强稳健性,3)提供模型和任务之间敏感性的详细分析,这为创建更稳定和可靠的LLM应用程序提供信息来解决选择问题。

[NLP-39] DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences
[NLP-39] DriVLMe:增强基于法学硕士的自动驾驶代理,具有良好的经验和社交体验

链接: https://arxiv.org/abs/2406.03008
作者: Yidong Huang,Jacob Sansom,Ziqiao Ma,Felix Gervits,Joyce Chai
关键词: real-world driving scenarios, Recent advancements, real-world driving, driving scenarios, foundation models
中文关键词: 现实世界驾驶场景、最新进展、现实世界驾驶、驾驶场景、基础模型
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: First Vision and Language for Autonomous Driving and Robotics Workshop (VLADR @ CVPR 2024)

点击查看摘要

Abstract:Recent advancements in foundation models (FMs) have unlocked new prospects in autonomous driving, yet the experimental settings of these studies are preliminary, over-simplified, and fail to capture the complexity of real-world driving scenarios in human environments. It remains under-explored whether FM agents can handle long-horizon navigation tasks with free-from dialogue and deal with unexpected situations caused by environmental dynamics or task changes. To explore the capabilities and boundaries of FMs faced with the challenges above, we introduce DriVLMe, a video-language-model-based agent to facilitate natural and effective communication between humans and autonomous vehicles that perceive the environment and navigate. We develop DriVLMe from both embodied experiences in a simulated environment and social experiences from real human dialogue. While DriVLMe demonstrates competitive performance in both open-loop benchmarks and closed-loop human studies, we reveal several limitations and challenges, including unacceptable inference time, imbalanced training data, limited visual understanding, challenges with multi-turn interactions, simplified language generation from robotic experiences, and difficulties in handling on-the-fly unexpected situations like environmental dynamics and task changes.
摘要:基础模型的最新进展为自动驾驶开辟了新的前景,但这些研究的实验设置都是初步的、过于简化的,无法捕捉到人类环境中真实世界驾驶场景的复杂性。FM代理是否能够通过自由对话处理长期导航任务,以及处理由环境动态或任务变化引起的意外情况,仍未得到充分探讨。为了探索FMS面对上述挑战的能力和边界,我们引入了DriVLMe,一个基于视频语言模型的代理,以促进人类与感知环境和导航的自动车辆之间的自然和有效沟通。我们开发DriVLMe既来自模拟环境中的具体体验,也来自真实人类对话的社会体验。虽然DriVLMe在开环基准和闭环人体研究中都展示了具有竞争力的表现,但我们揭示了一些限制和挑战,包括不可接受的推理时间、不平衡的训练数据、有限的视觉理解、多轮交互的挑战、从机器人经验生成简化的语言,以及在处理环境动态和任务变化等即时意外情况方面的困难。

[NLP-40] BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents
[NLP-40] BadAgent:在LLM代理中插入并激活后门攻击

链接: https://arxiv.org/abs/2406.03007
作者: Yifei Wang,Dizhan Xue,Shengjie Zhang,Shengsheng Qian
关键词: powerful LLM-based intelligent, provide customized services, LLM-based intelligent agents, large language models, LLM agents
中文关键词: 强大的基于LLM的智能,提供定制服务,基于LLM的智能代理,大型语言模型,LLM代理
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted by ACL 2024

点击查看摘要

Abstract:With the prosperity of large language models (LLMs), powerful LLM-based intelligent agents have been developed to provide customized services with a set of user-defined tools. State-of-the-art methods for constructing LLM agents adopt trained LLMs and further fine-tune them on data for the agent task. However, we show that such methods are vulnerable to our proposed backdoor attacks named BadAgent on various agent tasks, where a backdoor can be embedded by fine-tuning on the backdoor data. At test time, the attacker can manipulate the deployed LLM agents to execute harmful operations by showing the trigger in the agent input or environment. To our surprise, our proposed attack methods are extremely robust even after fine-tuning on trustworthy data. Though backdoor attacks have been studied extensively in natural language processing, to the best of our knowledge, we could be the first to study them on LLM agents that are more dangerous due to the permission to use external tools. Our work demonstrates the clear risk of constructing LLM agents based on untrusted LLMs or data. Our code is public at this https URL
摘要:随着大型语言模型的蓬勃发展,基于大型语言模型的智能代理被开发出来,通过一套用户自定义的工具来提供定制服务。构建LLM代理的最先进方法采用经过训练的LLM,并根据代理任务的数据进一步微调它们。然而,我们发现这些方法容易受到我们提出的针对各种代理任务的名为BadAgent的后门攻击,其中可以通过对后门数据进行微调来嵌入后门。在测试时,攻击者可以通过在代理输入或环境中显示触发器来操纵部署的LLM代理执行有害操作。令我们惊讶的是,即使在对可信数据进行微调之后,我们提出的攻击方法也非常健壮。虽然后门攻击在自然语言处理中已经得到了广泛的研究,但就我们所知,我们可能是第一个在LLM代理上研究它们的,这些代理由于被允许使用外部工具而更危险。我们的工作证明了基于不可信的LLM或数据构建LLM代理的明显风险。我们的代码在此HTTPS URL上是公开的

[NLP-41] Evaluation of data inconsistency for multi-modal sentiment analysis
[NLP-41] 多模式情绪分析的数据不一致性评估

链接: https://arxiv.org/abs/2406.03004
作者: Yufei Wang,Mengyue Wu
关键词: MSA involves analyzing, Emotion semantic inconsistency, MSA, MSA involves, sentiment
中文关键词: MCA涉及分析、情感语义不一致、MCA、MCA涉及、情感
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emotion semantic inconsistency is an ubiquitous challenge in multi-modal sentiment analysis (MSA). MSA involves analyzing sentiment expressed across various modalities like text, audio, and videos. Each modality may convey distinct aspects of sentiment, due to subtle and nuanced expression of human beings, leading to inconsistency, which may hinder the prediction of artificial agents. In this work, we introduce a modality conflicting test set and assess the performance of both traditional multi-modal sentiment analysis models and multi-modal large language models (MLLMs). Our findings reveal significant performance degradation across traditional models when confronted with semantically conflicting data and point out the drawbacks of MLLMs when handling multi-modal emotion analysis. Our research presents a new challenge and offer valuable insights for the future development of sentiment analysis systems.
摘要:情感语义不一致是多模式情感分析(MTA)中普遍存在的挑战。MCA涉及分析文本、音频和视频等各种形式表达的情绪。由于人类微妙而细致的表达,每种模式都可能传达情感的不同方面,导致不一致,这可能会阻碍人工代理的预测。在这项工作中,我们引入了一个情态冲突测试集,并评估传统多情态情感分析模型和多情态大型语言模型(MLLM)的性能。我们的研究结果揭示了当面对语义冲突的数据时,传统模型的性能会显着下降,并指出了MLLM在处理多模式情感分析时的缺陷。我们的研究提出了新的挑战,并为情感分析系统的未来发展提供了宝贵的见解。

[NLP-42] Readability-guided Idiom-aware Sentence Simplification (RISS) for Chinese
[NLP-42] 可读性引导的中文成语感知句子简化(RISS)

链接: https://arxiv.org/abs/2406.02974
作者: Jingshen Zhang,Xinglu Chen,Xinying Qiu,Zhimin Wang,Wenhe Feng
关键词: faces challenges due, Chinese sentence simplification, Readability-guided Idiom-aware Sentence, Readability-guided Paraphrase Selection, large-scale labeled parallel
中文关键词: 面临挑战,中文句子简化、可读性引导的习语感知句子、可读性引导的解释选择、大规模标签并行
类目: Computation and Language (cs.CL)
备注: Accepted to the 23rd China National Conference on Computational Linguistics (CCL 2024)

点击查看摘要

Abstract:Chinese sentence simplification faces challenges due to the lack of large-scale labeled parallel corpora and the prevalence of idioms. To address these challenges, we propose Readability-guided Idiom-aware Sentence Simplification (RISS), a novel framework that combines data augmentation techniques with lexcial simplification. RISS introduces two key components: (1) Readability-guided Paraphrase Selection (RPS), a method for mining high-quality sentence pairs, and (2) Idiom-aware Simplification (IAS), a model that enhances the comprehension and simplification of idiomatic expressions. By integrating RPS and IAS using multi-stage and multi-task learning strategies, RISS outperforms previous state-of-the-art methods on two Chinese sentence simplification datasets. Furthermore, RISS achieves additional improvements when fine-tuned on a small labeled dataset. Our approach demonstrates the potential for more effective and accessible Chinese text simplification.
摘要:由于缺乏大规模的标签平行库和习语的盛行,中文句子简化面临挑战。为了应对这些挑战,我们提出了可读写性引导的习语感知句子简化(RISS),这是一种将数据增强技术与词汇简化相结合的新型框架。RISS引入了两个关键组件:(1)可读性引导的解释选择(RPS),一种挖掘高质量句子对的方法,和(2)习语感知简化(IAS),一种增强习语表达理解和简化的模型。通过使用多阶段和多任务学习策略集成RPS和IAS,RISS在两个中文句子简化数据集上的表现优于之前的最先进方法。此外,当对小标签数据集进行微调时,RISS实现了额外的改进。我们的方法展示了更有效和更容易理解的中文文本简化的潜力。

[NLP-43] Filtered not Mixed: Stochastic Filtering-Based Online Gating for Mixture of Large Language Models
[NLP-43] 过滤而非混合:基于随机过滤的在线门控,用于混合大型语言模型

链接: https://arxiv.org/abs/2406.02969
作者: Raeid Saqur,Anastasis Kratsios,Florian Krach,Yannick Limmer,Jacob-Junqi Tian,John Willes,Blanka Horvath,Frank Rudzicz
关键词: Large Language Models, expert Large Language, pre-trained expert Large, Large Language, online time-series prediction
中文关键词: 大型语言模型,专家大型语言,预培训专家大型,大型语言,在线时间序列预测
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computational Finance (q-fin.CP); Mathematical Finance (q-fin.MF)
备注: 29 pages, 5 Appendix sections

点击查看摘要

Abstract:We propose MoE-F – a formalised mechanism for combining N pre-trained expert Large Language Models (LLMs) in online time-series prediction tasks by adaptively forecasting the best weighting of LLM predictions at every time step. Our mechanism leverages the conditional information in each expert’s running performance to forecast the best combination of LLMs for predicting the time series in its next step. Diverging from static (learned) Mixture of Experts (MoE) methods, MoE-F employs time-adaptive stochastic filtering techniques to combine experts. By framing the expert selection problem as a finite state-space, continuous-time Hidden Markov model (HMM), we can leverage the Wohman-Shiryaev filter. Our approach first constructs N parallel filters corresponding to each of the N individual LLMs. Each filter proposes its best combination of LLMs, given the information that they have access to. Subsequently, the N filter outputs are aggregated to optimize a lower bound for the loss of the aggregated LLMs, which can be optimized in closed-form, thus generating our ensemble predictor. Our contributions here are: (I) the MoE-F algorithm – deployable as a plug-and-play filtering harness, (II) theoretical optimality guarantees of the proposed filtering-based gating algorithm, and (III) empirical evaluation and ablative results using state of the art foundational and MoE LLMs on a real-world Financial Market Movement task where MoE-F attains a remarkable 17% absolute and 48.5% relative F1 measure improvement over the next best performing individual LLM expert.
摘要:我们提出了一种形式化机制MOE-F,用于在线时间序列预测任务中组合N个预先训练的专家大语言模型(LLM),它通过自适应地预测每个时间步长的LLM预测的最佳权重。我们的机制利用每个专家运行性能中的条件信息来预测下一步预测时间序列的最佳LLM组合。与静态(学习)专家混合(MOE)方法不同,MOE-F采用时间自适应随机滤波技术来组合专家。通过将专家选择问题框架化为有限状态空间、连续时间隐马尔可夫模型(HMM),我们可以利用Wohman-Shiryaev过滤器。我们的方法首先构造与N个单独的LLM中的每一个相对应的N个并行滤波器。每个过滤器在给定它们可以访问的信息的情况下,提出其LLM的最佳组合。随后,N个滤波器的输出被聚集以优化聚集的LLM的损失的下界,该下界可以以闭合的形式进行优化,从而产生我们的集成预测器。我们在这里的贡献是:(I)MOE-F算法–可作为即插即用的过滤工具部署,(Ii)建议的基于过滤的门控算法的理论最优保证,以及(Iii)在现实世界金融市场运动任务中使用最先进的基础和MOE LLM进行的经验评估和消融结果,其中MOE-F获得了显著的17%的绝对改进和48.5%的相对F1测量改进,比表现第二好的个人LLM专家。

[NLP-44] Docs2KG: Unified Knowledge Graph Construction from Heterogeneous Documents Assisted by Large Language Models
[NLP-44] Docs 2KG:在大型语言模型的辅助下从异类文档构建统一知识图

链接: https://arxiv.org/abs/2406.02962
作者: Qiang Sun,Yuanyi Luo,Wenxiao Zhang,Sirui Li,Jichunyang Li,Kai Niu,Xiangrui Kong,Wei Liu
关键词: accommodate heterogeneous formats, enterprise data reside, conservative estimate, heterogeneous formats, data
中文关键词: 适应异类格式、企业数据驻留、保守估计、异类格式、数据
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Even for a conservative estimate, 80% of enterprise data reside in unstructured files, stored in data lakes that accommodate heterogeneous formats. Classical search engines can no longer meet information seeking needs, especially when the task is to browse and explore for insight formulation. In other words, there are no obvious search keywords to use. Knowledge graphs, due to their natural visual appeals that reduce the human cognitive load, become the winning candidate for heterogeneous data integration and knowledge representation. In this paper, we introduce Docs2KG, a novel framework designed to extract multimodal information from diverse and heterogeneous unstructured documents, including emails, web pages, PDF files, and Excel files. Dynamically generates a unified knowledge graph that represents the extracted key information, Docs2KG enables efficient querying and exploration of document data lakes. Unlike existing approaches that focus on domain-specific data sources or pre-designed schemas, Docs2KG offers a flexible and extensible solution that can adapt to various document structures and content types. The proposed framework unifies data processing supporting a multitude of downstream tasks with improved domain interpretability. Docs2KG is publicly accessible at this https URL, and a demonstration video is available at this https URL. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2406.02962 [cs.CL] (or arXiv:2406.02962v1 [cs.CL] for this version)
摘要:即使是保守估计,80%的企业数据也驻留在非结构化文件中,存储在适应不同格式的数据湖中。传统的搜索引擎已经不能满足信息搜索的需求,特别是当任务是浏览和探索洞察力公式时。换句话说,没有明显的搜索关键字可用。知识图以其自然的视觉吸引力,减轻了人类的认知负荷,成为异质数据集成和知识表示的首选。在本文中,我们介绍了Docs2KG,一个新的框架,设计用于从各种不同的非结构化文档中提取多模式信息,包括电子邮件、网页、PDF文件和Excel文件。动态生成表示提取的关键信息的统一知识图,Docs2KG支持对文档数据湖的高效查询和探索。与专注于特定于域的数据源或预先设计的模式的现有方法不同,Docs2KG提供了一种灵活且可扩展的解决方案,可以适应各种文档结构和内容类型。所提出的框架将支持大量下游任务的数据处理与改进的域可解释性统一起来。Docs2KG可通过此HTTPS URL公开访问,演示视频可通过此HTTPS URL获得。学科:计算与语言(cs.CL);人工智能(cs.AI);信息检索(cs.IR)引用AS:arxiv:2406.02962cs.CL

[NLP-45] Adversarial Moment-Matching Distillation of Large Language Models
[NLP-45] 大型语言模型的对抗性动量匹配提炼

链接: https://arxiv.org/abs/2406.02959
作者: Chen Jia
关键词: large language models, achieving practical benefits, Knowledge distillation, larger teacher model, language models
中文关键词: 大型语言模型,实现实际效益,知识提炼,大型教师模型,语言模型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) has been shown to be highly effective in guiding a student model with a larger teacher model and achieving practical benefits in improving the computational and memory efficiency for large language models (LLMs). State-of-the-art KD methods for LLMs mostly rely on minimizing explicit distribution distance between teacher and student probability predictions. Instead of optimizing these mandatory behaviour cloning objectives, we explore an imitation learning strategy for KD of LLMs. In particular, we minimize the imitation gap by matching the action-value moments of the teacher’s behavior from both on- and off-policy perspectives. To achieve this action-value moment-matching goal, we propose an adversarial training algorithm to jointly estimate the moment-matching distance and optimize the student policy to minimize it. Results from both task-agnostic instruction-following experiments and task-specific experiments demonstrate the effectiveness of our method and achieve new state-of-the-art performance.
摘要:知识蒸馏(KD)被证明能有效地用较大的教师模型来指导学生模型,并在提高大语言模型的计算和存储效率方面取得了实际的好处。最新的知识发现方法主要依赖于最小化教师和学生概率预测之间的显式分布距离。我们没有优化这些强制性的行为克隆目标,而是探索了一种用于LLMS KD的模仿学习策略。特别是,我们通过从政策上和政策外的角度匹配教师行为的行动价值时刻来最小化模仿差距。为了实现动作-价值矩匹配的目标,我们提出了一种对抗性训练算法,联合估计矩匹配距离,并优化学生策略使其最小化。任务不可知指令跟随实验和特定任务实验的结果都证明了该方法的有效性,并获得了新的最先进的性能。

[NLP-46] PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs
[NLP-46] PrE-text:LLC时代在私有联邦数据上训练语言模型

链接: https://arxiv.org/abs/2406.02958
作者: Charlie Hou,Akshat Shrivastava,Hongyuan Zhan,Rylan Conway,Trang Le,Adithya Sagar,Giulia Fanti,Daniel Lazar
关键词: training machine learning, On-device, On-device training, machine learning, training
中文关键词: 培训机器学习、设备上、设备上培训、机器学习、培训
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: ICML 2024 (Oral)

点击查看摘要

Abstract:On-device training is currently the most common approach for training machine learning (ML) models on private, distributed user data. Despite this, on-device training has several drawbacks: (1) most user devices are too small to train large models on-device, (2) on-device training is communication- and computation-intensive, and (3) on-device training can be difficult to debug and deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes ( \epsilon=1.29 , \epsilon=7.58 ). We achieve these results while using 9 \times fewer rounds, 6 \times less client computation per round, and 100 \times less communication per round. Second, finetuning large models on PrE-Text’s DP synthetic data improves large language model (LLM) performance on private data across the same range of privacy budgets. Altogether, these results suggest that training on DP synthetic data can be a better option than training a model on-device on private distributed data. Code is available at this https URL.
摘要:设备上训练是目前训练机器学习(ML)模型的最常用方法,用于训练私有的、分布式的用户数据。尽管如此,设备上的培训有几个缺点:(1)大多数用户设备太小,无法在设备上培训大型模型,(2)设备上的培训是通信和计算密集型的,以及(3)设备上的培训可能难以调试和部署。为了解决这些问题,我们提出了一种生成差异私有(DP)合成文本数据的方法–私有进化文本(Pre-Text)。首先,我们证明了在多个数据集上,使用预文本合成数据训练小模型(适合于用户设备的模型)的性能优于在实际隐私制度下(\epsilon=1.29,\epsilon=7.58)在设备上训练的小模型。我们实现了这些结果,同时使用的轮数减少了9倍,每轮客户端计算减少了6倍,每轮通信减少了100倍。其次,在相同的隐私预算范围内,对预文本的DP合成数据的大型模型进行微调,可以提高针对私人数据的大型语言模型(LLM)的性能。总而言之,这些结果表明,对DP合成数据进行训练可能是比在设备上对私有分布式数据进行模型训练更好的选择。代码可在此HTTPS URL上找到。

[NLP-47] he Task-oriented Queries Benchmark (ToQB)
[NLP-47] 面向任务的工作空间基准(ToJB)

链接: https://arxiv.org/abs/2406.02943
作者: Keun Soo Yim
关键词: large language model, Natural Language Processing, Task-oriented queries, Task-oriented Queries Benchmark, order food
中文关键词: 大型语言模型、自然语言处理、面向任务的查询、面向任务的收件箱基准、点餐
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Neural and Evolutionary Computing (cs.NE)
备注: Data available on GitHub, this https URL

点击查看摘要

Abstract:Task-oriented queries (e.g., one-shot queries to play videos, order food, or call a taxi) are crucial for assessing the quality of virtual assistants, chatbots, and other large language model (LLM)-based services. However, a standard benchmark for task-oriented queries is not yet available, as existing benchmarks in the relevant NLP (Natural Language Processing) fields have primarily focused on task-oriented dialogues. Thus, we present a new methodology for efficiently generating the Task-oriented Queries Benchmark (ToQB) using existing task-oriented dialogue datasets and an LLM service. Our methodology involves formulating the underlying NLP task to summarize the original intent of a speaker in each dialogue, detailing the key steps to perform the devised NLP task using an LLM service, and outlining a framework for automating a major part of the benchmark generation process. Through a case study encompassing three domains (i.e., two single-task domains and one multi-task domain), we demonstrate how to customize the LLM prompts (e.g., omitting system utterances or speaker labels) for those three domains and characterize the generated task-oriented queries. The generated ToQB dataset is made available to the public. We further discuss new domains that can be added to ToQB by community contributors and its practical applications.
摘要:面向任务的查询(例如,播放视频、点餐或叫出租车的一次性查询)对于评估虚拟助手、聊天机器人和其他基于大型语言模型(LLM)的服务的质量至关重要。然而,目前还没有面向任务的查询的标准基准,因为相关自然语言处理领域的现有基准主要侧重于面向任务的对话。因此,我们提出了一种新的方法来使用现有的面向任务的对话数据集和LLM服务来高效地生成面向任务的查询基准(ToQB)。我们的方法包括制定基本的NLP任务以总结发言者在每次对话中的原始意图,详细说明使用LLM服务执行设计的NLP任务的关键步骤,以及概述用于自动化基准生成过程的主要部分的框架。通过一个包含三个领域(即两个单任务领域和一个多任务领域)的案例研究,我们展示了如何为这三个领域定制LLM提示(例如,省略系统话语或说话人标签),并对生成的面向任务的查询进行表征。生成的ToQB数据集向公众提供。我们进一步讨论了社区贡献者可以添加到ToQB的新域名及其实际应用。

[NLP-48] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models
[NLP-48] Pruner-Zero:从零开始为大型语言模型开发符号修剪指标

链接: https://arxiv.org/abs/2406.02924
作者: Peijie Dong,Lujun Li,Zhenheng Tang,Xiang Liu,Xinglin Pan,Qiang Wang,Xiaowen Chu
关键词: Large Language Models, face deployment challenges, deployment challenges due, Large Language, Language Models
中文关键词: 大型语言模型,面临部署挑战,部署挑战,大型语言,语言模型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注: Accepted by ICML2024, 29 pages, 4 figures

点击查看摘要

Abstract:Despite the remarkable capabilities, Large Language Models (LLMs) face deployment challenges due to their extensive size. Pruning methods drop a subset of weights to accelerate, but many of them require retraining, which is prohibitively expensive and computationally demanding. Recently, post-training pruning approaches introduced novel metrics, enabling the pruning of LLMs without retraining. However, these metrics require the involvement of human experts and tedious trial and error. To efficiently identify superior pruning metrics, we develop an automatic framework for searching symbolic pruning metrics using genetic programming. In particular, we devise an elaborate search space encompassing the existing pruning metrics to discover the potential symbolic pruning metric. We propose an opposing operation simplification strategy to increase the diversity of the population. In this way, Pruner-Zero allows auto-generation of symbolic pruning metrics. Based on the searched results, we explore the correlation between pruning metrics and performance after pruning and summarize some principles. Extensive experiments on LLaMA and LLaMA-2 on language modeling and zero-shot tasks demonstrate that our Pruner-Zero obtains superior performance than SOTA post-training pruning methods. Code at: \urlthis https URL.
摘要:尽管大型语言模型(LLM)具有卓越的功能,但由于其庞大的规模,它面临着部署挑战。剪枝方法会丢弃一部分权重来加速,但其中许多方法需要重新训练,这是一种昂贵得令人望而却步的计算要求。最近,训练后修剪方法引入了新的度量,使得无需再培训就可以修剪LLM。然而,这些指标需要人类专家的参与和乏味的试验和错误。为了有效地识别更优的剪枝指标,我们开发了一个使用遗传编程的自动搜索符号剪枝指标的框架。特别是,我们设计了一个包含现有剪枝度量的精心设计的搜索空间,以发现潜在的符号剪枝度量。我们提出了一种相反的操作简化策略,以增加种群的多样性。通过这种方式,Pruner-Zero允许自动生成符号修剪度量。基于搜索结果,我们探索了剪枝度量与剪枝后性能之间的相关性,并总结了一些原则。在Llama和Llama-2上的语言建模和零射击任务上的大量实验表明,我们的Pruner-Zero获得了比SOTA训练后剪枝方法更好的性能。代码位于:\urlThis HTTPS URL。

[NLP-49] xt Injection for Neural Contextual Biasing
[NLP-49] 用于神经上下文偏置的文本注入

链接: https://arxiv.org/abs/2406.02921
作者: Zhong Meng,Zelin Wu,Rohit Prabhavalkar,Cal Peyser,Weiran Wang,Nanxin Chen,Tara N. Sainath,Bhuvana Ramabhadran
关键词: automatic speech recognition, effectively improves automatic, improves automatic speech, biasing effectively improves, speech recognition
中文关键词: 自动语音识别,有效提高自动性,提高自动语音,偏置有效提高,语音识别
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:Neural contextual biasing effectively improves automatic speech recognition (ASR) for crucial phrases within a speaker’s context, particularly those that are infrequent in the training data. This work proposes contextual text injection (CTI) to enhance contextual ASR. CTI leverages not only the paired speech-text data, but also a much larger corpus of unpaired text to optimize the ASR model and its biasing component. Unpaired text is converted into speech-like representations and used to guide the model’s attention towards relevant bias phrases. Moreover, we introduce a contextual text-injected (CTI) minimum word error rate (MWER) training, which minimizes the expected WER caused by contextual biasing when unpaired text is injected into the model. Experiments show that CTI with 100 billion text sentences can achieve up to 43.3% relative WER reduction from a strong neural biasing model. CTI-MWER provides a further relative improvement of 23.5%.
摘要:神经上下文偏差有效地改善了说话者上下文内关键短语的自动语音识别(ASB),特别是训练数据中不常见的短语。这项工作提出了上下文文本注入(RTI)来增强上下文ASB。RTI不仅利用配对的语音-文本数据,还利用更大的未配对文本库来优化ASB模型及其偏置组件。未配对的文本被转换为类似语音的表示,并用于引导模型对相关偏见短语的注意力。此外,我们引入了上下文文本注入(RTI)最小字错误率(MWER)训练,该训练可以最大限度地减少将未配对文本注入模型时由上下文偏差引起的预期WER。实验表明,具有1000亿个文本句子的RTI可以通过强神经偏见模型实现高达43.3%的相对WER降低。CTI-MWER进一步相对改善了23.5%。

[NLP-50] MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge
[NLP-50] MultifacetEval:多方面评估,以探索LLM掌握医学知识的情况

链接: https://arxiv.org/abs/2406.02919
作者: Yuxuan Zhou,Xien Liu,Chen Ning,Ji Wu
关键词: Large language models, Large language, delivering notable performance, mastering medical knowledge, language models
中文关键词: 大型语言模型,大型语言,提供显着的性能,掌握医学知识,语言模型
类目: Computation and Language (cs.CL)
备注: Accepted by IJCAI 2024

点击查看摘要

Abstract:Large language models (LLMs) have excelled across domains, also delivering notable performance on the medical evaluation benchmarks, such as MedQA. However, there still exists a significant gap between the reported performance and the practical effectiveness in real-world medical scenarios. In this paper, we aim to explore the causes of this gap by employing a multifaceted examination schema to systematically probe the actual mastery of medical knowledge by current LLMs. Specifically, we develop a novel evaluation framework MultifacetEval to examine the degree and coverage of LLMs in encoding and mastering medical knowledge at multiple facets (comparison, rectification, discrimination, and verification) concurrently. Based on the MultifacetEval framework, we construct two multifaceted evaluation datasets: MultiDiseK (by producing questions from a clinical disease knowledge base) and MultiMedQA (by rephrasing each question from a medical benchmark MedQA into multifaceted questions). The experimental results on these multifaceted datasets demonstrate that the extent of current LLMs in mastering medical knowledge is far below their performance on existing medical benchmarks, suggesting that they lack depth, precision, and comprehensiveness in mastering medical knowledge. Consequently, current LLMs are not yet ready for application in real-world medical tasks. The codes and datasets are available at this https URL.
摘要:大型语言模型(LLM)在各个领域都表现出色,在医学评估基准(如MedQA)上也表现出色。然而,在现实世界的医疗场景中,报告的性能与实际效果之间仍然存在着显著的差距。在这篇文章中,我们旨在通过采用多方面的考试模式来系统地探索当前LLMS对医学知识的实际掌握情况,以探索这种差距的原因。具体地说,我们开发了一个新的评估框架MultifacetEval来同时检查LLMS在编码和掌握多方面(比较、纠正、区分和验证)医学知识方面的程度和覆盖面。基于MultifacetEval框架,我们构建了两个多方面的评估数据集:MultiDiseK(通过从临床疾病知识库中产生问题)和MultiMedQA(通过将医学基准MedQA中的每个问题重新表述为多方面的问题)。在这些多方面的数据集上的实验结果表明,目前的LLMS在掌握医学知识方面的程度远远低于在现有医学基准上的表现,这表明它们在掌握医学知识方面缺乏深度、精确度和全面性。因此,目前的LLM还没有准备好在现实世界的医疗任务中应用。代码和数据集可在此HTTPS URL上找到。

[NLP-51] Improving In-Context Learning with Prediction Feedback for Sentiment Analysis
[NLP-51] 通过情绪分析的预测反馈改善上下文学习

链接: https://arxiv.org/abs/2406.02911
作者: Hongling Xu,Qianlong Wang,Yice Zhang,Min Yang,Xi Zeng,Bing Qin,Ruifeng Xu
关键词: Large language models, Large language, achieved promising results, language models, in-context learning
中文关键词: 大型语言模型,大型语言,取得了有希望的结果,语言模型,上下文学习
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2024 (Findings)

点击查看摘要

Abstract:Large language models (LLMs) have achieved promising results in sentiment analysis through the in-context learning (ICL) paradigm. However, their ability to distinguish subtle sentiments still remains a challenge. Inspired by the human ability to adjust understanding via feedback, this paper enhances ICL by incorporating prior predictions and feedback, aiming to rectify sentiment misinterpretation of LLMs. Specifically, the proposed framework consists of three steps: (1) acquiring prior predictions of LLMs, (2) devising predictive feedback based on correctness, and (3) leveraging a feedback-driven prompt to refine sentiment understanding. Experimental results across nine sentiment analysis datasets demonstrate the superiority of our framework over conventional ICL methods, with an average F1 improvement of 5.95%.
摘要:大型语言模型(LLM)通过上下文学习(ICL)范式在情感分析方面取得了令人鼓舞的结果。然而,它们区分微妙情绪的能力仍然是一个挑战。受人类通过反馈调整理解的能力的启发,本文通过结合先前的预测和反馈来增强ICL,旨在纠正对LLM的情绪误解。具体来说,提出的框架由三个步骤组成:(1)获取LLM的先前预测,(2)基于正确性设计预测反馈,以及(3)利用反馈驱动的提示来完善情绪理解。九个情感分析数据集的实验结果证明了我们的框架优于传统ICL方法,F1平均改进为5.95%。

[NLP-52] Open Grounded Planning: Challenges and Benchmark Construction
[NLP-52] 开放式接地规划:挑战和基准构建

链接: https://arxiv.org/abs/2406.02903
作者: Shiguang Guo,Ziliang Deng,Hongyu Lin,Yaojie Lu,Xianpei Han,Le Sun
关键词: increasingly drawn attention, open grounded planning, grounded planning, open grounded, planning
中文关键词: 日益引起关注,开放接地规划,接地规划,开放接地规划,规划
类目: Computation and Language (cs.CL)
备注: Accept to ACL 2024 main conference

点击查看摘要

Abstract:The emergence of large language models (LLMs) has increasingly drawn attention to the use of LLMs for human-like planning. Existing work on LLM-based planning either focuses on leveraging the inherent language generation capabilities of LLMs to produce free-style plans, or employs reinforcement learning approaches to learn decision-making for a limited set of actions within restricted environments. However, both approaches exhibit significant discrepancies from the open and executable requirements in real-world planning. In this paper, we propose a new planning task–open grounded planning. The primary objective of open grounded planning is to ask the model to generate an executable plan based on a variable action set, thereby ensuring the executability of the produced plan. To this end, we establishes a benchmark for open grounded planning spanning a wide range of domains. Then we test current state-of-the-art LLMs along with five planning approaches, revealing that existing LLMs and methods still struggle to address the challenges posed by grounded planning in open domains. The outcomes of this paper define and establish a foundational dataset for open grounded planning, and shed light on the potential challenges and future directions of LLM-based planning.
摘要:大型语言模型的出现越来越引起人们对使用大型语言模型进行仿人规划的关注。现有的基于LLM的规划工作要么专注于利用LLM固有的语言生成能力来生成自由风格的计划,要么使用强化学习方法来学习在受限环境中针对有限行动集的决策。然而,这两种方法都与现实世界规划中的开放和可执行的要求有很大的差异。本文提出了一种新的规划任务–开放式接地规划。开放式接地计划的主要目标是要求模型基于可变动作集生成可执行计划,从而确保所生成计划的可执行性。为此,我们建立了一个跨越广泛领域的开放式接地规划基准。然后,我们测试了当前最先进的LLM以及五种规划方法,揭示了现有的LLM和方法仍然难以应对开放领域中扎根规划带来的挑战。本文的研究结果定义并建立了开放式接地规划的基础数据集,并阐明了基于土地利用方式的规划的潜在挑战和未来发展方向。

[NLP-53] S2GSL: Incorporating Segment to Syntactic Enhanced Graph Structure Learning for Aspect-based Sentiment Analysis
[NLP-53] S2 GSL:将分段融入语法增强的图结构学习,用于基于Aspects的情绪分析

链接: https://arxiv.org/abs/2406.02902
作者: Bingfeng Chen,Qihan Ouyang,Yongqi Luo,Boyan Xu,Ruichu Cai,Zhifeng Hao
关键词: based Sentiment Analysis, Aspect based Sentiment, Previous graph-based approaches, Sentiment Analysis, static dependency trees
中文关键词: 基于情感分析、基于方面的情感、以前的基于图形的方法、情感分析、静态依赖树
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Previous graph-based approaches in Aspect based Sentiment Analysis(ABSA) have demonstrated impressive performance by utilizing graph neural networks and attention mechanisms to learn structures of static dependency trees and dynamic latent trees. However, incorporating both semantic and syntactic information simultaneously within complex global structures can introduce irrelevant contexts and syntactic dependencies during the process of graph structure learning, potentially resulting in inaccurate predictions. In order to address the issues above, we propose S ^2 GSL, incorporating Segment to Syntactic enhanced Graph Structure Learning for ABSA. Specifically,S ^2 GSL is featured with a segment-aware semantic graph learning and a syntax-based latent graph learning enabling the removal of irrelevant contexts and dependencies, respectively. We further propose a self-adaptive aggregation network that facilitates the fusion of two graph learning branches, thereby achieving complementarity across diverse structures. Experimental results on four benchmarks demonstrate the effectiveness of our framework.
摘要:以往的基于图的情感分析方法(ABSA)通过利用图神经网络和注意力机制来学习静态依赖树和动态潜在树的结构,表现出了令人印象深刻的性能。然而,在复杂的全局结构中同时包含语义和句法信息可能会在图结构学习过程中引入不相关的上下文和句法依赖,从而可能导致不准确的预测。为了解决上述问题,我们提出了S^2GSL,将分段与句法增强的图结构学习相结合,用于ABSA。具体地说,S^2GSL具有分段感知语义图学习和基于语法的潜图学习,分别能够去除不相关的上下文和依赖关系。我们进一步提出了一种自适应聚集网络,它促进了两个图学习分支的融合,从而实现了不同结构之间的互补。在四个基准上的实验结果证明了该框架的有效性。

[NLP-54] Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms
[NLP-54] 直接对齐算法中奖励模型过度优化的比例定律

链接: https://arxiv.org/abs/2406.02900
作者: Rafael Rafailov,Yaswanth Chittepu,Ryan Park,Harshit Sikchi,Joey Hejna,Bradley Knox,Chelsea Finn,Scott Niekum
关键词: Large Language Models, Large Language, Human Feedback, success of Large, Reinforcement Learning
中文关键词: 大型语言模型、大型语言、人类反馈、大型成功、强化学习
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process. In the classical RLHF framework, a reward model is first trained to represent human preferences, which is in turn used by an online reinforcement learning (RL) algorithm to optimize the LLM. A prominent issue with such methods is \emphreward over-optimization or \emphreward hacking, where performance as measured by the learned proxy reward model increases, but true quality plateaus or even deteriorates. Direct Alignment Algorithms (DDAs) like Direct Preference Optimization have emerged as alternatives to the classical RLHF pipeline by circumventing the reward modeling phase. However, although DAAs do not use a separate proxy reward model, they still commonly deteriorate from over-optimization. While the so-called reward hacking phenomenon is not well-defined for DAAs, we still uncover similar trends: at higher KL budgets, DAA algorithms exhibit similar degradation patterns to their classic RLHF counterparts. In particular, we find that DAA methods deteriorate not only across a wide range of KL budgets but also often before even a single epoch of the dataset is completed. Through extensive empirical experimentation, this work formulates and formalizes the reward over-optimization or hacking problem for DAAs and explores its consequences across objectives, training regimes, and model scales.
摘要:人类反馈强化学习(RLHF)是近年来大型语言模型(LLM)取得成功的关键,但它往往是一个复杂而脆弱的过程。在经典的RLHF框架中,首先训练奖励模型来表示人类的偏好,然后使用在线强化学习(RL)算法来优化LLM。这种方法的一个突出问题是奖励过度优化或奖励黑客行为,即通过学习的代理奖励模型衡量的性能增加,但真正的质量停滞不前甚至恶化。直接比对算法(DDA),如直接偏好优化,已经成为经典RLHF管道的替代方案,它绕过了奖励建模阶段。然而,尽管DAA不使用单独的代理奖励模型,但它们仍然通常会因过度优化而恶化。虽然对于DAA来说,所谓的奖励黑客现象并没有被很好地定义,但我们仍然发现了类似的趋势:在较高的KL预算下,DAA算法表现出与经典RLHF算法相似的退化模式。特别是,我们发现DAA方法不仅在广泛的KL预算范围内恶化,而且往往甚至在数据集的一个时期完成之前就恶化了。通过广泛的实证实验,这项工作制定并形式化了DAA的报酬过度优化或黑客问题,并探索了其在目标、培训制度和模型尺度上的后果。

[NLP-55] Language Model Can Do Knowledge Tracing: Simple but Effective Method to Integrate Language Model and Knowledge Tracing Task
[NLP-55] 语言模型可以进行知识跟踪:集成语言模型和知识跟踪任务的简单但有效的方法

链接: https://arxiv.org/abs/2406.02893
作者: Unggi Lee,Jiyeong Bae,Dohee Kim,Sookbun Lee,Jaekwon Park,Taekyung Ahn,Gunho Lee,Damji Stratton,Hyeoncheol Kim
关键词: modeling student knowledge, Knowledge Tracing, model-based Knowledge Tracing, critical task, task in online
中文关键词: 建模学生知识,知识追踪,基于模型的知识追踪,关键任务,在线任务
类目: Computation and Language (cs.CL)
备注: 11 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Knowledge Tracing (KT) is a critical task in online learning for modeling student knowledge over time. Despite the success of deep learning-based KT models, which rely on sequences of numbers as data, most existing approaches fail to leverage the rich semantic information in the text of questions and concepts. This paper proposes Language model-based Knowledge Tracing (LKT), a novel framework that integrates pre-trained language models (PLMs) with KT methods. By leveraging the power of language models to capture semantic representations, LKT effectively incorporates textual information and significantly outperforms previous KT models on large benchmark datasets. Moreover, we demonstrate that LKT can effectively address the cold-start problem in KT by leveraging the semantic knowledge captured by PLMs. Interpretability of LKT is enhanced compared to traditional KT models due to its use of text-rich data. We conducted the local interpretable model-agnostic explanation technique and analysis of attention scores to interpret the model performance further. Our work highlights the potential of integrating PLMs with KT and paves the way for future research in KT domain.
摘要:知识追踪(KT)是在线学习中的一项关键任务,用于对学生随时间变化的知识进行建模。尽管基于深度学习的KT模型已经取得了成功,它依赖于数字序列作为数据,但大多数现有的方法无法利用问题和概念文本中丰富的语义信息。提出了一种基于语言模型的知识跟踪(LKT)框架,该框架将预训练的语言模型(PLM)与KT方法相结合。通过利用语言模型的能力来捕获语义表示,LKT有效地结合了文本信息,并在大型基准数据集上显著优于以前的KT模型。此外,我们还证明了LKT通过利用PLM获取的语义知识,可以有效地解决KT中的冷启动问题。由于LKT使用了丰富的文本数据,因此与传统的KT模型相比,它的可解释性得到了提高。我们采用了局部可解释的模型不可知性解释技术和注意力得分分析来进一步解释模型的性能。我们的工作突出了将PLM与KT相结合的潜力,并为KT领域的未来研究铺平了道路。

[NLP-56] HYDRA: Model Factorization Framework for Black-Box LLM Personalization
[NLP-56] HySYS:黑匣子LLM个性化的模型分解框架

链接: https://arxiv.org/abs/2406.02888
作者: Yuchen Zhuang,Haotian Sun,Yue Yu,Qifan Wang,Chao Zhang,Bo Dai
关键词: modern intelligent systems, delivering tailored experiences, critical research area, mining users’ behavioral, users’ behavioral history
中文关键词: 现代智能系统,提供量身定制的体验,关键研究领域,挖掘用户的行为,用户的行为历史
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 6 figures, work in progress

点击查看摘要

Abstract:Personalization has emerged as a critical research area in modern intelligent systems, focusing on mining users’ behavioral history and adapting to their preferences for delivering tailored experiences. Despite the remarkable few-shot capabilities exhibited by black-box large language models (LLMs), the inherent opacity of their model parameters presents significant challenges in aligning the generated output with individual expectations. Existing solutions have primarily focused on prompt design to incorporate user-specific profiles and behaviors; however, such approaches often struggle to generalize effectively due to their inability to capture shared knowledge among all users. To address these challenges, we propose HYDRA, a model factorization framework that captures both user-specific behavior patterns from historical data and shared general knowledge among all users to deliver personalized generation. In order to capture user-specific behavior patterns, we first train a reranker to prioritize the most useful information from top-retrieved relevant historical records. By combining the prioritized history with the corresponding query, we train an adapter to align the output with individual user-specific preferences, eliminating the reliance on access to inherent model parameters of black-box LLMs. Both the reranker and the adapter can be decomposed into a base model with multiple user-specific heads, resembling a hydra. The base model maintains shared knowledge across users, while the multiple personal heads capture user-specific preferences. Experimental results demonstrate that HYDRA outperforms existing state-of-the-art prompt-based methods by an average relative improvement of 9.01% across five diverse personalization tasks in the LaMP benchmark. Our implementation is available at this https URL.
摘要:个性化已经成为现代智能系统中的一个关键研究领域,其重点是挖掘用户的行为历史,并适应他们提供定制体验的偏好。尽管黑盒大型语言模型(LLM)表现出了惊人的少发能力,但其模型参数固有的不透明度在使生成的输出与个人预期保持一致方面构成了巨大的挑战。现有的解决方案主要集中在即时设计,以结合用户特定的配置文件和行为;然而,由于无法捕获所有用户之间的共享知识,此类方法通常难以有效推广。为了应对这些挑战,我们提出了Hydra,这是一个模型分解框架,可以从历史数据中捕获用户特定的行为模式,并在所有用户之间共享通用知识,以提供个性化的生成。为了捕获特定于用户的行为模式,我们首先训练重新访问者从顶部检索的相关历史记录中优先处理最有用的信息。通过将按优先顺序排列的历史记录与相应的查询相结合,我们训练适配器将输出与个别用户特定的首选项相匹配,从而消除了对访问黑盒LLMS固有模型参数的依赖。改进器和适配器都可以分解成具有多个特定于用户的头部的基本模型,类似于九头蛇。基本模型维护用户之间的共享知识,而多个个人负责人捕获用户特定的偏好。实验结果表明,在LAMP基准测试中,Hydra在五个不同的个性化任务中的性能优于现有的基于提示的方法,平均相对提高了9.01%。我们的实现可通过此HTTPS URL获得。

[NLP-57] PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs
[NLP-57] PLaD:具有伪偏好对的基于偏好的大型语言模型蒸馏

链接: https://arxiv.org/abs/2406.02886
作者: Rongzhi Zhang,Jiaming Shen,Tianqi Liu,Haorui Wang,Zhen Qin,Feng Han,Jialu Liu,Simon Baumgartner,Michael Bendersky,Chao Zhang
关键词: Large Language Models, exhibited impressive capabilities, vast parameter sizes, parameter sizes restrict, Large Language
中文关键词: 大型语言模型,表现出令人印象深刻的功能,巨大的参数大小,参数大小限制,大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Findings of ACL 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have exhibited impressive capabilities in various tasks, yet their vast parameter sizes restrict their applicability in resource-constrained settings. Knowledge distillation (KD) offers a viable solution by transferring expertise from large teacher models to compact student models. However, traditional KD techniques face specific challenges when applied to LLMs, including restricted access to LLM outputs, significant teacher-student capacity gaps, and the inherited mis-calibration issue. In this work, we present PLaD, a novel preference-based LLM distillation framework. PLaD exploits the teacher-student capacity discrepancy to generate pseudo-preference pairs where teacher outputs are preferred over student outputs. Then, PLaD leverages a ranking loss to re-calibrate student’s estimation of sequence likelihood, which steers the student’s focus towards understanding the relative quality of outputs instead of simply imitating the teacher. PLaD bypasses the need for access to teacher LLM’s internal states, tackles the student’s expressivity limitations, and mitigates the student mis-calibration issue. Through extensive experiments on two sequence generation tasks and with various LLMs, we demonstrate the effectiveness of our proposed PLaD framework.
摘要:大型语言模型在各种任务中表现出了令人印象深刻的能力,但其庞大的参数大小限制了它们在资源受限环境中的适用性。知识蒸馏(KD)通过将专业知识从大型教师模型转移到紧凑的学生模型,提供了一个可行的解决方案。然而,传统的KD技术在应用于LLMS时面临着特定的挑战,包括对LLM输出的访问受到限制、显著的师生能力差距以及遗传的错误校准问题。在这项工作中,我们提出了一种新的基于偏好的LLM精馏框架PLAD。PLAD利用教师和学生的能力差异来产生伪偏好对,其中教师的输出优先于学生的输出。然后,PLAD利用排名损失来重新校准学生对序列似然的估计,这将学生的重点引导到理解输出的相对质量上,而不是简单地模仿老师。PLAD绕过了访问LLM教师内部状态的需要,解决了学生表达能力的限制,并缓解了学生错误校准的问题。通过在两个序列生成任务和不同LLM上的大量实验,我们证明了我们所提出的PLAD框架的有效性。

[NLP-58] Outdated Issue Aware Decoding for Factual Knowledge Editing
[NLP-58] 过时的问题意识解码以实现事实知识编辑

链接: https://arxiv.org/abs/2406.02882
作者: Zengkui Sun,Yijin Liu,Jiaan Wang,Fandong Meng,Jinan Xu,Yufeng Chen,Jie Zhou
关键词: received increasing attention, Editing has received, Knowledge Editing, edited knowledge, reasoning questions
中文关键词: 受到越来越多的关注,编辑收到,知识编辑,编辑知识,推理问题
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL2024 Findings

点击查看摘要

Abstract:Recently, Knowledge Editing has received increasing attention, since it could update the specific knowledge from outdated ones in pretrained models without re-training. However, as pointed out by recent studies, existing related methods tend to merely memorize the superficial word composition of the edited knowledge, rather than truly learning and absorbing it. Consequently, on the reasoning questions, we discover that existing methods struggle to utilize the edited knowledge to reason the new answer, and tend to retain outdated responses, which are generated by the original models utilizing original knowledge. Nevertheless, the outdated responses are unexpected for the correct answers to reasoning questions, which we named as the outdated issue. To alleviate this issue, in this paper, we propose a simple yet effective decoding strategy, i.e., outDated ISsue aware deCOding (DISCO), to enhance the performance of edited models on reasoning questions. Specifically, we capture the difference in the probability distribution between the original and edited models. Further, we amplify the difference of the token prediction in the edited model to alleviate the outdated issue, and thus enhance the model performance w.r.t the edited knowledge. Experimental results suggest that applying DISCO could enhance edited models to reason, e.g., on reasoning questions, DISCO outperforms the prior SOTA method by 12.99 F1 scores, and reduces the ratio of the outdated issue to 5.78% on the zsRE dataset.
摘要:近年来,知识编辑受到越来越多的关注,因为它可以从预先训练的模型中更新过时的特定知识,而不需要重新训练。然而,最近的研究指出,现有的相关方法往往只是记忆编辑知识的表面单词组成,而不是真正地学习和吸收它。因此,在推理问题上,我们发现现有的方法很难利用编辑后的知识来推理新的答案,并且倾向于保留由原始模型利用原始知识生成的过时答案。然而,对于推理问题的正确答案来说,过时的回答是意想不到的,我们称之为过时的问题。为了缓解这一问题,本文提出了一种简单而有效的解码策略,即过时问题感知解码(DISCO),以提高编辑后的模型在推理问题上的性能。具体地说,我们捕捉了原始模型和编辑后的模型之间的概率分布的差异。此外,我们还放大了编辑后的模型中令牌预测的差异,以缓解过时的问题,从而提高了模型在编辑后的知识下的性能。实验结果表明,应用DISCO可以增强编辑后的模型的推理能力,例如在推理问题上,DISCO的性能比SOTA方法高12.99F1分,并在zsRE数据集上将过时问题的比例降低到5.78%。

[NLP-59] LCS: A Language Converter Strategy for Zero-Shot Neural Machine Translation
[NLP-59] LCS:一种用于零镜头神经机器翻译的语言转换器策略

链接: https://arxiv.org/abs/2406.02876
作者: Zengkui Sun,Yijin Liu,Fandong Meng,Jinan Xu,Yufeng Chen,Jie Zhou
关键词: Multilingual neural machine, models generally distinguish, neural machine translation, generally distinguish translation, machine translation models
中文关键词: 多语言神经机器,模型一般区分,神经机器翻译,一般区分翻译,机器翻译模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL2024 Findings

点击查看摘要

Abstract:Multilingual neural machine translation models generally distinguish translation directions by the language tag (LT) in front of the source or target sentences. However, current LT strategies cannot indicate the desired target language as expected on zero-shot translation, i.e., the off-target issue. Our analysis reveals that the indication of the target language is sensitive to the placement of the target LT. For example, when placing the target LT on the decoder side, the indication would rapidly degrade along with decoding steps, while placing the target LT on the encoder side would lead to copying or paraphrasing the source input. To address the above issues, we propose a simple yet effective strategy named Language Converter Strategy (LCS). By introducing the target language embedding into the top encoder layers, LCS mitigates confusion in the encoder and ensures stable language indication for the decoder. Experimental results on MultiUN, TED, and OPUS-100 datasets demonstrate that LCS could significantly mitigate the off-target issue, with language accuracy up to 95.28%, 96.21%, and 85.35% meanwhile outperforming the vanilla LT strategy by 3.07, 3,3, and 7.93 BLEU scores on zero-shot translation, respectively.
摘要:多语言神经机器翻译模型通常通过源或目标句子前面的语言标签(LT)来区分翻译方向。然而,目前的翻译策略并不能像预期的那样在零镜头翻译中指出目标语言,即偏离目标的问题。我们的分析表明,目标语的指示对目标语的位置很敏感。例如,当将目标LT放置在解码器侧时,该指示将随着解码步骤而迅速退化,而将目标LT放置在编码侧将导致复制或改述源输入。针对上述问题,我们提出了一种简单而有效的策略–语言转换策略(LCS)。通过将目标语言嵌入到顶级编码层,LCS减轻了编码层的混乱,并确保了解码器稳定的语言指示。在MultiUN、TED和OPUS-100数据集上的实验结果表明,LCS能够显著缓解偏离目标的问题,语言准确率分别达到95.28%、96.21%和85.35%,同时在零镜头翻译上分别比普通LT策略高3.07、3、3和7.93分。

[NLP-60] NUMCoT: Numerals and Units of Measurement in Chain-of-Thought Reasoning using Large Language Models
[NLP-60] NUMCoT:使用大型语言模型的思想链推理中的数字和测量单位

链接: https://arxiv.org/abs/2406.02864
作者: Ancheng Xu,Minghuan Tan,Lei Wang,Min Yang,Ruifeng Xu
关键词: Large Language Models, conjoined topics, topics in activities, activities of human, mutual effects
中文关键词: 大型语言模型、联合主题、活动主题、人类活动、相互影响
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Findings of ACL 2024

点击查看摘要

Abstract:Numeral systems and units of measurement are two conjoined topics in activities of human beings and have mutual effects with the languages expressing them. Currently, the evaluation of Large Language Models (LLMs) often involves mathematical reasoning, yet little attention is given to how minor changes in numbers or units can drastically alter the complexity of problems and the performance of LLMs. In this paper, we scrutinize existing LLMs on processing of numerals and units of measurement by constructing datasets with perturbations. We first anatomize the reasoning of math word problems to different sub-procedures like numeral conversions from language to numbers and measurement conversions based on units. Then we further annotate math word problems from ancient Chinese arithmetic works which are challenging in numerals and units of measurement. Experiments on perturbed datasets demonstrate that LLMs still encounter difficulties in handling numeral and measurement conversions.
摘要:数制和计量单位是人类活动中的两个相互联系的话题,并与表达它们的语言相互影响。目前,大型语言模型的评估通常涉及数学推理,但很少有人注意到数字或单位的微小变化如何显著改变问题的复杂性和大型语言模型的性能。在这篇文章中,我们仔细审查了现有的最小二乘法关于数字和测量单位的处理,通过构造带有扰动的数据集。我们首先将数学应用题的推理分解为不同的子程序,如从语言到数字的数字转换和基于单位的度量转换。然后,我们从中国古代算术著作中进一步诠释了在数字和度量单位方面具有挑战性的数学应用题。在扰动数据集上的实验表明,LLMS在处理数字和测量转换时仍然遇到困难。

[NLP-61] LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation
[NLP-61] 法学硕士作为评分者:输出顺序对对话评估的影响

链接: https://arxiv.org/abs/2406.02863
作者: Yi-Pei Chen,KuanChao Chu,Hideki Nakayama
关键词: large language models, research investigates, investigates the effect, large language, language models
中文关键词: 大型语言模型,研究调查,调查效果,大型语言,语言模型
类目: Computation and Language (cs.CL)
备注: Presented in AAAI 2024 Spring Symposium. The first two authors contributed equally

点击查看摘要

Abstract:This research investigates the effect of prompt design on dialogue evaluation using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for dialogue evaluation remains challenging due to model sensitivity and subjectivity in dialogue assessments. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs’ scoring, with a “reason-first” approach yielding more comprehensive evaluations. This insight is crucial for enhancing the accuracy and consistency of LLM-based evaluations.
摘要:本研究使用大型语言模型(LLM)调查了提示设计对对话评估的影响。虽然LLM越来越多地用于对各种输入进行评分,但由于对话评估中的模型敏感性和主观性,为对话评估创建有效的提示仍然具有挑战性。我们的研究实验了不同的提示结构,改变了输出指令的顺序并包括解释原因。我们发现,呈现原因和分数的顺序显着影响LLM的评分,“原因优先”的方法可以产生更全面的评估。这一见解对于提高基于LLM的评估的准确性和一致性至关重要。

[NLP-62] Xmodel-LM Technical Report
[NLP-62] Xmodel-LM技术报告

链接: https://arxiv.org/abs/2406.02856
作者: Yichuan Wang,Yang Liu,Yu Yan,Xucheng Huang,Ling Jiang
关键词: trillion tokens, compact and efficient, language model pre-trained, Chinese and English, introduce Xmodel-LM
中文关键词: 万亿代币,紧凑高效,语言模型预训练,中文和英文,引入Xmodel-LM
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Xmodel-LM, a compact and efficient 1.1B language model pre-trained on over 2 trillion tokens. Trained on our self-built dataset (Xdata), which balances Chinese and English corpora based on downstream task optimization, Xmodel-LM exhibits remarkable performance despite its smaller size. It notably surpasses existing open-source language models of similar scale. Our model checkpoints and code are publicly accessible on GitHub at this https URL.
摘要:我们引入了Xmodel-LM,这是一个紧凑而高效的1.1B语言模型,在超过2万亿个令牌上预训练。Xmodel-LM经过我们自建的数据集(Xdata)的培训,该数据集基于下游任务优化平衡了中文和英文库,尽管规模较小,但仍表现出出色的性能。它明显优于现有的类似规模的开源语言模型。我们的模型检查点和代码可在GitHub上通过https URL公开访问。

[NLP-63] Item-Language Model for Conversational Recommendation
[NLP-63] 对话式推荐的项语言模型

链接: https://arxiv.org/abs/2406.02844
作者: Li Yang,Anushya Subbiah,Hardik Patel,Judith Yue Li,Yanwei Song,Reza Mirghaderi,Vikram Aggarwal
关键词: complex dialogue understanding, Large-language Models, dialogue understanding, extremely successful, successful at tasks
中文关键词: 复杂的对话理解,大语言模型,对话理解,非常成功,成功完成任务
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:Large-language Models (LLMs) have been extremely successful at tasks like complex dialogue understanding, reasoning and coding due to their emergent abilities. These emergent abilities have been extended with multi-modality to include image, audio, and video capabilities. Recommender systems, on the other hand, have been critical for information seeking and item discovery needs. Recently, there have been attempts to apply LLMs for recommendations. One difficulty of current attempts is that the underlying LLM is usually not trained on the recommender system data, which largely contains user interaction signals and is often not publicly available. Another difficulty is user interaction signals often have a different pattern from natural language text, and it is currently unclear if the LLM training setup can learn more non-trivial knowledge from interaction signals compared with traditional recommender system methods. Finally, it is difficult to train multiple LLMs for different use-cases, and to retain the original language and reasoning abilities when learning from recommender system data. To address these three limitations, we propose an Item-Language Model (ILM), which is composed of an item encoder to produce text-aligned item representations that encode user interaction signals, and a frozen LLM that can understand those item representations with preserved pretrained knowledge. We conduct extensive experiments which demonstrate both the importance of the language-alignment and of user interaction knowledge in the item encoder.
摘要:大型语言模型(LLM)由于其涌现能力,在复杂的对话理解、推理和编码等任务中取得了极大的成功。这些紧急能力已经通过多模式进行了扩展,包括图像、音频和视频功能。另一方面,推荐系统对于信息搜索和物品发现需求至关重要。最近,有人尝试将LLMS应用于推荐。当前尝试的一个困难是,基础LLM通常没有关于推荐器系统数据进行训练,这些数据主要包含用户交互信号,并且通常不是公开可用的。另一个困难是用户交互信号通常具有与自然语言文本不同的模式,与传统的推荐系统方法相比,目前尚不清楚LLM训练设置是否能够从交互信号中学习更多非平凡的知识。最后,当从推荐系统数据中学习时,很难针对不同的用例训练多个LLM,并且很难保持原始的语言和推理能力。为了解决这三个局限性,我们提出了一种项语言模型(ILM),它由一个项编码器和一个冻结的LLM组成,前者用于产生文本对齐的项表示,用于编码用户交互信号,后者用于理解具有保留的预训练知识的项表示。我们进行了大量的实验,证明了语言对齐和用户交互知识在项目编码器中的重要性。

[NLP-64] Efficient Minimum Bayes Risk Decoding using Low-Rank Matrix Completion Algorithms
[NLP-64] 使用低阶矩阵完成算法的高效最小Bayes风险解码

链接: https://arxiv.org/abs/2406.02832
作者: Firas Trabelsi,David Vilar,Mara Finkelstein,Markus Freitag
关键词: Minimum Bayes Risk, Minimum Bayes, Bayes Risk, quadratic computational complexity, computational complexity limits
中文关键词: 最小Bayes风险、最小Bayes、Bayes风险、二次计算复杂性、计算复杂性限制
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Minimum Bayes Risk (MBR) decoding is a powerful decoding strategy widely used for text generation tasks, but its quadratic computational complexity limits its practical application. This paper presents a novel approach for approximating MBR decoding using matrix completion techniques, focusing on the task of machine translation. We formulate MBR decoding as a matrix completion problem, where the utility metric scores between candidate hypotheses and pseudo-reference translations form a low-rank matrix. First, we empirically show that the scores matrices indeed have a low-rank structure. Then, we exploit this by only computing a random subset of the scores and efficiently recover the missing entries in the matrix by applying the Alternating Least Squares (ALS) algorithm, thereby enabling a fast approximation of the MBR decoding process. Our experimental results on machine translation tasks demonstrate that the proposed method requires 1/16 utility metric computations compared to vanilla MBR decoding while achieving equal translation quality measured by COMET22 on the WMT22 dataset (ende and enru). We also benchmark our method against other approximation methods and we show gains in quality when comparing to them.
摘要:最小贝叶斯风险(MBR)译码是一种被广泛应用于文本生成任务的强大译码策略,但其二次计算复杂性限制了其实际应用。针对机器翻译问题,提出了一种基于矩阵补全技术的MBR译码近似方法。我们将MBR译码描述为一个矩阵补全问题,其中候选假设和伪参考翻译之间的效用度量分数形成一个低阶矩阵。首先,我们的经验表明,分数矩阵确实具有低级结构。然后,我们通过只计算分数的随机子集来利用这一点,并通过应用交替最小二乘(ALS)算法有效地恢复矩阵中丢失的条目,从而实现MBR解码过程的快速近似。我们在机器翻译任务上的实验结果表明,与普通的MBR解码相比,该方法只需要1/16的效用度量计算,而在WMT22数据集(ende和enru)上获得了与COMET22相同的翻译质量。我们还将我们的方法与其他近似方法进行了比较,结果表明,与其他近似方法相比,我们的方法在质量上有所提高。

[NLP-65] oo Big to Fail: Larger Language Models are Disproportionately Resilient to Induction of Dementia-Related Linguistic Anomalies
[NLP-65] oo Big to Fail:更大的语言模型对痴呆症相关语言异常的诱导具有不成比例的弹性

链接: https://arxiv.org/abs/2406.02830
作者: Changye Li,Zhecheng Sheng,Trevor Cohen,Serguei Pakhomov
关键词: artificial neural networks, neural networks grow, grow in complexity, increasingly challenging, healthcare applications
中文关键词: 人工神经网络,神经网络不断发展,复杂性不断增加,越来越具有挑战性,医疗保健应用
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024 findings

点击查看摘要

Abstract:As artificial neural networks grow in complexity, understanding their inner workings becomes increasingly challenging, which is particularly important in healthcare applications. The intrinsic evaluation metrics of autoregressive neural language models (NLMs), perplexity (PPL), can reflect how “surprised” an NLM model is at novel input. PPL has been widely used to understand the behavior of NLMs. Previous findings show that changes in PPL when masking attention layers in pre-trained transformer-based NLMs reflect linguistic anomalies associated with Alzheimer’s disease dementia. Building upon this, we explore a novel bidirectional attention head ablation method that exhibits properties attributed to the concepts of cognitive and brain reserve in human brain studies, which postulate that people with more neurons in the brain and more efficient processing are more resilient to neurodegeneration. Our results show that larger GPT-2 models require a disproportionately larger share of attention heads to be masked/ablated to display degradation of similar magnitude to masking in smaller models. These results suggest that the attention mechanism in transformer models may present an analogue to the notions of cognitive and brain reserve and could potentially be used to model certain aspects of the progression of neurodegenerative disorders and aging.
摘要:随着人工神经网络的复杂性增加,了解其内部工作原理变得越来越具有挑战性,这在医疗保健应用中尤为重要。自回归神经语言模型(NLMS)的内在评价指标困惑(PPL)可以反映出NLM模型在新输入下的“惊讶”程度。PPL已被广泛用于理解NLMS的行为。先前的发现表明,在预先训练的基于变压器的NLMS中,当掩蔽注意力层时,PPL的变化反映了与阿尔茨海默病痴呆相关的语言异常。在此基础上,我们探索了一种新的双向注意头部消融方法,该方法展示了人类大脑研究中认知和大脑储备的概念,该方法假设大脑中有更多的神经元和更有效的处理能力的人对神经退化更具弹性。我们的结果表明,较大的GPT-2模型需要不成比例地更大比例的注意力头部被掩蔽/消融,以显示与较小模型的掩蔽相似的退化幅度。这些结果表明,变形金刚模型中的注意机制可能类似于认知和大脑储备的概念,并可能被用来模拟神经退行性疾病和衰老的某些方面。

[NLP-66] Exploring Robustness in Doctor-Patient Conversation Summarization: An Analysis of Out-of-Domain SOAP Notes
[NLP-66] 探索医患对话摘要中的鲁棒性:域外肥皂注释分析

链接: https://arxiv.org/abs/2406.02826
作者: Yu-Wen Chen,Julia Hirschberg
关键词: Summarizing medical conversations, poses unique challenges, unique challenges due, collecting in-domain training, in-domain training data
中文关键词: 总结医疗对话,带来独特的挑战,独特的挑战,收集领域内培训,领域内培训数据
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Clinical NLP Workshop 2024

点击查看摘要

Abstract:Summarizing medical conversations poses unique challenges due to the specialized domain and the difficulty of collecting in-domain training data. In this study, we investigate the performance of state-of-the-art doctor-patient conversation generative summarization models on the out-of-domain data. We divide the summarization model of doctor-patient conversation into two configurations: (1) a general model, without specifying subjective (S), objective (O), and assessment (A) and plan § notes; (2) a SOAP-oriented model that generates a summary with SOAP sections. We analyzed the limitations and strengths of the fine-tuning language model-based methods and GPTs on both configurations. We also conducted a Linguistic Inquiry and Word Count analysis to compare the SOAP notes from different datasets. The results exhibit a strong correlation for reference notes across different datasets, indicating that format mismatch (i.e., discrepancies in word distribution) is not the main cause of performance decline on out-of-domain data. Lastly, a detailed analysis of SOAP notes is included to provide insights into missing information and hallucinations introduced by the models.
摘要:由于医学对话的专业性和收集领域内训练数据的难度,总结医学对话具有独特的挑战。在这项研究中,我们调查了最新的医患对话生成性摘要模型在域外数据上的性能。我们将医患对话的摘要模型分为两种配置:(1)通用模型,不指定主观(S)、客观(O)和评估(A)和计划§注释;(2)面向SOAP的模型,生成带有Soap部分的摘要。我们分析了基于语言模型的微调方法和GPTS在这两种配置上的局限性和优势。我们还进行了语言查询和字数统计分析,以比较不同数据集的肥皂笔记。结果显示,不同数据集的参考说明具有很强的相关性,表明格式不匹配(即单词分布的差异)不是域外数据性能下降的主要原因。最后,包括了对肥皂笔记的详细分析,以提供对模型引入的缺失信息和幻觉的洞察。

[NLP-67] Chain of Agents: Large Language Models Collaborating on Long-Context Tasks
[NLP-67] 代理链:大型语言模型在长上下文任务上协作

链接: https://arxiv.org/abs/2406.02818
作者: Yusen Zhang,Ruoxi Sun,Yanfei Chen,Tomas Pfister,Rui Zhang,Sercan Ö. Arik
关键词: Large Language Models, Addressing the challenge, effectively processing long, Language Models, Large Language
中文关键词: 大型语言模型,应对挑战,有效处理长时间,语言模型,大型语言
类目: Computation and Language (cs.CL)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Addressing the challenge of effectively processing long contexts has become a critical issue for Large Language Models (LLMs). Two common strategies have emerged: 1) reducing the input length, such as retrieving relevant chunks by Retrieval-Augmented Generation (RAG), and 2) expanding the context window limit of LLMs. However, both strategies have drawbacks: input reduction has no guarantee of covering the part with needed information, while window extension struggles with focusing on the pertinent information for solving the task. To mitigate these limitations, we propose Chain-of-Agents (CoA), a novel framework that harnesses multi-agent collaboration through natural language to enable information aggregation and context reasoning across various LLMs over long-context tasks. CoA consists of multiple worker agents who sequentially communicate to handle different segmented portions of the text, followed by a manager agent who synthesizes these contributions into a coherent final output. CoA processes the entire input by interleaving reading and reasoning, and it mitigates long context focus issues by assigning each agent a short context. We perform comprehensive evaluation of CoA on a wide range of long-context tasks in question answering, summarization, and code completion, demonstrating significant improvements by up to 10% over strong baselines of RAG, Full-Context, and multi-agent LLMs.
摘要:解决有效处理长上下文的挑战已经成为大型语言模型(LLM)的一个关键问题。出现了两种常见的策略:1)减少输入长度,如通过检索-增强生成(RAG)检索相关块;2)扩大LLMS的上下文窗口限制。然而,这两种策略都有缺点:减少输入并不能保证用所需的信息覆盖部分,而窗口扩展则难以集中于解决任务的相关信息。为了缓解这些限制,我们提出了一种新的框架-代理链(Chain-of-Agents,CoA),它通过自然语言来利用多代理协作,从而在长上下文任务中跨不同的LLM实现信息聚合和上下文推理。CoA由多个工作代理组成,这些工作代理按顺序通信以处理文本的不同分段部分,然后是管理代理,管理代理将这些贡献合成一个连贯的最终输出。CoA通过交错阅读和推理来处理整个输入,并通过为每个代理分配一个短上下文来缓解长上下文关注的问题。我们在问题回答、摘要和代码完成等一系列长上下文任务上对CoA进行了全面评估,表明比RAG、全上下文和多代理LLM的强基线有高达10%的显著改进。

[NLP-68] textttACCORD: Closing the Commonsense Measurability Gap
[NLP-68] textttACCORD:缩小常识可测量性差距

链接: https://arxiv.org/abs/2406.02804
作者: François Roewer-Després,Jinyue Feng,Zining Zhu,Frank Rudzicz
关键词: large language models, multi-hop counterfactuals, ACCORD, language models, texttt
中文关键词: 大型语言模型、多跳反事实、ACCORD、语言模型、文本
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: For leaderboard and dataset download, see this https URL For source code, see this https URL

点击查看摘要

Abstract:We present \textttACCORD , a framework and benchmark suite for disentangling the commonsense grounding and reasoning abilities of large language models (LLMs) through controlled, multi-hop counterfactuals. \textttACCORD introduces formal elements to commonsense reasoning to explicitly control and quantify reasoning complexity beyond the typical 1 or 2 hops. Uniquely, \textttACCORD can automatically generate benchmarks of arbitrary reasoning complexity, and so it scales with future LLM improvements. Benchmarking state-of-the-art LLMs – including GPT-4o (2024-05-13), Llama-3-70B-Instruct, and Mixtral-8x22B-Instruct-v0.1 – shows performance degrading to random chance with only moderate scaling, leaving substantial headroom for improvement. We release a leaderboard of the benchmark suite tested in this work, as well as code for automatically generating more complex benchmarks.
摘要:我们提出了\textttACCORD,这是一个框架和基准套件,用于通过受控的多跳反事实来解开大型语言模型(LLM)的常识基础和推理能力。 \textttACCORD将形式元素引入常识推理,以显式控制和量化超出典型1或2跳的推理复杂性。独特的是,\textttACCORD可以自动生成任意推理复杂性的基准,因此它可以随着未来的LLM改进而扩展。对最先进的LLM进行基准测试–包括GPT-4 o(2024-05-13)、Llama-3- 70 B-Direct和Mixtral-8x 22 B-Direct-v0.1 --显示性能会下降至随机机会,仅限适度扩展,留下了巨大的改进空间。我们发布了这项工作中测试的基准套件的排行榜,以及自动生成更复杂基准的代码。

[NLP-69] Promotional Language and the Adoption of Innovative Ideas in Science
[NLP-69] 宣传语言和科学中创新思想的采用

链接: https://arxiv.org/abs/2406.02798
作者: Hao Peng,Huilian Sophie Qiu,Henrik Barslund Fosse,Brian Uzzi
关键词: promotional language, innovative ideas communicated, Novo Nordisk Foundation, language, promotional
中文关键词: 宣传语言、传达的创新想法、诺和诺德基金会、语言、宣传
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:How are the merits of innovative ideas communicated in science? Here we conduct semantic analyses of grant application success with a focus on scientific promotional language, which has been growing in frequency in many contexts and purportedly may convey an innovative idea’s originality and significance. Our analysis attempts to surmount limitations of prior studies by examining the full text of tens of thousands of both funded and unfunded grants from three leading public and private funding agencies: the NIH, the NSF, and the Novo Nordisk Foundation, one of the world’s largest private science foundations. We find a robust association between promotional language and the support and adoption of innovative ideas by funders and other scientists. First, the percentage of promotional language in a grant proposal is associated with up to a doubling of the grant’s probability of being funded. Second, a grant’s promotional language reflects its intrinsic level of innovativeness. Third, the percentage of promotional language predicts the expected citation and productivity impact of publications that are supported by funded grants. Lastly, a computer-assisted experiment that manipulates the promotional language in our data demonstrates how promotional language can communicate the merit of ideas through cognitive activation. With the incidence of promotional language in science steeply rising, and the pivotal role of grants in converting promising and aspirational ideas into solutions, our analysis provides empirical evidence that promotional language is associated with effectively communicating the merits of innovative scientific ideas.
摘要:创新思想的优点是如何在科学中传播的?在这里,我们对拨款申请的成功进行语义分析,重点放在科学宣传语言上,这种语言在许多情况下出现的频率越来越高,据称可能传达一个创新想法的原创性和意义。我们的分析试图通过检查来自三个主要的公共和私人资助机构的数万份有资金和无资金的赠款的全文,试图克服先前研究的局限性:NIH、NSF和世界上最大的私人科学基金会之一诺和诺德基金会。我们发现,宣传语言与资助者和其他科学家对创新想法的支持和采用之间存在着密切的联系。首先,在赠款提案中使用宣传语言的百分比与赠款获得资助的可能性最高翻了一番有关。其次,奖助金的宣传语言反映了其内在的创新水平。第三,宣传语言的百分比预测了获得资助的出版物的预期引文和生产率影响。最后,一个计算机辅助的实验操纵了我们数据中的推广语言,展示了推广语言如何通过认知激活来传达想法的优点。随着推广语言在科学中的使用频率急剧上升,以及拨款在将有希望和有抱负的想法转化为解决方案方面的关键作用,我们的分析提供了经验证据,证明推广语言与有效地传达创新科学想法的优点有关。

[NLP-70] ArguMentor: Augmenting User Experiences with Counter-Perspectives
[NLP-70] ArguMentor:通过反观点增强用户体验

链接: https://arxiv.org/abs/2406.02795
作者: Priya Pitre,Kurt Luther
关键词: chambers in society, make them susceptible, susceptible to confirmation, confirmation bias, bias and echo
中文关键词: 社会中的房间,使它们容易受到确认、确认偏见、偏见和回声的影响
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Opinion pieces often represent only one side of any story, which can influence users and make them susceptible to confirmation bias and echo chambers in society. Moreover, humans are also bad at reading long articles – often indulging in idle reading and re-reading. To solve this, we design ArguMentor, an end-to-end system that highlights claims in opinion pieces, generates counter-arguments for them using an LLM, and generates a context-based summary of the passage based on current events. It further enhances user interaction and understanding through additional features like QA bot, DebateMe and highlighting trigger windows. Our survey and results show that users can generate more counterarguments and on an average have more neutralized views after engaging with the system.
摘要:观点文章通常只代表任何故事的一面,这可能会影响用户,使他们容易受到社会上的确认偏见和回音室的影响。此外,人类也不擅长阅读长文章–经常沉迷于闲置阅读和重读。为了解决这个问题,我们设计了ArguuMentor,这是一个端到端的系统,可以突出显示意见文章中的主张,使用LLM为其生成反论点,并根据当前事件生成基于上下文的文章摘要。它通过QA bot、DebateMe和突出显示触发窗口等额外功能进一步增强了用户的交互和理解。我们的调查和结果表明,用户在使用系统后可以产生更多的反驳,并且平均而言具有更多的中立观点。

[NLP-71] Language Models can Infer Action Semantics for Classical Planners from Environment Feedback
[NLP-71] 语言模型可以从环境反馈推断经典规划者的动作语义

链接: https://arxiv.org/abs/2406.02791
作者: Wang Zhu,Ishika Singh,Robin Jia,Jesse Thomason
关键词: approaches guarantee finding, planning approaches guarantee, Large Language Models, Classical planning approaches, approaches guarantee
中文关键词: 方法保证发现,规划方法保证,大型语言模型,经典规划方法,方法保证
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Classical planning approaches guarantee finding a set of actions that can achieve a given goal state when possible, but require an expert to specify logical action semantics that govern the dynamics of the environment. Researchers have shown that Large Language Models (LLMs) can be used to directly infer planning steps based on commonsense knowledge and minimal domain information alone, but such plans often fail on execution. We bring together the strengths of classical planning and LLM commonsense inference to perform domain induction, learning and validating action pre- and post-conditions based on closed-loop interactions with the environment itself. We propose PSALM, which leverages LLM inference to heuristically complete partial plans emitted by a classical planner given partial domain knowledge, as well as to infer the semantic rules of the domain in a logical language based on environment feedback after execution. Our analysis on 7 environments shows that with just one expert-curated example plans, using LLMs as heuristic planners and rule predictors achieves lower environment execution steps and environment resets than random exploration while simultaneously recovering the underlying ground truth action semantics of the domain.
摘要:经典的规划方法确保在可能的情况下找到一组能够实现给定目标状态的动作,但需要专家指定管理环境动态的逻辑动作语义。研究人员已经证明,大型语言模型(LLM)可以用于仅基于常识和最小领域信息直接推断规划步骤,但此类计划在执行时往往会失败。我们结合了经典规划和LLM常识推理的优点,基于与环境本身的闭环交互,执行域归纳、学习和验证操作前置条件和后置条件。我们提出了PSALM,它利用LLM推理来启发式地完成给定部分领域知识的经典规划者发出的部分计划,并在执行后基于环境反馈以逻辑语言推断领域的语义规则。我们对7个环境的分析表明,在只有一个专家精选的示例计划的情况下,使用LLM作为启发式计划器和规则预测器,在恢复领域潜在的基本事实动作语义的同时,获得了比随机探索更少的环境执行步骤和环境重置。

[NLP-72] Disentangling Logic: The Role of Context in Large Language Model Reasoning Capabilities
[NLP-72] 理清逻辑:上下文在大型语言模型推理能力中的作用

链接: https://arxiv.org/abs/2406.02787
作者: Wenyue Hua,Kaijie Zhu,Lingyao Li,Lizhou Fan,Shuhang Lin,Mingyu Jin,Haochen Xue,Zelong Li,JinDong Wang,Yongfeng Zhang
关键词: systematically disentangle pure, disentangle pure logic, study intends, intends to systematically, systematically disentangle
中文关键词: 系统地解开纯粹,解开纯粹逻辑,研究意图,意图系统地、系统地解开
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 9 figures

点击查看摘要

Abstract:This study intends to systematically disentangle pure logic reasoning and text understanding by investigating the contrast across abstract and contextualized logical problems from a comprehensive set of domains. We explore whether LLMs demonstrate genuine reasoning capabilities across various domains when the underlying logical structure remains constant. We focus on two main questions (1) Can abstract logical problems alone accurately benchmark an LLM’s reasoning ability in real-world scenarios, disentangled from contextual support in practical settings? (2) Does fine-tuning LLMs on abstract logic problem generalize to contextualized logic problems and vice versa? To investigate these questions, we focus on standard propositional logic, specifically propositional deductive and abductive logic reasoning. In particular, we construct instantiated datasets for deductive and abductive reasoning with 4 levels of difficulty, encompassing 12 distinct categories or domains based on the categorization of Wikipedia. Our experiments aim to provide insights into disentangling context in logical reasoning and the true reasoning capabilities of LLMs and their generalization potential. The code and dataset are available at: this https URL.
摘要:本研究旨在通过考察抽象逻辑问题和语境化逻辑问题在一系列领域中的对比,系统地理清纯逻辑推理和文本理解之间的关系。我们探索当底层逻辑结构保持不变时,LLM是否显示出跨不同领域的真正推理能力。我们主要关注两个问题:(1)抽象逻辑问题本身能否准确地测试LLM在现实世界场景中的推理能力,脱离实际环境中的上下文支持?(2)对抽象逻辑问题进行微调的LLM是否推广到上下文逻辑问题,反之亦然?为了研究这些问题,我们关注标准命题逻辑,特别是命题演绎和溯因逻辑推理。特别是,我们根据维基百科的分类,构建了具有4个难度级别的演绎和溯因推理的实例化数据集,涵盖了12个不同的类别或领域。我们的实验旨在深入了解逻辑推理中的上下文解缠,以及LLM的真实推理能力及其推广潜力。代码和数据集可在以下网址获得:This HTTPS URL。

[NLP-73] Aligning Large Language Models via Fine-grained Supervision
[NLP-73] 通过细粒度监督调整大型语言模型

链接: https://arxiv.org/abs/2406.02756
作者: Dehong Xu,Liang Qiu,Minseok Kim,Faisal Ladhak,Jaeyoung Do
关键词: Pre-trained large-scale language, producing coherent articles, Pre-trained large-scale, large-scale language models, excel at producing
中文关键词: 预先训练的大规模语言,产生连贯的文章,预先训练的大规模、大规模语言模型,擅长产生
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback (RLHF) to improve model alignment, which works by transforming coarse human preferences of LLM outputs into a feedback signal that guides the model learning process. However, because this approach operates on sequence-level feedback, it lacks the precision to identify the exact parts of the output affecting user preferences. To address this gap, we propose a method to enhance LLM alignment through fine-grained token-level supervision. Specifically, we ask annotators to minimally edit less preferred responses within the standard reward modeling dataset to make them more favorable, ensuring changes are made only where necessary while retaining most of the original content. The refined dataset is used to train a token-level reward model, which is then used for training our fine-grained Proximal Policy Optimization (PPO) model. Our experiment results demonstrate that this approach can achieve up to an absolute improvement of 5.1% in LLM performance, in terms of win rate against the reference model, compared with the traditional PPO model.
摘要:经过预先训练的大规模语言模型(LLM)擅长生成连贯的文章,但它们的输出可能是不真实的、有毒的,或者与用户的期望不符。目前的方法主要集中在使用带人反馈的强化学习(RLHF)来改善模型对齐,其工作原理是将LLM输出的人类粗略偏好转换为指导模型学习过程的反馈信号。然而,由于这种方法是在序列级别的反馈上操作的,因此它缺乏识别输出中影响用户偏好的确切部分的精度。为了弥补这一差距,我们提出了一种通过细粒度令牌级监督来增强LLM对齐的方法。具体地说,我们要求注释者在标准奖励建模数据集中对不太受欢迎的回复进行最低限度的编辑,以使它们更有利,确保仅在必要时进行更改,同时保留大部分原始内容。改进后的数据集被用来训练令牌级奖励模型,然后被用来训练我们的细粒度近邻策略优化(PPO)模型。实验结果表明,与传统的PPO模型相比,该方法可以在LLM性能上获得高达5.1%的绝对改进。

[NLP-74] RATT: AThought Structure for Coherent and Correct LLMReasoning
[NLP-74] RATT:连贯和正确的LLM推理的思想结构

链接: https://arxiv.org/abs/2406.02746
作者: Jinghan Zhang,Xiting Wang,Weijieying Ren,Lu Jiang,Dongjie Wang,Kunpeng Liu
关键词: Large Language Models, Large Language, Retrieval Augmented Thoughts, gain substantial reasoning, Augmented Thought Tree
中文关键词: 大型语言模型、大型语言、检索增强思想、获得实质性推理、增强思想树
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) gain substantial reasoning and decision-making capabilities from thought structures. However, existing methods such as Tree of Thought and Retrieval Augmented Thoughts often fall short in complex tasks due to the limitations of insufficient local retrieval of factual knowledge and inadequate global selection of strategies. These limitations make it challenging for these methods to balance factual accuracy and comprehensive logical optimization effectively. To address these limitations, we introduce the Retrieval Augmented Thought Tree (RATT), a novel thought structure that considers both overall logical soundness and factual correctness at each step of the thinking process. Specifically, at every point of a thought branch, RATT performs planning and lookahead to explore and evaluate multiple potential reasoning steps, and integrate the fact-checking ability of Retrieval-Augmented Generation (RAG) with LLM’s ability to assess overall strategy. Through this combination of factual knowledge and strategic feasibility, the RATT adjusts and integrates the thought tree structure to search for the most promising branches within the search space. This thought structure significantly enhances the model’s coherence in logical inference and efficiency in decision-making, and thus increases the limit of the capacity of LLM to generate reliable inferences and decisions based on thought structures. A broad range of experiments on different types of tasks showcases that the RATT structure significantly outperforms existing methods in factual correctness and logical coherence.
摘要:大型语言模型从思维结构中获得了大量的推理和决策能力。然而,现有的方法,如思维树和检索扩充思维,由于事实知识的局部检索不足和策略的全局选择不足,往往无法完成复杂的任务。这些局限性使得这些方法很难有效地平衡事实准确性和全面的逻辑优化。为了解决这些局限性,我们引入了检索增强思维树(RATT),这是一种新颖的思维结构,它在思维过程的每一步都考虑了总体逻辑可靠性和事实正确性。具体地说,在思想分支的每个点上,RATT都会执行规划和前瞻,以探索和评估多个潜在的推理步骤,并将检索-增强生成(RAG)的事实检查能力与LLM的评估整体策略的能力相结合。通过这种事实知识和战略可行性的结合,RATT调整和整合思维树结构,在搜索空间内寻找最有前途的分支。这种思维结构大大提高了模型在逻辑推理上的一致性和决策的效率,从而增加了LLM基于思维结构产生可靠推理和决策的能力限制。对不同类型任务的广泛实验表明,RATT结构在事实正确性和逻辑连贯性方面显著优于现有方法。

[NLP-75] xtless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation
[NLP-75] 具有自监督蒸馏的无extless声学模型用于噪音稳健的表达性语音到语音翻译

链接: https://arxiv.org/abs/2406.02733
作者: Min-Jae Hwang,Ilia Kulikov,Benjamin Peloquin,Hongyu Gong,Peng-Jen Chen,Ann Lee
关键词: textless acoustic model, textless acoustic, noise-robust expressive, acoustic model, propose a textless
中文关键词: 无文本声学模型,无文本声学,抗噪表现力,声学模型,提出无文本
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to ACL 2024 (findings)

点击查看摘要

Abstract:In this paper, we propose a textless acoustic model with a self-supervised distillation strategy for noise-robust expressive speech-to-speech translation (S2ST). Recently proposed expressive S2ST systems have achieved impressive expressivity preservation performances by cascading unit-to-speech (U2S) generator to the speech-to-unit translation model. However, these systems are vulnerable to the presence of noise in input speech, which is an assumption in real-world translation scenarios. To address this limitation, we propose a U2S generator that incorporates a distillation with no label (DINO) self-supervised training strategy into it’s pretraining process. Because the proposed method captures noise-agnostic expressivity representation, it can generate qualified speech even in noisy environment. Objective and subjective evaluation results verified that the proposed method significantly improved the performance of the expressive S2ST system in noisy environments while maintaining competitive performance in clean environments.
摘要:本文提出了一种基于自监督精馏策略的无文本声学模型,用于抗噪的表现式语音到语音翻译(S2ST)。最近提出的表现性S2ST系统通过将单元到语音(U2S)生成器级联到语音到单元翻译模型,获得了令人印象深刻的表现力保持性能。然而,这些系统容易受到输入语音中噪声的存在的影响,这是现实世界翻译场景中的一个假设。为了解决这一局限性,我们提出了一种U2S生成器,该生成器将无标签蒸馏(Dino)自监督训练策略融入到其预训练过程中。由于该方法捕获了与噪声无关的表现力表示,因此即使在噪声环境下也能生成合格的语音。客观和主观评价结果表明,该方法在保持在清洁环境中具有竞争力的性能的同时,显著提高了S2ST系统在噪声环境中的表现。

[NLP-76] Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
[NLP-76] 通过将后缀梯度压缩到后缀控制器中实现LLM行为的自我控制

链接: https://arxiv.org/abs/2406.02721
作者: Min Cai,Yuchen Zhang,Shichang Zhang,Fan Yin,Difan Zou,Yisong Yue,Ziniu Hu
关键词: explicit human annotations, method utilizing suffix, large language models, utilizing suffix gradients, human annotations
中文关键词: 显式人类注释、利用后缀的方法、大型语言模型、利用后缀梯度、人类注释
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 41 pages, 12 figures, 61 tables; Website: this https URL

点击查看摘要

Abstract:We propose Self-Control, a novel method utilizing suffix gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a guideline expressed in suffix string and the model’s self-assessment of adherence, Self-Control computes the gradient of this self-judgment concerning the model’s hidden states, directly influencing the auto-regressive generation process towards desired behaviors. To enhance efficiency, we introduce Self-Control_prefix, a compact module that encapsulates the learned representations from suffix gradients into a Prefix Controller, facilitating inference-time control for various LLM behaviors. Our experiments demonstrate Self-Control’s efficacy across multiple domains, including emotional modulation, ensuring harmlessness, and enhancing complex reasoning. Especially, Self-Control_prefix enables a plug-and-play control and jointly controls multiple attributes, improving model outputs without altering model parameters or increasing inference-time costs.
摘要:我们提出了一种利用后缀梯度来控制大型语言模型(LLM)行为的新方法–Self-Control,该方法无需明确的人类标注。给出以后缀字符串表示的指导方针和模型的遵守自我评估,自我控制计算关于模型隐藏状态的自我判断的梯度,直接影响到针对期望行为的自回归生成过程。为了提高效率,我们引入了Self-Control_Prefix,这是一个紧凑的模块,它将从后缀梯度学习的表示封装到前缀控制器中,促进了对各种LLM行为的推理时间控制。我们的实验证明了自我控制在多个领域的有效性,包括情绪调节、确保无害和增强复杂推理。特别是,Self-Control_Prefix支持即插即用控制并联合控制多个属性,在不改变模型参数或增加推理时间成本的情况下改善模型输出。

[NLP-77] Block Transformer: Global-to-Local Language Modeling for Fast Inference
[NLP-77] 块Transformer:用于快速推理的全球到本地语言建模

链接: https://arxiv.org/abs/2406.02657
作者: Namgyu Ho,Sangmin Bae,Taehyeon Kim,Hyunjik Jo,Yireun Kim,Tal Schuster,Adam Fisch,James Thorne,Se-Young Yun
关键词: adopts hierarchical, Block Transformer architecture, paper presents, Block Transformer, autoregressive transformers
中文关键词: 采用分层、块Transformer架构,论文提出、块变换器、自回归变换器
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 30 pages, 21 figures, 5 tables

点击查看摘要

Abstract:This paper presents the Block Transformer architecture which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks of self-attention. To apply self-attention, the key-value (KV) cache of all previous sequences must be retrieved from memory at every decoding step. Thereby, this KV cache IO becomes a significant bottleneck in batch inference. We notice that these costs stem from applying self-attention on the global context, therefore we isolate the expensive bottlenecks of global modeling to lower layers and apply fast local modeling in upper layers. To mitigate the remaining costs in the lower layers, we aggregate input tokens into fixed size blocks and then apply self-attention at this coarse level. Context information is aggregated into a single embedding to enable upper layers to decode the next block of tokens, without global attention. Free of global attention bottlenecks, the upper layers can fully utilize the compute hardware to maximize inference throughput. By leveraging global and local modules, the Block Transformer architecture demonstrates 10-20x gains in inference throughput compared to vanilla transformers with equivalent perplexity. Our work introduces a new approach to optimize language model inference through novel application of global-to-local modeling. Code is available at this https URL.
摘要:本文提出了一种块变压器结构,该结构采用分层的全局到局部模型对自回归变压器进行建模,以缓解自我注意的推理瓶颈。为了应用自我注意,必须在每个解码步骤从存储器中检索所有先前序列的键值(KV)缓存。因此,这种KV缓存IO成为批处理推理中的一个重要瓶颈。我们注意到,这些成本源于在全局环境中应用自我关注,因此我们将全局建模的昂贵瓶颈隔离到较低层,并在较高层应用快速局部建模。为了减少较低层中的剩余成本,我们将输入令牌聚合到固定大小的块中,然后在这个粗略的级别上应用自我关注。上下文信息被聚合到单个嵌入中,以使上层能够在不引起全局注意的情况下解码下一代币块。在没有全局注意瓶颈的情况下,上层可以充分利用计算硬件来最大化推理吞吐量。通过利用全局和局部模块,块变压器体系结构在推理吞吐量方面显示出比具有同等困惑的普通变压器高出10-20倍。我们的工作引入了一种新的方法,通过全局到局部建模的新应用来优化语言模型推理。代码可在此HTTPS URL上找到。

[NLP-78] LOLAMEME: Logic Language Memory Mechanistic Framework
[NLP-78] LOLAMEME:逻辑语言记忆机制框架

链接: https://arxiv.org/abs/2406.02592
作者: Jay Desai,Xiaobo Guo,Srinivasan H. Sengamedu
关键词: Large Language Models, achieved superhuman breadth, Large Language, unprecedented depth, achieved superhuman
中文关键词: 大型语言模型,实现超人的广度,大型语言,前所未有的深度,实现超人
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: this https URL

点击查看摘要

Abstract:The performance of Large Language Models has achieved superhuman breadth with unprecedented depth. At the same time, the language models are mostly black box models and the underlying mechanisms for performance have been evaluated using synthetic or mechanistic schemes. We extend current mechanistic schemes to incorporate Logic, memory, and nuances of Language such as latent structure. The proposed framework is called LOLAMEME and we provide two instantiations of LOLAMEME: LoLa and MeMe languages. We then consider two generative language model architectures: transformer-based GPT-2 and convolution-based Hyena. We propose the hybrid architecture T HEX and use LOLAMEME framework is used to compare three architectures. T HEX outperforms GPT-2 and Hyena on select tasks.
摘要:大型语言模型的性能达到了超人的广度和前所未有的深度。与此同时,语言模型大多是黑匣子模型,并且使用合成或机械方案评估了潜在的性能机制。我们扩展了当前的机械方案,以整合逻辑、记忆和语言的细微差别,例如潜在结构。提出的框架称为LOLAMEME,我们提供了LOLAMEME的两个实例:LoLa和MeMe语言。然后我们考虑两种生成式语言模型架构:基于转换器的GPT-2和基于卷积的Hyena。我们提出混合架构T HEX并使用LOLAMEME框架来比较三种架构。T HEX在某些任务上优于GPT-2和Hyena。

[NLP-79] Are PPO-ed Language Models Hackable?
[NLP-79] PPO语言模型可以被黑客攻击吗?

链接: https://arxiv.org/abs/2406.02577
作者: Suraj Anand,David Getzen
关键词: remove undesirable behaviors, Numerous algorithms, undesirable behaviors, remove undesirable, Numerous
中文关键词: 删除不良行为,大量算法,不良行为,删除不良行为,大量
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Numerous algorithms have been proposed to \textitalign language models to remove undesirable behaviors. However, the challenges associated with a very large state space and creating a proper reward function often result in various jailbreaks. Our paper aims to examine this effect of reward in the controlled setting of positive sentiment language generation. Instead of online training of a reward model based on human feedback, we employ a statically learned sentiment classifier. We also consider a setting where our model’s weights and activations are exposed to an end-user after training. We examine a pretrained GPT-2 through the lens of mechanistic interpretability before and after proximal policy optimization (PPO) has been applied to promote positive sentiment responses. Using these insights, we (1) attempt to “hack” the PPO-ed model to generate negative sentiment responses and (2) add a term to the reward function to try and alter `negative’ weights.
摘要:人们提出了许多算法来文本对齐语言模型以消除不良行为。然而,与非常大的国家空间和创建适当的奖励功能相关的挑战通常会导致各种越狱。我们的论文旨在研究在积极情绪语言生成的受控环境中奖励的这种影响。我们采用静态学习的情感分类器,而不是基于人类反馈在线训练奖励模型。我们还考虑一种设置,即我们的模型的权重和激活在训练后暴露给最终用户。在应用近端政策优化(PPO)促进积极情绪反应之前和之后,我们通过机械解释性的视角来检查预训练的GPT-2。使用这些见解,我们(1)尝试“破解”PPO模型以生成负面情绪反应,以及(2)在奖励函数中添加一个项以尝试改变“负”权重。

[NLP-80] Cross-Modal Safety Alignment: Is textual unlearning all you need?
[NLP-80] 跨模式安全对齐:文本遗忘就是您所需要的一切吗?

链接: https://arxiv.org/abs/2406.02575
作者: Trishna Chakraborty,Erfan Shayegani,Zikui Cai,Nael Abu-Ghazaleh,M. Salman Asif,Yue Dong,Amit K. Roy-Chowdhury,Chengyu Song
关键词: Large Language Models, Supervised Fine-tuning, Human Feedback, Reinforcement Learning, Learning with Human
中文关键词: 大型语言模型、监督微调、人类反馈、强化学习、与人一起学习
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent studies reveal that integrating new modalities into Large Language Models (LLMs), such as Vision-Language Models (VLMs), creates a new attack surface that bypasses existing safety training techniques like Supervised Fine-tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where, regardless of the combination of input modalities, all inputs are ultimately fused into the language space, we aim to explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Our evaluation across six datasets empirically demonstrates the transferability – textual unlearning in VLMs significantly reduces the Attack Success Rate (ASR) to less than 8% and in some cases, even as low as nearly 2% for both text-based and vision-text-based attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential benefits but incurs significantly increased computational demands, possibly up to 6 times higher.
摘要:最近的研究表明,将新的通道集成到大语言模型(LLM)中,如视觉语言模型(VLMS),创建了一个新的攻击面,绕过了现有的安全训练技术,如监督精调(SFT)和带人反馈的强化学习(RLHF)。虽然进一步的基于SFT和RLHF的安全培训可以在多模式环境中进行,但收集多模式培训数据集构成了一个巨大的挑战。受最近多通道模型的结构设计的启发,无论输入通道的组合如何,所有输入最终都被融合到语言空间中,我们的目标是探索仅在文本领域的遗忘是否可以有效地用于跨通道安全对齐。我们对六个数据集的评估经验证明了可转移性–在保持有效性的同时,VLMS中的文本遗忘显著地将攻击成功率(ASR)降低到低于8%,在某些情况下,对于基于文本和基于视觉文本的攻击,攻击成功率甚至低至近2%。此外,我们的实验表明,使用多模式数据集的遗忘没有提供潜在的好处,但会导致显著增加的计算需求,可能高达6倍。

[NLP-81] Sequence-to-sequence models in peer-to-peer learning: A practical application
[NLP-81] 点对点学习中的序列到序列模型:实际应用

链接: https://arxiv.org/abs/2406.02565
作者: Robert Šajina,Ivo Ipšić
关键词: Automatic Speech Recognition, Speech Recognition, based on LSTM, LSTM units, units for Automatic
中文关键词: 自动语音识别,语音识别,基于LSTM,LSTM单位,自动单位
类目: ound (cs.SD); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper explores the applicability of sequence-to-sequence (Seq2Seq) models based on LSTM units for Automatic Speech Recognition (ASR) task within peer-to-peer learning environments. Leveraging two distinct peer-to-peer learning methods, the study simulates the learning process of agents and evaluates their performance in ASR task using two different ASR datasets. In a centralized training setting, utilizing a scaled-down variant of the Deep Speech 2 model, a single model achieved a Word Error Rate (WER) of 84% when trained on the UserLibri dataset, and 38% when trained on the LJ Speech dataset. Conversely, in a peer-to-peer learning scenario involving 55 agents, the WER ranged from 87% to 92% for the UserLibri dataset, and from 52% to 56% for the LJ Speech dataset. The findings demonstrate the feasibility of employing Seq2Seq models in decentralized settings, albeit with slightly higher Word Error Rates (WER) compared to centralized training methods.
摘要:本文探讨了基于LSTM单元的序列到序列(Seq 2 Seq)模型在点对点学习环境中用于自动语音识别(ASB)任务的适用性。该研究利用两种不同的点对点学习方法,模拟了智能体的学习过程,并使用两个不同的ASB数据集评估了它们在ASB任务中的表现。在集中式训练环境中,利用Deep Speech 2模型的缩小变体,单个模型在userLibrary数据集上训练时实现了84%的字错误率(WER),在LJ Speech数据集上训练时实现了38%。相反,在涉及55个代理人的点对点学习场景中,userLibrary数据集的WER范围为87%至92%,LJ Speech数据集的WER范围为52%至56%。研究结果证明了在去中心化环境中使用Seq 2Seq模型的可行性,尽管与集中式训练方法相比,字错误率(WER)略高。

[NLP-82] 4D ASR: Joint Beam Search Integrating CTC Attention Transducer and Mask Predict Decoders
[NLP-82] 4D ASB:集成了CTC注意力传感器和屏蔽预测解码器的联合射束搜索

链接: https://arxiv.org/abs/2406.02950
作者: Yui Sudo,Muhammad Shakeel,Yosuke Fukumoto,Brian Yan,Jiatong Shi,Yifan Peng,Shinji Watanabe
关键词: automatic speech recognition, connectionist temporal classification, neural network transducer, recurrent neural network, attention-based encoder-decoder
中文关键词: 自动语音识别、连接主义时态分类、神经网络转换器、循环神经网络、基于注意力的编码器-解码器
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: submitted to IEEE/ACM Transactions on Audio Speech and Language Processing

点击查看摘要

Abstract:End-to-end automatic speech recognition (E2E-ASR) can be classified into several network architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and mask-predict models. Each network architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on application requirements. Instead of building separate models, we propose a joint modeling scheme where four decoders (CTC, RNN-T, attention, and mask-predict) share the same encoder – we refer to this as 4D modeling. The 4D model is trained using multitask learning, which will bring model regularization and maximize the model robustness thanks to their complementary properties. To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning. In addition, we propose three novel one-pass beam search algorithms by combining three decoders (CTC, RNN-T, and attention) to further improve performance. These three beam search algorithms differ in which decoder is used as the primary decoder. We carefully evaluate the performance and computational tradeoffs associated with each algorithm. Experimental results demonstrate that the jointly trained 4D model outperforms the E2E-ASR models trained with only one individual decoder. Furthermore, we demonstrate that the proposed one-pass beam search algorithm outperforms the previously proposed CTC/attention decoding.
摘要:端到端自动语音识别(E2E-ASR)可以分为几种网络结构,如连接式时间分类(CTC)、递归神经网络换能器(RNN-T)、基于注意力的编解码器和掩码预测模型。每种网络架构都有优缺点,因此从业者需要根据应用需求在这些不同的模型之间进行切换。我们没有建立单独的模型,而是提出了一种联合建模方案,其中四个解码器(CTC、RNN-T、注意和掩码-预测)共享相同的编码器–我们称之为4D建模。4D模型使用多任务学习进行训练,这将带来模型的正则化,并由于它们的互补特性而最大限度地提高模型的稳健性。为了有效地训练4D模型,我们引入了稳定多任务学习的两阶段训练策略。此外,我们还提出了三种新的单程波束搜索算法,将三种解码器(CTC、RNN-T和注意)结合在一起,以进一步提高性能。这三种波束搜索算法的不同之处在于使用哪个解码器作为主解码器。我们仔细评估了与每个算法相关的性能和计算权衡。实验结果表明,联合训练的4D模型比单独训练的E2E-ASR模型具有更好的性能。此外,我们还证明了所提出的单程波束搜索算法的性能优于先前提出的CTC/注意译码算法。

[NLP-83] SYN2REAL: Leveraging Task Arithmetic for Mitigating Synthetic-Real Discrepancies in ASR Domain Adaptation
[NLP-83] SY 2 REAL:利用任务算法缓解SVR域自适应中的合成-真实差异

链接: https://arxiv.org/abs/2406.02925
作者: Hsuan Su,Hua Farn,Shang-Tse Chen,Hung-yi Lee
关键词: Recent advancements, large language models, advancements in large, large language, significantly impacted
中文关键词: 最近的进步、大型语言模型、大型语言的进步受到了显着的影响
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have introduced the ‘task vector’ concept, which has significantly impacted various domains but remains underexplored in speech recognition. This paper presents a novel ‘SYN2REAL’ task vector for domain adaptation in automatic speech recognition (ASR), specifically targeting text-only domains. Traditional fine-tuning on synthetic speech often results in performance degradation due to acoustic mismatches. To address this issue, we propose creating a ‘SYN2REAL’ vector by subtracting the parameter differences between models fine-tuned on real and synthetic speech. This vector effectively bridges the gap between the two domains. Experiments on the SLURP dataset demonstrate that our approach yields an average improvement of 11.15% in word error rate for unseen target domains, highlighting the potential of task vectors in enhancing speech domain adaptation.
摘要:大型语言模型(LLM)的最新进展引入了“任务载体”概念,该概念对各个领域产生了显着影响,但在语音识别中仍然没有得到充分的研究。本文提出了一种新型的“SY 2 REAL”任务载体,用于自动语音识别(ASB)中的领域自适应,专门针对纯文本领域。传统的合成语音微调通常会导致由于声学不匹配而导致性能下降。为了解决这个问题,我们建议通过减去对真实语音和合成语音进行微调的模型之间的参数差异来创建“SY 2REAL”载体。该载体有效地弥合了两个领域之间的差距。对SIURP数据集的实验表明,我们的方法对于未见目标域的单词错误率平均提高了11.15%,凸显了任务载体在增强语音域适应方面的潜力。

[NLP-84] Combining X-Vectors and Bayesian Batch Active Learning: Two-Stage Active Learning Pipeline for Speech Recognition
[NLP-84] 结合X-Vector和Bayesian批量主动学习:语音识别的两阶段主动学习管道

链接: https://arxiv.org/abs/2406.02566
作者: Ognjen Kundacina,Vladimir Vincan,Dragisa Miskovic
关键词: two-stage active learning, automatic speech recognition, Emphasizing a data-centric, active learning, pipeline for automatic
中文关键词: 两阶段主动学习、自动语音识别、强调以数据为中心的主动学习、自动流水线
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Emphasizing a data-centric AI approach, this paper introduces a novel two-stage active learning (AL) pipeline for automatic speech recognition (ASR), combining unsupervised and supervised AL methods. The first stage utilizes unsupervised AL by using x-vectors clustering for diverse sample selection from unlabeled speech data, thus establishing a robust initial dataset for the subsequent supervised AL. The second stage incorporates a supervised AL strategy, with a batch AL method specifically developed for ASR, aimed at selecting diverse and informative batches of samples. Here, sample diversity is also achieved using x-vectors clustering, while the most informative samples are identified using a Bayesian AL method tailored for ASR with an adaptation of Monte Carlo dropout to approximate Bayesian inference. This approach enables precise uncertainty estimation, thereby enhancing ASR model training with significantly reduced data requirements. Our method has shown superior performance compared to competing methods on homogeneous, heterogeneous, and OOD test sets, demonstrating that strategic sample selection and innovative Bayesian modeling can substantially optimize both labeling effort and data utilization in deep learning-based ASR applications.
摘要:强调以数据为中心的人工智能方法,结合无监督和有监督主动学习方法,提出了一种新的用于自动语音识别的两阶段主动学习流水线。第一阶段通过使用x向量聚类从未标记的语音数据中选择不同的样本来利用无监督AL,从而为后续的有监督AL建立稳健的初始数据集。第二阶段结合了监督AL策略,以及专门为ASR开发的批次AL方法,旨在选择不同的和信息丰富的批次样本。这里,样本多样性也是使用x向量聚类实现的,而最有信息量的样本是使用为ASR量身定做的贝叶斯AL方法,并采用蒙特卡罗辍学来近似贝叶斯推理。这种方法可以实现精确的不确定性估计,从而以显著减少的数据需求来增强ASR模型训练。我们的方法在同质、异质和面向对象的测试集上表现出了比竞争方法更好的性能,表明在基于深度学习的ASR应用中,策略性样本选择和创新的贝叶斯建模可以显著优化标注工作和数据利用。

[NLP-85] A cost minimization approach to fix the vocabulary size in a tokenizer for an End-to-End ASR system
[NLP-85] 一种固定端到端ASC系统标记器中词汇量大小的成本最小化方法

链接: https://arxiv.org/abs/2406.02563
作者: Sunil Kumar Kopparapu,Ashish Panda
关键词: Unlike hybrid speech, hybrid speech recognition, Byte Pair Encoding, Unlike hybrid, speech recognition systems
中文关键词: 与混合语音不同,混合语音识别,字节对编码,与混合语音识别系统不同
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:Unlike hybrid speech recognition systems where the use of tokens was restricted to phones, biphones or triphones the choice of tokens in the end-to-end ASR systems is derived from the text corpus of the training data. The use of tokenization algorithms like Byte Pair Encoding (BPE) and WordPiece is popular in identifying the tokens that are used in the overall training process of the speech recognition system. Popular toolkits, like ESPNet use a pre-defined vocabulary size (number of tokens) for these tokenization algorithms, but there is no discussion on how vocabulary size was derived. In this paper, we build a cost function, assuming the tokenization process to be a black-box to enable choosing the number of tokens which might most benefit building an end-to-end ASR. We show through experiments on LibriSpeech 100 hour set that the performance of an end-to-end ASR system improves when the number of tokens are chosen carefully.
摘要:与令牌的使用仅限于电话、双音素或三音素的混合语音识别系统不同,端到端ASB系统中的令牌选择是从训练数据的文本库中得出的。字节对编码(BPE)和WordPiece等标记化算法的使用在识别语音识别系统的整个训练过程中使用的标记时很流行。ESPNet等流行工具包为这些标记化算法使用预定义的词汇量大小(标记数量),但没有讨论词汇量大小是如何推导的。在本文中,我们构建了一个成本函数,假设代币化过程是一个黑匣子,以便能够选择最有利于构建端到端的ASC的代币数量。我们通过LibriSpeech 100小时集的实验表明,当仔细选择令牌数量时,端到端ASB系统的性能会得到改善。

[NLP-86] Gated Low-rank Adaptation for personalized Code-Switching Automatic Speech Recognition on the low-spec devices
[NLP-86] 门控低等级自适应在低规格设备上实现个性化代码切换自动语音识别

链接: https://arxiv.org/abs/2406.02562
作者: Gwantae Kim,Bokyeung Lee,Donghyeon Kim,Hanseok Ko
关键词: speech recognition models, speech recognition, low-spec devices, CPU-only devices, recognition models
中文关键词: 语音识别模型、语音识别、低规格设备、纯MCU设备、识别模型
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Table 2 is revised

点击查看摘要

Abstract:In recent times, there has been a growing interest in utilizing personalized large models on low-spec devices, such as mobile and CPU-only devices. However, utilizing a personalized large model in the on-device is inefficient, and sometimes limited due to computational cost. To tackle the problem, this paper presents the weights separation method to minimize on-device model weights using parameter-efficient fine-tuning methods. Moreover, some people speak multiple languages in an utterance, as known as code-switching, the personalized ASR model is necessary to address such cases. However, current multilingual speech recognition models are limited to recognizing a single language within each utterance. To tackle this problem, we propose code-switching speech recognition models that incorporate fine-tuned monolingual and multilingual speech recognition models. Additionally, we introduce a gated low-rank adaptation(GLoRA) for parameter-efficient fine-tuning with minimal performance degradation. Our experiments, conducted on Korean-English code-switching datasets, demonstrate that fine-tuning speech recognition models for code-switching surpasses the performance of traditional code-switching speech recognition models trained from scratch. Furthermore, GLoRA enhances parameter-efficient fine-tuning performance compared to conventional LoRA.
摘要:最近,人们对在低规格设备上使用个性化大型号越来越感兴趣,例如移动设备和仅使用CPU的设备。然而,在设备上使用个性化的大型模型效率很低,有时还会因为计算成本而受到限制。为了解决这一问题,本文提出了一种权重分离方法,利用参数高效的微调方法来最小化设备上模型的权重。此外,一些人在一次发声中会说多种语言,这就是所谓的语码转换,个性化的ASR模型是解决这种情况的必要手段。然而,当前的多语言语音识别模型仅限于识别每个话语中的一种语言。为了解决这一问题,我们提出了代码转换语音识别模型,该模型结合了微调的单语和多语语音识别模型。此外,我们引入了门控低阶自适应(GLORA)来进行参数高效的微调,同时最大限度地降低了性能。我们在韩语-英语语码转换数据集上进行的实验表明,用于语码转换的微调语音识别模型的性能优于传统的从头开始训练的语码转换语音识别模型。此外,与传统的LORA相比,GLORA提高了参数高效的微调性能。

[NLP-87] Less Peaky and More Accurate CTC Forced Alignment by Label Priors
[NLP-87] 不那么尖峰、更准确的CTC通过标签先验强制对齐

链接: https://arxiv.org/abs/2406.02560
作者: Ruizhe Huang,Xiaohui Zhang,Zhaoheng Ni,Li Sun,Moto Hira,Jeff Hwang,Vimal Manohar,Vineel Pratap,Matthew Wiesner,Shinji Watanabe,Daniel Povey,Sanjeev Khudanpur
关键词: Connectionist temporal classification, Connectionist temporal, peaky output distributions, temporal classification, output distributions
中文关键词: 连接主义时态分类、连接主义时态、峰值输出分布、时态分类、输出分布
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by ICASSP 2024. Github repo: this https URL

点击查看摘要

Abstract:Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leveraging label priors, so that the scores of alignment paths containing fewer blanks are boosted and maximized during training. As a result, our CTC model produces less peaky posteriors and is able to more accurately predict the offset of the tokens besides their onset. It outperforms the standard CTC model and a heuristics-based approach for obtaining CTC’s token offset timestamps by 12-40% in phoneme and word boundary errors (PBE and WBE) measured on the Buckeye and TIMIT data. Compared with the most widely used FA toolkit Montreal Forced Aligner (MFA), our method performs similarly on PBE/WBE on Buckeye, yet falls behind MFA on TIMIT. Nevertheless, our method has a much simpler training pipeline and better runtime efficiency. Our training recipe and pretrained model are released in TorchAudio.
摘要:众所周知,连接主义时态分类(CTC)模型具有峰值输出分布。这种行为对于自动语音识别(ASR)来说不是问题,但它可能导致不准确的强制对齐(FA),尤其是在更细的粒度,例如音素级别。本文的目的是通过利用标签先验知识来改善CTC算法的峰值行为,提高其对强制比对生成的适应性,从而在训练过程中提升和最大化包含较少空白的比对路径的分数。因此,我们的CTC模型产生更少的峰值后验,并且能够更准确地预测除了它们的起始之外的标记的偏移量。在基于Buckeye和Timit数据测量的音素和单词边界错误(PBE和WBE)方面,它比标准的CTC模型和基于启发式的方法获得CTC的标记偏移时间戳的性能高出12%-40%。与目前应用最广泛的FFA工具包蒙特利尔强制对齐(MFA)相比,我们的方法在七叶树上的PBE/WBE上表现相似,但在TIMIT上落后于MFA。然而,我们的方法具有更简单的训练流水线和更好的运行效率。我们的训练食谱和预先训练的模型在TorchAudio上发布。

[NLP-88] PhoWhisper: Automatic Speech Recognition for Vietnamese
[NLP-88] PhoWhisper:越南语自动语音识别

链接: https://arxiv.org/abs/2406.02555
作者: Thanh-Thien Le,Linh The Nguyen,Dat Quoc Nguyen
关键词: automatic speech recognition, Vietnamese automatic speech, speech recognition, automatic speech, Vietnamese automatic
中文关键词: 自动语音识别,越南语自动语音,语音识别,自动语音,越南语自动
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Accepted to ICLR 2024 Tiny Papers Track

点击查看摘要

Abstract:We introduce PhoWhisper in five versions for Vietnamese automatic speech recognition. PhoWhisper’s robustness is achieved through fine-tuning the Whisper model on an 844-hour dataset that encompasses diverse Vietnamese accents. Our experimental study demonstrates state-of-the-art performances of PhoWhisper on benchmark Vietnamese ASR datasets. We have open-sourced PhoWhisper at: this https URL
摘要:我们引入了五个版本的PhoWhisper,用于越南语自动语音识别。PhoWhisper的稳健性是通过在包含不同越南口音的844小时数据集上微调Whisper模型来实现的。我们的实验研究展示了PhoWhisper在越南基准ASB数据集上的最先进性能。我们有开源PhoWhisper:这个https URL

[NLP-89] Hear Me See Me Understand Me: Audio-Visual Autism Behavior Recognition
[NLP-89] 听到我看到我理解我:视听自闭症行为识别

链接: https://arxiv.org/abs/2406.02554
作者: Shijian Deng,Erin E. Kosloski,Siddhi Patel,Zeke A. Barnett,Yiyang Nan,Alexander Kaplan,Sisira Aarukapalli,William T. Doan,Matthew Wang,Harsh Singh,Pamela R. Rollins,Yapeng Tian
关键词: autism behavior recognition, essential aspect previously, aspect previously omitted, behavior recognition, audio-visual autism behavior
中文关键词: 自闭症行为识别,之前的基本方面,之前省略的方面,行为识别,视听自闭症行为
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:In this article, we introduce a novel problem of audio-visual autism behavior recognition, which includes social behavior recognition, an essential aspect previously omitted in AI-assisted autism screening research. We define the task at hand as one that is audio-visual autism behavior recognition, which uses audio and visual cues, including any speech present in the audio, to recognize autism-related behaviors. To facilitate this new research direction, we collected an audio-visual autism spectrum dataset (AV-ASD), currently the largest video dataset for autism screening using a behavioral approach. It covers an extensive range of autism-associated behaviors, including those related to social communication and interaction. To pave the way for further research on this new problem, we intensively explored leveraging foundation models and multimodal large language models across different modalities. Our experiments on the AV-ASD dataset demonstrate that integrating audio, visual, and speech modalities significantly enhances the performance in autism behavior recognition. Additionally, we explored the use of a post-hoc to ad-hoc pipeline in a multimodal large language model to investigate its potential to augment the model’s explanatory capability during autism behavior recognition. We will release our dataset, code, and pre-trained models.
摘要:在本文中,我们介绍了一个新的视听自闭症行为识别问题,其中包括社会行为识别,这是以前在人工智能辅助自闭症筛查研究中忽略的一个重要方面。我们将手头的任务定义为视听自闭症行为识别,它使用音频和视觉线索,包括音频中存在的任何语音,来识别与自闭症相关的行为。为了促进这一新的研究方向,我们收集了一个自闭症视听频谱数据集(AV-ASD),这是目前使用行为方法进行自闭症筛查的最大视频数据集。它涵盖了与自闭症相关的广泛行为,包括与社会交流和互动有关的行为。为了为进一步研究这一新问题铺平道路,我们深入探索了跨不同模式的基础模型和多模式大型语言模型。我们在AV-ASD数据集上的实验表明,整合音频、视觉和语音模式显著提高了自闭症行为识别的性能。此外,我们探索了在多模式大型语言模型中使用后自组织到自组织管道的使用,以考察其在自闭症行为识别过程中增强模型的解释能力的潜力。我们将发布我们的数据集、代码和预先训练的模型。

计算机视觉

[CV-0] Convolutional Neural Networks and Vision Transformers for Fashion MNIST Classification: A Literature Review

链接: https://arxiv.org/abs/2406.03478
作者: Sonia Bbouzidi,Ghazala Hcini,Imen Jdey,Fadoua Drira
关键词: Convolutional Neural Networks, Neural Networks, Convolutional Neural, Vision Transformers, Fashion MNIST dataset
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Our review explores the comparative analysis between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the domain of image classification, with a particular focus on clothing classification within the e-commerce sector. Utilizing the Fashion MNIST dataset, we delve into the unique attributes of CNNs and ViTs. While CNNs have long been the cornerstone of image classification, ViTs introduce an innovative self-attention mechanism enabling nuanced weighting of different input data components. Historically, transformers have primarily been associated with Natural Language Processing (NLP) tasks. Through a comprehensive examination of existing literature, our aim is to unveil the distinctions between ViTs and CNNs in the context of image classification. Our analysis meticulously scrutinizes state-of-the-art methodologies employing both architectures, striving to identify the factors influencing their performance. These factors encompass dataset characteristics, image dimensions, the number of target classes, hardware infrastructure, and the specific architectures along with their respective top results. Our key goal is to determine the most appropriate architecture between ViT and CNN for classifying images in the Fashion MNIST dataset within the e-commerce industry, while taking into account specific conditions and needs. We highlight the importance of combining these two architectures with different forms to enhance overall performance. By uniting these architectures, we can take advantage of their unique strengths, which may lead to more precise and reliable models for e-commerce applications. CNNs are skilled at recognizing local patterns, while ViTs are effective at grasping overall context, making their combination a promising strategy for boosting image classification performance.

[CV-1] AD-H: Autonomous Driving with Hierarchical Agents

链接: https://arxiv.org/abs/2406.03474
作者: Zaibin Zhang,Shiyu Tang,Yuanhang Zhang,Talas Fu,Yifan Wang,Yang Liu,Dong Wang,Jing Shao,Lijun Wang,Huchuan Lu
关键词: employing MLLM-based agents, large language models, multimodal large language, recent works, dynamic environments
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Due to the impressive capabilities of multimodal large language models (MLLMs), recent works have focused on employing MLLM-based agents for autonomous driving in large-scale and dynamic environments. However, prevalent approaches often directly translate high-level instructions into low-level vehicle control signals, which deviates from the inherent language generation paradigm of MLLMs and fails to fully harness their emergent powers. As a result, the generalizability of these methods is highly restricted by autonomous driving datasets used during fine-tuning. To tackle this challenge, we propose to connect high-level instructions and low-level control signals with mid-level language-driven commands, which are more fine-grained than high-level instructions but more universal and explainable than control signals, and thus can effectively bridge the gap in between. We implement this idea through a hierarchical multi-agent driving system named AD-H, including a MLLM planner for high-level reasoning and a lightweight controller for low-level execution. The hierarchical design liberates the MLLM from low-level control signal decoding and therefore fully releases their emergent capability in high-level perception, reasoning, and planning. We build a new dataset with action hierarchy annotations. Comprehensive closed-loop evaluations demonstrate several key advantages of our proposed AD-H system. First, AD-H can notably outperform state-of-the-art methods in achieving exceptional driving performance, even exhibiting self-correction capabilities during vehicle operation, a scenario not encountered in the training dataset. Second, AD-H demonstrates superior generalization under long-horizon instructions and novel environmental conditions, significantly surpassing current state-of-the-art methods. We will make our data and code publicly accessible at this https URL

[CV-2] Polarization Wavefront Lidar: Learning Large Scene Reconstruction from Polarized Wavefronts

链接: https://arxiv.org/abs/2406.03461
作者: Dominik Scheuble,Chenyang Lei,Seung-Hwan Baek,Mario Bijelic,Felix Heide
关键词: cornerstone sensing modality, autonomous driving, cornerstone sensing, Conventional lidar sensors, large outdoor scenarios
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Accepted at CVPR 2024; Project Website: this https URL

点击查看摘要

Abstract:Lidar has become a cornerstone sensing modality for 3D vision, especially for large outdoor scenarios and autonomous driving. Conventional lidar sensors are capable of providing centimeter-accurate distance information by emitting laser pulses into a scene and measuring the time-of-flight (ToF) of the reflection. However, the polarization of the received light that depends on the surface orientation and material properties is usually not considered. As such, the polarization modality has the potential to improve scene reconstruction beyond distance measurements. In this work, we introduce a novel long-range polarization wavefront lidar sensor (PolLidar) that modulates the polarization of the emitted and received light. Departing from conventional lidar sensors, PolLidar allows access to the raw time-resolved polarimetric wavefronts. We leverage polarimetric wavefronts to estimate normals, distance, and material properties in outdoor scenarios with a novel learned reconstruction method. To train and evaluate the method, we introduce a simulated and real-world long-range dataset with paired raw lidar data, ground truth distance, and normal maps. We find that the proposed method improves normal and distance reconstruction by 53% mean angular error and 41% mean absolute error compared to existing shape-from-polarization (SfP) and ToF methods. Code and data are open-sourced at this https URL.

[CV-3] LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection

链接: https://arxiv.org/abs/2406.03459
作者: Qiang Chen,Xiangbo Su,Xinyu Zhang,Jian Wang,Jiahui Chen,Yunpeng Shen,Chuchu Han,Ziliang Chen,Weixiang Xu,Fanrong Li,Shan Zhang,Kun Yao,Errui Ding,Gang Zhang,Jingdong Wang
关键词: light-weight detection transformer, real-time object detection, ViT encoder, detection transformer, light-weight detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for reducing the ViT encoder complexity. We improve the ViT encoder by aggregating multi-level feature maps, and the intermediate and final feature maps in the ViT encoder, forming richer feature maps, and introduce window-major feature map organization for improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time detectors, e.g., YOLO and its variants, on COCO and other benchmark datasets. Code and models are available at (this https URL).

[CV-4] FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

链接: https://arxiv.org/abs/2406.03447
作者: Mona Ahmadian,Frank Guerin,Andrew Gilbert
关键词: Language Space, approach for learning, learning semantic video, semantic Language Space, Space
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper demonstrates a self-supervised approach for learning semantic video representations. Recent vision studies show that a masking strategy for vision and natural language supervision has contributed to developing transferable visual pretraining. Our goal is to achieve a more semantic video representation by leveraging the text related to the video content during the pretraining in a fully self-supervised manner. To this end, we present FILS, a novel self-supervised video Feature prediction In semantic Language Space (FILS). The vision model can capture valuable structured information by correctly predicting masked feature semantics in language space. It is learned using a patch-wise video-text contrastive strategy, in which the text representations act as prototypes for transforming vision features into a language space, which are then used as targets for semantically meaningful feature prediction using our masked encoder-decoder structure. FILS demonstrates remarkable transferability on downstream action recognition tasks, achieving state-of-the-art on challenging egocentric datasets, like Epic-Kitchens, Something-SomethingV2, Charades-Ego, and EGTEA, using ViT-Base. Our efficient method requires less computation and smaller batches compared to previous works.

[CV-5] xt-to-Events: Synthetic Event Camera Streams from Conditional Text Input

链接: https://arxiv.org/abs/2406.03439
作者: Joachim Ott,Zuowen Wang,Shih-Chii Liu
关键词: require vision sensors, Event, event camera, advantageous for tasks, tasks that require
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Event cameras are advantageous for tasks that require vision sensors with low-latency and sparse output responses. However, the development of deep network algorithms using event cameras has been slow because of the lack of large labelled event camera datasets for network training. This paper reports a method for creating new labelled event datasets by using a text-to-X model, where X is one or multiple output modalities, in the case of this work, events. Our proposed text-to-events model produces synthetic event frames directly from text prompts. It uses an autoencoder which is trained to produce sparse event frames representing event camera outputs. By combining the pretrained autoencoder with a diffusion model architecture, the new text-to-events model is able to generate smooth synthetic event streams of moving objects. The autoencoder was first trained on an event camera dataset of diverse scenes. In the combined training with the diffusion model, the DVS gesture dataset was used. We demonstrate that the model can generate realistic event sequences of human gestures prompted by different text statements. The classification accuracy of the generated sequences, using a classifier trained on the real dataset, ranges between 42% to 92%, depending on the gesture group. The results demonstrate the capability of this method in synthesizing event datasets.

[CV-6] CattleFace-RGBT: RGB-T Cattle Facial Landmark Benchmark

链接: https://arxiv.org/abs/2406.03431
作者: Ethan Coffman,Reagan Clark,Nhat-Tan Bui,Trong Thang Pham,Beth Kegley,Jeremy G. Powell,Jiangchao Zhao,Ngan Le
关键词: Cattle Facial Landmark, Landmark dataset consisting, address this challenge, Facial Landmark, RGB-T image pairs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:To address this challenge, we introduce CattleFace-RGBT, a RGB-T Cattle Facial Landmark dataset consisting of 2,300 RGB-T image pairs, a total of 4,600 images. Creating a landmark dataset is time-consuming, but AI-assisted annotation can help. However, applying AI to thermal images is challenging due to suboptimal results from direct thermal training and infeasible RGB-thermal alignment due to different camera views. Therefore, we opt to transfer models trained on RGB to thermal images and refine them using our AI-assisted annotation tool following a semi-automatic annotation approach. Accurately localizing facial key points on both RGB and thermal images enables us to not only discern the cattle’s respiratory signs but also measure temperatures to assess the animal’s thermal state. To the best of our knowledge, this is the first dataset for the cattle facial landmark on RGB-T images. We conduct benchmarking of the CattleFace-RGBT dataset across various backbone architectures, with the objective of establishing baselines for future research, analysis, and comparison. The dataset and models are at this https URL

[CV-7] Post-hoc Part-prototype Networks

链接: https://arxiv.org/abs/2406.03421
作者: Andong Tan,Fengtao Zhou,Hao Chen
关键词: Post-hoc explainability methods, Scott Oriole, characteristic Scott Oriole, Scott Oriole wing, explainability methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICML 2024

点击查看摘要

Abstract:Post-hoc explainability methods such as Grad-CAM are popular because they do not influence the performance of a trained model. However, they mainly reveal “where” a model looks at for a given input, fail to explain “what” the model looks for (e.g., what is important to classify a bird image to a Scott Oriole?). Existing part-prototype networks leverage part-prototypes (e.g., characteristic Scott Oriole’s wing and head) to answer both “where” and “what”, but often under-perform their black box counterparts in the accuracy. Therefore, a natural question is: can one construct a network that answers both “where” and “what” in a post-hoc manner to guarantee the model’s performance? To this end, we propose the first post-hoc part-prototype network via decomposing the classification head of a trained model into a set of interpretable part-prototypes. Concretely, we propose an unsupervised prototype discovery and refining strategy to obtain prototypes that can precisely reconstruct the classification head, yet being interpretable. Besides guaranteeing the performance, we show that our network offers more faithful explanations qualitatively and yields even better part-prototypes quantitatively than prior part-prototype networks.

[CV-8] CoFie: Learning Compact Neural Surface Representations with Coordinate Fields

链接: https://arxiv.org/abs/2406.03417
作者: Hanwen Jiang,Haitao Yang,Georgios Pavlakos,Qixing Huang
关键词: local geometry-aware neural, local, local shapes, geometry-aware neural surface, coordinate
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project page: this https URL

点击查看摘要

Abstract:This paper introduces CoFie, a novel local geometry-aware neural surface representation. CoFie is motivated by the theoretical analysis of local SDFs with quadratic approximation. We find that local shapes are highly compressive in an aligned coordinate frame defined by the normal and tangent directions of local shapes. Accordingly, we introduce Coordinate Field, which is a composition of coordinate frames of all local shapes. The Coordinate Field is optimizable and is used to transform the local shapes from the world coordinate frame to the aligned shape coordinate frame. It largely reduces the complexity of local shapes and benefits the learning of MLP-based implicit representations. Moreover, we introduce quadratic layers into the MLP to enhance expressiveness concerning local shape geometry. CoFie is a generalizable surface representation. It is trained on a curated set of 3D shapes and works on novel shape instances during testing. When using the same amount of parameters with prior works, CoFie reduces the shape error by 48% and 56% on novel instances of both training and unseen shape categories. Moreover, CoFie demonstrates comparable performance to prior works when using only 70% fewer parameters.

[CV-9] Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach

链接: https://arxiv.org/abs/2406.03411
作者: Saehyung Lee,Sangwon Yu,Junsung Park,Jihun Yi,Sungroh Yoon
关键词: dialogue-form context query, primarily address, dialogue-form context, retrieval task, context query
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To appear in ACL 2024 Main

点击查看摘要

Abstract:In this paper, we primarily address the issue of dialogue-form context query within the interactive text-to-image retrieval task. Our methodology, PlugIR, actively utilizes the general instruction-following capability of LLMs in two ways. First, by reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data, thereby enabling the use of any arbitrary black-box model. Second, we construct the LLM questioner to generate non-redundant questions about the attributes of the target image, based on the information of retrieval candidate images in the current context. This approach mitigates the issues of noisiness and redundancy in the generated questions. Beyond our methodology, we propose a novel evaluation metric, Best log Rank Integral (BRI), for a comprehensive assessment of the interactive retrieval system. PlugIR demonstrates superior performance compared to both zero-shot and fine-tuned baselines in various benchmarks. Additionally, the two methodologies comprising PlugIR can be flexibly applied together or separately in various situations. Our codes are available at this https URL.

[CV-10] Gaussian Representation for Deformable Image Registration

链接: https://arxiv.org/abs/2406.03394
作者: Jihe Li,Fabian Zhang,Xia Li,Tianhao Zhang,Ye Zhang,Joachim Buhmann
关键词: Deformable image registration, balance computational efficiency, Deformable image, task in radiotherapy, speed effectively
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deformable image registration (DIR) is a fundamental task in radiotherapy, with existing methods often struggling to balance computational efficiency, registration accuracy, and speed effectively. We introduce a novel DIR approach employing parametric 3D Gaussian control points achieving a better tradeoff. It provides an explicit and flexible representation for spatial deformation fields between 3D volumetric medical images, producing a displacement vector field (DVF) across all volumetric positions. The movement of individual voxels is derived using linear blend skinning (LBS) through localized interpolation of transformations associated with neighboring Gaussians. This interpolation strategy not only simplifies the determination of voxel motions but also acts as an effective regularization technique. Our approach incorporates a unified optimization process through backpropagation, enabling iterative learning of both the parameters of the 3D Gaussians and their transformations. Additionally, the density of Gaussians is adjusted adaptively during the learning phase to accommodate varying degrees of motion complexity. We validated our approach on the 4D-CT lung DIR-Lab and cardiac ACDC datasets, achieving an average target registration error (TRE) of 1.06 mm within a much-improved processing time of 2.43 seconds for the DIR-Lab dataset over existing methods, demonstrating significant advancements in both accuracy and efficiency.

[CV-11] SelfReDepth: Self-Supervised Real-Time Depth Restoration for Consumer-Grade Sensors

链接: https://arxiv.org/abs/2406.03388
作者: Alexandre Duarte,Francisco Fernandes,João M. Pereira,Catarina Moreira,Jacinto C. Nascimento,Joaquim Jorge
关键词: consumer-grade sensors suffer, scene-specific sources, Depth, produced by consumer-grade, suffer from inaccurate
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 13pp, 5 figures, 1 table

点击查看摘要

Abstract:Depth maps produced by consumer-grade sensors suffer from inaccurate measurements and missing data from either system or scene-specific sources. Data-driven denoising algorithms can mitigate such problems. However, they require vast amounts of ground truth depth data. Recent research has tackled this limitation using self-supervised learning techniques, but it requires multiple RGB-D sensors. Moreover, most existing approaches focus on denoising single isolated depth maps or specific subjects of interest, highlighting a need for methods to effectively denoise depth maps in real-time dynamic environments. This paper extends state-of-the-art approaches for depth-denoising commodity depth devices, proposing SelfReDepth, a self-supervised deep learning technique for depth restoration, via denoising and hole-filling by inpainting full-depth maps captured with RGB-D sensors. The algorithm targets depth data in video streams, utilizing multiple sequential depth frames coupled with color data to achieve high-quality depth videos with temporal coherence. Finally, SelfReDepth is designed to be compatible with various RGB-D sensors and usable in real-time scenarios as a pre-processing step before applying other depth-dependent algorithms. Our results demonstrate our approach’s real-time performance on real-world datasets. They show that it outperforms state-of-the-art denoising and restoration performance at over 30fps on Commercial Depth Cameras, with potential benefits for augmented and mixed-reality applications.

[CV-12] A Flexible Recursive Network for Video Stereo Matching Based on Residual Estimation

链接: https://arxiv.org/abs/2406.03333
作者: Youchen Zhao,Guorong Luo,Hua Zhong,Haixiong Li
关键词: Residual Estimation Module, Multi-scale Residual Estimation, Disparity Optimization Module, Temporal Attention Module, residual estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Due to the high similarity of disparity between consecutive frames in video sequences, the area where disparity changes is defined as the residual map, which can be calculated. Based on this, we propose RecSM, a network based on residual estimation with a flexible recursive structure for video stereo matching. The RecSM network accelerates stereo matching using a Multi-scale Residual Estimation Module (MREM), which employs the temporal context as a reference and rapidly calculates the disparity for the current frame by computing only the residual values between the current and previous frames. To further reduce the error of estimated disparities, we use the Disparity Optimization Module (DOM) and Temporal Attention Module (TAM) to enforce constraints between each module, and together with MREM, form a flexible Stackable Computation Structure (SCS), which allows for the design of different numbers of SCS based on practical scenarios. Experimental results demonstrate that with a stack count of 3, RecSM achieves a 4x speed improvement compared to ACVNet, running at 0.054 seconds based on one NVIDIA RTX 2080TI GPU, with an accuracy decrease of only 0.7%. Code is available at this https URL.

[CV-13] Comparative Benchmarking of Failure Detection Methods in Medical Image Segmentation: Unveiling the Role of Confidence Aggregation

链接: https://arxiv.org/abs/2406.03323
作者: Maximilian Zenk,David Zimmerer,Fabian Isensee,Jeremias Traub,Tobias Norajitra,Paul F. Jäger,Klaus Maier-Hein
关键词: learning algorithms offering, recent deep learning, deep learning algorithms, Semantic segmentation, algorithms offering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Semantic segmentation is an essential component of medical image analysis research, with recent deep learning algorithms offering out-of-the-box applicability across diverse datasets. Despite these advancements, segmentation failures remain a significant concern for real-world clinical applications, necessitating reliable detection mechanisms. This paper introduces a comprehensive benchmarking framework aimed at evaluating failure detection methodologies within medical image segmentation. Through our analysis, we identify the strengths and limitations of current failure detection metrics, advocating for the risk-coverage analysis as a holistic evaluation approach. Utilizing a collective dataset comprising five public 3D medical image collections, we assess the efficacy of various failure detection strategies under realistic test-time distribution shifts. Our findings highlight the importance of pixel confidence aggregation and we observe superior performance of the pairwise Dice score (Roy et al., 2019) between ensemble predictions, positioning it as a simple and robust baseline for failure detection in medical image segmentation. To promote ongoing research, we make the benchmarking framework available to the community.

[CV-14] Learning Visual Prompts for Guiding the Attention of Vision Transformers

链接: https://arxiv.org/abs/2406.03303
作者: Razieh Rezaei,Masoud Jalili Sabet,Jindong Gu,Daniel Rueckert,Philip Torr,Ashkan Khakzar
关键词: infuses visual information, predictions and tasks, specific predictions, Visual prompting infuses, prompting infuses visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Short version (4-pages) accepted as a spotlight paper at T4V workshop, CVPR 2024

点击查看摘要

Abstract:Visual prompting infuses visual information into the input image to adapt models toward specific predictions and tasks. Recently, manually crafted markers such as red circles are shown to guide the model to attend to a target region on the image. However, these markers only work on models trained with data containing those markers. Moreover, finding these prompts requires guesswork or prior knowledge of the domain on which the model is trained. This work circumvents manual design constraints by proposing to learn the visual prompts for guiding the attention of vision transformers. The learned visual prompt, added to any input image would redirect the attention of the pre-trained vision transformer to its spatial location on the image. Specifically, the prompt is learned in a self-supervised manner without requiring annotations and without fine-tuning the vision transformer. Our experiments demonstrate the effectiveness of the proposed optimization-based visual prompting strategy across various pre-trained vision encoders.

[CV-15] L-PR: Exploiting LiDAR Fiducial Marker for Unordered Low Overlap Multiview Point Cloud Registration

链接: https://arxiv.org/abs/2406.03298
作者: Yibo Liu,Jinjun Shan,Amaldev Haridevan,Shuo Zhang,Kejian Lin
关键词: Point cloud registration, point clouds, vision and robotics, applications in computer, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 8 pages

点击查看摘要

Abstract:Point cloud registration is a prerequisite for many applications in computer vision and robotics. Most existing methods focus on pairwise registration of two point clouds with high overlap. Although there have been some methods for low overlap cases, they struggle in degraded scenarios. This paper introduces a novel framework named L-PR, designed to register unordered low overlap multiview point clouds leveraging LiDAR fiducial markers. We refer to them as LiDAR fiducial markers, but they are the same as the popular AprilTag and ArUco markers, thin sheets of paper that do not affect the 3D geometry of the environment. We first propose an improved adaptive threshold marker detection method to provide robust detection results when the viewpoints among point clouds change dramatically. Then, we formulate the unordered multiview point cloud registration problem as a maximum a-posteriori (MAP) problem and develop a framework consisting of two levels of graphs to address it. The first-level graph, constructed as a weighted graph, is designed to efficiently and optimally infer initial values of scan poses from the unordered set. The second-level graph is constructed as a factor graph. By globally optimizing the variables on the graph, including scan poses, marker poses, and marker corner positions, we tackle the MAP problem. We conduct qualitative and quantitative experiments to demonstrate that the proposed method exhibits superiority over competitors in four aspects: registration accuracy, instance reconstruction quality, localization accuracy, and robustness to the degraded scene. To benefit the community, we open-source our method and dataset at this https URL.

[CV-16] xt-to-Image Rectified Flow as Plug-and-Play Priors

链接: https://arxiv.org/abs/2406.03293
作者: Xiaofeng Yang,Cheng Chen,Xulei Yang,Fayao Liu,Guosheng Lin
关键词: Large-scale diffusion models, achieved remarkable performance, Large-scale diffusion, achieved remarkable, models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code: this https URL

点击查看摘要

Abstract:Large-scale diffusion models have achieved remarkable performance in generative tasks. Beyond their initial training applications, these models have proven their ability to function as versatile plug-and-play priors. For instance, 2D diffusion models can serve as loss functions to optimize 3D implicit models. Rectified flow, a novel class of generative models, enforces a linear progression from the source to the target distribution and has demonstrated superior performance across various domains. Compared to diffusion-based methods, rectified flow approaches surpass in terms of generation quality and efficiency, requiring fewer inference steps. In this work, we present theoretical and experimental evidence demonstrating that rectified flow based methods offer similar functionalities to diffusion models - they can also serve as effective priors. Besides the generative capabilities of diffusion priors, motivated by the unique time-symmetry properties of rectified flow models, a variant of our method can additionally perform image inversion. Experimentally, our rectified flow-based priors outperform their diffusion counterparts - the SDS and VSD losses - in text-to-3D generation. Our method also displays competitive performance in image inversion and editing.

[CV-17] VWise: A novel benchmark for evaluating scene classification for vehicular applications

链接: https://arxiv.org/abs/2406.03273
作者: Pedro Azevedo,Emanuella Araújo,Gabriel Pierre,Willams de Lima Costa,João Marcelo Teixeira,Valter Ferreira,Roberto Jones,Veronica Teichrieb
关键词: Current datasets, North America, America or Europe, Latin American, Latin American country
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current datasets for vehicular applications are mostly collected in North America or Europe. Models trained or evaluated on these datasets might suffer from geographical bias when deployed in other regions. Specifically, for scene classification, a highway in a Latin American country differs drastically from an Autobahn, for example, both in design and maintenance levels. We propose VWise, a novel benchmark for road-type classification and scene classification tasks, in addition to tasks focused on external contexts related to vehicular applications in LatAm. We collected over 520 video clips covering diverse urban and rural environments across Latin American countries, annotated with six classes of road types. We also evaluated several state-of-the-art classification models in baseline experiments, obtaining over 84% accuracy. With this dataset, we aim to enhance research on vehicular tasks in Latin America.

[CV-18] Image Copy-Move Forgery Detection and Localization Scheme: How to Avoid Missed Detection and False Alarm

链接: https://arxiv.org/abs/2406.03271
作者: Li Jiang,Zhaowei Lu,Yuebing Gao,Yifan Wang
关键词: illegal purposes due, operation that replaces, illegal purposes, potential semantic, part
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image copy-move is an operation that replaces one part of the image with another part of the same image, which can be used for illegal purposes due to the potential semantic changes. Recent studies have shown that keypoint-based algorithms achieved excellent and robust localization performance even when small or smooth tampered areas were involved. However, when the input image is low-resolution, most existing keypoint-based algorithms are difficult to generate sufficient keypoints, resulting in more missed detections. In addition, existing algorithms are usually unable to distinguish between Similar but Genuine Objects (SGO) images and tampered images, resulting in more false alarms. This is mainly due to the lack of further verification of local homography matrix in forgery localization stage. To tackle these problems, this paper firstly proposes an excessive keypoint extraction strategy to overcome missed detection. Subsequently, a group matching algorithm is used to speed up the matching of excessive keypoints. Finally, a new iterative forgery localization algorithm is introduced to quickly form pixel-level localization results while ensuring a lower false alarm. Extensive experimental results show that our scheme has superior performance than state-of-the-art algorithms in overcoming missed detection and false alarm. Our code is available at this https URL.

[CV-19] Deep Generative Models for Proton Zero Degree Calorimeter Simulations in ALICE CERN

链接: https://arxiv.org/abs/2406.03263
作者: Patryk Będkowski,Jan Dubiński,Kamil Deja,Przemysław Rokita
关键词: Large Hadron Collider, Simulating detector responses, Large Hadron, Hadron Collider, Simulating detector
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 3 figures, PP-RAI 2024 conference

点击查看摘要

Abstract:Simulating detector responses is a crucial part of understanding the inner-workings of particle collisions in the Large Hadron Collider at CERN. The current reliance on statistical Monte-Carlo simulations strains CERN’s computational grid, underscoring the urgency for more efficient alternatives. Addressing these challenges, recent proposals advocate for generative machine learning methods. In this study, we present an innovative deep learning simulation approach tailored for the proton Zero Degree Calorimeter in the ALICE experiment. Leveraging a Generative Adversarial Network model with Selective Diversity Increase loss, we directly simulate calorimeter responses. To enhance its capabilities in modeling a broad range of calorimeter response intensities, we expand the SDI-GAN architecture with additional regularization. Moreover, to improve the spatial fidelity of the generated data, we introduce an auxiliary regressor network. Our method offers a significant speedup when comparing to the traditional Monte-Carlo based approaches.

[CV-20] ADer: A Comprehensive Benchmark for Multi-class Visual Anomaly Detection

链接: https://arxiv.org/abs/2406.03262
作者: Jiangning Zhang,Haoyang He,Zhenye Gan,Qingdong He,Yuxuan Cai,Zhucun Xue,Yabiao Wang,Chengjie Wang,Lei Xie,Yong Liu
关键词: unsupervised learning paradigms, identify anomalous regions, increasing application demand, Visual anomaly detection, anomaly detection aims
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual anomaly detection aims to identify anomalous regions in images through unsupervised learning paradigms, with increasing application demand and value in fields such as industrial inspection and medical lesion detection. Despite significant progress in recent years, there is a lack of comprehensive benchmarks to adequately evaluate the performance of various mainstream methods across different datasets under the practical multi-class setting. The absence of standardized experimental setups can lead to potential biases in training epochs, resolution, and metric results, resulting in erroneous conclusions. This paper addresses this issue by proposing a comprehensive visual anomaly detection benchmark, \textbf\textitADer, which is a modular framework that is highly extensible for new methods. The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics. Additionally, we have open-sourced the GPU-assisted \hrefthis https URLADEval package to address the slow evaluation problem of metrics like time-consuming mAU-PRO on large-scale data, significantly reducing evaluation time by more than \textit1000-fold. Through extensive experimental results, we objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection. We hope that \textbf\textitADer will become a valuable resource for researchers and practitioners in the field, promoting the development of more robust and generalizable anomaly detection systems. Full codes have been attached in Appendix and open-sourced at \urlthis https URL.

[CV-21] Prompt-based Visual Alignment for Zero-shot Policy Transfer

链接: https://arxiv.org/abs/2406.03250
作者: Haihan Gao,Rui Zhang,Qi Yi,Hantao Yao,Haochen Li,Jiaming Guo,Shaohui Peng,Yunkai Gao,QiCheng Wang,Xing Hu,Yuanbo Wen,Zihao Zhang,Zidong Du,Ling Li,Qi Guo,Yunji Chen
关键词: main obstacles, obstacles to applications, applications in reinforcement, Overfitting, reinforcement learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by ICML2024

点击查看摘要

Abstract:Overfitting in RL has become one of the main obstacles to applications in reinforcement learning(RL). Existing methods do not provide explicit semantic constrain for the feature extractor, hindering the agent from learning a unified cross-domain representation and resulting in performance degradation on unseen domains. Besides, abundant data from multiple domains are needed. To address these issues, in this work, we propose prompt-based visual alignment (PVA), a robust framework to mitigate the detrimental domain bias in the image for zero-shot policy transfer. Inspired that Visual-Language Model (VLM) can serve as a bridge to connect both text space and image space, we leverage the semantic information contained in a text sequence as an explicit constraint to train a visual aligner. Thus, the visual aligner can map images from multiple domains to a unified domain and achieve good generalization performance. To better depict semantic information, prompt tuning is applied to learn a sequence of learnable tokens. With explicit constraints of semantic information, PVA can learn unified cross-domain representation under limited access to cross-domain data and achieves great zero-shot generalization ability in unseen domains. We verify PVA on a vision-based autonomous driving task with CARLA simulator. Experiments show that the agent generalizes well on unseen domains under limited access to multi-domain data.

[CV-22] Global Clipper: Enhancing Safety and Reliability of Transformer-based Object Detection Models

链接: https://arxiv.org/abs/2406.03229
作者: Qutub Syed Sha,Michael Paulitsch,Karthik Pattabiraman,Korbinian Hagn,Fabian Oboril,Cornelius Buerkle,Kay-Ulrich Scholl,Gereon Hinz,Alois Knoll
关键词: transformer-based object detection, detection models progress, object detection models, expected to grow, object detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at IJCAI-AISafety’24 Workshop

点击查看摘要

Abstract:As transformer-based object detection models progress, their impact in critical sectors like autonomous vehicles and aviation is expected to grow. Soft errors causing bit flips during inference have significantly impacted DNN performance, altering predictions. Traditional range restriction solutions for CNNs fall short for transformers. This study introduces the Global Clipper and Global Hybrid Clipper, effective mitigation strategies specifically designed for transformer-based models. It significantly enhances their resilience to soft errors and reduces faulty inferences to ~ 0%. We also detail extensive testing across over 64 scenarios involving two transformer models (DINO-DETR and Lite-DETR) and two CNN models (YOLOv3 and SSD) using three datasets, totalling approximately 3.3 million inferences, to assess model robustness comprehensively. Moreover, the paper explores unique aspects of attention blocks in transformers and their operational differences from CNNs.

[CV-23] Interactive Image Selection and Training for Brain Tumor Segmentation Network

链接: https://arxiv.org/abs/2406.03225
作者: Matheus A. Cerqueira,Flávia Sprenger,Bernardo C. A. Teixeira,Alexandre X. Falcão
关键词: Medical image segmentation, Medical image, relevant problem, deep learning, Medical
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 4 figures, and 3 tables

点击查看摘要

Abstract:Medical image segmentation is a relevant problem, with deep learning being an exponent. However, the necessity of a high volume of fully annotated images for training massive models can be a problem, especially for applications whose images present a great diversity, such as brain tumors, which can occur in different sizes and shapes. In contrast, a recent methodology, Feature Learning from Image Markers (FLIM), has involved an expert in the learning loop, producing small networks that require few images to train the convolutional layers. In this work, We employ an interactive method for image selection and training based on FLIM, exploring the user’s knowledge. The results demonstrated that with our methodology, we could choose a small set of images to train the encoder of a U-shaped network, obtaining performance equal to manual selection and even surpassing the same U-shaped network trained with backpropagation and all training images.

[CV-24] Searching Priors Makes Text-to-Video Synthesis Better

链接: https://arxiv.org/abs/2406.03215
作者: Haoran Cheng,Liang Peng,Linxuan Xia,Yuepeng Hu,Hengjia Li,Qinglin Lu,Xiaofei He,Boxi Wu
关键词: brought substantial progress, Significant advancements, brought substantial, substantial progress, Significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Significant advancements in video diffusion models have brought substantial progress to the field of text-to-video (T2V) synthesis. However, existing T2V synthesis model struggle to accurately generate complex motion dynamics, leading to a reduction in video realism. One possible solution is to collect massive data and train the model on it, but this would be extremely expensive. To alleviate this problem, in this paper, we reformulate the typical T2V generation process as a search-based generation pipeline. Instead of scaling up the model training, we employ existing videos as the motion prior database. Specifically, we divide T2V generation process into two steps: (i) For a given prompt input, we search existing text-video datasets to find videos with text labels that closely match the prompt motions. We propose a tailored search algorithm that emphasizes object motion features. (ii) Retrieved videos are processed and distilled into motion priors to fine-tune a pre-trained base T2V model, followed by generating desired videos using input prompt. By utilizing the priors gleaned from the searched videos, we enhance the realism of the generated videos’ motion. All operations can be finished on a single NVIDIA RTX 4090 GPU. We validate our method against state-of-the-art T2V models across diverse prompt inputs. The code will be public.

[CV-25] Identification of Stone Deterioration Patterns with Large Multimodal Models

链接: https://arxiv.org/abs/2406.03207
作者: Daniele Corradetti,Jose Delgado Rodrigues
关键词: stone-based cultural heritage, cultural heritage sites, Large Multimodal Models, stone-based cultural, preserving cultural
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)
*备注: 10 pages, 5 figures, submitted to Journal of Cultural Heritage

点击查看摘要

Abstract:The conservation of stone-based cultural heritage sites is a critical concern for preserving cultural and historical landmarks. With the advent of Large Multimodal Models, as GPT-4omni (OpenAI), Claude 3 Opus (Anthropic) and Gemini 1.5 Pro (Google), it is becoming increasingly important to define the operational capabilities of these models. In this work, we systematically evaluate the abilities of the main foundational multimodal models to recognise and classify anomalies and deterioration patterns of the stone elements that are useful in the practice of conservation and restoration of world heritage. After defining a taxonomy of the main stone deterioration patterns and anomalies, we asked the foundational models to identify a curated selection of 354 highly representative images of stone-built heritage, offering them a careful selection of labels to choose from. The result, which varies depending on the type of pattern, allowed us to identify the strengths and weaknesses of these models in the field of heritage conservation and restoration.

[CV-26] Writing Order Recovery in Complex and Long Static Handwriting

链接: https://arxiv.org/abs/2406.03194
作者: Moises Diaz,Gioele Crispo,Antonio Parziale,Angelo Marcelli,Miguel A. Ferrer
关键词: information for recognizers, powerful source, source of information, order, trajectory
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The order in which the trajectory is executed is a powerful source of information for recognizers. However, there is still no general approach for recovering the trajectory of complex and long handwriting from static images. Complex specimens can result in multiple pen-downs and in a high number of trajectory crossings yielding agglomerations of pixels (also known as clusters). While the scientific literature describes a wide range of approaches for recovering the writing order in handwriting, these approaches nevertheless lack a common evaluation metric. In this paper, we introduce a new system to estimate the order recovery of thinned static trajectories, which allows to effectively resolve the clusters and select the order of the executed pen-downs. We evaluate how knowing the starting points of the pen-downs affects the quality of the recovered writing. Once the stability and sensitivity of the system is analyzed, we describe a series of experiments with three publicly available databases, showing competitive results in all cases. We expect the proposed system, whose code is made publicly available to the research community, to reduce potential confusion when the order of complex trajectories are recovered, and this will in turn make the trajectories recovered to be viable for further applications, such as velocity estimation.

[CV-27] Situation Monitor: Diversity-Driven Zero-Shot Out-of-Distribution Detection using Budding Ensemble Architecture for Object Detection

链接: https://arxiv.org/abs/2406.03188
作者: Qutub Syed,Michael Paulitsch,Korbinian Hagn,Neslihan Kose Cihangir,Kay-Ulrich Scholl,Fabian Oboril,Gereon Hinz,Alois Knoll
关键词: introduce Situation Monitor, Budding Ensemble Architecture, safety-critical machine learning, machine learning applications, Situation Monitor utilizes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Paper accepted at CVPR SAIAD Workshop

点击查看摘要

Abstract:We introduce Situation Monitor, a novel zero-shot Out-of-Distribution (OOD) detection approach for transformer-based object detection models to enhance reliability in safety-critical machine learning applications such as autonomous driving. The Situation Monitor utilizes the Diversity-based Budding Ensemble Architecture (DBEA) and increases the OOD performance by integrating a diversity loss into the training process on top of the budding ensemble architecture, detecting Far-OOD samples and minimizing false positives on Near-OOD samples. Moreover, utilizing the resulting DBEA increases the model’s OOD performance and improves the calibration of confidence scores, particularly concerning the intersection over union of the detected objects. The DBEA model achieves these advancements with a 14% reduction in trainable parameters compared to the vanilla model. This signifies a substantial improvement in efficiency without compromising the model’s ability to detect OOD instances and calibrate the confidence scores accurately.

[CV-28] Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion

链接: https://arxiv.org/abs/2406.03184
作者: Hao Wen,Zehuan Huang,Yaohui Wang,Xinyuan Chen,Yu Qiao,Lu Sheng
关键词: creation methods typically, methods typically involve, generating multi-view images, typically involve, involve a two-stage
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: See our project page at this https URL

点击查看摘要

Abstract:Existing single image-to-3D creation methods typically involve a two-stage process, first generating multi-view images, and then using these images for 3D reconstruction. However, training these two stages separately leads to significant data bias in the inference phase, thus affecting the quality of reconstructed results. We introduce a unified 3D generation framework, named Ouroboros3D, which integrates diffusion-based multi-view image generation and 3D reconstruction into a recursive diffusion process. In our framework, these two modules are jointly trained through a self-conditioning mechanism, allowing them to adapt to each other’s characteristics for robust inference. During the multi-view denoising process, the multi-view diffusion model uses the 3D-aware maps rendered by the reconstruction module at the previous timestep as additional conditions. The recursive diffusion framework with 3D-aware feedback unites the entire process and improves geometric consistency.Experiments show that our framework outperforms separation of these two stages and existing methods that combine them at the inference phase. Project page: this https URL

[CV-29] Geometric Localization of Homology Cycles

链接: https://arxiv.org/abs/2406.03183
作者: Amritendu Dhar,Vijay Natarajan,Abhishek Rathod
关键词: Computing, Computing an optimal, homology class, homology localization problem, homology
类目: Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: To Appear in CCCG 2024 : Proc. 36th Canadian Conference on Computational Geometry

点击查看摘要

Abstract:Computing an optimal cycle in a given homology class, also referred to as the homology localization problem, is known to be an NP-hard problem in general. Furthermore, there is currently no known optimality criterion that localizes classes geometrically and admits a stability property under the setting of persistent homology. We present a geometric optimization of the cycles that is computable in polynomial time and is stable in an approximate sense. Tailoring our search criterion to different settings, we obtain various optimization problems like optimal homologous cycle, minimum homology basis, and minimum persistent homology basis. In practice, the (trivial) exact algorithm is computationally expensive despite having a worst case polynomial runtime. Therefore, we design approximation algorithms for the above problems and study their performance experimentally. These algorithms have reasonable runtimes for moderate sized datasets and the cycles computed by these algorithms are consistently of high quality as demonstrated via experiments on multiple datasets.

[CV-30] FAPNet: An Effective Frequency Adaptive Point-based Eye Tracker

链接: https://arxiv.org/abs/2406.03177
作者: Xiaopeng Lin,Hongwei Ren,Bojun Cheng
关键词: Eye tracking, event-based eye tracking, Eye Tracking Challenge, Eye, crucial for human-computer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by CVPRW 2024 (AIS)

点击查看摘要

Abstract:Eye tracking is crucial for human-computer interaction in different domains. Conventional cameras encounter challenges such as power consumption and image quality during different eye movements, prompting the need for advanced solutions with ultra-fast, low-power, and accurate eye trackers. Event cameras, fundamentally designed to capture information about moving objects, exhibit low power consumption and high temporal resolution. This positions them as an alternative to traditional cameras in the realm of eye tracking. Nevertheless, existing event-based eye tracking networks neglect the pivotal sparse and fine-grained temporal information in events, resulting in unsatisfactory performance. Moreover, the energy-efficient features are further compromised by the use of excessively complex models, hindering efficient deployment on edge devices. In this paper, we utilize Point Cloud as the event representation to harness the high temporal resolution and sparse characteristics of events in eye tracking tasks. We rethink the point-based architecture PEPNet with preprocessing the long-term relationships between samples, leading to the innovative design of FAPNet. A frequency adaptive mechanism is designed to realize adaptive tracking according to the speed of the pupil movement and the Inter Sample LSTM module is introduced to utilize the temporal correlation between samples. In the Event-based Eye Tracking Challenge, we utilize vanilla PEPNet, which is the former work to achieve the p_10 accuracy of 97.95%. On the SEET synthetic dataset, FAPNet can achieve state-of-the-art while consuming merely 10% of the PEPNet’s computational resources. Notably, the computational demand of FAPNet is independent of the sensor’s spatial resolution, enhancing its applicability on resource-limited edge devices.

[CV-31] MMCL: Boosting Deformable DETR-Based Detectors with Multi-Class Min-Margin Contrastive Learning for Superior Prohibited Item Detection

链接: https://arxiv.org/abs/2406.03176
作者: Mingyuan Li,Tong Jia,Hui Lu,Bowen Ma,Hao Wang,Dongyue Chen
关键词: X-ray images lead, Min-Margin Contrastive Learning, Multi-Class Inter-Class Exclusion, Prohibited Item detection, Multi-Class Min-Margin Contrastive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Prohibited Item detection in X-ray images is one of the most effective security inspection methods.However, differing from natural light images, the unique overlapping phenomena in X-ray images lead to the coupling of foreground and background features, thereby lowering the accuracy of general object detectors.Therefore, we propose a Multi-Class Min-Margin Contrastive Learning (MMCL) method that, by clarifying the category semantic information of content queries under the deformable DETR architecture, aids the model in extracting specific category foreground information from coupled features.Specifically, after grouping content queries by the number of categories, we employ the Multi-Class Inter-Class Exclusion (MIE) loss to push apart content queries from different groups. Concurrently, the Intra-Class Min-Margin Clustering (IMC) loss is utilized to attract content queries within the same group, while ensuring the preservation of necessary disparity. As training, the inherent Hungarian matching of the model progressively strengthens the alignment between each group of queries and the semantic features of their corresponding category of objects. This evolving coherence ensures a deep-seated grasp of category characteristics, consequently bolstering the anti-overlapping detection capabilities of models.MMCL is versatile and can be easily plugged into any deformable DETR-based model with dozens of lines of code. Extensive experiments on the PIXray and OPIXray datasets demonstrate that MMCL significantly enhances the performance of various state-of-the-art models without increasing complexity. The code has been released at this https URL.

[CV-32] Dynamic 3D Gaussian Fields for Urban Areas

链接: https://arxiv.org/abs/2406.03175
作者: Tobias Fischer,Jonas Kulhanek,Samuel Rota Bulò,Lorenzo Porzi,Marc Pollefeys,Peter Kontschieder
关键词: novel-view synthesis, rendering speeds, speeds, dynamic urban areas, NVS
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page is available at this https URL

点击查看摘要

Abstract:We present an efficient neural 3D scene representation for novel-view synthesis (NVS) in large-scale, dynamic urban areas. Existing works are not well suited for applications like mixed-reality or closed-loop simulation due to their limited visual quality and non-interactive rendering speeds. Recently, rasterization-based approaches have achieved high-quality NVS at impressive speeds. However, these methods are limited to small-scale, homogeneous data, i.e. they cannot handle severe appearance and geometry variations due to weather, season, and lighting and do not scale to larger, dynamic areas with thousands of images. We propose 4DGF, a neural scene representation that scales to large-scale dynamic urban areas, handles heterogeneous input data, and substantially improves rendering speeds. We use 3D Gaussians as an efficient geometry scaffold while relying on neural fields as a compact and flexible appearance model. We integrate scene dynamics via a scene graph at global scale while modeling articulated motions on a local level via deformations. This decomposed approach enables flexible scene composition suitable for real-world applications. In experiments, we surpass the state-of-the-art by over 3 dB in PSNR and more than 200 times in rendering speed.

[CV-33] Sample-specific Masks for Visual Reprogramming-based Prompting

链接: https://arxiv.org/abs/2406.03150
作者: Chengyi Cai,Zesheng Ye,Lei Feng,Jianzhong Qi,Feng Liu
关键词: medical data prediction, tuning considerable parameters, Visual reprogramming, small-scale pattern added, classifier on ImageNet
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual reprogramming (VR) is a prompting technique that aims to re-purpose a pre-trained model (e.g., a classifier on ImageNet) to target tasks (e.g., medical data prediction) by learning a small-scale pattern added into input images instead of tuning considerable parameters within the model. The location of the pattern within input samples is usually determined by a pre-defined mask shared across all samples. In this paper, we show that the shared mask potentially limits VR’s generalization and increases its approximation error due to the lack of sample-level adaptation. Motivated by this finding, we design a new framework for VR called sample-specific multi-channel masks (SMM). Specifically, SMM employs a lightweight ConvNet and patch-wise interpolation to generate sample-specific three-channel masks instead of a shared and pre-defined mask. Since we generate different masks for individual samples, SMM is theoretically shown to reduce approximation error for the target tasks compared with existing state-of-the-art VR methods. We also empirically demonstrate its performance gain on both ResNet and ViT. The success of SMM further highlights the broader applicability of VR in leveraging the latent knowledge of pre-trained models for various target tasks. Our code is available at this https URL.

[CV-34] ny models from tiny data: Textual and null-text inversion for few-shot distillation

链接: https://arxiv.org/abs/2406.03146
作者: Erik Landolsi,Fredrik Kahl
关键词: Few-shot image classification, involves classifying images, image classification involves, classification involves classifying, image classification
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 pages (9 main pages + references and appendix)

点击查看摘要

Abstract:Few-shot image classification involves classifying images using very few training examples. Recent vision foundation models show excellent few-shot transfer abilities, but are large and slow at inference. Using knowledge distillation, the capabilities of high-performing but slow models can be transferred to tiny, efficient models. However, common distillation methods require a large set of unlabeled data, which is not available in the few-shot setting. To overcome this lack of data, there has been a recent interest in using synthetic data. We expand on this work by presenting a novel diffusion model inversion technique (TINT) combining the diversity of textual inversion with the specificity of null-text inversion. Using this method in a few-shot distillation pipeline leads to state-of-the-art accuracy among small student models on popular benchmarks, while being significantly faster than prior work. This allows us to push even tiny models to high accuracy using only a tiny application-specific dataset, albeit relying on extra data for pre-training. Popular few-shot benchmarks involve evaluation over a large number of episodes, which is computationally cumbersome for methods involving synthetic data generation. Therefore, we also present a theoretical analysis on how the variance of the accuracy estimator depends on the number of episodes and query examples, and use these results to lower the computational effort required for method evaluation. In addition, to further motivate the use of generative models in few-shot distillation, we demonstrate that our method performs better compared to training on real data mined from the dataset used to train the diffusion model. Source code will be made available at this https URL. Comments: 21 pages (9 main pages + references and appendix) Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) ACMclasses: I.4.0; I.2.6; I.2.10 Cite as: arXiv:2406.03146 [cs.CV] (or arXiv:2406.03146v1 [cs.CV] for this version) Submission history From: Erik Landolsi [view email] [v1] Wed, 5 Jun 2024 11:01:42 UTC (2,251 KB)

[CV-35] ZeroPur: Succinct Training-Free Adversarial Purification

链接: https://arxiv.org/abs/2406.03143
作者: Xiuli Bi,Zonglin Yang,Bo Liu,Xiaodong Cun,Chi-Man Pun,Pietro Lio,Bin Xiao
关键词: unseen adversarial attacks, victim classifiers, kind of defense, defense technique, defend various unseen
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: 16 pages, 5 figures, under review

点击查看摘要

Abstract:Adversarial purification is a kind of defense technique that can defend various unseen adversarial attacks without modifying the victim classifier. Existing methods often depend on external generative models or cooperation between auxiliary functions and victim classifiers. However, retraining generative models, auxiliary functions, or victim classifiers relies on the domain of the fine-tuned dataset and is computation-consuming. In this work, we suppose that adversarial images are outliers of the natural image manifold and the purification process can be considered as returning them to this manifold. Following this assumption, we present a simple adversarial purification method without further training to purify adversarial images, called ZeroPur. ZeroPur contains two steps: given an adversarial example, Guided Shift obtains the shifted embedding of the adversarial example by the guidance of its blurred counterparts; after that, Adaptive Projection constructs a directional vector by this shifted embedding to provide momentum, projecting adversarial images onto the manifold adaptively. ZeroPur is independent of external models and requires no retraining of victim classifiers or auxiliary functions, relying solely on victim classifiers themselves to achieve purification. Extensive experiments on three datasets (CIFAR-10, CIFAR-100, and ImageNet-1K) using various classifier architectures (ResNet, WideResNet) demonstrate that our method achieves state-of-the-art robust performance. The code will be publicly available.

[CV-36] Enhanced Automotive Object Detection via RGB-D Fusion in a DiffusionDet Framework

链接: https://arxiv.org/abs/2406.03129
作者: Eliraz Orfaig,Inna Stainvas,Igal Bilik
关键词: Vision-based autonomous driving, autonomous driving requires, driving requires reliable, Vision-based autonomous, efficient object detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-based autonomous driving requires reliable and efficient object detection. This work proposes a DiffusionDet-based framework that exploits data fusion from the monocular camera and depth sensor to provide the RGB and depth (RGB-D) data. Within this framework, ground truth bounding boxes are randomly reshaped as part of the training phase, allowing the model to learn the reverse diffusion process of noise addition. The system methodically enhances a randomly generated set of boxes at the inference stage, guiding them toward accurate final detections. By integrating the textural and color features from RGB images with the spatial depth information from the LiDAR sensors, the proposed framework employs a feature fusion that substantially enhances object detection of automotive targets. The 2.3 AP gain in detecting automotive targets is achieved through comprehensive experiments using the KITTI dataset. Specifically, the improved performance of the proposed approach in detecting small objects is demonstrated.

[CV-37] VQUNet: Vector Quantization U-Net for Defending Adversarial Atacks by Regularizing Unwanted Noise

链接: https://arxiv.org/abs/2406.03117
作者: Zhixun He,Mukesh Singhal
关键词: Deep Neural Networks, developing Artificial Intelligence, Deep Neural, Neural Networks, Artificial Intelligence
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Deep Neural Networks (DNN) have become a promising paradigm when developing Artificial Intelligence (AI) and Machine Learning (ML) applications. However, DNN applications are vulnerable to fake data that are crafted with adversarial attack algorithms. Under adversarial attacks, the prediction accuracy of DNN applications suffers, making them unreliable. In order to defend against adversarial attacks, we introduce a novel noise-reduction procedure, Vector Quantization U-Net (VQUNet), to reduce adversarial noise and reconstruct data with high fidelity. VQUNet features a discrete latent representation learning through a multi-scale hierarchical structure for both noise reduction and data reconstruction. The empirical experiments show that the proposed VQUNet provides better robustness to the target DNN models, and it outperforms other state-of-the-art noise-reduction-based defense methods under various adversarial attacks for both Fashion-MNIST and CIFAR10 datasets. When there is no adversarial attack, the defense method has less than 1% accuracy degradation for both datasets.

[CV-38] Enhancing 3D Lane Detection and Topology Reasoning with 2D Lane Priors

链接: https://arxiv.org/abs/2406.03105
作者: Han Li,Zehao Huang,Zitian Wang,Wenge Rong,Naiyan Wang,Si Liu
关键词: autonomous driving scenarios, driving scenarios, detecting the accurate, essential tasks, tasks in autonomous
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages, 9 figures, 6 tables

点击查看摘要

Abstract:3D lane detection and topology reasoning are essential tasks in autonomous driving scenarios, requiring not only detecting the accurate 3D coordinates on lane lines, but also reasoning the relationship between lanes and traffic elements. Current vision-based methods, whether explicitly constructing BEV features or not, all establish the lane anchors/queries in 3D space while ignoring the 2D lane priors. In this study, we propose Topo2D, a novel framework based on Transformer, leveraging 2D lane instances to initialize 3D queries and 3D positional embeddings. Furthermore, we explicitly incorporate 2D lane features into the recognition of topology relationships among lane centerlines and between lane centerlines and traffic elements. Topo2D achieves 44.5% OLS on multi-view topology reasoning benchmark OpenLane-V2 and 62.6% F-Socre on single-view 3D lane detection benchmark OpenLane, exceeding the performance of existing state-of-the-art methods.

[CV-39] EgoSurgery-Tool: A Dataset of Surgical Tool and Hand Detection from Egocentric Open Surgery Videos

链接: https://arxiv.org/abs/2406.03095
作者: Ryo Fujii,Hideo Saito,Hiroyuki Kajita
关键词: Surgical tool, Surgical, fundamental task, open surgery videos, open surgery
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Surgical tool detection is a fundamental task for understanding egocentric open surgery videos. However, detecting surgical tools presents significant challenges due to their highly imbalanced class distribution, similar shapes and similar textures, and heavy occlusion. The lack of a comprehensive large-scale dataset compounds these challenges. In this paper, we introduce EgoSurgery-Tool, an extension of the existing EgoSurgery-Phase dataset, which contains real open surgery videos captured using an egocentric camera attached to the surgeon’s head, along with phase annotations. EgoSurgery-Tool has been densely annotated with surgical tools and comprises over 49K surgical tool bounding boxes across 15 categories, constituting a large-scale surgical tool detection dataset. EgoSurgery-Tool also provides annotations for hand detection with over 46K hand-bounding boxes, capturing hand-object interactions that are crucial for understanding activities in egocentric open surgery. EgoSurgery-Tool is superior to existing datasets due to its larger scale, greater variety of surgical tools, more annotations, and denser scenes. We conduct a comprehensive analysis of EgoSurgery-Tool using nine popular object detectors to assess their effectiveness in both surgical tool and hand detection. The dataset will be released at this https URL.

[CV-40] Lossless Image Compression Using Multi-level Dictionaries: Binary Images

链接: https://arxiv.org/abs/2406.03087
作者: Samar Agnihotri,Renu Rameshan,Ritwik Ghosal
关键词: Lossless image compression, image compression, Lossless image, information loss compared, compression
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures, and 5 tables

点击查看摘要

Abstract:Lossless image compression is required in various applications to reduce storage or transmission costs of images, while requiring the reconstructed images to have zero information loss compared to the original. Existing lossless image compression methods either have simple design but poor compression performance, or complex design, better performance, but with no performance guarantees. In our endeavor to develop a lossless image compression method with low complexity and guaranteed performance, we argue that compressibility of a color image is essentially derived from the patterns in its spatial structure, intensity variations, and color variations. Thus, we divide the overall design of a lossless image compression scheme into three parts that exploit corresponding redundancies. We further argue that the binarized version of an image captures its fundamental spatial structure and in this work, we propose a scheme for lossless compression of binary images. The proposed scheme first learns dictionaries of 16\times16 , 8\times8 , 4\times4 , and 2\times 2 square pixel patterns from various datasets of binary images. It then uses these dictionaries to encode binary images. These dictionaries have various interesting properties that are further exploited to construct an efficient scheme. Our preliminary results show that the proposed scheme consistently outperforms existing conventional and learning based lossless compression approaches, and provides, on average, as much as 1.5\times better performance than a common general purpose lossless compression scheme (WebP), more than 3\times better performance than a state of the art learning based scheme, and better performance than a specialized scheme for binary image compression (JBIG2). Comments: 11 pages, 7 figures, and 5 tables Subjects: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2406.03087 [cs.IT] (or arXiv:2406.03087v1 [cs.IT] for this version)

[CV-41] Exploiting LMM-based knowledge for image classification tasks

链接: https://arxiv.org/abs/2406.03071
作者: Maria Tzelepi,Vasileios Mezaris
关键词: Large Multimodal Models, Large Multimodal, encoded in Large, image classification tasks, address image classification
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Accepted for publication, 25th Int. Conf. on Engineering Applications of Neural Networks (EANN/EAAAI 2024), Corfu, Greece, June 2024. This is the “submitted manuscript”

点击查看摘要

Abstract:In this paper we address image classification tasks leveraging knowledge encoded in Large Multimodal Models (LMMs). More specifically, we use the MiniGPT-4 model to extract semantic descriptions for the images, in a multimodal prompting fashion. In the current literature, vision language models such as CLIP, among other approaches, are utilized as feature extractors, using only the image encoder, for solving image classification tasks. In this paper, we propose to additionally use the text encoder to obtain the text embeddings corresponding to the MiniGPT-4-generated semantic descriptions. Thus, we use both the image and text embeddings for solving the image classification task. The experimental evaluation on three datasets validates the improved classification performance achieved by exploiting LMM-based knowledge.

[CV-42] A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

链接: https://arxiv.org/abs/2406.03070
作者: Zicheng Zhang,Haoning Wu,Chunyi Li,Yingjie Zhou,Wei Sun,Xiongkuo Min,Zijian Chen,Xiaohong Liu,Weisi Lin,Guangtao Zhai
关键词: assess AI-generated images, efficiently assess AI-generated, AI-generated images, remains a critical, accurately and efficiently
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:How to accurately and efficiently assess AI-generated images (AIGIs) remains a critical challenge for generative models. Given the high costs and extensive time commitments required for user studies, many researchers have turned towards employing large multi-modal models (LMMs) as AIGI evaluators, the precision and validity of which are still questionable. Furthermore, traditional benchmarks often utilize mostly natural-captured content rather than AIGIs to test the abilities of LMMs, leading to a noticeable gap for AIGIs. Therefore, we introduce A-Bench in this paper, a benchmark designed to diagnose whether LMMs are masters at evaluating AIGIs. Specifically, A-Bench is organized under two key principles: 1) Emphasizing both high-level semantic understanding and low-level visual quality perception to address the intricate demands of AIGIs. 2) Various generative models are utilized for AIGI creation, and various LMMs are employed for evaluation, which ensures a comprehensive validation scope. Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts, and tested across 18 leading LMMs. We hope that A-Bench will significantly enhance the evaluation process and promote the generation quality for AIGIs. The benchmark is available at this https URL.

[CV-43] Decision Boundary-aware Knowledge Consolidation Generates Better Instance-Incremental Learner

链接: https://arxiv.org/abs/2406.03065
作者: Qiang Nie,Weifu Fu,Yuhuan Lin,Jialin Li,Yifeng Zhou,Yong Liu,Lei Zhu,Chengjie Wang
关键词: Instance-incremental learning, IIL, IIL setting, learning continually, Instance-incremental
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages

点击查看摘要

Abstract:Instance-incremental learning (IIL) focuses on learning continually with data of the same classes. Compared to class-incremental learning (CIL), the IIL is seldom explored because IIL suffers less from catastrophic forgetting (CF). However, besides retaining knowledge, in real-world deployment scenarios where the class space is always predefined, continual and cost-effective model promotion with the potential unavailability of previous data is a more essential demand. Therefore, we first define a new and more practical IIL setting as promoting the model’s performance besides resisting CF with only new observations. Two issues have to be tackled in the new IIL setting: 1) the notorious catastrophic forgetting because of no access to old data, and 2) broadening the existing decision boundary to new observations because of concept drift. To tackle these problems, our key insight is to moderately broaden the decision boundary to fail cases while retain old boundary. Hence, we propose a novel decision boundary-aware distillation method with consolidating knowledge to teacher to ease the student learning new knowledge. We also establish the benchmarks on existing datasets Cifar-100 and ImageNet. Notably, extensive experiments demonstrate that the teacher model can be a better incremental learner than the student model, which overturns previous knowledge distillation-based methods treating student as the main role.

[CV-44] Adapter-X: A Novel General Parameter-Efficient Fine-Tuning Framework for Vision

链接: https://arxiv.org/abs/2406.03051
作者: Minglei Li,Peng Ye,Yongqi Huang,Lin Zhang,Tao Chen,Tong He,Jiayuan Fan,Wanli Ouyang
关键词: foundation models continue, popularity and size, Parameter-efficient fine-tuning, increasingly important, important as foundation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) has become increasingly important as foundation models continue to grow in both popularity and size. Adapter has been particularly well-received due to their potential for parameter reduction and adaptability across diverse tasks. However, striking a balance between high efficiency and robust generalization across tasks remains a challenge for adapter-based methods. We analyze existing methods and find that: 1) parameter sharing is the key to reducing redundancy; 2) more tunable parameters, dynamic allocation, and block-specific design are keys to improving performance. Unfortunately, no previous work considers all these factors. Inspired by this insight, we introduce a novel framework named Adapter-X. First, a Sharing Mixture of Adapters (SMoA) module is proposed to fulfill token-level dynamic allocation, increased tunable parameters, and inter-block sharing at the same time. Second, some block-specific designs like Prompt Generator (PG) are introduced to further enhance the ability of adaptation. Extensive experiments across 2D image and 3D point cloud modalities demonstrate that Adapter-X represents a significant milestone as it is the first to outperform full fine-tuning in both 2D image and 3D point cloud modalities with significantly fewer parameters, i.e., only 0.20% and 1.88% of original trainable parameters for 2D and 3D classification tasks. Our code will be publicly available.

[CV-45] Giving each task what it needs – leveraging structured sparsity for tailored multi-task learning

链接: https://arxiv.org/abs/2406.03048
作者: Richa Upadhyay,Ronald Phlypo,Rajkumar Saini,Marcus Liwicki
关键词: Multi-task Learning, distinct feature representations, demands distinct feature, task demands distinct, LOMT models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Every task demands distinct feature representations, ranging from low-level to high-level attributes, so it is vital to address the specific needs of each task, especially in the Multi-task Learning (MTL) framework. This work, therefore, introduces Layer-Optimized Multi-Task (LOMT) models that utilize structured sparsity to refine feature selection for individual tasks and enhance the performance of all tasks in a multi-task scenario. Structured or group sparsity systematically eliminates parameters from trivial channels and, eventually, entire layers within a convolution neural network during training. Consequently, the remaining layers provide the most optimal features for a given task. In this two-step approach, we subsequently leverage this sparsity-induced optimal layer information to build the LOMT models by connecting task-specific decoders to these strategically identified layers, deviating from conventional approaches that uniformly connect decoders at the end of the network. This tailored architecture optimizes the network, focusing on essential features while reducing redundancy. We validate the efficacy of the proposed approach on two datasets, ie NYU-v2 and CelebAMask-HD datasets, for multiple heterogeneous tasks. A detailed performance analysis of the LOMT models, in contrast to the conventional MTL models, reveals that the LOMT models outperform for most task combinations. The excellent qualitative and quantitative outcomes highlight the effectiveness of employing structured sparsity for optimal layer (or feature) selection.

[CV-46] Follow-Your-Pose v2: Multiple-Condition Guided Character Image Animation for Stable Pose Control

链接: https://arxiv.org/abs/2406.03035
作者: Jingyun Xue,Hongfa Wang,Qi Tian,Yue Ma,Andong Wang,Zhiyuan Zhao,Shaobo Min,Wenzhe Zhao,Kaihao Zhang,Heung-Yeung Shum,Wei Liu,Mengyang Liu,Wenhan Luo
关键词: social media platforms, Pose-controllable character video, Pose-controllable character, media platforms, high demand
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pose-controllable character video generation is in high demand with extensive applications for fields such as automatic advertising and content creation on social media platforms. While existing character image animation methods using pose sequences and reference images have shown promising performance, they tend to struggle with incoherent animation in complex scenarios, such as multiple character animation and body occlusion. Additionally, current methods request large-scale high-quality videos with stable backgrounds and temporal consistency as training datasets, otherwise, their performance will greatly deteriorate. These two issues hinder the practical utilization of character image animation tools. In this paper, we propose a practical and robust framework Follow-Your-Pose v2, which can be trained on noisy open-sourced videos readily available on the internet. Multi-condition guiders are designed to address the challenges of background stability, body occlusion in multi-character generation, and consistency of character appearance. Moreover, to fill the gap of fair evaluation of multi-character pose animation, we propose a new benchmark comprising approximately 4,000 frames. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods by a margin of over 35% across 2 datasets and on 7 metrics. Meanwhile, qualitative assessments reveal a significant improvement in the quality of generated video, particularly in scenarios involving complex backgrounds and body occlusion of multi-character, suggesting the superiority of our approach.

[CV-47] Instructing Prompt-to-Prompt Generation for Zero-Shot Learning

链接: https://arxiv.org/abs/2406.03032
作者: Man Liu,Huihui Bai,Feng Li,Chunjie Zhang,Yunchao Wei,Meng Wang,Tat-Seng Chua,Yao Zhao
关键词: classify unseen categories, Zero-shot learning, aims to explore, explore the semantic-visual, semantic-visual interactions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Zero-shot learning (ZSL) aims to explore the semantic-visual interactions to discover comprehensive knowledge transferred from seen categories to classify unseen categories. Recently, prompt engineering has emerged in ZSL, demonstrating impressive potential as it enables the zero-shot transfer of diverse visual concepts to downstream tasks. However, these methods are still not well generalized to broad unseen domains. A key reason is that the fixed adaption of learnable prompts on seen domains makes it tend to over-emphasize the primary visual features observed during training. In this work, we propose a \textbfPrompt-to-\textbfPrompt generation methodology (\textbfP2P), which addresses this issue by further embracing the instruction-following technique to distill instructive visual prompts for comprehensive transferable knowledge discovery. The core of P2P is to mine semantic-related instruction from prompt-conditioned visual features and text instruction on modal-sharing semantic concepts and then inversely rectify the visual representations with the guidance of the learned instruction prompts. This enforces the compensation for missing visual details to primary contexts and further eliminates the cross-modal disparity, endowing unseen domain generalization. Through extensive experimental results, we demonstrate the efficacy of P2P in achieving superior performance over state-of-the-art methods.

[CV-48] Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction

链接: https://arxiv.org/abs/2406.03019
作者: Pengjie Wang,Kaile Zhang,Xinyu Wang,Shengwei Han,Yongge Liu,Lianwen Jin,Xiang Bai,Yuliang Liu
关键词: Oracle Bone Inscriptions, oldest existing forms, Oracle Bone, Bone Inscriptions, Puzzle Pieces Picker
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICDAR 2024

点击查看摘要

Abstract:Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world. However, due to the great antiquity of the era, a large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in the field of paleography today. This paper introduces a novel approach, namely Puzzle Pieces Picker (P ^3 ), to decipher these enigmatic characters through radical reconstruction. We deconstruct OBI into foundational strokes and radicals, then employ a Transformer model to reconstruct them into their modern (conterpart)\textcolorbluecounterparts, offering a groundbreaking solution to ancient script analysis. To further this endeavor, a new Ancient Chinese Character Puzzles (ACCP) dataset was developed, comprising an extensive collection of character images from seven key historical stages, annotated with detailed radical sequences. The experiments have showcased considerable promising insights, underscoring the potential and effectiveness of our approach in deciphering the intricacies of ancient Chinese scripts. Through this novel dataset and methodology, we aim to bridge the gap between traditional philology and modern document analysis techniques, offering new insights into the rich history of Chinese linguistic heritage.

[CV-49] DifAttack: Query-Efficient Black-Box Adversarial Attack via Hierarchical Disentangled Feature Space in Cross Domain

链接: https://arxiv.org/abs/2406.03017
作者: Jun Liu,Jiantao Zhou,Jiandian Zeng,Jinyu Tian
关键词: Attack Success Rate, Success Rate, high Attack Success, work investigates efficient, investigates efficient score-based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: substantial text overlap with arXiv:2309.14585

点击查看摘要

Abstract:This work investigates efficient score-based black-box adversarial attacks with a high Attack Success Rate (ASR) and good generalizability. We design a novel attack method based on a \textitHierarchical \textbfDisentangled \textbfFeature space and \textitcross domain, called \textbfDifAttack++, which differs significantly from the existing ones operating over the entire feature space. Specifically, DifAttack++ firstly disentangles an image’s latent feature into an \textitadversarial feature (AF) and a \textitvisual feature (VF) via an autoencoder equipped with our specially designed \textbfHierarchical \textbfDecouple-\textbfFusion (HDF) module, where the AF dominates the adversarial capability of an image, while the VF largely determines its visual appearance. We train such autoencoders for the clean and adversarial image domains respectively, meanwhile realizing feature disentanglement, by using pairs of clean images and their Adversarial Examples (AEs) generated from available surrogate models via white-box attack methods. Eventually, in the black-box attack stage, DifAttack++ iteratively optimizes the AF according to the query feedback from the victim model until a successful AE is generated, while keeping the VF unaltered. Extensive experimental results demonstrate that our method achieves superior ASR and query efficiency than SOTA methods, meanwhile exhibiting much better visual quality of AEs. The code is available at this https URL.

[CV-50] Balancing Performance and Efficiency in Zero-shot Robotic Navigation

链接: https://arxiv.org/abs/2406.03015
作者: Dmytro Kuzmenko,Nadiya Shvai
关键词: Goal Navigation task, Object Goal Navigation, Vision-Language Frontier Maps, Frontier Maps, Goal Navigation
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to ICTERI 2024 Posters Track

点击查看摘要

Abstract:We present an optimization study of the Vision-Language Frontier Maps (VLFM) applied to the Object Goal Navigation task in robotics. Our work evaluates the efficiency and performance of various vision-language models, object detectors, segmentation models, and multi-modal comprehension and Visual Question Answering modules. Using the \textitval-mini and \textitval splits of Habitat-Matterport 3D dataset, we conduct experiments on a desktop with limited VRAM. We propose a solution that achieves a higher success rate (+1.55%) improving over the VLFM BLIP-2 baseline without substantial success-weighted path length loss while requiring \textbf2.3 times less video memory. Our findings provide insights into balancing model performance and computational efficiency, suggesting effective deployment strategies for resource-limited environments.

[CV-51] DriVLMe: Enhancing LLM-based Autonomous Driving Agents with Embodied and Social Experiences

链接: https://arxiv.org/abs/2406.03008
作者: Yidong Huang,Jacob Sansom,Ziqiao Ma,Felix Gervits,Joyce Chai
关键词: real-world driving scenarios, Recent advancements, real-world driving, driving scenarios, foundation models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: First Vision and Language for Autonomous Driving and Robotics Workshop (VLADR @ CVPR 2024)

点击查看摘要

Abstract:Recent advancements in foundation models (FMs) have unlocked new prospects in autonomous driving, yet the experimental settings of these studies are preliminary, over-simplified, and fail to capture the complexity of real-world driving scenarios in human environments. It remains under-explored whether FM agents can handle long-horizon navigation tasks with free-from dialogue and deal with unexpected situations caused by environmental dynamics or task changes. To explore the capabilities and boundaries of FMs faced with the challenges above, we introduce DriVLMe, a video-language-model-based agent to facilitate natural and effective communication between humans and autonomous vehicles that perceive the environment and navigate. We develop DriVLMe from both embodied experiences in a simulated environment and social experiences from real human dialogue. While DriVLMe demonstrates competitive performance in both open-loop benchmarks and closed-loop human studies, we reveal several limitations and challenges, including unacceptable inference time, imbalanced training data, limited visual understanding, challenges with multi-turn interactions, simplified language generation from robotic experiences, and difficulties in handling on-the-fly unexpected situations like environmental dynamics and task changes.

[CV-52] EdgeSync: Faster Edge-model Updating via Adaptive Continuous Learning for Video Data Drift

链接: https://arxiv.org/abs/2406.03001
作者: Peng Zhao,Runchu Dong,Guiqin Wang,Cong Zhao
关键词: systems typically place, Real-time video analytics, analytics systems typically, typically place models, Real-time video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Real-time video analytics systems typically place models with fewer weights on edge devices to reduce latency. The distribution of video content features may change over time for various reasons (i.e. light and weather change) , leading to accuracy degradation of existing models, to solve this problem, recent work proposes a framework that uses a remote server to continually train and adapt the lightweight model at edge with the help of complex model. However, existing analytics approaches leave two challenges untouched: firstly, retraining task is compute-intensive, resulting in large model update delays; secondly, new model may not fit well enough with the data distribution of the current video stream. To address these challenges, in this paper, we present EdgeSync, EdgeSync filters the samples by considering both timeliness and inference results to make training samples more relevant to the current video content as well as reduce the update delay, to improve the quality of training, EdgeSync also designs a training management module that can efficiently adjusts the model training time and training order on the runtime. By evaluating real datasets with complex scenes, our method improves about 3.4% compared to existing methods and about 10% compared to traditional means.

[CV-53] Quantifying Task Priority for Multi-Task Optimization

链接: https://arxiv.org/abs/2406.02996
作者: Wooseong Jeong,Kuk-Jin Yoon
关键词: single unified network, single unified, task priority, learn diverse tasks, tasks
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The goal of multi-task learning is to learn diverse tasks within a single unified network. As each task has its own unique objective function, conflicts emerge during training, resulting in negative transfer among them. Earlier research identified these conflicting gradients in shared parameters between tasks and attempted to realign them in the same direction. However, we prove that such optimization strategies lead to sub-optimal Pareto solutions due to their inability to accurately determine the individual contributions of each parameter across various tasks. In this paper, we propose the concept of task priority to evaluate parameter contributions across different tasks. To learn task priority, we identify the type of connections related to links between parameters influenced by task-specific losses during backpropagation. The strength of connections is gauged by the magnitude of parameters to determine task priority. Based on these, we present a new method named connection strength-based optimization for multi-task learning which consists of two phases. The first phase learns the task priority within the network, while the second phase modifies the gradients while upholding this priority. This ultimately leads to finding new Pareto optimal solutions for multiple tasks. Through extensive experiments, we show that our approach greatly enhances multi-task performance in comparison to earlier gradient manipulation methods.

[CV-54] A Human-Annotated Video Dataset for Training and Evaluation of 360-Degree Video Summarization Methods

链接: https://arxiv.org/abs/2406.02991
作者: Ioannis Kontostathis,Evlampios Apostolidis,Vasileios Mezaris
关键词: summarization methods, content to concise, traditional devices, sets and smartphones, paper we introduce
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Accepted for publication, 1st Int. Workshop on Video for Immersive Experiences (Video4IMX-2024) at ACM IMX 2024, Stockholm, Sweden, June 2024. This is the “accepted version”

点击查看摘要

Abstract:In this paper we introduce a new dataset for 360-degree video summarization: the transformation of 360-degree video content to concise 2D-video summaries that can be consumed via traditional devices, such as TV sets and smartphones. The dataset includes ground-truth human-generated summaries, that can be used for training and objectively evaluating 360-degree video summarization methods. Using this dataset, we train and assess two state-of-the-art summarization methods that were originally proposed for 2D-video summarization, to serve as a baseline for future comparisons with summarization methods that are specifically tailored to 360-degree video. Finally, we present an interactive tool that was developed to facilitate the data annotation process and can assist other annotation activities that rely on video fragment selection.

[CV-55] Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification

链接: https://arxiv.org/abs/2406.02990
作者: Gexin Huang,Chenfei Wu,Mingjie Li,Xiaojun Chang,Ling Chen,Ying Sun,Shen Zhao,Xiaodan Liang,Liang Lin
关键词: Predicting genetic mutations, training multiple binary, Predicting genetic, slide images, images is indispensable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 16 pages, 8 figures, and 3 tables

点击查看摘要

Abstract:Predicting genetic mutations from whole slide images is indispensable for cancer diagnosis. However, existing work training multiple binary classification models faces two challenges: (a) Training multiple binary classifiers is inefficient and would inevitably lead to a class imbalance problem. (b) The biological relationships among genes are overlooked, which limits the prediction performance. To tackle these challenges, we innovatively design a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances. BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules: (a) A gene graph whose node features are the genes’ linguistic descriptions and the cancer phenotype, with edges modeled by genes’ pathway associations and mutation consistencies. (b) A knowledge association module that fuses linguistic and biomedical knowledge into gene priors by transformer-based graph representation learning, capturing the intrinsic relationships between different genes’ mutations. BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules: (a) A modality fusion module that firstly fuses the gene priors with critical regions in WSIs and obtains gene-wise mutation logits. (b) A comparative multi-label loss that emphasizes the inherent comparisons among mutation status to enhance the discrimination capabilities. Sufficient experiments on The Cancer Genome Atlas benchmark demonstrate that BPGT outperforms the state-of-the-art.

[CV-56] Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment

链接: https://arxiv.org/abs/2406.02987
作者: Wenliang Zhong,Wenyi Wu,Qi Li,Rob Barton,Boxin Du,Shioulin Sam,Karim Bouyarmane,Ismail Tutar,Junzhou Huang
关键词: Multimodal Large Language, Large Language Models, achieved SOTA performance, Multimodal Large, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved SOTA performance in various visual language tasks by fusing the visual representations with LLMs leveraging some visual adapters. In this paper, we first establish that adapters using query-based Transformers such as Q-former is a simplified Multi-instance Learning method without considering instance heterogeneity/correlation. We then propose a general component termed Multi-instance Visual Prompt Generator (MIVPG) to incorporate enriched visual representations into LLMs by taking advantage of instance correlation between images or patches for the same sample. Quantatitive evaluation on three public vision-language (VL) datasets from different scenarios shows that the proposed MIVPG improves Q-former in main VL tasks.

[CV-57] Self-Supervised Skeleton Action Representation Learning: A Benchmark and Beyond

链接: https://arxiv.org/abs/2406.02978
作者: Jiahang Zhang,Lilang Lin,Shuai Yang,Jiaying Liu
关键词: meaningful prior representations, learn meaningful prior, meaningful prior, skeleton-based action understanding, label-efficient skeleton-based action
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Self-supervised learning (SSL), which aims to learn meaningful prior representations from unlabeled data, has been proven effective for label-efficient skeleton-based action understanding. Different from the image domain, skeleton data possesses sparser spatial structures and diverse representation forms, with the absence of background clues and the additional temporal dimension. This presents the new challenges for the pretext task design of spatial-temporal motion representation learning. Recently, many endeavors have been made for skeleton-based SSL and remarkable progress has been achieved. However, a systematic and thorough review is still lacking. In this paper, we conduct, for the first time, a comprehensive survey on self-supervised skeleton-based action representation learning, where various literature is organized according to their pre-training pretext task methodologies. Following the taxonomy of context-based, generative learning, and contrastive learning approaches, we make a thorough review and benchmark of existing works and shed light on the future possible directions. Our investigation demonstrates that most SSL works rely on the single paradigm, learning representations of a single level, and are evaluated on the action recognition task solely, which leaves the generalization power of skeleton SSL models under-explored. To this end, a novel and effective SSL method for skeleton is further proposed, which integrates multiple pretext tasks to jointly learn versatile representations of different granularity, substantially boosting the generalization capacity for different downstream tasks. Extensive experiments under three large-scale datasets demonstrate that the proposed method achieves the superior generalization performance on various downstream tasks, including recognition, retrieval, detection, and few-shot learning.

[CV-58] Sparse Color-Code Net: Real-Time RGB-Based 6D Object Pose Estimation on Edge Devices

链接: https://arxiv.org/abs/2406.02977
作者: Xingjian Yang,Zhitao Yu,Ashis G. Banerjee
关键词: augmented reality applications, reality applications increasingly, applications increasingly rely, Sparse Color-Code Net, precise and efficient
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Accepted for publication in the Proceedings of the 2024 IEEE 20th International Conference on Automation Science and Engineering

点击查看摘要

Abstract:As robotics and augmented reality applications increasingly rely on precise and efficient 6D object pose estimation, real-time performance on edge devices is required for more interactive and responsive systems. Our proposed Sparse Color-Code Net (SCCN) embodies a clear and concise pipeline design to effectively address this requirement. SCCN performs pixel-level predictions on the target object in the RGB image, utilizing the sparsity of essential object geometry features to speed up the Perspective-n-Point (PnP) computation process. Additionally, it introduces a novel pixel-level geometry-based object symmetry representation that seamlessly integrates with the initial pose predictions, effectively addressing symmetric object ambiguities. SCCN notably achieves an estimation rate of 19 frames per second (FPS) and 6 FPS on the benchmark LINEMOD dataset and the Occlusion LINEMOD dataset, respectively, for an NVIDIA Jetson AGX Xavier, while consistently maintaining high estimation accuracy at these rates.

[CV-59] DA-Flow: Dual Attention Normalizing Flow for Skeleton-based Video Anomaly Detection

链接: https://arxiv.org/abs/2406.02976
作者: Ruituo Wu,Yang Chen,Jian Xiao,Bing Li,Jicong Fan,Frédéric Dufaux,Ce Zhu,Yipeng Liu
关键词: temporal convolutional networks, graph convolutional networks, convolutional networks, Cooperation between temporal, skeleton-based video anomaly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cooperation between temporal convolutional networks (TCN) and graph convolutional networks (GCN) as a processing module has shown promising results in skeleton-based video anomaly detection (SVAD). However, to maintain a lightweight model with low computational and storage complexity, shallow GCN and TCN blocks are constrained by small receptive fields and a lack of cross-dimension interaction capture. To tackle this limitation, we propose a lightweight module called the Dual Attention Module (DAM) for capturing cross-dimension interaction relationships in spatio-temporal skeletal data. It employs the frame attention mechanism to identify the most significant frames and the skeleton attention mechanism to capture broader relationships across fixed partitions with minimal parameters and flops. Furthermore, the proposed Dual Attention Normalizing Flow (DA-Flow) integrates the DAM as a post-processing unit after GCN within the normalizing flow framework. Simulations show that the proposed model is robust against noise and negative samples. Experimental results show that DA-Flow reaches competitive or better performance than the existing state-of-the-art (SOTA) methods in terms of the micro AUC metric with the fewest number of parameters. Moreover, we found that even without training, simply using random projection without dimensionality reduction on skeleton data enables substantial anomaly detection capabilities.

[CV-60] Event3DGS: Event-based 3D Gaussian Splatting for Fast Egomotion

链接: https://arxiv.org/abs/2406.02972
作者: Tianyi Xiong,Jiayi Wu,Botao He,Cornelia Fermuller,Yiannis Aloimonos,Heng Huang,Christopher A. Metzler
关键词: real-world robotic tasks, Gaussian splatting, Gaussian Splatting solely, leverages the advantage, novel-view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The recent emergence of 3D Gaussian splatting (3DGS) leverages the advantage of explicit point-based representations, which significantly improves the rendering speed and quality of novel-view synthesis. However, 3D radiance field rendering in environments with high-dynamic motion or challenging illumination condition remains problematic in real-world robotic tasks. The reason is that fast egomotion is prevalent real-world robotic tasks, which induces motion blur, leading to inaccuracies and artifacts in the reconstructed structure. To alleviate this problem, we propose Event3DGS, the first method that learns Gaussian Splatting solely from raw event streams. By exploiting the high temporal resolution of event cameras and explicit point-based representation, Event3DGS can reconstruct high-fidelity 3D structures solely from the event streams under fast egomotion. Our sparsity-aware sampling and progressive training approaches allow for better reconstruction quality and consistency. To further enhance the fidelity of appearance, we explicitly incorporate the motion blur formation process into a differentiable rasterizer, which is used with a limited set of blurred RGB images to refine the appearance. Extensive experiments on multiple datasets validate the superior rendering quality of Event3DGS compared with existing approaches, with over 95% lower training time and faster rendering speed in orders of magnitude.

[CV-61] Adversarial Generation of Hierarchical Gaussians for 3D Generative Model

链接: https://arxiv.org/abs/2406.02968
作者: Sangeek Hyun,Jae-Pil Heo
关键词: Generative Adversarial Networks, ray casting-based volume, demanding rendering costs, casting-based volume rendering, incurs demanding rendering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Most advances in 3D Generative Adversarial Networks (3D GANs) largely depend on ray casting-based volume rendering, which incurs demanding rendering costs. One promising alternative is rasterization-based 3D Gaussian Splatting (3D-GS), providing a much faster rendering speed and explicit 3D representation. In this paper, we exploit Gaussian as a 3D representation for 3D GANs by leveraging its efficient and explicit characteristics. However, in an adversarial framework, we observe that a naïve generator architecture suffers from training instability and lacks the capability to adjust the scale of Gaussians. This leads to model divergence and visual artifacts due to the absence of proper guidance for initialized positions of Gaussians and densification to manage their scales adaptively. To address these issues, we introduce a generator architecture with a hierarchical multi-scale Gaussian representation that effectively regularizes the position and scale of generated Gaussians. Specifically, we design a hierarchy of Gaussians where finer-level Gaussians are parameterized by their coarser-level counterparts; the position of finer-level Gaussians would be located near their coarser-level counterparts, and the scale would monotonically decrease as the level becomes finer, modeling both coarse and fine details of the 3D scene. Experimental results demonstrate that ours achieves a significantly faster rendering speed (x100) compared to state-of-the-art 3D consistent GANs with comparable 3D generation capability. Project page: this https URL.

[CV-62] Understanding the Impact of Negative Prompts: When and How Do They Take Effect?

链接: https://arxiv.org/abs/2406.02965
作者: Yuanhao Ban,Ruochen Wang,Tianyi Zhou,Minhao Cheng,Boqing Gong,Cho-Jui Hsieh
关键词: Stable Diffusion, conditional generation models, negative prompts, models like Stable, significant practical efficacy
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The concept of negative prompts, emerging from conditional generation models like Stable Diffusion, allows users to specify what to exclude from the generated images.%, demonstrating significant practical efficacy. Despite the widespread use of negative prompts, their intrinsic mechanisms remain largely unexplored. This paper presents the first comprehensive study to uncover how and when negative prompts take effect. Our extensive empirical analysis identifies two primary behaviors of negative prompts. Delayed Effect: The impact of negative prompts is observed after positive prompts render corresponding content. Deletion Through Neutralization: Negative prompts delete concepts from the generated image through a mutual cancellation effect in latent space with positive prompts. These insights reveal significant potential real-world applications; for example, we demonstrate that negative prompts can facilitate object inpainting with minimal alterations to the background via a simple adaptive algorithm. We believe our findings will offer valuable insights for the community in capitalizing on the potential of negative prompts.

[CV-63] AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

链接: https://arxiv.org/abs/2406.02951
作者: Trevine Oorloff,Surya Koppisetti,Nicolò Bonettini,Divyaraj Solanki,Ben Colman,Yaser Yacoob,Ali Shahriyari,Gaurav Bharaj
关键词: rapid growth, deepfake video content, generalizable methods, audio-visual, Audio-Visual Feature Fusion
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted to CVPR 2024

点击查看摘要

Abstract:With the rapid growth in deepfake video content, we require improved and generalizable methods to detect them. Most existing detection methods either use uni-modal cues or rely on supervised training to capture the dissonance between the audio and visual modalities. While the former disregards the audio-visual correspondences entirely, the latter predominantly focuses on discerning audio-visual cues within the training corpus, thereby potentially overlooking correspondences that can help detect unseen deepfakes. We present Audio-Visual Feature Fusion (AVFF), a two-stage cross-modal learning method that explicitly captures the correspondence between the audio and visual modalities for improved deepfake detection. The first stage pursues representation learning via self-supervision on real videos to capture the intrinsic audio-visual correspondences. To extract rich cross-modal representations, we use contrastive learning and autoencoding objectives, and introduce a novel audio-visual complementary masking and feature fusion strategy. The learned representations are tuned in the second stage, where deepfake classification is pursued via supervised learning on both real and fake videos. Extensive experiments and analysis suggest that our novel representation learning paradigm is highly discriminative in nature. We report 98.6% accuracy and 99.1% AUC on the FakeAVCeleb dataset, outperforming the current audio-visual state-of-the-art by 14.9% and 9.9%, respectively.

[CV-64] P2PFormer: A Primitive-to-polygon Method for Regular Building Contour Extraction from Remote Sensing Images

链接: https://arxiv.org/abs/2406.02930
作者: Tao Zhang,Shiqing Wei,Yikang Zhou,Muying Luo,Wenling You,Shunping Ji
关键词: Extracting building contours, remote sensing imagery, significant challenge due, regular building contours, Extracting building
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Extracting building contours from remote sensing imagery is a significant challenge due to buildings’ complex and diverse shapes, occlusions, and noise. Existing methods often struggle with irregular contours, rounded corners, and redundancy points, necessitating extensive post-processing to produce regular polygonal building contours. To address these challenges, we introduce a novel, streamlined pipeline that generates regular building contours without post-processing. Our approach begins with the segmentation of generic geometric primitives (which can include vertices, lines, and corners), followed by the prediction of their sequence. This allows for the direct construction of regular building contours by sequentially connecting the segmented primitives. Building on this pipeline, we developed P2PFormer, which utilizes a transformer-based architecture to segment geometric primitives and predict their order. To enhance the segmentation of primitives, we introduce a unique representation called group queries. This representation comprises a set of queries and a singular query position, which improve the focus on multiple midpoints of primitives and their efficient linkage. Furthermore, we propose an innovative implicit update strategy for the query position embedding aimed at sharpening the focus of queries on the correct positions and, consequently, enhancing the quality of primitive segmentation. Our experiments demonstrate that P2PFormer achieves new state-of-the-art performance on the WHU, CrowdAI, and WHU-Mix datasets, surpassing the previous SOTA PolyWorld by a margin of 2.7 AP and 6.5 AP75 on the largest CrowdAI dataset. We intend to make the code and trained weights publicly available to promote their use and facilitate further research.

[CV-65] Exploring Data Efficiency in Zero-Shot Learning with Diffusion Models

链接: https://arxiv.org/abs/2406.02929
作者: Zihan Ye,Shreyank N. Gowda,Xiaobo Jin,Xiaowei Huang,Haotian Xu,Yaochu Jin,Kaizhu Huang
关键词: identify unseen classes, aims to enable, enable classifiers, classifiers to identify, unseen classes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Zero-Shot Learning (ZSL) aims to enable classifiers to identify unseen classes by enhancing data efficiency at the class level. This is achieved by generating image features from pre-defined semantics of unseen classes. However, most current approaches heavily depend on the number of samples from seen classes, i.e. they do not consider instance-level effectiveness. In this paper, we demonstrate that limited seen examples generally result in deteriorated performance of generative models. To overcome these challenges, we propose ZeroDiff, a Diffusion-based Generative ZSL model. This unified framework incorporates diffusion models to improve data efficiency at both the class and instance levels. Specifically, for instance-level effectiveness, ZeroDiff utilizes a forward diffusion chain to transform limited data into an expanded set of noised data. For class-level effectiveness, we design a two-branch generation structure that consists of a Diffusion-based Feature Generator (DFG) and a Diffusion-based Representation Generator (DRG). DFG focuses on learning and sampling the distribution of cross-entropy-based features, whilst DRG learns the supervised contrastive-based representation to boost the zero-shot capabilities of DFG. Additionally, we employ three discriminators to evaluate generated features from various aspects and introduce a Wasserstein-distance-based mutual learning loss to transfer knowledge among discriminators, thereby enhancing guidance for generation. Demonstrated through extensive experiments on three popular ZSL benchmarks, our ZeroDiff not only achieves significant improvements over existing ZSL methods but also maintains robust performance even with scarce training data. Code will be released upon acceptance.

[CV-66] Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models

链接: https://arxiv.org/abs/2406.02915
作者: Jinhao Li,Haopeng Li,Sarah Erfani,Lei Feng,James Bailey,Feng Liu
关键词: large language model, text descriptions generated, pre-trained vision-language model, finer text descriptions, query image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 22 pages, 16 figures, published to ICML 2024

点击查看摘要

Abstract:It has recently been discovered that using a pre-trained vision-language model (VLM), e.g., CLIP, to align a whole query image with several finer text descriptions generated by a large language model can significantly enhance zero-shot performance. However, in this paper, we empirically find that the finer descriptions tend to align more effectively with local areas of the query image rather than the whole image, and then we theoretically validate this finding. Thus, we present a method called weighted visual-text cross alignment (WCA). This method begins with a localized visual prompting technique, designed to identify local visual areas within the query image. The local visual areas are then cross-aligned with the finer descriptions by creating a similarity matrix using the pre-trained VLM. To determine how well a query image aligns with each category, we develop a score function based on the weighted similarities in this matrix. Extensive experiments demonstrate that our method significantly improves zero-shot performance across various datasets, achieving results that are even comparable to few-shot learning methods.

[CV-67] A Self-Supervised Denoising Strategy for Underwater Acoustic Camera Imageries

链接: https://arxiv.org/abs/2406.02914
作者: Xiaoteng Zhou,Katsunori Mizuno,Yilong Zhang
关键词: low-visibility marine environments, marine environments characterized, acoustic camera images, visual sensors capable, acoustic cameras serve
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 8 pages

点击查看摘要

Abstract:In low-visibility marine environments characterized by turbidity and darkness, acoustic cameras serve as visual sensors capable of generating high-resolution 2D sonar images. However, acoustic camera images are interfered with by complex noise and are difficult to be directly ingested by downstream visual algorithms. This paper introduces a novel strategy for denoising acoustic camera images using deep learning techniques, which comprises two principal components: a self-supervised denoising framework and a fine feature-guided block. Additionally, the study explores the relationship between the level of image denoising and the improvement in feature-matching performance. Experimental results show that the proposed denoising strategy can effectively filter acoustic camera images without prior knowledge of the noise model. The denoising process is nearly end-to-end without complex parameter tuning and post-processing. It successfully removes noise while preserving fine feature details, thereby enhancing the performance of local feature matching.

[CV-68] Language-guided Detection and Mitigation of Unknown Dataset Bias

链接: https://arxiv.org/abs/2406.02889
作者: Zaiying Zhao,Soichiro Kumano,Toshihiko Yamasaki
关键词: training fair classifiers, significant problem, problem in training, training fair, prior knowledge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Dataset bias is a significant problem in training fair classifiers. When attributes unrelated to classification exhibit strong biases towards certain classes, classifiers trained on such dataset may overfit to these bias attributes, substantially reducing the accuracy for minority groups. Mitigation techniques can be categorized according to the availability of bias information (\ie, prior knowledge). Although scenarios with unknown biases are better suited for real-world settings, previous work in this field often suffers from a lack of interpretability regarding biases and lower performance. In this study, we propose a framework to identify potential biases as keywords without prior knowledge based on the partial occurrence in the captions. We further propose two debiasing methods: (a) handing over to an existing debiasing approach which requires prior knowledge by assigning pseudo-labels, and (b) employing data augmentation via text-to-image generative models, using acquired bias keywords as prompts. Despite its simplicity, experimental results show that our framework not only outperforms existing methods without prior knowledge, but also is even comparable with a method that assumes prior knowledge.

[CV-69] PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

链接: https://arxiv.org/abs/2406.02884
作者: Tao Yang,Yingmin Luo,Zhongang Qi,Yang Wu,Ying Shan,Chang Wen Chen
关键词: requiring arranging, constraint-following manner, achieving automated graphic, keystone in achieving, arranging the position
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Layout generation is the keystone in achieving automated graphic design, requiring arranging the position and size of various multi-modal design elements in a visually pleasing and constraint-following manner. Previous approaches are either inefficient for large-scale applications or lack flexibility for varying design requirements. Our research introduces a unified framework for automated graphic layout generation, leveraging the multi-modal large language model (MLLM) to accommodate diverse design tasks. In contrast, our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts under specific visual and textual constraints, including user-defined natural language specifications. We conducted extensive experiments and achieved state-of-the-art (SOTA) performance on public multi-modal layout generation benchmarks, demonstrating the effectiveness of our method. Moreover, recognizing existing datasets’ limitations in capturing the complexity of real-world graphic designs, we propose two new datasets for much more challenging tasks (user-constrained generation and complicated poster), further validating our model’s utility in real-life settings. Marking by its superior accessibility and adaptability, this approach further automates large-scale graphic design tasks. The code and datasets will be publicly available on this https URL.

[CV-70] Inv-Adapter: ID Customization Generation via Image Inversion and Lightweight Adapter

链接: https://arxiv.org/abs/2406.02881
作者: Peng Xing,Ning Wang,Jianbo Ouyang,Zechao Li
关键词: remarkable advancement, boosts the research, models significantly boosts, significantly boosts, model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: technical report

点击查看摘要

Abstract:The remarkable advancement in text-to-image generation models significantly boosts the research in ID customization generation. However, existing personalization methods cannot simultaneously satisfy high fidelity and high-efficiency requirements. Their main bottleneck lies in the prompt image encoder, which produces weak alignment signals with the text-to-image model and significantly increased model size. Towards this end, we propose a lightweight Inv-Adapter, which first extracts diffusion-domain representations of ID images utilizing a pre-trained text-to-image model via DDIM image inversion, without additional image encoder. Benefiting from the high alignment of the extracted ID prompt features and the intermediate features of the text-to-image model, we then embed them efficiently into the base text-to-image model by carefully designing a lightweight attention adapter. We conduct extensive experiments to assess ID fidelity, generation loyalty, speed, and training parameters, all of which show that the proposed Inv-Adapter is highly competitive in ID customization generation and model scale.

[CV-71] Controllable Talking Face Generation by Implicit Facial Keypoints Editing

链接: https://arxiv.org/abs/2406.02880
作者: Dong Zhao,Jiaying Shi,Wenjun Li,Shudong Wang,Shenghui Xu,Zhaoming Pan
关键词: digital human research, garnered significant interest, Audio-driven talking face, Audio-driven talking, talking face generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Audio-driven talking face generation has garnered significant interest within the domain of digital human research. Existing methods are encumbered by intricate model architectures that are intricately dependent on each other, complicating the process of re-editing image or video inputs. In this work, we present ControlTalk, a talking face generation method to control face expression deformation based on driven audio, which can construct the head pose and facial expression including lip motion for both single image or sequential video inputs in a unified manner. By utilizing a pre-trained video synthesis renderer and proposing the lightweight adaptation, ControlTalk achieves precise and naturalistic lip synchronization while enabling quantitative control over mouth opening shape. Our experiments show that our method is superior to state-of-the-art performance on widely used benchmarks, including HDTF and MEAD. The parameterized adaptation demonstrates remarkable generalization capabilities, effectively handling expression deformation across same-ID and cross-ID scenarios, and extending its utility to out-of-domain portraits, regardless of languages.

[CV-72] Rethinking Guidance Information to Utilize Unlabeled Samples:A Label Encoding Perspective

链接: https://arxiv.org/abs/2406.02862
作者: Yulong Zhang,Yuan Yao,Shuhao Chen,Pengrong Jin,Yu Zhang,Jian Jin,Jiangang Lu
关键词: Empirical Risk Minimization, Risk Minimization, Empirical Risk, Entropy Minimization, Label-Encoding Risk Minimization
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:Empirical Risk Minimization (ERM) is fragile in scenarios with insufficient labeled samples. A vanilla extension of ERM to unlabeled samples is Entropy Minimization (EntMin), which employs the soft-labels of unlabeled samples to guide their learning. However, EntMin emphasizes prediction discriminability while neglecting prediction diversity. To alleviate this issue, in this paper, we rethink the guidance information to utilize unlabeled samples. By analyzing the learning objective of ERM, we find that the guidance information for labeled samples in a specific category is the corresponding label encoding. Inspired by this finding, we propose a Label-Encoding Risk Minimization (LERM). It first estimates the label encodings through prediction means of unlabeled samples and then aligns them with their corresponding ground-truth label encodings. As a result, the LERM ensures both prediction discriminability and diversity, and it can be integrated into existing methods as a plugin. Theoretically, we analyze the relationships between LERM and ERM as well as EntMin. Empirically, we verify the superiority of the LERM under several label insufficient scenarios. The codes are available at this https URL.

[CV-73] Zero-Shot Image Segmentation via Recursive Normalized Cut on Diffusion Features

链接: https://arxiv.org/abs/2406.02842
作者: Paul Couairon,Mustafa Shukor,Jean-Emmanuel Haugeard,Matthieu Cord,Nicolas Thome
关键词: domains including language, including language, emerged as powerful, powerful tools, domains including
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Foundation models have emerged as powerful tools across various domains including language, vision, and multimodal tasks. While prior works have addressed unsupervised image segmentation, they significantly lag behind supervised models. In this paper, we use a diffusion UNet encoder as a foundation vision encoder and introduce DiffCut, an unsupervised zero-shot segmentation method that solely harnesses the output features from the final self-attention block. Through extensive experimentation, we demonstrate that the utilization of these diffusion features in a graph based segmentation algorithm, significantly outperforms previous state-of-the-art methods on zero-shot segmentation. Specifically, we leverage a recursive Normalized Cut algorithm that softly regulates the granularity of detected objects and produces well-defined segmentation maps that precisely capture intricate image details. Our work highlights the remarkably accurate semantic knowledge embedded within diffusion UNet encoders that could then serve as foundation vision encoders for downstream tasks. Project page at this https URL

[CV-74] Conditional Idempotent Generative Networks

链接: https://arxiv.org/abs/2406.02841
作者: Niccolò Ronchetti
关键词: Idempotent Generative Networks, Conditional Idempotent Generative, propose Conditional Idempotent, Generative Networks, Idempotent Generative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages, 8 figures

点击查看摘要

Abstract:We propose Conditional Idempotent Generative Networks (CIGN), a novel approach that expands upon Idempotent Generative Networks (IGN) to enable conditional generation. While IGNs offer efficient single-pass generation, they lack the ability to control the content of the generated data. CIGNs address this limitation by incorporating conditioning mechanisms, allowing users to steer the generation process towards specific types of data. We establish the theoretical foundations for CIGNs, outlining their scope, loss function design, and evaluation metrics. We then present two potential architectures for implementing CIGNs: channel conditioning and filter conditioning. Finally, we discuss experimental results on the MNIST dataset, demonstrating the effectiveness of both approaches. Our findings pave the way for further exploration of CIGNs on larger datasets and with more powerful computing resources to determine the optimal implementation strategy. Comments: 22 pages, 8 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.02841 [cs.LG] (or arXiv:2406.02841v1 [cs.LG] for this version)

[CV-75] DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking

链接: https://arxiv.org/abs/2406.02836
作者: Mehrdad Saberi,Vinu Sankar Sadasivan,Arman Zarei,Hessam Mahdavifar,Soheil Feizi
关键词: detecting AI-generated content, data ownership protection, applications including data, including data ownership, media forensics
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Identifying the origin of data is crucial for data provenance, with applications including data ownership protection, media forensics, and detecting AI-generated content. A standard approach involves embedding-based retrieval techniques that match query data with entries in a reference dataset. However, this method is not robust against benign and malicious edits. To address this, we propose Data Retrieval with Error-corrected codes and Watermarking (DREW). DREW randomly clusters the reference dataset, injects unique error-controlled watermark keys into each cluster, and uses these keys at query time to identify the appropriate cluster for a given sample. After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches. The integration of error control codes (ECC) ensures reliable cluster assignments, enabling the method to perform retrieval on the entire dataset in case the ECC algorithm cannot detect the correct cluster with high confidence. This makes DREW maintain baseline performance, while also providing opportunities for performance improvements due to the increased likelihood of correctly matching queries to their origin when performing retrieval on a smaller subset of the dataset. Depending on the watermark technique used, DREW can provide substantial improvements in retrieval accuracy (up to 40% for some datasets and modification types) across multiple datasets and state-of-the-art embedding models (e.g., DinoV2, CLIP), making our method a promising solution for secure and reliable source identification. The code is available at this https URL

[CV-76] DenoDet: Attention as Deformable Multi-Subspace Feature Denoising for Target Detection in SAR Images

链接: https://arxiv.org/abs/2406.02833
作者: Yimian Dai,Minrui Zou,Yuxuan Li,Xiang Li,Kang Ni,Jian Yang
关键词: Synthetic Aperture Radar, Synthetic Aperture, Aperture Radar, inherent speckle noise, SAR target detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) target detection has long been impeded by inherent speckle noise and the prevalence of diminutive, ambiguous targets. While deep neural networks have advanced SAR target detection, their intrinsic low-frequency bias and static post-training weights falter with coherent noise and preserving subtle details across heterogeneous terrains. Motivated by traditional SAR image denoising, we propose DenoDet, a network aided by explicit frequency domain transform to calibrate convolutional biases and pay more attention to high-frequencies, forming a natural multi-scale subspace representation to detect targets from the perspective of multi-subspace denoising. We design TransDeno, a dynamic frequency domain attention module that performs as a transform domain soft thresholding operation, dynamically denoising across subspaces by preserving salient target signals and attenuating noise. To adaptively adjust the granularity of subspace processing, we also propose a deformable group fully-connected layer (DeGroFC) that dynamically varies the group conditioned on the input features. Without bells and whistles, our plug-and-play TransDeno sets state-of-the-art scores on multiple SAR target detection datasets. The code is available at this https URL.

[CV-77] Distilling Aggregated Knowledge for Weakly-Supervised Video Anomaly Detection

链接: https://arxiv.org/abs/2406.02831
作者: Jash Dalvi,Ali Dabouei,Gunjan Dhanuka,Min Xu
关键词: Video anomaly detection, anomaly detection aims, identifying abnormal events, automated models capable, Video anomaly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video anomaly detection aims to develop automated models capable of identifying abnormal events in surveillance videos. The benchmark setup for this task is extremely challenging due to: i) the limited size of the training sets, ii) weak supervision provided in terms of video-level labels, and iii) intrinsic class imbalance induced by the scarcity of abnormal events. In this work, we show that distilling knowledge from aggregated representations of multiple backbones into a relatively simple model achieves state-of-the-art performance. In particular, we develop a bi-level distillation approach along with a novel disentangled cross-attention-based feature aggregation network. Our proposed approach, DAKD (Distilling Aggregated Knowledge with Disentangled Attention), demonstrates superior performance compared to existing methods across multiple benchmark datasets. Notably, we achieve significant improvements of 1.36%, 0.78%, and 7.02% on the UCF-Crime, ShanghaiTech, and XD-Violence datasets, respectively.

[CV-78] ORACLE: Leveraging Mutual Information for Consistent Character Generation with LoRAs in Diffusion Models

链接: https://arxiv.org/abs/2406.02820
作者: Kiymet Akdemir,Pinar Yanardag
关键词: comic book artistry, promoting visual creativity, children literature, game development, book artistry
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-to-image diffusion models have recently taken center stage as pivotal tools in promoting visual creativity across an array of domains such as comic book artistry, children’s literature, game development, and web design. These models harness the power of artificial intelligence to convert textual descriptions into vivid images, thereby enabling artists and creators to bring their imaginative concepts to life with unprecedented ease. However, one of the significant hurdles that persist is the challenge of maintaining consistency in character generation across diverse contexts. Variations in textual prompts, even if minor, can yield vastly different visual outputs, posing a considerable problem in projects that require a uniform representation of characters throughout. In this paper, we introduce a novel framework designed to produce consistent character representations from a single text prompt across diverse settings. Through both quantitative and qualitative analyses, we demonstrate that our framework outperforms existing methods in generating characters with consistent visual identities, underscoring its potential to transform creative industries. By addressing the critical challenge of character consistency, we not only enhance the practical utility of these models but also broaden the horizons for artistic and creative expression.

[CV-79] LADI v2: Multi-label Dataset and Classifiers for Low-Altitude Disaster Imagery

链接: https://arxiv.org/abs/2406.02780
作者: Samuel Scheele,Katherine Picchione,Jeffrey Liu
关键词: promising tools, tools for supporting, operations following natural, Low Altitude Disaster, supporting emergency management
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:ML-based computer vision models are promising tools for supporting emergency management operations following natural disasters. Arial photographs taken from small manned and unmanned aircraft can be available soon after a disaster and provide valuable information from multiple perspectives for situational awareness and damage assessment applications. However, emergency managers often face challenges finding the most relevant photos among the tens of thousands that may be taken after an incident. While ML-based solutions could enable more effective use of aerial photographs, there is still a lack of training data for imagery of this type from multiple perspectives and for multiple hazard types. To address this, we present the LADI v2 (Low Altitude Disaster Imagery version 2) dataset, a curated set of about 10,000 disaster images captured in the United States by the Civil Air Patrol (CAP) in response to federally-declared emergencies (2015-2023) and annotated for multi-label classification by trained CAP volunteers. We also provide two pretrained baseline classifiers and compare their performance to state-of-the-art vision-language models in multi-label classification. The data and code are released publicly to support the development of computer vision models for emergency management research and applications.

[CV-80] MeshVPR: Citywide Visual Place Recognition Using 3D Meshes

链接: https://arxiv.org/abs/2406.02776
作者: Gabriele Berton,Lorenz Junglas,Riccardo Zaccone,Thomas Pollok,Barbara Caputo,Carlo Masone
关键词: visual place recognition, recognition step based, localization step based, place recognition step, step based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Website: this https URL

点击查看摘要

Abstract:Mesh-based scene representation offers a promising direction for simplifying large-scale hierarchical visual localization pipelines, combining a visual place recognition step based on global features (retrieval) and a visual localization step based on local features. While existing work demonstrates the viability of meshes for visual localization, the impact of using synthetic databases rendered from them in visual place recognition remains largely unexplored. In this work we investigate using dense 3D textured meshes for large-scale Visual Place Recognition (VPR) and identify a significant performance drop when using synthetic mesh-based databases compared to real-world images for retrieval. To address this, we propose MeshVPR, a novel VPR pipeline that utilizes a lightweight features alignment framework to bridge the gap between real-world and synthetic domains. MeshVPR leverages pre-trained VPR models and it is efficient and scalable for city-wide deployments. We introduce novel datasets with freely available 3D meshes and manually collected queries from Berlin, Paris, and Melbourne. Extensive evaluations demonstrate that MeshVPR achieves competitive performance with standard VPR pipelines, paving the way for mesh-based localization systems. Our contributions include the new task of citywide mesh-based VPR, the new benchmark datasets, MeshVPR, and a thorough analysis of open challenges. Data, code, and interactive visualizations are available at this https URL

[CV-81] Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following

链接: https://arxiv.org/abs/2406.02774
作者: Qiaomu Miao,Alexandros Graikos,Jingwei Zhang,Sounak Mondal,Minh Hoai,Dimitris Samaras
关键词: target coordinates annotated, gaze target coordinates, inherently ambiguous process, Visual Question Answering, target coordinates
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Training gaze following models requires a large number of images with gaze target coordinates annotated by human annotators, which is a laborious and inherently ambiguous process. We propose the first semi-supervised method for gaze following by introducing two novel priors to the task. We obtain the first prior using a large pretrained Visual Question Answering (VQA) model, where we compute Grad-CAM heatmaps by `prompting’ the VQA model with a gaze following question. These heatmaps can be noisy and not suited for use in training. The need to refine these noisy annotations leads us to incorporate a second prior. We utilize a diffusion model trained on limited human annotations and modify the reverse sampling process to refine the Grad-CAM heatmaps. By tuning the diffusion process we achieve a trade-off between the human annotation prior and the VQA heatmap prior, which retains the useful VQA prior information while exhibiting similar properties to the training data distribution. Our method outperforms simple pseudo-annotation generation baselines on the GazeFollow image dataset. More importantly, our pseudo-annotation strategy, applied to a widely used supervised gaze following model (VAT), reduces the annotation need by 50%. Our method also performs the best on the VideoAttentionTarget dataset.

[CV-82] Cyclic Sparse Training: Is it Enough?

链接: https://arxiv.org/abs/2406.02773
作者: Advait Gadhikar,Sree Harsha Nelaturu,Rebekka Burkholz
关键词: implicit regularization induced, repeated cyclic training, repeated cyclic, iterative pruning methods, cyclic training
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The success of iterative pruning methods in achieving state-of-the-art sparse networks has largely been attributed to improved mask identification and an implicit regularization induced by pruning. We challenge this hypothesis and instead posit that their repeated cyclic training schedules enable improved optimization. To verify this, we show that pruning at initialization is significantly boosted by repeated cyclic training, even outperforming standard iterative pruning methods. The dominant mechanism how this is achieved, as we conjecture, can be attributed to a better exploration of the loss landscape leading to a lower training loss. However, at high sparsity, repeated cyclic training alone is not enough for competitive performance. A strong coupling between learnt parameter initialization and mask seems to be required. Standard methods obtain this coupling via expensive pruning-training iterations, starting from a dense network. To achieve this with sparse training instead, we propose SCULPT-ing, i.e., repeated cyclic training of any sparse mask followed by a single pruning step to couple the parameters and the mask, which is able to match the performance of state-of-the-art iterative pruning methods in the high sparsity regime at reduced computational cost.

[CV-83] Multi-layer Learnable Attention Mask for Multimodal Tasks

链接: https://arxiv.org/abs/2406.02761
作者: Wayner Barrios,SouYoung Jin
关键词: high computational demands, Learnable Attention Mask, diverse settings, varying granularity, high computational
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:While the Self-Attention mechanism in the Transformer model has proven to be effective in many domains, we observe that it is less effective in more diverse settings (e.g. multimodality) due to the varying granularity of each token and the high computational demands of lengthy sequences. To address the challenges, we introduce the Learnable Attention Mask (LAM), strategically designed to globally regulate attention maps and prioritize critical tokens within the sequence. Leveraging the Self-Attention module in a BERT-like transformer network, our approach adeptly captures associations between tokens. The extension of the LAM to a multi-layer version accommodates the varied information aspects embedded at each layer of the Transformer network. Comprehensive experimental validation on various datasets, such as MADv2, QVHighlights, ImageNet 1K, and MSRVTT, demonstrates the efficacy of the LAM, exemplifying its ability to enhance model performance while mitigating redundant computations. This pioneering approach presents a significant advancement in enhancing the understanding of complex scenarios, such as in movie understanding.

[CV-84] Story Generation from Visual Inputs: Techniques Related Tasks and Challenges

链接: https://arxiv.org/abs/2406.02748
作者: Daniel A. P. Oliveira,Eugénio Ribeiro,David Martins de Matos
关键词: Creating engaging narratives, digital media consumption, automated digital media, Creating engaging, assistive technologies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Creating engaging narratives from visual data is crucial for automated digital media consumption, assistive technologies, and interactive entertainment. This survey covers methodologies used in the generation of these narratives, focusing on their principles, strengths, and limitations. The survey also covers tasks related to automatic story generation, such as image and video captioning, and visual question answering, as well as story generation without visual inputs. These tasks share common challenges with visual story generation and have served as inspiration for the techniques used in the field. We analyze the main datasets and evaluation metrics, providing a critical perspective on their limitations. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) ACMclasses: I.2.7; I.2.10 Cite as: arXiv:2406.02748 [cs.CV] (or arXiv:2406.02748v1 [cs.CV] for this version)

[CV-85] 3D-HGS: 3D Half-Gaussian Splatting

链接: https://arxiv.org/abs/2406.02720
作者: Haolin Li,Jinyang Liu,Mario Sznaier,Octavia Camps
关键词: computer vision, Neural Radiance Fields, Photo-realistic, Reconstruction, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Photo-realistic 3D Reconstruction is a fundamental problem in 3D computer vision. This domain has seen considerable advancements owing to the advent of recent neural rendering techniques. These techniques predominantly aim to focus on learning volumetric representations of 3D scenes and refining these representations via loss functions derived from rendering. Among these, 3D Gaussian Splatting (3D-GS) has emerged as a significant method, surpassing Neural Radiance Fields (NeRFs). 3D-GS uses parameterized 3D Gaussians for modeling both spatial locations and color information, combined with a tile-based fast rendering technique. Despite its superior rendering performance and speed, the use of 3D Gaussian kernels has inherent limitations in accurately representing discontinuous functions, notably at edges and corners for shape discontinuities, and across varying textures for color discontinuities. To address this problem, we propose to employ 3D Half-Gaussian (3D-HGS) kernels, which can be used as a plug-and-play kernel. Our experiments demonstrate their capability to improve the performance of current 3D-GS related methods and achieve state-of-the-art rendering performance on various datasets without compromising rendering speed.

[CV-86] Window to Wall Ratio Detection using SegFormer

链接: https://arxiv.org/abs/2406.02706
作者: Zoe De Simone,Sayandeep Biswas,Oscar Wu
关键词: Wall Ratios, assessing the energy, daylight and ventilation, key to assessing, Ratios
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Window to Wall Ratios (WWR) are key to assessing the energy, daylight and ventilation performance of buildings. Studies have shown that window area has a large impact on building performance and simulation. However, data to set up these environmental models and simulations is typically not available. Instead, a standard 40% WWR is typically assumed for all buildings. This paper leverages existing computer vision window detection methods to predict WWR of buildings from external street view images using semantic segmentation, demonstrating the potential for adapting established computer vision technique in architectural applications

[CV-87] Contrastive Language Video Time Pre-training

链接: https://arxiv.org/abs/2406.02631
作者: Hengyue Liu,Kyle Min,Hector A. Valdez,Subarna Tripathi
关键词: representations in long-form, introduce LAVITI, temporal representations, LAVITI, LAVITI aims
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR EgoVis Workshop 2024 extended abstract

点击查看摘要

Abstract:We introduce LAVITI, a novel approach to learning language, video, and temporal representations in long-form videos via contrastive learning. Different from pre-training on video-text pairs like EgoVLP, LAVITI aims to align language, video, and temporal features by extracting meaningful moments in untrimmed videos. Our model employs a set of learnable moment queries to decode clip-level visual, language, and temporal features. In addition to vision and language alignment, we introduce relative temporal embeddings (TE) to represent timestamps in videos, which enables contrastive learning of time. Significantly different from traditional approaches, the prediction of a particular timestamp is transformed by computing the similarity score between the predicted TE and all TEs. Furthermore, existing approaches for video understanding are mainly designed for short videos due to high computational complexity and memory footprint. Our method can be trained on the Ego4D dataset with only 8 NVIDIA RTX-3090 GPUs in a day. We validated our method on CharadesEgo action recognition, achieving state-of-the-art results.

[CV-88] A Novel Defense Against Poisoning Attacks on Federated Learning: LayerCAM Augmented with Autoencoder

链接: https://arxiv.org/abs/2406.02605
作者: Jingjing Zheng,Xin Yuan,Kai Li,Wei Ni,Eduardo Tovar,Jon Crowcroft
关键词: adopted Euclidean distance-based, widely adopted Euclidean, Euclidean distance-based detection, circumvent widely adopted, Recent attacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent attacks on federated learning (FL) can introduce malicious model updates that circumvent widely adopted Euclidean distance-based detection methods. This paper proposes a novel defense strategy, referred to as LayerCAM-AE, designed to counteract model poisoning in federated learning. The LayerCAM-AE puts forth a new Layer Class Activation Mapping (LayerCAM) integrated with an autoencoder (AE), significantly enhancing detection capabilities. Specifically, LayerCAM-AE generates a heat map for each local model update, which is then transformed into a more compact visual format. The autoencoder is designed to process the LayerCAM heat maps from the local model updates, improving their distinctiveness and thereby increasing the accuracy in spotting anomalous maps and malicious local models. To address the risk of misclassifications with LayerCAM-AE, a voting algorithm is developed, where a local model update is flagged as malicious if its heat maps are consistently suspicious over several rounds of communication. Extensive tests of LayerCAM-AE on the SVHN and CIFAR-100 datasets are performed under both Independent and Identically Distributed (IID) and non-IID settings in comparison with existing ResNet-50 and REGNETY-800MF defense models. Experimental results show that LayerCAM-AE increases detection rates (Recall: 1.0, Precision: 1.0, FPR: 0.0, Accuracy: 1.0, F1 score: 1.0, AUC: 1.0) and test accuracy in FL, surpassing the performance of both the ResNet-50 and REGNETY-800MF. Our code is available at: this https URL

[CV-89] CoNO: Complex Neural Operator for Continous Dynamical Physical Systems

链接: https://arxiv.org/abs/2406.02597
作者: Karn Tiwari,N M Anoop Krishnan,A P Prathosh
关键词: infinite-dimensional functional spaces, Neural operators extend, operators extend data-driven, Complex Neural Operator, Fractional Fourier Transform
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注: Under Review

点击查看摘要

Abstract:Neural operators extend data-driven models to map between infinite-dimensional functional spaces. While these operators perform effectively in either the time or frequency domain, their performance may be limited when applied to non-stationary spatial or temporal signals whose frequency characteristics change with time. Here, we introduce Complex Neural Operator (CoNO) that parameterizes the integral kernel using Fractional Fourier Transform (FrFT), better representing non-stationary signals in a complex-valued domain. Theoretically, we prove the universal approximation capability of CoNO. We perform an extensive empirical evaluation of CoNO on seven challenging partial differential equations (PDEs), including regular grids, structured meshes, and point clouds. Empirically, CoNO consistently attains state-of-the-art performance, showcasing an average relative gain of 10.9%. Further, CoNO exhibits superior performance, outperforming all other models in additional tasks such as zero-shot super-resolution and robustness to noise. CoNO also exhibits the ability to learn from small amounts of data – giving the same performance as the next best model with just 60% of the training data. Altogether, CoNO presents a robust and superior model for modeling continuous dynamical systems, providing a fillip to scientific machine learning.

[CV-90] Planetary Causal Inference: Implications for the Geography of Poverty

链接: https://arxiv.org/abs/2406.02584
作者: Kazuki Sakamoto,Connor T. Jerzak,Adel Daoud
关键词: Earth observation data, Earth observation, government-derived economic indicators, machine learning, living conditions
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: For a full list of the papers found in the quantitative literature search, see this https URL

点击查看摘要

Abstract:Earth observation data such as satellite imagery can, when combined with machine learning, have profound impacts on our understanding of the geography of poverty through the prediction of living conditions, especially where government-derived economic indicators are either unavailable or potentially untrustworthy. Recent work has progressed in using EO data not only to predict spatial economic outcomes, but also to explore cause and effect, an understanding which is critical for downstream policy analysis. In this review, we first document the growth of interest in EO-ML analyses in the causal space. We then trace the relationship between spatial statistics and EO-ML methods before discussing the four ways in which EO data has been used in causal ML pipelines – (1.) poverty outcome imputation for downstream causal analysis, (2.) EO image deconfounding, (3.) EO-based treatment effect heterogeneity, and (4.) EO-based transportability analysis. We conclude by providing a workflow for how researchers can incorporate EO data in causal ML analysis going forward.

[CV-91] Exploring the Potential of Polynomial Basis Functions in Kolmogorov-Arnold Networks: A Comparative Study of Different Groups of Polynomials

链接: https://arxiv.org/abs/2406.02583
作者: Seyd Teymoor Seydi
关键词: traditional spline-based methods, Kolmogorov-Arnold Network, KAN models, spline-based methods, KAN
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive survey of 18 distinct polynomials and their potential applications in Kolmogorov-Arnold Network (KAN) models as an alternative to traditional spline-based methods. The polynomials are classified into various groups based on their mathematical properties, such as orthogonal polynomials, hypergeometric polynomials, q-polynomials, Fibonacci-related polynomials, combinatorial polynomials, and number-theoretic polynomials. The study aims to investigate the suitability of these polynomials as basis functions in KAN models for complex tasks like handwritten digit classification on the MNIST dataset. The performance metrics of the KAN models, including overall accuracy, Kappa, and F1 score, are evaluated and compared. The Gottlieb-KAN model achieves the highest performance across all metrics, suggesting its potential as a suitable choice for the given task. However, further analysis and tuning of these polynomials on more complex datasets are necessary to fully understand their capabilities in KAN models. The source code for the implementation of these KAN models is available at this https URL .

[CV-92] ShadowRefiner: Towards Mask-free Shadow Removal via Fast Fourier Transformer

链接: https://arxiv.org/abs/2406.02559
作者: Wei Dong,Han Zhou,Yuqiong Tian,Jingke Sun,Xiaohong Liu,Guangtao Zhai,Jun Chen
关键词: vision applications including, applications including object, including object detection, Fast Fourier Transformer, exhibit pronounced spatial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by CVPR workshop 2024 (NTIRE 2024)

点击查看摘要

Abstract:Shadow-affected images often exhibit pronounced spatial discrepancies in color and illumination, consequently degrading various vision applications including object detection and segmentation systems. To effectively eliminate shadows in real-world images while preserving intricate details and producing visually compelling outcomes, we introduce a mask-free Shadow Removal and Refinement network (ShadowRefiner) via Fast Fourier Transformer. Specifically, the Shadow Removal module in our method aims to establish effective mappings between shadow-affected and shadow-free images via spatial and frequency representation learning. To mitigate the pixel misalignment and further improve the image quality, we propose a novel Fast-Fourier Attention based Transformer (FFAT) architecture, where an innovative attention mechanism is designed for meticulous refinement. Our method wins the championship in the Perceptual Track and achieves the second best performance in the Fidelity Track of NTIRE 2024 Image Shadow Removal Challenge. Besides, comprehensive experiment result also demonstrate the compelling effectiveness of our proposed method. The code is publicly available: this https URL.

[CV-93] owards Practical Single-shot Motion Synthesis

链接: https://arxiv.org/abs/2406.01136
作者: Konstantinos Roditakis,Spyridon Thermos,Nikolaos Zioulis
关键词: privacy concerns pose, Generative Adversarial Network, cold start, text prompts, computing resources
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: CVPR 2024, AI for 3D Generation Workshop, Project page: this https URL

点击查看摘要

Abstract:Despite the recent advances in the so-called “cold start” generation from text prompts, their needs in data and computing resources, as well as the ambiguities around intellectual property and privacy concerns pose certain counterarguments for their utility. An interesting and relatively unexplored alternative has been the introduction of unconditional synthesis from a single sample, which has led to interesting generative applications. In this paper we focus on single-shot motion generation and more specifically on accelerating the training time of a Generative Adversarial Network (GAN). In particular, we tackle the challenge of GAN’s equilibrium collapse when using mini-batch training by carefully annealing the weights of the loss functions that prevent mode collapse. Additionally, we perform statistical analysis in the generator and discriminator models to identify correlations between training stages and enable transfer learning. Our improved GAN achieves competitive quality and diversity on the Mixamo benchmark when compared to the original GAN architecture and a single-shot diffusion model, while being up to x6.8 faster in training time from the former and x1.75 from the latter. Finally, we demonstrate the ability of our improved GAN to mix and compose motion with a single forward pass. Project page available at this https URL.

[CV-94] Computation-Efficient Era: A Comprehensive Survey of State Space Models in Medical Image Analysis

链接: https://arxiv.org/abs/2406.03430
作者: Moein Heidari,Sina Ghorbani Kolahi,Sanaz Karimijafarbigloo,Bobby Azad,Afshin Bozorgpour,Soheila Hatami,Reza Azad,Ali Diba,Ulas Bagci,Dorit Merhof,Ilker Hacihaliloglu
关键词: recurrent neural networks, Mamba models, performing these tasks, Sequence modeling plays, plays a vital
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This is the first version of our survey, and the paper is currently under review

点击查看摘要

Abstract:Sequence modeling plays a vital role across various domains, with recurrent neural networks being historically the predominant method of performing these tasks. However, the emergence of transformers has altered this paradigm due to their superior performance. Built upon these advances, transformers have conjoined CNNs as two leading foundational models for learning visual representations. However, transformers are hindered by the \mathcalO(N^2) complexity of their attention mechanisms, while CNNs lack global receptive fields and dynamic weight allocation. State Space Models (SSMs), specifically the \textit\textbfMamba model with selection mechanisms and hardware-aware architecture, have garnered immense interest lately in sequential modeling and visual representation learning, challenging the dominance of transformers by providing infinite context lengths and offering substantial efficiency maintaining linear complexity in the input sequence. Capitalizing on the advances in computer vision, medical imaging has heralded a new epoch with Mamba models. Intending to help researchers navigate the surge, this survey seeks to offer an encyclopedic review of Mamba models in medical imaging. Specifically, we start with a comprehensive theoretical review forming the basis of SSMs, including Mamba architecture and its alternatives for sequence modeling paradigms in this context. Next, we offer a structured classification of Mamba models in the medical field and introduce a diverse categorization scheme based on their application, imaging modalities, and targeted organs. Finally, we summarize key challenges, discuss different future research directions of the SSMs in the medical domain, and propose several directions to fulfill the demands of this field. In addition, we have compiled the studies discussed in this paper along with their open-source implementations on our GitHub repository.

[CV-95] UnWave-Net: Unrolled Wavelet Network for Compton Tomography Image Reconstruction

链接: https://arxiv.org/abs/2406.03413
作者: Ishak Ayad,Cécilia Tarpau,Javier Cebeiro,Maï K. Nguyen
关键词: scan internal structures, typically involving collimation, Computed tomography, medical imaging technique, typically involving
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper has been early accepted by MICCAI 2024

点击查看摘要

Abstract:Computed tomography (CT) is a widely used medical imaging technique to scan internal structures of a body, typically involving collimation and mechanical rotation. Compton scatter tomography (CST) presents an interesting alternative to conventional CT by leveraging Compton physics instead of collimation to gather information from multiple directions. While CST introduces new imaging opportunities with several advantages such as high sensitivity, compactness, and entirely fixed systems, image reconstruction remains an open problem due to the mathematical challenges of CST modeling. In contrast, deep unrolling networks have demonstrated potential in CT image reconstruction, despite their computationally intensive nature. In this study, we investigate the efficiency of unrolling networks for CST image reconstruction. To address the important computational cost required for training, we propose UnWave-Net, a novel unrolled wavelet-based reconstruction network. This architecture includes a non-local regularization term based on wavelets, which captures long-range dependencies within images and emphasizes the multi-scale components of the wavelet transform. We evaluate our approach using a CST of circular geometry which stays completely static during data acquisition, where UnWave-Net facilitates image reconstruction in the absence of a specific reconstruction formula. Our method outperforms existing approaches and achieves state-of-the-art performance in terms of SSIM and PSNR, and offers an improved computational efficiency compared to traditional unrolling networks.

[CV-96] SuperFormer: Volumetric Transformer Architectures for MRI Super-Resolution

链接: https://arxiv.org/abs/2406.03359
作者: Cristhian Forigua,Maria Escobar,Pablo Arbelaez
关键词: Visual Transformers, Swin Transformer model, Magnetic Resonance Imaging, Swin Transformer, processing volumetric medical
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents a novel framework for processing volumetric medical information using Visual Transformers (ViTs). First, We extend the state-of-the-art Swin Transformer model to the 3D medical domain. Second, we propose a new approach for processing volumetric information and encoding position in ViTs for 3D applications. We instantiate the proposed framework and present SuperFormer, a volumetric transformer-based approach for Magnetic Resonance Imaging (MRI) Super-Resolution. Our method leverages the 3D information of the MRI domain and uses a local self-attention mechanism with a 3D relative positional encoding to recover anatomical details. In addition, our approach takes advantage of multi-domain information from volume and feature domains and fuses them to reconstruct the High-Resolution MRI. We perform an extensive validation on the Human Connectome Project dataset and demonstrate the superiority of volumetric transformers over 3D CNN-based methods. Our code and pretrained models are available at this https URL.

[CV-97] EngineBench: Flow Reconstruction in the Transparent Combustion Chamber III Optical Engine

链接: https://arxiv.org/abs/2406.03325
作者: Samuel J. Baker,Michael A. Hobley,Isabel Scherl,Xiaohang Fang,Felix C. P. Leach,Martin H. Davy
关键词: inside combustion machinery, high quality experimental, machine learning, oriented database, turbulent flows inside
类目: Fluid Dynamics (physics.flu-dyn); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present EngineBench, the first machine learning (ML) oriented database to use high quality experimental data for the study of turbulent flows inside combustion machinery. Prior datasets for ML in fluid mechanics are synthetic or use overly simplistic geometries. EngineBench is comprised of real-world particle image velocimetry (PIV) data that captures the turbulent airflow patterns in a specially-designed optical engine. However, in PIV data from internal flows, such as from engines, it is often challenging to achieve a full field of view and large occlusions can be present. In order to design optimal combustion systems, insight into the turbulent flows in these obscured areas is needed, which can be provided via inpainting models. Here we propose a novel inpainting task using random edge gaps, a technique that emphasises realism by introducing occlusions at random sizes and orientations at the edges of the PIV images. We test five ML methods on random edge gaps using pixel-wise, vector-based, and multi-scale performance metrics. We find that UNet-based models are more accurate than the industry-norm non-parametric approach and the context encoder at this task on both small and large gap sizes. The dataset and inpainting task presented in this paper support the development of more general-purpose pre-trained ML models for engine design problems. The method comparisons allow for more informed selection of ML models for problems in experimental flow diagnostics. All data and code are publicly available at this https URL.

[CV-98] Generative Diffusion Models for Fast Simulations of Particle Collisions at CERN

链接: https://arxiv.org/abs/2406.03233
作者: Mikołaj Kita,Jan Dubiński,Przemysław Rokita,Kamil Deja
关键词: Large Hadron Collider, High Energy Physics, CERN Large Hadron, Energy Physics simulations, Hadron Collider
类目: Data Analysis, Statistics and Probability (physics.data-an); Computer Vision and Pattern Recognition (cs.CV); High Energy Physics - Experiment (hep-ex)
*备注:

点击查看摘要

Abstract:In High Energy Physics simulations play a crucial role in unraveling the complexities of particle collision experiments within CERN’s Large Hadron Collider. Machine learning simulation methods have garnered attention as promising alternatives to traditional approaches. While existing methods mainly employ Variational Autoencoders (VAEs) or Generative Adversarial Networks (GANs), recent advancements highlight the efficacy of diffusion models as state-of-the-art generative machine learning methods. We present the first simulation for Zero Degree Calorimeter (ZDC) at the ALICE experiment based on diffusion models, achieving the highest fidelity compared to existing baselines. We perform an analysis of trade-offs between generation times and the simulation quality. The results indicate a significant potential of latent diffusion model due to its rapid generation time.

[CV-99] Multi-Task Multi-Scale Contrastive Knowledge Distillation for Efficient Medical Image Segmentation

链接: https://arxiv.org/abs/2406.03173
作者: Risab Biswas
关键词: image segmentation tasks, medical image segmentation, teacher model, segmentation tasks, specifically focusing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Master’s thesis

点击查看摘要

Abstract:This thesis aims to investigate the feasibility of knowledge transfer between neural networks for medical image segmentation tasks, specifically focusing on the transfer from a larger multi-task “Teacher” network to a smaller “Student” network. In the context of medical imaging, where the data volumes are often limited, leveraging knowledge from a larger pre-trained network could be useful. The primary objective is to enhance the performance of a smaller student model by incorporating knowledge representations acquired by a teacher model that adopts a multi-task pre-trained architecture trained on CT images, to a more resource-efficient student network, which can essentially be a smaller version of the same, trained on a mere 50% of the data than that of the teacher model. To facilitate knowledge transfer between the two models, we devised an architecture incorporating multi-scale feature distillation and supervised contrastive learning. Our study aims to improve the student model’s performance by integrating knowledge representations from the teacher model. We investigate whether this approach is particularly effective in scenarios with limited computational resources and limited training data availability. To assess the impact of multi-scale feature distillation, we conducted extensive experiments. We also conducted a detailed ablation study to determine whether it is essential to distil knowledge at various scales, including low-level features from encoder layers, for effective knowledge transfer. In addition, we examine different losses in the knowledge distillation process to gain insights into their effects on overall performance. Comments: Master’s thesis Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.03173 [eess.IV] (or arXiv:2406.03173v1 [eess.IV] for this version)

[CV-100] EpidermaQuant: Unsupervised detection and quantification of epidermal differentiation markers on H-DAB-stained images of reconstructed human epidermis

链接: https://arxiv.org/abs/2406.03103
作者: Dawid Zamojski,Agnieszka Gogler,Dorota Scieglinska,Michal Marczyk
关键词: histological analyses combined, reconstructed human epidermis, human epidermis generated, assessed using histological, histological analyses
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:The integrity of the reconstructed human epidermis generated in vitro could be assessed using histological analyses combined with immunohistochemical staining of keratinocyte differentiation markers. Computer-based analysis of scanned tissue saves the expert time and may improve the accuracy of quantification by eliminating interrater reliability issues. However, technical differences during the preparation and capture of stained images and the presence of multiple artifacts may influence the outcome of computational methods. Using a dataset with 598 unannotated images showing cross-sections of in vitro reconstructed human epidermis stained with DAB-based immunohistochemistry reaction to visualize 4 different keratinocyte differentiation marker proteins (filaggrin, keratin 10, Ki67, HSPA2) and counterstained with hematoxylin, we developed an unsupervised method for the detection and quantification of immunohistochemical staining. The proposed pipeline includes the following steps: (i) color normalization to reduce the variability of pixel intensity values in different samples; (ii) color deconvolution to acquire color channels of the stains used; (iii) morphological operations to find the background area of the image; (iv) automatic image rotation; and (v) finding markers of human epidermal differentiation with clustering. Also, we created a method to exclude images without DAB-stained areas. The most effective combination of methods includes: (i) Reinhard’s normalization; (ii) Ruifrok and Johnston color deconvolution method; (iii) proposed image rotation method based on boundary distribution of image intensity; (iv) k-means clustering using DAB stain intensity. These results should enhance the performance of quantitative analysis of protein markers in reconstructed human epidermis samples and enable comparison of their spatial distribution between different experimental conditions.

[CV-101] Phy-Diff: Physics-guided Hourglass Diffusion Model for Diffusion MRI Synthesis

链接: https://arxiv.org/abs/2406.03002
作者: Juanhua Zhang,Ruodan Yan,Alessandro Perelli,Xi Chen,Chao Li
关键词: high acquisition costs, Diffusion MRI, important neuroimaging technique, acquisition costs, important neuroimaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MICCAI 2024

点击查看摘要

Abstract:Diffusion MRI (dMRI) is an important neuroimaging technique with high acquisition costs. Deep learning approaches have been used to enhance dMRI and predict diffusion biomarkers through undersampled dMRI. To generate more comprehensive raw dMRI, generative adversarial network based methods are proposed to include b-values and b-vectors as conditions, but they are limited by unstable training and less desirable diversity. The emerging diffusion model (DM) promises to improve generative performance. However, it remains challenging to include essential information in conditioning DM for more relevant generation, i.e., the physical principles of dMRI and white matter tract structures. In this study, we propose a physics-guided diffusion model to generate high-quality dMRI. Our model introduces the physical principles of dMRI in the noise evolution in the diffusion process and introduce a query-based conditional mapping within the difussion model. In addition, to enhance the anatomical fine detials of the generation, we introduce the XTRACT atlas as prior of white matter tracts by adopting an adapter technique. Our experiment results show that our method outperforms other state-of-the-art methods and has the potential to advance dMRI enhancement.

[CV-102] Radiomics-guided Multimodal Self-attention Network for Predicting Pathological Complete Response in Breast MRI

链接: https://arxiv.org/abs/2406.02936
作者: Jonghun Kim,Hyunjin Park
关键词: pathologic complete response, predicting pathologic complete, treatment customization, complete response, anti-cancer treatment
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 5 figures, IEEE ISBI 2024 proceedings

点击查看摘要

Abstract:Breast cancer is the most prevalent cancer among women and predicting pathologic complete response (pCR) after anti-cancer treatment is crucial for patient prognosis and treatment customization. Deep learning has shown promise in medical imaging diagnosis, particularly when utilizing multiple imaging modalities to enhance accuracy. This study presents a model that predicts pCR in breast cancer patients using dynamic contrast-enhanced (DCE) magnetic resonance imaging (MRI) and apparent diffusion coefficient (ADC) maps. Radiomics features are established hand-crafted features of the tumor region and thus could be useful in medical image analysis. Our approach extracts features from both DCE MRI and ADC using an encoder with a self-attention mechanism, leveraging radiomics to guide feature extraction from tumor-related regions. Our experimental results demonstrate the superior performance of our model in predicting pCR compared to other baseline methods.

[CV-103] U-KAN Makes Strong Backbone for Medical Image Segmentation and Generation

链接: https://arxiv.org/abs/2406.02918
作者: Chenxin Li,Xinyu Liu,Wuyang Li,Cheng Wang,Hengyu Liu,Yixuan Yuan
关键词: visual applications, diffusion probability models, medical image segmentation, image segmentation, probability models
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2405.14399 , arXiv:2203.04967 by other authors

点击查看摘要

Abstract:U-Net has become a cornerstone in various visual applications such as image segmentation and diffusion probability models. While numerous innovative designs and improvements have been introduced by incorporating transformers or MLPs, the networks are still limited to linearly modeling patterns as well as the deficient interpretability. To address these challenges, our intuition is inspired by the impressive results of the Kolmogorov-Arnold Networks (KANs) in terms of accuracy and interpretability, which reshape the neural network learning via the stack of non-linear learnable activation functions derived from the Kolmogorov-Anold representation theorem. Specifically, in this paper, we explore the untapped potential of KANs in improving backbones for vision tasks. We investigate, modify and re-design the established U-Net pipeline by integrating the dedicated KAN layers on the tokenized intermediate representation, termed U-KAN. Rigorous medical image segmentation benchmarks verify the superiority of U-KAN by higher accuracy even with less computation cost. We further delved into the potential of U-KAN as an alternative U-Net noise predictor in diffusion models, demonstrating its applicability in generating task-oriented model architectures. These endeavours unveil valuable insights and sheds light on the prospect that with U-KAN, you can make strong backbone for medical image segmentation and generation. Project page: this https URL

[CV-104] Second-order differential operators stochastic differential equations and Brownian motions on embedded manifolds

链接: https://arxiv.org/abs/2406.02879
作者: Du Nguyen,Stefan Sommer
关键词: Riemannian Brownian motions, stochastic differential equation, Brownian motions, second-order differential operators, Riemannian Brownian
类目: Probability (math.PR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Numerical Analysis (math.NA); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We specify the conditions when a manifold M embedded in an inner product space E is an invariant manifold of a stochastic differential equation (SDE) on E, linking it with the notion of second-order differential operators on M. When M is given a Riemannian metric, we derive a simple formula for the Laplace-Beltrami operator in terms of the gradient and Hessian on E and construct the Riemannian Brownian motions on M as solutions of conservative Stratonovich and Ito SDEs on E. We derive explicitly the SDE for Brownian motions on several important manifolds in applications, including left-invariant matrix Lie groups using embedded coordinates. Numerically, we propose three simulation schemes to solve SDEs on manifolds. In addition to the stochastic projection method, to simulate Riemannian Brownian motions, we construct a second-order tangent retraction of the Levi-Civita connection using a given E-tubular retraction. We also propose the retractive Euler-Maruyama method to solve a SDE, taking into account the second-order term of a tangent retraction. We provide software to implement the methods in the paper, including Brownian motions of the manifolds discussed. We verify numerically that on several compact Riemannian manifolds, the long-term limit of Brownian simulation converges to the uniform distributions, suggesting a method to sample Riemannian uniform distributions

[CV-105] Neural Representations of Dynamic Visual Stimuli

链接: https://arxiv.org/abs/2406.02659
作者: Jacob Yeung,Andrew F. Luo,Gabriel Sarch,Margaret M. Henderson,Deva Ramanan,Michael J. Tarr
关键词: constantly changing visual, change in appearance, changing visual stimuli, shift and move, vary in distance
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Humans experience the world through constantly changing visual stimuli, where scenes can shift and move, change in appearance, and vary in distance. The dynamic nature of visual perception is a fundamental aspect of our daily lives, yet the large majority of research on object and scene processing, particularly using fMRI, has focused on static stimuli. While studies of static image perception are attractive due to their computational simplicity, they impose a strong non-naturalistic constraint on our investigation of human vision. In contrast, dynamic visual stimuli offer a more ecologically-valid approach but present new challenges due to the interplay between spatial and temporal information, making it difficult to disentangle the representations of stable image features and motion. To overcome this limitation – given dynamic inputs, we explicitly decouple the modeling of static image representations and motion representations in the human brain. Three results demonstrate the feasibility of this approach. First, we show that visual motion information as optical flow can be predicted (or decoded) from brain activity as measured by fMRI. Second, we show that this predicted motion can be used to realistically animate static images using a motion-conditioned video diffusion model (where the motion is driven by fMRI brain activity). Third, we show prediction in the reverse direction: existing video encoders can be fine-tuned to predict fMRI brain activity from video imagery, and can do so more effectively than image encoders. This foundational work offers a novel, extensible framework for interpreting how the human brain processes dynamic visual information.

[CV-106] Pancreatic Tumor Segmentation as Anomaly Detection in CT Images Using Denoising Diffusion Models

链接: https://arxiv.org/abs/2406.02653
作者: Reza Babaei,Samuel Cheng,Theresa Thai,Shangqing Zhao
关键词: pancreatic tumor detection, advances in medicine, remained a formidable, pancreatic tumor, tumor detection
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the advances in medicine, cancer has remained a formidable challenge. Particularly in the case of pancreatic tumors, characterized by their diversity and late diagnosis, early detection poses a significant challenge crucial for effective treatment. The advancement of deep learning techniques, particularly supervised algorithms, has significantly propelled pancreatic tumor detection in the medical field. However, supervised deep learning approaches necessitate extensive labeled medical images for training, yet acquiring such annotations is both limited and costly. Conversely, weakly supervised anomaly detection methods, requiring only image-level annotations, have garnered interest. Existing methodologies predominantly hinge on generative adversarial networks (GANs) or autoencoder models, which can pose complexity in training and, these models may face difficulties in accurately preserving fine image details. This research presents a novel approach to pancreatic tumor detection, employing weak supervision anomaly detection through denoising diffusion algorithms. By incorporating a deterministic iterative process of adding and removing noise along with classifier guidance, the method enables seamless translation of images between diseased and healthy subjects, resulting in detailed anomaly maps without requiring complex training protocols and segmentation masks. This study explores denoising diffusion models as a recent advancement over traditional generative models like GANs, contributing to the field of pancreatic tumor detection. Recognizing the low survival rates of pancreatic cancer, this study emphasizes the need for continued research to leverage diffusion models’ efficiency in medical segmentation tasks.

[CV-107] A Brief Overview of Optimization-Based Algorithms for MRI Reconstruction Using Deep Learning

链接: https://arxiv.org/abs/2406.02626
作者: Wanyu Bian
关键词: Magnetic resonance imaging, high spatial resolution, exceptional soft tissue, soft tissue contrast, Magnetic resonance
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) is renowned for its exceptional soft tissue contrast and high spatial resolution, making it a pivotal tool in medical imaging. The integration of deep learning algorithms offers significant potential for optimizing MRI reconstruction processes. Despite the growing body of research in this area, a comprehensive survey of optimization-based deep learning models tailored for MRI reconstruction has yet to be conducted. This review addresses this gap by presenting a thorough examination of the latest optimization-based algorithms in deep learning specifically designed for MRI reconstruction. The goal of this paper is to provide researchers with a detailed understanding of these advancements, facilitating further innovation and application within the MRI community.

[CV-108] EVAN: Evolutional Video Streaming Adaptation via Neural Representation

链接: https://arxiv.org/abs/2406.02557
作者: Mufan Liu,Le Yang,Yiling Xu,Ye-kui Wang,Jenq-Neng Hwang
关键词: limited adaptation capability, exhibiting limited adaptation, Adaptive bitrate, exhibiting limited, conventional codecs
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: accepted by ICME (conference)

点击查看摘要

Abstract:Adaptive bitrate (ABR) using conventional codecs cannot further modify the bitrate once a decision has been made, exhibiting limited adaptation capability. This may result in either overly conservative or overly aggressive bitrate selection, which could cause either inefficient utilization of the network bandwidth or frequent re-buffering, respectively. Neural representation for video (NeRV), which embeds the video content into neural network weights, allows video reconstruction with incomplete models. Specifically, the recovery of one frame can be achieved without relying on the decoding of adjacent frames. NeRV has the potential to provide high video reconstruction quality and, more importantly, pave the way for developing more flexible ABR strategies for video transmission. In this work, a new framework, named Evolutional Video streaming Adaptation via Neural representation (EVAN), which can adaptively transmit NeRV models based on soft actor-critic (SAC) reinforcement learning, is proposed. EVAN is trained with a more exploitative strategy and utilizes progressive playback to avoid re-buffering. Experiments showed that EVAN can outperform existing ABRs with 50% reduction in re-buffering and achieve nearly 20% .

[CV-109] Hear Me See Me Understand Me: Audio-Visual Autism Behavior Recognition

链接: https://arxiv.org/abs/2406.02554
作者: Shijian Deng,Erin E. Kosloski,Siddhi Patel,Zeke A. Barnett,Yiyang Nan,Alexander Kaplan,Sisira Aarukapalli,William T. Doan,Matthew Wang,Harsh Singh,Pamela R. Rollins,Yapeng Tian
关键词: autism behavior recognition, essential aspect previously, aspect previously omitted, behavior recognition, audio-visual autism behavior
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:In this article, we introduce a novel problem of audio-visual autism behavior recognition, which includes social behavior recognition, an essential aspect previously omitted in AI-assisted autism screening research. We define the task at hand as one that is audio-visual autism behavior recognition, which uses audio and visual cues, including any speech present in the audio, to recognize autism-related behaviors. To facilitate this new research direction, we collected an audio-visual autism spectrum dataset (AV-ASD), currently the largest video dataset for autism screening using a behavioral approach. It covers an extensive range of autism-associated behaviors, including those related to social communication and interaction. To pave the way for further research on this new problem, we intensively explored leveraging foundation models and multimodal large language models across different modalities. Our experiments on the AV-ASD dataset demonstrate that integrating audio, visual, and speech modalities significantly enhances the performance in autism behavior recognition. Additionally, we explored the use of a post-hoc to ad-hoc pipeline in a multimodal large language model to investigate its potential to augment the model’s explanatory capability during autism behavior recognition. We will release our dataset, code, and pre-trained models.

机器学习

[LG-0] Wings: Learning Multimodal LLMs without Text-only Forgetting

链接: https://arxiv.org/abs/2406.03496
作者: Yi-Kai Zhang,Shiyin Lu,Yang Li,Yanqing Ma,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,De-Chuan Zhan,Han-Jia Ye
关键词: large language models, Multimodal large language, trained LLM, language models, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs), initiated with a trained LLM, first align images with text and then fine-tune on multimodal mixed inputs. However, the MLLM catastrophically forgets the text-only instructions, which do not include images and can be addressed within the initial LLM. In this paper, we present Wings, a novel MLLM that excels in both text-only dialogues and multimodal comprehension. Analyzing MLLM attention in multimodal instructions reveals that text-only forgetting is related to the attention shifts from pre-image to post-image text. From that, we construct extra modules that act as the boosted learner to compensate for the attention shift. The complementary visual and textual learners, like “wings” on either side, are connected in parallel within each layer’s attention block. Initially, image and text inputs are aligned with visual learners operating alongside the main attention, balancing focus on visual elements. Textual learners are later collaboratively integrated with attention-based routing to blend the outputs of the visual and textual learners. We design the Low-Rank Residual Attention (LoRRA) to guarantee high efficiency for learners. Our experimental results demonstrate that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks. On a newly constructed Interleaved Image-Text (IIT) benchmark, Wings exhibits superior performance from text-only-rich to multimodal-rich question-answering tasks.

[LG-1] Grokking Modular Polynomials

链接: https://arxiv.org/abs/2406.03495
作者: Darshil Doshi,Tianyu He,Aritra Das,Andrey Gromov
关键词: Neural networks readily, modular arithmetic tasks, networks readily learn, modular, Multi-layer Perceptron
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); High Energy Physics - Theory (hep-th); Number Theory (math.NT); Machine Learning (stat.ML)
*备注: 7+4 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Neural networks readily learn a subset of the modular arithmetic tasks, while failing to generalize on the rest. This limitation remains unmoved by the choice of architecture and training strategies. On the other hand, an analytical solution for the weights of Multi-layer Perceptron (MLP) networks that generalize on the modular addition task is known in the literature. In this work, we (i) extend the class of analytical solutions to include modular multiplication as well as modular addition with many terms. Additionally, we show that real networks trained on these datasets learn similar solutions upon generalization (grokking). (ii) We combine these “expert” solutions to construct networks that generalize on arbitrary modular polynomials. (iii) We hypothesize a classification of modular polynomials into learnable and non-learnable via neural networks training; and provide experimental evidence supporting our claims.

[LG-2] Solving Poisson Equations using Neural Walk-on-Spheres

链接: https://arxiv.org/abs/2406.03494
作者: Hong Chul Nam,Julius Berner,Anima Anandkumar
关键词: neural PDE solver, PDE solver, neural PDE, Poisson equations, high-dimensional Poisson equations
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: Accepted at ICML 2024

点击查看摘要

Abstract:We propose Neural Walk-on-Spheres (NWoS), a novel neural PDE solver for the efficient solution of high-dimensional Poisson equations. Leveraging stochastic representations and Walk-on-Spheres methods, we develop novel losses for neural networks based on the recursive solution of Poisson equations on spheres inside the domain. The resulting method is highly parallelizable and does not require spatial gradients for the loss. We provide a comprehensive comparison against competing methods based on PINNs, the Deep Ritz method, and (backward) stochastic differential equations. In several challenging, high-dimensional numerical examples, we demonstrate the superiority of NWoS in accuracy, speed, and computational costs. Compared to commonly used PINNs, our approach can reduce memory usage and errors by orders of magnitude. Furthermore, we apply NWoS to problems in PDE-constrained optimization and molecular dynamics to show its efficiency in practical applications.

[LG-3] Highway Value Iteration Networks

链接: https://arxiv.org/abs/2406.03485
作者: Yuhui Wang,Weida Li,Francesco Faccio,Qingyuan Wu,Jürgen Schmidhuber
关键词: employing a differentiable, iteration networks, planning, planning module, VINs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICML 2024

点击查看摘要

Abstract:Value iteration networks (VINs) enable end-to-end learning for planning tasks by employing a differentiable “planning module” that approximates the value iteration algorithm. However, long-term planning remains a challenge because training very deep VINs is difficult. To address this problem, we embed highway value iteration – a recent algorithm designed to facilitate long-term credit assignment – into the structure of VINs. This improvement augments the “planning module” of the VIN with three additional components: 1) an “aggregate gate,” which constructs skip connections to improve information flow across many layers; 2) an “exploration module,” crafted to increase the diversity of information and gradient flow in spatial dimensions; 3) a “filter gate” designed to ensure safe exploration. The resulting novel highway VIN can be trained effectively with hundreds of layers using standard backpropagation. In long-term planning tasks requiring hundreds of planning steps, deep highway VINs outperform both traditional VINs and several advanced, very deep NNs.

[LG-4] QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead

链接: https://arxiv.org/abs/2406.03482
作者: Amir Zandieh,Majid Daliri,Insu Han
关键词: Serving LLMs requires, requires substantial memory, LLMs requires substantial, Serving LLMs, requirements of Key-Value
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Performance (cs.PF)
*备注: 13 pages

点击查看摘要

Abstract:Serving LLMs requires substantial memory due to the storage requirements of Key-Value (KV) embeddings in the KV cache, which grows with sequence length. An effective approach to compress KV cache is quantization. However, traditional quantization methods face significant memory overhead due to the need to store quantization constants (at least a zero point and a scale) in full precision per data block. Depending on the block size, this overhead can add 1 or 2 bits per quantized number. We introduce QJL, a new quantization approach that consists of a Johnson-Lindenstrauss (JL) transform followed by sign-bit quantization. In contrast to existing methods, QJL eliminates memory overheads by removing the need for storing quantization constants. We propose an asymmetric estimator for the inner product of two vectors and demonstrate that applying QJL to one vector and a standard JL transform without quantization to the other provides an unbiased estimator with minimal distortion. We have developed an efficient implementation of the QJL sketch and its corresponding inner product estimator, incorporating a lightweight CUDA kernel for optimized computation. When applied across various LLMs and NLP tasks to quantize the KV cache to only 3 bits, QJL demonstrates a more than fivefold reduction in KV cache memory usage without compromising accuracy, all while achieving faster runtime. Codes are available at \urlthis https URL.

[LG-5] Convolutional Neural Networks and Vision Transformers for Fashion MNIST Classification: A Literature Review

链接: https://arxiv.org/abs/2406.03478
作者: Sonia Bbouzidi,Ghazala Hcini,Imen Jdey,Fadoua Drira
关键词: Convolutional Neural Networks, Neural Networks, Convolutional Neural, Vision Transformers, Fashion MNIST dataset
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Our review explores the comparative analysis between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) in the domain of image classification, with a particular focus on clothing classification within the e-commerce sector. Utilizing the Fashion MNIST dataset, we delve into the unique attributes of CNNs and ViTs. While CNNs have long been the cornerstone of image classification, ViTs introduce an innovative self-attention mechanism enabling nuanced weighting of different input data components. Historically, transformers have primarily been associated with Natural Language Processing (NLP) tasks. Through a comprehensive examination of existing literature, our aim is to unveil the distinctions between ViTs and CNNs in the context of image classification. Our analysis meticulously scrutinizes state-of-the-art methodologies employing both architectures, striving to identify the factors influencing their performance. These factors encompass dataset characteristics, image dimensions, the number of target classes, hardware infrastructure, and the specific architectures along with their respective top results. Our key goal is to determine the most appropriate architecture between ViT and CNN for classifying images in the Fashion MNIST dataset within the e-commerce industry, while taking into account specific conditions and needs. We highlight the importance of combining these two architectures with different forms to enhance overall performance. By uniting these architectures, we can take advantage of their unique strengths, which may lead to more precise and reliable models for e-commerce applications. CNNs are skilled at recognizing local patterns, while ViTs are effective at grasping overall context, making their combination a promising strategy for boosting image classification performance.

[LG-6] Does your data spark joy? Performance gains from domain upsampling at the end of training

链接: https://arxiv.org/abs/2406.03476
作者: Cody Blakeney,Mansheej Paul,Brett W. Larsen,Sean Owen,Jonathan Frankle
关键词: large FLOP scales, amounts of CommonCrawl, large language models, domain-specific datasets, large amounts
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: The first three authors contributed equally

点击查看摘要

Abstract:Pretraining datasets for large language models (LLMs) have grown to trillions of tokens composed of large amounts of CommonCrawl (CC) web scrape along with smaller, domain-specific datasets. It is expensive to understand the impact of these domain-specific datasets on model capabilities as training at large FLOP scales is required to reveal significant changes to difficult and emergent benchmarks. Given the increasing cost of experimenting with pretraining data, how does one determine the optimal balance between the diversity in general web scrapes and the information density of domain specific data? In this work, we show how to leverage the smaller domain specific datasets by upsampling them relative to CC at the end of training to drive performance improvements on difficult benchmarks. This simple technique allows us to improve up to 6.90 pp on MMLU, 8.26 pp on GSM8K, and 6.17 pp on HumanEval relative to the base data mix for a 7B model trained for 1 trillion (T) tokens, thus rivaling Llama-2 (7B) \unicodex2014 a model trained for twice as long. We experiment with ablating the duration of domain upsampling from 5% to 30% of training and find that 10% to 20% percent is optimal for navigating the tradeoff between general language modeling capabilities and targeted benchmarks. We also use domain upsampling to characterize at scale the utility of individual datasets for improving various benchmarks by removing them during this final phase of training. This tool opens up the ability to experiment with the impact of different pretraining datasets at scale, but at an order of magnitude lower cost compared to full pretraining runs.

[LG-7] Solving Differential Equations using Physics-Informed Deep Equilibrium Models

链接: https://arxiv.org/abs/2406.03472
作者: Bruno Machado Pacheco,Eduardo Camponogara
关键词: Deep Equilibrium Models, ordinary differential equations, Equilibrium Models, paper introduces Physics-Informed, Deep Equilibrium
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted at CASE 2024

点击查看摘要

Abstract:This paper introduces Physics-Informed Deep Equilibrium Models (PIDEQs) for solving initial value problems (IVPs) of ordinary differential equations (ODEs). Leveraging recent advancements in deep equilibrium models (DEQs) and physics-informed neural networks (PINNs), PIDEQs combine the implicit output representation of DEQs with physics-informed training techniques. We validate PIDEQs using the Van der Pol oscillator as a benchmark problem, demonstrating their efficiency and effectiveness in solving IVPs. Our analysis includes key hyperparameter considerations for optimizing PIDEQ performance. By bridging deep learning and physics-based modeling, this work advances computational techniques for solving IVPs, with implications for scientific computing and engineering applications.

[LG-8] Node-wise Filtering in Graph Neural Networks: A Mixture of Experts Approach

链接: https://arxiv.org/abs/2406.03464
作者: Haoyu Han,Juanhui Li,Wei Huang,Xianfeng Tang,Hanqing Lu,Chen Luo,Hui Liu,Jiliang Tang
关键词: Graph Neural Networks, Neural Networks, diverse graph structural, Graph Neural, node classification tasks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have proven to be highly effective for node classification tasks across diverse graph structural patterns. Traditionally, GNNs employ a uniform global filter, typically a low-pass filter for homophilic graphs and a high-pass filter for heterophilic graphs. However, real-world graphs often exhibit a complex mix of homophilic and heterophilic patterns, rendering a single global filter approach suboptimal. In this work, we theoretically demonstrate that a global filter optimized for one pattern can adversely affect performance on nodes with differing patterns. To address this, we introduce a novel GNN framework Node-MoE that utilizes a mixture of experts to adaptively select the appropriate filters for different nodes. Extensive experiments demonstrate the effectiveness of Node-MoE on both homophilic and heterophilic graphs.

[LG-9] Distributional Adversarial Loss

链接: https://arxiv.org/abs/2406.03458
作者: Saba Ahmadi,Siddharth Bhandari,Avrim Blum,Chen Dan,Prabhav Jain
关键词: adversarial loss, adversarial, adversarial attacks, major challenge, challenge in defending
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A major challenge in defending against adversarial attacks is the enormous space of possible attacks that even a simple adversary might perform. To address this, prior work has proposed a variety of defenses that effectively reduce the size of this space. These include randomized smoothing methods that add noise to the input to take away some of the adversary’s impact. Another approach is input discretization which limits the adversary’s possible number of actions. Motivated by these two approaches, we introduce a new notion of adversarial loss which we call distributional adversarial loss, to unify these two forms of effectively weakening an adversary. In this notion, we assume for each original example, the allowed adversarial perturbation set is a family of distributions (e.g., induced by a smoothing procedure), and the adversarial loss over each example is the maximum loss over all the associated distributions. The goal is to minimize the overall adversarial loss. We show generalization guarantees for our notion of adversarial loss in terms of the VC-dimension of the hypothesis class and the size of the set of allowed adversarial distributions associated with each input. We also investigate the role of randomness in achieving robustness against adversarial attacks in the methods described above. We show a general derandomization technique that preserves the extent of a randomized classifier’s robustness against adversarial attacks. We corroborate the procedure experimentally via derandomizing the Random Projection Filters framework of \citedong2023adversarial. Our procedure also improves the robustness of the model against various adversarial attacks. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2406.03458 [cs.LG] (or arXiv:2406.03458v1 [cs.LG] for this version)

[LG-10] FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

链接: https://arxiv.org/abs/2406.03447
作者: Mona Ahmadian,Frank Guerin,Andrew Gilbert
关键词: Language Space, approach for learning, learning semantic video, semantic Language Space, Space
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper demonstrates a self-supervised approach for learning semantic video representations. Recent vision studies show that a masking strategy for vision and natural language supervision has contributed to developing transferable visual pretraining. Our goal is to achieve a more semantic video representation by leveraging the text related to the video content during the pretraining in a fully self-supervised manner. To this end, we present FILS, a novel self-supervised video Feature prediction In semantic Language Space (FILS). The vision model can capture valuable structured information by correctly predicting masked feature semantics in language space. It is learned using a patch-wise video-text contrastive strategy, in which the text representations act as prototypes for transforming vision features into a language space, which are then used as targets for semantically meaningful feature prediction using our masked encoder-decoder structure. FILS demonstrates remarkable transferability on downstream action recognition tasks, achieving state-of-the-art on challenging egocentric datasets, like Epic-Kitchens, Something-SomethingV2, Charades-Ego, and EGTEA, using ViT-Base. Our efficient method requires less computation and smaller batches compared to previous works.

[LG-11] Pre-trained Large Language Models Use Fourier Features to Compute Addition

链接: https://arxiv.org/abs/2406.03445
作者: Tianyi Zhou,Deqing Fu,Vatsal Sharan,Robin Jia
关键词: exhibit impressive mathematical, mathematical reasoning capabilities, compute basic arithmetic, impressive mathematical reasoning, Pre-trained large language
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities, yet how they compute basic arithmetic, such as addition, remains unclear. This paper shows that pre-trained LLMs add numbers using Fourier features – dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain. Within the model, MLP and attention layers use Fourier features in complementary ways: MLP layers primarily approximate the magnitude of the answer using low-frequency features, while attention layers primarily perform modular addition (e.g., computing whether the answer is even or odd) using high-frequency features. Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy. Introducing pre-trained token embeddings to a randomly initialized model rescues its performance. Overall, our analysis demonstrates that appropriate pre-trained representations (e.g., Fourier features) can unlock the ability of Transformers to learn precise mechanisms for algorithmic tasks.

[LG-12] Cycles of Thought: Measuring LLM Confidence through Stable Explanations

链接: https://arxiv.org/abs/2406.03441
作者: Evan Becker,Stefano Soatto
关键词: high-risk machine learning, machine learning applications, high-risk machine, machine learning, learning applications
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many high-risk machine learning applications it is essential for a model to indicate when it is uncertain about a prediction. While large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, their overconfidence in incorrect responses is still a well-documented failure mode. Traditional methods for ML uncertainty quantification can be difficult to directly adapt to LLMs due to the computational cost of implementation and closed-source nature of many models. A variety of black-box methods have recently been proposed, but these often rely on heuristics such as self-verbalized confidence. We instead propose a framework for measuring an LLM’s uncertainty with respect to the distribution of generated explanations for an answer. While utilizing explanations is not a new idea in and of itself, by interpreting each possible model+explanation pair as a test-time classifier we can calculate a posterior answer distribution over the most likely of these classifiers. We demonstrate how a specific instance of this framework using explanation entailment as our classifier likelihood improves confidence score metrics (in particular AURC and AUROC) over baselines across five different datasets. We believe these results indicate that our framework is both a well-principled and effective way of quantifying uncertainty in LLMs.

[LG-13] ransfer Learning for Latent Variable Network Models

链接: https://arxiv.org/abs/2406.03437
作者: Akhil Jalan,Arya Mazumdar,Soumendu Sundar Mukherjee,Purnamrita Sarkar
关键词: study transfer learning, latent variables, latent variable network, variable network models, edge data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study transfer learning for estimation in latent variable network models. In our setting, the conditional edge probability matrices given the latent variables are represented by P for the source and Q for the target. We wish to estimate Q given two kinds of data: (1) edge data from a subgraph induced by an o(1) fraction of the nodes of Q , and (2) edge data from all of P . If the source P has no relation to the target Q , the estimation error must be \Omega(1) . However, we show that if the latent variables are shared, then vanishing error is possible. We give an efficient algorithm that utilizes the ordering of a suitably defined graph distance. Our algorithm achieves o(1) error and does not assume a parametric form on the source or target networks. Next, for the specific case of Stochastic Block Models we prove a minimax lower bound and show that a simple algorithm achieves this rate. Finally, we empirically demonstrate our algorithm’s use on real-world and simulated graph transfer problems.

[LG-14] Unified PAC-Bayesian Study of Pessimism for Offline Policy Learning with Regularized Importance Sampling

链接: https://arxiv.org/abs/2406.03434
作者: Imad Aouali,Victor-Emmanuel Brunel,David Rohde,Anna Korba
关键词: Off-policy learning, risk estimator based, collect data, correct bias, OPL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted at UAI 2024

点击查看摘要

Abstract:Off-policy learning (OPL) often involves minimizing a risk estimator based on importance weighting to correct bias from the logging policy used to collect data. However, this method can produce an estimator with a high variance. A common solution is to regularize the importance weights and learn the policy by minimizing an estimator with penalties derived from generalization bounds specific to the estimator. This approach, known as pessimism, has gained recent attention but lacks a unified framework for analysis. To address this gap, we introduce a comprehensive PAC-Bayesian framework to examine pessimism with regularized importance weighting. We derive a tractable PAC-Bayesian generalization bound that universally applies to common importance weight regularizations, enabling their comparison within a single framework. Our empirical results challenge common understanding, demonstrating the effectiveness of standard IW regularization techniques.

[LG-15] HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits

链接: https://arxiv.org/abs/2406.03428
作者: Tim Franzmeyer,Aleksandar Shtedritski,Samuel Albanie,Philip Torr,João F. Henriques,Jakob N. Foerster
关键词: machine learning, essential for driving, driving progress, progress in machine, Data
类目: Machine Learning (cs.LG)
*备注: ACL 2024 Findings

点击查看摘要

Abstract:Benchmarks have been essential for driving progress in machine learning. A better understanding of LLM capabilities on real world tasks is vital for safe development. Designing adequate LLM benchmarks is challenging: Data from real-world tasks is hard to collect, public availability of static evaluation data results in test data contamination and benchmark overfitting, and periodically generating new evaluation data is tedious and may result in temporally inconsistent results. We introduce HelloFresh, based on continuous streams of real-world data generated by intrinsically motivated human labelers. It covers recent events from X (formerly Twitter) community notes and edits of Wikipedia pages, mitigating the risk of test data contamination and benchmark overfitting. Any X user can propose an X note to add additional context to a misleading post (formerly tweet); if the community classifies it as helpful, it is shown with the post. Similarly, Wikipedia relies on community-based consensus, allowing users to edit articles or revert edits made by other users. Verifying whether an X note is helpful or whether a Wikipedia edit should be accepted are hard tasks that require grounding by querying the web. We backtest state-of-the-art LLMs supplemented with simple web search access and find that HelloFresh yields a temporally consistent ranking. To enable continuous evaluation on HelloFresh, we host a public leaderboard and periodically updated evaluation data at this https URL.

[LG-16] Robust Knowledge Distillation Based on Feature Variance Against Backdoored Teacher Model

链接: https://arxiv.org/abs/2406.03409
作者: Jinyin Chen,Xiaoming Zhao,Haibin Zheng,Xiao Li,Sheng Xiang,Haifeng Guo
关键词: deep neural networks, resource limited equipment, captured special attention, computing resource limited, student model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Benefiting from well-trained deep neural networks (DNNs), model compression have captured special attention for computing resource limited equipment, especially edge devices. Knowledge distillation (KD) is one of the widely used compression techniques for edge deployment, by obtaining a lightweight student model from a well-trained teacher model released on public platforms. However, it has been empirically noticed that the backdoor in the teacher model will be transferred to the student model during the process of KD. Although numerous KD methods have been proposed, most of them focus on the distillation of a high-performing student model without robustness consideration. Besides, some research adopts KD techniques as effective backdoor mitigation tools, but they fail to perform model compression at the same time. Consequently, it is still an open problem to well achieve two objectives of robust KD, i.e., student model’s performance and backdoor mitigation. To address these issues, we propose RobustKD, a robust knowledge distillation that compresses the model while mitigating backdoor based on feature variance. Specifically, RobustKD distinguishes the previous works in three key aspects: (1) effectiveness: by distilling the feature map of the teacher model after detoxification, the main task performance of the student model is comparable to that of the teacher model; (2) robustness: by reducing the characteristic variance between the teacher model and the student model, it mitigates the backdoor of the student model under backdoored teacher model scenario; (3) generic: RobustKD still has good performance in the face of multiple data models (e.g., WRN 28-4, Pyramid-200) and diverse DNNs (e.g., ResNet50, MobileNet).

[LG-17] Physics and geometry informed neural operator network with application to acoustic scattering

链接: https://arxiv.org/abs/2406.03407
作者: Siddharth Nair,Timothy F. Walsh,Greg Pickrell,Fabio Semperlotti
关键词: geometry informed neural, informed neural operator, neural operator network, scattered pressure field, geometry informed deep
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Computational Physics (physics.comp-ph)
*备注: 20 pages of main text, 9 figures

点击查看摘要

Abstract:In this paper, we introduce a physics and geometry informed neural operator network with application to the forward simulation of acoustic scattering. The development of geometry informed deep learning models capable of learning a solution operator for different computational domains is a problem of general importance for a variety of engineering applications. To this end, we propose a physics-informed deep operator network (DeepONet) capable of predicting the scattered pressure field for arbitrarily shaped scatterers using a geometric parameterization approach based on non-uniform rational B-splines (NURBS). This approach also results in parsimonious representations of non-trivial scatterer geometries. In contrast to existing physics-based approaches that require model re-evaluation when changing the computational domains, our trained model is capable of learning solution operator that can approximate physically-consistent scattered pressure field in just a few seconds for arbitrary rigid scatterer shapes; it follows that the computational time for forward simulations can improve (i.e. be reduced) by orders of magnitude in comparison to the traditional forward solvers. In addition, this approach can evaluate the scattered pressure field without the need for labeled training data. After presenting the theoretical approach, a comprehensive numerical study is also provided to illustrate the remarkable ability of this approach to simulate the acoustic pressure fields resulting from arbitrary combinations of arbitrary scatterer geometries. These results highlight the unique generalization capability of the proposed operator learning approach.

[LG-18] LncRNA-disease association prediction method based on heterogeneous information completion and convolutional neural network

链接: https://arxiv.org/abs/2406.03406
作者: Wen-Yu Xi,Juan Wang,Yu-Lin Zhang,Jin-Xing Liu,Yin-Lian Gao
关键词: complex human diseases, emerging research shows, emerging research, crucial research, series of complex
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:The emerging research shows that lncRNA has crucial research value in a series of complex human diseases. Therefore, the accurate identification of lncRNA-disease associations (LDAs) is very important for the warning and treatment of diseases. However, most of the existing methods have limitations in identifying nonlinear LDAs, and it remains a huge challenge to predict new LDAs. In this paper, a deep learning model based on a heterogeneous network and convolutional neural network (CNN) is proposed for lncRNA-disease association prediction, named HCNNLDA. The heterogeneous network containing the lncRNA, disease, and miRNA nodes, is constructed firstly. The embedding matrix of a lncRNA-disease node pair is constructed according to various biological premises about lncRNAs, diseases, and miRNAs. Then, the low-dimensional feature representation is fully learned by the convolutional neural network. In the end, the XGBoot classifier model is trained to predict the potential LDAs. HCNNLDA obtains a high AUC value of 0.9752 and AUPR of 0.9740 under the 5-fold cross-validation. The experimental results show that the proposed model has better performance than that of several latest prediction models. Meanwhile, the effectiveness of HCNNLDA in identifying novel LDAs is further demonstrated by case studies of three diseases. To sum up, HCNNLDA is a feasible calculation model to predict LDAs.

[LG-19] Amalgam: A Framework for Obfuscated Neural Network Training on the Cloud

链接: https://arxiv.org/abs/2406.03405
作者: Sifat Ut Taki,Spyridon Mastorakis
关键词: proprietary Neural Network, Neural Network, proprietary Neural, cloud service provider, service provider
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Training a proprietary Neural Network (NN) model with a proprietary dataset on the cloud comes at the risk of exposing the model architecture and the dataset to the cloud service provider. To tackle this problem, in this paper, we present an NN obfuscation framework, called Amalgam, to train NN models in a privacy-preserving manner in existing cloud-based environments. Amalgam achieves that by augmenting NN models and the datasets to be used for training with well-calibrated noise to “hide” both the original model architectures and training datasets from the cloud. After training, Amalgam extracts the original models from the augmented models and returns them to users. Our evaluation results with different computer vision and natural language processing models and datasets demonstrate that Amalgam: (i) introduces modest overheads into the training process without impacting its correctness, and (ii) does not affect the model’s accuracy.

[LG-20] ST-DPGAN: A Privacy-preserving Framework for Spatiotemporal Data Generation

链接: https://arxiv.org/abs/2406.03404
作者: Wei Shao,Rongyi Zhu,Cai Yang,Chandra Thapa,Muhammad Ejaz Ahmed,Seyit Camtepe,Rui Zhang,DuYong Kim,Hamid Menouar,Flora D. Salim
关键词: edge devices, financial transactions, wide range, range of edge, personal communication
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Spatiotemporal data is prevalent in a wide range of edge devices, such as those used in personal communication and financial transactions. Recent advancements have sparked a growing interest in integrating spatiotemporal analysis with large-scale language models. However, spatiotemporal data often contains sensitive information, making it unsuitable for open third-party access. To address this challenge, we propose a Graph-GAN-based model for generating privacy-protected spatiotemporal data. Our approach incorporates spatial and temporal attention blocks in the discriminator and a spatiotemporal deconvolution structure in the generator. These enhancements enable efficient training under Gaussian noise to achieve differential privacy. Extensive experiments conducted on three real-world spatiotemporal datasets validate the efficacy of our model. Our method provides a privacy guarantee while maintaining the data utility. The prediction model trained on our generated data maintains a competitive performance compared to the model trained on the original data.

[LG-21] Structure-based Drug Design Benchmark: Do 3D Methods Really Dominate?

链接: https://arxiv.org/abs/2406.03403
作者: Kangyu Zheng,Yingzhou Lu,Zaixi Zhang,Zhongwei Wan,Yao Ma,Marinka Zitnik,Tianfan Fu
关键词: deep generative models, deep generative, reinforcement learning, field of structure-based, main types
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Currently, the field of structure-based drug design is dominated by three main types of algorithms: search-based algorithms, deep generative models, and reinforcement learning. While existing works have typically focused on comparing models within a single algorithmic category, cross-algorithm comparisons remain scarce. In this paper, to fill the gap, we establish a benchmark to evaluate the performance of sixteen models across these different algorithmic foundations by assessing the pharmaceutical properties of the generated molecules and their docking affinities with specified target proteins. We highlight the unique advantages of each algorithmic approach and offer recommendations for the design of future SBDD models. We emphasize that 1D/2D ligand-centric drug design methods can be used in SBDD by treating the docking function as a black-box oracle, which is typically neglected. The empirical results show that 1D/2D methods achieve competitive performance compared with 3D-based methods that use the 3D structure of the target protein explicitly. Also, AutoGrow4, a 2D molecular graph-based genetic algorithm, dominates SBDD in terms of optimization ability. The relevant code is available in this https URL.

[LG-22] Mixed-Precision Over-The-Air Federated Learning via Approximated Computing

链接: https://arxiv.org/abs/2406.03402
作者: Jinsheng Yuan,Zhuangkun Wei,Weisi Guo
关键词: distributed learning mechanism, privacy-preserving distributed learning, Federated Learning, extensively investigated, privacy-preserving distributed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Over-the-Air Federated Learning (OTA-FL) has been extensively investigated as a privacy-preserving distributed learning mechanism. Realistic systems will see FL clients with diverse size, weight, and power configurations. A critical research gap in existing OTA-FL research is the assumption of homogeneous client computational bit precision. Indeed, many clients may exploit approximate computing (AxC) where bit precisions are adjusted for energy and computational efficiency. The dynamic distribution of bit precision updates amongst FL clients poses an open challenge for OTA-FL, as is is incompatible in the wireless modulation superposition space. Here, we propose an AxC-based OTA-FL framework of clients with multiple precisions, demonstrating the following innovations: (i) optimize the quantization-performance trade-off for both server and clients within the constraints of varying edge computing capabilities and learning accuracy requirements, and (ii) develop heterogeneous gradient resolution OTA-FL modulation schemes to ensure compatibility with physical layer OTA aggregation. Our findings indicate that we can design modulation schemes that enable AxC based OTA-FL, which can achieve 50% faster and smoother server convergence and a performance enhancement for the lowest precision clients compared to a homogeneous precision approach. This demonstrates the great potential of our AxC-based OTA-FL approach in heterogeneous edge computing environments. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.03402 [cs.LG] (or arXiv:2406.03402v1 [cs.LG] for this version)

[LG-23] Methods for Class-Imbalanced Learning with Support Vector Machines: A Review and an Empirical Evaluation

链接: https://arxiv.org/abs/2406.03398
作者: Salim rezvani,Farhad Pourpanah,Chee Peng Lim,Q. M. Jonathan Wu
关键词: Support Vector Machine, Vector Machine, Support Vector, paper presents, presents a review
类目: Machine Learning (cs.LG)
*备注: Accepted in Soft Computing

点击查看摘要

Abstract:This paper presents a review on methods for class-imbalanced learning with the Support Vector Machine (SVM) and its variants. We first explain the structure of SVM and its variants and discuss their inefficiency in learning with class-imbalanced data sets. We introduce a hierarchical categorization of SVM-based models with respect to class-imbalanced learning. Specifically, we categorize SVM-based models into re-sampling, algorithmic, and fusion methods, and discuss the principles of the representative models in each category. In addition, we conduct a series of empirical evaluations to compare the performances of various representative SVM-based models in each category using benchmark imbalanced data sets, ranging from low to high imbalanced ratios. Our findings reveal that while algorithmic methods are less time-consuming owing to no data pre-processing requirements, fusion methods, which combine both re-sampling and algorithmic approaches, generally perform the best, but with a higher computational load. A discussion on research gaps and future research directions is provided.

[LG-24] Noisy Data Visualization using Functional Data Analysis

链接: https://arxiv.org/abs/2406.03396
作者: Haozhe Chen,Andres Felipe Duque Correa,Guy Wolf,Kevin R. Moon
关键词: exploratory data analysis, Empirical Intrinsic Geometry, important tool, tool in exploratory, Data
类目: Machine Learning (cs.LG); Functional Analysis (math.FA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Data visualization via dimensionality reduction is an important tool in exploratory data analysis. However, when the data are noisy, many existing methods fail to capture the underlying structure of the data. The method called Empirical Intrinsic Geometry (EIG) was previously proposed for performing dimensionality reduction on high dimensional dynamical processes while theoretically eliminating all noise. However, implementing EIG in practice requires the construction of high-dimensional histograms, which suffer from the curse of dimensionality. Here we propose a new data visualization method called Functional Information Geometry (FIG) for dynamical processes that adapts the EIG framework while using approaches from functional data analysis to mitigate the curse of dimensionality. We experimentally demonstrate that the resulting method outperforms a variant of EIG designed for visualization in terms of capturing the true structure, hyperparameter robustness, and computational speed. We then use our method to visualize EEG brain measurements of sleep activity.

[LG-25] Author Content or Sharers? Estimating Spread Dynamics with Bayesian Mixture Hawkes

链接: https://arxiv.org/abs/2406.03390
作者: Pio Calderon,Marian-Andrei Rizoiu
关键词: BMH model, BMH, Bayesian Mixture Hawkes, shaped by intertwining, BMH model outperforms
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: accepted in the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD) 2024

点击查看摘要

Abstract:The spread of content on social media is shaped by intertwining factors on three levels: the source, the content itself, and the pathways of content spread. At the lowest level, the popularity of the sharing user determines its eventual reach. However, higher-level factors such as the nature of the online item and the credibility of its source also play crucial roles in determining how widely and rapidly the online item spreads. In this work, we propose the Bayesian Mixture Hawkes (BMH) model to jointly learn the influence of source, content and spread. We formulate the BMH model as a hierarchical mixture model of separable Hawkes processes, accommodating different classes of Hawkes dynamics and the influence of feature sets on these classes. We test the BMH model on two learning tasks, cold-start popularity prediction and temporal profile generalization performance, applying to two real-world retweet cascade datasets referencing articles from controversial and traditional media publishers. The BMH model outperforms the state-of-the-art models and predictive baselines on both datasets and utilizes cascade- and item-level information better than the alternatives. Lastly, we perform a counter-factual analysis where we apply the trained publisher-level BMH models to a set of article headlines and show that effectiveness of headline writing style (neutral, clickbait, inflammatory) varies across publishers. The BMH model unveils differences in style effectiveness between controversial and reputable publishers, where we find clickbait to be notably more effective for reputable publishers as opposed to controversial ones, which links to the latter’s overuse of clickbait.

[LG-26] Learning Long Range Dependencies on Graphs via Random Walks

链接: https://arxiv.org/abs/2406.03386
作者: Dexiong Chen,Till Hendrik Schulz,Karsten Borgwardt
关键词: Message-passing graph neural, capturing local relationships, graph neural networks, neural networks, local relationships
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Message-passing graph neural networks (GNNs), while excelling at capturing local relationships, often struggle with long-range dependencies on graphs. Conversely, graph transformers (GTs) enable information exchange between all nodes but oversimplify the graph structure by treating them as a set of fixed-length vectors. This work proposes a novel architecture, NeuralWalker, that overcomes the limitations of both methods by combining random walks with message passing. NeuralWalker achieves this by treating random walks as sequences, allowing for the application of recent advances in sequence models in order to capture long-range dependencies within these walks. Based on this concept, we propose a framework that offers (1) more expressive graph representations through random walk sequences, (2) the ability to utilize any sequence model for capturing long-range dependencies, and (3) the flexibility by integrating various GNN and GT architectures. Our experimental evaluations demonstrate that NeuralWalker achieves significant performance improvements on 19 graph and node benchmark datasets, notably outperforming existing methods by up to 13% on the PascalVoc-SP and COCO-SP datasets. Code is available at this https URL.

[LG-27] What Matters in Hierarchical Search for Combinatorial Reasoning Problems?

链接: https://arxiv.org/abs/2406.03361
作者: Michał Zawalski,Gracjan Góral,Michał Tyrolski,Emilia Wiśnios,Franciszek Budrowski,Łukasz Kuciński,Piotr Miłoś
关键词: notorious NP-hard tasks, Efficiently tackling combinatorial, Efficiently tackling, combinatorial reasoning problems, NP-hard tasks
类目: Machine Learning (cs.LG)
*备注: Accepted for Generative Models for Decision Making Workshop at ICLR 2024

点击查看摘要

Abstract:Efficiently tackling combinatorial reasoning problems, particularly the notorious NP-hard tasks, remains a significant challenge for AI research. Recent efforts have sought to enhance planning by incorporating hierarchical high-level search strategies, known as subgoal methods. While promising, their performance against traditional low-level planners is inconsistent, raising questions about their application contexts. In this study, we conduct an in-depth exploration of subgoal-planning methods for combinatorial reasoning. We identify the attributes pivotal for leveraging the advantages of high-level search: hard-to-learn value functions, complex action spaces, presence of dead ends in the environment, or using data collected from diverse experts. We propose a consistent evaluation methodology to achieve meaningful comparisons between methods and reevaluate the state-of-the-art algorithms.

[LG-28] Cooperative learning of Pl@ntNets Artificial Intelligence algorithm: how does it work and how can we improve it?

链接: https://arxiv.org/abs/2406.03356
作者: Tanguy Lefort,Antoine Affouard,Benjamin Charlier,Jean-Christophe Lombardo,Mathias Chouet,Hervé Goëau,Joseph Salmon,Pierre Bonnet,Alexis Joly
关键词: Deep learning models, Deep learning, Deep, data, species identification rely
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Deep learning models for plant species identification rely on large annotated datasets. The PlantNet system enables global data collection by allowing users to upload and annotate plant observations, leading to noisy labels due to diverse user skills. Achieving consensus is crucial for training, but the vast scale of collected data makes traditional label aggregation strategies challenging. Existing methods either retain all observations, resulting in noisy training data or selectively keep those with sufficient votes, discarding valuable information. Additionally, as many species are rarely observed, user expertise can not be evaluated as an inter-user agreement: otherwise, botanical experts would have a lower weight in the AI training step than the average user. Our proposed label aggregation strategy aims to cooperatively train plant identification AI models. This strategy estimates user expertise as a trust score per user based on their ability to identify plant species from crowdsourced data. The trust score is recursively estimated from correctly identified species given the current estimated labels. This interpretable score exploits botanical experts’ knowledge and the heterogeneity of users. Subsequently, our strategy removes unreliable observations but retains those with limited trusted annotations, unlike other approaches. We evaluate PlantNet’s strategy on a released large subset of the PlantNet database focused on European flora, comprising over 6M observations and 800K users. We demonstrate that estimating users’ skills based on the diversity of their expertise enhances labeling performance. Our findings emphasize the synergy of human annotation and data filtering in improving AI performance for a refined dataset. We explore incorporating AI-based votes alongside human input. This can further enhance human-AI interactions to detect unreliable observations.

[LG-29] Position: A Call to Action for a Human-Centered AutoML Paradigm

链接: https://arxiv.org/abs/2406.03348
作者: Marius Lindauer,Florian Karl,Anne Klier,Julia Moosbauer,Alexander Tornede,Andreas Mueller,Frank Hutter,Matthias Feurer,Bernd Bischl
关键词: Automated machine learning, configuring machine learning, efficiently configuring machine, machine learning, Automated machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated machine learning (AutoML) was formed around the fundamental objectives of automatically and efficiently configuring machine learning (ML) workflows, aiding the research of new ML algorithms, and contributing to the democratization of ML by making it accessible to a broader audience. Over the past decade, commendable achievements in AutoML have primarily focused on optimizing predictive performance. This focused progress, while substantial, raises questions about how well AutoML has met its broader, original goals. In this position paper, we argue that a key to unlocking AutoML’s full potential lies in addressing the currently underexplored aspect of user interaction with AutoML systems, including their diverse roles, expectations, and expertise. We envision a more human-centered approach in future AutoML research, promoting the collaborative design of ML systems that tightly integrates the complementary strengths of human expertise and AutoML methodologies.

[LG-30] Normalizing Flows for Conformal Regression

链接: https://arxiv.org/abs/2406.03346
作者: Nicolo Colombo
关键词: Conformal Prediction, calibrating its outputs, outputs on labeled, labeled data, Prediction
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: To be presented at the 40th Conference on Uncertainty in Artificial Intelligence (UAI 2024). 13 pages, 2 figures

点击查看摘要

Abstract:Conformal Prediction (CP) algorithms estimate the uncertainty of a prediction model by calibrating its outputs on labeled data. The same calibration scheme usually applies to any model and data without modifications. The obtained prediction intervals are valid by construction but could be inefficient, i.e. unnecessarily big, if the prediction errors are not uniformly distributed over the input space. We present a general scheme to localize the intervals by training the calibration process. The standard prediction error is replaced by an optimized distance metric that depends explicitly on the object attributes. Learning the optimal metric is equivalent to training a Normalizing Flow that acts on the joint distribution of the errors and the inputs. Unlike the Error Re-weighting CP algorithm of Papadopoulos et al. (2008), the framework allows estimating the gap between nominal and empirical conditional validity. The approach is compatible with existing locally-adaptive CP strategies based on re-weighting the calibration samples and applies to any point-prediction model without retraining. Comments: To be presented at the 40th Conference on Uncertainty in Artificial Intelligence (UAI 2024). 13 pages, 2 figures Subjects: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML) Cite as: arXiv:2406.03346 [cs.LG] (or arXiv:2406.03346v1 [cs.LG] for this version)

[LG-31] Feature Contamination: Neural Networks Learn Uncorrelated Features and Fail to Generalize

链接: https://arxiv.org/abs/2406.03345
作者: Tianren Zhang,Chujie Zhao,Guanyu Chen,Yizhou Jiang,Feng Chen
关键词: building robust machine, robust machine learning, critical for building, building robust, robust machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICML 2024

点击查看摘要

Abstract:Learning representations that generalize under distribution shifts is critical for building robust machine learning models. However, despite significant efforts in recent years, algorithmic advances in this direction have been limited. In this work, we seek to understand the fundamental difficulty of out-of-distribution generalization with deep neural networks. We first empirically show that perhaps surprisingly, even allowing a neural network to explicitly fit the representations obtained from a teacher network that can generalize out-of-distribution is insufficient for the generalization of the student network. Then, by a theoretical study of two-layer ReLU networks optimized by stochastic gradient descent (SGD) under a structured feature model, we identify a fundamental yet unexplored feature learning proclivity of neural networks, feature contamination: neural networks can learn uncorrelated features together with predictive features, resulting in generalization failure under distribution shifts. Notably, this mechanism essentially differs from the prevailing narrative in the literature that attributes the generalization failure to spurious correlations. Overall, our results offer new insights into the non-linear feature learning dynamics of neural networks and highlight the necessity of considering inductive biases in out-of-distribution generalization.

[LG-32] ackling GenAI Copyright Issues: Originality Estimation and Genericization

链接: https://arxiv.org/abs/2406.03341
作者: Hiroaki Chiba-Okabe,Weijie J. Su
关键词: numerous lawsuits filed, significant copyright concerns, sparked significant copyright, leading to numerous, generative model
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:The rapid progress of generative AI technology has sparked significant copyright concerns, leading to numerous lawsuits filed against AI developers. While some studies explore methods to mitigate copyright risks by steering the outputs of generative models away from those resembling copyrighted data, little attention has been paid to the question of how much of a resemblance is undesirable; more original or unique data are afforded stronger protection, and the threshold level of resemblance for constituting infringement correspondingly lower. Here, leveraging this principle, we propose a genericization method that modifies the outputs of a generative model to make them more generic and less likely to infringe copyright. To achieve this, we introduce a metric for quantifying the level of originality of data in a manner that is consistent with the legal framework. This metric can be practically estimated by drawing samples from a generative model, which is then used for the genericization process. Experiments demonstrate that our genericization method successfully modifies the output of a text-to-image generative model so that it produces more generic, copyright-compliant images.

[LG-33] Identifying latent state transition in non-linear dynamical systems

链接: https://arxiv.org/abs/2406.03337
作者: Çağlar Hızlı,Çağatay Yıldız,Matthias Bethge,ST John,Pekka Marttinen
关键词: underlying lower-dimensional latent, lower-dimensional latent states, dynamical systems focused, dynamical systems, time evolutions
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work aims to improve generalization and interpretability of dynamical systems by recovering the underlying lower-dimensional latent states and their time evolutions. Previous work on disentangled representation learning within the realm of dynamical systems focused on the latent states, possibly with linear transition approximations. As such, they cannot identify nonlinear transition dynamics, and hence fail to reliably predict complex future behavior. Inspired by the advances in nonlinear ICA, we propose a state-space modeling framework in which we can identify not just the latent states but also the unknown transition function that maps the past states to the present. We introduce a practical algorithm based on variational auto-encoders and empirically demonstrate in realistic synthetic settings that we can (i) recover latent state dynamics with high accuracy, (ii) correspondingly achieve high future prediction accuracy, and (iii) adapt fast to new environments.

[LG-34] Reparameterization invariance in approximate Bayesian inference

链接: https://arxiv.org/abs/2406.03334
作者: Hrittik Roy,Marco Miani,Carl Henrik Ek,Philipp Hennig,Marvin Pförtner,Lukas Tatzel,Søren Hauberg
关键词: Current approximate posteriors, BNNs assign, exhibit a crucial, crucial limitation, Bayesian neural networks
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Current approximate posteriors in Bayesian neural networks (BNNs) exhibit a crucial limitation: they fail to maintain invariance under reparameterization, i.e. BNNs assign different posterior densities to different parametrizations of identical functions. This creates a fundamental flaw in the application of Bayesian principles as it breaks the correspondence between uncertainty over the parameters with uncertainty over the parametrized function. In this paper, we investigate this issue in the context of the increasingly popular linearized Laplace approximation. Specifically, it has been observed that linearized predictives alleviate the common underfitting problems of the Laplace approximation. We develop a new geometric view of reparametrizations from which we explain the success of linearization. Moreover, we demonstrate that these reparameterization invariance properties can be extended to the original neural network predictive using a Riemannian diffusion process giving a straightforward algorithm for approximate posterior sampling, which empirically improves posterior fit.

[LG-35] UDQL: Bridging The Gap between MSE Loss and The Optimal Value Function in Offline Reinforcement Learning

链接: https://arxiv.org/abs/2406.03324
作者: Yu Zhang,Rui Yu,Zhipeng Yao,Wenyuan Zhang,Jun Wang,Liming Zhang
关键词: achieved outstanding performance, offline reinforcement learning, Square Error, reinforcement learning, outstanding performance
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Mean Square Error (MSE) is commonly utilized to estimate the solution of the optimal value function in the vast majority of offline reinforcement learning (RL) models and has achieved outstanding performance. However, we find that its principle can lead to overestimation phenomenon for the value function. In this paper, we first theoretically analyze overestimation phenomenon led by MSE and provide the theoretical upper bound of the overestimated error. Furthermore, to address it, we propose a novel Bellman underestimated operator to counteract overestimation phenomenon and then prove its contraction characteristics. At last, we propose the offline RL algorithm based on underestimated operator and diffusion policy model. Extensive experimental results on D4RL tasks show that our method can outperform state-of-the-art offline RL algorithms, which demonstrates that our theoretical analysis and underestimation way are effective for offline RL tasks.

[LG-36] Reproducibility study of FairAC

链接: https://arxiv.org/abs/2406.03314
作者: Gijs de Jong,Macha J. Meijer,Derck W. E. Prinzhorn,Harold Ruiter
关键词: Fair Attribute Completion, Completion on Graph, Graph with Missing, written by Guo, Fair Attribute
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 2 figures, accepted at TMLR

点击查看摘要

Abstract:This work aims to reproduce the findings of the paper “Fair Attribute Completion on Graph with Missing Attributes” written by Guo, Chu, and Li arXiv:2302.12977 by investigating the claims made in the paper. This paper suggests that the results of the original paper are reproducible and thus, the claims hold. However, the claim that FairAC is a generic framework for many downstream tasks is very broad and could therefore only be partially tested. Moreover, we show that FairAC is generalizable to various datasets and sensitive attributes and show evidence that the improvement in group fairness of the FairAC framework does not come at the expense of individual fairness. Lastly, the codebase of FairAC has been refactored and is now easily applicable for various datasets and models.

[LG-37] Embarrassingly Parallel GFlowNets

链接: https://arxiv.org/abs/2406.03288
作者: Tiago da Silva,Luiz Max Carvalho,Amauri Souza,Samuel Kaski,Diego Mesquita
关键词: compositional random variables, discrete compositional random, alternative to MCMC, MCMC sampling, random variables
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:GFlowNets are a promising alternative to MCMC sampling for discrete compositional random variables. Training GFlowNets requires repeated evaluations of the unnormalized target distribution or reward function. However, for large-scale posterior sampling, this may be prohibitive since it incurs traversing the data several times. Moreover, if the data are distributed across clients, employing standard GFlowNets leads to intensive client-server communication. To alleviate both these issues, we propose embarrassingly parallel GFlowNet (EP-GFlowNet). EP-GFlowNet is a provably correct divide-and-conquer method to sample from product distributions of the form R(\cdot) \propto R_1(\cdot) … R_N(\cdot) – e.g., in parallel or federated Bayes, where each R_n is a local posterior defined on a data partition. First, in parallel, we train a local GFlowNet targeting each R_n and send the resulting models to the server. Then, the server learns a global GFlowNet by enforcing our newly proposed \emphaggregating balance condition, requiring a single communication step. Importantly, EP-GFlowNets can also be applied to multi-objective optimization and model reuse. Our experiments illustrate the EP-GFlowNets’s effectiveness on many tasks, including parallel Bayesian phylogenetics, multi-objective multiset, sequence generation, and federated Bayesian structure learning.

[LG-38] SpikeLM: Towards General Spike-Driven Language Modeling via Elastic Bi-Spiking Mechanisms

链接: https://arxiv.org/abs/2406.03287
作者: Xingrun Xing,Zheng Zhang,Ziyi Ni,Shitao Xiao,Yiming Ju,Siqi Fan,Yequan Wang,Jiajun Zhang,Guoqi Li
关键词: energy-efficient artificial intelligence, artificial intelligence similar, spiking neural networks, bio-inspired spiking neural, event-driven sparsity
类目: Neural and Evolutionary Computing (cs.NE); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Towards energy-efficient artificial intelligence similar to the human brain, the bio-inspired spiking neural networks (SNNs) have advantages of biological plausibility, event-driven sparsity, and binary activation. Recently, large-scale language models exhibit promising generalization capability, making it a valuable issue to explore more general spike-driven models. However, the binary spikes in existing SNNs fail to encode adequate semantic information, placing technological challenges for generalization. This work proposes the first fully spiking mechanism for general language tasks, including both discriminative and generative ones. Different from previous spikes with 0,1 levels, we propose a more general spike formulation with bi-directional, elastic amplitude, and elastic frequency encoding, while still maintaining the addition nature of SNNs. In a single time step, the spike is enhanced by direction and amplitude information; in spike frequency, a strategy to control spike firing rate is well designed. We plug this elastic bi-spiking mechanism in language modeling, named SpikeLM. It is the first time to handle general language tasks with fully spike-driven models, which achieve much higher accuracy than previously possible. SpikeLM also greatly bridges the performance gap between SNNs and ANNs in language modeling. Our code is available at this https URL.

[LG-39] FusionBench: A Comprehensive Benchmark of Deep Model Fusion

链接: https://arxiv.org/abs/2406.03280
作者: Anke Tang,Li Shen,Yong Luo,Han Hu,Bo Do,Dacheng Tao
关键词: Deep model fusion, deep neural networks, model fusion techniques, model fusion, fusion techniques
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Project homepage: this https URL

点击查看摘要

Abstract:Deep model fusion is an emerging technique that unifies the predictions or parameters of several deep neural networks into a single model in a cost-effective and data-efficient manner. This enables the unified model to take advantage of the original models’ strengths, potentially exceeding their performance. Although a variety of deep model fusion techniques have been introduced, their evaluations tend to be inconsistent and often inadequate to validate their effectiveness and robustness against distribution shifts. To address this issue, we introduce FusionBench, which is the first comprehensive benchmark dedicated to deep model fusion. FusionBench covers a wide range of tasks, including open-vocabulary image classification, text classification, and text-to-text generation. Each category includes up to eight tasks with corresponding task-specific models, featuring both full fine-tuning and LoRA fine-tuning, as well as models of different sizes, to ensure fair and balanced comparisons of various multi-task model fusion techniques across different tasks, model scales, and fine-tuning strategies. We implement and evaluate a broad spectrum of deep model fusion techniques. These techniques range from model ensemble methods, which combine the predictions to improve the overall performance, to model merging, which integrates different models into a single one, and model mixing methods, which upscale or recombine the components of the original models. FusionBench now contains 26 distinct tasks, 74 fine-tuned models, and 16 fusion techniques, and we are committed to consistently expanding the benchmark with more tasks, models, and fusion techniques. In addition, we offer a well-documented set of resources and guidelines to aid researchers in understanding and replicating the benchmark results. Homepage this https URL

[LG-40] Using GNN property predictors as molecule generators

链接: https://arxiv.org/abs/2406.03278
作者: Félix Therrien,Edward H. Sargent,Oleksandr Voznyy
关键词: computational discovery pipelines, accurately predict materials, discovery pipelines, Graph neural networks, neural networks
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 7 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Graph neural networks (GNNs) have emerged as powerful tools to accurately predict materials and molecular properties in computational discovery pipelines. In this article, we exploit the invertible nature of these neural networks to directly generate molecular structures with desired electronic properties. Starting from a random graph or an existing molecule, we perform a gradient ascent while holding the GNN weights fixed in order to optimize its input, the molecular graph, towards the target property. Valence rules are enforced strictly through a judicious graph construction. The method relies entirely on the property predictor; no additional training is required on molecular structures. We demonstrate the application of this method by generating molecules with specific DFT-verified energy gaps and octanol-water partition coefficients (logP). Our approach hits target properties with rates comparable to or better than state-of-the-art generative models while consistently generating more diverse molecules.

[LG-41] Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning

链接: https://arxiv.org/abs/2406.03276
作者: Mohamed Elsayed,Homayoon Farrahi,Felix Dangel,A. Rupam Mahmood
关键词: Hessian diagonals, approximating Hessian diagonals, information is valuable, applications but challenging, challenging to compute
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published in the Proceedings of the 41st International Conference on Machine Learning (ICML 2024). Code is available at this https URL . arXiv admin note: substantial text overlap with arXiv:2210.11639

点击查看摘要

Abstract:Second-order information is valuable for many applications but challenging to compute. Several works focus on computing or approximating Hessian diagonals, but even this simplification introduces significant additional costs compared to computing a gradient. In the absence of efficient exact computation schemes for Hessian diagonals, we revisit an early approximation scheme proposed by Becker and LeCun (1989, BL89), which has a cost similar to gradients and appears to have been overlooked by the community. We introduce HesScale, an improvement over BL89, which adds negligible extra computation. On small networks, we find that this improvement is of higher quality than all alternatives, even those with theoretical guarantees, such as unbiasedness, while being much cheaper to compute. We use this insight in reinforcement learning problems where small networks are used and demonstrate HesScale in second-order optimization and scaling the step-size parameter. In our experiments, HesScale optimizes faster than existing methods and improves stability through step-size scaling. These findings are promising for scaling second-order methods in larger models in the future.

[LG-42] Deep Generative Models for Proton Zero Degree Calorimeter Simulations in ALICE CERN

链接: https://arxiv.org/abs/2406.03263
作者: Patryk Będkowski,Jan Dubiński,Kamil Deja,Przemysław Rokita
关键词: Large Hadron Collider, Simulating detector responses, Large Hadron, Hadron Collider, Simulating detector
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 3 figures, PP-RAI 2024 conference

点击查看摘要

Abstract:Simulating detector responses is a crucial part of understanding the inner-workings of particle collisions in the Large Hadron Collider at CERN. The current reliance on statistical Monte-Carlo simulations strains CERN’s computational grid, underscoring the urgency for more efficient alternatives. Addressing these challenges, recent proposals advocate for generative machine learning methods. In this study, we present an innovative deep learning simulation approach tailored for the proton Zero Degree Calorimeter in the ALICE experiment. Leveraging a Generative Adversarial Network model with Selective Diversity Increase loss, we directly simulate calorimeter responses. To enhance its capabilities in modeling a broad range of calorimeter response intensities, we expand the SDI-GAN architecture with additional regularization. Moreover, to improve the spatial fidelity of the generated data, we introduce an auxiliary regressor network. Our method offers a significant speedup when comparing to the traditional Monte-Carlo based approaches.

[LG-43] On the Maximal Local Disparity of Fairness-Aware Classifiers

链接: https://arxiv.org/abs/2406.03255
作者: Jinqiu Jin,Haoxuan Li,Fuli Feng
关键词: trustworthy machine learning, machine learning algorithms, crucial aspect, development of trustworthy, trustworthy machine
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fairness has become a crucial aspect in the development of trustworthy machine learning algorithms. Current fairness metrics to measure the violation of demographic parity have the following drawbacks: (i) the average difference of model predictions on two groups cannot reflect their distribution disparity, and (ii) the overall calculation along all possible predictions conceals the extreme local disparity at or around certain predictions. In this work, we propose a novel fairness metric called Maximal Cumulative ratio Disparity along varying Predictions’ neighborhood (MCDP), for measuring the maximal local disparity of the fairness-aware classifiers. To accurately and efficiently calculate the MCDP, we develop a provably exact and an approximate calculation algorithm that greatly reduces the computational complexity with low estimation error. We further propose a bi-level optimization algorithm using a differentiable approximation of the MCDP for improving the algorithmic fairness. Extensive experiments on both tabular and image datasets validate that our fair training algorithm can achieve superior fairness-accuracy trade-offs.

[LG-44] Exploring Higher Order Structures in Graph Explanantions

链接: https://arxiv.org/abs/2406.03253
作者: Akshit Sinha,Sreeram Vennam,Charu Sharma,Ponnurangam Kumaraguru
关键词: explaining predictions generated, Graph Neural Networks, Recent advancements, Neural Networks, graph learning contributed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in graph learning contributed to explaining predictions generated by Graph Neural Networks. However, existing methodologies often fall short when applied to real-world datasets. We introduce HOGE, a framework to capture higher-order structures using cell complexes, which excel at modeling higher-order relationships. In the real world, higher-order structures are ubiquitous like in molecules or social networks, thus our work significantly enhances the practical applicability of graph explanations. HOGE produces clearer and more accurate explanations compared to prior methods. Our method can be integrated with all existing graph explainers, ensuring seamless integration into current frameworks. We evaluate on GraphXAI benchmark datasets, HOGE achieves improved or comparable performance with minimal computational overhead. Ablation studies show that the performance gain observed can be attributed to the higher-order structures that come from introducing cell complexes.

[LG-45] Near-field Beamforming for Extremely Large-scale MIMO Based on Unsupervised Deep Learning

链接: https://arxiv.org/abs/2406.03249
作者: Jiali Nie,Yuanhao Cui,Zhaohui Yang,Weijie Yuan,Xiaojun Jing
关键词: Extremely Large-scale Array, future communication systems, improving wireless systems’, Extremely Large-scale, Large-scale Array
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Extremely Large-scale Array (ELAA) is considered a frontier technology for future communication systems, pivotal in improving wireless systems’ rate and spectral efficiency. However, as ELAA employs a multitude of antennas operating at higher frequencies, users are typically situated in the near-field region where the spherical wavefront propagates. This inevitably leads to a significant increase in the overhead of beam training, requiring complex two-dimensional beam searching in both the angle domain and the distance domain. To address this problem, we propose a near-field beamforming method based on unsupervised deep learning. Our convolutional neural network efficiently extracts complex channel state information features by strategically selecting padding and kernel size. We optimize the beamformers to maximize achievable rates in a multi-user network without relying on predefined custom codebooks. Upon deployment, the model requires solely the input of pre-estimated channel state information to derive the optimal beamforming vector. Simulation results show that our proposed scheme can obtain stable beamforming gain compared with the baseline scheme. Furthermore, owing to the inherent traits of deep learning methodologies, this approach substantially diminishes the beam training costs in near-field regions.

[LG-46] Variational Pseudo Marginal Methods for Jet Reconstruction in Particle Physics

链接: https://arxiv.org/abs/2406.03242
作者: Hanming Yang,Antonio Khalil Moretti,Sebastian Macaluso,Philippe Chlenski,Christian A. Naesseth,Itsik Pe’er
关键词: provide vital insights, subatomic particles produced, Reconstructing jets, high-energy collisions, provide vital
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Reconstructing jets, which provide vital insights into the properties and histories of subatomic particles produced in high-energy collisions, is a main problem in data analyses in collider physics. This intricate task deals with estimating the latent structure of a jet (binary tree) and involves parameters such as particle energy, momentum, and types. While Bayesian methods offer a natural approach for handling uncertainty and leveraging prior knowledge, they face significant challenges due to the super-exponential growth of potential jet topologies as the number of observed particles increases. To address this, we introduce a Combinatorial Sequential Monte Carlo approach for inferring jet latent structures. As a second contribution, we leverage the resulting estimator to develop a variational inference algorithm for parameter learning. Building on this, we introduce a variational family using a pseudo-marginal framework for a fully Bayesian treatment of all variables, unifying the generative model with the inference process. We illustrate our method’s effectiveness through experiments using data generated with a collider physics generative model, highlighting superior speed and accuracy across a range of tasks.

[LG-47] Fine-Grained Causal Dynamics Learning with Quantization for Improving Robustness in Reinforcement Learning

链接: https://arxiv.org/abs/2406.03234
作者: Inwoo Hwang,Yunhyeok Kwak,Suhyung Choi,Byoung-Tak Zhang,Sanghack Lee
关键词: Causal dynamics learning, reinforcement learning, recently emerged, dynamics model, Causal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICML 2024

点击查看摘要

Abstract:Causal dynamics learning has recently emerged as a promising approach to enhancing robustness in reinforcement learning (RL). Typically, the goal is to build a dynamics model that makes predictions based on the causal relationships among the entities. Despite the fact that causal connections often manifest only under certain contexts, existing approaches overlook such fine-grained relationships and lack a detailed understanding of the dynamics. In this work, we propose a novel dynamics model that infers fine-grained causal structures and employs them for prediction, leading to improved robustness in RL. The key idea is to jointly learn the dynamics model with a discrete latent variable that quantizes the state-action space into subgroups. This leads to recognizing meaningful context that displays sparse dependencies, where causal structures are learned for each subgroup throughout the training. Experimental results demonstrate the robustness of our method to unseen states and locally spurious correlations in downstream tasks where fine-grained causal reasoning is crucial. We further illustrate the effectiveness of our subgroup-based approach with quantization in discovering fine-grained causal relationships compared to prior methods.

[LG-48] CommonPower: Supercharging Machine Learning for Smart Grids

链接: https://arxiv.org/abs/2406.03231
作者: Michael Eichelbeck,Hannah Markgraf,Matthias Althoff
关键词: power system management, reinforcement learning, growing complexity, complexity of power, management has led
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: For the corresponding code repository, see this https URL

点击查看摘要

Abstract:The growing complexity of power system management has led to an increased interest in the use of reinforcement learning (RL). However, no tool for comprehensive and realistic benchmarking of RL in smart grids exists. One prerequisite for such a comparison is a safeguarding mechanism since vanilla RL controllers can not guarantee the satisfaction of system constraints. Other central requirements include flexible modeling of benchmarking scenarios, credible baselines, and the possibility to investigate the impact of forecast uncertainties. Our Python tool CommonPower is the first modular framework addressing these needs. CommonPower offers a unified interface for single-agent and multi-agent RL training algorithms and includes a built-in model predictive control approach based on a symbolic representation of the system equations. This makes it possible to combine model predictive controllers with RL controllers in the same system. Leveraging the symbolic system model, CommonPower facilitates the study of safeguarding strategies via the flexible formulation of safety layers. Furthermore equipped with a generic forecasting interface, CommonPower constitutes a versatile tool significantly augmenting the exploration of safe RL controllers in smart grids on several dimensions.

[LG-49] Defending Large Language Models Against Attacks With Residual Stream Activation Analysis

链接: https://arxiv.org/abs/2406.03230
作者: Amelia Kawasaki,Andrew Davis,Houssam Abbas
关键词: Large Language Models, Large Language, adoption of Large, Language Models, exemplified by OpenAI
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The widespread adoption of Large Language Models (LLMs), exemplified by OpenAI’s ChatGPT, brings to the forefront the imperative to defend against adversarial threats on these models. These attacks, which manipulate an LLM’s output by introducing malicious inputs, undermine the model’s integrity and the trust users place in its outputs. In response to this challenge, our paper presents an innovative defensive strategy, given white box access to an LLM, that harnesses residual activation analysis between transformer layers of the LLM. We apply an established methodology for analyzing distinctive activation patterns in the residual streams for a novel result of attack prompt classification. We curate multiple datasets to demonstrate how this method of classification has high accuracy across multiple types of attack scenarios, including our newly-created attack dataset. Furthermore, we enhance the model’s resilience by integrating safety fine-tuning techniques for LLMs in order to measure its effect on our capability to detect attacks. The results underscore the effectiveness of our approach in enhancing the detection and mitigation of adversarial inputs, advancing the security framework within which LLMs operate.

[LG-50] Global Clipper: Enhancing Safety and Reliability of Transformer-based Object Detection Models

链接: https://arxiv.org/abs/2406.03229
作者: Qutub Syed Sha,Michael Paulitsch,Karthik Pattabiraman,Korbinian Hagn,Fabian Oboril,Cornelius Buerkle,Kay-Ulrich Scholl,Gereon Hinz,Alois Knoll
关键词: transformer-based object detection, detection models progress, object detection models, expected to grow, object detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at IJCAI-AISafety’24 Workshop

点击查看摘要

Abstract:As transformer-based object detection models progress, their impact in critical sectors like autonomous vehicles and aviation is expected to grow. Soft errors causing bit flips during inference have significantly impacted DNN performance, altering predictions. Traditional range restriction solutions for CNNs fall short for transformers. This study introduces the Global Clipper and Global Hybrid Clipper, effective mitigation strategies specifically designed for transformer-based models. It significantly enhances their resilience to soft errors and reduces faulty inferences to ~ 0%. We also detail extensive testing across over 64 scenarios involving two transformer models (DINO-DETR and Lite-DETR) and two CNN models (YOLOv3 and SSD) using three datasets, totalling approximately 3.3 million inferences, to assess model robustness comprehensively. Moreover, the paper explores unique aspects of attention blocks in transformers and their operational differences from CNNs.

[LG-51] Choice of PEFT Technique in Continual Learning: Prompt Tuning is Not All You Need

链接: https://arxiv.org/abs/2406.03216
作者: Martin Wistuba,Prabhu Teja Sivaprasad,Lukas Balles,Giovanni Zappella
关键词: combined pretrained Transformers, Recent Continual Learning, pretrained Transformers, Recent Continual, prompt tuning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent Continual Learning (CL) methods have combined pretrained Transformers with prompt tuning, a parameter-efficient fine-tuning (PEFT) technique. We argue that the choice of prompt tuning in prior works was an undefended and unablated decision, which has been uncritically adopted by subsequent research, but warrants further research to understand its implications. In this paper, we conduct this research and find that the choice of prompt tuning as a PEFT method hurts the overall performance of the CL system. To illustrate this, we replace prompt tuning with LoRA in two state-of-the-art continual learning methods: Learning to Prompt and S-Prompts. These variants consistently achieve higher accuracy across a wide range of domain-incremental and class-incremental benchmarks, while being competitive in inference speed. Our work highlights a crucial argument: unexamined choices can hinder progress in the field, and rigorous ablations, such as the PEFT method, are required to drive meaningful adoption of CL techniques in real-world applications.

[LG-52] Inferring the time-varying coupling of dynamical systems with temporal convolutional autoencoders

链接: https://arxiv.org/abs/2406.03212
作者: Josuan Calderon,Gordon J. Berman
关键词: complex dynamical systems, dynamical systems fail, introduce Temporal Autoencoders, non-linear and non-stationary, causality in complex
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Most approaches for assessing causality in complex dynamical systems fail when the interactions between variables are inherently non-linear and non-stationary. Here we introduce Temporal Autoencoders for Causal Inference (TACI), a methodology that combines a new surrogate data metric for assessing causal interactions with a novel two-headed machine learning architecture to identify and measure the direction and strength of time-varying causal interactions. Through tests on both synthetic and real-world datasets, we demonstrate TACI’s ability to accurately quantify dynamic causal interactions across a variety of systems. Our findings display the method’s effectiveness compared to existing approaches and also highlight our approach’s potential to build a deeper understanding of the mechanisms that underlie time-varying interactions in physical and biological systems.

[LG-53] Challenges and Considerations in the Evaluation of Bayesian Causal Discovery

链接: https://arxiv.org/abs/2406.03209
作者: Amir Mohammad Karimi Mamaghan,Panagiotis Tigas,Karl Henrik Johansson,Yarin Gal,Yashas Annadani,Stefan Bauer
关键词: causal decision making, reliable causal decision, causal discovery, Representing uncertainty, Bayesian Causal Discovery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Representing uncertainty in causal discovery is a crucial component for experimental design, and more broadly, for safe and reliable causal decision making. Bayesian Causal Discovery (BCD) offers a principled approach to encapsulating this uncertainty. Unlike non-Bayesian causal discovery, which relies on a single estimated causal graph and model parameters for assessment, evaluating BCD presents challenges due to the nature of its inferred quantity - the posterior distribution. As a result, the research community has proposed various metrics to assess the quality of the approximate posterior. However, there is, to date, no consensus on the most suitable metric(s) for evaluation. In this work, we reexamine this question by dissecting various metrics and understanding their limitations. Through extensive empirical evaluation, we find that many existing metrics fail to exhibit a strong correlation with the quality of approximation to the true posterior, especially in scenarios with low sample sizes where BCD is most desirable. We highlight the suitability (or lack thereof) of these metrics under two distinct factors: the identifiability of the underlying causal model and the quantity of available data. Both factors affect the entropy of the true posterior, indicating that the current metrics are less fitting in settings of higher entropy. Our findings underline the importance of a more nuanced evaluation of new methods by taking into account the nature of the true posterior, as well as guide and motivate the development of new evaluation procedures for this challenge.

[LG-54] Bayesian WeakS-to-Strong from Text Classification to Generation

链接: https://arxiv.org/abs/2406.03199
作者: Ziyun Cui,Ziyang Zhang,Wen Wu,Guangzhi Sun,Chao Zhang
关键词: large language models, language models raise, supervise them weakly, large language, raise the question
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advances in large language models raise the question of how alignment techniques will adapt as models become increasingly complex and humans will only be able to supervise them weakly. Weak-to-Strong mimics such a scenario where weak model supervision attempts to harness the full capabilities of a much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions. Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization. Furthermore, we extend the application of WeakS-to-Strong from text classification tasks to text generation tasks where more advanced strategies are investigated for supervision. Moreover, direct preference optimization is applied to advance the student model’s preference learning, beyond the basic learning framework of teacher forcing. Results demonstrate the effectiveness of the proposed approach for the reliability of a strong student model, showing potential for superalignment.

[LG-55] he Impossibility of Fair LLMs

链接: https://arxiv.org/abs/2406.03198
作者: Jacy Anthis,Kristian Lum,Michael Ekstrand,Avi Feller,Alexander D’Amour,Chenhao Tan
关键词: large language models, language models, increasingly clear, large language, Gemini
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: Presented at the 1st Human-Centered Evaluation and Auditing of Language Models (HEAL) workshop at CHI 2024

点击查看摘要

Abstract:The need for fair AI is increasingly clear in the era of general-purpose systems such as ChatGPT, Gemini, and other large language models (LLMs). However, the increasing complexity of human-AI interaction and its social impacts have raised questions of how fairness standards could be applied. Here, we review the technical frameworks that machine learning researchers have used to evaluate fairness, such as group fairness and fair representations, and find that their application to LLMs faces inherent limitations. We show that each framework either does not logically extend to LLMs or presents a notion of fairness that is intractable for LLMs, primarily due to the multitudes of populations affected, sensitive attributes, and use cases. To address these challenges, we develop guidelines for the more realistic goal of achieving fairness in particular use cases: the criticality of context, the responsibility of LLM developers, and the need for stakeholder participation in an iterative process of design and evaluation. Moreover, it may eventually be possible and even necessary to use the general-purpose capabilities of AI systems to address fairness challenges as a form of scalable AI-assisted alignment.

[LG-56] Graph Neural Network Explanations are Fragile

链接: https://arxiv.org/abs/2406.03193
作者: Jiate Li,Meng Pang,Yun Dong,Jinyuan Jia,Binghui Wang
关键词: Explainable Graph Neural, Graph Neural Network, Neural Network, Explainable Graph, GNN explainers
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 17 pages, 64 figures

点击查看摘要

Abstract:Explainable Graph Neural Network (GNN) has emerged recently to foster the trust of using GNNs. Existing GNN explainers are developed from various perspectives to enhance the explanation performance. We take the first step to study GNN explainers under adversarial attack–We found that an adversary slightly perturbing graph structure can ensure GNN model makes correct predictions, but the GNN explainer yields a drastically different explanation on the perturbed graph. Specifically, we first formulate the attack problem under a practical threat model (i.e., the adversary has limited knowledge about the GNN explainer and a restricted perturbation budget). We then design two methods (i.e., one is loss-based and the other is deduction-based) to realize the attack. We evaluate our attacks on various GNN explainers and the results show these explainers are fragile.

[LG-57] Initialization-enhanced Physics-Informed Neural Network with Domain Decomposition (IDPINN)

链接: https://arxiv.org/abs/2406.03172
作者: Chenhao Si,Ming Yan
关键词: neural network framework, physics-informed neural network, improve prediction accuracy, physics-informed neural, enhancement of initialization
类目: Machine Learning (cs.LG)
*备注: 20 pages, 14 figures

点击查看摘要

Abstract:We propose a new physics-informed neural network framework, IDPINN, based on the enhancement of initialization and domain decomposition to improve prediction accuracy. We train a PINN using a small dataset to obtain an initial network structure, including the weighted matrix and bias, which initializes the PINN for each subdomain. Moreover, we leverage the smoothness condition on the interface to enhance the prediction performance. We numerically evaluated it on several forward problems and demonstrated the benefits of IDPINN in terms of accuracy.

[LG-58] opological Neural Networks go Persistent Equivariant and Continuous

链接: https://arxiv.org/abs/2406.03164
作者: Yogesh Verma,Amauri H Souza,Vikas Garg
关键词: Graph Neural Networks, Topological Neural Networks, Graph Neural, incorporate higher-order relational, enabling richer representations
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:Topological Neural Networks (TNNs) incorporate higher-order relational information beyond pairwise interactions, enabling richer representations than Graph Neural Networks (GNNs). Concurrently, topological descriptors based on persistent homology (PH) are being increasingly employed to augment the GNNs. We investigate the benefits of integrating these two paradigms. Specifically, we introduce TopNets as a broad framework that subsumes and unifies various methods in the intersection of GNNs/TNNs and PH such as (generalizations of) RePHINE and TOGL. TopNets can also be readily adapted to handle (symmetries in) geometric complexes, extending the scope of TNNs and PH to spatial settings. Theoretically, we show that PH descriptors can provably enhance the expressivity of simplicial message-passing networks. Empirically, (continuous and E(n)-equivariant extensions of) TopNets achieve strong performance across diverse tasks, including antibody design, molecular dynamics simulation, and drug property prediction.

[LG-59] Ethical considerations of use of hold-out sets in clinical prediction model management

链接: https://arxiv.org/abs/2406.03161
作者: Louis Chislett,Louis JM Aslett,Alisha R Davies,Catalina A Vallejos,James Liley
关键词: machine learning models, Clinical prediction models, Clinical prediction, machine learning, prediction models
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Clinical prediction models are statistical or machine learning models used to quantify the risk of a certain health outcome using patient data. These can then inform potential interventions on patients, causing an effect called performative prediction: predictions inform interventions which influence the outcome they were trying to predict, leading to a potential underestimation of risk in some patients if a model is updated on this data. One suggested resolution to this is the use of hold-out sets, in which a set of patients do not receive model derived risk scores, such that a model can be safely retrained. We present an overview of clinical and research ethics regarding potential implementation of hold-out sets for clinical prediction models in health settings. We focus on the ethical principles of beneficence, non-maleficence, autonomy and justice. We also discuss informed consent, clinical equipoise, and truth-telling. We present illustrative cases of potential hold-out set implementations and discuss statistical issues arising from different hold-out set sampling methods. We also discuss differences between hold-out sets and randomised control trials, in terms of ethics and statistical issues. Finally, we give practical recommendations for researchers interested in the use hold-out sets for clinical prediction models.

[LG-60] Detecting Model Misspecification in Amortized Bayesian Inference with Neural Networks: An Extended Investigation

链接: https://arxiv.org/abs/2406.03154
作者: Marvin Schmitt,Paul-Christian Bürkner,Ullrich Köthe,Stefan T. Radev
关键词: efficient amortized Bayesian, amortized Bayesian inference, probabilistic deep learning, deep learning enable, learning enable efficient
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Extended version of the conference paper this https URL . arXiv admin note: text overlap with arXiv:2112.08866

点击查看摘要

Abstract:Recent advances in probabilistic deep learning enable efficient amortized Bayesian inference in settings where the likelihood function is only implicitly defined by a simulation program (simulation-based inference; SBI). But how faithful is such inference if the simulation represents reality somewhat inaccurately, that is, if the true system behavior at test time deviates from the one seen during training? We conceptualize the types of such model misspecification arising in SBI and systematically investigate how the performance of neural posterior approximators gradually deteriorates as a consequence, making inference results less and less trustworthy. To notify users about this problem, we propose a new misspecification measure that can be trained in an unsupervised fashion (i.e., without training data from the true distribution) and reliably detects model misspecification at test time. Our experiments clearly demonstrate the utility of our new measure both on toy examples with an analytical ground-truth and on representative scientific tasks in cell biology, cognitive decision making, disease outbreak dynamics, and computer vision. We show how the proposed misspecification test warns users about suspicious outputs, raises an alarm when predictions are not trustworthy, and guides model designers in their search for better simulators.

[LG-61] Dynamic Spectral Clustering with Provable Approximation Guarantee

链接: https://arxiv.org/abs/2406.03152
作者: Steinar Laenen,He Sun
关键词: underlying cluster structure, dynamically evolving graphs, gradually change, paper studies clustering, dynamically evolving
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: This work is accepted at the 41st International Conference on Machine Learning (ICML’24)

点击查看摘要

Abstract:This paper studies clustering algorithms for dynamically evolving graphs \G_t\ , in which new edges (and potential new vertices) are added into a graph, and the underlying cluster structure of the graph can gradually change. The paper proves that, under some mild condition on the cluster-structure, the clusters of the final graph G_T of n_T vertices at time T can be well approximated by a dynamic variant of the spectral clustering algorithm. The algorithm runs in amortised update time O(1) and query time o(n_T) . Experimental studies on both synthetic and real-world datasets further confirm the practicality of our designed algorithm.

[LG-62] Which Side Are You On? A Multi-task Dataset for End-to-End Argument Summarisation and Evaluation

链接: https://arxiv.org/abs/2406.03151
作者: Hao Li,Yuping Wu,Viktor Schlegel,Riza Batista-Navarro,Tharindu Madusanka,Iqra Zahid,Jiayan Zeng,Xiaochi Wang,Xinran He,Yizhi Li,Goran Nenadic
关键词: large language models, synthesise persuasive arguments, language models, recent advances, advances of large
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Published on ACL 2024 Findings

点击查看摘要

Abstract:With the recent advances of large language models (LLMs), it is no longer infeasible to build an automated debate system that helps people to synthesise persuasive arguments. Previous work attempted this task by integrating multiple components. In our work, we introduce an argument mining dataset that captures the end-to-end process of preparing an argumentative essay for a debate, which covers the tasks of claim and evidence identification (Task 1 ED), evidence convincingness ranking (Task 2 ECR), argumentative essay summarisation and human preference ranking (Task 3 ASR) and metric learning for automated evaluation of resulting essays, based on human feedback along argument quality dimensions (Task 4 SQE). Our dataset contains 14k examples of claims that are fully annotated with the various properties supporting the aforementioned tasks. We evaluate multiple generative baselines for each of these tasks, including representative LLMs. We find, that while they show promising results on individual tasks in our benchmark, their end-to-end performance on all four tasks in succession deteriorates significantly, both in automated measures as well as in human-centred evaluation. This challenge presented by our proposed dataset motivates future research on end-to-end argument mining and summarisation. The repository of this project is available at this https URL

[LG-63] Sample-specific Masks for Visual Reprogramming-based Prompting

链接: https://arxiv.org/abs/2406.03150
作者: Chengyi Cai,Zesheng Ye,Lei Feng,Jianzhong Qi,Feng Liu
关键词: medical data prediction, tuning considerable parameters, Visual reprogramming, small-scale pattern added, classifier on ImageNet
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual reprogramming (VR) is a prompting technique that aims to re-purpose a pre-trained model (e.g., a classifier on ImageNet) to target tasks (e.g., medical data prediction) by learning a small-scale pattern added into input images instead of tuning considerable parameters within the model. The location of the pattern within input samples is usually determined by a pre-defined mask shared across all samples. In this paper, we show that the shared mask potentially limits VR’s generalization and increases its approximation error due to the lack of sample-level adaptation. Motivated by this finding, we design a new framework for VR called sample-specific multi-channel masks (SMM). Specifically, SMM employs a lightweight ConvNet and patch-wise interpolation to generate sample-specific three-channel masks instead of a shared and pre-defined mask. Since we generate different masks for individual samples, SMM is theoretically shown to reduce approximation error for the target tasks compared with existing state-of-the-art VR methods. We also empirically demonstrate its performance gain on both ResNet and ViT. The success of SMM further highlights the broader applicability of VR in leveraging the latent knowledge of pre-trained models for various target tasks. Our code is available at this https URL.

[LG-64] Aligning Transformers with Weisfeiler-Leman

链接: https://arxiv.org/abs/2406.03148
作者: Luis Müller,Christopher Morris
关键词: well-understood expressive power, offer theoretically well-understood, theoretically well-understood expressive, dimensional Weisfeiler, Graph neural network
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2024

点击查看摘要

Abstract:Graph neural network architectures aligned with the k -dimensional Weisfeiler–Leman ( k -WL) hierarchy offer theoretically well-understood expressive power. However, these architectures often fail to deliver state-of-the-art predictive performance on real-world graphs, limiting their practical utility. While recent works aligning graph transformer architectures with the k -WL hierarchy have shown promising empirical results, employing transformers for higher orders of k remains challenging due to a prohibitive runtime and memory complexity of self-attention as well as impractical architectural assumptions, such as an infeasible number of attention heads. Here, we advance the alignment of transformers with the k -WL hierarchy, showing stronger expressivity results for each k , making them more feasible in practice. In addition, we develop a theoretical framework that allows the study of established positional encodings such as Laplacian PEs and SPE. We evaluate our transformers on the large-scale PCQM4Mv2 dataset, showing competitive predictive performance with the state-of-the-art and demonstrating strong downstream performance when fine-tuning them on small-scale molecular datasets. Our code is available at this https URL.

[LG-65] ny models from tiny data: Textual and null-text inversion for few-shot distillation

链接: https://arxiv.org/abs/2406.03146
作者: Erik Landolsi,Fredrik Kahl
关键词: Few-shot image classification, involves classifying images, image classification involves, classification involves classifying, image classification
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 21 pages (9 main pages + references and appendix)

点击查看摘要

Abstract:Few-shot image classification involves classifying images using very few training examples. Recent vision foundation models show excellent few-shot transfer abilities, but are large and slow at inference. Using knowledge distillation, the capabilities of high-performing but slow models can be transferred to tiny, efficient models. However, common distillation methods require a large set of unlabeled data, which is not available in the few-shot setting. To overcome this lack of data, there has been a recent interest in using synthetic data. We expand on this work by presenting a novel diffusion model inversion technique (TINT) combining the diversity of textual inversion with the specificity of null-text inversion. Using this method in a few-shot distillation pipeline leads to state-of-the-art accuracy among small student models on popular benchmarks, while being significantly faster than prior work. This allows us to push even tiny models to high accuracy using only a tiny application-specific dataset, albeit relying on extra data for pre-training. Popular few-shot benchmarks involve evaluation over a large number of episodes, which is computationally cumbersome for methods involving synthetic data generation. Therefore, we also present a theoretical analysis on how the variance of the accuracy estimator depends on the number of episodes and query examples, and use these results to lower the computational effort required for method evaluation. In addition, to further motivate the use of generative models in few-shot distillation, we demonstrate that our method performs better compared to training on real data mined from the dataset used to train the diffusion model. Source code will be made available at this https URL. Comments: 21 pages (9 main pages + references and appendix) Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) ACMclasses: I.4.0; I.2.6; I.2.10 Cite as: arXiv:2406.03146 [cs.CV] (or arXiv:2406.03146v1 [cs.CV] for this version) Submission history From: Erik Landolsi [view email] [v1] Wed, 5 Jun 2024 11:01:42 UTC (2,251 KB)

[LG-66] E(n) Equivariant Message Passing Cellular Networks

链接: https://arxiv.org/abs/2406.03145
作者: Veljko Kovac(1),Erik J. Bekkers(1, 2),Pietro Liò(3),Floor Eijkelboom(1, 2, 4) ((1) University of Amsterdam, (2) AMLab, (3) Department of Computer Science and Technology, University of Cambridge, (4) UvA-Bosch Delta Lab)
关键词: Equivariant Graph Neural, Equivariant Message Passing, Passing Cellular Networks, Message Passing Cellular, Graph Neural Networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces E(n) Equivariant Message Passing Cellular Networks (EMPCNs), an extension of E(n) Equivariant Graph Neural Networks to CW-complexes. Our approach addresses two aspects of geometric message passing networks: 1) enhancing their expressiveness by incorporating arbitrary cells, and 2) achieving this in a computationally efficient way with a decoupled EMPCNs technique. We demonstrate that EMPCNs achieve close to state-of-the-art performance on multiple tasks without the need for steerability, including many-body predictions and motion capture. Moreover, ablation studies confirm that decoupled EMPCNs exhibit stronger generalization capabilities than their non-topologically informed counterparts. These findings show that EMPCNs can be used as a scalable and expressive framework for higher-order message passing in geometric and topological graphs

[LG-67] On the Power of Randomization in Fair Classification and Representation

链接: https://arxiv.org/abs/2406.03142
作者: Sushant Agarwal,Amit Deshpande
关键词: fair machine learning, Fair, data distribution, unsupervised fair machine, Fair classification
类目: Machine Learning (cs.LG)
*备注: Appeared in ACM FAccT 2022

点击查看摘要

Abstract:Fair classification and fair representation learning are two important problems in supervised and unsupervised fair machine learning, respectively. Fair classification asks for a classifier that maximizes accuracy on a given data distribution subject to fairness constraints. Fair representation maps a given data distribution over the original feature space to a distribution over a new representation space such that all classifiers over the representation satisfy fairness. In this paper, we examine the power of randomization in both these problems to minimize the loss of accuracy that results when we impose fairness constraints. Previous work on fair classification has characterized the optimal fair classifiers on a given data distribution that maximize accuracy subject to fairness constraints, e.g., Demographic Parity (DP), Equal Opportunity (EO), and Predictive Equality (PE). We refine these characterizations to demonstrate when the optimal randomized fair classifiers can surpass their deterministic counterparts in accuracy. We also show how the optimal randomized fair classifier that we characterize can be obtained as a solution to a convex optimization problem. Recent work has provided techniques to construct fair representations for a given data distribution such that any classifier over this representation satisfies DP. However, the classifiers on these fair representations either come with no or weak accuracy guarantees when compared to the optimal fair classifier on the original data distribution. Extending our ideas for randomized fair classification, we improve on these works, and construct DP-fair, EO-fair, and PE-fair representations that have provably optimal accuracy and suffer no accuracy loss compared to the optimal DP-fair, EO-fair, and PE-fair classifiers respectively on the original data distribution.

[LG-68] Continual Traffic Forecasting via Mixture of Experts

链接: https://arxiv.org/abs/2406.03140
作者: Sanghyun Lee,Chanyoung Park
关键词: networks undergo expansion, patterns continually evolve, traffic networks undergo, traffic patterns continually, evolve over time
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The real-world traffic networks undergo expansion through the installation of new sensors, implying that the traffic patterns continually evolve over time. Incrementally training a model on the newly added sensors would make the model forget the past knowledge, i.e., catastrophic forgetting, while retraining the model on the entire network to capture these changes is highly inefficient. To address these challenges, we propose a novel Traffic Forecasting Mixture of Experts (TFMoE) for traffic forecasting under evolving networks. The main idea is to segment the traffic flow into multiple homogeneous groups, and assign an expert model responsible for a specific group. This allows each expert model to concentrate on learning and adapting to a specific set of patterns, while minimizing interference between the experts during training, thereby preventing the dilution or replacement of prior knowledge, which is a major cause of catastrophic forgetting. Through extensive experiments on a real-world long-term streaming network dataset, PEMSD3-Stream, we demonstrate the effectiveness and efficiency of TFMoE. Our results showcase superior performance and resilience in the face of catastrophic forgetting, underscoring the effectiveness of our approach in dealing with continual learning for traffic flow forecasting in long-term streaming networks.

[LG-69] Computational Limits of Low-Rank Adaptation (LoRA) for Transformer-Based Models

链接: https://arxiv.org/abs/2406.03136
作者: Jerry Yao-Chieh Hu,Maojiang Su,En-Jui Kuo,Zhao Song,Han Liu
关键词: finetuning transformer-based models, mathbf, fine-grained complexity theory, Exponential Time Hypothesis, Strong Exponential Time
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the computational limits of Low-Rank Adaptation (LoRA) update for finetuning transformer-based models using fine-grained complexity theory. Our key observation is that the existence of low-rank decompositions within the gradient computation of LoRA adaptation leads to possible algorithmic speedup. This allows us to (i) identify a phase transition behavior and (ii) prove the existence of nearly linear algorithms by controlling the LoRA update computation term by term, assuming the Strong Exponential Time Hypothesis (SETH). For the former, we identify a sharp transition in the efficiency of all possible rank- r LoRA update algorithms for transformers, based on specific norms resulting from the multiplications of the input sequence \mathbfX , pretrained weights \mathbfW^\star , and adapter matrices \alpha \mathbfB \mathbfA / r . Specifically, we derive a shared upper bound threshold for such norms and show that efficient (sub-quadratic) approximation algorithms of LoRA exist only below this threshold. For the latter, we prove the existence of nearly linear approximation algorithms for LoRA adaptation by utilizing the hierarchical low-rank structures of LoRA gradients and approximating the gradients with a series of chained low-rank approximations. To showcase our theory, we consider two practical scenarios: partial (e.g., only \mathbfW_V and \mathbfW_Q ) and full adaptations (e.g., \mathbfW_Q , \mathbfW_V , and \mathbfW_K ) of weights in attention heads.

[LG-70] MESS: Modern Electronic Structure Simulations

链接: https://arxiv.org/abs/2406.03121
作者: Hatem Helal,Andrew Fitzgibbon
关键词: quantitative scientific insights, provide quantitative scientific, Electronic structure simulation, atomistic scale, enabling advances
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Electronic structure simulation (ESS) has been used for decades to provide quantitative scientific insights on an atomistic scale, enabling advances in chemistry, biology, and materials science, among other disciplines. Following standard practice in scientific computing, the software packages driving these studies have been implemented in compiled languages such as FORTRAN and C. However, the recent introduction of machine learning (ML) into these domains has meant that ML models must be coded in these languages, or that complex software bridges have to be built between ML models in Python and these large compiled software systems. This is in contrast with recent progress in modern ML frameworks which aim to optimise both ease of use and high performance by harnessing hardware acceleration of tensor programs defined in Python. We introduce MESS: a modern electronic structure simulation package implemented in JAX; porting the ESS code to the ML world. We outline the costs and benefits of following the software development practices used in ML for this important scientific workload. MESS shows significant speedups n widely available hardware accelerators and simultaneously opens a clear pathway towards combining ESS with ML. MESS is available at this https URL.

[LG-71] DEER: A Delay-Resilient Framework for Reinforcement Learning with Variable Delays

链接: https://arxiv.org/abs/2406.03102
作者: Bo Xia,Yilun Kong,Yongzhe Chang,Bo Yuan,Zhiheng Li,Xueqian Wang,Bin Liang
关键词: Classic reinforcement learning, frequently confronts challenges, Markov assumption, Classic reinforcement, tasks involving delays
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Classic reinforcement learning (RL) frequently confronts challenges in tasks involving delays, which cause a mismatch between received observations and subsequent actions, thereby deviating from the Markov assumption. Existing methods usually tackle this issue with end-to-end solutions using state augmentation. However, these black-box approaches often involve incomprehensible processes and redundant information in the information states, causing instability and potentially undermining the overall performance. To alleviate the delay challenges in RL, we propose \textbfDEER (Delay-resilient Encoder-Enhanced RL) , a framework designed to effectively enhance the interpretability and address the random delay issues. DEER employs a pretrained encoder to map delayed states, along with their variable-length past action sequences resulting from different delays, into hidden states, which is trained on delay-free environment datasets. In a variety of delayed scenarios, the trained encoder can seamlessly integrate with standard RL algorithms without requiring additional modifications and enhance the delay-solving capability by simply adapting the input dimension of the original algorithms. We evaluate DEER through extensive experiments on Gym and Mujoco environments. The results confirm that DEER is superior to state-of-the-art RL algorithms in both constant and random delay settings.

[LG-72] Graph Convolutional Branch and Bound

链接: https://arxiv.org/abs/2406.03099
作者: Lorenzo Sciandra,Roberto Esposito,Andrea Cesare Grosso,Laura Sacerdote,Cristina Zucca
关键词: deep learning model, optimization pipeline, article demonstrates, demonstrates the effectiveness, effectiveness of employing
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Submitted to European Journal of Operational Research

点击查看摘要

Abstract:This article demonstrates the effectiveness of employing a deep learning model in an optimization pipeline. Specifically, in a generic exact algorithm for a NP problem, multiple heuristic criteria are usually used to guide the search of the optimum within the set of all feasible solutions. In this context, neural networks can be leveraged to rapidly acquire valuable information, enabling the identification of a more expedient path in this vast space. So, after the explanation of the tackled traveling salesman problem, the implemented branch and bound for its classical resolution is described. This algorithm is then compared with its hybrid version termed “graph convolutional branch and bound” that integrates the previous branch and bound with a graph convolutional neural network. The empirical results obtained highlight the efficacy of this approach, leading to conclusive findings and suggesting potential directions for future research.

[LG-73] Enhancing the Resilience of Graph Neural Networks to Topological Perturbations in Sparse Graphs

链接: https://arxiv.org/abs/2406.03097
作者: Shuqi He,Jun Zhuang,Ding Wang,Luyao Peng,Jun Song
关键词: Graph neural networks, neural networks, extensively employed, label, Graph neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have been extensively employed in node classification. Nevertheless, recent studies indicate that GNNs are vulnerable to topological perturbations, such as adversarial attacks and edge disruptions. Considerable efforts have been devoted to mitigating these challenges. For example, pioneering Bayesian methodologies, including GraphSS and LlnDT, incorporate Bayesian label transitions and topology-based label sampling to strengthen the robustness of GNNs. However, GraphSS is hindered by slow convergence, while LlnDT faces challenges in sparse graphs. To overcome these limitations, we propose a novel label inference framework, TraTopo, which combines topology-driven label propagation, Bayesian label transitions, and link analysis via random walks. TraTopo significantly surpasses its predecessors on sparse graphs by utilizing random walk sampling, specifically targeting isolated nodes for link prediction, thus enhancing its effectiveness in topological sampling contexts. Additionally, TraTopo employs a shortest-path strategy to refine link prediction, thereby reducing predictive overhead and improving label inference accuracy. Empirical evaluations highlight TraTopo’s superiority in node classification, significantly exceeding contemporary GCN models in accuracy.

[LG-74] EgoSurgery-Tool: A Dataset of Surgical Tool and Hand Detection from Egocentric Open Surgery Videos

链接: https://arxiv.org/abs/2406.03095
作者: Ryo Fujii,Hideo Saito,Hiroyuki Kajita
关键词: Surgical tool, Surgical, fundamental task, open surgery videos, open surgery
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Surgical tool detection is a fundamental task for understanding egocentric open surgery videos. However, detecting surgical tools presents significant challenges due to their highly imbalanced class distribution, similar shapes and similar textures, and heavy occlusion. The lack of a comprehensive large-scale dataset compounds these challenges. In this paper, we introduce EgoSurgery-Tool, an extension of the existing EgoSurgery-Phase dataset, which contains real open surgery videos captured using an egocentric camera attached to the surgeon’s head, along with phase annotations. EgoSurgery-Tool has been densely annotated with surgical tools and comprises over 49K surgical tool bounding boxes across 15 categories, constituting a large-scale surgical tool detection dataset. EgoSurgery-Tool also provides annotations for hand detection with over 46K hand-bounding boxes, capturing hand-object interactions that are crucial for understanding activities in egocentric open surgery. EgoSurgery-Tool is superior to existing datasets due to its larger scale, greater variety of surgical tools, more annotations, and denser scenes. We conduct a comprehensive analysis of EgoSurgery-Tool using nine popular object detectors to assess their effectiveness in both surgical tool and hand detection. The dataset will be released at this https URL.

[LG-75] HASS: Hardware-Aware Sparsity Search for Dataflow DNN Accelerator

链接: https://arxiv.org/abs/2406.03088
作者: Zhewen Yu,Sudarshan Sreeram,Krish Agrawal,Junyi Wu,Alexander Montgomerie-Corcoran,Cheng Zhang,Jianyi Cheng,Christos-Savvas Bouganis,Yiren Zhao
关键词: Deep Neural Networks, Deep Neural, Neural Networks, learning hierarchical representations, hierarchical representations
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: accepted to FPL2024

点击查看摘要

Abstract:Deep Neural Networks (DNNs) excel in learning hierarchical representations from raw data, such as images, audio, and text. To compute these DNN models with high performance and energy efficiency, these models are usually deployed onto customized hardware accelerators. Among various accelerator designs, dataflow architecture has shown promising performance due to its layer-pipelined structure and its scalability in data parallelism. Exploiting weights and activations sparsity can further enhance memory storage and computation efficiency. However, existing approaches focus on exploiting sparsity in non-dataflow accelerators, which cannot be applied onto dataflow accelerators because of the large hardware design space introduced. As such, this could miss opportunities to find an optimal combination of sparsity features and hardware designs. In this paper, we propose a novel approach to exploit unstructured weights and activations sparsity for dataflow accelerators, using software and hardware co-optimization. We propose a Hardware-Aware Sparsity Search (HASS) to systematically determine an efficient sparsity solution for dataflow accelerators. Over a set of models, we achieve an efficiency improvement ranging from 1.3 \times to 4.2 \times compared to existing sparse designs, which are either non-dataflow or non-hardware-aware. Particularly, the throughput of MobileNetV3 can be optimized to 4895 images per second. HASS is open-source: \urlthis https URL Comments: accepted to FPL2024 Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2406.03088 [cs.AR] (or arXiv:2406.03088v1 [cs.AR] for this version)

[LG-76] Lossless Image Compression Using Multi-level Dictionaries: Binary Images

链接: https://arxiv.org/abs/2406.03087
作者: Samar Agnihotri,Renu Rameshan,Ritwik Ghosal
关键词: Lossless image compression, image compression, Lossless image, information loss compared, compression
类目: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 11 pages, 7 figures, and 5 tables

点击查看摘要

Abstract:Lossless image compression is required in various applications to reduce storage or transmission costs of images, while requiring the reconstructed images to have zero information loss compared to the original. Existing lossless image compression methods either have simple design but poor compression performance, or complex design, better performance, but with no performance guarantees. In our endeavor to develop a lossless image compression method with low complexity and guaranteed performance, we argue that compressibility of a color image is essentially derived from the patterns in its spatial structure, intensity variations, and color variations. Thus, we divide the overall design of a lossless image compression scheme into three parts that exploit corresponding redundancies. We further argue that the binarized version of an image captures its fundamental spatial structure and in this work, we propose a scheme for lossless compression of binary images. The proposed scheme first learns dictionaries of 16\times16 , 8\times8 , 4\times4 , and 2\times 2 square pixel patterns from various datasets of binary images. It then uses these dictionaries to encode binary images. These dictionaries have various interesting properties that are further exploited to construct an efficient scheme. Our preliminary results show that the proposed scheme consistently outperforms existing conventional and learning based lossless compression approaches, and provides, on average, as much as 1.5\times better performance than a common general purpose lossless compression scheme (WebP), more than 3\times better performance than a state of the art learning based scheme, and better performance than a specialized scheme for binary image compression (JBIG2). Comments: 11 pages, 7 figures, and 5 tables Subjects: Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2406.03087 [cs.IT] (or arXiv:2406.03087v1 [cs.IT] for this version)

[LG-77] ask-Oriented Wireless Communications for Collaborative Perception in Intelligent Unmanned Systems

链接: https://arxiv.org/abs/2406.03086
作者: Sheng Zhou,Yukuan Jia,Ruiqing Mao,Zhaojun Nan,Yuxuan Sun,Zhisheng Niu
关键词: reliable environmental perception, intelligent unmanned systems, shown great potential, Collaborative Perception, environmental perception
类目: Multiagent Systems (cs.MA); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Accepted by IEEE Network Magazine

点击查看摘要

Abstract:Collaborative Perception (CP) has shown great potential to achieve more holistic and reliable environmental perception in intelligent unmanned systems (IUSs). However, implementing CP still faces key challenges due to the characteristics of the CP task and the dynamics of wireless channels. In this article, a task-oriented wireless communication framework is proposed to jointly optimize the communication scheme and the CP procedure. We first propose channel-adaptive compression and robust fusion approaches to extract and exploit the most valuable semantic information under wireless communication constraints. We then propose a task-oriented distributed scheduling algorithm to identify the best collaborators for CP under dynamic environments. The main idea is learning while scheduling, where the collaboration utility is effectively learned with low computation and communication overhead. Case studies are carried out in connected autonomous driving scenarios to verify the proposed framework. Finally, we identify several future research directions.

[LG-78] Exploring User Retrieval Integration towards Large Language Models for Cross-Domain Sequential Recommendation

链接: https://arxiv.org/abs/2406.03085
作者: Tingjia Shen,Hao Wang,Jiaqing Zhang,Sirui Zhao,Liangyue Li,Zulong Chen,Defu Lian,Enhong Chen
关键词: Cross-Domain Sequential Recommendation, long-standing cold-start issue, users’ sequential preferences, Sequential Recommendation, Large Language Model
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Cross-Domain Sequential Recommendation (CDSR) aims to mine and transfer users’ sequential preferences across different domains to alleviate the long-standing cold-start issue. Traditional CDSR models capture collaborative information through user and item modeling while overlooking valuable semantic information. Recently, Large Language Model (LLM) has demonstrated powerful semantic reasoning capabilities, motivating us to introduce them to better capture semantic information. However, introducing LLMs to CDSR is non-trivial due to two crucial issues: seamless information integration and domain-specific generation. To this end, we propose a novel framework named URLLM, which aims to improve the CDSR performance by exploring the User Retrieval approach and domain grounding on LLM simultaneously. Specifically, we first present a novel dual-graph sequential model to capture the diverse information, along with an alignment and contrastive learning method to facilitate domain knowledge transfer. Subsequently, a user retrieve-generation model is adopted to seamlessly integrate the structural information into LLM, fully harnessing its emergent inferencing ability. Furthermore, we propose a domain-specific strategy and a refinement module to prevent out-of-domain generation. Extensive experiments on Amazon demonstrated the information integration and domain-specific generation ability of URLLM in comparison to state-of-the-art baselines. Our code is available at this https URL

[LG-79] Learning Solutions of Stochastic Optimization Problems with Bayesian Neural Networks

链接: https://arxiv.org/abs/2406.03082
作者: Alan A. Lahoud,Erik Schaffernicht,Johannes A. Stork
关键词: parametrized Optimization Problems, Optimization Problems, parametrized Optimization, yield optimal decisions, inputs to yield
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mathematical solvers use parametrized Optimization Problems (OPs) as inputs to yield optimal decisions. In many real-world settings, some of these parameters are unknown or uncertain. Recent research focuses on predicting the value of these unknown parameters using available contextual features, aiming to decrease decision regret by adopting end-to-end learning approaches. However, these approaches disregard prediction uncertainty and therefore make the mathematical solver susceptible to provide erroneous decisions in case of low-confidence predictions. We propose a novel framework that models prediction uncertainty with Bayesian Neural Networks (BNNs) and propagates this uncertainty into the mathematical solver with a Stochastic Programming technique. The differentiable nature of BNNs and differentiable mathematical solvers allow for two different learning approaches: In the Decoupled learning approach, we update the BNN weights to increase the quality of the predictions’ distribution of the OP parameters, while in the Combined learning approach, we update the weights aiming to directly minimize the expected OP’s cost function in a stochastic end-to-end fashion. We do an extensive evaluation using synthetic data with various noise properties and a real dataset, showing that decisions regret are generally lower (better) with both proposed methods.

[LG-80] owards Federated Domain Unlearning: Verification Methodologies and Challenges

链接: https://arxiv.org/abs/2406.03078
作者: Kahou Tam,Kewei Xu,Li Li,Huazhu Fu
关键词: ensuring data privacy, Federated Learning, Federated Domain Unlearning, multiple entities, healthcare and finance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 12 figures

点击查看摘要

Abstract:Federated Learning (FL) has evolved as a powerful tool for collaborative model training across multiple entities, ensuring data privacy in sensitive sectors such as healthcare and finance. However, the introduction of the Right to Be Forgotten (RTBF) poses new challenges, necessitating federated unlearning to delete data without full model retraining. Traditional FL unlearning methods, not originally designed with domain specificity in mind, inadequately address the complexities of multi-domain scenarios, often affecting the accuracy of models in non-targeted domains or leading to uniform forgetting across all domains. Our work presents the first comprehensive empirical study on Federated Domain Unlearning, analyzing the characteristics and challenges of current techniques in multi-domain contexts. We uncover that these methods falter, particularly because they neglect the nuanced influences of domain-specific data, which can lead to significant performance degradation and inaccurate model behavior. Our findings reveal that unlearning disproportionately affects the model’s deeper layers, erasing critical representational subspaces acquired during earlier training phases. In response, we propose novel evaluation methodologies tailored for Federated Domain Unlearning, aiming to accurately assess and verify domain-specific data erasure without compromising the model’s overall integrity and performance. This investigation not only highlights the urgent need for domain-centric unlearning strategies in FL but also sets a new precedent for evaluating and implementing these techniques effectively.

[LG-81] Local to Global: Learning Dynamics and Effect of Initialization for Transformers

链接: https://arxiv.org/abs/2406.03072
作者: Ashok Vardhan Makkuva,Marco Bondaschi,Chanakya Ekbote,Adway Girish,Alliot Nagle,Hyeji Kim,Michael Gastpar
关键词: revolutionized deep learning, recent years, transformer-based models, sequence modeling, models have revolutionized
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In recent years, transformer-based models have revolutionized deep learning, particularly in sequence modeling. To better understand this phenomenon, there is a growing interest in using Markov input processes to study transformers. However, our current understanding in this regard remains limited with many fundamental questions about how transformers learn Markov chains still unanswered. In this paper, we address this by focusing on first-order Markov chains and single-layer transformers, providing a comprehensive characterization of the learning dynamics in this context. Specifically, we prove that transformer parameters trained on next-token prediction loss can either converge to global or local minima, contingent on the initialization and the Markovian data properties, and we characterize the precise conditions under which this occurs. To the best of our knowledge, this is the first result of its kind highlighting the role of initialization. We further demonstrate that our theoretical findings are corroborated by empirical evidence. Based on these insights, we provide guidelines for the initialization of transformer parameters and demonstrate their effectiveness. Finally, we outline several open problems in this arena. Code is available at: \urlhttps://anonymous.4open.science/r/Local-to-Global-C70B/.

[LG-82] How Truncating Weights Improves Reasoning in Language Models

链接: https://arxiv.org/abs/2406.03068
作者: Lei Chen,Joan Bruna,Alberto Bietti
关键词: generate fluent text, large language models, involve basic forms, large language, forms of logical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In addition to the ability to generate fluent text in various languages, large language models have been successful at tasks that involve basic forms of logical “reasoning” over their context. Recent work found that selectively removing certain components from weight matrices in pre-trained models can improve such reasoning capabilities. We investigate this phenomenon further by carefully studying how certain global associations tend to be stored in specific weight components or Transformer blocks, in particular feed-forward layers. Such associations may hurt predictions in reasoning tasks, and removing the corresponding components may then improve performance. We analyze how this arises during training, both empirically and theoretically, on a two-layer Transformer trained on a basic reasoning task with noise, a toy associative memory model, and on the Pythia family of pre-trained models tested on simple reasoning tasks.

[LG-83] Decision Boundary-aware Knowledge Consolidation Generates Better Instance-Incremental Learner

链接: https://arxiv.org/abs/2406.03065
作者: Qiang Nie,Weifu Fu,Yuhuan Lin,Jialin Li,Yifeng Zhou,Yong Liu,Lei Zhu,Chengjie Wang
关键词: Instance-incremental learning, IIL, IIL setting, learning continually, Instance-incremental
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages

点击查看摘要

Abstract:Instance-incremental learning (IIL) focuses on learning continually with data of the same classes. Compared to class-incremental learning (CIL), the IIL is seldom explored because IIL suffers less from catastrophic forgetting (CF). However, besides retaining knowledge, in real-world deployment scenarios where the class space is always predefined, continual and cost-effective model promotion with the potential unavailability of previous data is a more essential demand. Therefore, we first define a new and more practical IIL setting as promoting the model’s performance besides resisting CF with only new observations. Two issues have to be tackled in the new IIL setting: 1) the notorious catastrophic forgetting because of no access to old data, and 2) broadening the existing decision boundary to new observations because of concept drift. To tackle these problems, our key insight is to moderately broaden the decision boundary to fail cases while retain old boundary. Hence, we propose a novel decision boundary-aware distillation method with consolidating knowledge to teacher to ease the student learning new knowledge. We also establish the benchmarks on existing datasets Cifar-100 and ImageNet. Notably, extensive experiments demonstrate that the teacher model can be a better incremental learner than the student model, which overturns previous knowledge distillation-based methods treating student as the main role.

[LG-84] Path-Specific Causal Reasoning for Fairness-aware Cognitive Diagnosis

链接: https://arxiv.org/abs/2406.03064
作者: Dacao Zhang,Kun Zhang,Le Wu,Mi Tian,Richang Hong,Meng Wang
关键词: predict students’ proficiency, students’ proficiency levels, Intelligent Education, Cognitive Diagnosis, sensitive information
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accpeted by KDD’2024

点击查看摘要

Abstract:Cognitive Diagnosis~(CD), which leverages students and exercise data to predict students’ proficiency levels on different knowledge concepts, is one of fundamental components in Intelligent Education. Due to the scarcity of student-exercise interaction data, most existing methods focus on making the best use of available data, such as exercise content and student information~(e.g., educational context). Despite the great progress, the abuse of student sensitive information has not been paid enough attention. Due to the important position of CD in Intelligent Education, employing sensitive information when making diagnosis predictions will cause serious social issues. Moreover, data-driven neural networks are easily misled by the shortcut between input data and output prediction, exacerbating this problem. Therefore, it is crucial to eliminate the negative impact of sensitive information in CD models. In response, we argue that sensitive attributes of students can also provide useful information, and only the shortcuts directly related to the sensitive information should be eliminated from the diagnosis process. Thus, we employ causal reasoning and design a novel Path-Specific Causal Reasoning Framework (PSCRF) to achieve this goal. Specifically, we first leverage an encoder to extract features and generate embeddings for general information and sensitive information of students. Then, we design a novel attribute-oriented predictor to decouple the sensitive attributes, in which fairness-related sensitive features will be eliminated and other useful information will be retained. Finally, we designed a multi-factor constraint to ensure the performance of fairness and diagnosis performance simultaneously. Extensive experiments over real-world datasets (e.g., PISA dataset) demonstrate the effectiveness of our proposed PSCRF.

[LG-85] Predicting unobserved climate time series data at distant areas via spatial correlation using reservoir computing

链接: https://arxiv.org/abs/2406.03061
作者: Shihori Koyama,Daisuke Inoue,Hiroaki Yoshida,Kazuyuki Aihara,Gouhei Tanaka
关键词: Collecting time series, Collecting time, data spatially distributed, impacts on ecosystems, data
类目: Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD); Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Collecting time series data spatially distributed in many locations is often important for analyzing climate change and its impacts on ecosystems. However, comprehensive spatial data collection is not always feasible, requiring us to predict climate variables at some locations. This study focuses on a prediction of climatic elements, specifically near-surface temperature and pressure, at a target location apart from a data observation point. Our approach uses two prediction methods: reservoir computing (RC), known as a machine learning framework with low computational requirements, and vector autoregression models (VAR), recognized as a statistical method for analyzing time series data. Our results show that the accuracy of the predictions degrades with the distance between the observation and target locations. We quantitatively estimate the distance in which effective predictions are possible. We also find that in the context of climate data, a geographical distance is associated with data correlation, and a strong data correlation significantly improves the prediction accuracy with RC. In particular, RC outperforms VAR in predicting highly correlated data within the predictive range. These findings suggest that machine learning-based methods can be used more effectively to predict climatic elements in remote locations by assessing the distance to them from the data observation point in advance. Our study on low-cost and accurate prediction of climate variables has significant value for climate change strategies.

[LG-86] Efficient Exploration of the Rashomon Set of Rule Set Models

链接: https://arxiv.org/abs/2406.03059
作者: Martino Ciaperoni,Han Xiao,Aristides Gionis
关键词: high-stakes decision making, increasingly complex predictive, drive high-stakes decision, Rashomon set, obtain interpretable predictions
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Today, as increasingly complex predictive models are developed, simple rule sets remain a crucial tool to obtain interpretable predictions and drive high-stakes decision making. However, a single rule set provides a partial representation of a learning task. An emerging paradigm in interpretable machine learning aims at exploring the Rashomon set of all models exhibiting near-optimal performance. Existing work on Rashomon-set exploration focuses on exhaustive search of the Rashomon set for particular classes of models, which can be a computationally challenging task. On the other hand, exhaustive enumeration leads to redundancy that often is not necessary, and a representative sample or an estimate of the size of the Rashomon set is sufficient for many applications. In this work, we propose, for the first time, efficient methods to explore the Rashomon set of rule set models with or without exhaustive search. Extensive experiments demonstrate the effectiveness of the proposed methods in a variety of scenarios.

[LG-87] BWS: Best Window Selection Based on Sample Scores for Data Pruning across Broad Ranges

链接: https://arxiv.org/abs/2406.03057
作者: Hoyong Choi,Nohyun Ki,Hye Won Chung
关键词: Data subset selection, training neural networks, subset selection aims, addressing challenges, aims to find
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2024

点击查看摘要

Abstract:Data subset selection aims to find a smaller yet informative subset of a large dataset that can approximate the full-dataset training, addressing challenges associated with training neural networks on large-scale datasets. However, existing methods tend to specialize in either high or low selection ratio regimes, lacking a universal approach that consistently achieves competitive performance across a broad range of selection ratios. We introduce a universal and efficient data subset selection method, Best Window Selection (BWS), by proposing a method to choose the best window subset from samples ordered based on their difficulty scores. This approach offers flexibility by allowing the choice of window intervals that span from easy to difficult samples. Furthermore, we provide an efficient mechanism for selecting the best window subset by evaluating its quality using kernel ridge regression. Our experimental results demonstrate the superior performance of BWS compared to other baselines across a broad range of selection ratios over datasets, including CIFAR-10/100 and ImageNet, and the scenarios involving training from random initialization or fine-tuning of pre-trained models.

[LG-88] Are Your Models Still Fair? Fairness Attacks on Graph Neural Networks via Node Injections

链接: https://arxiv.org/abs/2406.03052
作者: Zihan Luo,Hong Huang,Yongkang Zhou,Jiping Zhang,Nuo Chen
关键词: Graph Neural Networks, Neural Networks, Graph Neural, remarkable capabilities demonstrated, facing malicious adversarial
类目: Machine Learning (cs.LG)
*备注: 21 pages

点击查看摘要

Abstract:Despite the remarkable capabilities demonstrated by Graph Neural Networks (GNNs) in graph-related tasks, recent research has revealed the fairness vulnerabilities in GNNs when facing malicious adversarial attacks. However, all existing fairness attacks require manipulating the connectivity between existing nodes, which may be prohibited in reality. To this end, we introduce a Node Injection-based Fairness Attack (NIFA), exploring the vulnerabilities of GNN fairness in such a more realistic setting. In detail, NIFA first designs two insightful principles for node injection operations, namely the uncertainty-maximization principle and homophily-increase principle, and then optimizes injected nodes’ feature matrix to further ensure the effectiveness of fairness attacks. Comprehensive experiments on three real-world datasets consistently demonstrate that NIFA can significantly undermine the fairness of mainstream GNNs, even including fairness-aware GNNs, by injecting merely 1% of nodes. We sincerely hope that our work can stimulate increasing attention from researchers on the vulnerability of GNN fairness, and encourage the development of corresponding defense mechanisms.

[LG-89] Population Transformer: Learning Population-level Representations of Intracranial Activity

链接: https://arxiv.org/abs/2406.03044
作者: Geeling Chau,Christopher Wang,Sabera Talukder,Vighnesh Subramaniam,Saraswati Soedarmadji,Yisong Yue,Boris Katz,Andrei Barbu
关键词: neuroscience recording modality, intracranial neural recordings, learns population-level codes, recording modality, unlocking the benefits
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 17 pages, 10 figures, submitted to NeurIPS 2024

点击查看摘要

Abstract:We present a self-supervised framework that learns population-level codes for intracranial neural recordings at scale, unlocking the benefits of representation learning for a key neuroscience recording modality. The Population Transformer (PopT) lowers the amount of data required for decoding experiments, while increasing accuracy, even on never-before-seen subjects and tasks. We address two key challenges in developing PopT: sparse electrode distribution and varying electrode location across patients. PopT stacks on top of pretrained representations and enhances downstream tasks by enabling learned aggregation of multiple spatially-sparse data channels. Beyond decoding, we interpret the pretrained PopT and fine-tuned models to show how it can be used to provide neuroscience insights learned from massive amounts of data. We release a pretrained PopT to enable off-the-shelf improvements in multi-channel intracranial data decoding and interpretability, and code is available at this https URL.

[LG-90] Optimal Multi-Fidelity Best-Arm Identification

链接: https://arxiv.org/abs/2406.03033
作者: Riccardo Poiani,Rémy Degenne,Emilie Kaufmann,Alberto Maria Metelli,Marcello Restelli
关键词: bandit best-arm identification, best-arm identification, multi-fidelity best-arm identification, highest mean reward, accuracy as fast
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In bandit best-arm identification, an algorithm is tasked with finding the arm with highest mean reward with a specified accuracy as fast as possible. We study multi-fidelity best-arm identification, in which the algorithm can choose to sample an arm at a lower fidelity (less accurate mean estimate) for a lower cost. Several methods have been proposed for tackling this problem, but their optimality remain elusive, notably due to loose lower bounds on the total cost needed to identify the best arm. Our first contribution is a tight, instance-dependent lower bound on the cost complexity. The study of the optimization problem featured in the lower bound provides new insights to devise computationally efficient algorithms, and leads us to propose a gradient-based approach with asymptotically optimal cost complexity. We demonstrate the benefits of the new algorithm compared to existing methods in experiments. Our theoretical and empirical findings also shed light on an intriguing concept of optimal fidelity for each arm.

[LG-91] From Tarzan to Tolkien: Controlling the Language Proficiency Level of LLMs for Content Generation

链接: https://arxiv.org/abs/2406.03030
作者: Ali Malik,Stephen Mayhew,Chris Piech,Klinton Bicknell
关键词: Large Language Models, fully proficient, problem of controlling, controlling the difficulty, difficulty level
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of controlling the difficulty level of text generated by Large Language Models (LLMs) for contexts where end-users are not fully proficient, such as language learners. Using a novel framework, we evaluate the effectiveness of several key approaches for this task, including few-shot prompting, supervised finetuning, and reinforcement learning (RL), utilising both GPT-4 and open source alternatives like LLama2-7B and Mistral-7B. Our findings reveal a large performance gap between GPT-4 and the open source models when using prompt-based strategies. However, we show how to bridge this gap with a careful combination of finetuning and RL alignment. Our best model, CALM (CEFR-Aligned Language Model), surpasses the performance of GPT-4 and other strategies, at only a fraction of the cost. We further validate the quality of our results through a small-scale human study. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2406.03030 [cs.CL] (or arXiv:2406.03030v1 [cs.CL] for this version) Journalreference: In Findings of the Association for Computational Linguistics (ACL 2024)

[LG-92] Analyzing the Influence of Training Samples on Explanations

链接: https://arxiv.org/abs/2406.03012
作者: André Artelt,Barbara Hammer
关键词: constitutes a popular, explaining their decision-making, providing a counterfactual, popular method, method to analyze
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at the Workshop on Explainable Artificial Intelligence (XAI) at IJCAI 2024. arXiv admin note: text overlap with arXiv:2402.08290

点击查看摘要

Abstract:EXplainable AI (XAI) constitutes a popular method to analyze the reasoning of AI systems by explaining their decision-making, e.g. providing a counterfactual explanation of how to achieve recourse. However, in cases such as unexpected explanations, the user might be interested in learning about the cause of this explanation – e.g. properties of the utilized training data that are responsible for the observed explanation. Under the umbrella of data valuation, first approaches have been proposed that estimate the influence of data samples on a given model. In this work, we take a slightly different stance, as we are interested in the influence of single samples on a model explanation rather than the model itself. Hence, we propose the novel problem of identifying training data samples that have a high influence on a given explanation (or related quantity) and investigate the particular case of differences in the cost of the recourse between protected groups. For this, we propose an algorithm that identifies such influential training samples.

[LG-93] BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents

链接: https://arxiv.org/abs/2406.03007
作者: Yifei Wang,Dizhan Xue,Shengjie Zhang,Shengsheng Qian
关键词: powerful LLM-based intelligent, provide customized services, LLM-based intelligent agents, large language models, LLM agents
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted by ACL 2024

点击查看摘要

Abstract:With the prosperity of large language models (LLMs), powerful LLM-based intelligent agents have been developed to provide customized services with a set of user-defined tools. State-of-the-art methods for constructing LLM agents adopt trained LLMs and further fine-tune them on data for the agent task. However, we show that such methods are vulnerable to our proposed backdoor attacks named BadAgent on various agent tasks, where a backdoor can be embedded by fine-tuning on the backdoor data. At test time, the attacker can manipulate the deployed LLM agents to execute harmful operations by showing the trigger in the agent input or environment. To our surprise, our proposed attack methods are extremely robust even after fine-tuning on trustworthy data. Though backdoor attacks have been studied extensively in natural language processing, to the best of our knowledge, we could be the first to study them on LLM agents that are more dangerous due to the permission to use external tools. Our work demonstrates the clear risk of constructing LLM agents based on untrusted LLMs or data. Our code is public at this https URL

[LG-94] Residual Connections and Normalization Can Provably Prevent Oversmoothing in GNNs

链接: https://arxiv.org/abs/2406.02997
作者: Michael Scholkemper,Xinyi Wu,Ali Jadbabaie,Michael Schaub
关键词: standard design choices, graph neural networks, oversmoothing problem, Residual connections, neural networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Residual connections and normalization layers have become standard design choices for graph neural networks (GNNs), and were proposed as solutions to the mitigate the oversmoothing problem in GNNs. However, how exactly these methods help alleviate the oversmoothing problem from a theoretical perspective is not well understood. In this work, we provide a formal and precise characterization of (linearized) GNNs with residual connections and normalization layers. We establish that (a) for residual connections, the incorporation of the initial features at each layer can prevent the signal from becoming too smooth, and determines the subspace of possible node representations; (b) batch normalization prevents a complete collapse of the output embedding space to a one-dimensional subspace through the individual rescaling of each column of the feature matrix. This results in the convergence of node representations to the top- k eigenspace of the message-passing operator; © moreover, we show that the centering step of a normalization layer – which can be understood as a projection – alters the graph signal in message-passing in such a way that relevant information can become harder to extract. We therefore introduce a novel, principled normalization layer called GraphNormv2 in which the centering step is learned such that it does not distort the original graph signal in an undesirable way. Experimental results confirm the effectiveness of our method.

[LG-95] Quantifying Task Priority for Multi-Task Optimization

链接: https://arxiv.org/abs/2406.02996
作者: Wooseong Jeong,Kuk-Jin Yoon
关键词: single unified network, single unified, task priority, learn diverse tasks, tasks
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The goal of multi-task learning is to learn diverse tasks within a single unified network. As each task has its own unique objective function, conflicts emerge during training, resulting in negative transfer among them. Earlier research identified these conflicting gradients in shared parameters between tasks and attempted to realign them in the same direction. However, we prove that such optimization strategies lead to sub-optimal Pareto solutions due to their inability to accurately determine the individual contributions of each parameter across various tasks. In this paper, we propose the concept of task priority to evaluate parameter contributions across different tasks. To learn task priority, we identify the type of connections related to links between parameters influenced by task-specific losses during backpropagation. The strength of connections is gauged by the magnitude of parameters to determine task priority. Based on these, we present a new method named connection strength-based optimization for multi-task learning which consists of two phases. The first phase learns the task priority within the network, while the second phase modifies the gradients while upholding this priority. This ultimately leads to finding new Pareto optimal solutions for multiple tasks. Through extensive experiments, we show that our approach greatly enhances multi-task performance in comparison to earlier gradient manipulation methods.

[LG-96] Local vs. Global Interpretability: A Computational Complexity Perspective

链接: https://arxiv.org/abs/2406.02981
作者: Shahaf Bassan,Guy Amir,Guy Katz
关键词: recent years, studied extensively, extensively in recent, global, local
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO)
*备注: To appear in ICML 2024

点击查看摘要

Abstract:The local and global interpretability of various ML models has been studied extensively in recent years. However, despite significant progress in the field, many known results remain informal or lack sufficient mathematical rigor. We propose a framework for bridging this gap, by using computational complexity theory to assess local and global perspectives of interpreting ML models. We begin by proposing proofs for two novel insights that are essential for our analysis: (1) a duality between local and global forms of explanations; and (2) the inherent uniqueness of certain global explanation forms. We then use these insights to evaluate the complexity of computing explanations, across three model types representing the extremes of the interpretability spectrum: (1) linear models; (2) decision trees; and (3) neural networks. Our findings offer insights into both the local and global interpretability of these models. For instance, under standard complexity assumptions such as P != NP, we prove that selecting global sufficient subsets in linear models is computationally harder than selecting local subsets. Interestingly, with neural networks and decision trees, the opposite is true: it is harder to carry out this task locally than globally. We believe that our findings demonstrate how examining explainability through a computational complexity lens can help us develop a more rigorous grasp of the inherent interpretability of ML models.

[LG-97] nsor Polynomial Additive Model

链接: https://arxiv.org/abs/2406.02980
作者: Yang Chen,Ce Zhu,Jiani Liu,Yipeng Liu
关键词: interpretable machine learning, clarity and simplicity, interpretable machine, machine learning, TPAM
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Additive models can be used for interpretable machine learning for their clarity and simplicity. However, In the classical models for high-order data, the vectorization operation disrupts the data structure, which may lead to degenerated accuracy and increased computational complexity. To deal with these problems, we propose the tensor polynomial addition model (TPAM). It retains the multidimensional structure information of high-order inputs with tensor representation. The model parameter compression is achieved using a hierarchical and low-order symmetric tensor approximation. In this way, complex high-order feature interactions can be captured with fewer parameters. Moreover, The TPAM preserves the inherent interpretability of additive models, facilitating transparent decision-making and the extraction of meaningful feature values. Additionally, leveraging TPAM’s transparency and ability to handle higher-order features, it is used as a post-processing module for other interpretation models by introducing two variants for class activation maps. Experimental results on a series of datasets demonstrate that TPAM can enhance accuracy by up to 30%, and compression rate by up to 5 times, while maintaining a good interpretability.

[LG-98] Efficient User Sequence Learning for Online Services via Compressed Graph Neural Networks

链接: https://arxiv.org/abs/2406.02979
作者: Yucheng Wu,Liyue Chen,Yu Cheng,Shuai Chen,Jinyu Xu,Leye Wang
关键词: transaction detection mechanisms, fraudulent transaction detection, online fraudulent transaction, Graph Neural Networks, detection mechanisms
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by IEEE ICWS 2024

点击查看摘要

Abstract:Learning representations of user behavior sequences is crucial for various online services, such as online fraudulent transaction detection mechanisms. Graph Neural Networks (GNNs) have been extensively applied to model sequence relationships, and extract information from similar sequences. While user behavior sequence data volume is usually huge for online applications, directly applying GNN models may lead to substantial computational overhead during both the training and inference stages and make it challenging to meet real-time requirements for online services. In this paper, we leverage graph compression techniques to alleviate the efficiency issue. Specifically, we propose a novel unified framework called ECSeq, to introduce graph compression techniques into relation modeling for user sequence representation learning. The key module of ECSeq is sequence relation modeling, which explores relationships among sequences to enhance sequence representation learning, and employs graph compression algorithms to achieve high efficiency and scalability. ECSeq also exhibits plug-and-play characteristics, seamlessly augmenting pre-trained sequence representation models without modifications. Empirical experiments on both sequence classification and regression tasks demonstrate the effectiveness of ECSeq. Specifically, with an additional training time of tens of seconds in total on 100,000+ sequences and inference time preserved within 10^-4 seconds/sample, ECSeq improves the prediction R@P _0.9 of the widely used LSTM by \sim 5% .

[LG-99] Filtered not Mixed: Stochastic Filtering-Based Online Gating for Mixture of Large Language Models

链接: https://arxiv.org/abs/2406.02969
作者: Raeid Saqur,Anastasis Kratsios,Florian Krach,Yannick Limmer,Jacob-Junqi Tian,John Willes,Blanka Horvath,Frank Rudzicz
关键词: Large Language Models, expert Large Language, pre-trained expert Large, Large Language, online time-series prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computational Finance (q-fin.CP); Mathematical Finance (q-fin.MF)
*备注: 29 pages, 5 Appendix sections

点击查看摘要

Abstract:We propose MoE-F – a formalised mechanism for combining N pre-trained expert Large Language Models (LLMs) in online time-series prediction tasks by adaptively forecasting the best weighting of LLM predictions at every time step. Our mechanism leverages the conditional information in each expert’s running performance to forecast the best combination of LLMs for predicting the time series in its next step. Diverging from static (learned) Mixture of Experts (MoE) methods, MoE-F employs time-adaptive stochastic filtering techniques to combine experts. By framing the expert selection problem as a finite state-space, continuous-time Hidden Markov model (HMM), we can leverage the Wohman-Shiryaev filter. Our approach first constructs N parallel filters corresponding to each of the N individual LLMs. Each filter proposes its best combination of LLMs, given the information that they have access to. Subsequently, the N filter outputs are aggregated to optimize a lower bound for the loss of the aggregated LLMs, which can be optimized in closed-form, thus generating our ensemble predictor. Our contributions here are: (I) the MoE-F algorithm – deployable as a plug-and-play filtering harness, (II) theoretical optimality guarantees of the proposed filtering-based gating algorithm, and (III) empirical evaluation and ablative results using state of the art foundational and MoE LLMs on a real-world Financial Market Movement task where MoE-F attains a remarkable 17% absolute and 48.5% relative F1 measure improvement over the next best performing individual LLM expert.

[LG-100] Adversarial Moment-Matching Distillation of Large Language Models

链接: https://arxiv.org/abs/2406.02959
作者: Chen Jia
关键词: large language models, achieving practical benefits, Knowledge distillation, larger teacher model, language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge distillation (KD) has been shown to be highly effective in guiding a student model with a larger teacher model and achieving practical benefits in improving the computational and memory efficiency for large language models (LLMs). State-of-the-art KD methods for LLMs mostly rely on minimizing explicit distribution distance between teacher and student probability predictions. Instead of optimizing these mandatory behaviour cloning objectives, we explore an imitation learning strategy for KD of LLMs. In particular, we minimize the imitation gap by matching the action-value moments of the teacher’s behavior from both on- and off-policy perspectives. To achieve this action-value moment-matching goal, we propose an adversarial training algorithm to jointly estimate the moment-matching distance and optimize the student policy to minimize it. Results from both task-agnostic instruction-following experiments and task-specific experiments demonstrate the effectiveness of our method and achieve new state-of-the-art performance.

[LG-101] PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

链接: https://arxiv.org/abs/2406.02958
作者: Charlie Hou,Akshat Shrivastava,Hongyuan Zhan,Rylan Conway,Trang Le,Adithya Sagar,Giulia Fanti,Daniel Lazar
关键词: training machine learning, On-device, On-device training, machine learning, training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: ICML 2024 (Oral)

点击查看摘要

Abstract:On-device training is currently the most common approach for training machine learning (ML) models on private, distributed user data. Despite this, on-device training has several drawbacks: (1) most user devices are too small to train large models on-device, (2) on-device training is communication- and computation-intensive, and (3) on-device training can be difficult to debug and deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes ( \epsilon=1.29 , \epsilon=7.58 ). We achieve these results while using 9 \times fewer rounds, 6 \times less client computation per round, and 100 \times less communication per round. Second, finetuning large models on PrE-Text’s DP synthetic data improves large language model (LLM) performance on private data across the same range of privacy budgets. Altogether, these results suggest that training on DP synthetic data can be a better option than training a model on-device on private distributed data. Code is available at this https URL.

[LG-102] GraphAlign: Pretraining One Graph Neural Network on Multiple Graphs via Feature Alignment

链接: https://arxiv.org/abs/2406.02953
作者: Zhenyu Hou,Haozhan Li,Yukuo Cen,Jie Tang,Yuxiao Dong
关键词: holds considerable promise, Graph self-supervised learning, existing graph SSL, graph SSL, self-supervised learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph self-supervised learning (SSL) holds considerable promise for mining and learning with graph-structured data. Yet, a significant challenge in graph SSL lies in the feature discrepancy among graphs across different domains. In this work, we aim to pretrain one graph neural network (GNN) on a varied collection of graphs endowed with rich node features and subsequently apply the pretrained GNN to unseen graphs. We present a general GraphAlign method that can be seamlessly integrated into the existing graph SSL framework. To align feature distributions across disparate graphs, GraphAlign designs alignment strategies of feature encoding, normalization, alongside a mixture-of-feature-expert module. Extensive experiments show that GraphAlign empowers existing graph SSL frameworks to pretrain a unified and powerful GNN across multiple graphs, showcasing performance superiority on both in-domain and out-of-domain graphs.

[LG-103] Exploring Data Efficiency in Zero-Shot Learning with Diffusion Models

链接: https://arxiv.org/abs/2406.02929
作者: Zihan Ye,Shreyank N. Gowda,Xiaobo Jin,Xiaowei Huang,Haotian Xu,Yaochu Jin,Kaizhu Huang
关键词: identify unseen classes, aims to enable, enable classifiers, classifiers to identify, unseen classes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Zero-Shot Learning (ZSL) aims to enable classifiers to identify unseen classes by enhancing data efficiency at the class level. This is achieved by generating image features from pre-defined semantics of unseen classes. However, most current approaches heavily depend on the number of samples from seen classes, i.e. they do not consider instance-level effectiveness. In this paper, we demonstrate that limited seen examples generally result in deteriorated performance of generative models. To overcome these challenges, we propose ZeroDiff, a Diffusion-based Generative ZSL model. This unified framework incorporates diffusion models to improve data efficiency at both the class and instance levels. Specifically, for instance-level effectiveness, ZeroDiff utilizes a forward diffusion chain to transform limited data into an expanded set of noised data. For class-level effectiveness, we design a two-branch generation structure that consists of a Diffusion-based Feature Generator (DFG) and a Diffusion-based Representation Generator (DRG). DFG focuses on learning and sampling the distribution of cross-entropy-based features, whilst DRG learns the supervised contrastive-based representation to boost the zero-shot capabilities of DFG. Additionally, we employ three discriminators to evaluate generated features from various aspects and introduce a Wasserstein-distance-based mutual learning loss to transfer knowledge among discriminators, thereby enhancing guidance for generation. Demonstrated through extensive experiments on three popular ZSL benchmarks, our ZeroDiff not only achieves significant improvements over existing ZSL methods but also maintains robust performance even with scarce training data. Code will be released upon acceptance.

[LG-104] Multivariate Physics-Informed Convolutional Autoencoder for Anomaly Detection in Power Distribution Systems with High Penetration of DERs

链接: https://arxiv.org/abs/2406.02927
作者: Mehdi Jabbari Zideh,Sarika Khushalani Solanki
关键词: system domain due, power system domain, data availability issues, cyber-physical events, availability issues
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the relentless progress of deep learning models in analyzing the system conditions under cyber-physical events, their abilities are limited in the power system domain due to data availability issues, cost of data acquisition, and lack of interpretation and extrapolation for the data beyond the training windows. In addition, the integration of distributed energy resources (DERs) such as wind and solar generations increases the complexities and nonlinear nature of power systems. Therefore, an interpretable and reliable methodology is of utmost need to increase the confidence of power system operators and their situational awareness for making reliable decisions. This has led to the development of physics-informed neural network (PINN) models as more interpretable, trustworthy, and robust models where the underlying principled laws are integrated into the training process of neural network models to achieve improved performance. This paper proposes a multivariate physics-informed convolutional autoencoder (PIConvAE) model to detect cyber anomalies in power distribution systems with unbalanced configurations and high penetration of DERs. The physical laws are integrated through a customized loss function that embeds the underlying Kirchhoff’s circuit laws into the training process of the autoencoder. The performance of the multivariate PIConvAE model is evaluated on two unbalanced power distribution grids, IEEE 123-bus system and a real-world feeder in Riverside, CA. The results show the exceptional performance of the proposed method in detecting various cyber anomalies in both systems. In addition, the model’s effectiveness is evaluated in data scarcity scenarios with different training data ratios. Finally, the model’s performance is compared with existing machine learning models where the PIConvAE model surpasses other models with considerably higher detection metrics.

[LG-105] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models

链接: https://arxiv.org/abs/2406.02924
作者: Peijie Dong,Lujun Li,Zhenheng Tang,Xiang Liu,Xinglin Pan,Qiang Wang,Xiaowen Chu
关键词: Large Language Models, face deployment challenges, deployment challenges due, Large Language, Language Models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted by ICML2024, 29 pages, 4 figures

点击查看摘要

Abstract:Despite the remarkable capabilities, Large Language Models (LLMs) face deployment challenges due to their extensive size. Pruning methods drop a subset of weights to accelerate, but many of them require retraining, which is prohibitively expensive and computationally demanding. Recently, post-training pruning approaches introduced novel metrics, enabling the pruning of LLMs without retraining. However, these metrics require the involvement of human experts and tedious trial and error. To efficiently identify superior pruning metrics, we develop an automatic framework for searching symbolic pruning metrics using genetic programming. In particular, we devise an elaborate search space encompassing the existing pruning metrics to discover the potential symbolic pruning metric. We propose an opposing operation simplification strategy to increase the diversity of the population. In this way, Pruner-Zero allows auto-generation of symbolic pruning metrics. Based on the searched results, we explore the correlation between pruning metrics and performance after pruning and summarize some principles. Extensive experiments on LLaMA and LLaMA-2 on language modeling and zero-shot tasks demonstrate that our Pruner-Zero obtains superior performance than SOTA post-training pruning methods. Code at: \urlthis https URL.

[LG-106] xt Injection for Neural Contextual Biasing

链接: https://arxiv.org/abs/2406.02921
作者: Zhong Meng,Zelin Wu,Rohit Prabhavalkar,Cal Peyser,Weiran Wang,Nanxin Chen,Tara N. Sainath,Bhuvana Ramabhadran
关键词: automatic speech recognition, effectively improves automatic, improves automatic speech, biasing effectively improves, speech recognition
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 1 figure

点击查看摘要

Abstract:Neural contextual biasing effectively improves automatic speech recognition (ASR) for crucial phrases within a speaker’s context, particularly those that are infrequent in the training data. This work proposes contextual text injection (CTI) to enhance contextual ASR. CTI leverages not only the paired speech-text data, but also a much larger corpus of unpaired text to optimize the ASR model and its biasing component. Unpaired text is converted into speech-like representations and used to guide the model’s attention towards relevant bias phrases. Moreover, we introduce a contextual text-injected (CTI) minimum word error rate (MWER) training, which minimizes the expected WER caused by contextual biasing when unpaired text is injected into the model. Experiments show that CTI with 100 billion text sentences can achieve up to 43.3% relative WER reduction from a strong neural biasing model. CTI-MWER provides a further relative improvement of 23.5%.

[LG-107] A comprehensive and FAIR comparison between MLP and KAN representations for differential equations and operator networks

链接: https://arxiv.org/abs/2406.02917
作者: Khemraj Shukla,Juan Diego Toscano,Zhicheng Wang,Zongren Zou,George Em Karniadakis
关键词: alternative representation model, Kolmogorov-Arnold Networks, deep operator networks, recently introduced, deep operator models
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) were recently introduced as an alternative representation model to MLP. Herein, we employ KANs to construct physics-informed machine learning models (PIKANs) and deep operator models (DeepOKANs) for solving differential equations for forward and inverse problems. In particular, we compare them with physics-informed neural networks (PINNs) and deep operator networks (DeepONets), which are based on the standard MLP representation. We find that although the original KANs based on the B-splines parameterization lack accuracy and efficiency, modified versions based on low-order orthogonal polynomials have comparable performance to PINNs and DeepONet although they still lack robustness as they may diverge for different random seeds or higher order orthogonal polynomials. We visualize their corresponding loss landscapes and analyze their learning dynamics using information bottleneck theory. Our study follows the FAIR principles so that other researchers can use our benchmarks to further advance this emerging topic.

[LG-108] Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models

链接: https://arxiv.org/abs/2406.02915
作者: Jinhao Li,Haopeng Li,Sarah Erfani,Lei Feng,James Bailey,Feng Liu
关键词: large language model, text descriptions generated, pre-trained vision-language model, finer text descriptions, query image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 22 pages, 16 figures, published to ICML 2024

点击查看摘要

Abstract:It has recently been discovered that using a pre-trained vision-language model (VLM), e.g., CLIP, to align a whole query image with several finer text descriptions generated by a large language model can significantly enhance zero-shot performance. However, in this paper, we empirically find that the finer descriptions tend to align more effectively with local areas of the query image rather than the whole image, and then we theoretically validate this finding. Thus, we present a method called weighted visual-text cross alignment (WCA). This method begins with a localized visual prompting technique, designed to identify local visual areas within the query image. The local visual areas are then cross-aligned with the finer descriptions by creating a similarity matrix using the pre-trained VLM. To determine how well a query image aligns with each category, we develop a score function based on the weighted similarities in this matrix. Extensive experiments demonstrate that our method significantly improves zero-shot performance across various datasets, achieving results that are even comparable to few-shot learning methods.

[LG-109] Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity

链接: https://arxiv.org/abs/2406.02913
作者: Wentao Guo,Jikai Long,Yimeng Zeng,Zirui Liu,Xinyu Yang,Yide Ran,Jacob R. Gardner,Osbert Bastani,Christopher De Sa,Xiaodong Yu,Beidi Chen,Zhaozhuo Xu
关键词: fine-tuning Large Language, Large Language Models, Large Language, Zeroth-order optimization, fine-tuning Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Zeroth-order optimization (ZO) is a memory-efficient strategy for fine-tuning Large Language Models using only forward passes. However, the application of ZO fine-tuning in memory-constrained settings such as mobile phones and laptops is still challenging since full precision forward passes are infeasible. In this study, we address this limitation by integrating sparsity and quantization into ZO fine-tuning of LLMs. Specifically, we investigate the feasibility of fine-tuning an extremely small subset of LLM parameters using ZO. This approach allows the majority of un-tuned parameters to be quantized to accommodate the constraint of limited device memory. Our findings reveal that the pre-training process can identify a set of “sensitive parameters” that can guide the ZO fine-tuning of LLMs on downstream tasks. Our results demonstrate that fine-tuning 0.1% sensitive parameters in the LLM with ZO can outperform the full ZO fine-tuning performance, while offering wall-clock time speedup. Additionally, we show that ZO fine-tuning targeting these 0.1% sensitive parameters, combined with 4 bit quantization, enables efficient ZO fine-tuning of an Llama2-7B model on a GPU device with less than 8 GiB of memory and notably reduced latency.

[LG-110] Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

链接: https://arxiv.org/abs/2406.02900
作者: Rafael Rafailov,Yaswanth Chittepu,Ryan Park,Harshit Sikchi,Joey Hejna,Bradley Knox,Chelsea Finn,Scott Niekum
关键词: Large Language Models, Large Language, Human Feedback, success of Large, Reinforcement Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) has been crucial to the recent success of Large Language Models (LLMs), however, it is often a complex and brittle process. In the classical RLHF framework, a reward model is first trained to represent human preferences, which is in turn used by an online reinforcement learning (RL) algorithm to optimize the LLM. A prominent issue with such methods is \emphreward over-optimization or \emphreward hacking, where performance as measured by the learned proxy reward model increases, but true quality plateaus or even deteriorates. Direct Alignment Algorithms (DDAs) like Direct Preference Optimization have emerged as alternatives to the classical RLHF pipeline by circumventing the reward modeling phase. However, although DAAs do not use a separate proxy reward model, they still commonly deteriorate from over-optimization. While the so-called reward hacking phenomenon is not well-defined for DAAs, we still uncover similar trends: at higher KL budgets, DAA algorithms exhibit similar degradation patterns to their classic RLHF counterparts. In particular, we find that DAA methods deteriorate not only across a wide range of KL budgets but also often before even a single epoch of the dataset is completed. Through extensive empirical experimentation, this work formulates and formalizes the reward over-optimization or hacking problem for DAAs and explores its consequences across objectives, training regimes, and model scales.

[LG-111] A Bi-metric Framework for Fast Similarity Search

链接: https://arxiv.org/abs/2406.02891
作者: Haike Xu,Sandeep Silwal,Piotr Indyk
关键词: designing nearest neighbor, proxy metric, ground-truth metric, metric, neighbor data structures
类目: Information Retrieval (cs.IR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a new “bi-metric” framework for designing nearest neighbor data structures. Our framework assumes two dissimilarity functions: a ground-truth metric that is accurate but expensive to compute, and a proxy metric that is cheaper but less accurate. In both theory and practice, we show how to construct data structures using only the proxy metric such that the query procedure achieves the accuracy of the expensive metric, while only using a limited number of calls to both metrics. Our theoretical results instantiate this framework for two popular nearest neighbor search algorithms: DiskANN and Cover Tree. In both cases we show that, as long as the proxy metric used to construct the data structure approximates the ground-truth metric up to a bounded factor, our data structure achieves arbitrarily good approximation guarantees with respect to the ground-truth metric. On the empirical side, we apply the framework to the text retrieval problem with two dissimilarity functions evaluated by ML models with vastly different computational costs. We observe that for almost all data sets in the MTEB benchmark, our approach achieves a considerably better accuracy-efficiency tradeoff than the alternatives, such as re-ranking.

[LG-112] Representation Learning For Efficient Deep Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2406.02890
作者: Dom Huh,Prasant Mohapatra
关键词: Sample efficiency remains, MARL, remains a key, key challenge, Latent Space Optimization
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sample efficiency remains a key challenge in multi-agent reinforcement learning (MARL). A promising approach is to learn a meaningful latent representation space through auxiliary learning objectives alongside the MARL objective to aid in learning a successful control policy. In our work, we present MAPO-LSO (Multi-Agent Policy Optimization with Latent Space Optimization) which applies a form of comprehensive representation learning devised to supplement MARL training. Specifically, MAPO-LSO proposes a multi-agent extension of transition dynamics reconstruction and self-predictive learning that constructs a latent state optimization scheme that can be trivially extended to current state-of-the-art MARL algorithms. Empirical results demonstrate MAPO-LSO to show notable improvements in sample efficiency and learning performance compared to its vanilla MARL counterpart without any additional MARL hyperparameter tuning on a diverse suite of MARL tasks.

[LG-113] HYDRA: Model Factorization Framework for Black-Box LLM Personalization

链接: https://arxiv.org/abs/2406.02888
作者: Yuchen Zhuang,Haotian Sun,Yue Yu,Qifan Wang,Chao Zhang,Bo Dai
关键词: modern intelligent systems, delivering tailored experiences, critical research area, mining users’ behavioral, users’ behavioral history
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 24 pages, 6 figures, work in progress

点击查看摘要

Abstract:Personalization has emerged as a critical research area in modern intelligent systems, focusing on mining users’ behavioral history and adapting to their preferences for delivering tailored experiences. Despite the remarkable few-shot capabilities exhibited by black-box large language models (LLMs), the inherent opacity of their model parameters presents significant challenges in aligning the generated output with individual expectations. Existing solutions have primarily focused on prompt design to incorporate user-specific profiles and behaviors; however, such approaches often struggle to generalize effectively due to their inability to capture shared knowledge among all users. To address these challenges, we propose HYDRA, a model factorization framework that captures both user-specific behavior patterns from historical data and shared general knowledge among all users to deliver personalized generation. In order to capture user-specific behavior patterns, we first train a reranker to prioritize the most useful information from top-retrieved relevant historical records. By combining the prioritized history with the corresponding query, we train an adapter to align the output with individual user-specific preferences, eliminating the reliance on access to inherent model parameters of black-box LLMs. Both the reranker and the adapter can be decomposed into a base model with multiple user-specific heads, resembling a hydra. The base model maintains shared knowledge across users, while the multiple personal heads capture user-specific preferences. Experimental results demonstrate that HYDRA outperforms existing state-of-the-art prompt-based methods by an average relative improvement of 9.01% across five diverse personalization tasks in the LaMP benchmark. Our implementation is available at this https URL.

[LG-114] Nonlinear Transformations Against Unlearnable Datasets

链接: https://arxiv.org/abs/2406.02883
作者: Thushari Hapuarachchi,Jing Lin,Kaiqi Xiong,Mohamed Rahouti,Gitte Ost
关键词: Automated scraping stands, Automated scraping, Tangent Generalization Attack, Neural Tangent Generalization, scraping stands
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Automated scraping stands out as a common method for collecting data in deep learning models without the authorization of data owners. Recent studies have begun to tackle the privacy concerns associated with this data collection method. Notable approaches include Deepconfuse, error-minimizing, error-maximizing (also known as adversarial poisoning), Neural Tangent Generalization Attack, synthetic, autoregressive, One-Pixel Shortcut, Self-Ensemble Protection, Entangled Features, Robust Error-Minimizing, Hypocritical, and TensorClog. The data generated by those approaches, called “unlearnable” examples, are prevented “learning” by deep learning models. In this research, we investigate and devise an effective nonlinear transformation framework and conduct extensive experiments to demonstrate that a deep neural network can effectively learn from the data/examples traditionally considered unlearnable produced by the above twelve approaches. The resulting approach improves the ability to break unlearnable data compared to the linear separable technique recently proposed by researchers. Specifically, our extensive experiments show that the improvement ranges from 0.34% to 249.59% for the unlearnable CIFAR10 datasets generated by those twelve data protection approaches, except for One-Pixel Shortcut. Moreover, the proposed framework achieves over 100% improvement of test accuracy for Autoregressive and REM approaches compared to the linear separable technique. Our findings suggest that these approaches are inadequate in preventing unauthorized uses of data in machine learning models. There is an urgent need to develop more robust protection mechanisms that effectively thwart an attacker from accessing data without proper authorization from the owners.

[LG-115] FedStaleWeight: Buffered Asynchronous Federated Learning with Fair Aggregation via Staleness Reweighting

链接: https://arxiv.org/abs/2406.02877
作者: Jeffrey Ma,Alan Tu,Yiling Chen,Vijay Janapa Reddi
关键词: Asynchronous Federated Learning, Federated Learning, harness decentralized data, endeavors to harness, preserving privacy
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) endeavors to harness decentralized data while preserving privacy, facing challenges of performance, scalability, and collaboration. Asynchronous Federated Learning (AFL) methods have emerged as promising alternatives to their synchronous counterparts bounded by the slowest agent, yet they add additional challenges in convergence guarantees, fairness with respect to compute heterogeneity, and incorporation of staleness in aggregated updates. Specifically, AFL biases model training heavily towards agents who can produce updates faster, leaving slower agents behind, who often also have differently distributed data which is not learned by the global model. Naively upweighting introduces incentive issues, where true fast updating agents may falsely report updates at a slower speed to increase their contribution to model training. We introduce FedStaleWeight, an algorithm addressing fairness in aggregating asynchronous client updates by employing average staleness to compute fair re-weightings. FedStaleWeight reframes asynchronous federated learning aggregation as a mechanism design problem, devising a weighting strategy that incentivizes truthful compute speed reporting without favoring faster update-producing agents by upweighting agent updates based on staleness. Leveraging only observed agent update staleness, FedStaleWeight results in more equitable aggregation on a per-agent basis. We both provide theoretical convergence guarantees in the smooth, non-convex setting and empirically compare FedStaleWeight against the commonly used asynchronous FedBuff with gradient averaging, demonstrating how it achieves stronger fairness, expediting convergence to a higher global model accuracy. Finally, we provide an open-source test bench to facilitate exploration of buffered AFL aggregation strategies, fostering further research in asynchronous federated learning paradigms.

[LG-116] Leveraging KANs For Enhanced Deep Koopman Operator Discovery

链接: https://arxiv.org/abs/2406.02875
作者: George Nehma,Madhur Tiwari
关键词: discovering Deep Koopman, linearizing nonlinear dynamics, MLP Deep Neural, Deep Koopman operators, MLP Neural Network
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph)
*备注: 6 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Multi-layer perceptrons (MLP’s) have been extensively utilized in discovering Deep Koopman operators for linearizing nonlinear dynamics. With the emergence of Kolmogorov-Arnold Networks (KANs) as a more efficient and accurate alternative to the MLP Neural Network, we propose a comparison of the performance of each network type in the context of learning Koopman operators with this http URL this work, we propose a KANs-based deep Koopman framework with applications to an orbital Two-Body Problem (2BP) and the pendulum for data-driven discovery of linear system dynamics. KANs were found to be superior in nearly all aspects of training; learning 31 times faster, being 15 times more parameter efficiency, and predicting 1.25 times more accurately as compared to the MLP Deep Neural Networks (DNNs) in the case of the 2BP. Thus, KANs shows potential for being an efficient tool in the development of Deep Koopman Theory.

[LG-117] Combinatorial Optimization with Automated Graph Neural Networks

链接: https://arxiv.org/abs/2406.02872
作者: Yang Liu,Peng Zhang,Yang Gao,Chuan Zhou,Zhao Li,Hongyang Chen
关键词: maximum independent set, maximum cut, maximum independent, recent years, independent set
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

Abstract:In recent years, graph neural networks (GNNs) have become increasingly popular for solving NP-hard combinatorial optimization (CO) problems, such as maximum cut and maximum independent set. The core idea behind these methods is to represent a CO problem as a graph and then use GNNs to learn the node/graph embedding with combinatorial information. Although these methods have achieved promising results, given a specific CO problem, the design of GNN architectures still requires heavy manual work with domain knowledge. Existing automated GNNs are mostly focused on traditional graph learning problems, which is inapplicable to solving NP-hard CO problems. To this end, we present a new class of \textbfAUTOmated \textbfGNNs for solving \textbfNP-hard problems, namely \textbfAutoGNP. We represent CO problems by GNNs and focus on two specific problems, i.e., mixed integer linear programming and quadratic unconstrained binary optimization. The idea of AutoGNP is to use graph neural architecture search algorithms to automatically find the best GNNs for a given NP-hard combinatorial optimization problem. Compared with existing graph neural architecture search algorithms, AutoGNP utilizes two-hop operators in the architecture search space. Moreover, AutoGNP utilizes simulated annealing and a strict early stopping policy to avoid local optimal solutions. Empirical results on benchmark combinatorial problems demonstrate the superiority of our proposed model.

[LG-118] Oscillations enhance time-series prediction in reservoir computing with feedback

链接: https://arxiv.org/abs/2406.02867
作者: Yuji Kawai,Takashi Morita,Jihoon Park,Minoru Asada
关键词: minimal computational resources, machine learning framework, predict temporal data, modeling the brain, computational resources
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Reservoir computing, a machine learning framework used for modeling the brain, can predict temporal data with little observations and minimal computational resources. However, it is difficult to accurately reproduce the long-term target time series because the reservoir system becomes unstable. This predictive capability is required for a wide variety of time-series processing, including predictions of motor timing and chaotic dynamical systems. This study proposes oscillation-driven reservoir computing (ODRC) with feedback, where oscillatory signals are fed into a reservoir network to stabilize the network activity and induce complex reservoir dynamics. The ODRC can reproduce long-term target time series more accurately than conventional reservoir computing methods in a motor timing and chaotic time-series prediction tasks. Furthermore, it generates a time series similar to the target in the unexperienced period, that is, it can learn the abstract generative rules from limited observations. Given these significant improvements made by the simple and computationally inexpensive implementation, the ODRC would serve as a practical model of various time series data. Moreover, we will discuss biological implications of the ODRC, considering it as a model of neural oscillations and their cerebellar processors.

[LG-119] SPDiffuser: Diffusion Models as Learned Samplers for Traveling Salesperson Path Planning Problems

链接: https://arxiv.org/abs/2406.02858
作者: Ryo Yonetani
关键词: data-driven path planner, salesperson path planning, paper presents TSPDiffuser, path planning problems, paper presents
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents TSPDiffuser, a novel data-driven path planner for traveling salesperson path planning problems (TSPPPs) in environments rich with obstacles. Given a set of destinations within obstacle maps, our objective is to efficiently find the shortest possible collision-free path that visits all the destinations. In TSPDiffuser, we train a diffusion model on a large collection of TSPPP instances and their respective solutions to generate plausible paths for unseen problem instances. The model can then be employed as a learned sampler to construct a roadmap that contains potential solutions with a small number of nodes and edges. This approach enables efficient and accurate estimation of traveling costs between destinations, effectively addressing the primary computational challenge in solving TSPPPs. Experimental evaluations with diverse synthetic and real-world indoor/outdoor environments demonstrate the effectiveness of TSPDiffuser over existing methods in terms of the trade-off between solution quality and computational time requirements.

[LG-120] Exact Conversion of In-Context Learning to Model Weights

链接: https://arxiv.org/abs/2406.02847
作者: Brian K Chen,Tianyang Hu,Hui Jin,Hwee Kuan Lee,Kenji Kawaguchi
关键词: powerful emergent property, attracted increasing attention, large language models, In-Context Learning, ICL
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:In-Context Learning (ICL) has been a powerful emergent property of large language models that has attracted increasing attention in recent years. In contrast to regular gradient-based learning, ICL is highly interpretable and does not require parameter updates. In this paper, we show that, for linearized transformer networks, ICL can be made explicit and permanent through the inclusion of bias terms. We mathematically demonstrate the equivalence between a model with ICL demonstration prompts and the same model with the additional bias terms. Our algorithm (ICLCA) allows for exact conversion in an inexpensive manner. Existing methods are not exact and require expensive parameter updates. We demonstrate the efficacy of our approach through experiments that show the exact incorporation of ICL tokens into a linear transformer. We further suggest how our method can be adapted to achieve cheap approximate conversion of ICL tokens, even in regular transformer networks that are not linearized. Our experiments on GPT-2 show that, even though the conversion is only approximate, the model still gains valuable context from the included bias terms.

[LG-121] Conditional Idempotent Generative Networks

链接: https://arxiv.org/abs/2406.02841
作者: Niccolò Ronchetti
关键词: Idempotent Generative Networks, Conditional Idempotent Generative, propose Conditional Idempotent, Generative Networks, Idempotent Generative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 22 pages, 8 figures

点击查看摘要

Abstract:We propose Conditional Idempotent Generative Networks (CIGN), a novel approach that expands upon Idempotent Generative Networks (IGN) to enable conditional generation. While IGNs offer efficient single-pass generation, they lack the ability to control the content of the generated data. CIGNs address this limitation by incorporating conditioning mechanisms, allowing users to steer the generation process towards specific types of data. We establish the theoretical foundations for CIGNs, outlining their scope, loss function design, and evaluation metrics. We then present two potential architectures for implementing CIGNs: channel conditioning and filter conditioning. Finally, we discuss experimental results on the MNIST dataset, demonstrating the effectiveness of both approaches. Our findings pave the way for further exploration of CIGNs on larger datasets and with more powerful computing resources to determine the optimal implementation strategy. Comments: 22 pages, 8 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.02841 [cs.LG] (or arXiv:2406.02841v1 [cs.LG] for this version)

[LG-122] Efficient Minimum Bayes Risk Decoding using Low-Rank Matrix Completion Algorithms

链接: https://arxiv.org/abs/2406.02832
作者: Firas Trabelsi,David Vilar,Mara Finkelstein,Markus Freitag
关键词: Minimum Bayes Risk, Minimum Bayes, Bayes Risk, quadratic computational complexity, computational complexity limits
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Minimum Bayes Risk (MBR) decoding is a powerful decoding strategy widely used for text generation tasks, but its quadratic computational complexity limits its practical application. This paper presents a novel approach for approximating MBR decoding using matrix completion techniques, focusing on the task of machine translation. We formulate MBR decoding as a matrix completion problem, where the utility metric scores between candidate hypotheses and pseudo-reference translations form a low-rank matrix. First, we empirically show that the scores matrices indeed have a low-rank structure. Then, we exploit this by only computing a random subset of the scores and efficiently recover the missing entries in the matrix by applying the Alternating Least Squares (ALS) algorithm, thereby enabling a fast approximation of the MBR decoding process. Our experimental results on machine translation tasks demonstrate that the proposed method requires 1/16 utility metric computations compared to vanilla MBR decoding while achieving equal translation quality measured by COMET22 on the WMT22 dataset (ende and enru). We also benchmark our method against other approximation methods and we show gains in quality when comparing to them.

[LG-123] Stochastic Diffusion: A Diffusion Probabilistic Model for Stochastic Time Series Forecasting

链接: https://arxiv.org/abs/2406.02827
作者: Yuansan Liu,Sudanthi Wijewickrema,Dongting Hu,Christofer Bester,Stephen O’Leary,James Bailey
关键词: stochastic time series, time series data, Recent innovations, time series, generative time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 4 figures

点击查看摘要

Abstract:Recent innovations in diffusion probabilistic models have paved the way for significant progress in image, text and audio generation, leading to their applications in generative time series forecasting. However, leveraging such abilities to model highly stochastic time series data remains a challenge. In this paper, we propose a novel Stochastic Diffusion (StochDiff) model which learns data-driven prior knowledge at each time step by utilizing the representational power of the stochastic latent spaces to model the variability of the multivariate time series data. The learnt prior knowledge helps the model to capture complex temporal dynamics and the inherent uncertainty of the data. This improves its ability to model highly stochastic time series data. Through extensive experiments on real-world datasets, we demonstrate the effectiveness of our proposed model on stochastic time series forecasting. Additionally, we showcase an application of our model for real-world surgical guidance, highlighting its potential to benefit the medical community.

[LG-124] Exploring Robustness in Doctor-Patient Conversation Summarization: An Analysis of Out-of-Domain SOAP Notes

链接: https://arxiv.org/abs/2406.02826
作者: Yu-Wen Chen,Julia Hirschberg
关键词: Summarizing medical conversations, poses unique challenges, unique challenges due, collecting in-domain training, in-domain training data
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Clinical NLP Workshop 2024

点击查看摘要

Abstract:Summarizing medical conversations poses unique challenges due to the specialized domain and the difficulty of collecting in-domain training data. In this study, we investigate the performance of state-of-the-art doctor-patient conversation generative summarization models on the out-of-domain data. We divide the summarization model of doctor-patient conversation into two configurations: (1) a general model, without specifying subjective (S), objective (O), and assessment (A) and plan § notes; (2) a SOAP-oriented model that generates a summary with SOAP sections. We analyzed the limitations and strengths of the fine-tuning language model-based methods and GPTs on both configurations. We also conducted a Linguistic Inquiry and Word Count analysis to compare the SOAP notes from different datasets. The results exhibit a strong correlation for reference notes across different datasets, indicating that format mismatch (i.e., discrepancies in word distribution) is not the main cause of performance decline on out-of-domain data. Lastly, a detailed analysis of SOAP notes is included to provide insights into missing information and hallucinations introduced by the models.

[LG-125] ORACLE: Leveraging Mutual Information for Consistent Character Generation with LoRAs in Diffusion Models

链接: https://arxiv.org/abs/2406.02820
作者: Kiymet Akdemir,Pinar Yanardag
关键词: comic book artistry, promoting visual creativity, children literature, game development, book artistry
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-to-image diffusion models have recently taken center stage as pivotal tools in promoting visual creativity across an array of domains such as comic book artistry, children’s literature, game development, and web design. These models harness the power of artificial intelligence to convert textual descriptions into vivid images, thereby enabling artists and creators to bring their imaginative concepts to life with unprecedented ease. However, one of the significant hurdles that persist is the challenge of maintaining consistency in character generation across diverse contexts. Variations in textual prompts, even if minor, can yield vastly different visual outputs, posing a considerable problem in projects that require a uniform representation of characters throughout. In this paper, we introduce a novel framework designed to produce consistent character representations from a single text prompt across diverse settings. Through both quantitative and qualitative analyses, we demonstrate that our framework outperforms existing methods in generating characters with consistent visual identities, underscoring its potential to transform creative industries. By addressing the critical challenge of character consistency, we not only enhance the practical utility of these models but also broaden the horizons for artistic and creative expression.

[LG-126] Randomized Geometric Algebra Methods for Convex Neural Networks

链接: https://arxiv.org/abs/2406.02806
作者: Yifei Wang,Sungyoon Kim,Paul Chu,Indu Subramaniam,Mert Pilanci
关键词: generalizing randomized linear, hypercomplex vector spaces, introduce randomized algorithms, Clifford Geometric Algebra, randomized linear algebra
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce randomized algorithms to Clifford’s Geometric Algebra, generalizing randomized linear algebra to hypercomplex vector spaces. This novel approach has many implications in machine learning, including training neural networks to global optimality via convex optimization. Additionally, we consider fine-tuning large language model (LLM) embeddings as a key application area, exploring the intersection of geometric algebra and modern AI techniques. In particular, we conduct a comparative analysis of the robustness of transfer learning via embeddings, such as OpenAI GPT models and BERT, using traditional methods versus our novel approach based on convex optimization. We test our convex optimization transfer learning method across a variety of case studies, employing different embeddings (GPT-4 and BERT embeddings) and different text classification datasets (IMDb, Amazon Polarity Dataset, and GLUE) with a range of hyperparameter settings. Our results demonstrate that convex optimization and geometric algebra not only enhances the performance of LLMs but also offers a more stable and reliable method of transfer learning via embeddings.

[LG-127] textttACCORD: Closing the Commonsense Measurability Gap

链接: https://arxiv.org/abs/2406.02804
作者: François Roewer-Després,Jinyue Feng,Zining Zhu,Frank Rudzicz
关键词: large language models, multi-hop counterfactuals, ACCORD, language models, texttt
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: For leaderboard and dataset download, see this https URL For source code, see this https URL

点击查看摘要

Abstract:We present \textttACCORD , a framework and benchmark suite for disentangling the commonsense grounding and reasoning abilities of large language models (LLMs) through controlled, multi-hop counterfactuals. \textttACCORD introduces formal elements to commonsense reasoning to explicitly control and quantify reasoning complexity beyond the typical 1 or 2 hops. Uniquely, \textttACCORD can automatically generate benchmarks of arbitrary reasoning complexity, and so it scales with future LLM improvements. Benchmarking state-of-the-art LLMs – including GPT-4o (2024-05-13), Llama-3-70B-Instruct, and Mixtral-8x22B-Instruct-v0.1 – shows performance degrading to random chance with only moderate scaling, leaving substantial headroom for improvement. We release a leaderboard of the benchmark suite tested in this work, as well as code for automatically generating more complex benchmarks.

[LG-128] Auditing Privacy Mechanisms via Label Inference Attacks

链接: https://arxiv.org/abs/2406.02797
作者: Róbert István Busa-Fekete,Travis Dick,Claudio Gentile,Andrés Muñoz Medina,Adam Smith,Marika Swanberg
关键词: propose reconstruction advantage, label privatization mechanisms, reconstruction advantage measures, reconstruction advantage, audit label privatization
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We propose reconstruction advantage measures to audit label privatization mechanisms. A reconstruction advantage measure quantifies the increase in an attacker’s ability to infer the true label of an unlabeled example when provided with a private version of the labels in a dataset (e.g., aggregate of labels from different users or noisy labels output by randomized response), compared to an attacker that only observes the feature vectors, but may have prior knowledge of the correlation between features and labels. We consider two such auditing measures: one additive, and one multiplicative. These incorporate previous approaches taken in the literature on empirical auditing and differential privacy. The measures allow us to place a variety of proposed privatization schemes – some differentially private, some not – on the same footing. We analyze these measures theoretically under a distributional model which encapsulates reasonable adversarial settings. We also quantify their behavior empirically on real and simulated prediction tasks. Across a range of experimental settings, we find that differentially private schemes dominate or match the privacy-utility tradeoff of more heuristic approaches.

[LG-129] Building Socially-Equitable Public Models

链接: https://arxiv.org/abs/2406.02790
作者: Yejia Liu,Jianyi Yang,Pengfei Li,Tongxin Li,Shaolei Ren
关键词: Public models offer, downstream agents, showcasing their proficiency, public model, models offer predictions
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Accepted by the ICML 2024

点击查看摘要

Abstract:Public models offer predictions to a variety of downstream tasks and have played a crucial role in various AI applications, showcasing their proficiency in accurate predictions. However, the exclusive emphasis on prediction accuracy may not align with the diverse end objectives of downstream agents. Recognizing the public model’s predictions as a service, we advocate for integrating the objectives of downstream agents into the optimization process. Concretely, to address performance disparities and foster fairness among heterogeneous agents in training, we propose a novel Equitable Objective. This objective, coupled with a policy gradient algorithm, is crafted to train the public model to produce a more equitable/uniform performance distribution across downstream agents, each with their unique concerns. Both theoretical analysis and empirical case studies have proven the effectiveness of our method in advancing performance equity across diverse downstream agents utilizing the public model for their decision-making. Codes and datasets are released at this https URL.

[LG-130] Private Stochastic Convex Optimization with Heavy Tails: Near-Optimality from Simple Reductions

链接: https://arxiv.org/abs/2406.02789
作者: Hilal Asi,Daogao Liu,Kevin Tian
关键词: stochastic convex optimization, sample Lipschitz constants, differentially private stochastic, private stochastic convex, Lipschitz constants
类目: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of differentially private stochastic convex optimization (DP-SCO) with heavy-tailed gradients, where we assume a k^\textth -moment bound on the Lipschitz constants of sample functions rather than a uniform bound. We propose a new reduction-based approach that enables us to obtain the first optimal rates (up to logarithmic factors) in the heavy-tailed setting, achieving error G_2 \cdot \frac 1 \sqrt n + G_k \cdot (\frac\sqrt dn\epsilon)^1 - \frac 1 k under (\epsilon, \delta) -approximate differential privacy, up to a mild \textuppolylog(\frac1\delta) factor, where G_2^2 and G_k^k are the 2^\textnd and k^\textth moment bounds on sample Lipschitz constants, nearly-matching a lower bound of [Lowy and Razaviyayn 2023]. We further give a suite of private algorithms in the heavy-tailed setting which improve upon our basic result under additional assumptions, including an optimal algorithm under a known-Lipschitz constant assumption, a near-linear time algorithm for smooth functions, and an optimal linear time algorithm for smooth generalized linear models. Subjects: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2406.02789 [cs.DS] (or arXiv:2406.02789v1 [cs.DS] for this version)

[LG-131] Disentangling Logic: The Role of Context in Large Language Model Reasoning Capabilities

链接: https://arxiv.org/abs/2406.02787
作者: Wenyue Hua,Kaijie Zhu,Lingyao Li,Lizhou Fan,Shuhang Lin,Mingyu Jin,Haochen Xue,Zelong Li,JinDong Wang,Yongfeng Zhang
关键词: systematically disentangle pure, disentangle pure logic, study intends, intends to systematically, systematically disentangle
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 22 pages, 9 figures

点击查看摘要

Abstract:This study intends to systematically disentangle pure logic reasoning and text understanding by investigating the contrast across abstract and contextualized logical problems from a comprehensive set of domains. We explore whether LLMs demonstrate genuine reasoning capabilities across various domains when the underlying logical structure remains constant. We focus on two main questions (1) Can abstract logical problems alone accurately benchmark an LLM’s reasoning ability in real-world scenarios, disentangled from contextual support in practical settings? (2) Does fine-tuning LLMs on abstract logic problem generalize to contextualized logic problems and vice versa? To investigate these questions, we focus on standard propositional logic, specifically propositional deductive and abductive logic reasoning. In particular, we construct instantiated datasets for deductive and abductive reasoning with 4 levels of difficulty, encompassing 12 distinct categories or domains based on the categorization of Wikipedia. Our experiments aim to provide insights into disentangling context in logical reasoning and the true reasoning capabilities of LLMs and their generalization potential. The code and dataset are available at: this https URL.

[LG-132] LADI v2: Multi-label Dataset and Classifiers for Low-Altitude Disaster Imagery

链接: https://arxiv.org/abs/2406.02780
作者: Samuel Scheele,Katherine Picchione,Jeffrey Liu
关键词: promising tools, tools for supporting, operations following natural, Low Altitude Disaster, supporting emergency management
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:ML-based computer vision models are promising tools for supporting emergency management operations following natural disasters. Arial photographs taken from small manned and unmanned aircraft can be available soon after a disaster and provide valuable information from multiple perspectives for situational awareness and damage assessment applications. However, emergency managers often face challenges finding the most relevant photos among the tens of thousands that may be taken after an incident. While ML-based solutions could enable more effective use of aerial photographs, there is still a lack of training data for imagery of this type from multiple perspectives and for multiple hazard types. To address this, we present the LADI v2 (Low Altitude Disaster Imagery version 2) dataset, a curated set of about 10,000 disaster images captured in the United States by the Civil Air Patrol (CAP) in response to federally-declared emergencies (2015-2023) and annotated for multi-label classification by trained CAP volunteers. We also provide two pretrained baseline classifiers and compare their performance to state-of-the-art vision-language models in multi-label classification. The data and code are released publicly to support the development of computer vision models for emergency management research and applications.

[LG-133] MS-IMAP – A Multi-Scale Graph Embedding Approach for Interpretable Manifold Learning

链接: https://arxiv.org/abs/2406.02778
作者: Shay Deutsch,Lionel Yelibi,Alex Tong Lin,Arjun Ravi Kannan
关键词: machine learning applications, Deriving meaningful representations, diverse machine learning, representations from complex, high-dimensional data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deriving meaningful representations from complex, high-dimensional data in unsupervised settings is crucial across diverse machine learning applications. This paper introduces a framework for multi-scale graph network embedding based on spectral graph wavelets that employs a contrastive learning approach. A significant feature of the proposed embedding is its capacity to establish a correspondence between the embedding space and the input feature space which aids in deriving feature importance of the original features. We theoretically justify our approach and demonstrate that, in Paley-Wiener spaces on combinatorial graphs, the spectral graph wavelets operator offers greater flexibility and better control over smoothness properties compared to the Laplacian operator. We validate the effectiveness of our proposed graph embedding on a variety of public datasets through a range of downstream tasks, including clustering and unsupervised feature importance.

[LG-134] Diagnostic Digital Twin for Anomaly Detection in Floating Offshore Wind Energy

链接: https://arxiv.org/abs/2406.02775
作者: Florian Stadtmann,Adil Rasheed
关键词: diagnostic digital twin, diagnostic digital, digital twin, digital, rising across industries
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The demand for condition-based and predictive maintenance is rising across industries, especially for remote, high-value, and high-risk assets. In this article, the diagnostic digital twin concept is introduced, discussed, and implemented for a floating offshore turbine. A diagnostic digital twin is a virtual representation of an asset that combines real-time data and models to monitor damage, detect anomalies, and diagnose failures, thereby enabling condition-based and predictive maintenance. By applying diagnostic digital twins to offshore assets, unexpected failures can be alleviated, but the implementation can prove challenging. Here, a diagnostic digital twin is implemented for an operational floating offshore wind turbine. The asset is monitored through measurements. Unsupervised learning methods are employed to build a normal operation model, detect anomalies, and provide a fault diagnosis. Warnings and diagnoses are sent through text messages, and a more detailed diagnosis can be accessed in a virtual reality interface. The diagnostic digital twin successfully detected an anomaly with high confidence hours before a failure occurred. The paper concludes by discussing diagnostic digital twins in the broader context of offshore engineering. The presented approach can be generalized to other offshore assets to improve maintenance and increase the lifetime, efficiency, and sustainability of offshore assets.

[LG-135] Cyclic Sparse Training: Is it Enough?

链接: https://arxiv.org/abs/2406.02773
作者: Advait Gadhikar,Sree Harsha Nelaturu,Rebekka Burkholz
关键词: implicit regularization induced, repeated cyclic training, repeated cyclic, iterative pruning methods, cyclic training
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The success of iterative pruning methods in achieving state-of-the-art sparse networks has largely been attributed to improved mask identification and an implicit regularization induced by pruning. We challenge this hypothesis and instead posit that their repeated cyclic training schedules enable improved optimization. To verify this, we show that pruning at initialization is significantly boosted by repeated cyclic training, even outperforming standard iterative pruning methods. The dominant mechanism how this is achieved, as we conjecture, can be attributed to a better exploration of the loss landscape leading to a lower training loss. However, at high sparsity, repeated cyclic training alone is not enough for competitive performance. A strong coupling between learnt parameter initialization and mask seems to be required. Standard methods obtain this coupling via expensive pruning-training iterations, starting from a dense network. To achieve this with sparse training instead, we propose SCULPT-ing, i.e., repeated cyclic training of any sparse mask followed by a single pruning step to couple the parameters and the mask, which is able to match the performance of state-of-the-art iterative pruning methods in the high sparsity regime at reduced computational cost.

[LG-136] Hyperbolic Benchmarking Unveils Network Topology-Feature Relationship in GNN Performance

链接: https://arxiv.org/abs/2406.02772
作者: Roya Aliakbarisani,Robert Jankowski,M. Ángeles Serrano,Marián Boguñá
关键词: Graph Neural Networks, Graph Neural, Neural Networks, malware detection, excelled in predicting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have excelled in predicting graph properties in various applications ranging from identifying trends in social networks to drug discovery and malware detection. With the abundance of new architectures and increased complexity, GNNs are becoming highly specialized when tested on a few well-known datasets. However, how the performance of GNNs depends on the topological and features properties of graphs is still an open question. In this work, we introduce a comprehensive benchmarking framework for graph machine learning, focusing on the performance of GNNs across varied network structures. Utilizing the geometric soft configuration model in hyperbolic space, we generate synthetic networks with realistic topological properties and node feature vectors. This approach enables us to assess the impact of network properties, such as topology-feature correlation, degree distributions, local density of triangles (or clustering), and homophily, on the effectiveness of different GNN architectures. Our results highlight the dependency of model performance on the interplay between network structure and node features, providing insights for model selection in various scenarios. This study contributes to the field by offering a versatile tool for evaluating GNNs, thereby assisting in developing and selecting suitable models based on specific data characteristics.

[LG-137] Improved context-sensitive transformer model for inland vessel trajectory prediction

链接: https://arxiv.org/abs/2406.02771
作者: Kathrin Donandt,Karim Böttger,Dirk Söffker
关键词: vessel trajectory prediction, Physics-related and model-based, requires specific knowledge, model-based vessel trajectory, trajectory prediction
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-related and model-based vessel trajectory prediction is highly accurate but requires specific knowledge of the vessel under consideration which is not always practical. Machine learning-based trajectory prediction models do not require expert knowledge, but rely on the implicit knowledge extracted from massive amounts of data. Several deep learning (DL) methods for vessel trajectory prediction have recently been suggested. The DL models developed typically only process information about the (dis)location of vessels defined with respect to a global reference system. In the context of inland navigation, this can be problematic, since without knowledge of the limited navigable space, irrealistic trajectories are likely to be determined. If spatial constraintes are introduced, e.g., by implementing an additional submodule to process map data, however, overall complexity increases. Instead of processing the vessel displacement information on the one hand and the spatial information on the other hand, the paper proposes the merging of both information. Here, fairway-related and navigation-related displacement information are used directly. In this way, the previously proposed context-sensitive Classification Transformer (CSCT) shows an improved spatial awareness. Additionally, the CSCT is adapted to assess the model uncertainty by enabling dropout during inference. This approach is trained on different inland waterways to analyze its generalizability. As the improved CSCT obtains lower prediction errors and enables to estimate the trustworthiness of each prediction, it is more suitable for safety-critical applications in inland navigation than previously developed models.

[LG-138] Short-term Inland Vessel Trajectory Prediction with Encoder-Decoder Models

链接: https://arxiv.org/abs/2406.02770
作者: Kathrin Donandt,Karim Böttger,Dirk Söffker
关键词: Accurate vessel trajectory, Accurate vessel, river specific features, Accurate, Deep learning-based prediction
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate vessel trajectory prediction is necessary for save and efficient navigation. Deep learning-based prediction models, esp. encoder-decoders, are rarely applied to inland navigation specifically. Approaches from the maritime domain cannot directly be transferred to river navigation due to specific driving behavior influencing factors. Different encoder-decoder architectures, including a transformer encoder-decoder, are compared herein for predicting the next positions of inland vessels, given not only spatio-temporal information from AIS, but also river specific features. The results show that the reformulation of the regression task as classification problem and the inclusion of river specific features yield the lowest displacement errors. The standard LSTM encoder-decoder outperforms the transformer encoder-decoder for the data considered, but is computationally more expensive. In this study for the first time a transformer-based encoder-decoder model is applied to the problem of predicting the ship trajectory. Here, a feature vector using the river-specific context of navigation input parameters is established. Future studies can built on the proposed models, investigate the improvement of the computationally more efficient transformer, e.g. through further hyper-parameter optimization, and use additional river-specific information in the context representation to further increase prediction accuracy.

[LG-139] Spatial and social situation-aware transformer-based trajectory prediction of autonomous systems

链接: https://arxiv.org/abs/2406.02767
作者: Kathrin Donandt,Dirk Söffker
关键词: Autonomous transportation systems, Autonomous transportation, social tensor, dislocate without collision, transportation systems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous transportation systems such as road vehicles or vessels require the consideration of the static and dynamic environment to dislocate without collision. Anticipating the behavior of an agent in a given situation is required to adequately react to it in time. Developing deep learning-based models has become the dominant approach to motion prediction recently. The social environment is often considered through a CNN-LSTM-based sub-module processing a \textitsocial tensor that includes information of the past trajectory of surrounding agents. For the proposed transformer-based trajectory prediction model, an alternative, computationally more efficient social tensor definition and processing is suggested. It considers the interdependencies between target and surrounding agents at each time step directly instead of relying on information of last hidden LSTM states of individually processed agents. A transformer-based sub-module, the Social Tensor Transformer, is integrated into the overall prediction model. It is responsible for enriching the target agent’s dislocation features with social interaction information obtained from the social tensor. For the awareness of spatial limitations, dislocation features are defined in relation to the navigable area. This replaces additional, computationally expensive map processing sub-modules. An ablation study shows, that for longer prediction horizons, the deviation of the predicted trajectory from the ground truth is lower compared to a spatially and socially agnostic model. Even if the performance gain from a spatial-only to a spatial and social context-sensitive model is small in terms of common error measures, by visualizing the results it can be shown that the proposed model in fact is able to predict reactions to surrounding agents and explicitely allows an interpretable behavior.

[LG-140] Discovering Dynamic Symbolic Policies with Genetic Programming

链接: https://arxiv.org/abs/2406.02765
作者: Sigur de Vries,Sander Keemink,Marcel van Gerven
关键词: Artificial intelligence, solve control problems, control, techniques are increasingly, increasingly being applied
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 22 pages including references and appendix, 4 figures, 1 algorithm, 5 tables

点击查看摘要

Abstract:Artificial intelligence (AI) techniques are increasingly being applied to solve control problems. However, control systems developed in AI are often black-box methods, in that it is not clear how and why they generate their outputs. A lack of transparency can be problematic for control tasks in particular, because it complicates the identification of biases or errors, which in turn negatively influences the user’s confidence in the system. To improve the interpretability and transparency in control systems, the black-box structure can be replaced with white-box symbolic policies described by mathematical expressions. Genetic programming offers a gradient-free method to optimise the structure of non-differentiable mathematical expressions. In this paper, we show that genetic programming can be used to discover symbolic control systems. This is achieved by learning a symbolic representation of a function that transforms observations into control signals. We consider both systems that implement static control policies without memory and systems that implement dynamic memory-based control policies. In case of the latter, the discovered function becomes the state equation of a differential equation, which allows for evidence integration. Our results show that symbolic policies are discovered that perform comparably with black-box policies on a variety of control tasks. Furthermore, the additional value of the memory capacity in the dynamic policies is demonstrated on experiments where static policies fall short. Overall, we demonstrate that white-box symbolic policies can be optimised with genetic programming, while offering interpretability and transparency that lacks in black-box models.

[LG-141] Adaptive Preference Scaling for Reinforcement Learning with Human Feedback

链接: https://arxiv.org/abs/2406.02764
作者: Ilgee Hong,Zichong Li,Alexander Bukharin,Yixiao Li,Haoming Jiang,Tianbao Yang,Tuo Zhao
关键词: Reinforcement learning, human feedback, human preference data, prevalent approach, human
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values by learning rewards from human preference data. Due to various reasons, however, such data typically takes the form of rankings over pairs of trajectory segments, which fails to capture the varying strengths of preferences across different pairs. In this paper, we propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO), designed to address this uncertainty in preference strength. By incorporating an adaptive scaling parameter into the loss for each pair, our method increases the flexibility of the reward function. Specifically, it assigns small scaling parameters to pairs with ambiguous preferences, leading to more comparable rewards, and large scaling parameters to those with clear preferences for more distinct rewards. Computationally, our proposed loss function is strictly convex and univariate with respect to each scaling parameter, enabling its efficient optimization through a simple second-order algorithm. Our method is versatile and can be readily adapted to various preference optimization frameworks, including direct preference optimization (DPO). Our experiments with robotic control and natural language generation with large language models (LLMs) show that our method not only improves policy performance but also aligns reward function selection more closely with policy optimization, simplifying the hyperparameter tuning process.

[LG-142] Multi-layer Learnable Attention Mask for Multimodal Tasks

链接: https://arxiv.org/abs/2406.02761
作者: Wayner Barrios,SouYoung Jin
关键词: high computational demands, Learnable Attention Mask, diverse settings, varying granularity, high computational
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:While the Self-Attention mechanism in the Transformer model has proven to be effective in many domains, we observe that it is less effective in more diverse settings (e.g. multimodality) due to the varying granularity of each token and the high computational demands of lengthy sequences. To address the challenges, we introduce the Learnable Attention Mask (LAM), strategically designed to globally regulate attention maps and prioritize critical tokens within the sequence. Leveraging the Self-Attention module in a BERT-like transformer network, our approach adeptly captures associations between tokens. The extension of the LAM to a multi-layer version accommodates the varied information aspects embedded at each layer of the Transformer network. Comprehensive experimental validation on various datasets, such as MADv2, QVHighlights, ImageNet 1K, and MSRVTT, demonstrates the efficacy of the LAM, exemplifying its ability to enhance model performance while mitigating redundant computations. This pioneering approach presents a significant advancement in enhancing the understanding of complex scenarios, such as in movie understanding.

[LG-143] Aligning Large Language Models via Fine-grained Supervision

链接: https://arxiv.org/abs/2406.02756
作者: Dehong Xu,Liang Qiu,Minseok Kim,Faisal Ladhak,Jaeyoung Do
关键词: Pre-trained large-scale language, producing coherent articles, Pre-trained large-scale, large-scale language models, excel at producing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback (RLHF) to improve model alignment, which works by transforming coarse human preferences of LLM outputs into a feedback signal that guides the model learning process. However, because this approach operates on sequence-level feedback, it lacks the precision to identify the exact parts of the output affecting user preferences. To address this gap, we propose a method to enhance LLM alignment through fine-grained token-level supervision. Specifically, we ask annotators to minimally edit less preferred responses within the standard reward modeling dataset to make them more favorable, ensuring changes are made only where necessary while retaining most of the original content. The refined dataset is used to train a token-level reward model, which is then used for training our fine-grained Proximal Policy Optimization (PPO) model. Our experiment results demonstrate that this approach can achieve up to an absolute improvement of 5.1% in LLM performance, in terms of win rate against the reference model, compared with the traditional PPO model.

[LG-144] Measuring Stochastic Data Complexity with Boltzmann Influence Functions

链接: https://arxiv.org/abs/2406.02745
作者: Nathan Ng,Roger Grosse,Marzyeh Ghassemi
关键词: crucial part, part of ensuring, ensuring reliability, distribution shifts, Estimating the uncertainty
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating the uncertainty of a model’s prediction on a test point is a crucial part of ensuring reliability and calibration under distribution shifts. A minimum description length approach to this problem uses the predictive normalized maximum likelihood (pNML) distribution, which considers every possible label for a data point, and decreases confidence in a prediction if other labels are also consistent with the model and training data. In this work we propose IF-COMP, a scalable and efficient approximation of the pNML distribution that linearizes the model with a temperature-scaled Boltzmann influence function. IF-COMP can be used to produce well-calibrated predictions on test points as well as measure complexity in both labelled and unlabelled settings. We experimentally validate IF-COMP on uncertainty calibration, mislabel detection, and OOD detection tasks, where it consistently matches or beats strong baseline methods.

[LG-145] DPDR: Gradient Decomposition and Reconstruction for Differentially Private Deep Learning

链接: https://arxiv.org/abs/2406.02744
作者: Yixuan Liu,Li Xiong,Yuhan Liu,Yujie Gu,Ruixuan Liu,Hong Chen
关键词: Stochastic Gradients Descent, Differentially Private Stochastic, Private Stochastic Gradients, Private Stochastic, Gradients Descent
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:Differentially Private Stochastic Gradients Descent (DP-SGD) is a prominent paradigm for preserving privacy in deep learning. It ensures privacy by perturbing gradients with random noise calibrated to their entire norm at each training step. However, this perturbation suffers from a sub-optimal performance: it repeatedly wastes privacy budget on the general converging direction shared among gradients from different batches, which we refer as common knowledge, yet yields little information gain. Motivated by this, we propose a differentially private training framework with early gradient decomposition and reconstruction (DPDR), which enables more efficient use of the privacy budget. In essence, it boosts model utility by focusing on incremental information protection and recycling the privatized common knowledge learned from previous gradients at early training steps. Concretely, DPDR incorporates three steps. First, it disentangles common knowledge and incremental information in current gradients by decomposing them based on previous noisy gradients. Second, most privacy budget is spent on protecting incremental information for higher information gain. Third, the model is updated with the gradient reconstructed from recycled common knowledge and noisy incremental information. Theoretical analysis and extensive experiments show that DPDR outperforms state-of-the-art baselines on both convergence rate and accuracy.

[LG-146] olerant Algorithms for Learning with Arbitrary Covariate Shift

链接: https://arxiv.org/abs/2406.02742
作者: Surbhi Goel,Abhishek Shetty,Konstantinos Stavropoulos,Arsen Vasilyan
关键词: potentially adversarially generated, adversarially generated test, generated test distribution, distribution shift, test distribution
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of learning under arbitrary distribution shift, where the learner is trained on a labeled set from one distribution but evaluated on a different, potentially adversarially generated test distribution. We focus on two frameworks: PQ learning [Goldwasser, A. Kalai, Y. Kalai, Montasser NeurIPS 2020], allowing abstention on adversarially generated parts of the test distribution, and TDS learning [Klivans, Stavropoulos, Vasilyan COLT 2024], permitting abstention on the entire test distribution if distribution shift is detected. All prior known algorithms either rely on learning primitives that are computationally hard even for simple function classes, or end up abstaining entirely even in the presence of a tiny amount of distribution shift. We address both these challenges for natural function classes, including intersections of halfspaces and decision trees, and standard training distributions, including Gaussians. For PQ learning, we give efficient learning algorithms, while for TDS learning, our algorithms can tolerate moderate amounts of distribution shift. At the core of our approach is an improved analysis of spectral outlier-removal techniques from learning with nasty noise. Our analysis can (1) handle arbitrarily large fraction of outliers, which is crucial for handling arbitrary distribution shifts, and (2) obtain stronger bounds on polynomial moments of the distribution after outlier removal, yielding new insights into polynomial regression under distribution shifts. Lastly, our techniques lead to novel results for tolerant testable learning [Rubinfeld and Vasilyan STOC 2023], and learning with nasty noise. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2406.02742 [cs.DS] (or arXiv:2406.02742v1 [cs.DS] for this version)

[LG-147] Long Range Propagation on Continuous-Time Dynamic Graphs

链接: https://arxiv.org/abs/2406.02740
作者: Alessio Gravina,Giulio Lovisotto,Claudio Gallicchio,Davide Bacciu,Claas Grohnfeldt
关键词: Learning Continuous-Time Dynamic, Continuous-Time Dynamic Graphs, irregularly sampled events, Dynamic Graphs, Learning Continuous-Time
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2024 ( this https URL )

点击查看摘要

Abstract:Learning Continuous-Time Dynamic Graphs (C-TDGs) requires accurately modeling spatio-temporal information on streams of irregularly sampled events. While many methods have been proposed recently, we find that most message passing-, recurrent- or self-attention-based methods perform poorly on long-range tasks. These tasks require correlating information that occurred “far” away from the current event, either spatially (higher-order node information) or along the time dimension (events occurred in the past). To address long-range dependencies, we introduce Continuous-Time Graph Anti-Symmetric Network (CTAN). Grounded within the ordinary differential equations framework, our method is designed for efficient propagation of information. In this paper, we show how CTAN’s (i) long-range modeling capabilities are substantiated by theoretical findings and how (ii) its empirical performance on synthetic long-range benchmarks and real-world benchmarks is superior to other methods. Our results motivate CTAN’s ability to propagate long-range information in C-TDGs as well as the inclusion of long-range tasks as part of temporal graph models evaluation.

[LG-148] Synthetic Data Outliers: Navigating Identity Disclosure

链接: https://arxiv.org/abs/2406.02736
作者: Carolina Trindade,Luís Antunes,Tânia Carvalho,Nuno Moniz
关键词: deep learning models, Multiple synthetic data, data generation models, Multiple synthetic, synthetic data generation
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Multiple synthetic data generation models have emerged, among which deep learning models have become the vanguard due to their ability to capture the underlying characteristics of the original data. However, the resemblance of the synthetic to the original data raises important questions on the protection of individuals’ privacy. As synthetic data is perceived as a means to fully protect personal information, most current related work disregards the impact of re-identification risk. In particular, limited attention has been given to exploring outliers, despite their privacy relevance. In this work, we analyze the privacy of synthetic data w.r.t the outliers. Our main findings suggest that outliers re-identification via linkage attack is feasible and easily achieved. Furthermore, additional safeguards such as differential privacy can prevent re-identification, albeit at the expense of the data utility.

[LG-149] GEFL: Extended Filtration Learning for Graph Classification

链接: https://arxiv.org/abs/2406.02732
作者: Simon Zhang,Soham Mukherjee,Tamal K. Dey
关键词: Extended persistence, obtain global multiscale, persistence, Extended, analysis to obtain
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: 26 pages, 13 figures, Learning on Graphs Conference (LoG 2022)

点击查看摘要

Abstract:Extended persistence is a technique from topological data analysis to obtain global multiscale topological information from a graph. This includes information about connected components and cycles that are captured by the so-called persistence barcodes. We introduce extended persistence into a supervised learning framework for graph classification. Global topological information, in the form of a barcode with four different types of bars and their explicit cycle representatives, is combined into the model by the readout function which is computed by extended persistence. The entire model is end-to-end differentiable. We use a link-cut tree data structure and parallelism to lower the complexity of computing extended persistence, obtaining a speedup of more than 60x over the state-of-the-art for extended persistence computation. This makes extended persistence feasible for machine learning. We show that, under certain conditions, extended persistence surpasses both the WL[1] graph isomorphism test and 0-dimensional barcodes in terms of expressivity because it adds more global (topological) information. In particular, arbitrarily long cycles can be represented, which is difficult for finite receptive field message passing graph neural networks. Furthermore, we show the effectiveness of our method on real world datasets compared to many existing recent graph representation learning methods.

[LG-150] mporal Graph Learning Recurrent Neural Network for Traffic Forecasting

链接: https://arxiv.org/abs/2406.02726
作者: Sanghyun Lee,Chanyoung Park
关键词: Accurate traffic flow, crucial research topic, Accurate traffic, traffic flow forecasting, traffic flow
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate traffic flow forecasting is a crucial research topic in transportation management. However, it is a challenging problem due to rapidly changing traffic conditions, high nonlinearity of traffic flow, and complex spatial and temporal correlations of road networks. Most existing studies either try to capture the spatial dependencies between roads using the same semantic graph over different time steps, or assume all sensors on the roads are equally likely to be connected regardless of the distance between them. However, we observe that the spatial dependencies between roads indeed change over time, and two distant roads are not likely to be helpful to each other when predicting the traffic flow, both of which limit the performance of existing studies. In this paper, we propose Temporal Graph Learning Recurrent Neural Network (TGLRN) to address these problems. More precisely, to effectively model the nature of time series, we leverage Recurrent Neural Networks (RNNs) to dynamically construct a graph at each time step, thereby capturing the time-evolving spatial dependencies between roads (i.e., microscopic view). Simultaneously, we provide the Adaptive Structure Information to the model, ensuring that close and consecutive sensors are considered to be more important for predicting the traffic flow (i.e., macroscopic view). Furthermore, to endow TGLRN with robustness, we introduce an edge sampling strategy when constructing the graph at each time step, which eventually leads to further improvements on the model performance. Experimental results on four commonly used real-world benchmark datasets show the effectiveness of TGLRN.

[LG-151] Optimal Rates for DP-SCO with a Single Epoch and Large Batches

链接: https://arxiv.org/abs/2406.02716
作者: Christopher A. Choquette-Choo,Arun Ganesh,Abhradeep Thakurta
关键词: gradient descent, machine learning, gradient, differentially private, algorithm
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The most common algorithms for differentially private (DP) machine learning (ML) are all based on stochastic gradient descent, for example, DP-SGD. These algorithms achieve DP by treating each gradient as an independent private query. However, this independence can cause us to overpay in privacy loss because we don’t analyze the entire gradient trajectory. In this work, we propose a new DP algorithm, which we call Accelerated-DP-SRGD (DP stochastic recursive gradient descent), that enables us to break this independence and only pay for privacy in the gradient difference, i.e., in the new information at the current step. Our algorithm achieves the optimal DP-stochastic convex optimization (DP-SCO) error (up to polylog factors) using only a single epoch over the dataset, and converges at the Nesterov’s accelerated rate. Our algorithm can be run in at most \sqrtn batch gradient steps with batch size at least \sqrtn , unlike prior work which required O(n) queries with mostly constant batch sizes. To achieve this, our algorithm combines three key ingredients, a variant of stochastic recursive gradients (SRG), accelerated gradient descent, and correlated noise generation from DP continual counting. Finally, we also show that our algorithm improves over existing SoTA on multi-class logistic regression on MNIST and CIFAR-10. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2406.02716 [cs.LG] (or arXiv:2406.02716v1 [cs.LG] for this version)

[LG-152] Self-Trained Model for ECG Complex Delineation

链接: https://arxiv.org/abs/2406.02711
作者: Aram Avetisyan,Nikolas Khachaturov,Ariana Asatryan,Shahane Tigranyan,Yury Markin
关键词: accurate diagnoses, plays a crucial, crucial role, role in assisting, assisting cardiologists
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Electrocardiogram (ECG) delineation plays a crucial role in assisting cardiologists with accurate diagnoses. Prior research studies have explored various methods, including the application of deep learning techniques, to achieve precise delineation. However, existing approaches face limitations primarily related to dataset size and robustness. In this paper, we introduce a dataset for ECG delineation and propose a novel self-trained method aimed at leveraging a vast amount of unlabeled ECG data. Our approach involves the pseudolabeling of unlabeled data using a neural network trained on our dataset. Subsequently, we train the model on the newly labeled samples to enhance the quality of delineation. We conduct experiments demonstrating that our dataset is a valuable resource for training robust models and that our proposed self-trained method improves the prediction quality of ECG delineation.

[LG-153] Window to Wall Ratio Detection using SegFormer

链接: https://arxiv.org/abs/2406.02706
作者: Zoe De Simone,Sayandeep Biswas,Oscar Wu
关键词: Wall Ratios, assessing the energy, daylight and ventilation, key to assessing, Ratios
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Window to Wall Ratios (WWR) are key to assessing the energy, daylight and ventilation performance of buildings. Studies have shown that window area has a large impact on building performance and simulation. However, data to set up these environmental models and simulations is typically not available. Instead, a standard 40% WWR is typically assumed for all buildings. This paper leverages existing computer vision window detection methods to predict WWR of buildings from external street view images using semantic segmentation, demonstrating the potential for adapting established computer vision technique in architectural applications

[LG-154] Operational Latent Spaces

链接: https://arxiv.org/abs/2406.02699
作者: Scott H. Hawley,Austin R. Tackett
关键词: semantically meaningful operations, support semantically meaningful, operational latent spaces, latent spaces, operational latent
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 7 pages, 6 figures. Accepted to AES International Symposium on AI and the Musician

点击查看摘要

Abstract:We investigate the construction of latent spaces through self-supervised learning to support semantically meaningful operations. Analogous to operational amplifiers, these “operational latent spaces” (OpLaS) not only demonstrate semantic structure such as clustering but also support common transformational operations with inherent semantic meaning. Some operational latent spaces are found to have arisen “unintentionally” in the progress toward some (other) self-supervised learning objective, in which unintended but still useful properties are discovered among the relationships of points in the space. Other spaces may be constructed “intentionally” by developers stipulating certain kinds of clustering or transformations intended to produce the desired structure. We focus on the intentional creation of operational latent spaces via self-supervised learning, including the introduction of rotation operators via a novel “FiLMR” layer, which can be used to enable ring-like symmetries found in some musical constructions.

[LG-155] QRL – Implicitly Quantized Representations for Sample-efficient Reinforcement Learning

链接: https://arxiv.org/abs/2406.02696
作者: Aidan Scannell,Kalle Kujanpää,Yi Zhao,Mohammadreza Nakhaei,Arno Solin,Joni Pajarinen
关键词: Quantized Reinforcement Learning, reinforcement learning, Learning, shown much promise, representation learning
类目: Machine Learning (cs.LG)
*备注: 9 pages, 11 figures

点击查看摘要

Abstract:Learning representations for reinforcement learning (RL) has shown much promise for continuous control. We propose an efficient representation learning method using only a self-supervised latent-state consistency loss. Our approach employs an encoder and a dynamics model to map observations to latent states and predict future latent states, respectively. We achieve high performance and prevent representation collapse by quantizing the latent representation such that the rank of the representation is empirically preserved. Our method, named iQRL: implicitly Quantized Reinforcement Learning, is straightforward, compatible with any model-free RL algorithm, and demonstrates excellent performance by outperforming other recently proposed representation learning methods in continuous control benchmarks from DeepMind Control Suite.

[LG-156] Block Transformer: Global-to-Local Language Modeling for Fast Inference

链接: https://arxiv.org/abs/2406.02657
作者: Namgyu Ho,Sangmin Bae,Taehyeon Kim,Hyunjik Jo,Yireun Kim,Tal Schuster,Adam Fisch,James Thorne,Se-Young Yun
关键词: adopts hierarchical, Block Transformer architecture, paper presents, Block Transformer, autoregressive transformers
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 30 pages, 21 figures, 5 tables

点击查看摘要

Abstract:This paper presents the Block Transformer architecture which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks of self-attention. To apply self-attention, the key-value (KV) cache of all previous sequences must be retrieved from memory at every decoding step. Thereby, this KV cache IO becomes a significant bottleneck in batch inference. We notice that these costs stem from applying self-attention on the global context, therefore we isolate the expensive bottlenecks of global modeling to lower layers and apply fast local modeling in upper layers. To mitigate the remaining costs in the lower layers, we aggregate input tokens into fixed size blocks and then apply self-attention at this coarse level. Context information is aggregated into a single embedding to enable upper layers to decode the next block of tokens, without global attention. Free of global attention bottlenecks, the upper layers can fully utilize the compute hardware to maximize inference throughput. By leveraging global and local modules, the Block Transformer architecture demonstrates 10-20x gains in inference throughput compared to vanilla transformers with equivalent perplexity. Our work introduces a new approach to optimize language model inference through novel application of global-to-local modeling. Code is available at this https URL.

[LG-157] kNN Classification of Malware Data Dependency Graph Features

链接: https://arxiv.org/abs/2406.02654
作者: John Musgrave,Anca Ralescu
关键词: Feature resolution impacts, make explainable inferences, resolution impacts, impacts the ability, make explainable
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature resolution impacts the ability of classifiers to make explainable inferences when applied to malware classification. We explore classification based on features constructed from data dependency graphs, and present results from k-Nearest Neighbors (kNN) classifiers. Our study demonstrates that classification based on a novel feature representation not only yields high accuracy, but also increases explainability in inference, as features of data dependency are directly representative of program behavior. We present classification results using the Microsoft Kaggle 2015 malware dataset which was processed with a novel approach to feature extraction and representation. We show that non-parametric approaches to classification in the metric space are able to obtain classification accuracy of 87.5% when applied to multi-class classification in the Kaggle malware dataset. Additionally, similarity in the metric space can be calculated directly without prior training. Our results provide evidence that data dependency graphs accurately capture both semantic and structural information.

[LG-158] RoutePlacer: An End-to-End Routability-Aware Placer with Graph Neural Network

链接: https://arxiv.org/abs/2406.02651
作者: Yunbo Hou,Haoran Ye,Yingxue Zhang,Siyuan Xu,Guojie Song
关键词: modern chip design, chip design, critical and challenging, challenging step, step of modern
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: Accepted at KDD 2024

点击查看摘要

Abstract:Placement is a critical and challenging step of modern chip design, with routability being an essential indicator of placement quality. Current routability-oriented placers typically apply an iterative two-stage approach, wherein the first stage generates a placement solution, and the second stage provides non-differentiable routing results to heuristically improve the solution quality. This method hinders jointly optimizing the routability aspect during placement. To address this problem, this work introduces RoutePlacer, an end-to-end routability-aware placement method. It trains RouteGNN, a customized graph neural network, to efficiently and accurately predict routability by capturing and fusing geometric and topological representations of placements. Well-trained RouteGNN then serves as a differentiable approximation of routability, enabling end-to-end gradient-based routability optimization. In addition, RouteGNN can improve two-stage placers as a plug-and-play alternative to external routers. Our experiments on DREAMPlace, an open-source AI4EDA platform, show that RoutePlacer can reduce Total Overflow by up to 16% while maintaining routed wirelength, compared to the state-of-the-art; integrating RouteGNN within two-stage placers leads to a 44% reduction in Total Overflow without compromising wirelength.

[LG-159] By Fair Means or Foul: Quantifying Collusion in a Market Simulation with Deep Reinforcement Learning

链接: https://arxiv.org/abs/2406.02650
作者: Michael Schlechtinger,Damaris Kosack,Franz Krause,Heiko Paulheim
关键词: utilizing Reinforcement Learning, Artificial Intelligence, Reinforcement Learning, rapidly evolving landscape, utilizing Reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint for IJCAI 2024

点击查看摘要

Abstract:In the rapidly evolving landscape of eCommerce, Artificial Intelligence (AI) based pricing algorithms, particularly those utilizing Reinforcement Learning (RL), are becoming increasingly prevalent. This rise has led to an inextricable pricing situation with the potential for market collusion. Our research employs an experimental oligopoly model of repeated price competition, systematically varying the environment to cover scenarios from basic economic theory to subjective consumer demand preferences. We also introduce a novel demand framework that enables the implementation of various demand models, allowing for a weighted blending of different models. In contrast to existing research in this domain, we aim to investigate the strategies and emerging pricing patterns developed by the agents, which may lead to a collusive outcome. Furthermore, we investigate a scenario where agents cannot observe their competitors’ prices. Finally, we provide a comprehensive legal analysis across all scenarios. Our findings indicate that RL-based AI agents converge to a collusive state characterized by the charging of supracompetitive prices, without necessarily requiring inter-agent communication. Implementing alternative RL algorithms, altering the number of agents or simulation settings, and restricting the scope of the agents’ observation space does not significantly impact the collusive market outcome behavior.

[LG-160] Exploring Effects of Hyperdimensional Vectors for Tsetlin Machines

链接: https://arxiv.org/abs/2406.02648
作者: Vojtech Halenka,Ahmed K. Kadhim,Paul F. A. Clarke,Bimal Bhattarai,Rupsa Saha,Ole-Christoffer Granmo,Lei Jiao,Per-Arne Andersen
关键词: Tsetlin machines, efficiency on Boolean, Boolean representations, Booleanizing complex data, high efficiency
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 17 figures

点击查看摘要

Abstract:Tsetlin machines (TMs) have been successful in several application domains, operating with high efficiency on Boolean representations of the input data. However, Booleanizing complex data structures such as sequences, graphs, images, signal spectra, chemical compounds, and natural language is not trivial. In this paper, we propose a hypervector (HV) based method for expressing arbitrarily large sets of concepts associated with any input data. Using a hyperdimensional space to build vectors drastically expands the capacity and flexibility of the TM. We demonstrate how images, chemical compounds, and natural language text are encoded according to the proposed method, and how the resulting HV-powered TM can achieve significantly higher accuracy and faster learning on well-known benchmarks. Our results open up a new research direction for TMs, namely how to expand and exploit the benefits of operating in hyperspace, including new booleanization strategies, optimization of TM inference and learning, as well as new TM applications.

[LG-161] E-ICL: Enhancing Fine-Grained Emotion Recognition through the Lens of Prototype Theory

链接: https://arxiv.org/abs/2406.02642
作者: Zhou Yang,Zhaochun Ren,Chenglong Ye,Yufeng Wang,Haizhou Sun,Chao Chen,Xiaofei Zhu,Yunbing Wu,Xiangwen Liao
关键词: In-context learning, fine-grained emotion recognition, emotion, commonsense reasoning, knowledge acquisition
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages, 7 figures, 5 tables

点击查看摘要

Abstract:In-context learning (ICL) achieves remarkable performance in various domains such as knowledge acquisition, commonsense reasoning, and semantic understanding. However, its performance significantly deteriorates for emotion detection tasks, especially fine-grained emotion recognition. The underlying reasons for this remain unclear. In this paper, we identify the reasons behind ICL’s poor performance from the perspective of prototype theory and propose a method to address this issue. Specifically, we conduct extensive pilot experiments and find that ICL conforms to the prototype theory on fine-grained emotion recognition. Based on this theory, we uncover the following deficiencies in ICL: (1) It relies on prototypes (example-label pairs) that are semantically similar but emotionally inaccurate to predict emotions. (2) It is prone to interference from irrelevant categories, affecting the accuracy and robustness of the predictions. To address these issues, we propose an Emotion Context Learning method (E-ICL) on fine-grained emotion recognition. E-ICL relies on more emotionally accurate prototypes to predict categories by referring to emotionally similar examples with dynamic labels. Simultaneously, E-ICL employs an exclusionary emotion prediction strategy to avoid interference from irrelevant categories, thereby increasing its accuracy and robustness. Note that the entire process is accomplished with the assistance of a plug-and-play emotion auxiliary model, without additional training. Experiments on the fine-grained emotion datasets EDOS, Empathetic-Dialogues, EmpatheticIntent, and GoEmotions show that E-ICL achieves superior emotion prediction performance. Furthermore, even when the emotion auxiliary model used is lower than 10% of the LLMs, E-ICL can still boost the performance of LLMs by over 4% on multiple datasets.

[LG-162] EchoMamba4Rec: Harmonizing Bidirectional State Space Models with Spectral Filtering for Advanced Sequential Recommendation

链接: https://arxiv.org/abs/2406.02638
作者: Yuda Wang,Xuxin He,Shengxin Zhu
关键词: Sequential recommendation aims, historical user behaviors, Sequential recommendation, estimate dynamic user, dynamic user preferences
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2403.03900 by other authors

点击查看摘要

Abstract:Sequential recommendation aims to estimate dynamic user preferences and sequential dependencies among historical user behaviors. Attention-based models have proven effective for sequential recommendation, but they suffer from inference inefficiency due to the quadratic computational complexity of attention mechanisms, particularly for long-range behavior sequences. Inspired by the recent success of state space models (SSMs) in control theory, which provide a robust framework for modeling and controlling dynamic systems, we present EchoMamba4Rec. Control theory emphasizes the use of SSMs for managing long-range dependencies and maintaining inferential efficiency through structured state matrices. EchoMamba4Rec leverages these control relationships in sequential recommendation and integrates bi-directional processing with frequency-domain filtering to capture complex patterns and dependencies in user interaction data more effectively. Our model benefits from the ability of state space models (SSMs) to learn and perform parallel computations, significantly enhancing computational efficiency and scalability. It features a bi-directional Mamba module that incorporates both forward and reverse Mamba components, leveraging information from both past and future interactions. Additionally, a filter layer operates in the frequency domain using learnable Fast Fourier Transform (FFT) and learnable filters, followed by an inverse FFT to refine item embeddings and reduce noise. We also integrate Gate Linear Units (GLU) to dynamically control information flow, enhancing the model’s expressiveness and training stability. Experimental results demonstrate that EchoMamba significantly outperforms existing models, providing more accurate and personalized recommendations.

[LG-163] Evidentially Calibrated Source-Free Time-Series Domain Adaptation with Temporal Imputation

链接: https://arxiv.org/abs/2406.02635
作者: Peiliang Gong,Mohamed Ragab,Emadeldeen Eldele,Wenyu Zhang,Min Wu,Chuan-Sheng Foo,Daoqiang Zhang,Xiaoli Li,Zhenghua Chen
关键词: time series SFDA, time series, Source-free domain adaptation, SFDA, series SFDA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Source-free domain adaptation (SFDA) aims to adapt a model pre-trained on a labeled source domain to an unlabeled target domain without access to source data, preserving the source domain’s privacy. While SFDA is prevalent in computer vision, it remains largely unexplored in time series analysis. Existing SFDA methods, designed for visual data, struggle to capture the inherent temporal dynamics of time series, hindering adaptation performance. This paper proposes MAsk And imPUte (MAPU), a novel and effective approach for time series SFDA. MAPU addresses the critical challenge of temporal consistency by introducing a novel temporal imputation task. This task involves randomly masking time series signals and leveraging a dedicated temporal imputer to recover the original signal within the learned embedding space, bypassing the complexities of noisy raw data. Notably, MAPU is the first method to explicitly address temporal consistency in the context of time series SFDA. Additionally, it offers seamless integration with existing SFDA methods, providing greater flexibility. We further introduce E-MAPU, which incorporates evidential uncertainty estimation to address the overconfidence issue inherent in softmax predictions. To achieve that, we leverage evidential deep learning to obtain a better-calibrated pre-trained model and adapt the target encoder to map out-of-support target samples to a new feature representation closer to the source domain’s support. This fosters better alignment, ultimately enhancing adaptation performance. Extensive experiments on five real-world time series datasets demonstrate that both MAPU and E-MAPU achieve significant performance gains compared to existing methods. These results highlight the effectiveness of our proposed approaches for tackling various time series domain adaptation problems.

[LG-164] Edit Distance Robust Watermarks for Language Models

链接: https://arxiv.org/abs/2406.02633
作者: Noah Golowich,Ankur Moitra
关键词: detecting AI-generated text, problem of detecting, language model outputs, detecting AI-generated, AI-generated text
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Motivated by the problem of detecting AI-generated text, we consider the problem of watermarking the output of language models with provable guarantees. We aim for watermarks which satisfy: (a) undetectability, a cryptographic notion introduced by Christ, Gunn Zamir (2024) which stipulates that it is computationally hard to distinguish watermarked language model outputs from the model’s actual output distribution; and (b) robustness to channels which introduce a constant fraction of adversarial insertions, substitutions, and deletions to the watermarked text. Earlier schemes could only handle stochastic substitutions and deletions, and thus we are aiming for a more natural and appealing robustness guarantee that holds with respect to edit distance. Our main result is a watermarking scheme which achieves both undetectability and robustness to edits when the alphabet size for the language model is allowed to grow as a polynomial in the security parameter. To derive such a scheme, we follow an approach introduced by Christ Gunn (2024), which proceeds via first constructing pseudorandom codes satisfying undetectability and robustness properties analogous to those above; our key idea is to handle adversarial insertions and deletions by interpreting the symbols as indices into the codeword, which we call indexing pseudorandom codes. Additionally, our codes rely on weaker computational assumptions than used in previous work. Then we show that there is a generic transformation from such codes over large alphabets to watermarking schemes for arbitrary language models. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2406.02633 [cs.CR] (or arXiv:2406.02633v1 [cs.CR] for this version)

[LG-165] Redefining DDoS Attack Detection Using A Dual-Space Prototypical Network-Based Approach

链接: https://arxiv.org/abs/2406.02632
作者: Fernando Martinez,Mariyam Mapkar,Ali Alfatemi,Mohamed Rahouti,Yufeng Xin,Kaiqi Xiong,Nasir Ghani
关键词: Distributed Denial, Denial of Service, increasingly substantial cybersecurity, substantial cybersecurity threat, pose an increasingly
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 9 pages, The 33rd International Conference on Computer Communications and Networks (ICCCN 2024)

点击查看摘要

Abstract:Distributed Denial of Service (DDoS) attacks pose an increasingly substantial cybersecurity threat to organizations across the globe. In this paper, we introduce a new deep learning-based technique for detecting DDoS attacks, a paramount cybersecurity challenge with evolving complexity and scale. Specifically, we propose a new dual-space prototypical network that leverages a unique dual-space loss function to enhance detection accuracy for various attack patterns through geometric and angular similarity measures. This approach capitalizes on the strengths of representation learning within the latent space (a lower-dimensional representation of data that captures complex patterns for machine learning analysis), improving the model’s adaptability and sensitivity towards varying DDoS attack vectors. Our comprehensive evaluation spans multiple training environments, including offline training, simulated online training, and prototypical network scenarios, to validate the model’s robustness under diverse data abundance and scarcity conditions. The Multilayer Perceptron (MLP) with Attention, trained with our dual-space prototypical design over a reduced training set, achieves an average accuracy of 94.85% and an F1-Score of 94.71% across our tests, showcasing its effectiveness in dynamic and constrained real-world scenarios.

[LG-166] SSNet: A Lightweight Multi-Party Computation Scheme for Practical Privacy-Preserving Machine Learning Service in the Cloud

链接: https://arxiv.org/abs/2406.02629
作者: Shijin Duan,Chenghong Wang,Hongwu Peng,Yukui Luo,Wujie Wen,Caiwen Ding,Xiaolin Xu
关键词: current MPC, MPC, current MPC frameworks, pivotal aspect, aspect of deep
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:As privacy-preserving becomes a pivotal aspect of deep learning (DL) development, multi-party computation (MPC) has gained prominence for its efficiency and strong security. However, the practice of current MPC frameworks is limited, especially when dealing with large neural networks, exemplified by the prolonged execution time of 25.8 seconds for secure inference on ResNet-152. The primary challenge lies in the reliance of current MPC approaches on additive secret sharing, which incurs significant communication overhead with non-linear operations such as comparisons. Furthermore, additive sharing suffers from poor scalability on party size. In contrast, the evolving landscape of MPC necessitates accommodating a larger number of compute parties and ensuring robust performance against malicious activities or computational failures. In light of these challenges, we propose SSNet, which for the first time, employs Shamir’s secret sharing (SSS) as the backbone of MPC-based ML framework. We meticulously develop all framework primitives and operations for secure DL models tailored to seamlessly integrate with the SSS scheme. SSNet demonstrates the ability to scale up party numbers straightforwardly and embeds strategies to authenticate the computation correctness without incurring significant performance overhead. Additionally, SSNet introduces masking strategies designed to reduce communication overhead associated with non-linear operations. We conduct comprehensive experimental evaluations on commercial cloud computing infrastructure from Amazon AWS, as well as across diverse prevalent DNN models and datasets. SSNet demonstrates a substantial performance boost, achieving speed-ups ranging from 3x to 14x compared to SOTA MPC frameworks. Moreover, SSNet also represents the first framework that is evaluated on a five-party computation setup, in the context of secure DL inference. Comments: 16 pages, 9 figures Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2406.02629 [cs.CR] (or arXiv:2406.02629v1 [cs.CR] for this version)

[LG-167] Progressive Inference: Explaining Decoder-Only Sequence Classification Models Using Intermediate Predictions

链接: https://arxiv.org/abs/2406.02625
作者: Sanjay Kariyappa,Freddy Lécué,Saumitra Mishra,Christopher Pond,Daniele Magazzeni,Manuela Veloso
关键词: paper proposes Progressive, proposes Progressive Inference, Progressive Inference, decoder-only Transformer model, proposes Progressive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper proposes Progressive Inference - a framework to compute input attributions to explain the predictions of decoder-only sequence classification models. Our work is based on the insight that the classification head of a decoder-only Transformer model can be used to make intermediate predictions by evaluating them at different points in the input sequence. Due to the causal attention mechanism, these intermediate predictions only depend on the tokens seen before the inference point, allowing us to obtain the model’s prediction on a masked input sub-sequence, with negligible computational overheads. We develop two methods to provide sub-sequence level attributions using this insight. First, we propose Single Pass-Progressive Inference (SP-PI), which computes attributions by taking the difference between consecutive intermediate predictions. Second, we exploit a connection with Kernel SHAP to develop Multi Pass-Progressive Inference (MP-PI). MP-PI uses intermediate predictions from multiple masked versions of the input to compute higher quality attributions. Our studies on a diverse set of models trained on text classification tasks show that SP-PI and MP-PI provide significantly better attributions compared to prior work.

[LG-168] Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

链接: https://arxiv.org/abs/2406.02619
作者: Andis Draguns,Andrew Gritsevskiy,Sumeet Ramesh Motwani,Charlie Rogers-Smith,Jeffrey Ladish,Christian Schroeder de Witt
关键词: open-source language models, language models significantly, models significantly increases, downstream backdoor attacks, rapid proliferation
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce dangerous behaviours during model deployment and can evade detection by conventional cybersecurity monitoring systems. In this paper, we introduce a novel class of backdoors in autoregressive transformer models, that, in contrast to prior art, are unelicitable in nature. Unelicitability prevents the defender from triggering the backdoor, making it impossible to evaluate or detect ahead of deployment even if given full white-box access and using automated techniques, such as red-teaming or certain formal verification methods. We show that our novel construction is not only unelicitable thanks to using cryptographic techniques, but also has favourable robustness properties. We confirm these properties in empirical investigations, and provide evidence that our backdoors can withstand state-of-the-art mitigation strategies. Additionally, we expand on previous work by showing that our universal backdoors, while not completely undetectable in white-box settings, can be harder to detect than some existing designs. By demonstrating the feasibility of seamlessly integrating backdoors into transformer models, this paper fundamentally questions the efficacy of pre-deployment detection strategies. This offers new insights into the offence-defence balance in AI safety and security.

[LG-169] Adaptive Layer Splitting for Wireless LLM Inference in Edge Computing: A Model-Based Reinforcement Learning Approach

链接: https://arxiv.org/abs/2406.02616
作者: Yuxuan Chen,Rongpeng Li,Xiaoxue Yu,Zhifeng Zhao,Honggang Zhang
关键词: large language models, edge computing environments, large language, environments is critical, critical for enhancing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Optimizing the deployment of large language models (LLMs) in edge computing environments is critical for enhancing privacy and computational efficiency. Toward efficient wireless LLM inference in edge computing, this study comprehensively analyzes the impact of different splitting points in mainstream open-source LLMs. On this basis, this study introduces a framework taking inspiration from model-based reinforcement learning (MBRL) to determine the optimal splitting point across the edge and user equipment (UE). By incorporating a reward surrogate model, our approach significantly reduces the computational cost of frequent performance evaluations. Extensive simulations demonstrate that this method effectively balances inference performance and computational load under varying network conditions, providing a robust solution for LLM deployment in decentralized settings.

[LG-170] A hybrid numerical methodology coupling Reduced Order Modeling and Graph Neural Networks for non-parametric geometries: applications to structural dynamics problems

链接: https://arxiv.org/abs/2406.02615
作者: Victor Matray(LMPS),Faisal Amlani(LMPS),Frédéric Feyel(LMPS),David Néron(LMPS)
关键词: partial differential equations, governing complex physical, complex physical systems, time-domain partial differential, Graph Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Classical Physics (physics.class-ph)
*备注:

点击查看摘要

Abstract:This work introduces a new approach for accelerating the numerical analysis of time-domain partial differential equations (PDEs) governing complex physical systems. The methodology is based on a combination of a classical reduced-order modeling (ROM) framework and recently-introduced Graph Neural Networks (GNNs), where the latter is trained on highly heterogeneous databases of varying numerical discretization sizes. The proposed techniques are shown to be particularly suitable for non-parametric geometries, ultimately enabling the treatment of a diverse range of geometries and topologies. Performance studies are presented in an application context related to the design of aircraft seats and their corresponding mechanical responses to shocks, where the main motivation is to reduce the computational burden and enable the rapid design iteration for such problems that entail non-parametric geometries. The methods proposed here are straightforwardly applicable to other scientific or engineering problems requiring a large number of finite element-based numerical simulations, with the potential to significantly enhance efficiency while maintaining reasonable accuracy.

[LG-171] Frequency Enhanced Pre-training for Cross-city Few-shot Traffic Forecasting

链接: https://arxiv.org/abs/2406.02614
作者: Zhanyu Liu,Jianrong Ding,Guanjie Zheng
关键词: Intelligent Transportation Systems, Transportation Systems, Intelligent Transportation, cross-city few-shot forecasting, field of Intelligent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by ECMLPKDD 2024 (Research Track)

点击查看摘要

Abstract:The field of Intelligent Transportation Systems (ITS) relies on accurate traffic forecasting to enable various downstream applications. However, developing cities often face challenges in collecting sufficient training traffic data due to limited resources and outdated infrastructure. Recognizing this obstacle, the concept of cross-city few-shot forecasting has emerged as a viable approach. While previous cross-city few-shot forecasting methods ignore the frequency similarity between cities, we have made an observation that the traffic data is more similar in the frequency domain between cities. Based on this fact, we propose a \textbfFrequency \textbfEnhanced \textbfPre-training Framework for \textbfCross-city Few-shot Forecasting (\textbfFEPCross). FEPCross has a pre-training stage and a fine-tuning stage. In the pre-training stage, we propose a novel Cross-Domain Spatial-Temporal Encoder that incorporates the information of the time and frequency domain and trains it with self-supervised tasks encompassing reconstruction and contrastive objectives. In the fine-tuning stage, we design modules to enrich training samples and maintain a momentum-updated graph structure, thereby mitigating the risk of overfitting to the few-shot training data. Empirical evaluations performed on real-world traffic datasets validate the exceptional efficacy of FEPCross, outperforming existing approaches of diverse categories and demonstrating characteristics that foster the progress of cross-city few-shot forecasting.

[LG-172] ACCO: Accumulate while you Communicate Hiding Communications in Distributed LLM Training

链接: https://arxiv.org/abs/2406.02613
作者: Adel Nabli(MLIA, Mila),Louis Fournier(MLIA),Pierre Erbacher(MLIA),Louis Serrano(MLIA),Eugene Belilovsky(Mila),Edouard Oyallon
关键词: Large Language Models, Training Large Language, Large Language, Language Models, employing multiple GPUs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Training Large Language Models (LLMs) relies heavily on distributed implementations, employing multiple GPUs to compute stochastic gradients on model replicas in parallel. However, synchronizing gradients in data parallel settings induces a communication overhead increasing with the number of distributed workers, which can impede the efficiency gains of parallelization. To address this challenge, optimization algorithms reducing inter-worker communication have emerged, such as local optimization methods used in Federated Learning. While effective in minimizing communication overhead, these methods incur significant memory costs, hindering scalability: in addition to extra momentum variables, if communications are only allowed between multiple local optimization steps, then the optimizer’s states cannot be sharded among workers. In response, we propose \textbfAC cumulate while \textbfCO mmunicate ( \textttACCO ), a memory-efficient optimization algorithm tailored for distributed training of LLMs. \textttACCO allows to shard optimizer states across workers, overlaps gradient computations and communications to conceal communication costs, and accommodates heterogeneous hardware. Our method relies on a novel technique to mitigate the one-step delay inherent in parallel execution of gradient computations and communications, eliminating the need for warmup steps and aligning with the training dynamics of standard distributed optimization while converging faster in terms of wall-clock time. We demonstrate the effectiveness of \textttACCO on several LLMs training and fine-tuning tasks.

[LG-173] Is Data Valuation Learnable and Interpretable?

链接: https://arxiv.org/abs/2406.02612
作者: Ou Wu,Weiyao Zhu,Mengyang Li
关键词: data valuation, valuation, deep learning model, data, data valuation methods
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Measuring the value of individual samples is critical for many data-driven tasks, e.g., the training of a deep learning model. Recent literature witnesses the substantial efforts in developing data valuation methods. The primary data valuation methodology is based on the Shapley value from game theory, and various methods are proposed along this path. Even though Shapley value-based valuation has solid theoretical basis, it is entirely an experiment-based approach and no valuation model has been constructed so far. In addition, current data valuation methods ignore the interpretability of the output values, despite an interptable data valuation method is of great helpful for applications such as data pricing. This study aims to answer an important question: is data valuation learnable and interpretable? A learned valuation model have several desirable merits such as fixed number of parameters and knowledge reusability. An intrepretable data valuation model can explain why a sample is valuable or invaluable. To this end, two new data value modeling frameworks are proposed, in which a multi-layer perception~(MLP) and a new regression tree are utilized as specific base models for model training and interpretability, respectively. Extensive experiments are conducted on benchmark datasets. The experimental results provide a positive answer for the question. Our study opens up a new technical path for the assessing of data values. Large data valuation models can be built across many different data-driven tasks, which can promote the widespread application of data valuation.

[LG-174] LOLA: LLM-Assisted Online Learning Algorithm for Content Experiments

链接: https://arxiv.org/abs/2406.02611
作者: Zikun Ye,Hema Yoganarasimhan,Yufeng Zheng
关键词: publishers require automated, Large Language Models, rapidly evolving digital, Online Learning Algorithm, LLM-Assisted Online Learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In the rapidly evolving digital content landscape, media firms and news publishers require automated and efficient methods to enhance user engagement. This paper introduces the LLM-Assisted Online Learning Algorithm (LOLA), a novel framework that integrates Large Language Models (LLMs) with adaptive experimentation to optimize content delivery. Leveraging a large-scale dataset from Upworthy, which includes 17,681 headline A/B tests aimed at evaluating the performance of various headlines associated with the same article content, we first investigate three broad pure-LLM approaches: prompt-based methods, embedding-based classification models, and fine-tuned open-source LLMs. Our findings indicate that prompt-based approaches perform poorly, achieving no more than 65% accuracy in identifying the catchier headline among two options. In contrast, OpenAI-embedding-based classification models and fine-tuned Llama-3-8b models achieve comparable accuracy, around 82-84%, though still falling short of the performance of experimentation with sufficient traffic. We then introduce LOLA, which combines the best pure-LLM approach with the Upper Confidence Bound algorithm to adaptively allocate traffic and maximize clicks. Our numerical experiments on Upworthy data show that LOLA outperforms the standard A/B testing method (the current status quo at Upworthy), pure bandit algorithms, and pure-LLM approaches, particularly in scenarios with limited experimental traffic or numerous arms. Our approach is both scalable and broadly applicable to content experiments across a variety of digital settings where firms seek to optimize user engagement, including digital advertising and social media recommendations.

[LG-175] Less is More: Pseudo-Label Filtering for Continual Test-Time Adaptation

链接: https://arxiv.org/abs/2406.02609
作者: Jiayao Tan,Fan Lyu,Chenggong Ni,Tingliang Feng,Fuyuan Hu,Zhang Zhang,Shaochuang Zhao,Liang Wang
关键词: Continual Test-Time Adaptation, Continual Test-Time, sequence of target, test phase, phase without accessing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2310.03335 by other authors

点击查看摘要

Abstract:Continual Test-Time Adaptation (CTTA) aims to adapt a pre-trained model to a sequence of target domains during the test phase without accessing the source data. To adapt to unlabeled data from unknown domains, existing methods rely on constructing pseudo-labels for all samples and updating the model through self-training. However, these pseudo-labels often involve noise, leading to insufficient adaptation. To improve the quality of pseudo-labels, we propose a pseudo-label selection method for CTTA, called Pseudo Labeling Filter (PLF). The key idea of PLF is to keep selecting appropriate thresholds for pseudo-labels and identify reliable ones for self-training. Specifically, we present three principles for setting thresholds during continuous domain learning, including initialization, growth and diversity. Based on these principles, we design Self-Adaptive Thresholding to filter pseudo-labels. Additionally, we introduce a Class Prior Alignment (CPA) method to encourage the model to make diverse predictions for unknown domain samples. Through extensive experiments, PLF outperforms current state-of-the-art methods, proving its effectiveness in CTTA.

[LG-176] Know Your Neighborhood: General and Zero-Shot Capable Binary Function Search Powered by Call Graphlets

链接: https://arxiv.org/abs/2406.02606
作者: Joshua Collyer,Tim Watson,Iain Phillips
关键词: graph neural network, malware analysis, vulnerability research, important problem, problem with applications
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Binary code similarity detection is an important problem with applications in areas like malware analysis, vulnerability research and plagiarism detection. This paper proposes a novel graph neural network architecture combined with a novel graph data representation called call graphlets. A call graphlet encodes the neighborhood around each function in a binary executable, capturing the local and global context through a series of statistical features. A specialized graph neural network model is then designed to operate on this graph representation, learning to map it to a feature vector that encodes semantic code similarities using deep metric learning. The proposed approach is evaluated across four distinct datasets covering different architectures, compiler toolchains, and optimization levels. Experimental results demonstrate that the combination of call graphlets and the novel graph neural network architecture achieves state-of-the-art performance compared to baseline techniques across cross-architecture, mono-architecture and zero shot tasks. In addition, our proposed approach also performs well when evaluated against an out-of-domain function inlining task. Overall, the work provides a general and effective graph neural network-based solution for conducting binary code similarity detection.

[LG-177] A Novel Defense Against Poisoning Attacks on Federated Learning: LayerCAM Augmented with Autoencoder

链接: https://arxiv.org/abs/2406.02605
作者: Jingjing Zheng,Xin Yuan,Kai Li,Wei Ni,Eduardo Tovar,Jon Crowcroft
关键词: adopted Euclidean distance-based, widely adopted Euclidean, Euclidean distance-based detection, circumvent widely adopted, Recent attacks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent attacks on federated learning (FL) can introduce malicious model updates that circumvent widely adopted Euclidean distance-based detection methods. This paper proposes a novel defense strategy, referred to as LayerCAM-AE, designed to counteract model poisoning in federated learning. The LayerCAM-AE puts forth a new Layer Class Activation Mapping (LayerCAM) integrated with an autoencoder (AE), significantly enhancing detection capabilities. Specifically, LayerCAM-AE generates a heat map for each local model update, which is then transformed into a more compact visual format. The autoencoder is designed to process the LayerCAM heat maps from the local model updates, improving their distinctiveness and thereby increasing the accuracy in spotting anomalous maps and malicious local models. To address the risk of misclassifications with LayerCAM-AE, a voting algorithm is developed, where a local model update is flagged as malicious if its heat maps are consistently suspicious over several rounds of communication. Extensive tests of LayerCAM-AE on the SVHN and CIFAR-100 datasets are performed under both Independent and Identically Distributed (IID) and non-IID settings in comparison with existing ResNet-50 and REGNETY-800MF defense models. Experimental results show that LayerCAM-AE increases detection rates (Recall: 1.0, Precision: 1.0, FPR: 0.0, Accuracy: 1.0, F1 score: 1.0, AUC: 1.0) and test accuracy in FL, surpassing the performance of both the ResNet-50 and REGNETY-800MF. Our code is available at: this https URL

[LG-178] Gated recurrent neural network with TPE Bayesian optimization for enhancing stock index prediction accuracy

链接: https://arxiv.org/abs/2406.02604
作者: Bivas Dinda
关键词: deep learning architectures, predicting future stock, abundant financial data, future stock prices, stock index price
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Computational Finance (q-fin.CP)
*备注: 23 pages, 9 figures, 12 tables

点击查看摘要

Abstract:The recent advancement of deep learning architectures, neural networks, and the combination of abundant financial data and powerful computers are transforming finance, leading us to develop an advanced method for predicting future stock prices. However, the accessibility of investment and trading at everyone’s fingertips made the stock markets increasingly intricate and prone to volatility. The increased complexity and volatility of the stock market have driven demand for more models, which would effectively capture high volatility and non-linear behavior of the different stock prices. This study explored gated recurrent neural network (GRNN) algorithms such as LSTM (long short-term memory), GRU (gated recurrent unit), and hybrid models like GRU-LSTM, LSTM-GRU, with Tree-structured Parzen Estimator (TPE) Bayesian optimization for hyperparameter optimization (TPE-GRNN). The aim is to improve the prediction accuracy of the next day’s closing price of the NIFTY 50 index, a prominent Indian stock market index, using TPE-GRNN. A combination of eight influential factors is carefully chosen from fundamental stock data, technical indicators, crude oil price, and macroeconomic data to train the models for capturing the changes in the price of the index with the factors of the broader economy. Single-layer and multi-layer TPE-GRNN models have been developed. The models’ performance is evaluated using standard matrices like R2, MAPE, and RMSE. The analysis of models’ performance reveals the impact of feature selection and hyperparameter optimization (HPO) in enhancing stock index price prediction accuracy. The results show that the MAPE of our proposed TPE-LSTM method is the lowest (best) with respect to all the previous models for stock index price prediction.

[LG-179] Distortion-free Watermarks are not Truly Distortion-free under Watermark Key Collisions

链接: https://arxiv.org/abs/2406.02603
作者: Yihan Wu,Ruibo Chen,Zhengmian Hu,Yanshuo Chen,Junfeng Guo,Hongyang Zhang,Heng Huang
关键词: Language model, random sampling process, watermarking techniques inject, random seed, key collisions
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Language model (LM) watermarking techniques inject a statistical signal into LM-generated content by substituting the random sampling process with pseudo-random sampling, using watermark keys as the random seed. Among these statistical watermarking approaches, distortion-free watermarks are particularly crucial because they embed watermarks into LM-generated content without compromising generation quality. However, one notable limitation of pseudo-random sampling compared to true-random sampling is that, under the same watermark keys (i.e., key collision), the results of pseudo-random sampling exhibit correlations. This limitation could potentially undermine the distortion-free property. Our studies reveal that key collisions are inevitable due to the limited availability of watermark keys, and existing distortion-free watermarks exhibit a significant distribution bias toward the original LM distribution in the presence of key collisions. Moreover, achieving a perfect distortion-free watermark is impossible as no statistical signal can be embedded under key collisions. To reduce the distribution bias caused by key collisions, we introduce a new family of distortion-free watermarks–beta-watermark. Experimental results support that the beta-watermark can effectively reduce the distribution bias under key collisions.

[LG-180] D-FaST: Cognitive Signal Decoding with Disentangled Frequency-Spatial-Temporal Attention

链接: https://arxiv.org/abs/2406.02602
作者: Weiguo Chen,Changjian Wang,Kele Xu,Yuan Yuan,Yanru Bai,Dongsong Zhang
关键词: Natural Language Processing, Cognitive Language Processing, Language Processing, Natural Language, progressively pivotal role
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 9 figures. Accepted by IEEE Transactions on Cognitive and Developmental Systems

点击查看摘要

Abstract:Cognitive Language Processing (CLP), situated at the intersection of Natural Language Processing (NLP) and cognitive science, plays a progressively pivotal role in the domains of artificial intelligence, cognitive intelligence, and brain science. Among the essential areas of investigation in CLP, Cognitive Signal Decoding (CSD) has made remarkable achievements, yet there still exist challenges related to insufficient global dynamic representation capability and deficiencies in multi-domain feature integration. In this paper, we introduce a novel paradigm for CLP referred to as Disentangled Frequency-Spatial-Temporal Attention(D-FaST). Specifically, we present an novel cognitive signal decoder that operates on disentangled frequency-space-time domain attention. This decoder encompasses three key components: frequency domain feature extraction employing multi-view attention, spatial domain feature extraction utilizing dynamic brain connection graph attention, and temporal feature extraction relying on local time sliding window attention. These components are integrated within a novel disentangled framework. Additionally, to encourage advancements in this field, we have created a new CLP dataset, MNRED. Subsequently, we conducted an extensive series of experiments, evaluating D-FaST’s performance on MNRED, as well as on publicly available datasets including ZuCo, BCIC IV-2A, and BCIC IV-2B. Our experimental results demonstrate that D-FaST outperforms existing methods significantly on both our datasets and traditional CSD datasets including establishing a state-of-the-art accuracy score 78.72% on MNRED, pushing the accuracy score on ZuCo to 78.35%, accuracy score on BCIC IV-2A to 74.85% and accuracy score on BCIC IV-2B to 76.81%.

[LG-181] Multimodal Deep Learning for Low-Resource Settings: A Vector Embedding Alignment Approach for Healthcare Applications

链接: https://arxiv.org/abs/2406.02601
作者: David Restrepo,Chenwei Wu,Sebastián Andrés Cajas,Luis Filipe Nakayama,Leo Anthony Celi,Diego M López
关键词: Large-scale multi-modal deep, Large-scale multi-modal, multimodal deep learning, deep learning, revolutionized domains
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large-scale multi-modal deep learning models have revolutionized domains such as healthcare, highlighting the importance of computational power. However, in resource-constrained regions like Low and Middle-Income Countries (LMICs), limited access to GPUs and data poses significant challenges, often leaving CPUs as the sole resource. To address this, we advocate for leveraging vector embeddings to enable flexible and efficient computational methodologies, democratizing multimodal deep learning across diverse contexts. Our paper investigates the efficiency and effectiveness of using vector embeddings from single-modal foundation models and multi-modal Vision-Language Models (VLMs) for multimodal deep learning in low-resource environments, particularly in healthcare. Additionally, we propose a simple yet effective inference-time method to enhance performance by aligning image-text embeddings. Comparing these approaches with traditional methods, we assess their impact on computational efficiency and model performance using metrics like accuracy, F1-score, inference time, training time, and memory usage across three medical modalities: BRSET (ophthalmology), HAM10000 (dermatology), and SatelliteBench (public health). Our findings show that embeddings reduce computational demands without compromising model performance. Furthermore, our alignment method improves performance in medical tasks. This research promotes sustainable AI practices by optimizing resources in constrained environments, highlighting the potential of embedding-based approaches for efficient multimodal learning. Vector embeddings democratize multimodal deep learning in LMICs, particularly in healthcare, enhancing AI adaptability in varied use cases. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.02601 [cs.LG] (or arXiv:2406.02601v1 [cs.LG] for this version)

[LG-182] Data Quality in Edge Machine Learning: A State-of-the-Art Survey

链接: https://arxiv.org/abs/2406.02600
作者: Mohammed Djameleddine Belgoumri,Mohamed Reda Bouadjenek,Sunil Aryal,Hakim Hacid
关键词: Data-driven Artificial Intelligence, Data-driven Artificial, Artificial Intelligence, Machine Learning, autonomous driving technologies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 31 pages, 5 figures

点击查看摘要

Abstract:Data-driven Artificial Intelligence (AI) systems trained using Machine Learning (ML) are shaping an ever-increasing (in size and importance) portion of our lives, including, but not limited to, recommendation systems, autonomous driving technologies, healthcare diagnostics, financial services, and personalized marketing. On the one hand, the outsized influence of these systems imposes a high standard of quality, particularly in the data used to train them. On the other hand, establishing and maintaining standards of Data Quality (DQ) becomes more challenging due to the proliferation of Edge Computing and Internet of Things devices, along with their increasing adoption for training and deploying ML models. The nature of the edge environment – characterized by limited resources, decentralized data storage, and processing – exacerbates data-related issues, making them more frequent, severe, and difficult to detect and mitigate. From these observations, it follows that DQ research for edge ML is a critical and urgent exploration track for the safety and robust usefulness of present and future AI systems. Despite this fact, DQ research for edge ML is still in its infancy. The literature on this subject remains fragmented and scattered across different research communities, with no comprehensive survey to date. Hence, this paper aims to fill this gap by providing a global view of the existing literature from multiple disciplines that can be grouped under the umbrella of DQ for edge ML. Specifically, we present a tentative definition of data quality in Edge computing, which we use to establish a set of DQ dimensions. We explore each dimension in detail, including existing solutions for mitigation.

[LG-183] owards Learning Foundation Models for Heuristic Functions to Solve Pathfinding Problems

链接: https://arxiv.org/abs/2406.02598
作者: Vedant Khandelwal,Amit Sheth,Forest Agostinelli
关键词: computational science, natural sciences, found throughout robotics, Pathfinding problems, Concordance Correlation Coefficient
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pathfinding problems are found throughout robotics, computational science, and natural sciences. Traditional methods to solve these require training deep neural networks (DNNs) for each new problem domain, consuming substantial time and resources. This study introduces a novel foundation model, leveraging deep reinforcement learning to train heuristic functions that seamlessly adapt to new domains without further fine-tuning. Building upon DeepCubeA, we enhance the model by providing the heuristic function with the domain’s state transition information, improving its adaptability. Utilizing a puzzle generator for the 15-puzzle action space variation domains, we demonstrate our model’s ability to generalize and solve unseen domains. We achieve a strong correlation between learned and ground truth heuristic values across various domains, as evidenced by robust R-squared and Concordance Correlation Coefficient metrics. These results underscore the potential of foundation models to establish new standards in efficiency and adaptability for AI-driven solutions in complex pathfinding problems.

[LG-184] CoNO: Complex Neural Operator for Continous Dynamical Physical Systems

链接: https://arxiv.org/abs/2406.02597
作者: Karn Tiwari,N M Anoop Krishnan,A P Prathosh
关键词: infinite-dimensional functional spaces, Neural operators extend, operators extend data-driven, Complex Neural Operator, Fractional Fourier Transform
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注: Under Review

点击查看摘要

Abstract:Neural operators extend data-driven models to map between infinite-dimensional functional spaces. While these operators perform effectively in either the time or frequency domain, their performance may be limited when applied to non-stationary spatial or temporal signals whose frequency characteristics change with time. Here, we introduce Complex Neural Operator (CoNO) that parameterizes the integral kernel using Fractional Fourier Transform (FrFT), better representing non-stationary signals in a complex-valued domain. Theoretically, we prove the universal approximation capability of CoNO. We perform an extensive empirical evaluation of CoNO on seven challenging partial differential equations (PDEs), including regular grids, structured meshes, and point clouds. Empirically, CoNO consistently attains state-of-the-art performance, showcasing an average relative gain of 10.9%. Further, CoNO exhibits superior performance, outperforming all other models in additional tasks such as zero-shot super-resolution and robustness to noise. CoNO also exhibits the ability to learn from small amounts of data – giving the same performance as the next best model with just 60% of the training data. Altogether, CoNO presents a robust and superior model for modeling continuous dynamical systems, providing a fillip to scientific machine learning.

[LG-185] Slow and Steady Wins the Race: Maintaining Plasticity with Hare and Tortoise Networks

链接: https://arxiv.org/abs/2406.02596
作者: Hojoon Lee,Hyeonseo Cho,Hyunseung Kim,Donghu Kim,Dugki Min,Jaegul Choo,Clare Lyle
关键词: Ash Adams, revisiting warm-starting experiments, experiments from Ash, Hare Tortoise, revisiting warm-starting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: accepted to ICML 2024

点击查看摘要

Abstract:This study investigates the loss of generalization ability in neural networks, revisiting warm-starting experiments from Ash Adams. Our empirical analysis reveals that common methods designed to enhance plasticity by maintaining trainability provide limited benefits to generalization. While reinitializing the network can be effective, it also risks losing valuable prior knowledge. To this end, we introduce the Hare Tortoise, inspired by the brain’s complementary learning system. Hare Tortoise consists of two components: the Hare network, which rapidly adapts to new information analogously to the hippocampus, and the Tortoise network, which gradually integrates knowledge akin to the neocortex. By periodically reinitializing the Hare network to the Tortoise’s weights, our method preserves plasticity while retaining general knowledge. Hare Tortoise can effectively maintain the network’s ability to generalize, which improves advanced reinforcement learning algorithms on the Atari-100k benchmark. The code is available at this https URL.

[LG-186] Graph Neural Networks for Brain Graph Learning: A Survey

链接: https://arxiv.org/abs/2406.02594
作者: Xuexiong Luo,Jia Wu,Jian Yang,Shan Xue,Amin Beheshti,Quan Z. Sheng,David McAlpine,Paul Sowman,Alexis Giral,Philip S. Yu
关键词: Exploring the complex, complex structure, crucial for understanding, understanding its functionality, functionality and diagnosing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages, 2 figures, IJCAI-2024

点击查看摘要

Abstract:Exploring the complex structure of the human brain is crucial for understanding its functionality and diagnosing brain disorders. Thanks to advancements in neuroimaging technology, a novel approach has emerged that involves modeling the human brain as a graph-structured pattern, with different brain regions represented as nodes and the functional relationships among these regions as edges. Moreover, graph neural networks (GNNs) have demonstrated a significant advantage in mining graph-structured data. Developing GNNs to learn brain graph representations for brain disorder analysis has recently gained increasing attention. However, there is a lack of systematic survey work summarizing current research methods in this domain. In this paper, we aim to bridge this gap by reviewing brain graph learning works that utilize GNNs. We first introduce the process of brain graph modeling based on common neuroimaging data. Subsequently, we systematically categorize current works based on the type of brain graph generated and the targeted research problems. To make this research accessible to a broader range of interested researchers, we provide an overview of representative methods and commonly used datasets, along with their implementation sources. Finally, we present our insights on future research directions. The repository of this survey is available at \urlthis https URL.

[LG-187] LOLAMEME: Logic Language Memory Mechanistic Framework

链接: https://arxiv.org/abs/2406.02592
作者: Jay Desai,Xiaobo Guo,Srinivasan H. Sengamedu
关键词: Large Language Models, achieved superhuman breadth, Large Language, unprecedented depth, achieved superhuman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: this https URL

点击查看摘要

Abstract:The performance of Large Language Models has achieved superhuman breadth with unprecedented depth. At the same time, the language models are mostly black box models and the underlying mechanisms for performance have been evaluated using synthetic or mechanistic schemes. We extend current mechanistic schemes to incorporate Logic, memory, and nuances of Language such as latent structure. The proposed framework is called LOLAMEME and we provide two instantiations of LOLAMEME: LoLa and MeMe languages. We then consider two generative language model architectures: transformer-based GPT-2 and convolution-based Hyena. We propose the hybrid architecture T HEX and use LOLAMEME framework is used to compare three architectures. T HEX outperforms GPT-2 and Hyena on select tasks.

[LG-188] Unveiling the Potential of AI for Nanomaterial Morphology Prediction

链接: https://arxiv.org/abs/2406.02591
作者: Ivan Dubrovsky,Andrei Dmitrenko,Aleksei Dmitrenko,Nikita Serov,Vladimir Vinogradov
关键词: complex experimental process, specific morphology remains, experimental process, industry sectors, Creation of nanomaterials
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Creation of nanomaterials with specific morphology remains a complex experimental process, even though there is a growing demand for these materials in various industry sectors. This study explores the potential of AI to predict the morphology of nanoparticles within the data availability constraints. For that, we first generated a new multi-modal dataset that is double the size of analogous studies. Then, we systematically evaluated performance of classical machine learning and large language models in prediction of nanomaterial shapes and sizes. Finally, we prototyped a text-to-image system, discussed the obtained empirical results, as well as the limitations and promises of existing approaches.

[LG-189] Capturing Climatic Variability: Using Deep Learning for Stochastic Downscaling

链接: https://arxiv.org/abs/2406.02587
作者: Kiri Daust,Adam Monahan
关键词: requires accurate local, changing climate requires, climate requires accurate, accurate local climate, Generative Adversarial Networks
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Submitted to Artificial Intelligence for the Earth Systems AMS Journal

点击查看摘要

Abstract:Adapting to the changing climate requires accurate local climate information, a computationally challenging problem. Recent studies have used Generative Adversarial Networks (GANs), a type of deep learning, to learn complex distributions and downscale climate variables efficiently. Capturing variability while downscaling is crucial for estimating uncertainty and characterising extreme events - critical information for climate adaptation. Since downscaling is an undetermined problem, many fine-scale states are physically consistent with the coarse-resolution state. To quantify this ill-posed problem, downscaling techniques should be stochastic, able to sample realisations from a high-resolution distribution conditioned on low-resolution input. Previous stochastic downscaling attempts have found substantial underdispersion, with models failing to represent the full distribution. We propose approaches to improve the stochastic calibration of GANs in three ways: a) injecting noise inside the network, b) adjusting the training process to explicitly account for the stochasticity, and c) using a probabilistic loss metric. We tested our models first on a synthetic dataset with known distributional properties, and then on a realistic downscaling scenario, predicting high-resolution wind components from low-resolution climate covariates. Injecting noise, on its own, substantially improved the quality of conditional and full distributions in tests with synthetic data, but performed less well for wind field downscaling, where models remained underdispersed. For wind downscaling, we found that adjusting the training method and including the probabilistic loss improved calibration. The best model, with all three changes, showed much improved skill at capturing the full variability of the high-resolution distribution and thus at characterising extremes.

[LG-190] Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task

链接: https://arxiv.org/abs/2406.02585
作者: Siavash Golkar,Alberto Bietti,Mariel Pettee,Michael Eickenberg,Miles Cranmer,Keiya Hirashima,Geraud Krawezik,Nicholas Lourie,Michael McCabe,Rudy Morel,Ruben Ohana,Liam Holden Parker,Bruno Régaldo-Saint Blancard,Kyunghyun Cho,Shirley Ho
关键词: behavior remains crucial, revolutionized machine learning, diverse domains, remains crucial, high-stakes applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem aimed at enhancing our understanding of Transformers in quantitative and scientific contexts. This task requires precise localization and computation within datasets, akin to object detection or region-based scientific analysis. We present theoretical and empirical analysis using both causal and non-causal Transformer architectures, investigating the influence of various positional encodings on performance and interpretability. In particular, we find that causal attention is much better suited for the task, and that no positional embeddings lead to the best accuracy, though rotary embeddings are competitive and easier to train. We also show that out of distribution performance is tightly linked to which tokens it uses as a bias term.

[LG-191] Planetary Causal Inference: Implications for the Geography of Poverty

链接: https://arxiv.org/abs/2406.02584
作者: Kazuki Sakamoto,Connor T. Jerzak,Adel Daoud
关键词: Earth observation data, Earth observation, government-derived economic indicators, machine learning, living conditions
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: For a full list of the papers found in the quantitative literature search, see this https URL

点击查看摘要

Abstract:Earth observation data such as satellite imagery can, when combined with machine learning, have profound impacts on our understanding of the geography of poverty through the prediction of living conditions, especially where government-derived economic indicators are either unavailable or potentially untrustworthy. Recent work has progressed in using EO data not only to predict spatial economic outcomes, but also to explore cause and effect, an understanding which is critical for downstream policy analysis. In this review, we first document the growth of interest in EO-ML analyses in the causal space. We then trace the relationship between spatial statistics and EO-ML methods before discussing the four ways in which EO data has been used in causal ML pipelines – (1.) poverty outcome imputation for downstream causal analysis, (2.) EO image deconfounding, (3.) EO-based treatment effect heterogeneity, and (4.) EO-based transportability analysis. We conclude by providing a workflow for how researchers can incorporate EO data in causal ML analysis going forward.

[LG-192] Exploring the Potential of Polynomial Basis Functions in Kolmogorov-Arnold Networks: A Comparative Study of Different Groups of Polynomials

链接: https://arxiv.org/abs/2406.02583
作者: Seyd Teymoor Seydi
关键词: traditional spline-based methods, Kolmogorov-Arnold Network, KAN models, spline-based methods, KAN
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive survey of 18 distinct polynomials and their potential applications in Kolmogorov-Arnold Network (KAN) models as an alternative to traditional spline-based methods. The polynomials are classified into various groups based on their mathematical properties, such as orthogonal polynomials, hypergeometric polynomials, q-polynomials, Fibonacci-related polynomials, combinatorial polynomials, and number-theoretic polynomials. The study aims to investigate the suitability of these polynomials as basis functions in KAN models for complex tasks like handwritten digit classification on the MNIST dataset. The performance metrics of the KAN models, including overall accuracy, Kappa, and F1 score, are evaluated and compared. The Gottlieb-KAN model achieves the highest performance across all metrics, suggesting its potential as a suitable choice for the given task. However, further analysis and tuning of these polynomials on more complex datasets are necessary to fully understand their capabilities in KAN models. The source code for the implementation of these KAN models is available at this https URL .

[LG-193] Spatiotemporal Predictions of Toxic Urban Plumes Using Deep Learning

链接: https://arxiv.org/abs/2406.02582
作者: Yinan Wang,M. Giselle Fernández-Godino,Nipun Gunawardena,Donald D. Lucas,Xiaowei Yue
关键词: impact populated areas, Industrial accidents, release large amounts, chemical spills, populated areas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 13 pages, 10 figures

点击查看摘要

Abstract:Industrial accidents, chemical spills, and structural fires can release large amounts of harmful materials that disperse into urban atmospheres and impact populated areas. Computer models are typically used to predict the transport of toxic plumes by solving fluid dynamical equations. However, these models can be computationally expensive due to the need for many grid cells to simulate turbulent flow and resolve individual buildings and streets. In emergency response situations, alternative methods are needed that can run quickly and adequately capture important spatiotemporal features. Here, we present a novel deep learning model called ST-GasNet that was inspired by the mathematical equations that govern the behavior of plumes as they disperse through the atmosphere. ST-GasNet learns the spatiotemporal dependencies from a limited set of temporal sequences of ground-level toxic urban plumes generated by a high-resolution large eddy simulation model. On independent sequences, ST-GasNet accurately predicts the late-time spatiotemporal evolution, given the early-time behavior as an input, even for cases when a building splits a large plume into smaller plumes. By incorporating large-scale wind boundary condition information, ST-GasNet achieves a prediction accuracy of at least 90% on test data for the entire prediction period.

[LG-194] Constrained or Unconstrained? Neural-Network-Based Equation Discovery from Data

链接: https://arxiv.org/abs/2406.02581
作者: Grant Norman,Jacqueline Wentz,Hemanth Kolla,Kurt Maute,Alireza Doostan
关键词: practitioners often rely, model systems, constrained optimization problem, neural network, neural network PDEs
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 28 pages, 18 figures

点击查看摘要

Abstract:Throughout many fields, practitioners often rely on differential equations to model systems. Yet, for many applications, the theoretical derivation of such equations and/or accurate resolution of their solutions may be intractable. Instead, recently developed methods, including those based on parameter estimation, operator subset selection, and neural networks, allow for the data-driven discovery of both ordinary and partial differential equations (PDEs), on a spectrum of interpretability. The success of these strategies is often contingent upon the correct identification of representative equations from noisy observations of state variables and, as importantly and intertwined with that, the mathematical strategies utilized to enforce those equations. Specifically, the latter has been commonly addressed via unconstrained optimization strategies. Representing the PDE as a neural network, we propose to discover the PDE by solving a constrained optimization problem and using an intermediate state representation similar to a Physics-Informed Neural Network (PINN). The objective function of this constrained optimization problem promotes matching the data, while the constraints require that the PDE is satisfied at several spatial collocation points. We present a penalty method and a widely used trust-region barrier method to solve this constrained optimization problem, and we compare these methods on numerical examples. Our results on the Burgers’ and the Korteweg-De Vreis equations demonstrate that the latter constrained method outperforms the penalty method, particularly for higher noise levels or fewer collocation points. For both methods, we solve these discovered neural network PDEs with classical methods, such as finite difference methods, as opposed to PINNs-type methods relying on automatic differentiation. We briefly highlight other small, yet crucial, implementation details.

[LG-195] Exploiting Chaotic Dynamics as Deep Neural Networks

链接: https://arxiv.org/abs/2406.02580
作者: Shuhong Liu,Nozomi Akashi,Qingyao Huang,Yasuo Kuniyoshi,Kohei Nakajima
关键词: complex dynamics arising, initial states, arising from nonlinearity, sensitivity to initial, presents complex dynamics
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Chaos presents complex dynamics arising from nonlinearity and a sensitivity to initial states. These characteristics suggest a depth of expressivity that underscores their potential for advanced computational applications. However, strategies to effectively exploit chaotic dynamics for information processing have largely remained elusive. In this study, we reveal that the essence of chaos can be found in various state-of-the-art deep neural networks. Drawing inspiration from this revelation, we propose a novel method that directly leverages chaotic dynamics for deep learning architectures. Our approach is systematically evaluated across distinct chaotic systems. In all instances, our framework presents superior results to conventional deep neural networks in terms of accuracy, convergence speed, and efficiency. Furthermore, we found an active role of transient chaos formation in our scheme. Collectively, this study offers a new path for the integration of chaos, which has long been overlooked in information processing, and provides insights into the prospective fusion of chaotic dynamics within the domains of machine learning and neuromorphic computation.

[LG-196] An Open-Source Framework for Efficient Numerically-Tailored Computations

链接: https://arxiv.org/abs/2406.02579
作者: Louis Ledoux,Marc Casas
关键词: numerically-tailored Matrix-Matrix Multiplications, Matrix-Matrix Multiplications, High Performance Computing, Sea Surface Height, facilitate efficient
类目: Mathematical Software (cs.MS); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 6 pages, open-source

点击查看摘要

Abstract:We present a versatile open-source framework designed to facilitate efficient, numerically-tailored Matrix-Matrix Multiplications (MMMs). The framework offers two primary contributions: first, a fine-tuned, automated pipeline for arithmetic datapath generation, enabling highly customizable systolic MMM kernels; second, seamless integration of the generated kernels into user code, irrespective of the programming language employed, without necessitating modifications. The framework demonstrates a systematic enhancement in accuracy per energy cost across diverse High Performance Computing (HPC) workloads displaying a variety of numerical requirements, such as Artificial Intelligence (AI) inference and Sea Surface Height (SSH) computation. For AI inference, we consider a set of state-of-the-art neural network models, namely ResNet18, ResNet34, ResNet50, DenseNet121, DenseNet161, DenseNet169, and VGG11, in conjunction with two datasets, two computer formats, and 27 distinct intermediate arithmetic datapaths. Our approach consistently reduces energy consumption across all cases, with a notable example being the reduction by factors of 3.3\times for IEEE754-32 and 1.4\times for Bfloat16 during ImageNet inference with ResNet50. This is accomplished while maintaining accuracies of 82.3% and 86% , comparable to those achieved with conventional Floating-Point Units (FPUs). In the context of SSH computation, our method achieves fully-reproducible results using double-precision words, surpassing the accuracy of conventional double- and quad-precision arithmetic in FPUs. Our approach enhances SSH computation accuracy by a minimum of 5\times and 27\times compared to IEEE754-64 and IEEE754-128, respectively, resulting in 5.6\times and 15.1\times improvements in accuracy per power cost. Comments: 6 pages, open-source Subjects: Mathematical Software (cs.MS); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2406.02579 [cs.MS] (or arXiv:2406.02579v1 [cs.MS] for this version) Journalreference: International Conference on Field Programmable Logic and Applications 2023 Related DOI: https://doi.org/10.1109/FPL60245.2023.00011 Focus to learn more DOI(s) linking to related resources

[LG-197] Pretrained Mobility Transformer: A Foundation Model for Human Mobility

链接: https://arxiv.org/abs/2406.02578
作者: Xinhua Wu,Haoyu He,Yanchao Wang,Qi Wang
关键词: Ubiquitous mobile devices, generating vast amounts, location-based service data, Ubiquitous mobile, utilize urban spaces
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ubiquitous mobile devices are generating vast amounts of location-based service data that reveal how individuals navigate and utilize urban spaces in detail. In this study, we utilize these extensive, unlabeled sequences of user trajectories to develop a foundation model for understanding urban space and human mobility. We introduce the \textbfPretrained \textbfMobility \textbfTransformer (PMT), which leverages the transformer architecture to process user trajectories in an autoregressive manner, converting geographical areas into tokens and embedding spatial and temporal information within these representations. Experiments conducted in three U.S. metropolitan areas over a two-month period demonstrate PMT’s ability to capture underlying geographic and socio-demographic characteristics of regions. The proposed PMT excels across various downstream tasks, including next-location prediction, trajectory imputation, and trajectory generation. These results support PMT’s capability and effectiveness in decoding complex patterns of human mobility, offering new insights into urban spatial functionality and individual mobility preferences.

[LG-198] Are PPO-ed Language Models Hackable?

链接: https://arxiv.org/abs/2406.02577
作者: Suraj Anand,David Getzen
关键词: remove undesirable behaviors, Numerous algorithms, undesirable behaviors, remove undesirable, Numerous
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures

点击查看摘要

Abstract:Numerous algorithms have been proposed to \textitalign language models to remove undesirable behaviors. However, the challenges associated with a very large state space and creating a proper reward function often result in various jailbreaks. Our paper aims to examine this effect of reward in the controlled setting of positive sentiment language generation. Instead of online training of a reward model based on human feedback, we employ a statically learned sentiment classifier. We also consider a setting where our model’s weights and activations are exposed to an end-user after training. We examine a pretrained GPT-2 through the lens of mechanistic interpretability before and after proximal policy optimization (PPO) has been applied to promote positive sentiment responses. Using these insights, we (1) attempt to “hack” the PPO-ed model to generate negative sentiment responses and (2) add a term to the reward function to try and alter `negative’ weights.

[LG-199] Cross-Modal Safety Alignment: Is textual unlearning all you need?

链接: https://arxiv.org/abs/2406.02575
作者: Trishna Chakraborty,Erfan Shayegani,Zikui Cai,Nael Abu-Ghazaleh,M. Salman Asif,Yue Dong,Amit K. Roy-Chowdhury,Chengyu Song
关键词: Large Language Models, Supervised Fine-tuning, Human Feedback, Reinforcement Learning, Learning with Human
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies reveal that integrating new modalities into Large Language Models (LLMs), such as Vision-Language Models (VLMs), creates a new attack surface that bypasses existing safety training techniques like Supervised Fine-tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where, regardless of the combination of input modalities, all inputs are ultimately fused into the language space, we aim to explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Our evaluation across six datasets empirically demonstrates the transferability – textual unlearning in VLMs significantly reduces the Attack Success Rate (ASR) to less than 8% and in some cases, even as low as nearly 2% for both text-based and vision-text-based attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential benefits but incurs significantly increased computational demands, possibly up to 6 times higher.

[LG-200] Resource-constrained Fairness

链接: https://arxiv.org/abs/2406.01290
作者: Sofie Goethals,Eoin Delaney,Brent Mittelstadt,Chris Russell
关键词: resources strongly constrains, strongly constrains, Access, Access to resources, resources strongly
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Access to resources strongly constrains the decisions we make. While we might wish to offer every student a scholarship, or schedule every patient for follow-up meetings with a specialist, limited resources mean that this is not possible. Existing tools for fair machine learning ignore these key constraints, with the majority of methods disregarding any finite resource limitations under which decisions are made. Our research introduces the concept of “resource-constrained fairness” and quantifies the cost of fairness within this framework. We demonstrate that the level of available resources significantly influences this cost, a factor that has been overlooked in previous evaluations.

[LG-201] owards Practical Single-shot Motion Synthesis

链接: