本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-06-05)

今日共更新524篇论文,其中:

  • 自然语言处理93篇(Computation and Language (cs.CL))
  • 计算机视觉106篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能149篇(Artificial Intelligence (cs.AI))
  • 机器学习209篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] o Believe or Not to Believe Your LLM
[NLP-0] o相信或不相信您的法学硕士

链接: https://arxiv.org/abs/2406.02543
作者: Yasin Abbasi Yadkori,Ilja Kuzborskij,András György,Csaba Szepesvári
关键词: large language models, explore uncertainty quantification, goal to identify, epistemic uncertainty, explore uncertainty
中文关键词: 大型语言模型,探索不确定性量化,目标识别,认识不确定性,探索不确定性
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We explore uncertainty quantification in large language models (LLMs), with the goal to identify when uncertainty in responses given a query is large. We simultaneously consider both epistemic and aleatoric uncertainties, where the former comes from the lack of knowledge about the ground truth (such as about facts or the language), and the latter comes from irreducible randomness (such as multiple possible answers). In particular, we derive an information-theoretic metric that allows to reliably detect when only epistemic uncertainty is large, in which case the output of the model is unreliable. This condition can be computed based solely on the output of the model obtained simply by some special iterative prompting based on the previous responses. Such quantification, for instance, allows to detect hallucinations (cases when epistemic uncertainty is high) in both single- and multi-answer responses. This is in contrast to many standard uncertainty quantification strategies (such as thresholding the log-likelihood of a response) where hallucinations in the multi-answer case cannot be detected. We conduct a series of experiments which demonstrate the advantage of our formulation. Further, our investigations shed some light on how the probabilities assigned to a given output by an LLM can be amplified by iterative prompting, which might be of independent interest.
摘要:我们探索了大型语言模型(LLMS)中的不确定性量化,目的是识别给定查询的响应中的不确定性何时很大。我们同时考虑认知性不确定性和任意性不确定性,其中前者来自对基本真理的缺乏知识(例如关于事实或语言),而后者来自不可约的随机性(例如多个可能的答案)。特别是,我们推导出了一个信息论度量,它允许可靠地检测到只有认知不确定性较大时,在这种情况下,模型的输出是不可靠的。该条件可以仅基于模型的输出来计算,该模型仅通过基于先前响应的一些特殊迭代提示而获得。例如,这种量化可以检测出单一答案和多答案回答中的幻觉(认知不确定性较高的情况)。这与许多标准的不确定性量化策略(如对响应的对数似然率设置阈值)形成对比,在这些策略中,无法检测到多答案案例中的幻觉。我们进行了一系列实验,证明了我们的配方的优势。此外,我们的研究揭示了LLM分配给给定输出的概率如何通过迭代提示放大,这可能是独立感兴趣的。

[NLP-1] Parrot: Multilingual Visual Instruction Tuning
[NLP-1] Parrot:多语言视觉指令调整

链接: https://arxiv.org/abs/2406.02539
作者: Hai-Long Sun,Da-Wei Zhou,Yang Li,Shiyin Lu,Chao Yi,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,De-Chuan Zhan,Han-Jia Ye
关键词: Large Language Models, Multimodal Large Language, artificial general intelligence, Language Models, Multimodal Large
中文关键词: 大型语言模型、多模式大型语言、人工通用智能、语言模型、多模式大型
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has marked a significant step towards artificial general intelligence. Existing methods mainly focus on aligning vision encoders with LLMs through supervised fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs’ inherent ability to react to multiple languages progressively deteriorate as the training process evolves. We empirically find that the imbalanced SFT datasets, primarily composed of English-centric image-text pairs, lead to significantly reduced performance in non-English languages. This is due to the failure of aligning the vision encoder and LLM with multilingual tokens during the SFT process. In this paper, we introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Specifically, to enhance non-English visual tokens alignment, we compute the cross-attention using the initial visual features and textual embeddings, the result of which is then fed into the MoE router to select the most relevant experts. The selected experts subsequently convert the initial visual tokens into language-specific visual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks. Both the source code and the training dataset of Parrot will be made publicly available.
摘要:像GPT-4V这样的多通道大语言模型的快速发展标志着向人工智能迈出了重要的一步。现有的方法主要集中于通过有监督的微调(SFT)将视觉编码器与LLMS对齐,以赋予LLMS多模式能力,使得MLLMS对多种语言的固有反应能力随着训练过程的演变而逐渐恶化。我们的经验发现,不平衡的SFT数据集(主要由以英语为中心的图像-文本对组成)会导致非英语语言的性能显著降低。这是由于在SFT过程中视觉编码器和LLM与多语言标记对齐失败。在本文中,我们介绍了一种新的方法,它利用文本引导在语言层面驱动视觉标记对齐。鹦鹉在不同的语言输入上创造了视觉标记的条件,并使用混合专家(MOE)来促进多语言标记的对齐。具体地说,为了增强非英语视觉标记的一致性,我们使用初始视觉特征和文本嵌入来计算交叉注意,然后将结果反馈到MOE路由器以选择最相关的专家。所选择的专家随后将初始视觉标记转换成特定语言的视觉标记。此外,考虑到目前实地缺乏评估多语言能力的基准,我们收集并提供了一个大规模的多语言多模式基准,包括6种语言,15个类别,12,000个问题,命名为MMMB。我们的方法不仅在多语言MMB和MMMB上展示了最先进的性能,而且在广泛的多模式任务中也表现出色。鹦鹉的源代码和训练数据集都将公开提供。

[NLP-2] opViewRS: Vision-Language Models as Top-View Spatial Reasoners
[NLP-2] opView RS:作为顶视图空间推理器的视觉语言模型

链接: https://arxiv.org/abs/2406.02537
作者: Chengzu Li,Caiqi Zhang,Han Zhou,Nigel Collier,Anna Korhonen,Ivan Vulić
关键词: Top-view perspective denotes, large Vision-Language Models, perspective denotes, denotes a typical, vital for localization
中文关键词: 顶视图透视表示,大型视觉语言模型,透视表示,表示典型的,对本地化至关重要
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 3 figures, 3 tables (21 pages, 4 figures, 15 tables including references and appendices)

点击查看摘要

Abstract:Top-view perspective denotes a typical way in which humans read and reason over different types of maps, and it is vital for localization and navigation of humans as well as of `non-human’ agents, such as the ones backed by large Vision-Language Models (VLMs). Nonetheless, spatial reasoning capabilities of modern VLMs remain unattested and underexplored. In this work, we thus study their capability to understand and reason over spatial relations from the top view. The focus on top view also enables controlled evaluations at different granularity of spatial reasoning; we clearly disentangle different abilities (e.g., recognizing particular objects versus understanding their relative positions). We introduce the TopViewRS (Top-View Reasoning in Space) dataset, consisting of 11,384 multiple-choice questions with either realistic or semantic top-view map as visual input. We then use it to study and evaluate VLMs across 4 perception and reasoning tasks with different levels of complexity. Evaluation of 10 representative open- and closed-source VLMs reveals the gap of more than 50% compared to average human performance, and it is even lower than the random baseline in some cases. Although additional experiments show that Chain-of-Thought reasoning can boost model capabilities by 5.82% on average, the overall performance of VLMs remains limited. Our findings underscore the critical need for enhanced model capability in top-view spatial reasoning and set a foundation for further research towards human-level proficiency of VLMs in real-world multimodal tasks.
摘要:顶视图是指人类阅读和推理不同类型地图的一种典型方式,它对于人类和非人类主体的定位和导航至关重要,例如由大型视觉语言模型(VLM)支持的主体。尽管如此,现代VLM的空间推理能力仍然没有得到证实和探索。在这项工作中,我们因此研究了他们从顶视图理解和推理空间关系的能力。对顶视图的关注还可以在不同的空间推理粒度上进行受控评估;我们清楚地将不同的能力分开(例如,识别特定对象与理解它们的相对位置)。我们介绍了TopViewRS(Top-View Reason in Space)数据集,该数据集由11384个多项选择题组成,并以真实或语义的俯视图作为视觉输入。然后,我们使用它来研究和评估4个不同复杂程度的感知和推理任务中的VLM。对10个具有代表性的开源和封闭源VLM的评估显示,与人类平均表现相比差距超过50%,在某些情况下甚至低于随机基线。虽然更多的实验表明,链式推理可以将模型能力平均提高5.82%,但VLMS的整体性能仍然有限。我们的发现强调了在顶视空间推理中增强模型能力的迫切需要,并为进一步研究VLM在现实世界多通道任务中的人类水平熟练程度奠定了基础。

[NLP-3] Mitigate Position Bias in Large Language Models via Scaling a Single Dimension
[NLP-3] 通过缩放一维来缓解大型语言模型中的位置偏差

链接: https://arxiv.org/abs/2406.02536
作者: Yijiong Yu,Huiqiang Jiang,Xufang Luo,Qianhui Wu,Chin-Yew Lin,Dongsheng Li,Yuqing Yang,Yongfeng Huang,Lili Qiu
关键词: Large Language Models, Large Language, robust generative abilities, excellent generalization capabilities, real-world scenarios due
中文关键词: 大型语言模型、大型语言、强大的生成能力、出色的概括能力、现实世界场景
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly applied in various real-world scenarios due to their excellent generalization capabilities and robust generative abilities. However, they exhibit position bias, also known as “lost in the middle”, a phenomenon that is especially pronounced in long-context scenarios, which indicates the placement of the key information in different positions of a prompt can significantly affect accuracy. This paper first explores the micro-level manifestations of position bias, concluding that attention weights are a micro-level expression of position bias. It further identifies that, in addition to position embeddings, causal attention mask also contributes to position bias by creating position-specific hidden states. Based on these insights, we propose a method to mitigate position bias by scaling this positional hidden states. Experiments on the NaturalQuestions Multi-document QA, KV retrieval, LongBench and timeline reorder tasks, using various models including RoPE models, context windowextended models, and Alibi models, demonstrate the effectiveness and generalizability of our approach. Our method can improve performance by up to 15.2% by modifying just one dimension of hidden states. Our code is available at this https URL.
摘要:大语言模型因其良好的泛化能力和健壮的生成能力,越来越多地应用于现实世界的各种场景。然而,它们表现出位置偏差,也被称为“迷失在中间”,这一现象在长语境情景中尤其明显,这表明关键信息在提示的不同位置的放置会显著影响准确性。本文首先探讨了位置偏向的微观表现,认为注意权重是位置偏向的微观表现。该研究进一步发现,除了位置嵌入外,因果注意遮罩还通过创建特定于位置的隐藏状态而导致位置偏差。基于这些见解,我们提出了一种通过缩放这种位置隐藏状态来减轻位置偏差的方法。在自然查询的多文档问答、KV检索、长本奇和时间线重排等任务上的实验表明,该方法具有较好的通用性和有效性。通过仅修改一维的隐藏状态,我们的方法可以将性能提高高达15.2%。我们的代码可以在这个HTTPS URL上找到。

[NLP-4] SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
[NLP-4] SpecExec:消费者设备上交互式LLM推理的大规模并行推测解码

链接: https://arxiv.org/abs/2406.02532
作者: Ruslan Svirschevski,Avner May,Zhuoming Chen,Beidi Chen,Zhihao Jia,Max Ryabinin
关键词: gain widespread adoption, language models gain, models gain widespread, large language models, widespread adoption
中文关键词: 获得广泛采用,语言模型获得,模型获得广泛,大型语言模型,广泛采用
类目: Computation and Language (cs.CL)
备注: preprint. arXiv admin note: text overlap with arXiv:2312.17238 by other authors

点击查看摘要

Abstract:As large language models gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run LLMs on consumer machines? Consumer GPUs can no longer fit the largest available models (50B+ parameters) and must offload them to RAM or SSD. When running with offloaded parameters, the inference engine can process batches of hundreds or thousands of tokens at the same time as just one token, making it a natural fit for speculative decoding. We propose SpecExec (Speculative Execution), a simple parallel decoding method that can generate up to 20 tokens per target model iteration for popular LLM families. It utilizes the high spikiness of the token probabilities distribution in modern LLMs and a high degree of alignment between model output probabilities. SpecExec takes the most probable tokens continuation from the draft model to build a “cache” tree for the target model, which then gets validated in a single pass. Using SpecExec, we demonstrate inference of 50B+ parameter LLMs on consumer GPUs with RAM offloading at 4-6 tokens per second with 4-bit quantization or 2-3 tokens per second with 16-bit weights.
摘要:随着大型语言模型的广泛采用,高效运行它们变得至关重要。最近关于LLM推理的工作使用投机解码来实现极大的加速比。然而,这些作品中的大多数都是隐式地为高端数据中心硬件设计算法。在这项工作中,我们问了相反的问题:我们在消费类机器上运行LLM的速度有多快?消费级GPU无法再适应最大的可用型号(50B以上参数),必须将它们卸载到RAM或SSD。当使用卸载的参数运行时,推理机可以像处理一个令牌一样同时处理成百上千个令牌的批次,这使得它自然适合推测解码。我们提出了一种简单的并行译码方法specExec,它可以为流行的LLM族的每个目标模型迭代生成多达20个令牌。它利用了现代LLM中令牌概率分布的高尖峰以及模型输出概率之间的高度对齐。SpecExec从草稿模型中获取最有可能的令牌延续,为目标模型构建一个“缓存”树,然后在单次遍历中对其进行验证。使用specExec,我们演示了在消费类GPU上以每秒4-6个令牌(4位量化)或2-3个令牌(16位权重)卸载RAM的50B+参数LLM的推断。

[NLP-5] Scalable MatMul-free Language Modeling
[NLP-5] 可扩展无MatMult语言建模

链接: https://arxiv.org/abs/2406.02528
作者: Rui-Jie Zhu,Yu Zhang,Ethan Sifferman,Tyler Sheaves,Yiqiao Wang,Dustin Richmond,Peng Zhou,Jason K. Eshraghian
关键词: Matrix multiplication, large language models, typically dominates, large language, Matrix
中文关键词: 矩阵相乘,大型语言模型,通常占主导地位,大型语言,矩阵
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Matrix multiplication (MatMul) typically dominates the overall computational cost of large language models (LLMs). This cost only grows as LLMs scale to larger embedding dimensions and context lengths. In this work, we show that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. Our experiments show that our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers that require far more memory during inference at a scale up to at least 2.7B parameters. We investigate the scaling laws and find that the performance gap between our MatMul-free models and full precision Transformers narrows as the model size increases. We also provide a GPU-efficient implementation of this model which reduces memory usage by up to 61% over an unoptimized baseline during training. By utilizing an optimized kernel during inference, our model’s memory consumption can be reduced by more than 10x compared to unoptimized models. To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency. This work not only shows how far LLMs can be stripped back while still performing effectively, but also points at the types of operations future accelerators should be optimized for in processing the next generation of lightweight LLMs. Our code implementation is available at \urlthis https URL.
摘要:矩阵乘法(MatMul)通常在大型语言模型(LLM)的总体计算成本中占据主导地位。随着LLMS扩展到更大的嵌入维度和上下文长度,这一成本只会增加。在这项工作中,我们证明了MatMul运算可以从LLMS中完全消除,同时在十亿参数尺度上保持良好的性能。我们的实验表明,我们提出的无MatMul模型的性能与最先进的Transformers相当,后者在推理过程中需要更多的内存,最高可达2.7B个参数。我们研究了标度律,发现我们的无MatMul模型和全精度变形金刚之间的性能差距随着模型尺寸的增加而缩小。我们还提供了该模型的GPU高效实施方案,在培训期间,与未经优化的基准相比,该方案可将内存使用量降低高达61%。通过在推理过程中使用优化的内核,我们的模型的内存消耗可以比未优化的模型减少10倍以上。为了适当地量化我们架构的效率,我们在一个FPGA上构建了一个定制的硬件解决方案,该解决方案利用了超出GPU能力的轻量级操作。我们在超过人类可读吞吐量的13W下处理了10亿参数规模的模型,使LLM更接近于大脑的效率。这项工作不仅展示了在仍然有效执行的情况下可以在多大程度上剥离LLM,而且还指出了未来加速器在处理下一代轻量级LLM时应该优化的操作类型。我们的代码实现可在此HTTPS URL上找到。

[NLP-6] CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks
[NLP-6] CheckEmbed:开放式任务的LLM解决方案的有效验证

链接: https://arxiv.org/abs/2406.02524
作者: Maciej Besta,Lorenzo Paleari,Ales Kubicek,Piotr Nyczyk,Robert Gerstenberger,Patrick Iff,Tomasz Lehmann,Hubert Niewiadomski,Torsten Hoefler
关键词: Large Language Models, Large Language, Language Models, intricate open-ended tasks, revolutionizing various domains
中文关键词: 大型语言模型,大型语言,语言模型,复杂的开放式任务,彻底改变各个领域
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are revolutionizing various domains, yet verifying their answers remains a significant challenge, especially for intricate open-ended tasks such as consolidation, summarization, and extraction of knowledge. In this work, we propose CheckEmbed: an accurate, scalable, and simple LLM verification approach. CheckEmbed is driven by a straightforward yet powerful idea: in order to compare LLM solutions to one another or to the ground-truth, compare their corresponding answer-level embeddings obtained with a model such as GPT Text Embedding Large. This reduces a complex textual answer to a single embedding, facilitating straightforward, fast, and meaningful verification. We develop a comprehensive verification pipeline implementing the CheckEmbed methodology. The CheckEmbed pipeline also comes with metrics for assessing the truthfulness of the LLM answers, such as embedding heatmaps and their summaries. We show how to use these metrics for deploying practical engines that decide whether an LLM answer is satisfactory or not. We apply the pipeline to real-world document analysis tasks, including term extraction and document summarization, showcasing significant improvements in accuracy, cost-effectiveness, and runtime performance compared to existing token-, sentence-, and fact-level schemes such as BERTScore or SelfCheckGPT.
摘要:大型语言模型正在给各个领域带来革命性的变化,但验证它们的答案仍然是一个巨大的挑战,特别是对于复杂的开放式任务,如合并、摘要和知识提取。在这项工作中,我们提出了一种准确、可扩展、简单的LLM验证方法CheckEmed。CheckEmbed是由一个简单但强大的想法驱动的:为了将LLM解决方案彼此比较或与基本事实进行比较,将它们对应的答案级嵌入与GPT文本嵌入Large等模型进行比较。这将复杂的文本答案减少为单一嵌入,便于直接、快速和有意义的验证。我们开发了一个全面的验证流水线,实现了CheckEmbed方法。CheckEmed管道还附带了评估LLM答案真实性的指标,例如嵌入热图及其摘要。我们展示了如何使用这些度量来部署决定LLM答案是否令人满意的实际引擎。我们将该流水线应用于实际文档分析任务,包括术语提取和文档摘要,与BERTScore或SelfCheckGPT等现有的令牌、句子和事实级方案相比,在准确性、成本效益和运行时性能方面都有显著改进。

[NLP-7] Deterministic Reversible Data Augmentation for Neural Machine Translation
[NLP-7] 神经机器翻译的确定性可逆数据增强

链接: https://arxiv.org/abs/2406.02517
作者: Jiashu Yao,Heyan Huang,Zeming Liu,Yuhang Guo
关键词: introduce semantic inconsistency, subword sampling procedures, effective data augmentation, data augmentation method, Reversible Data Augmentation
中文关键词: 引入语义不一致性、子词采样过程、有效的数据增强、数据增强方法、可逆数据增强
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2024

点击查看摘要

Abstract:Data augmentation is an effective way to diversify corpora in machine translation, but previous methods may introduce semantic inconsistency between original and augmented data because of irreversible operations and random subword sampling procedures. To generate both symbolically diverse and semantically consistent augmentation data, we propose Deterministic Reversible Data Augmentation (DRDA), a simple but effective data augmentation method for neural machine translation. DRDA adopts deterministic segmentations and reversible operations to generate multi-granularity subword representations and pulls them closer together with multi-view techniques. With no extra corpora or model changes required, DRDA outperforms strong baselines on several translation tasks with a clear margin (up to 4.3 BLEU gain over Transformer) and exhibits good robustness in noisy, low-resource, and cross-domain datasets.
摘要:数据增强是机器翻译中实现语料库多元化的有效方法,但由于不可逆转的操作和随机子词采样过程,以前的方法可能会在原始数据和增强数据之间引入语义不一致。为了生成符号多样和语义一致的增强数据,我们提出了确定性可逆数据增强(DLDA),这是一种简单但有效的神经机器翻译数据增强方法。DEDA采用确定性分割和可逆操作来生成多粒度子词表示,并通过多视图技术将它们拉近在一起。由于不需要额外的文集或模型更改,DEDA在多个翻译任务上的表现优于强大的基线,具有明显的裕度(比Transformer更高4.3 BLEU收益),并在有噪音、低资源和跨域数据集中表现出良好的鲁棒性。

[NLP-8] Hiding Text in Large Language Models: Introducing Unconditional Token Forcing Confusion
[NLP-8] 在大型语言模型中隐藏文本:引入无条件令牌强迫混乱

链接: https://arxiv.org/abs/2406.02481
作者: Jakub Hoscilowicz,Pawel Popiolek,Jan Rudkowski,Jedrzej Bieniasz,Artur Janicki
关键词: Unconditional Token Forcing, LLM, artificially embed hidden, Unconditional Token, Token Forcing
中文关键词: 无条件令牌强迫、LLM、人工嵌入隐藏、无条件令牌、令牌强迫
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Work in progress. Code is available at this https URL

点击查看摘要

Abstract:With the help of simple fine-tuning, one can artificially embed hidden text into large language models (LLMs). This text is revealed only when triggered by a specific query to the LLM. Two primary applications are LLM fingerprinting and steganography. In the context of LLM fingerprinting, a unique text identifier (fingerprint) is embedded within the model to verify licensing compliance. In the context of steganography, the LLM serves as a carrier for hidden messages that can be disclosed through a designated trigger. Our work demonstrates that embedding hidden text in the LLM via fine-tuning, though seemingly secure due to the vast number of potential triggers (any sequence of characters or tokens could serve as a trigger), is susceptible to extraction through analysis of the LLM’s output decoding process. We propose a novel approach to extraction called Unconditional Token Forcing. It is premised on the hypothesis that iteratively feeding each token from the LLM’s vocabulary into the model should reveal sequences with abnormally high token probabilities, indicating potential embedded text candidates. Additionally, our experiments show that when the first token of a hidden fingerprint is used as an input, the LLM not only produces an output sequence with high token probabilities, but also repetitively generates the fingerprint itself. We also present a method to hide text in such a way that it is resistant to Unconditional Token Forcing, which we named Unconditional Token Forcing Confusion. Comments: Work in progress. Code is available at this https URL Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR) Cite as: arXiv:2406.02481 [cs.CL] (or arXiv:2406.02481v1 [cs.CL] for this version)
摘要:通过简单的微调,可以将隐藏文本人工嵌入到大型语言模型(LLM)中。只有在对LLM的特定查询触发时,才会显示此文本。两个主要应用是LLM指纹识别和隐写。在LLM指纹识别的上下文中,唯一的文本识别符(指纹)被嵌入到模型中,以验证许可合规性。在隐写术的背景下,LLM充当可以通过指定触发器泄露的隐藏消息的载体。我们的工作表明,通过微调将隐藏文本嵌入到LLM中,尽管由于潜在触发器(任何字符或标记序列都可以作为触发器)的数量巨大而看起来是安全的,但通过分析LLM的输出解码过程,它容易被提取。我们提出了一种新的抽取方法,称为无条件令牌强制。它的前提是假设迭代地将LLM词汇中的每个标记输入到模型中,应该会发现具有异常高的标记概率的序列,这表明潜在的嵌入文本候选。此外,我们的实验表明,当使用隐藏指纹的第一个令牌作为输入时,LLM不仅产生令牌概率较高的输出序列,而且还重复产生指纹本身。我们还提出了一种隐藏文本的方法,使其能够抵抗无条件令牌强制,我们称之为无条件令牌强制混淆。评论:工作正在进行中。代码可在此HTTPS URL主题:计算和语言(cs.CL);密码学和安全(cs.CR)引用为:arxiv:2406.02481cs.CL

[NLP-9] Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal Long Context Understanding
[NLP-9] 用大型语言模型分析时态复杂事件?时间长上下文理解的基准

链接: https://arxiv.org/abs/2406.02472
作者: Zhihan Zhang,Yixin Cao,Chenchen Ye,Yunshan Ma,Lizi Liao,Tat-Seng Chua
关键词: complex events, Temporal Complex Event, digital landscape, landscape is rapidly, rapidly evolving
中文关键词: 复杂事件,时间复杂事件,数字景观,景观正在迅速、迅速演变
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024

点击查看摘要

Abstract:The digital landscape is rapidly evolving with an ever-increasing volume of online news, emphasizing the need for swift and precise analysis of complex events. We refer to the complex events composed of many news articles over an extended period as Temporal Complex Event (TCE). This paper proposes a novel approach using Large Language Models (LLMs) to systematically extract and analyze the event chain within TCE, characterized by their key points and timestamps. We establish a benchmark, named TCELongBench, to evaluate the proficiency of LLMs in handling temporal dynamics and understanding extensive text. This benchmark encompasses three distinct tasks - reading comprehension, temporal sequencing, and future event forecasting. In the experiment, we leverage retrieval-augmented generation (RAG) method and LLMs with long context window to deal with lengthy news articles of TCE. Our findings indicate that models with suitable retrievers exhibit comparable performance with those utilizing long context window.
摘要:随着在线新闻数量的不断增加,数字版图正在迅速演变,强调了对复杂事件进行快速准确分析的必要性。我们将在较长时期内由多篇新闻文章组成的复杂事件称为时态复杂事件。提出了一种使用大型语言模型系统地提取和分析TCE中的事件链的新方法,其特征是关键点和时间戳。我们建立了一个名为TCELongBuchch的基准来评估LLMS在处理时间动态和理解大量文本方面的熟练程度。这一基准包括三个不同的任务–阅读理解、时间顺序和未来事件预测。在实验中,我们利用检索-增强生成(RAG)方法和具有长上下文窗口的LLMS来处理TCE中的长篇新闻文章。我们的发现表明,具有合适的检索者的模型表现出与使用长上下文窗口的模型相当的性能。

[NLP-10] Landscape-Aware Growing: The Power of a Little LAG
[NLP-10] 具有景观意识的成长:小LAG的力量

链接: https://arxiv.org/abs/2406.02469
作者: Stefani Karp,Nikunj Saunshi,Sobhan Miryoosefi,Sashank J. Reddi,Sanjiv Kumar
关键词: efficient pretraining paradigms, training Transformer-based models, Transformer-based models, training Transformer-based, increasing interest
中文关键词: 高效的预训练范式,训练基于变形器的模型,基于变形器的模型,训练基于变形器的,增加兴趣
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, there has been increasing interest in efficient pretraining paradigms for training Transformer-based models. Several recent approaches use smaller models to initialize larger models in order to save computation (e.g., stacking and fusion). In this work, we study the fundamental question of how to select the best growing strategy from a given pool of growing strategies. Prior works have extensively focused on loss- and/or function-preserving behavior at initialization or simply performance at the end of training. Instead, we identify that behavior at initialization can be misleading as a predictor of final performance and present an alternative perspective based on early training dynamics, which we call “landscape-aware growing (LAG)”. We perform extensive analysis of correlation of the final performance with performance in the initial steps of training and find early and more accurate predictions of the optimal growing strategy (i.e., with only a small “lag” after initialization). This perspective also motivates an adaptive strategy for gradual stacking.
摘要:近年来,人们对用于训练基于变压器的模型的有效的预训练范例越来越感兴趣。最近的几种方法使用较小的模型来初始化较大的模型,以节省计算(例如,堆叠和融合)。在这项工作中,我们研究了如何从给定的增长战略池中选择最佳增长战略的基本问题。以往的研究主要集中在初始状态下的丢失和/或功能保留行为或训练结束时的简单表现上。相反,我们发现初始化时的行为可能会被误导为最终表现的预测因素,并提出了一种基于早期训练动态的替代观点,我们称之为“景观感知成长(LAG)”。我们在训练的初始阶段对最终性能与性能的相关性进行了广泛的分析,并找到了对最优增长策略的早期和更准确的预测(即,在初始化后只有很小的“滞后”)。这种观点也激发了一种渐进堆叠的适应性策略。

[NLP-11] Representations as Language: An Information-Theoretic Framework for Interpretability
[NLP-11] 作为语言的表示:可解释性的信息理论框架

链接: https://arxiv.org/abs/2406.02449
作者: Henry Conklin,Kenny Smith
关键词: Large scale neural, Large scale, neural models show, models show impressive, scale neural models
中文关键词: 大规模神经,大规模神经模型显示,模型显示令人印象深刻,规模神经模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 Figures

点击查看摘要

Abstract:Large scale neural models show impressive performance across a wide array of linguistic tasks. Despite this they remain, largely, black-boxes - inducing vector-representations of their input that prove difficult to interpret. This limits our ability to understand what they learn, and when the learn it, or describe what kinds of representations generalise well out of distribution. To address this we introduce a novel approach to interpretability that looks at the mapping a model learns from sentences to representations as a kind of language in its own right. In doing so we introduce a set of information-theoretic measures that quantify how structured a model’s representations are with respect to its input, and when during training that structure arises. Our measures are fast to compute, grounded in linguistic theory, and can predict which models will generalise best based on their representations. We use these measures to describe two distinct phases of training a transformer: an initial phase of in-distribution learning which reduces task loss, then a second stage where representations becoming robust to noise. Generalisation performance begins to increase during this second phase, drawing a link between generalisation and robustness to noise. Finally we look at how model size affects the structure of the representational space, showing that larger models ultimately compress their representations more than their smaller counterparts.
摘要:大规模神经模型在广泛的语言任务中表现出令人印象深刻的表现。尽管如此,它们在很大程度上仍然是黑盒诱导的向量表示,事实证明很难解释它们的输入。这限制了我们理解他们学到了什么,以及他们何时学习,或者描述了从分布中很好地概括了哪种类型的表示。为了解决这个问题,我们引入了一种新的可解释性方法,该方法将模型从句子学习到表示的映射视为一种本身的语言。在这样做的过程中,我们引入了一组信息理论测量方法,这些测量方法量化了模型的表示相对于其输入的结构化程度,以及在训练期间该结构何时出现。我们的测量方法计算速度快,以语言理论为基础,可以根据模型的表示预测哪些模型的泛化效果最好。我们使用这些措施来描述训练变压器的两个不同阶段:减少任务损失的分布内学习的初始阶段,以及表示对噪声变得健壮的第二阶段。在这个第二阶段,泛化性能开始提高,从而在泛化和对噪声的稳健性之间建立了联系。最后,我们研究模型大小如何影响表示空间的结构,表明较大的模型最终比较小的对应模型更能压缩其表示。

[NLP-12] he Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding
[NLP-12] 斯堪的纳维亚嵌入基准:多语言和单语言文本嵌入的全面评估

链接: https://arxiv.org/abs/2406.02396
作者: Kenneth Enevoldsen,Márton Kardos,Niklas Muennighoff,Kristoffer Laigaard Nielbo
关键词: English text embeddings, English text, Scandinavian Embedding Benchmark, transitioned from evaluating, evaluating a handful
中文关键词: 英语文本嵌入,英语文本,斯堪的纳维亚嵌入基准,从评估转变,评估少数
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The evaluation of English text embeddings has transitioned from evaluating a handful of datasets to broad coverage across many tasks through benchmarks such as MTEB. However, this is not the case for multilingual text embeddings due to a lack of available benchmarks. To address this problem, we introduce the Scandinavian Embedding Benchmark (SEB). SEB is a comprehensive framework that enables text embedding evaluation for Scandinavian languages across 24 tasks, 10 subtasks, and 4 task categories. Building on SEB, we evaluate more than 26 models, uncovering significant performance disparities between public and commercial solutions not previously captured by MTEB. We open-source SEB and integrate it with MTEB, thus bridging the text embedding evaluation gap for Scandinavian languages.
摘要:英语文本嵌入的评估已从评估少数数据集转变为通过MTEB等基准评估对许多任务的广泛覆盖。然而,由于缺乏可用的基准,多语言文本嵌入的情况并非如此。为了解决这个问题,我们引入了斯堪的纳维亚嵌入基准(SEB)。SEB是一个全面的框架,可支持跨越24个任务、10个子任务和4个任务类别的斯堪的纳维亚语言的文本嵌入评估。在SEB的基础上,我们评估了超过26种模型,发现了MTEB之前未发现的公共和商业解决方案之间的显着性能差异。我们开源SEB并将其与MTEB集成,从而弥合斯堪的纳维亚语言的文本嵌入评估差距。

[NLP-13] Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data
[NLP-13] 多项选择题和大型语言模型:虚构医疗数据的案例研究

链接: https://arxiv.org/abs/2406.02394
作者: Maxime Griot,Jean Vanderdonckt,Demet Yuksel,Coralie Hemptinne
关键词: ChatGPT demonstrate significant, demonstrate significant potential, Large Language Models, Large Language, ChatGPT demonstrate
中文关键词: ChatGPT展示了重要性,展示了巨大的潜力,大型语言模型,大型语言,ChatGPT展示
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) like ChatGPT demonstrate significant potential in the medical field, often evaluated using multiple-choice questions (MCQs) similar to those found on the USMLE. Despite their prevalence in medical education, MCQs have limitations that might be exacerbated when assessing LLMs. To evaluate the effectiveness of MCQs in assessing the performance of LLMs, we developed a fictional medical benchmark focused on a non-existent gland, the Glianorex. This approach allowed us to isolate the knowledge of the LLM from its test-taking abilities. We used GPT-4 to generate a comprehensive textbook on the Glianorex in both English and French and developed corresponding multiple-choice questions in both languages. We evaluated various open-source, proprietary, and domain-specific LLMs using these questions in a zero-shot setting. The models achieved average scores around 67%, with minor performance differences between larger and smaller models. Performance was slightly higher in English than in French. Fine-tuned medical models showed some improvement over their base versions in English but not in French. The uniformly high performance across models suggests that traditional MCQ-based benchmarks may not accurately measure LLMs’ clinical knowledge and reasoning abilities, instead highlighting their pattern recognition skills. This study underscores the need for more robust evaluation methods to better assess the true capabilities of LLMs in medical contexts.
摘要:像ChatGPT这样的大型语言模型在医学领域显示出巨大的潜力,通常使用类似于USMLE上的多项选择题(MCQ)进行评估。尽管MCQ在医学教育中很普遍,但在评估LLM时,MCQ的局限性可能会加剧。为了评估MCQS在评估LLMS性能方面的有效性,我们开发了一个虚拟的医学基准,重点放在一个不存在的腺体Glianorex上。这种方法使我们能够将LLM的知识与其考试能力隔离开来。我们使用GPT-4生成了一本关于Glianorex的英语和法语综合教科书,并开发了相应的两种语言的多项选择题。我们使用这些问题在零距离设置中评估了各种开源、专有和特定于域的LLM。这些模型的平均得分约为67%,较大和较小模型之间的性能差异较小。英语的表现略高于法语。经过微调的医学模型显示,与基础版本相比,英语版有一些改进,但法语版没有改善。跨模型的一致高性能表明,传统的基于McQ的基准可能无法准确衡量LLM的临床知识和推理能力,而是突出它们的模式识别技能。这项研究强调了更稳健的评估方法的必要性,以更好地评估低成本管理在医疗环境中的真实能力。

[NLP-14] On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept
[NLP-14] 论LLM的内在自我纠正能力:不确定性和潜在概念

链接: https://arxiv.org/abs/2406.02378
作者: Guangliang Liu,Haitao Mao,Bochuan Cao,Zhiyu Xue,Kristen Johnson,Jiliang Tang,Rongrong Wang
关键词: Large Language Models, Large Language, Language Models, self-correction, self-correction capability
中文关键词: 大型语言模型,大型语言,语言模型,自我纠正,自我纠正能力
类目: Computation and Language (cs.CL)
备注: 22 pages, 7 figures

点击查看摘要

Abstract:Large Language Models (LLMs) can improve their responses when instructed to do so, a capability known as self-correction. When these instructions lack specific details about the issues in the response, this is referred to as leveraging the intrinsic self-correction capability. The empirical success of self-correction can be found in various applications, e.g., text detoxification and social bias mitigation. However, leveraging this self-correction capability may not always be effective, as it has the potential to revise an initially correct response into an incorrect one. In this paper, we endeavor to understand how and why leveraging the self-correction capability is effective. We identify that appropriate instructions can guide LLMs to a convergence state, wherein additional self-correction steps do not yield further performance improvements. We empirically demonstrate that model uncertainty and activated latent concepts jointly characterize the effectiveness of self-correction. Furthermore, we provide a mathematical formulation indicating that the activated latent concept drives the convergence of the model uncertainty and self-correction performance. Our analysis can also be generalized to the self-correction behaviors observed in Vision-Language Models (VLMs). Moreover, we highlight that task-agnostic debiasing can benefit from our principle in terms of selecting effective fine-tuning samples. Such initial success demonstrates the potential extensibility for better instruction tuning and safety alignment.
摘要:大型语言模型(LLM)可以在被指示这样做时提高它们的响应,这一能力被称为自我纠正。当这些说明缺乏关于响应中问题的具体细节时,这被称为利用固有的自我纠正能力。自我纠错在各种应用中都取得了经验上的成功,例如文本解毒和社会偏见缓解。然而,利用这种自我纠正能力可能并不总是有效的,因为它有可能将最初正确的回答修改为不正确的回答。在这篇文章中,我们努力理解如何以及为什么利用自我纠正能力是有效的。我们发现适当的指令可以将LLM引导到收敛状态,其中附加的自校正步骤不会产生进一步的性能改进。我们的经验证明,模型不确定性和激活的潜在概念共同表征了自我修正的有效性。此外,我们还给出了一个数学公式,表明激活的潜在概念推动了模型不确定性和自校正性能的收敛。我们的分析也可以推广到视觉语言模型中观察到的自我纠正行为。此外,我们强调,任务不可知性去偏倚可以受益于我们的原则,在选择有效的微调样本方面。这样的初步成功证明了更好的指令调优和安全对齐的潜在可扩展性。

[NLP-15] XRec: Large Language Models for Explainable Recommendation
[NLP-15] XRec:可解释推荐的大型语言模型

链接: https://arxiv.org/abs/2406.02377
作者: Qiyao Ma,Xubin Ren,Chao Huang
关键词: navigate information overload, providing personalized recommendations, personalized recommendations aligned, users navigate information, Recommender systems
中文关键词: 导航信息过载、提供个性化推荐、对齐个性化推荐、用户导航信息、推荐系统
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recommender systems help users navigate information overload by providing personalized recommendations aligned with their preferences. Collaborative Filtering (CF) is a widely adopted approach, but while advanced techniques like graph neural networks (GNNs) and self-supervised learning (SSL) have enhanced CF models for better user representations, they often lack the ability to provide explanations for the recommended items. Explainable recommendations aim to address this gap by offering transparency and insights into the recommendation decision-making process, enhancing users’ understanding. This work leverages the language capabilities of Large Language Models (LLMs) to push the boundaries of explainable recommender systems. We introduce a model-agnostic framework called XRec, which enables LLMs to provide comprehensive explanations for user behaviors in recommender systems. By integrating collaborative signals and designing a lightweight collaborative adaptor, the framework empowers LLMs to understand complex patterns in user-item interactions and gain a deeper understanding of user preferences. Our extensive experiments demonstrate the effectiveness of XRec, showcasing its ability to generate comprehensive and meaningful explanations that outperform baseline approaches in explainable recommender systems. We open-source our model implementation at this https URL.
摘要:推荐系统通过提供与用户偏好一致的个性化推荐来帮助用户导航信息过载。协同过滤(CF)是一种被广泛采用的方法,但尽管图神经网络(GNN)和自监督学习(SSL)等高级技术增强了协同过滤模型以更好地表示用户,但它们往往缺乏为推荐项目提供解释的能力。可解释建议旨在通过提供对推荐决策过程的透明度和洞察力来弥补这一差距,增强用户的理解。这项工作利用大型语言模型(LLM)的语言能力来推动可解释推荐系统的边界。我们引入了一个与模型无关的框架XRec,它使LLMS能够为推荐系统中的用户行为提供全面的解释。通过集成协作信号和设计轻量级协作适配器,该框架使LLMS能够理解用户-项目交互中的复杂模式,并更深入地理解用户偏好。我们的广泛实验证明了XRec的有效性,展示了它生成全面和有意义的解释的能力,这些解释在可解释的推荐系统中优于基线方法。我们在这个HTTPS URL上开放了我们的模型实现。

[NLP-16] Retaining Key Information under High Compression Ratios: Query-Guided Compressor for LLMs
[NLP-16] 高压缩比下保留关键信息:LLM的查询引导压缩器

链接: https://arxiv.org/abs/2406.02376
作者: Zhiwei Cao,Qian Cao,Yu Lu,Ningxin Peng,Luyang Huang,Shanbo Cheng,Jinsong Su
关键词: Large Language Models, Large Language, Language Models, Language, popularity of Large
中文关键词: 大型语言模型,大型语言,语言模型,语言,大型流行
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024

点击查看摘要

Abstract:The growing popularity of Large Language Models has sparked interest in context compression for Large Language Models (LLMs). However, the performance of previous methods degrades dramatically as compression ratios increase, sometimes even falling to the closed-book level. This decline can be attributed to the loss of key information during the compression process. Our preliminary study supports this hypothesis, emphasizing the significance of retaining key information to maintain model performance under high compression ratios. As a result, we introduce Query-Guided Compressor (QGC), which leverages queries to guide the context compression process, effectively preserving key information within the compressed context. Additionally, we employ a dynamic compression strategy. We validate the effectiveness of our proposed QGC on the Question Answering task, including NaturalQuestions, TriviaQA, and HotpotQA datasets. Experimental results show that QGC can consistently perform well even at high compression ratios, which also offers significant benefits in terms of inference cost and throughput.
摘要:随着大型语言模型的日益流行,人们对大型语言模型的上下文压缩产生了浓厚的兴趣。然而,随着压缩比的增加,以前的方法的性能会急剧下降,有时甚至会下降到闭卷水平。这种下降可以归因于压缩过程中关键信息的丢失。我们的初步研究支持这一假设,强调了保留关键信息对保持高压缩比下模型性能的重要性。因此,我们引入了查询引导的压缩器(QGC),它利用查询来指导上下文压缩过程,有效地保留了压缩上下文中的关键信息。此外,我们还采用了动态压缩策略。我们验证了我们提出的QGC在包括NaturalQuestions、TriviaQA和HotpotQA数据集的问答任务上的有效性。实验结果表明,即使在高压缩比的情况下,QGC也能保持良好的性能,在推理代价和吞吐量方面也有显著的优势。

[NLP-17] Large Language Models Make Sample-Efficient Recommender Systems
[NLP-17] 大型语言模型打造样本高效的推荐系统

链接: https://arxiv.org/abs/2406.02368
作者: Jianghao Lin,Xinyi Dai,Rong Shan,Bo Chen,Ruiming Tang,Yong Yu,Weinan Zhang
关键词: achieved remarkable progress, natural language processing, resembles human language, demonstrating remarkable abilities, recommender systems
中文关键词: 取得了显着的进步,自然语言处理,类似于人类语言,表现出非凡的能力,推荐系统
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted by Frontier of Computer Science

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable progress in the field of natural language processing (NLP), demonstrating remarkable abilities in producing text that resembles human language for various tasks. This opens up new opportunities for employing them in recommender systems (RSs). In this paper, we specifically examine the sample efficiency of LLM-enhanced recommender systems, which pertains to the model’s capacity to attain superior performance with a limited quantity of training data. Conventional recommendation models (CRMs) often need a large amount of training data because of the sparsity of features and interactions. Hence, we propose and verify our core viewpoint: Large Language Models Make Sample-Efficient Recommender Systems. We propose a simple yet effective framework (i.e., Laser) to validate the viewpoint from two aspects: (1) LLMs themselves are sample-efficient recommenders; and (2) LLMs, as feature generators and encoders, make CRMs more sample-efficient. Extensive experiments on two public datasets show that Laser requires only a small fraction of training samples to match or even surpass CRMs that are trained on the entire training set, demonstrating superior sample efficiency.
摘要:大语言模型在自然语言处理领域取得了显著的进展,在为各种任务生成与人类语言相似的文本方面表现出了非凡的能力。这为它们在推荐系统(RSS)中的应用开辟了新的机会。在本文中,我们具体考察了LLM增强的推荐系统的样本效率,这与该模型在有限数量的训练数据下获得优越性能的能力有关。由于特征和交互的稀疏性,传统的推荐模型往往需要大量的训练数据。因此,我们提出并验证了我们的核心观点:大语言模型构成样本高效的推荐系统。我们提出了一个简单而有效的框架(即LASER)来从两个方面验证这一观点:(1)LLM本身是样本高效的推荐器;(2)LLM作为特征生成器和编码器,使CRM更加样本有效。在两个公共数据集上的广泛实验表明,激光只需要一小部分训练样本就可以匹配甚至超过在整个训练集上训练的CRM,表现出优越的样本效率。

[NLP-18] Language Models Do Hard Arithmetic Tasks Easily and Hardly Do Easy Arithmetic Tasks
[NLP-18] 语言模型可以轻松完成困难的算术任务,也很难完成简单的算术任务

链接: https://arxiv.org/abs/2406.02356
作者: Andrew Gambardella,Yusuke Iwasawa,Yutaka Matsuo
关键词: large language models, perform arithmetic tasks, language models, practical debate, large language
中文关键词: 大型语言模型、执行算术任务、语言模型、实际辩论、大型语言
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

点击查看摘要

Abstract:The ability (and inability) of large language models (LLMs) to perform arithmetic tasks has been the subject of much theoretical and practical debate. We show that LLMs are frequently able to correctly and confidently predict the first digit of n-digit by m-digit multiplication tasks without using chain of thought reasoning, despite these tasks require compounding operations to solve. Simultaneously, LLMs in practice often fail to correctly or confidently predict the last digit of an n-digit by m-digit multiplication, a task equivalent to 1-digit by 1-digit multiplication which can be easily learned or memorized. We show that the latter task can be solved more robustly when the LLM is conditioned on all of the correct higher-order digits, which on average increases the confidence of the correct last digit on 5-digit by 5-digit multiplication tasks using Llama 2-13B by over 230% (0.13 to 0.43) and Mistral-7B by 150% (0.22 to 0.55).
摘要:大型语言模型(LLM)执行算术任务的能力(和能力)一直是理论和实践争论的主题。我们表明,LLM通常能够通过m位数相乘任务正确且自信地预测n位数的第一位数,而无需使用思想链推理,尽管这些任务需要复合运算来解决。同时,LLM在实践中常常无法正确或自信地预测n位数与m位数相乘的最后一位数,这相当于1位数与1位数相乘的任务,可以轻松学习或记忆。我们表明,当LLM以所有正确的高位数字为条件时,后一项任务可以更稳健地解决,这平均将使用Lama 2- 13 B的5位乘5位相乘任务中正确最后一位的置信度提高了230%以上(0.13至0.43),Mistral-7 B提高了150%(0.22至0.55)。

[NLP-19] LlamaCare: A Large Medical Language Model for Enhancing Healthcare Knowledge Sharing
[NLP-19] LlamaCare:增强医疗保健知识共享的大型医学语言模型

链接: https://arxiv.org/abs/2406.02350
作者: Maojun Sun
关键词: shown amazing capabilities, Extended Classification Integration, memorization and present, Classification Integration, shown amazing
中文关键词: 表现出惊人的能力,扩展分类集成,记忆和呈现,分类集成,表现出惊人的
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown amazing capabilities in knowledge memorization and present. However, when it comes to domain-specific knowledge and downstream tasks like medical, general LLMs are often unable to give precise answers. In addition, when people want LLMs to answer classification questions, they usually go through instruction tuning first, however, LLMs do not always give a direct index of the categorization after instruction tuning. In this paper, we proposed LlamaCare, a fine-tuned medical language model, and Extended Classification Integration(ECI), a module to handle classification problems of LLMs. Our contributions are : (i) We fine-tuned a large language model of medical knowledge with very low carbon emissions and achieved similar performance with ChatGPT by a 24G GPU. (ii) We solved the problem of redundant categorical answers and improved the performance of LLMs by proposing a new module called Extended Classification Integration. (iii) We released our processed data for one-shot and few-shot training for some benchmarks such as PubMedQA and USMLE 1-3 step. Our method achieves a close effect with the state-of-the-art model in benchmarks while costing lower GPU resources compared to LLMs with the same quantity of parameters. Our models, codes, and datasets can be found in this https URL
摘要:大型语言模型在知识记忆和表达方面表现出了惊人的能力。然而,当涉及到特定领域的知识和下游任务(如医疗)时,一般的LLM往往无法给出准确的答案。此外,当人们想要LLMS回答分类问题时,他们通常会先进行指令调优,然而,在指令调优之后,LLMS并不总是给出分类的直接索引。在本文中,我们提出了一个微调的医学语言模型LlamaCare和一个处理LLMS分类问题的扩展分类集成(ECI)模块。我们的贡献是:(I)我们微调了一个非常低碳排放的大型医学知识语言模型,并在24G GPU的情况下获得了与ChatGPT类似的性能。(2)提出了扩展分类集成模型,解决了分类答案冗余的问题,提高了LLMS的性能。(3)公布了PubMedQA和USMLE 1-3步等基准的单发和少发训练数据。我们的方法在基准测试中达到了与最先进的模型接近的效果,同时与相同参数的LLMS相比,所需的GPU资源更少。我们的模型、代码和数据集可在此HTTPS URL中找到

[NLP-20] Linguistic Fingerprint in Transformer Models: How Language Variation Influences Parameter Selection in Irony Detection
[NLP-20] Transformer模型中的语言指纹:语言变化如何影响反语检测中的参数选择

链接: https://arxiv.org/abs/2406.02338
作者: Michele Mastromattei,Fabio Massimo Zanzotto
关键词: transformer model architectures, sentiment analysis, paper explores, explores the correlation, analysis and transformer
中文关键词: Transformer模型架构、情绪分析、论文探索、探讨相关性、分析和转换器
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper explores the correlation between linguistic diversity, sentiment analysis and transformer model architectures. We aim to investigate how different English variations impact transformer-based models for irony detection. To conduct our study, we used the EPIC corpus to extract five diverse English variation-specific datasets and applied the KEN pruning algorithm on five different architectures. Our results reveal several similarities between optimal subnetworks, which provide insights into the linguistic variations that share strong resemblances and those that exhibit greater dissimilarities. We discovered that optimal subnetworks across models share at least 60% of their parameters, emphasizing the significance of parameter values in capturing and interpreting linguistic variations. This study highlights the inherent structural similarities between models trained on different variants of the same language and also the critical role of parameter values in capturing these nuances.
摘要:本文探讨了语言多样性、情感分析和变压器模型体系结构之间的相关性。我们的目标是调查不同的英语变体如何影响基于变压器的反讽检测模型。为了进行我们的研究,我们使用EPIC语料库提取了五个不同的英语变体特定的数据集,并在五个不同的体系结构上应用了Ken剪枝算法。我们的结果揭示了最优子网络之间的几个相似之处,这些子网络提供了对具有强烈相似之处的语言变体和表现出较大不同之处的语言变体的洞察。我们发现,模型之间的最优子网络共享至少60%的参数,强调了参数值在捕捉和解释语言变异方面的重要性。这项研究强调了在同一语言的不同变体上训练的模型之间固有的结构相似性,以及参数值在捕捉这些细微差别方面的关键作用。

[NLP-21] Probing the Category of Verbal Aspect in Transformer Language Models
[NLP-21] Transformer语言模型中动词体类别的探索

链接: https://arxiv.org/abs/2406.02335
作者: Anisia Katinskaia,Roman Yangarber
关键词: pretrained language models, investigate how pretrained, grammatical category, category of verbal, aspect
中文关键词: 预训练的语言模型,调查如何预训练、语法类别、言语类别、方面
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate how pretrained language models (PLM) encode the grammatical category of verbal aspect in Russian. Encoding of aspect in transformer LMs has not been studied previously in any language. A particular challenge is posed by “alternative contexts”: where either the perfective or the imperfective aspect is suitable grammatically and semantically. We perform probing using BERT and RoBERTa on alternative and non-alternative contexts. First, we assess the models’ performance on aspect prediction, via behavioral probing. Next, we examine the models’ performance when their contextual representations are substituted with counterfactual representations, via causal probing. These counterfactuals alter the value of the “boundedness” feature–a semantic feature, which characterizes the action in the context. Experiments show that BERT and RoBERTa do encode aspect–mostly in their final layers. The counterfactual interventions affect perfective and imperfective in opposite ways, which is consistent with grammar: perfective is positively affected by adding the meaning of boundedness, and vice versa. The practical implications of our probing results are that fine-tuning only the last layers of BERT on predicting aspect is faster and more effective than fine-tuning the whole model. The model has high predictive uncertainty about aspect in alternative contexts, which tend to lack explicit hints about the boundedness of the described action.
摘要:我们研究了预训练语言模型(PLM)如何对俄语动词体的语法范畴进行编码。转换器LMS中方面的编码以前没有在任何语言中研究过。“另类语境”提出了一个特殊的挑战:无论是完成体还是非完成体,在语法和语义上都是合适的。我们使用Bert和Roberta在可选和非可选语境中执行探测。首先,我们通过行为探测来评估模型在方面预测上的性能。接下来,我们通过因果探测,检查模型在其上下文表示被反事实表示取代时的性能。这些反事实改变了“有界性”特征的价值–这是一种语义特征,它表征了上下文中的行为。实验表明,伯特和罗伯塔确实编码了方面–主要是在他们的最后一层。反事实干预以相反的方式影响完成体和不完成体,这与语法是一致的:完成体通过增加有界意义而受到正向影响,反之亦然。我们的探索结果的实际意义是,在预测方面只微调最后一层的BERT比微调整个模型更快、更有效。该模型在替代语境中对方面具有很高的预测不确定性,这往往缺乏关于所描述动作的有界性的明确提示。

[NLP-22] Extended Mind Transformers
[NLP-22] 延伸心灵变形金刚

链接: https://arxiv.org/abs/2406.02332
作者: Phoebe Klett,Thomas Ahle
关键词: Pre-trained language models, long inputs quickly, Pre-trained language, demonstrate general intelligence, language models demonstrate
中文关键词: 预训练的语言模型,快速长输入,预训练的语言,展示一般智能,语言模型展示
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-trained language models demonstrate general intelligence and common sense, but long inputs quickly become a bottleneck for memorizing information at inference time. We resurface a simple method, Memorizing Transformers (Wu et al., 2022), that gives the model access to a bank of pre-computed memories. We show that it is possible to fix many of the shortcomings of the original method, such as the need for fine-tuning, by critically assessing how positional encodings should be updated for the keys and values retrieved. This intuitive method uses the model’s own key/query system to select and attend to the most relevant memories at each generation step, rather than using external embeddings. We demonstrate the importance of external information being retrieved in a majority of decoder layers, contrary to previous work. We open source a new counterfactual long-range retrieval benchmark, and show that Extended Mind Transformers outperform today’s state of the art by 6% on average.
摘要:预先训练的语言模型显示出一般的智力和常识,但长时间的输入很快就成为推理时记忆信息的瓶颈。我们重新浮出水面的是一个简单的方法,记忆变形金刚(Wu等人,2022年),它使模型能够访问预计算记忆库。我们证明,通过批判性地评估应该如何为检索到的键和值更新位置编码,可以修复原始方法的许多缺点,例如需要微调。这种直观的方法使用模型自己的键/查询系统在每个生成步骤中选择和关注最相关的记忆,而不是使用外部嵌入。与以前的工作相反,我们证明了在大多数解码器层中检索外部信息的重要性。我们开源了一个新的反事实的远程检索基准,并表明扩展的精神变形金刚平均比今天的最先进水平高6%。

[NLP-23] ranslation Deserves Better: Analyzing Translation Artifacts in Cross-lingual Visual Question Answering
[NLP-23] 翻译更好:分析跨语言视觉提问中的翻译产物

链接: https://arxiv.org/abs/2406.02331
作者: ChaeHun Park,Koanho Lee,Hyesu Lim,Jaeseok Kim,Junmo Park,Yu-Jung Heo,Du-Seong Chang,Jaegul Choo
关键词: visual question answering, reliable visual question, Building a reliable, question answering, challenging problem
中文关键词: 视觉问答,可靠的视觉问题,建立可靠的、问答的、具有挑战性的问题
类目: Computation and Language (cs.CL)
备注: ACL 2024 Findings Accepted

点击查看摘要

Abstract:Building a reliable visual question answering~(VQA) system across different languages is a challenging problem, primarily due to the lack of abundant samples for training. To address this challenge, recent studies have employed machine translation systems for the cross-lingual VQA task. This involves translating the evaluation samples into a source language (usually English) and using monolingual models (i.e., translate-test). However, our analysis reveals that translated texts contain unique characteristics distinct from human-written ones, referred to as translation artifacts. We find that these artifacts can significantly affect the models, confirmed by extensive experiments across diverse models, languages, and translation processes. In light of this, we present a simple data augmentation strategy that can alleviate the adverse impacts of translation artifacts.
摘要:构建跨不同语言的可靠视觉问答(VQA)系统是一个具有挑战性的问题,主要是由于缺乏丰富的训练样本。为了应对这一挑战,最近的研究使用机器翻译系统来执行跨语言VQA任务。这涉及将评估样本翻译成源语言(通常是英语)并使用单语模型(即,伪测试)。然而,我们的分析表明,翻译文本包含与人类书面文本不同的独特特征,即翻译产物。我们发现这些工件可以显着影响模型,这一点已通过跨不同模型、语言和翻译过程的广泛实验得到证实。有鉴于此,我们提出了一种简单的数据增强策略,可以减轻翻译工件的不利影响。

[NLP-24] On Affine Homotopy between Language Encoders
[NLP-24] 语言编码器之间的仿射同伦

链接: https://arxiv.org/abs/2406.02329
作者: Robin SM Chan,Reda Boumasmoud,Anej Svete,Yuxin Ren,Qipeng Guo,Zhijing Jin,Shauli Ravfogel,Mrinmaya Sachan,Bernhard Schölkopf,Mennatallah El-Assady,Ryan Cotterell
关键词: NLP tasks, functions that represent, text as vectors, represent text, integral component
中文关键词: NLP任务、表示文本作为载体的函数、表示文本、整体组件
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages

点击查看摘要

Abstract:Pre-trained language encoders – functions that represent text as vectors – are an integral component of many NLP tasks. We tackle a natural question in language encoder analysis: What does it mean for two encoders to be similar? We contend that a faithful measure of similarity needs to be \emphintrinsic, that is, task-independent, yet still be informative of \emphextrinsic similarity – the performance on downstream tasks. It is common to consider two encoders similar if they are \emphhomotopic, i.e., if they can be aligned through some transformation. In this spirit, we study the properties of \emphaffine alignment of language encoders and its implications on extrinsic similarity. We find that while affine alignment is fundamentally an asymmetric notion of similarity, it is still informative of extrinsic similarity. We confirm this on datasets of natural language representations. Beyond providing useful bounds on extrinsic similarity, affine intrinsic similarity also allows us to begin uncovering the structure of the space of pre-trained encoders by defining an order over them.
摘要:经过预先训练的语言编码器是许多自然语言处理任务中不可或缺的组成部分。我们解决了语言编码器分析中的一个自然问题:两个编码器相似意味着什么?我们认为,忠实的相似性度量需要是内在的,即与任务无关,但仍然具有外部相似性的信息–下游任务的性能。如果两个编码器是同伦的,即如果它们可以通过某种变换对齐,则通常认为两个编码器相似。本着这一精神,我们研究了语言编码者对齐的性质及其对外部相似性的影响。我们发现,虽然仿射排列本质上是一个不对称的相似性概念,但它仍然是外在相似性的信息量。我们在自然语言表示的数据集上证实了这一点。除了提供外在相似性的有用界限外,仿射内在相似性还允许我们通过定义预训练编码器的顺序来开始揭示它们的空间结构。

[NLP-25] chnical Language Processing for Telecommunications Specifications
[NLP-25] 电信规范的技术语言处理

链接: https://arxiv.org/abs/2406.02325
作者: Felipe A. Rodriguez Y.
关键词: Large Language Models, Large Language, Language Models, Generative Pre-Trained Transformer, real-world technical documentation
中文关键词: 大型语言模型、大型语言、语言模型、生成式预训练Transformer、现实世界的技术文档
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Still not published

点击查看摘要

Abstract:Large Language Models (LLMs) are continuously being applied in a more diverse set of contexts. At their current state, however, even state-of-the-art LLMs such as Generative Pre-Trained Transformer 4 (GTP-4) have challenges when extracting information from real-world technical documentation without a heavy preprocessing. One such area with real-world technical documentation is telecommunications engineering, which could greatly benefit from domain-specific LLMs. The unique format and overall structure of telecommunications internal specifications differs greatly from standard English and thus it is evident that the application of out-of-the-box Natural Language Processing (NLP) tools is not a viable option. In this article, we outline the limitations of out-of-the-box NLP tools for processing technical information generated by telecommunications experts, and expand the concept of Technical Language Processing (TLP) to the telecommunication domain. Additionally, we explore the effect of domain-specific LLMs in the work of Specification Engineers, emphasizing the potential benefits of adopting domain-specific LLMs to speed up the training of experts in different telecommunications fields.
摘要:大型语言模型(LLM)在越来越多样化的语境中得到了持续的应用。然而,在它们目前的状态下,即使是最先进的LLM,如生成式预训练变压器4(GTP-4),在没有大量预处理的情况下从现实世界的技术文档中提取信息时也存在挑战。拥有真实世界技术文档的一个这样的领域是电信工程,它可以从特定于领域的LLM中受益匪浅。电信内部规范的独特格式和总体结构与标准英语有很大不同,因此,使用现成的自然语言处理工具显然不是一个可行的选择。在本文中,我们概述了用于处理电信专家生成的技术信息的开箱即用的自然语言处理工具的局限性,并将技术语言处理(TLP)的概念扩展到电信领域。此外,我们还探讨了特定于领域的LLMS在规范工程师工作中的作用,强调了采用特定于领域的LLMS来加快不同电信领域专家的培训的潜在好处。

[NLP-26] mCoT: Multilingual Instruction Tuning for Reasoning Consistency in Language Models
[NLP-26] mCoT:语言模型推理一致性的多语言指令调优

链接: https://arxiv.org/abs/2406.02301
作者: Huiyuan Lai,Malvina Nissim
关键词: Large language models, Large language, downstream tasks, recently emerged, powerful technique
中文关键词: 大型语言模型、大型语言、下游任务、最近出现的、强大的技术
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024 main

点击查看摘要

Abstract:Large language models (LLMs) with Chain-of-thought (CoT) have recently emerged as a powerful technique for eliciting reasoning to improve various downstream tasks. As most research mainly focuses on English, with few explorations in a multilingual context, the question of how reliable this reasoning capability is in different languages is still open. To address it directly, we study multilingual reasoning consistency across multiple languages, using popular open-source LLMs. First, we compile the first large-scale multilingual math reasoning dataset, mCoT-MATH, covering eleven diverse languages. Then, we introduce multilingual CoT instruction tuning to boost reasoning capability across languages, thereby improving model consistency. While existing LLMs show substantial variation across the languages we consider, and especially low performance for lesser resourced languages, our 7B parameter model mCoT achieves impressive consistency across languages, and superior or comparable performance to close- and open-source models even of much larger sizes.
摘要:具有思想链(CoT)的大型语言模型(LLM)是最近出现的一种强大的技术,可以用来引发推理来改进各种下游任务。由于大多数研究主要集中在英语上,在多语言环境下的探索很少,这种推理能力在不同语言中的可靠性问题仍然悬而未决。为了直接解决这个问题,我们使用流行的开源LLMS来研究跨多语言的多语言推理一致性。首先,我们编制了第一个覆盖11种不同语言的大规模多语言数学推理数据集MCOT-MATH。然后,我们引入了多语言COT指令调优,以提高跨语言的推理能力,从而提高模型的一致性。虽然现有的LLM在我们考虑的语言中显示出很大的差异,特别是在资源较少的语言中性能很低,但我们的7B参数模型MCOT实现了令人印象深刻的跨语言一致性,即使是更大规模的关闭和开放源代码模型也具有卓越或可与之媲美的性能。

[NLP-27] Prompting Large Language Models with Human Error Markings for Self-Correcting Machine Translation
[NLP-27] 使用人为错误标记的大型语言模型进行自我纠正机器翻译

链接: https://arxiv.org/abs/2406.02267
作者: Nathaniel Berger,Stefan Riezler,Miriam Exel,Matthias Huck
关键词: general domain texts, large language models, unpaired language data, term translation quality, enhance term translation
中文关键词: 一般领域文本、大型语言模型、不配对语言数据、术语翻译质量、增强术语翻译
类目: Computation and Language (cs.CL)
备注: To appear at The 25th Annual Conference of the European Association for Machine Translation (EAMT 2024)

点击查看摘要

Abstract:While large language models (LLMs) pre-trained on massive amounts of unpaired language data have reached the state-of-the-art in machine translation (MT) of general domain texts, post-editing (PE) is still required to correct errors and to enhance term translation quality in specialized domains. In this paper we present a pilot study of enhancing translation memories ™ produced by PE (source segments, machine translations, and reference translations, henceforth called PE-TM) for the needs of correct and consistent term translation in technical domains. We investigate a light-weight two-step scenario where, at inference time, a human translator marks errors in the first translation step, and in a second step a few similar examples are extracted from the PE-TM to prompt an LLM. Our experiment shows that the additional effort of augmenting translations with human error markings guides the LLM to focus on a correction of the marked errors, yielding consistent improvements over automatic PE (APE) and MT from scratch. Comments: To appear at The 25th Annual Conference of the European Association for Machine Translation (EAMT 2024) Subjects: Computation and Language (cs.CL) Cite as: arXiv:2406.02267 [cs.CL] (or arXiv:2406.02267v1 [cs.CL] for this version)
摘要:虽然在大量未配对的语言数据上预先训练的大语言模型已经达到了一般领域文本的机器翻译的最高水平,但在专业领域中,仍然需要进行后编辑来纠正错误,提高术语翻译的质量。在本文中,我们提出了一项初步研究,以增强由PE(源语段、机器翻译和参考翻译,以下简称PE-TM)产生的翻译记忆库™,以满足技术领域中正确和一致的术语翻译的需要。我们研究了一个轻量级的两步场景,其中在推理时,人工翻译人员在第一个翻译步骤中标记错误,在第二个步骤中,从PE-TM中提取几个类似的例子以提示LLM。我们的实验表明,通过人为错误标记来增强翻译的额外努力引导LLM专注于纠正标记的错误,产生了比自动PE(APE)和从头开始的机器翻译一致的改进。评论:将出现在欧洲机器翻译协会(EAMT2024年)第25届年会主题:计算与语言(cs.CL)引用为:arxiv:2406.02267cs.CL

[NLP-28] Enhancing Retrieval-Augmented LMs with a Two-stage Consistency Learning Compressor
[NLP-28] 使用两级一致性学习压缩器增强检索增强LM

链接: https://arxiv.org/abs/2406.02266
作者: Chuankai Xu,Dongming Zhao,Bo Wang,Hanwen Xing
关键词: retrieval-augmented language models, tasks remains challenging, document-based tasks remains, remains challenging, language model responses
中文关键词: 检索增强语言模型,任务仍然具有挑战性,基于文档的任务仍然具有挑战性,语言模型响应
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the prevalence of retrieval-augmented language models (RALMs), the seamless integration of these models with retrieval mechanisms to enhance performance in document-based tasks remains challenging. While some post-retrieval processing Retrieval-Augmented Generation (RAG) methods have achieved success, most still lack the ability to distinguish pertinent from extraneous information, leading to potential inconsistencies and reduced precision in the generated output, which subsequently affects the truthfulness of the language model’s responses. To address these limitations, this work proposes a novel two-stage consistency learning approach for retrieved information compression in retrieval-augmented language models to enhance performance. By incorporating consistency learning, the aim is to generate summaries that maintain coherence and alignment with the intended semantic representations of a teacher model while improving faithfulness to the original retrieved documents. The proposed method is empirically validated across multiple datasets, demonstrating notable enhancements in precision and efficiency for question-answering tasks. It outperforms existing baselines and showcases the synergistic effects of combining contrastive and consistency learning paradigms within the retrieval-augmented generation framework.
摘要:尽管检索增强语言模型(RALM)很流行,但将这些模型与检索机制无缝集成以提高基于文档的任务的性能仍然具有挑战性。虽然一些检索后处理检索-增强生成(RAG)方法已经取得了成功,但大多数方法仍然缺乏区分相关和无关信息的能力,导致生成的输出中潜在的不一致和精度降低,从而影响语言模型响应的真实性。为了解决这些局限性,本工作提出了一种新的两阶段一致性学习方法,用于检索增强语言模型中的检索信息压缩以提高性能。通过纳入一致性学习,目的是生成摘要,以保持与教师模型的预期语义表示的一致性和一致性,同时提高对原始检索文档的忠实性。所提出的方法在多个数据集上进行了经验验证,表明问答任务在精度和效率方面都有显著的提高。它的表现优于现有的基线,并展示了在检索增强的生成框架内结合对比学习范例和一致性学习范例的协同效应。

[NLP-29] Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning
[NLP-29] 了解检索增强图像字幕的检索鲁棒性

链接: https://arxiv.org/abs/2406.02265
作者: Wenyan Li,Jiaang Li,Rita Ramos,Raphael Tang,Desmond Elliott
关键词: strong domain-transfer capabilities, Recent advancements, image captioning highlight, retrieving related captions, domain-transfer capabilities
中文关键词: 强大的域转移能力、最新进展、图像字幕亮点、检索相关字幕、域转移能力
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 9 pages, long paper at ACL 2024

点击查看摘要

Abstract:Recent advancements in retrieval-augmented models for image captioning highlight the significance of retrieving related captions for efficient, lightweight models with strong domain-transfer capabilities. While these models demonstrate the success of retrieval augmentation, retrieval models are still far from perfect in practice. Retrieved information can sometimes mislead the model generation, negatively impacting performance. In this paper, we analyze the robustness of the SmallCap retrieval-augmented captioning model. Our analysis shows that SmallCap is sensitive to tokens that appear in the majority of the retrieved captions, and integrated gradients attribution shows that those tokens are likely copied into the final caption. Given these findings, we propose to train the model by sampling retrieved captions from more diverse sets. This reduces the probability that the model learns to copy majority tokens and improves both in-domain and cross-domain performance effectively.
摘要:图像字幕检索增强模型的最新进展突出了检索相关字幕对于具有强大域转移能力的高效、轻量级模型的重要性。虽然这些模型证明了检索增强的成功,但在实际应用中,检索模型仍远未完善。检索到的信息有时会误导模型生成,从而对性能产生负面影响。在本文中,我们分析了SmallCap检索-增强字幕模型的健壮性。我们的分析表明,SmallCap对出现在大多数检索到的字幕中的标记很敏感,综合梯度属性表明这些标记很可能被复制到最终的标题中。鉴于这些发现,我们建议通过对从更多样化的集合中检索到的字幕进行采样来训练该模型。这降低了模型学习复制大多数令牌的概率,并有效地提高了域内和跨域的性能。

[NLP-30] Modeling Emotional Trajectories in Written Stories Utilizing Transformers and Weakly-Supervised Learning
[NLP-30] 利用变形金刚和弱监督学习在书面故事中建模情感轨迹

链接: https://arxiv.org/abs/2406.02251
作者: Lukas Christ,Shahin Amiriparian,Manuel Milling,Ilhan Aslan,Björn W. Schuller
关键词: Telling stories, integral part, part of human, human communication, influence the affective
中文关键词: 讲故事,是人类、人类沟通的组成部分,影响情感
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2024 Findings. arXiv admin note: text overlap with arXiv:2212.11382

点击查看摘要

Abstract:Telling stories is an integral part of human communication which can evoke emotions and influence the affective states of the audience. Automatically modeling emotional trajectories in stories has thus attracted considerable scholarly interest. However, as most existing works have been limited to unsupervised dictionary-based approaches, there is no benchmark for this task. We address this gap by introducing continuous valence and arousal labels for an existing dataset of children’s stories originally annotated with discrete emotion categories. We collect additional annotations for this data and map the categorical labels to the continuous valence and arousal space. For predicting the thus obtained emotionality signals, we fine-tune a DeBERTa model and improve upon this baseline via a weakly supervised learning approach. The best configuration achieves a Concordance Correlation Coefficient (CCC) of .8221 for valence and .7125 for arousal on the test set, demonstrating the efficacy of our proposed approach. A detailed analysis shows the extent to which the results vary depending on factors such as the author, the individual story, or the section within the story. In addition, we uncover the weaknesses of our approach by investigating examples that prove to be difficult to predict.
摘要:讲故事是人类交流中不可或缺的一部分,它能唤起观众的情感,影响观众的情感状态。因此,对故事中的情感轨迹进行自动建模已经引起了相当大的学术兴趣。然而,由于大多数现有的工作都局限于基于词典的无监督方法,因此没有针对这一任务的基准。我们通过为现有的儿童故事数据集引入连续的价和唤醒标签来解决这一差距,这些数据集最初是用离散的情感类别标注的。我们为这些数据收集了额外的注释,并将范畴标签映射到连续的配价和唤醒空间。为了预测由此获得的情绪信号,我们微调了DeBERTa模型,并通过弱监督学习方法改进了这一基线。最优配置在测试集上的一致性相关系数(CCC)分别为.8221和.7125,证明了我们所提出的方法的有效性。详细的分析显示了结果的不同程度取决于作者、个别故事或故事中的部分等因素。此外,我们通过调查被证明难以预测的例子来揭示我们方法的弱点。

[NLP-31] Description Boosting for Zero-Shot Entity and Relation Classification
[NLP-31] 零镜头实体和关系分类的描述提升

链接: https://arxiv.org/abs/2406.02245
作者: Gabriele Picco,Leopold Fuchs,Marcos Martínez Galindo,Alberto Purpura,Vanessa López,Hoang Thanh Lam
关键词: annotate input text, input text data, leverage available external, external information, information of unseen
中文关键词: 注释输入文本、输入文本数据、利用可用的外部、外部信息、不可见的信息
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Zero-shot entity and relation classification models leverage available external information of unseen classes – e.g., textual descriptions – to annotate input text data. Thanks to the minimum data requirement, Zero-Shot Learning (ZSL) methods have high value in practice, especially in applications where labeled data is scarce. Even though recent research in ZSL has demonstrated significant results, our analysis reveals that those methods are sensitive to provided textual descriptions of entities (or relations). Even a minor modification of descriptions can lead to a change in the decision boundary between entity (or relation) classes. In this paper, we formally define the problem of identifying effective descriptions for zero shot inference. We propose a strategy for generating variations of an initial description, a heuristic for ranking them and an ensemble method capable of boosting the predictions of zero-shot models through description enhancement. Empirical results on four different entity and relation classification datasets show that our proposed method outperform existing approaches and achieve new SOTA results on these datasets under the ZSL settings. The source code of the proposed solutions and the evaluation framework are open-sourced.
摘要:零概率实体和关系分类模型利用不可见类的可用外部信息–例如文本描述–来标注输入文本数据。由于对数据的要求最低,零镜头学习方法具有很高的实用价值,特别是在标签数据稀缺的应用中。尽管最近对ZSL的研究显示了显著的结果,但我们的分析表明,这些方法对提供的实体(或关系)的文本描述很敏感。即使对描述进行很小的修改,也可能导致实体(或关系)类之间的决策边界发生变化。在本文中,我们形式化地定义了为零镜头推理识别有效描述的问题。我们提出了一种生成初始描述变体的策略,一种对它们进行排序的启发式方法,以及一种能够通过描述增强来提高零射击模型预测的集成方法。在四个不同的实体和关系分类数据集上的实验结果表明,我们的方法优于现有的方法,并且在ZSL设置下在这些数据集上获得了新的SOTA结果。提议的解决方案和评估框架的源代码是开源的。

[NLP-32] Self-Modifying State Modeling for Simultaneous Machine Translation
[NLP-32] 机器同步翻译的自修改状态建模

链接: https://arxiv.org/abs/2406.02237
作者: Donglei Yu,Xiaomian Kang,Yuchen Liu,Yu Zhou,Chengqing Zong
关键词: generates target outputs, Simultaneous Machine Translation, receiving stream source, Simultaneous Machine, generates target
中文关键词: 生成目标输出,同时机器翻译,接收流源,同时机器,生成目标
类目: Computation and Language (cs.CL)
备注: Accept to ACL 2024 main conference. 15 pages, 13 figures, 9 tables

点击查看摘要

Abstract:Simultaneous Machine Translation (SiMT) generates target outputs while receiving stream source inputs and requires a read/write policy to decide whether to wait for the next source token or generate a new target token, whose decisions form a \textitdecision path. Existing SiMT methods, which learn the policy by exploring various decision paths in training, face inherent limitations. These methods not only fail to precisely optimize the policy due to the inability to accurately assess the individual impact of each decision on SiMT performance, but also cannot sufficiently explore all potential paths because of their vast number. Besides, building decision paths requires unidirectional encoders to simulate streaming source inputs, which impairs the translation quality of SiMT models. To solve these issues, we propose \textbfSelf-\textbfModifying \textbfState \textbfModeling (SM ^2 ), a novel training paradigm for SiMT task. Without building decision paths, SM ^2 individually optimizes decisions at each state during training. To precisely optimize the policy, SM ^2 introduces Self-Modifying process to independently assess and adjust decisions at each state. For sufficient exploration, SM ^2 proposes Prefix Sampling to efficiently traverse all potential states. Moreover, SM ^2 ensures compatibility with bidirectional encoders, thus achieving higher translation quality. Experiments show that SM ^2 outperforms strong baselines. Furthermore, SM ^2 allows offline machine translation models to acquire SiMT ability with fine-tuning.
摘要:同时机器翻译(SIMT)在接收流源输入的同时生成目标输出,需要一个读/写策略来决定是等待下一个源令牌还是生成一个新的目标令牌,目标令牌的决策形成了一条决策路径。现有的SIMT方法通过探索训练中的各种决策路径来学习策略,面临着固有的局限性。由于无法准确评估每个决策对SIMT性能的影响,这些方法不仅无法精确地优化策略,而且由于数量众多,也无法充分探索所有潜在的路径。此外,建立决策路径需要单向编码器模拟流源输入,这影响了SIMT模型的翻译质量。为了解决这些问题,我们提出了一种新的SIMT任务训练范式–修改SIMT训练范式。在不建立决策路径的情况下,SM^2在训练期间单独优化每个状态的决策。为了精确地优化政策,SM^2引入了自我修改过程,以独立评估和调整每个州的决策。为了充分探索,SM^2建议使用前缀采样来高效地遍历所有潜在状态。此外,SM^2确保了与双向编码器的兼容性,从而实现了更高的翻译质量。实验表明,SM^2的性能优于强基线。此外,SM^2允许离线机器翻译模型通过微调获得SIMT能力。

[NLP-33] FedMKT: Federated Mutual Knowledge Transfer for Large and Small Language Models
[NLP-33] FedMKT:大型和小型语言模型的联邦相互知识转移

链接: https://arxiv.org/abs/2406.02224
作者: Tao Fan,Guoqiang Ma,Yan Kang,Hanlin Gu,Lixin Fan,Qiang Yang
关键词: Recent research, small language models, language models, locally deployed homogeneous, large language models
中文关键词: 最近的研究、小型语言模型、语言模型、本地部署的同质、大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent research in federated large language models (LLMs) has primarily focused on enabling clients to fine-tune their locally deployed homogeneous LLMs collaboratively or on transferring knowledge from server-based LLMs to small language models (SLMs) at downstream clients. However, a significant gap remains in the simultaneous mutual enhancement of both the server’s LLM and clients’ SLMs. To bridge this gap, we propose FedMKT, a parameter-efficient federated mutual knowledge transfer framework for large and small language models. This framework is designed to adaptively transfer knowledge from the server’s LLM to clients’ SLMs while concurrently enriching the LLM with clients’ unique domain insights. We facilitate token alignment using minimum edit distance (MinED) and then selective mutual knowledge transfer between client-side SLMs and a server-side LLM, aiming to collectively enhance their performance. Through extensive experiments across three distinct scenarios, heterogeneous, homogeneous, and one-to-one, we evaluate the effectiveness of FedMKT using various public LLMs and SLMs on a range of NLP text generation tasks. Empirical results demonstrate significant performance improvements in clients’ SLMs with the aid of the LLM. Furthermore, the LLM optimized by FedMKT achieves a performance comparable to that achieved through direct fine-tuning based on clients’ data, highlighting the effectiveness and adaptability of FedMKT.
摘要:目前联合大型语言模型的研究主要集中在使客户能够协作地调整其本地部署的同构大型语言模型,或将知识从基于服务器的大型语言模型传输到下游客户端的小型语言模型。然而,在服务器的LLM和客户端的SLM的同时相互增强方面仍然存在显著的差距。为了弥补这一差距,我们提出了一种参数高效的联邦互知识传递框架FedMKT,适用于大小语言模型。该框架旨在自适应地将知识从服务器的LLM传输到客户端的SLM,同时用客户端独特的领域洞察丰富LLM。我们使用最小编辑距离(MINED)来促进令牌对齐,然后在客户端SLM和服务器端LLM之间选择性地相互传递知识,旨在共同提高它们的性能。通过三种不同的场景(异质、同质和一对一)进行广泛的实验,我们评估了FedMKT在一系列自然语言处理文本生成任务中使用各种公共LLM和SLM的有效性。实证结果表明,在LLM的帮助下,客户的SLM性能有了显著的提高。此外,由FedMKT优化的LLM获得了与基于客户数据的直接微调相媲美的性能,突显了FedMKT的有效性和适应性。

[NLP-34] Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts
[NLP-34] 为什么只有文本:通过多模式预算增强视觉和语言导航能力

链接: https://arxiv.org/abs/2406.02208
作者: Haodong Hong,Sen Wang,Zi Huang,Qi Wu,Jiajun Liu
关键词: employ textual instructions, Current, Prompts, employ textual, textual instructions
中文关键词: 使用文本指令,当前,默认,使用文本,文本指令
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: IJCAI 2024

点击查看摘要

Abstract:Current Vision-and-Language Navigation (VLN) tasks mainly employ textual instructions to guide agents. However, being inherently abstract, the same textual instruction can be associated with different visual signals, causing severe ambiguity and limiting the transfer of prior knowledge in the vision domain from the user to the agent. To fill this gap, we propose Vision-and-Language Navigation with Multi-modal Prompts (VLN-MP), a novel task augmenting traditional VLN by integrating both natural language and images in instructions. VLN-MP not only maintains backward compatibility by effectively handling text-only prompts but also consistently shows advantages with different quantities and relevance of visual prompts. Possible forms of visual prompts include both exact and similar object images, providing adaptability and versatility in diverse navigation scenarios. To evaluate VLN-MP under a unified framework, we implement a new benchmark that offers: (1) a training-free pipeline to transform textual instructions into multi-modal forms with landmark images; (2) diverse datasets with multi-modal instructions for different downstream tasks; (3) a novel module designed to process various image prompts for seamless integration with state-of-the-art VLN models. Extensive experiments on four VLN benchmarks (R2R, RxR, REVERIE, CVDN) show that incorporating visual prompts significantly boosts navigation performance. While maintaining efficiency with text-only prompts, VLN-MP enables agents to navigate in the pre-explore setting and outperform text-based models, showing its broader applicability.
摘要:当前的视觉与语言导航(VLN)任务主要使用文本指令来指导代理。然而,相同的文本指令本身是抽象的,可以与不同的视觉信号相关联,导致严重的歧义,并限制了视觉领域中的先验知识从用户到代理的转移。为了填补这一空白,我们提出了多模式提示视觉和语言导航(VLN-MP),这是一种通过在指令中集成自然语言和图像来增强传统VLN的新任务。VLN-MP不仅通过有效地处理纯文本提示来保持向后兼容性,而且对于不同数量和相关性的视觉提示也一致地显示出优势。视觉提示的可能形式包括准确和相似的对象图像,从而在不同的导航场景中提供适应性和多功能性。为了在统一的框架下评估VLN-MP,我们实现了一个新的基准,该基准提供:(1)无需训练的管道将文本指令转换为具有标志性图像的多模式形式;(2)针对不同下游任务的不同数据集和多模式指令;(3)设计用于处理各种图像提示的新模块,以便与最先进的VLN模型无缝集成。在四个VLN基准(R2R、RXR、Reflie、CVDN)上的广泛实验表明,加入视觉提示显著提高了导航性能。在通过纯文本提示保持效率的同时,VLN-MP使工程师能够在预探索设置中导航,并优于基于文本的模型,显示了其更广泛的适用性。

[NLP-35] A multilingual dataset for offensive language and hate speech detection for hausa yoruba and igbo languages
[NLP-35] 用于hausa yoruba和igbo语言攻击性语言和仇恨言论检测的多语言数据集

链接: https://arxiv.org/abs/2406.02169
作者: Saminu Mohammad Aliyu,Gregory Maksha Wajiga,Muhammad Murtala
关键词: effective detection mechanisms, multilingual contexts, offensive language detection, online offensive language, offensive language necessitates
中文关键词: 有效的检测机制、多语言上下文、攻击性语言检测、在线攻击性语言、攻击性语言
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:The proliferation of online offensive language necessitates the development of effective detection mechanisms, especially in multilingual contexts. This study addresses the challenge by developing and introducing novel datasets for offensive language detection in three major Nigerian languages: Hausa, Yoruba, and Igbo. We collected data from Twitter and manually annotated it to create datasets for each of the three languages, using native speakers. We used pre-trained language models to evaluate their efficacy in detecting offensive language in our datasets. The best-performing model achieved an accuracy of 90%. To further support research in offensive language detection, we plan to make the dataset and our models publicly available.
摘要:在线攻击性语言的激增需要开发有效的检测机制,尤其是在多语言环境中。这项研究通过开发和引入新型数据集来解决这一挑战,用于尼日利亚三种主要语言(豪萨语、约鲁巴语和伊博语)的攻击性语言检测。我们从Twitter收集数据,并手动注释,以使用母语者为三种语言中的每一种创建数据集。我们使用预先训练的语言模型来评估它们在检测数据集中冒犯性语言方面的功效。性能最好的模型实现了90%的准确率。为了进一步支持攻击性语言检测的研究,我们计划公开该数据集和我们的模型。

[NLP-36] Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision
[NLP-36] Whistle:通过弱语音监督实现数据高效的多语言和跨语言语音识别

链接: https://arxiv.org/abs/2406.02166
作者: Saierdaer Yusuyin,Te Ma,Hao Huang,Wenbo Zhao,Zhijian Ou
关键词: International Phonetic Alphabet, MCL-ASR, supervised pre-training, phonetic, self-supervised pre-training
中文关键词: 国际音素字母表,MCL-ASB,监督预训练,语音,自我监督预训练
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pre-training with phonetic or graphemic transcription, and self-supervised pre-training. We find that pre-training with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. This paper explores the approach of pre-training with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models. We construct a common experimental setup based on the CommonVoice dataset, called CV-Lang10, with 10 seen languages and 2 unseen languages. A set of experiments are conducted on CV-Lang10 to compare, as fair as possible, the three approaches under the common setup for MCL-ASR. Experiments demonstrate the advantages of phoneme-based models (Whistle) for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training this http URL is found that when training data is more limited, phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. To support reproducibility and promote future research along this direction, we will release the code, models and data for the whole pipeline of Whistle at this https URL upon publication.
摘要:多语种和跨语种的自动语音识别(MCL-ASR)有三种方法–有监督的语音或字形转录预训练和自我监督的预训练。我们发现,到目前为止,语音监督的预训练对MCL-ASR来说并没有得到足够的重视,而从概念上讲,它更有利于不同语言之间的信息共享。本文探讨了弱语音监督的预训练方法对数据高效的MCL-ASR的影响。我们放宽了对金标人类验证音标的要求,利用LanguageNet字素到音素(G2P)模型,获得了基于国际音标(IPA)的音标。我们基于CommonVoice数据集构建了一个通用的实验装置,称为CV-Lang 10,包含10种可见语言和2种不可见语言。在CV-Lang 10上进行了一组实验,尽可能公平地比较了MCL-ASR在公共设置下的三种方法。实验证明了基于音素模型的MCL-ASR在对可见语言的语音识别、对不同数量的少镜头数据的不可见语言的跨语言处理、克服灾难性遗忘和训练该http URL方面的优势。实验发现,在训练数据较有限的情况下,音素监督可以取得比子词监督和自我监督更好的结果,从而提供更高的数据效率。为了支持可重复性并推动未来沿着这一方向进行的研究,我们将在发布后在此HTTPS URL上发布整个Well管道的代码、模型和数据。

[NLP-37] Synergetic Event Understanding: A Collaborative Approach to Cross-Document Event Coreference Resolution with Large Language Models
[NLP-37] 协同事件理解:使用大型语言模型进行跨文档事件共指解析的协作方法

链接: https://arxiv.org/abs/2406.02148
作者: Qingkai Min,Qipeng Guo,Xiangkun Hu,Songfang Huang,Zheng Zhang,Yue Zhang
关键词: Cross-document event coreference, involves clustering event, event coreference resolution, clustering event mentions, Cross-document event
中文关键词: 跨文档事件共指,涉及集群事件、事件共指解析、集群事件提及、跨文档事件
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL-24 Main

点击查看摘要

Abstract:Cross-document event coreference resolution (CDECR) involves clustering event mentions across multiple documents that refer to the same real-world events. Existing approaches utilize fine-tuning of small language models (SLMs) like BERT to address the compatibility among the contexts of event mentions. However, due to the complexity and diversity of contexts, these models are prone to learning simple co-occurrences. Recently, large language models (LLMs) like ChatGPT have demonstrated impressive contextual understanding, yet they encounter challenges in adapting to specific information extraction (IE) tasks. In this paper, we propose a collaborative approach for CDECR, leveraging the capabilities of both a universally capable LLM and a task-specific SLM. The collaborative strategy begins with the LLM accurately and comprehensively summarizing events through prompting. Then, the SLM refines its learning of event representations based on these insights during fine-tuning. Experimental results demonstrate that our approach surpasses the performance of both the large and small language models individually, forming a complementary advantage. Across various datasets, our approach achieves state-of-the-art performance, underscoring its effectiveness in diverse scenarios.
摘要:跨文档事件共引用解析(CDECR)涉及跨引用相同现实事件的多个文档对事件引用进行聚类。现有方法利用诸如BERT的小语言模型(SLM)的微调来解决事件提及的上下文之间的兼容性。然而,由于上下文的复杂性和多样性,这些模型容易学习简单的共现。最近,像ChatGPT这样的大型语言模型(LLM)表现出了令人印象深刻的上下文理解,但它们在适应特定的信息提取(IE)任务方面遇到了挑战。在本文中,我们提出了一种CDECR的协作方法,利用通用的LLM和特定任务的SLM的能力。协作策略从LLM通过提示准确而全面地总结事件开始。然后,SLM在微调过程中基于这些洞察力改进其事件表示的学习。实验结果表明,我们的方法分别优于大语言模型和小语言模型,形成了互补的优势。在不同的数据集上,我们的方法实现了最先进的性能,强调了其在不同场景中的有效性。

[NLP-38] Reinforcement Tuning for Detecting Stances and Debunking Rumors Jointly with Large Language Models
[NLP-38] 与大型语言模型联合检测姿态和揭穿谣言的强化调优

链接: https://arxiv.org/abs/2406.02143
作者: Ruichao Yang,Wei Gao,Jing Ma,Hongzhan Lin,Bo Wang
关键词: Learning multi-task models, poses challenges due, verifying rumors poses, rumors poses challenges, jointly detecting stance
中文关键词: 学习多任务模型,提出应有的挑战,验证谣言构成,谣言构成挑战,共同检测立场
类目: Computation and Language (cs.CL)
备注: ACL 2024 (Findings)

点击查看摘要

Abstract:Learning multi-task models for jointly detecting stance and verifying rumors poses challenges due to the need for training data of stance at post level and rumor veracity at claim level, which are difficult to obtain. To address this issue, we leverage large language models (LLMs) as the foundation annotators for the joint stance detection (SD) and rumor verification (RV) tasks, dubbed as JSDRV. We introduce a novel reinforcement tuning framework to enhance the joint predictive capabilities of LLM-based SD and RV components. Specifically, we devise a policy for selecting LLM-annotated data at the two levels, employing a hybrid reward mechanism to choose high-quality labels for effective LLM fine-tuning on both tasks. Results demonstrate that JSDRV improves the capabilities of LLMs in the joint tasks, not only outperforming state-of-the-art methods but also generalizing to non-LLMs accommodated as task models.
摘要:学习用于联合检测立场和验证谣言的多任务模型带来了挑战,因为需要职位级别的立场训练数据和索赔级别的谣言真实性,而这些数据很难获得。为了解决这个问题,我们利用大型语言模型(LLM)作为联合姿态检测(SD)和谣言验证(RV)任务(称为JSDRV)的基础注释器。我们引入了一种新型的增强调整框架,以增强基于LLM的SD和RV组件的联合预测能力。具体来说,我们设计了一项在两个级别上选择LLM注释的数据的策略,采用混合奖励机制来选择高质量的标签,以便对两项任务进行有效的LLM微调。结果表明,JSDRV提高了LLM在联合任务中的能力,不仅优于最先进的方法,而且还推广到作为任务模型的非LLM。

[NLP-39] Robust Interaction-based Relevance Modeling for Online E-Commerce and LLM-based Retrieval
[NLP-39] 在线电子商务和基于LLM的检索的鲁棒基于交互的相关建模

链接: https://arxiv.org/abs/2406.02135
作者: Ben Chen,Huangyu Dai,Xiang Ma,Wen Jiang,Wei Ning
关键词: items selected closely, selected closely align, Semantic relevance calculation, items selected, selected closely
中文关键词: 紧密选择的项目,紧密排列的选择,语义相关性计算,选择的项目,紧密选择
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted by ECML-PKDD’24 as Outstanding Paper. 8 pages, 2 figures, 7 tables

点击查看摘要

Abstract:Semantic relevance calculation is crucial for e-commerce search engines, as it ensures that the items selected closely align with customer intent. Inadequate attention to this aspect can detrimentally affect user experience and engagement. Traditional text-matching techniques are prevalent but often fail to capture the nuances of search intent accurately, so neural networks now have become a preferred solution to processing such complex text matching. Existing methods predominantly employ representation-based architectures, which strike a balance between high traffic capacity and low latency. However, they exhibit significant shortcomings in generalization and robustness when compared to interaction-based architectures. In this work, we introduce a robust interaction-based modeling paradigm to address these shortcomings. It encompasses 1) a dynamic length representation scheme for expedited inference, 2) a professional terms recognition method to identify subjects and core attributes from complex sentence structures, and 3) a contrastive adversarial training protocol to bolster the model’s robustness and matching capabilities. Extensive offline evaluations demonstrate the superior robustness and effectiveness of our approach, and online A/B testing confirms its ability to improve relevance in the same exposure position, resulting in more clicks and conversions. To the best of our knowledge, this method is the first interaction-based approach for large e-commerce search relevance calculation. Notably, we have deployed it for the entire search traffic on this http URL, the largest B2B e-commerce platform in the world.
摘要:语义相关度计算对于电子商务搜索引擎来说至关重要,因为它可以确保所选择的条目与客户的意图紧密一致。对这一方面的不够重视可能会对用户体验和参与度造成不利影响。传统的文本匹配技术很流行,但往往不能准确地捕捉到搜索意图的细微差别,因此神经网络现在已经成为处理这种复杂文本匹配的首选解决方案。现有的方法主要采用基于表示的体系结构,它在高流量容量和低延迟之间取得了平衡。然而,与基于交互的体系结构相比,它们在通用性和健壮性方面表现出明显的缺陷。在这项工作中,我们引入了一个健壮的基于交互的建模范例来解决这些缺点。它包括1)用于加速推理的动态长度表示方案,2)从复杂句子结构中识别主语和核心属性的专业术语识别方法,以及3)增强模型的稳健性和匹配能力的对比对抗性训练协议。广泛的离线评估证明了我们方法的卓越稳健性和有效性,在线A/B测试证实了它有能力在相同的曝光位置提高相关性,从而产生更多的点击和转换。据我们所知,该方法是第一个基于交互的大型电子商务搜索相关度计算方法。值得注意的是,我们已经在这个http URL上部署了整个搜索流量,这是世界上最大的B2B电子商务平台。

[NLP-40] he current status of large language models in summarizing radiology report impressions
[NLP-40] 大型语言模型在总结放射学报告印象中的现状

链接: https://arxiv.org/abs/2406.02134
作者: Danqing Hu,Shanyuan Zhang,Qing Liu,Xiaofeng Zhu,Bing Liu
关键词: Large language models, language processing tasks, natural language processing, Large language, ChatGPT show excellent
中文关键词: 大型语言模型、语言处理任务、自然语言处理、大型语言、ChatGPT表现出色
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) like ChatGPT show excellent capabilities in various natural language processing tasks, especially for text generation. The effectiveness of LLMs in summarizing radiology report impressions remains unclear. In this study, we explore the capability of eight LLMs on the radiology report impression summarization. Three types of radiology reports, i.e., CT, PET-CT, and Ultrasound reports, are collected from Peking University Cancer Hospital and Institute. We use the report findings to construct the zero-shot, one-shot, and three-shot prompts with complete example reports to generate the impressions. Besides the automatic quantitative evaluation metrics, we define five human evaluation metrics, i.e., completeness, correctness, conciseness, verisimilitude, and replaceability, to evaluate the semantics of the generated impressions. Two thoracic surgeons (ZSY and LB) and one radiologist (LQ) compare the generated impressions with the reference impressions and score each impression under the five human evaluation metrics. Experimental results show that there is a gap between the generated impressions and reference impressions. Although the LLMs achieve comparable performance in completeness and correctness, the conciseness and verisimilitude scores are not very high. Using few-shot prompts can improve the LLMs’ performance in conciseness and verisimilitude, but the clinicians still think the LLMs can not replace the radiologists in summarizing the radiology impressions.
摘要:像ChatGPT这样的大型语言模型在各种自然语言处理任务中表现出了优异的性能,尤其是在文本生成方面。LLMS在总结放射学报告印象方面的有效性仍不清楚。在这项研究中,我们探索了八个最小二乘模型对放射学报告印象摘要的能力。三种类型的放射学报告,即CT、PET-CT和超声报告,来自北京大学肿瘤医院和研究所。我们使用报告结果来构建零次、一次和三次提示,并使用完整的示例报告来生成印象。除了自动量化评价指标外,我们还定义了完备性、正确性、简明性、真实性和可替换性五个人类评价指标来评价生成的印象的语义。两名胸科医生(ZSY和LB)和一名放射科医生(LQ)将产生的印象与参考印象进行比较,并根据五种人类评价标准对每个印象进行评分。实验结果表明,生成的印象与参考印象之间存在差距。虽然LLMS在完备性和正确性方面达到了与之相当的性能,但简洁性和逼真度得分并不是很高。使用少镜头提示可以提高LLMS的简洁性和真实性,但临床医生仍然认为LLMS在总结放射学印象方面不能取代放射科医生。

[NLP-41] Iteration Head: A Mechanistic Study of Chain-of-Thought
[NLP-41] 迭代头:思想链的机械学研究

链接: https://arxiv.org/abs/2406.02128
作者: Vivien Cabannes,Charles Arnal,Wassim Bouaziz,Alice Yang,Francois Charton,Julia Kempe
关键词: Large Language Models, improve Large Language, theoretical approximation power, Large Language, Language Models
中文关键词: 大型语言模型,提高大型语言,理论逼近能力,大型语言,语言模型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning is known to improve Large Language Models both empirically and in terms of theoretical approximation power. However, our understanding of the inner workings and conditions of apparition of CoT capabilities remains limited. This paper helps fill this gap by demonstrating how CoT reasoning emerges in transformers in a controlled and interpretable setting. In particular, we observe the appearance of a specialized attention mechanism dedicated to iterative reasoning, which we coined “iteration heads”. We track both the emergence and the precise working of these iteration heads down to the attention level, and measure the transferability of the CoT skills to which they give rise between tasks.
摘要:众所周知,思想链(CoT)推理可以在经验上和理论逼近能力方面改进大型语言模型。然而,我们对CoT能力显现的内部运作和条件的了解仍然有限。本文通过展示CoT推理如何在受控和可解释的环境中出现在变压器中,有助于填补这一空白。特别是,我们观察到专门用于迭代推理的注意力机制的出现,我们创造了“迭代头”。我们跟踪这些迭代的出现和精确工作,直至注意力水平,并衡量它们在任务之间产生的CoT技能的可移植性。

[NLP-42] Diver: Large Language Model Decoding with Span-Level Mutual Information Verification
[NLP-42] Diver:具有跨级互信息验证的大型语言模型解码

链接: https://arxiv.org/abs/2406.02120
作者: Jinliang Lu,Chen Wang,Jiajun Zhang
关键词: Large language models, shown impressive capabilities, Large language, language models, task-specific instructions
中文关键词: 大型语言模型,表现出令人印象深刻的能力,大型语言,语言模型,特定任务指令
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive capabilities in adapting to various tasks when provided with task-specific instructions. However, LLMs using standard decoding strategies often struggle with deviations from the inputs. Intuitively, compliant LLM outputs should reflect the information present in the input, which can be measured by point-wise mutual information (PMI) scores. Therefore, we propose Diver, a novel approach that enhances LLM Decoding through span-level PMI verification. During inference, Diver first identifies divergence steps that may lead to multiple candidate spans. Subsequently, it calculates the PMI scores by assessing the log-likelihood gains of the input if the candidate spans are generated. Finally, the optimal span is selected based on the PMI re-ranked output distributions. We evaluate our method across various downstream tasks, and empirical results demonstrate that Diver significantly outperforms existing decoding methods in both performance and versatility.
摘要:大型语言模型在提供特定于任务的指令时,已经显示出适应各种任务的令人印象深刻的能力。然而,使用标准解码策略的LLM经常与输入的偏差作斗争。直观地说,符合标准的LLM输出应该反映输入中存在的信息,这可以通过点式互信息(PMI)分数来衡量。因此,我们提出了Diver,一种新的方法,通过跨度级的PMI验证来增强LLM译码。在推理过程中,潜水员首先确定可能导致多个候选跨度的分歧步骤。随后,如果生成候选跨度,则它通过评估输入的对数似然增益来计算PMI分数。最后,基于PMI重新排序的输出分布来选择最优跨度。我们在不同的下游任务上对我们的方法进行了评估,实验结果表明,Diver在性能和通用性方面都明显优于现有的解码方法。

[NLP-43] UniOQA: A Unified Framework for Knowledge Graph Question Answering with Large Language Models
[NLP-43] UniOQA:使用大型语言模型的知识图问题解答统一框架

链接: https://arxiv.org/abs/2406.02110
作者: Zhuoyang Li,Liran Deng,Hui Liu,Qiaoqiao Liu,Junzhao Du
关键词: extensive Chinese open-domain, Chinese open-domain knowledge, extensive Chinese, Chinese open-domain, recent times
中文关键词: 广泛的中文开放领域,中文开放领域知识,广泛的中文,中文开放领域,近代
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:OwnThink stands as the most extensive Chinese open-domain knowledge graph introduced in recent times. Despite prior attempts in question answering over OwnThink (OQA), existing studies have faced limitations in model representation capabilities, posing challenges in further enhancing overall accuracy in question answering. In this paper, we introduce UniOQA, a unified framework that integrates two complementary parallel workflows. Unlike conventional approaches, UniOQA harnesses large language models (LLMs) for precise question answering and incorporates a direct-answer-prediction process as a cost-effective complement. Initially, to bolster representation capacity, we fine-tune an LLM to translate questions into the Cypher query language (CQL), tackling issues associated with restricted semantic understanding and hallucinations. Subsequently, we introduce the Entity and Relation Replacement algorithm to ensure the executability of the generated CQL. Concurrently, to augment overall accuracy in question answering, we further adapt the Retrieval-Augmented Generation (RAG) process to the knowledge graph. Ultimately, we optimize answer accuracy through a dynamic decision algorithm. Experimental findings illustrate that UniOQA notably advances SpCQL Logical Accuracy to 21.2% and Execution Accuracy to 54.9%, achieving the new state-of-the-art results on this benchmark. Through ablation experiments, we delve into the superior representation capacity of UniOQA and quantify its performance breakthrough.
摘要:OwnThink是近年来引进的最广泛的中文开放领域知识图谱。尽管OwnThink(OQA)问答已有尝试,但已有的研究在模型表征能力方面存在局限性,这对进一步提高问题回答的整体准确性提出了挑战。本文介绍了UniOQA,这是一个集成了两个互补的并行工作流的统一框架。与传统方法不同,UniOQA利用大型语言模型(LLM)进行准确的问题回答,并采用直接答案预测过程作为经济高效的补充。最初,为了增强表示能力,我们微调LLM以将问题转换为Cypher查询语言(CQL),解决与受限的语义理解和幻觉相关的问题。随后,我们引入了实体和关系替换算法来保证生成的CQL的可执行性。同时,为了提高问题回答的整体准确性,我们进一步将检索-增强生成(RAG)过程适应于知识图。最后,通过一种动态决策算法来优化答案的准确性。实验结果表明,UniOQA显著地将SpCQL逻辑准确率提高到21.2%,将执行准确率提高到54.9%,达到了该基准测试的最新水平。通过烧蚀实验,我们深入挖掘了UniOQA优越的表示能力,并量化了其性能突破。

[NLP-44] MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset
[NLP-44] MARS:用多任务评估数据集对语言模型的形而上学推理能力进行基准测试

链接: https://arxiv.org/abs/2406.02106
作者: Weiqi Wang,Yangqiu Song
关键词: enable Large Language, Large Language Models, Large Language, enable Large, Language Models
中文关键词: 启用大型语言、大型语言模型、大型语言、启用大型、语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To enable Large Language Models (LLMs) to function as conscious agents with generalizable reasoning capabilities, it is crucial that they possess the reasoning ability to comprehend situational changes (transitions) in distribution triggered by environmental factors or actions from other agents. Despite its fundamental significance, this ability remains underexplored due to the complexity of modeling infinite possible changes in an event and their associated distributions, coupled with the lack of benchmark data with situational transitions. Addressing these gaps, we propose a novel formulation of reasoning with distributional changes as a three-step discriminative process, termed as MetAphysical ReaSoning. We then introduce the first-ever benchmark, MARS, comprising three tasks corresponding to each step. These tasks systematically assess LLMs’ capabilities in reasoning the plausibility of (i) changes in actions, (ii) states caused by changed actions, and (iii) situational transitions driven by changes in action. Extensive evaluations with 20 (L)LMs of varying sizes and methods indicate that all three tasks in this process pose significant challenges, even for state-of-the-art LLMs and LMs after fine-tuning. Further analyses reveal potential causes for the underperformance of LLMs and demonstrate that pre-training them on large-scale conceptualization taxonomies can potentially enhance their metaphysical reasoning capabilities. Our data and models are publicly accessible at this https URL.
摘要:为了使大语言模型能够作为具有泛化推理能力的有意识的主体发挥作用,关键是它们必须具有理解由环境因素或其他主体的行为所触发的分布中的情景变化(转换)的推理能力。尽管这种能力具有基本意义,但由于对事件及其相关分布中的无限可能变化进行建模的复杂性,以及缺乏具有情景转变的基准数据,这种能力仍然没有得到充分的探索。为了弥补这些空白,我们提出了一种新的推理公式,将分布变化作为一个三步区分过程,称为形而上学推理。然后,我们介绍有史以来第一个基准测试,MARS,每个步骤包含三个任务。这些任务系统地评估了LLMS在推理(I)动作变化,(Ii)由动作变化引起的状态,以及(Iii)由动作变化驱动的情景转换的似然性方面的能力。对不同规模和方法的20个(L)LMS进行的广泛评估表明,这一过程中的所有三项任务都构成了重大挑战,即使是对最先进的LLM和微调后的LMS也是如此。进一步的分析揭示了LLMS表现不佳的潜在原因,并表明对他们进行大规模概念化分类的预培训可以潜在地提高他们的形而上学推理能力。我们的数据和模型可通过此HTTPS URL公开访问。

[NLP-45] Exploring Mathematical Extrapolation of Large Language Models with Synthetic Data
[NLP-45] 利用合成数据探索大型语言模型的数学外推

链接: https://arxiv.org/abs/2406.02100
作者: Haolong Li,Yu Ma,Yinqi Zhang,Chen Ye,Jie Chen
关键词: Large Language Models, Large Language, complex multi-step reasoning, language understanding, multi-step reasoning problems
中文关键词: 大型语言模型、大型语言、复杂的多步推理、语言理解、多步推理问题
类目: Computation and Language (cs.CL)
备注: Accept by Findings of ACL 2024

点击查看摘要

Abstract:Large Language Models (LLMs) have shown excellent performance in language understanding, text generation, code synthesis, and many other tasks, while they still struggle in complex multi-step reasoning problems, such as mathematical reasoning. In this paper, through a newly proposed arithmetical puzzle problem, we show that the model can perform well on multi-step reasoning tasks via fine-tuning on high-quality synthetic data. Experimental results with the open-llama-3B model on three different test datasets show that not only the model can reach a zero-shot pass@1 at 0.44 on the in-domain dataset, it also demonstrates certain generalization capabilities on the out-of-domain datasets. Specifically, this paper has designed two out-of-domain datasets in the form of extending the numerical range and the composing components of the arithmetical puzzle problem separately. The fine-tuned models have shown encouraging performance on these two far more difficult tasks with the zero-shot pass@1 at 0.33 and 0.35, respectively.
摘要:大语言模型在语言理解、文本生成、代码综合等许多任务中表现出优异的性能,但在数学推理等复杂的多步推理问题中仍然举步维艰。通过一个新提出的算术难题,我们证明了该模型通过对高质量的合成数据进行微调,可以很好地执行多步推理任务。用Open-Llama-3B模型在三个不同的测试数据集上的实验结果表明,该模型不仅可以在域内数据集上达到零射@1,而且在域外数据集上也表现出一定的泛化能力。具体地说,本文设计了两个域外数据集,分别以扩展数值范围的形式和算术难题的组成成分。微调的模型在这两个难度大得多的任务中表现出令人鼓舞的表现,零杆传球@1分别为0.33和0.35。

[NLP-46] LongSSM: On the Length Extension of State-space Models in Language Modelling
[NLP-46] LongRSM:关于语言建模中状态空间模型的长度扩展

链接: https://arxiv.org/abs/2406.02080
作者: Shida Wang
关键词: language modeling, Length extension, investigate the length-extension, Length, extension
中文关键词: 语言建模,长度扩展,研究长度扩展,长度,扩展
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Dynamical Systems (math.DS)
备注: 23 pages

点击查看摘要

Abstract:In this paper, we investigate the length-extension of state-space models (SSMs) in language modeling. Length extension involves training models on short sequences and testing them on longer ones. We show that state-space models trained with zero hidden states initialization have difficulty doing length extension. We explain this difficulty by pointing out the length extension is equivalent to polynomial extrapolation. Based on the theory, we propose a simple yet effective method - changing the hidden states initialization scheme - to improve the length extension. Moreover, our method shows that using long training sequence length is beneficial but not necessary to length extension. Changing the hidden state initialization enables the efficient training of long-memory model with a smaller training context length.
摘要:本文研究了语言建模中状态空间模型(SSM)的长度扩展。长度扩展涉及在短序列上训练模型并在长序列上测试它们。我们表明,用零隐藏状态初始化训练的状态空间模型很难进行长度扩展。我们通过指出长度扩展相当于多项外推来解释这个困难。基于该理论,我们提出了一种简单而有效的方法–改变隐藏状态初始化方案–来改善长度扩展。此外,我们的方法表明,使用长训练序列长度对于长度扩展是有益的,但不是必要的。改变隐藏状态初始化可以以更小的训练上下文长度高效训练长记忆模型。

[NLP-47] Assessing the Performance of Chinese Open Source Large Language Models in Information Extraction Tasks
[NLP-47] 评估中文开源大型语言模型在信息提取任务中的性能

链接: https://arxiv.org/abs/2406.02079
作者: Yida Cai,Hao Sun,Hsiu-Yuan Huang,Yunfang Wu
关键词: Natural Language Processing, facilitating seamless integration, extracting structured information, Named Entity Recognition, Language Processing
中文关键词: 自然语言处理、促进无缝集成、提取结构化信息、命名实体识别、语言处理
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Information Extraction (IE) plays a crucial role in Natural Language Processing (NLP) by extracting structured information from unstructured text, thereby facilitating seamless integration with various real-world applications that rely on structured data. Despite its significance, recent experiments focusing on English IE tasks have shed light on the challenges faced by Large Language Models (LLMs) in achieving optimal performance, particularly in sub-tasks like Named Entity Recognition (NER). In this paper, we delve into a comprehensive investigation of the performance of mainstream Chinese open-source LLMs in tackling IE tasks, specifically under zero-shot conditions where the models are not fine-tuned for specific tasks. Additionally, we present the outcomes of several few-shot experiments to further gauge the capability of these models. Moreover, our study includes a comparative analysis between these open-source LLMs and ChatGPT, a widely recognized language model, on IE performance. Through meticulous experimentation and analysis, we aim to provide insights into the strengths, limitations, and potential enhancements of existing Chinese open-source LLMs in the domain of Information Extraction within the context of NLP.
摘要:信息抽取通过从非结构化文本中提取结构化信息,从而促进与依赖于结构化数据的各种现实应用的无缝集成,在自然语言处理(NLP)中起着至关重要的作用。尽管它意义重大,但最近针对英语IE任务的实验揭示了大型语言模型(LLM)在实现最佳性能方面面临的挑战,特别是在命名实体识别(NER)等子任务中。在本文中,我们深入研究了中国主流开源LLMS在处理IE任务时的性能,特别是在模型没有针对特定任务进行微调的零触发条件下。此外,我们还提供了几个少量实验的结果,以进一步衡量这些模型的能力。此外,我们的研究还包括将这些开源LLM与公认的语言模型ChatGPT在IE性能上进行了比较分析。通过细致的实验和分析,我们的目标是深入了解现有中国开源LLMS在NLP环境下的信息提取领域的优势、局限性和潜在的增强。

[NLP-48] PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
[NLP-48] PyramidKN:基于金字塔信息流的动态NV缓存压缩

链接: https://arxiv.org/abs/2406.02069
作者: Zefan Cai.,Yichi Zhang,Bofei Gao,Tianyu Liu,Keming Lu,Wayne Xiong,Yue Dong,Baobao Chang,Junjie Hu,Wen Xiao
关键词: flow inside large, inside large language, attention-based information flow, information flow inside, long context processing
中文关键词: 内部大流程,内部大语言,基于注意力的信息流,内部信息流,长上下文处理
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this study, we investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing. Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusin on critical tokens (a.k.a massive activation or attention sink) in higher layers. Motivated by these insights, we developed PyramidKV, a novel and effective KV cache compression method. This approach dynamically adjusts the KV cache size across different layers, allocating more cache in lower layers and less in higher ones, diverging from traditional methods that maintain a uniform KV cache size. Our experimental evaluations, utilizing the LongBench benchmark, show that PyramidKV matches the performance of models with a full KV cache while retaining only 12% of the KV cache, thus significantly reducing memory usage. In scenarios emphasizing memory efficiency, where only 0.7% of the KV cache is maintained, PyramidKV surpasses other KV cache compression techniques achieving up to a 20.5 absolute accuracy improvement on TREC.
摘要:在这项研究中,我们调查了大语言模型中基于注意力的信息流是否通过长语境加工的显著模式聚集在一起。我们的观察表明,LLMS通过金字塔信息漏斗聚集信息,其中注意力广泛分散在较低的层,在特定的上下文中逐渐整合,最终集中在较高层的关键表征上(也称为大规模激活或注意力汇聚)。受此启发,我们开发了一种新颖有效的KV缓存压缩方法–金字塔KV。此方法跨不同层动态调整KV缓存大小,在较低层分配更多缓存,在较高层分配较少缓存,与保持统一KV缓存大小的传统方法不同。我们的实验评估,利用LongB边基准,表明金字塔KV的性能匹配模型与完整的KV缓存,同时只保留12%的KV缓存,从而显著减少内存使用量。在强调内存效率的场景中,只维护0.7%的KV缓存,在TREC上,金字塔KV超过了其他KV缓存压缩技术,实现了高达20.5%的绝对精度提升。

[NLP-49] Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
[NLP-49] 爱丽丝梦游仙境:在最先进的大型语言模型中展示完整推理分解的简单任务

链接: https://arxiv.org/abs/2406.02061
作者: Marianna Nezhurina,Lucia Cipolina-Kun,Mehdi Cherti,Jenia Jitsev
关键词: Large Language Models, exhibiting scaling laws, predict function improvement, Large Language, zero-shot manner
中文关键词: 大型语言模型,展现缩放定律,预测功能改进,大型语言,零射击方式
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: v1

点击查看摘要

Abstract:Large Language Models (LLMs) are often described as being instances of foundation models - that is, models that transfer strongly across various tasks and conditions in few-show or zero-shot manner, while exhibiting scaling laws that predict function improvement when increasing the pre-training scale. These claims of excelling in different functions and tasks rely on measurements taken across various sets of standardized benchmarks showing high scores for such models. We demonstrate here a dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales which claim strong function, using a simple, short, conventional common sense problem formulated in concise natural language, easily solvable by humans. The breakdown is dramatic, as models also express strong overconfidence in their wrong solutions, while providing often non-sensical “reasoning”-like explanations akin to confabulations to justify and backup the validity of their clearly failed responses, making them sound plausible. Various standard interventions in an attempt to get the right solution, like various type of enhanced prompting, or urging the models to reconsider the wrong solutions again by multi step re-evaluation, fail. We take these initial observations to the scientific and technological community to stimulate urgent re-assessment of the claimed capabilities of current generation of LLMs, Such re-assessment also requires common action to create standardized benchmarks that would allow proper detection of such basic reasoning deficits that obviously manage to remain undiscovered by current state-of-the-art evaluation procedures and benchmarks. Code for reproducing experiments in the paper and raw experiments data can be found at this https URL
摘要:大语言模型通常被描述为基础模型的实例–即以很少出现或零出现的方式在各种任务和条件下强烈迁移的模型,同时表现出预测随着预训练规模的增加而改善功能的标度律。这些声称在不同的职能和任务中出类拔萃的说法依赖于对各种标准化基准的测量,这些基准显示这些模型得分很高。我们在这里展示了在最大可用的规模上训练的最先进的模型的功能和推理能力的戏剧性分解,这些模型要求强大的功能,使用简单、简短、传统的常识问题用简洁的自然语言制定,很容易被人类解决。这种分解是戏剧性的,因为模型也对自己错误的解决方案表现出强烈的过度自信,同时提供往往毫无意义的类似于虚构的解释,以证明和支持他们明显失败的反应的有效性,使它们听起来似乎是可信的。各种试图获得正确解决方案的标准干预措施,如各种类型的强化提示,或通过多步骤重新评估敦促模型重新考虑错误的解决方案,都以失败告终。我们将这些初步意见带给科技界,以刺激对当代低成本管理系统声称的能力进行紧急重新评估,这种重新评估还需要共同行动,以创建标准化基准,使之能够适当地检测这些基本推理缺陷,而这些缺陷显然仍未被当前最先进的评估程序和基准发现。论文中复制实验的代码和原始实验数据可在以下HTTPS URL中找到

[NLP-50] Ive got the “Answer”! Interpretation of LLMs Hidden States in Question Answering
[NLP-50] 我得到了“答案”!问题解答中的法学硕士隐藏状态解读

链接: https://arxiv.org/abs/2406.02060
作者: Valeriya Goloviznina,Evgeny Kotelnikov
关键词: large language models, Interpretability and explainability, increasingly important, important in light, rapid development
中文关键词: 大型语言模型、可解释性和可解释性,越来越重要,在快速发展中重要
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for NLDB-2024 conference

点击查看摘要

Abstract:Interpretability and explainability of AI are becoming increasingly important in light of the rapid development of large language models (LLMs). This paper investigates the interpretation of LLMs in the context of the knowledge-based question answering. The main hypothesis of the study is that correct and incorrect model behavior can be distinguished at the level of hidden states. The quantized models LLaMA-2-7B-Chat, Mistral-7B, Vicuna-7B and the MuSeRC question-answering dataset are used to test this hypothesis. The results of the analysis support the proposed hypothesis. We also identify the layers which have a negative effect on the model’s behavior. As a prospect of practical application of the hypothesis, we propose to train such “weak” layers additionally in order to improve the quality of the task solution.
摘要:随着大型语言模型(LLM)的快速发展,人工智能的可解释性和可解释性变得越来越重要。本文探讨了基于知识的问答背景下对LLM的解释。该研究的主要假设是,正确和不正确的模型行为可以在隐藏状态的层面上区分。量化模型LLaMA-2- 7 B-Chat、Mistral-7 B、Vicuna-7 B和MuSeRC问答数据集用于测试这一假设。分析结果支持了提出的假设。我们还确定了对模型行为产生负面影响的层。作为该假设实际应用的前景,我们建议额外训练此类“弱”层,以提高任务解决方案的质量。

[NLP-51] Analyzing Social Biases in Japanese Large Language Models
[NLP-51] 分析日语大型语言模型中的社会偏见

链接: https://arxiv.org/abs/2406.02050
作者: Hitomi Yanaka,Han Namgi,Ryoma Kumon,Jie Lu,Masashi Takeshita,Ryo Sekizawa,Taisei Kato,Hiromi Arai
关键词: Large Language Models, Large Language, development of Large, social biases, Japanese LLMs
中文关键词: 大型语言模型、大型语言、大型发展、社会偏见、日本法学硕士
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the development of Large Language Models (LLMs), social biases in the LLMs have become a crucial issue. While various benchmarks for social biases have been provided across languages, the extent to which Japanese LLMs exhibit social biases has not been fully investigated. In this study, we construct the Japanese Bias Benchmark dataset for Question Answering (JBBQ) based on the English bias benchmark BBQ, and analyze social biases in Japanese LLMs. The results show that while current Japanese LLMs improve their accuracies on JBBQ by instruction-tuning, their bias scores become larger. In addition, augmenting their prompts with warning about social biases reduces the effect of biases in some models.
摘要:随着大型语言模型(LLM)的发展,LLM中的社会偏见已成为一个关键问题。虽然已经提供了不同语言的各种社会偏见基准,但日本法学硕士表现出社会偏见的程度尚未得到充分调查。在本研究中,我们基于英语偏见基准BBQ构建了日本问题解答(JBBQ)偏见基准数据集,并分析了日本法学硕士的社会偏见。结果表明,虽然当前的日本LLM通过描述调整提高了JBBQ的准确性,但他们的偏差分数变得更大。此外,在他们的提示中添加有关社会偏见的警告可以减少某些模型中偏见的影响。

[NLP-52] QROA: A Black-Box Query-Response Optimization Attack on LLMs
[NLP-52] QROA:对LLM的黑匣子查询响应优化攻击

链接: https://arxiv.org/abs/2406.02044
作者: Hussein Jawad,Nicolas J.-B. BRUNEL(LaMME)
关键词: Large Language Models, Large Language, Language Models, recent months, surged in popularity
中文关键词: 大型语言模型,大型语言,语言模型,近几个月受欢迎程度激增
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have surged in popularity in recent months, yet they possess concerning capabilities for generating harmful content when manipulated. This study introduces the Query-Response Optimization Attack (QROA), an optimization-based strategy designed to exploit LLMs through a black-box, query-only interaction. QROA adds an optimized trigger to a malicious instruction to compel the LLM to generate harmful content. Unlike previous approaches, QROA does not require access to the model’s logit information or any other internal data and operates solely through the standard query-response interface of LLMs. Inspired by deep Q-learning and Greedy coordinate descent, the method iteratively updates tokens to maximize a designed reward function. We tested our method on various LLMs such as Vicuna, Falcon, and Mistral, achieving an Attack Success Rate (ASR) over 80%. We also tested the model against Llama2-chat, the fine-tuned version of Llama2 designed to resist Jailbreak attacks, achieving good ASR with a suboptimal initial trigger seed. This study demonstrates the feasibility of generating jailbreak attacks against deployed LLMs in the public domain using black-box optimization methods, enabling more comprehensive safety testing of LLMs.
摘要:近几个月来,大型语言模型(LLM)大受欢迎,但它们具有在被操纵时生成有害内容的令人担忧的能力。这项研究介绍了查询-响应优化攻击(QROA),这是一种基于优化的策略,旨在通过黑盒、仅查询的交互来利用LLMS。QROA向恶意指令添加了优化的触发器,以迫使LLM生成有害内容。与以前的方法不同,QROA不需要访问模型的Logit信息或任何其他内部数据,只通过LLMS的标准查询-响应接口进行操作。受深度Q学习和贪婪坐标下降的启发,该方法迭代更新令牌以最大化所设计的奖励函数。我们在维库纳、猎鹰和米斯特拉尔等不同的LLMS上测试了我们的方法,取得了80%以上的攻击成功率(ASR)。我们还在Llama2-Chat上测试了该模型,Llama2-Chat是Llama2的微调版本,旨在抵抗越狱攻击,使用次优的初始触发种子实现了良好的ASR。这项研究论证了利用黑盒优化方法对部署在公共领域的LLM进行越狱攻击的可行性,从而实现了对LLM进行更全面的安全测试。

[NLP-53] Multimodal Reasoning with Multimodal Knowledge Graph
[NLP-53] 利用多模式知识图进行多模式推理

链接: https://arxiv.org/abs/2406.02030
作者: Junlin Lee,Yequan Wang,Jing Li,Min Zhang
关键词: large language models, Multimodal reasoning, Multimodal, knowledge, Multimodal Knowledge Graph
中文关键词: 大型语言模型、多模式推理、多模式、知识、多模式知识图
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal reasoning with large language models (LLMs) often suffers from hallucinations and the presence of deficient or outdated knowledge within LLMs. Some approaches have sought to mitigate these issues by employing textual knowledge graphs, but their singular modality of knowledge limits comprehensive cross-modal understanding. In this paper, we propose the Multimodal Reasoning with Multimodal Knowledge Graph (MR-MKG) method, which leverages multimodal knowledge graphs (MMKGs) to learn rich and semantic knowledge across modalities, significantly enhancing the multimodal reasoning capabilities of LLMs. In particular, a relation graph attention network is utilized for encoding MMKGs and a cross-modal alignment module is designed for optimizing image-text alignment. A MMKG-grounded dataset is constructed to equip LLMs with initial expertise in multimodal reasoning through pretraining. Remarkably, MR-MKG achieves superior performance while training on only a small fraction of parameters, approximately 2.25% of the LLM’s parameter size. Experimental results on multimodal question answering and multimodal analogy reasoning tasks demonstrate that our MR-MKG method outperforms previous state-of-the-art models.
摘要:使用大语言模型进行多通道推理时,常常会出现幻觉以及大语言模型中知识的缺失或过时。一些方法试图通过使用文本知识图来缓解这些问题,但其单一的知识形态限制了全面的跨模式理解。本文提出了基于多通道知识图的多通道推理方法(MR-MKG),该方法利用多通道知识图(MMKG)跨通道学习丰富的语义知识,显著提高了LLMS的多通道推理能力。特别是,利用关系图关注度网络对MMKG进行编码,并设计了跨模式对齐模块来优化图文对齐。构建了一个基于MMKG的数据集,通过预训练为LLMS配备多模式推理的初始专业知识。值得注意的是,MR-MKG在只对一小部分参数进行训练的情况下获得了优越的性能,大约是LLM参数大小的2.25%。在多通道问答和多通道类比推理任务上的实验结果表明,我们的MR-MKG方法的性能优于以往的最新模型。

[NLP-54] Why Would You Suggest That? Human Trust in Language Model Responses
[NLP-54] 你为什么会这样建议?人类对语言模型响应的信任

链接: https://arxiv.org/abs/2406.02018
作者: Manasi Sharma,Ho Chit Siu,Rohan Paleja,Jaime D. Peña
关键词: Large Language Models, Large Language, creative decision-making scenarios, emergence of Large, Language Models
中文关键词: 大型语言模型、大型语言、创造性决策场景、大型语言模型的出现
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has revealed a growing need for human-AI collaboration, especially in creative decision-making scenarios where trust and reliance are paramount. Through human studies and model evaluations on the open-ended News Headline Generation task from the LaMP benchmark, we analyze how the framing and presence of explanations affect user trust and model performance. Overall, we provide evidence that adding an explanation in the model response to justify its reasoning significantly increases self-reported user trust in the model when the user has the opportunity to compare various responses. Position and faithfulness of these explanations are also important factors. However, these gains disappear when users are shown responses independently, suggesting that humans trust all model responses, including deceptive ones, equitably when they are shown in isolation. Our findings urge future research to delve deeper into the nuanced evaluation of trust in human-machine teaming systems.
摘要:大型语言模型的出现揭示了人类与人工智能合作的日益增长的需求,特别是在信任和依赖至上的创造性决策场景中。通过对LAMP基准的开放式新闻标题生成任务的人体研究和模型评估,我们分析了解释的框架和存在如何影响用户信任和模型性能。总体而言,我们提供的证据表明,当用户有机会比较各种响应时,在模型响应中添加解释以证明其推理的合理性显著增加了自我报告的用户对模型的信任。这些解释的立场和真实性也是重要因素。然而,当用户被单独显示时,这些收益就消失了,这表明当单独显示时,人类公平地信任所有的模型响应,包括欺骗性的响应。我们的发现促使未来的研究更深入地研究人机合作系统中信任的细微差别评估。

[NLP-55] Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping
[NLP-55] 有效训练可减少小型化并通过逐核剪裁表现更好的ASB模型

链接: https://arxiv.org/abs/2406.02004
作者: Lun Wang,Om Thakkar,Zhong Meng,Nicole Rafidi,Rohit Prabhavalkar,Arun Narayanan
关键词: automatic speech recognition, large-scale automatic speech, training large-scale automatic, speech recognition, Gradient clipping plays
中文关键词: 自动语音识别,大规模自动语音,训练大规模自动,语音识别,梯度剪辑播放
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Gradient clipping plays a vital role in training large-scale automatic speech recognition (ASR) models. It is typically applied to minibatch gradients to prevent gradient explosion, and to the individual sample gradients to mitigate unintended memorization. This work systematically investigates the impact of a specific granularity of gradient clipping, namely per-core clip-ping (PCC), across training a wide range of ASR models. We empirically demonstrate that PCC can effectively mitigate unintended memorization in ASR models. Surprisingly, we find that PCC positively influences ASR performance metrics, leading to improved convergence rates and reduced word error rates. To avoid tuning the additional hyperparameter introduced by PCC, we further propose a novel variant, adaptive per-core clipping (APCC), for streamlined optimization. Our findings highlight the multifaceted benefits of PCC as a strategy for robust, privacy-forward ASR model training.
摘要:梯度剪裁在训练大规模自动语音识别(ASB)模型中发挥着至关重要的作用。它通常应用于迷你批梯度以防止梯度爆炸,并应用于单个样本梯度以减轻意外记忆。这项工作系统地研究了特定粒度的梯度剪裁(即每核剪裁(PCC))对训练广泛的ASB模型的影响。我们通过经验证明PCC可以有效地减轻ASC模型中的无意记忆。令人惊讶的是,我们发现PCC对ASB性能指标产生了积极影响,从而提高了收敛率并降低了字错误率。为了避免调整PCC引入的额外超参数,我们进一步提出了一种新颖的变体,即自适应每核限幅(APCC),用于简化优化。我们的研究结果强调了PCC作为稳健、隐私前瞻性的ASB模型培训策略的多方面优势。

[NLP-56] Position Debiasing Fine-Tuning for Causal Perception in Long-Term Dialogue
[NLP-56] 在长期对话中消除偏见对因果认知的微调

链接: https://arxiv.org/abs/2406.02002
作者: Shixuan Fan,Wei Wei,Wendi Li,Xian-Ling Mao,Wenfeng Xie,Dangyang Chen
关键词: extensive dialogue history, dialogue, relevant, human-like responses based, dialogue system
中文关键词: 广泛的对话历史、对话、相关的、基于类人的反应、对话系统
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to IJCAI 2024

点击查看摘要

Abstract:The core of the dialogue system is to generate relevant, informative, and human-like responses based on extensive dialogue history. Recently, dialogue generation domain has seen mainstream adoption of large language models (LLMs), due to its powerful capability in generating utterances. However, there is a natural deficiency for such models, that is, inherent position bias, which may lead them to pay more attention to the nearby utterances instead of causally relevant ones, resulting in generating irrelevant and generic responses in long-term dialogue. To alleviate such problem, in this paper, we propose a novel method, named Causal Perception long-term Dialogue framework (CPD), which employs perturbation-based causal variable discovery method to extract casually relevant utterances from the dialogue history and enhances model causal perception during fine-tuning. Specifically, a local-position awareness method is proposed in CPD for inter-sentence position correlation elimination, which helps models extract causally relevant utterances based on perturbations. Then, a casual-perception fine-tuning strategy is also proposed, to enhance the capability of discovering the causal invariant factors, by differently perturbing causally relevant and non-casually relevant ones for response generation. Experimental results on two datasets prove that our proposed method can effectively alleviate the position bias for multiple LLMs and achieve significant progress compared with existing baselines.
摘要:对话系统的核心是基于广泛的对话历史产生相关的、信息丰富的和类似人类的回应。近年来,对话生成领域由于其强大的话语生成能力,已经成为主流的大型语言模型(LLM)。然而,这类模式存在一个天然的缺陷,即固有的位置偏差,这可能会导致他们更多地关注邻近的话语,而不是因果相关的话语,从而导致在长期对话中产生不相关的一般性反应。为了缓解这一问题,本文提出了一种新的方法,称为因果感知长期对话框架(CPD),该方法采用基于扰动的因果变量发现方法从对话历史中提取偶然相关的话语,并在微调过程中增强模型的因果感知。具体地说,在CPD中提出了一种局部位置感知方法来消除句间位置相关性,帮助模型基于扰动提取因果相关的话语。在此基础上,提出了一种因果感知微调策略,通过对因果相关因素和非偶然相关因素进行不同扰动来产生反应,从而提高发现因果不变因素的能力。在两个数据集上的实验结果表明,该方法可以有效地缓解多个LLMS的位置偏差,与现有的基线相比取得了显着的进步。

[NLP-57] Personalized Topic Selection Model for Topic-Grounded Dialogue
[NLP-57] 面向主题对话的个性化主题选择模型

链接: https://arxiv.org/abs/2406.01988
作者: Shixuan Fan,Wei Wei,Xiaofei Wen,Xianling Mao,Jixiong Chen,Dangyang Chen
关键词: accomplish specific tasks, actively guide users, topic-guided conversations, textbf, increasingly popular
中文关键词: 完成特定任务,积极引导用户,主题引导对话,文本BF,越来越受欢迎
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2024 Findings

点击查看摘要

Abstract:Recently, the topic-grounded dialogue (TGD) system has become increasingly popular as its powerful capability to actively guide users to accomplish specific tasks through topic-guided conversations. Most existing works utilize side information (\eg topics or personas) in isolation to enhance the topic selection ability. However, due to disregarding the noise within these auxiliary information sources and their mutual influence, current models tend to predict user-uninteresting and contextually irrelevant topics. To build user-engaging and coherent dialogue agent, we propose a \textbfPersonalized topic s\textbfElection model for \textbfTopic-grounded \textbfDialogue, named \textbfPETD, which takes account of the interaction of side information to selectively aggregate such information for more accurately predicting subsequent topics. Specifically, we evaluate the correlation between global topics and personas and selectively incorporate the global topics aligned with user personas. Furthermore, we propose a contrastive learning based persona selector to filter out irrelevant personas under the constraint of lacking pertinent persona annotations. Throughout the selection and generation, diverse relevant side information is considered. Extensive experiments demonstrate that our proposed method can generate engaging and diverse responses, outperforming state-of-the-art baselines across various evaluation metrics.
摘要:近年来,基于话题的对话(TGD)系统以其强大的能力,通过话题引导对话,主动引导用户完成特定的任务,受到越来越多的关注。现有的大多数作品都孤立地使用辅助信息(例如主题或人物角色)来增强主题选择能力。然而,由于忽略了这些辅助信息源中的噪声及其相互影响,现有的模型倾向于预测用户不感兴趣的和上下文无关的主题。为了构建用户参与和连贯的对话代理,我们提出了一种个性化话题S的选举模型具体地说,我们评估全局主题和人物角色之间的相关性,并选择性地结合与用户人物角色一致的全局主题。此外,我们还提出了一种基于对比学习的角色选择器,在缺乏相关角色标注的约束下过滤掉不相关的角色。在整个选择和生成过程中,考虑了各种相关的辅助信息。大量的实验表明,我们提出的方法可以产生引人入胜和多样化的响应,在各种评估指标上的表现优于最先进的基线。

[NLP-58] RKLD: Reverse KL-Divergence-based Knowledge Distillation for Unlearning Personal Information in Large Language Models
[NLP-58] RKLD:基于KL分歧的反向知识提炼,用于在大型语言模型中忘记个人信息

链接: https://arxiv.org/abs/2406.01983
作者: Bichen Wang,Yuzhe Zi,Yixin Sun,Yanyan Zhao,Bing Qin
关键词: language model training, large language models, model training datasets, large language, training datasets
中文关键词: 语言模型训练、大型语言模型、模型训练数据集、大型语言、训练数据集
类目: Computation and Language (cs.CL)
备注: Work is in progress

点击查看摘要

Abstract:With the passage of the Right to Be Forgotten (RTBF) regulations and the scaling up of language model training datasets, research on model unlearning in large language models (LLMs) has become more crucial. Before the era of LLMs, machine unlearning research focused mainly on classification tasks in models with small parameters. In these tasks, the content to be forgotten or retained is clear and straightforward. However, as parameter sizes have grown and tasks have become more complex, balancing forget quality and model utility has become more challenging, especially in scenarios involving personal data instead of classification results. Existing methods based on gradient ascent and its variants often struggle with this balance, leading to unintended information loss or partial forgetting. To address this challenge, we propose RKLD, a novel \textbfReverse \textbfKL-Divergence-based Knowledge \textbfDistillation unlearning algorithm for LLMs targeting the unlearning of personal information. Through RKLD, we achieve significant forget quality and effectively maintain the model utility in our experiments.
摘要:随着《被遗忘权条例》的通过和语言模型训练数据集的扩大,大型语言模型中模型遗忘的研究变得更加重要。在LLMS时代之前,机器遗忘的研究主要集中在小参数模型的分类任务上。在这些任务中,要忘记或保留的内容是明确和直接的。然而,随着参数大小的增加和任务变得更加复杂,平衡忘记质量和模型效用变得更加具有挑战性,特别是在涉及个人数据而不是分类结果的情况下。现有的基于梯度上升及其变体的方法经常在这种平衡中挣扎,导致意外的信息丢失或部分遗忘。为了解决这一挑战,我们提出了一种新的基于散度的知识蒸馏去学习算法RKLD,该算法主要针对个人信息的遗忘。通过RKLD,我们在实验中达到了显著的遗忘质量,并有效地保持了模型的实用性。

[NLP-59] Zyda: A 1.3T Dataset for Open Language Modeling
[NLP-59] Zyda:用于开放语言建模的1.3T数据集

链接: https://arxiv.org/abs/2406.01981
作者: Yury Tokpanov,Beren Millidge,Paolo Glorioso,Jonathan Pilault,Adam Ibrahim,James Whittington,Quentin Anthony
关键词: large language models, surged correspondingly, scaled dramatically, dramatically in recent, recent years
中文关键词: 大型语言模型也相应激增,近年来急剧扩大规模
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The size of large language models (LLMs) has scaled dramatically in recent years and their computational and data requirements have surged correspondingly. State-of-the-art language models, even at relatively smaller sizes, typically require training on at least a trillion tokens. This rapid advancement has eclipsed the growth of open-source datasets available for large-scale LLM pretraining. In this paper, we introduce Zyda (Zyphra Dataset), a dataset under a permissive license comprising 1.3 trillion tokens, assembled by integrating several major respected open-source datasets into a single, high-quality corpus. We apply rigorous filtering and deduplication processes, both within and across datasets, to maintain and enhance the quality derived from the original datasets. Our evaluations show that Zyda not only competes favorably with other open datasets like Dolma, FineWeb, and RefinedWeb, but also substantially improves the performance of comparable models from the Pythia suite. Our rigorous data processing methods significantly enhance Zyda’s effectiveness, outperforming even the best of its constituent datasets when used independently.
摘要:近年来,大型语言模型的规模急剧扩大,其计算和数据需求也相应激增。最先进的语言模型,即使是相对较小的尺寸,通常也需要至少一万亿个令牌的培训。这一快速发展使可用于大规模LLM预培训的开源数据集的增长黯然失色。在本文中,我们介绍了Zyda(Zyphra DataSet),这是一个在许可许可下的数据集,包含1.3万亿个令牌,通过将几个主要的受尊敬的开源数据集整合成一个单一的、高质量的语料库来组装。我们在数据集内和跨数据集应用严格的过滤和重复数据删除流程,以保持和提高源自原始数据集的质量。我们的评估表明,Zyda不仅在与Dolma、FineWeb和RefinedWeb等其他开放数据集的竞争中具有优势,而且显著提高了来自Pythia套件的同类模型的性能。我们严格的数据处理方法显著提高了Zyda的有效性,在单独使用时表现甚至优于其最好的组成数据集。

[NLP-60] Conditional Language Learning with Context
[NLP-60] 有语境的条件语言学习

链接: https://arxiv.org/abs/2406.01976
作者: Xiao Zhang,Miao Li,Ji Wu
关键词: fitting raw text, sophisticated language understanding, language understanding skills, raw text, learn sophisticated language
中文关键词: 适合原始文本,复杂的语言理解,语言理解技能,原始文本,学习复杂的语言
类目: Computation and Language (cs.CL)
备注: To appear at the 41st International Conference on Machine Learning (ICML 2024)

点击查看摘要

Abstract:Language models can learn sophisticated language understanding skills from fitting raw text. They also unselectively learn useless corpus statistics and biases, especially during finetuning on domain-specific corpora. In this paper, we propose a simple modification to causal language modeling called conditional finetuning, which performs language modeling conditioned on a context. We show that a context can “explain away” certain corpus statistics and make the model avoid learning them. In this fashion, conditional finetuning achieves selective learning from a corpus, learning knowledge useful for downstream tasks while avoiding learning useless corpus statistics like topic biases. This selective learning effect leads to less forgetting and better stability-plasticity tradeoff in domain finetuning, potentially benefitting lifelong learning with language models.
摘要:语言模型可以从匹配原始文本中学习复杂的语言理解技能。他们还无选择地学习无用的数据库统计数据和偏见,尤其是在对特定领域的数据库进行微调期间。在本文中,我们提出了一种对因果语言建模的简单修改,称为条件微调,它根据上下文执行语言建模。我们表明,上下文可以“解释”某些语料库统计数据,并使模型避免学习它们。以这种方式,有条件微调实现了从数据库中的选择性学习,学习对下游任务有用的知识,同时避免学习无用的数据库统计数据,例如主题偏差。这种选择性学习效应导致领域微调中的遗忘减少和更好的稳定性-可塑性权衡,可能有利于语言模型的终身学习。

[NLP-61] Bileve: Securing Text Provenance in Large Language Models Against Spoofing with Bi-level Signature
[NLP-61] Bileve:通过双层签名保护大型语言模型中的文本出处,防止欺骗

链接: https://arxiv.org/abs/2406.01946
作者: Tong Zhou,Xuandong Zhao,Xiaolin Xu,Shaolei Ren
关键词: large language models, harmful content, forge harmful content, language models, machine-generated content
中文关键词: 大型语言模型、有害内容、伪造有害内容、语言模型、机器生成内容
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text watermarks for large language models (LLMs) have been commonly used to identify the origins of machine-generated content, which is promising for assessing liability when combating deepfake or harmful content. While existing watermarking techniques typically prioritize robustness against removal attacks, unfortunately, they are vulnerable to spoofing attacks: malicious actors can subtly alter the meanings of LLM-generated responses or even forge harmful content, potentially misattributing blame to the LLM developer. To overcome this, we introduce a bi-level signature scheme, Bileve, which embeds fine-grained signature bits for integrity checks (mitigating spoofing attacks) as well as a coarse-grained signal to trace text sources when the signature is invalid (enhancing detectability) via a novel rank-based sampling strategy. Compared to conventional watermark detectors that only output binary results, Bileve can differentiate 5 scenarios during detection, reliably tracing text provenance and regulating LLMs. The experiments conducted on OPT-1.3B and LLaMA-7B demonstrate the effectiveness of Bileve in defeating spoofing attacks with enhanced detectability.
摘要:大型语言模型的文本水印通常被用来识别机器生成内容的来源,这在打击深度虚假或有害内容时很有希望评估其责任。虽然现有的水印技术通常将健壮性放在免受删除攻击的优先位置,但不幸的是,它们容易受到欺骗性攻击:恶意行为者可以巧妙地更改LLM生成的响应的含义,甚至伪造有害内容,可能会将责任错误地归咎于LLM开发人员。为了克服这一问题,我们提出了一种双层签名方案BiLEVE,该方案通过一种新颖的基于等级的采样策略嵌入细粒度的签名比特用于完整性检查(缓解欺骗攻击),并在签名无效时嵌入粗粒度的信号来跟踪文本来源(增强了可检测性)。与传统的只输出二进制结果的水印检测器相比,BiLEVE在检测过程中可以区分5种场景,可靠地追踪文本来源和规范LLM。在OPT-1.3B和LLAMA-7B上进行的实验证明了BiLEVE在抵抗欺骗攻击方面的有效性,并增强了可检测性。

[NLP-62] Enhancing Trust in LLMs: Algorithms for Comparing and Interpreting LLMs
[NLP-62] 增强对LLM的信任:比较和解释LLM的算法

链接: https://arxiv.org/abs/2406.01943
作者: Nik Bear Brown
关键词: Large Language Models, Large Language, Language Models, Word Error Rate, Character Error Rate
中文关键词: 大型语言模型、大型语言、语言模型、字错误率、字符错误率
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: An extensive survey of the literature specifying algorithms and techniques enhancing the trustworthiness and understanding of Large Language Models (LLMs)

点击查看摘要

Abstract:This paper surveys evaluation techniques to enhance the trustworthiness and understanding of Large Language Models (LLMs). As reliance on LLMs grows, ensuring their reliability, fairness, and transparency is crucial. We explore algorithmic methods and metrics to assess LLM performance, identify weaknesses, and guide development towards more trustworthy applications. Key evaluation metrics include Perplexity Measurement, NLP metrics (BLEU, ROUGE, METEOR, BERTScore, GLEU, Word Error Rate, Character Error Rate), Zero-Shot and Few-Shot Learning Performance, Transfer Learning Evaluation, Adversarial Testing, and Fairness and Bias Evaluation. We introduce innovative approaches like LLMMaps for stratified evaluation, Benchmarking and Leaderboards for competitive assessment, Stratified Analysis for in-depth understanding, Visualization of Blooms Taxonomy for cognitive level accuracy distribution, Hallucination Score for quantifying inaccuracies, Knowledge Stratification Strategy for hierarchical analysis, and Machine Learning Models for Hierarchy Generation. Human Evaluation is highlighted for capturing nuances that automated metrics may miss. These techniques form a framework for evaluating LLMs, aiming to enhance transparency, guide development, and establish user trust. Future papers will describe metric visualization and demonstrate each approach on practical examples.
摘要:本文综述了提高大型语言模型可信度和可理解性的评估技术。随着对低成本管理的依赖增加,确保其可靠性、公平性和透明度至关重要。我们探索算法方法和指标来评估LLM性能,找出弱点,并指导开发更值得信赖的应用程序。关键评估指标包括困惑测量、自然语言处理指标(BLEU、胭脂、流星、BERTScore、GLEU、单词错误率、字符错误率)、零偏和少偏学习成绩、迁移学习评价、对抗性测试以及公平和偏见评价。我们引入了一些创新的方法,如用于分层评估的LLMaps,用于竞争评估的基准和排行榜,用于深入理解的分层分析,用于认知水平精度分布的Bloom分类可视化,用于量化不准确的幻觉分数,用于分层分析的知识分层策略,以及用于层次生成的机器学习模型。人力评估的重点是捕捉自动化指标可能遗漏的细微差别。这些技术形成了评估LLMS的框架,旨在提高透明度、指导开发并建立用户信任。未来的论文将描述度量可视化,并在实际示例中演示每种方法。

[NLP-63] Process-Driven Autoformalization in Lean 4
[NLP-63] 精益4中流程驱动的自动化

链接: https://arxiv.org/abs/2406.01940
作者: Jianqiao Lu,Zhengying Liu,Yingjia Wan,Yinya Huang,Haiming Wang,Zhicheng Yang,Jing Tang,Zhijiang Guo
关键词: advancing mathematical reasoning, textbf, natural language mathematics, offers significant potential, mathematical reasoning
中文关键词: 推进数学推理、textBF、自然语言数学,提供了巨大的数学推理潜力
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 22 pages, 1 figures, 11 tables

点击查看摘要

Abstract:Autoformalization, the conversion of natural language mathematics into formal languages, offers significant potential for advancing mathematical reasoning. However, existing efforts are limited to formal languages with substantial online corpora and struggle to keep pace with rapidly evolving languages like Lean 4. To bridge this gap, we propose a new benchmark \textbfFormalization for \textbfLean~\textbf4 (\textbf\name) designed to evaluate the autoformalization capabilities of large language models (LLMs). This benchmark encompasses a comprehensive assessment of questions, answers, formal statements, and proofs. Additionally, we introduce a \textbfProcess-\textbfSupervised \textbfVerifier (\textbfPSV) model that leverages the precise feedback from Lean 4 compilers to enhance autoformalization. Our experiments demonstrate that the PSV method improves autoformalization, enabling higher accuracy using less filtered training data. Furthermore, when fine-tuned with data containing detailed process information, PSV can leverage the data more effectively, leading to more significant improvements in autoformalization for Lean 4. Our dataset and code are available at \urlthis https URL.
摘要:自动形式化是自然语言数学到形式语言的转换,它为推进数学推理提供了巨大的潜力。然而,现有的努力仅限于具有大量在线语料库的正式语言,并且难以跟上像Lean 4这样的快速发展的语言的步伐。为了弥补这一差距,我们提出了一个新的基准\extbfLean~\extbf4(\extbf\name),旨在评估大型语言模型(LLM)的自动形式化能力。该基准包含对问题、答案、正式陈述和证明的全面评估。此外,我们还引入了一个\textbfProcess-\textbfSupervised\extbfVerizer(\textbfPSV)模型,该模型利用来自Lean 4编译器的精确反馈来增强自动形式化。我们的实验表明,PSV方法提高了自形式化,使用较少的过滤训练数据就可以获得更高的准确率。此外,当PSV与包含详细流程信息的数据进行微调时,PSV可以更有效地利用数据,从而在精益4的自动形式化方面获得更显著的改进。我们的数据集和代码可在此HTTPS URL中找到。

[NLP-64] Optimal Transport Guided Correlation Assignment for Multimodal Entity Linking
[NLP-64] 多模式实体链接的最优运输引导相关分配

链接: https://arxiv.org/abs/2406.01934
作者: Zefeng Zhang,Jiawei Sheng,Chuang Zhang,Yunzhi Liang,Wenyuan Zhang,Siqi Wang,Tingwen Liu
关键词: Multimodal Entity Linking, Entity Linking, link ambiguous mentions, Multimodal Entity, aims to link
中文关键词: 多模式实体链接,实体链接,链接模糊提及,多模式实体,旨在链接
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2024

点击查看摘要

Abstract:Multimodal Entity Linking (MEL) aims to link ambiguous mentions in multimodal contexts to entities in a multimodal knowledge graph. A pivotal challenge is to fully leverage multi-element correlations between mentions and entities to bridge modality gap and enable fine-grained semantic matching. Existing methods attempt several local correlative mechanisms, relying heavily on the automatically learned attention weights, which may over-concentrate on partial correlations. To mitigate this issue, we formulate the correlation assignment problem as an optimal transport (OT) problem, and propose a novel MEL framework, namely OT-MEL, with OT-guided correlation assignment. Thereby, we exploit the correlation between multimodal features to enhance multimodal fusion, and the correlation between mentions and entities to enhance fine-grained matching. To accelerate model prediction, we further leverage knowledge distillation to transfer OT assignment knowledge to attention mechanism. Experimental results show that our model significantly outperforms previous state-of-the-art baselines and confirm the effectiveness of the OT-guided correlation assignment.
摘要:多通道实体链接(MEL)旨在将多通道上下文中的模糊提及链接到多通道知识图中的实体。一个关键的挑战是充分利用提及和实体之间的多元素相关性来弥合通道差距并实现细粒度的语义匹配。现有的方法尝试了几种局部相关机制,严重依赖于自动学习的注意力权重,这可能会过度集中于部分相关性。为了缓解这一问题,我们将相关分配问题描述为一个最优传输(OT)问题,并提出了一种新的基于OT引导的相关分配的MEL框架,即OT-MEL。因此,我们利用多模式特征之间的相关性来增强多模式融合,并利用提及和实体之间的相关性来增强细粒度匹配。为了加速模型预测,我们进一步利用知识蒸馏将OT赋值知识转移到注意机制。实验结果表明,我们的模型显著优于以往的最新基线,并证实了OT引导的相关性分配的有效性。

[NLP-65] Dishonesty in Helpful and Harmless Alignment
[NLP-65] 有益无害的联盟中的不诚实

链接: https://arxiv.org/abs/2406.01931
作者: Youcheng Huang,Jingkun Tang,Duanyu Feng,Zheng Zhang,Wenqiang Lei,Jiancheng Lv,Anthony G. Cohn
关键词: People tell lies, seeking rewards, People, Large language models, satisfy human preference
中文关键词: 人们撒谎,寻求回报,人们,大型语言模型,满足人类偏好
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:People tell lies when seeking rewards. Large language models (LLMs) are aligned to human values with reinforcement learning where they get rewards if they satisfy human preference. We find that this also induces dishonesty in helpful and harmless alignment where LLMs tell lies in generating harmless responses. Using the latest interpreting tools, we detect dishonesty, show how LLMs can be harmful if their honesty is increased, and analyze such conflicts at the parameter-level. Given these preliminaries and the hypothesis that reward-seeking stimulates dishonesty, we theoretically show that the dishonesty can in-turn decrease the alignment performances and augment reward-seeking alignment with representation regularization. Extensive results, including GPT-4 annotated win-rates, perplexities, and cases studies demonstrate that we can train more honest, helpful, and harmless LLMs. We will make all our codes and results be open-sourced upon this paper’s acceptance.
摘要:人们在寻求回报时撒谎。大型语言模型(LLM)通过强化学习与人类价值观保持一致,如果满足人类偏好,它们就会获得奖励。我们发现,这也会在有益且无害的对齐中引发不诚实,而LLM在产生无害的反应时撒谎。使用最新的解释工具,我们检测不诚实行为,展示如果LLM的诚实度提高,它们会如何有害,并在参数层面分析此类冲突。鉴于这些先决条件和追求回报会刺激不诚实的假设,我们从理论上表明,不诚实反过来会降低一致表现,并通过代表性正规化增强追求回报的一致性。广泛的结果,包括GPT-4注释的获胜率、困惑和案例研究,表明我们可以培训更诚实、有帮助和无害的LLM。在本文被接受后,我们将使所有代码和结果开源。

[NLP-66] OTTAWA: Optimal TransporT Adaptive Word Aligner for Hallucination and Omission Translation Errors Detection
[NLP-66] OTTAWA:用于幻觉和省略翻译错误检测的最佳TransportT自适应字对齐器

链接: https://arxiv.org/abs/2406.01919
作者: Chenyang Huang,Abbas Ghaddar,Ivan Kobyzev,Mehdi Rezagholizadeh,Osmar R. Zaiane,Boxing Chen
关键词: Machine Translation, omissions in Machine, system internal states, considerable attention, attention on detecting
中文关键词: 机器翻译、机器中的遗漏、系统内部状态、相当关注、注意检测
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2024 Findings

点击查看摘要

Abstract:Recently, there has been considerable attention on detecting hallucinations and omissions in Machine Translation (MT) systems. The two dominant approaches to tackle this task involve analyzing the MT system’s internal states or relying on the output of external tools, such as sentence similarity or MT quality estimators. In this work, we introduce OTTAWA, a novel Optimal Transport (OT)-based word aligner specifically designed to enhance the detection of hallucinations and omissions in MT systems. Our approach explicitly models the missing alignments by introducing a “null” vector, for which we propose a novel one-side constrained OT setting to allow an adaptive null alignment. Our approach yields competitive results compared to state-of-the-art methods across 18 language pairs on the HalOmi benchmark. In addition, it shows promising features, such as the ability to distinguish between both error types and perform word-level detection without accessing the MT system’s internal states.
摘要:近年来,机器翻译系统中的幻觉和遗漏检测受到了广泛的关注。解决这一任务的两种主要方法包括分析机器翻译系统的内部状态或依赖外部工具的输出,例如句子相似度或机器翻译质量估计器。在这项工作中,我们介绍了渥太华,一个新的基于最优传输(OT)的词对齐器,专门设计来增强对机器翻译系统中的幻觉和遗漏的检测。我们的方法通过引入“零”向量来显式地对缺失的比对进行建模,为此,我们提出了一种新的单边约束的OT设置来允许自适应的零对齐。与HalOmi基准测试上的18个语言对的最新方法相比,我们的方法产生了具有竞争力的结果。此外,它还显示了很有前景的功能,例如区分两种错误类型的能力,以及在不访问MT系统内部状态的情况下执行词级检测的能力。

[NLP-67] HPE-CogVLM: New Head Pose Grounding Task Exploration on Vision Language Model
[NLP-67] HPE-CogVLM:视觉语言模型上的新头部姿势基础任务探索

链接: https://arxiv.org/abs/2406.01914
作者: Yu Tian,Tianqi Shao,Tsukasa Demizu,Xuyang Wu,Hsin-Tai Wu
关键词: roll Euler angles, Head pose estimation, precise numerical output, Euler angles, roll Euler
中文关键词: 滚动欧拉角、头部姿势估计、精确数字输出、欧拉角、滚动欧拉
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Head pose estimation (HPE) task requires a sophisticated understanding of 3D spatial relationships and precise numerical output of yaw, pitch, and roll Euler angles. Previous HPE studies are mainly based on Non-large language models (Non-LLMs), which rely on close-up human heads cropped from the full image as inputs and lack robustness in real-world scenario. In this paper, we present a novel framework to enhance the HPE prediction task by leveraging the visual grounding capability of CogVLM. CogVLM is a vision language model (VLM) with grounding capability of predicting object bounding boxes (BBoxes), which enables HPE training and prediction using full image information input. To integrate the HPE task into the VLM, we first cop with the catastrophic forgetting problem in large language models (LLMs) by investigating the rehearsal ratio in the data rehearsal method. Then, we propose and validate a LoRA layer-based model merging method, which keeps the integrity of parameters, to enhance the HPE performance in the framework. The results show our HPE-CogVLM achieves a 31.5% reduction in Mean Absolute Error for HPE prediction over the current Non-LLM based state-of-the-art in cross-dataset evaluation. Furthermore, we compare our LoRA layer-based model merging method with LoRA fine-tuning only and other merging methods in CogVLM. The results demonstrate our framework outperforms them in all HPE metrics.
摘要:头部位姿估计(HPE)任务需要对三维空间关系有深入的了解,并且需要精确的偏航、俯仰和横摇欧拉角的数值输出。以往的HPE研究主要基于非大语言模型(Non-LLMS),这些模型依赖于从完整图像中剪下的特写人头作为输入,在真实场景中缺乏健壮性。在本文中,我们提出了一种新的框架,通过利用CogVLM的视觉基础能力来增强HPE预测任务。CogVLM是一种视觉语言模型(VLM),它具有预测对象边界框(BBox)的基础能力,使得使用全图像信息输入进行HPE训练和预测成为可能。为了将HPE任务集成到VLM中,我们首先通过研究数据复述方法中的复述比率来解决大语言模型(LLMS)中的灾难性遗忘问题。然后,我们提出并验证了一种基于LORA层的模型合并方法,该方法保持了参数的完整性,从而提高了框架中的HPE性能。结果表明,在跨数据集评估中,我们的HPE-CogVLM预测的平均绝对误差比目前基于非LLM的预测方法降低了31.5%。此外,我们将基于LORA层的模型合并方法与仅使用LORA微调的方法以及CogVLM中的其他合并方法进行了比较。结果表明,我们的框架在所有HPE指标上都优于它们。

[NLP-68] Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks
[NLP-68] 显式编码结构对称性是算术任务中长度概括的关键

链接: https://arxiv.org/abs/2406.01895
作者: Mahdi Sabbaghi,George Pappas,Hamed Hassani,Surbhi Goel
关键词: basic arithmetic tasks, code generation, language understanding, logical reasoning, basic arithmetic
中文关键词: 基本算术任务、代码生成、语言理解、逻辑推理、基本算术
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 32 pages, 16 figures

点击查看摘要

Abstract:Despite the success of Transformers on language understanding, code generation, and logical reasoning, they still fail to generalize over length on basic arithmetic tasks such as addition and multiplication. A major reason behind this failure is the vast difference in structure between numbers and text; For example, the numbers are typically parsed from right to left, and there is a correspondence between digits at the same position across different numbers. In contrast, for text, such symmetries are quite unnatural. In this work, we propose to encode these semantics explicitly into the model via modified number formatting and custom positional encodings. Empirically, our method allows a Transformer trained on numbers with at most 5-digits for addition and multiplication to generalize up to 50-digit numbers, without using additional data for longer sequences. We further demonstrate that traditional absolute positional encodings (APE) fail to generalize to longer sequences, even when trained with augmented data that captures task symmetries. To elucidate the importance of explicitly encoding structure, we prove that explicit incorporation of structure via positional encodings is necessary for out-of-distribution generalization. Finally, we pinpoint other challenges inherent to length generalization beyond capturing symmetries, in particular complexity of the underlying task, and propose changes in the training distribution to address them.
摘要:尽管Transformers在语言理解、代码生成和逻辑推理方面取得了成功,但他们仍然无法在加法和乘法等基本算术任务上进行过长的概括。这一失败背后的一个主要原因是数字和文本在结构上的巨大差异;例如,数字通常是从右向左解析的,不同数字在同一位置上的数字之间存在对应关系。相比之下,对于文本来说,这样的对称是非常不自然的。在这项工作中,我们建议通过修改的数字格式和定制的位置编码将这些语义显式地编码到模型中。经验上,我们的方法允许Transformer对最多5位数字进行加法和乘法训练,以推广到50位数字,而不需要为更长的序列使用额外的数据。我们进一步证明,传统的绝对位置编码(APE)无法推广到更长的序列,即使用捕获任务对称性的增广数据训练也是如此。为了阐明显式编码结构的重要性,我们证明了通过位置编码对结构的显式合并对于非分布泛化是必要的。最后,我们指出了除了捕获对称性之外,长度泛化所固有的其他挑战,特别是底层任务的复杂性,并建议改变训练分布来解决这些挑战。

[NLP-69] Bi-DCSpell: A Bi-directional Detector-Corrector Interactive Framework for Chinese Spelling Check
[NLP-69] Bi-DCSpell:用于中文拼写检查的双向检测器-纠正器交互框架

链接: https://arxiv.org/abs/2406.01879
作者: Haiming Wu,Hanqing Zhang,Richeng Xuan,Dawei Song
关键词: Chinese Spelling Check, Spelling Check, Chinese Spelling, correct potentially misspelled, potentially misspelled characters
中文关键词: 中文拼写检查、拼写检查、中文拼写、纠正可能拼错的、可能拼错的字符
类目: Computation and Language (cs.CL)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Chinese Spelling Check (CSC) aims to detect and correct potentially misspelled characters in Chinese sentences. Naturally, it involves the detection and correction subtasks, which interact with each other dynamically. Such interactions are bi-directional, i.e., the detection result would help reduce the risk of over-correction and under-correction while the knowledge learnt from correction would help prevent false detection. Current CSC approaches are of two types: correction-only or single-directional detection-to-correction interactive frameworks. Nonetheless, they overlook the bi-directional interactions between detection and correction. This paper aims to fill the gap by proposing a Bi-directional Detector-Corrector framework for CSC (Bi-DCSpell). Notably, Bi-DCSpell contains separate detection and correction encoders, followed by a novel interactive learning module facilitating bi-directional feature interactions between detection and correction to improve each other’s representation learning. Extensive experimental results demonstrate a robust correction performance of Bi-DCSpell on widely used benchmarking datasets while possessing a satisfactory detection ability.
摘要:中文拼写检查旨在检测和纠正中文句子中可能出现的拼写错误的字符。自然,它涉及检测和校正子任务,它们彼此动态地交互。这种互动是双向的,即检测结果将有助于减少过度纠正和不足纠正的风险,而从纠正中学习的知识将有助于防止错误检测。当前的CSC方法有两种类型:仅纠正或单向检测到纠正的交互框架。尽管如此,他们忽略了检测和纠正之间的双向互动。为了填补这一空白,本文提出了一种基于双向检测器-校正器的CSC(BiDCSpell)框架。值得注意的是,BiDCSpell包含独立的检测和校正编码器,紧随其后的是一个新的交互学习模块,该模块促进了检测和校正之间的双向特征交互,以改善彼此的表示学习。大量的实验结果表明,该算法在广泛使用的基准数据集上具有较好的纠错性能,同时具有令人满意的检测能力。

[NLP-70] GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security
[NLP-70] :数据安全背景下数据模式的生成式检索增强匹配

链接: https://arxiv.org/abs/2406.01876
作者: Xuanqing Liu,Luyang Kong,Runhui Wang,Patrick Song,Austin Nevins,Henrik Johnson,Nimish Amlathe,Davor Golac
关键词: data ingestion process, Schema matching constitutes, contemporary database systems, constitutes a pivotal, pivotal phase
中文关键词: 数据摄入过程,模式匹配构成了现代数据库系统,构成了一个关键、关键阶段
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: KDD 2024 Camera Ready; 11 pages, 8 figures

点击查看摘要

Abstract:Schema matching constitutes a pivotal phase in the data ingestion process for contemporary database systems. Its objective is to discern pairwise similarities between two sets of attributes, each associated with a distinct data table. This challenge emerges at the initial stages of data analytics, such as when incorporating a third-party table into existing databases to inform business insights. Given its significance in the realm of database systems, schema matching has been under investigation since the 2000s. This study revisits this foundational problem within the context of large language models. Adhering to increasingly stringent data security policies, our focus lies on the zero-shot and few-shot scenarios: the model should analyze only a minimal amount of customer data to execute the matching task, contrasting with the conventional approach of scrutinizing the entire data table. We emphasize that the zero-shot or few-shot assumption is imperative to safeguard the identity and privacy of customer data, even at the potential cost of accuracy. The capability to accurately match attributes under such stringent requirements distinguishes our work from previous literature in this domain.
摘要:模式匹配是当代数据库系统数据摄取过程中的一个关键阶段。它的目标是辨别两组属性之间的成对相似性,每组属性都与不同的数据表相关联。这一挑战出现在数据分析的初始阶段,例如将第三方表合并到现有数据库中以提供业务洞察信息时。鉴于模式匹配在数据库系统领域的重要性,自本世纪头十年以来,模式匹配一直受到研究。这项研究在大型语言模型的背景下重新审视了这一基本问题。遵循日益严格的数据安全策略,我们的重点放在零命中率和少命中率场景上:该模型应该只分析最少量的客户数据来执行匹配任务,而不是传统的仔细检查整个数据表的方法。我们强调,为了保护客户数据的身份和隐私,必须采取零概率或极小概率假设,即使可能会以准确性为代价。在这样严格的要求下准确匹配属性的能力使我们的工作有别于该领域以前的文献。

[NLP-71] CR-UTP: Certified Robustness against Universal Text Perturbations
[NLP-71] CR-GPT:针对通用文本扰动的鲁棒性认证

链接: https://arxiv.org/abs/2406.01873
作者: Qian Lou,Xin Liang,Jiaqi Xue,Yancheng Zhang,Rui Xie,Mengxin Zheng
关键词: Universal Text Perturbations, minor input variations, language model robustness, language model, language prediction
中文关键词: 通用文本扰动、微小输入变化、语言模型稳健性、语言模型、语言预测
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted by ACL Findings 2024

点击查看摘要

Abstract:It is imperative to ensure the stability of every prediction made by a language model; that is, a language’s prediction should remain consistent despite minor input variations, like word substitutions. In this paper, we investigate the problem of certifying a language model’s robustness against Universal Text Perturbations (UTPs), which have been widely used in universal adversarial attacks and backdoor attacks. Existing certified robustness based on random smoothing has shown considerable promise in certifying the input-specific text perturbations (ISTPs), operating under the assumption that any random alteration of a sample’s clean or adversarial words would negate the impact of sample-wise perturbations. However, with UTPs, masking only the adversarial words can eliminate the attack. A naive method is to simply increase the masking ratio and the likelihood of masking attack tokens, but it leads to a significant reduction in both certified accuracy and the certified radius due to input corruption by extensive masking. To solve this challenge, we introduce a novel approach, the superior prompt search method, designed to identify a superior prompt that maintains higher certified accuracy under extensive masking. Additionally, we theoretically motivate why ensembles are a particularly suitable choice as base prompts for random smoothing. The method is denoted by superior prompt ensembling technique. We also empirically confirm this technique, obtaining state-of-the-art results in multiple settings. These methodologies, for the first time, enable high certified accuracy against both UTPs and ISTPs. The source code of CR-UTP is available at this https URL.
摘要:必须确保语言模型所做的每个预测的稳定性;也就是说,语言的预测应该保持一致,尽管输入有微小的变化,如单词替换。在本文中,我们研究了语言模型对通用文本扰动(UTP)的稳健性证明问题,UTP被广泛应用于通用对抗性攻击和后门攻击。现有的基于随机平滑的已证明的稳健性在证明特定于输入的文本扰动(ISTP)方面显示出相当大的前景,其操作是在假设样本的干净或敌意的单词的任何随机改变将否定样本方面的扰动的影响的情况下进行的。然而,对于UTP,只屏蔽敌意的单词就可以消除攻击。一种天真的方法是简单地增加掩蔽率和掩蔽攻击令牌的可能性,但由于广泛的掩蔽导致输入损坏,它导致认证的准确性和认证的半径都显著降低。为了解决这一挑战,我们引入了一种新的方法,高级提示搜索方法,旨在识别在广泛掩蔽下保持更高认证准确率的高级提示。此外,我们从理论上解释了为什么作为随机平滑的基础提示,集合是特别合适的选择。这种方法以卓越的即时集成技术表示。我们还从经验上证实了这一技术,在多个环境下获得了最先进的结果。这些方法首次针对UTP和ISTP实现了高度认证的准确性。CR-UTP的源代码可在此HTTPS URL中找到。

[NLP-72] #EpiTwitter: Public Health Messaging During the COVID-19 Pandemic
[NLP-72] #EpiTwitter:COVID-19大流行期间的公共卫生信息

链接: https://arxiv.org/abs/2406.01866
作者: Ashwin Rao,Nazanin Sabri,Siyi Guo,Louiqa Raschid,Kristina Lerman
关键词: social media serving, Effective communication, health crises, crises is critical, moral language
中文关键词: 社交媒体服务、有效沟通、健康危机、危机至关重要、道德语言
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Effective communication during health crises is critical, with social media serving as a key platform for public health experts (PHEs) to engage with the public. However, it also amplifies pseudo-experts promoting contrarian views. Despite its importance, the role of emotional and moral language in PHEs’ communication during COVID-19 remains under explored. This study examines how PHEs and pseudo-experts communicated on Twitter during the pandemic, focusing on emotional and moral language and their engagement with political elites. Analyzing tweets from 489 PHEs and 356 pseudo-experts from January 2020 to January 2021, alongside public responses, we identified key priorities and differences in messaging strategy. PHEs prioritize masking, healthcare, education, and vaccines, using positive emotional language like optimism. In contrast, pseudo-experts discuss therapeutics and lockdowns more frequently, employing negative emotions like pessimism and disgust. Negative emotional and moral language tends to drive engagement, but positive language from PHEs fosters positivity in public responses. PHEs exhibit liberal partisanship, expressing more positivity towards liberals and negativity towards conservative elites, while pseudo-experts show conservative partisanship. These findings shed light on the polarization of COVID-19 discourse and underscore the importance of strategic use of emotional and moral language by experts to mitigate polarization and enhance public trust.
摘要:卫生危机期间的有效沟通至关重要,社交媒体是公共卫生专家(PHE)与公众接触的关键平台。然而,它也放大了宣扬逆向观点的伪专家。尽管情感和道德语言很重要,但它在新冠肺炎期间PHE沟通中的作用仍有待探索。这项研究考察了PHE和伪专家在大流行期间如何在Twitter上进行沟通,重点是情感和道德语言以及他们与政治精英的接触。分析了2020年1月至2021年1月期间489名PHE和356名伪专家的推文,以及公众的回应,我们确定了消息传递策略中的关键优先事项和差异。PHE优先考虑面具、医疗保健、教育和疫苗,使用乐观等积极的情感语言。相比之下,伪专家更频繁地讨论治疗和封锁,使用悲观和厌恶等负面情绪。消极的情感和道德语言往往会推动参与,但来自公共卫生部门的积极语言会促进公众反应的积极。PHE表现出自由派的党派倾向,对自由主义者表现出更多的积极态度,对保守派精英表现出更多的消极情绪,而伪专家则表现出保守派的党派倾向。这些发现揭示了新冠肺炎话语的两极分化,并突显了专家策略性地使用情感和道德语言来缓解两极分化、增强公众信任的重要性。

[NLP-73] owards Effective Time-Aware Language Representation: Exploring Enhanced Temporal Understanding in Language Models
[NLP-73] 有效的时间感知语言表示:探索语言模型中增强的时态理解

链接: https://arxiv.org/abs/2406.01863
作者: Jiexin Wang,Adam Jatowt,Yi Cai
关键词: Natural Language Processing, field of Natural, Language Processing, Natural Language, increasingly crucial
中文关键词: 自然语言处理,自然领域,语言处理,自然语言,越来越重要
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the evolving field of Natural Language Processing, understanding the temporal context of text is increasingly crucial. This study investigates methods to incorporate temporal information during pre-training, aiming to achieve effective time-aware language representation for improved performance on time-related tasks. In contrast to common pre-trained models like BERT, which rely on synchronic document collections such as BookCorpus and Wikipedia, our research introduces BiTimeBERT 2.0, a novel language model pre-trained on a temporal news article collection. BiTimeBERT 2.0 utilizes this temporal news collection, focusing on three innovative pre-training objectives: Time-Aware Masked Language Modeling (TAMLM), Document Dating (DD), and Time-Sensitive Entity Replacement (TSER). Each objective targets a unique aspect of temporal information. TAMLM is designed to enhance the understanding of temporal contexts and relations, DD integrates document timestamps as chronological markers, and TSER focuses on the temporal dynamics of “Person” entities, recognizing their inherent temporal significance. The experimental results consistently demonstrate that BiTimeBERT 2.0 outperforms models like BERT and other existing pre-trained models, achieving substantial gains across a variety of downstream NLP tasks and applications where time plays a pivotal role.
摘要:在不断发展的自然语言处理领域,理解文本的时间语境变得越来越重要。本研究探讨了在预训练中融入时间信息的方法,旨在实现有效的时间感知语言表征,以提高在与时间相关的任务中的表现。与BERT等常见的预训练模型不同,BERT依赖于BookCorpus和Wikipedia等同步文档集,而我们的研究引入了BiTimeBERT 2.0,这是一个针对时态新闻文章集进行预训练的新型语言模型。BiTimeBERT 2.0利用这一时态新闻集合,重点关注三个创新的预培训目标:时敏掩蔽语言建模(TAMLM)、文档日期确定(DD)和时敏实体替换(TSER)。每个目标都针对时间信息的一个独特方面。TAMLM旨在增强对时态上下文和关系的理解,DD将文档时间戳整合为时间标记,而TSER则专注于“人”实体的时态动态,认识到它们固有的时态意义。实验结果一致表明,BiTimeBERT 2.0的性能优于BERT等模型和其他现有的预训练模型,在各种下游NLP任务和应用程序中实现了显著的收益,其中时间起着关键作用。

[NLP-74] Eliciting the Priors of Large Language Models using Iterated In-Context Learning
[NLP-74] 使用迭代上下文学习激发大型语言模型的先验性

链接: https://arxiv.org/abs/2406.01860
作者: Jian-Qiao Zhu,Thomas L. Griffiths
关键词: Large Language Models, Language Models, Large Language, decisions is critical, increasingly deployed
中文关键词: 大型语言模型,语言模型,大型语言,决策至关重要,部署越来越多
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly deployed in real-world settings, understanding the knowledge they implicitly use when making decisions is critical. One way to capture this knowledge is in the form of Bayesian prior distributions. We develop a prompt-based workflow for eliciting prior distributions from LLMs. Our approach is based on iterated learning, a Markov chain Monte Carlo method in which successive inferences are chained in a way that supports sampling from the prior distribution. We validated our method in settings where iterated learning has previously been used to estimate the priors of human participants – causal learning, proportion estimation, and predicting everyday quantities. We found that priors elicited from GPT-4 qualitatively align with human priors in these settings. We then used the same method to elicit priors from GPT-4 for a variety of speculative events, such as the timing of the development of superhuman AI.
摘要:随着大型语言模型(LLM)越来越多地部署在现实世界环境中,了解它们在做出决策时隐含使用的知识至关重要。捕获这些知识的一种方法是采用Bayesian先验分布的形式。我们开发了一个基于预算的工作流程,用于从LLM获取之前的分发。我们的方法基于迭代学习,这是一种马尔科夫链蒙特卡罗方法,其中连续的推论以支持从先验分布进行抽样的方式链接起来。我们在之前使用迭代学习来估计人类参与者的先验情况的环境中验证了我们的方法–因果学习、比例估计和预测每日数量。我们发现从GPT-4引出的先验在这些环境中与人类先验在定性上一致。然后,我们使用相同的方法从GPT-4中获取各种推测事件的先验信息,例如超人人工智能的发展时间。

[NLP-75] ruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability
[NLP-75] SEARCH Eval:评估LLM真实性和可靠性的数据集

链接: https://arxiv.org/abs/2406.01855
作者: Aisha Khatun,Daniel G. Brown
关键词: Large Language Model, Large Language, Language Model, existing benchmarks proving, areas of research
中文关键词: 大型语言模型、大型语言、语言模型、现有基准证明、研究领域
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) evaluation is currently one of the most important areas of research, with existing benchmarks proving to be insufficient and not completely representative of LLMs’ various capabilities. We present a curated collection of challenging statements on sensitive topics for LLM benchmarking called TruthEval. These statements were curated by hand and contain known truth values. The categories were chosen to distinguish LLMs’ abilities from their stochastic nature. We perform some initial analyses using this dataset and find several instances of LLMs failing in simple tasks showing their inability to understand simple questions.
摘要:大型语言模型(LLM)评估是目前最重要的研究领域之一,现有的基准被证明是不够的,并且不能完全代表LLM的各种能力。我们为LLM基准测试提供了一系列精心策划的关于敏感主题的具有挑战性的声明集,名为TruthEval。这些陈述是手工策划的,包含已知的真理价值观。选择这些类别是为了区分LLM的能力与随机性质。我们使用该数据集进行了一些初步分析,发现LLM在简单任务中失败的几个例子,表明他们无法理解简单的问题。

[NLP-76] An Open Multilingual System for Scoring Readability of Wikipedia
[NLP-76] 维基百科可读性评分的开放式多语言系统

链接: https://arxiv.org/abs/2406.01835
作者: Mykola Trokhymovych,Indira Sen,Martin Gerlach
关键词: freely accessible knowledge, Wikipedia, accessible knowledge, largest platform, platform for open
中文关键词: 自由获取的知识,维基百科,无障碍知识,最大的平台,开放的平台
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With over 60M articles, Wikipedia has become the largest platform for open and freely accessible knowledge. While it has more than 15B monthly visits, its content is believed to be inaccessible to many readers due to the lack of readability of its text. However, previous investigations of the readability of Wikipedia have been restricted to English only, and there are currently no systems supporting the automatic readability assessment of the 300+ languages in Wikipedia. To bridge this gap, we develop a multilingual model to score the readability of Wikipedia articles. To train and evaluate this model, we create a novel multilingual dataset spanning 14 languages, by matching articles from Wikipedia to simplified Wikipedia and online children encyclopedias. We show that our model performs well in a zero-shot scenario, yielding a ranking accuracy of more than 80% across 14 languages and improving upon previous benchmarks. These results demonstrate the applicability of the model at scale for languages in which there is no ground-truth data available for model fine-tuning. Furthermore, we provide the first overview on the state of readability in Wikipedia beyond English.
摘要:维基百科拥有6000多万篇文章,已成为最大的开放和免费获取知识的平台。虽然它每月的访问量超过1500亿次,但由于文本缺乏可读性,许多读者据信无法访问其内容。然而,以前对维基百科可读性的调查仅限于英文,目前还没有支持维基百科300多种语言的自动可读性评估的系统。为了弥补这一差距,我们开发了一个多语言模型来对维基百科文章的可读性进行评分。为了训练和评估这个模型,我们创建了一个涵盖14种语言的新型多语言数据集,方法是将维基百科中的文章与简化的维基百科和在线儿童百科全书进行匹配。我们表明,我们的模型在零概率场景中表现良好,在14种语言上的排名准确率超过80%,并比之前的基准有所提高。这些结果表明,该模型在规模上适用于没有可用于模型微调的实地数据的语言。此外,我们还首次概述了英语以外的维基百科的可读性状况。

[NLP-77] Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation
[NLP-77] 上下文序列可能性:自然语言生成的增强置信度分数

链接: https://arxiv.org/abs/2406.01806
作者: Zhen Lin,Shubhendu Trivedi,Jimeng Sun
关键词: large language models, numerous natural language, natural language generation, language models, large language
中文关键词: 大型语言模型、众多自然语言、自然语言生成、语言模型、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advent of large language models (LLMs) has dramatically advanced the state-of-the-art in numerous natural language generation tasks. For LLMs to be applied reliably, it is essential to have an accurate measure of their confidence. Currently, the most commonly used confidence score function is the likelihood of the generated sequence, which, however, conflates semantic and syntactic components. For instance, in question-answering (QA) tasks, an awkward phrasing of the correct answer might result in a lower probability prediction. Additionally, different tokens should be weighted differently depending on the context. In this work, we propose enhancing the predicted sequence probability by assigning different weights to various tokens using attention values elicited from the base LLM. By employing a validation set, we can identify the relevant attention heads, thereby significantly improving the reliability of the vanilla sequence probability confidence measure. We refer to this new score as the Contextualized Sequence Likelihood (CSL). CSL is easy to implement, fast to compute, and offers considerable potential for further improvement with task-specific prompts. Across several QA datasets and a diverse array of LLMs, CSL has demonstrated significantly higher reliability than state-of-the-art baselines in predicting generation quality, as measured by the AUROC or AUARC.
摘要:大型语言模型的出现极大地提高了自然语言生成任务的研究水平。要想可靠地应用低成本管理机制,必须准确衡量其可信度。目前,最常用的置信度得分函数是生成序列的可能性,然而,它合并了语义和句法成分。例如,在问答(QA)任务中,正确答案的笨拙措辞可能会导致较低的概率预测。此外,不同的令牌应该根据上下文进行不同的加权。在这项工作中,我们建议通过使用从基本LLM获得的关注值来为不同的标记赋予不同的权重来提高预测序列概率。通过使用验证集,我们可以识别相关的注意头部,从而显著提高了普通序列概率置信度度量的可靠性。我们将这个新分数称为上下文化序列似然(CSL)。CSL易于实现,计算速度快,并通过特定于任务的提示提供了相当大的改进潜力。在几个QA数据集和不同的LLM阵列中,CSL在预测AUROC或AUARC衡量的生成质量方面表现出比最先进的基线显著更高的可靠性。

[NLP-78] AI-based Classification of Customer Support Tickets: State of the Art and Implementation with AutoML
[NLP-78] 基于AI的客户支持票证分类:最新技术水平和AutoML的实现

链接: https://arxiv.org/abs/2406.01789
作者: Mario Truss,Stephan Boehm
关键词: shortening resolution time, improve customer support, customer inquiries, crucial to improve, shortening resolution
中文关键词: 缩短解决时间、改善客户支持、客户询问、改进至关重要、缩短解决方案
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Automation of support ticket classification is crucial to improve customer support performance and shortening resolution time for customer inquiries. This research aims to test the applicability of automated machine learning (AutoML) as a technology to train a machine learning model (ML model) that can classify support tickets. The model evaluation conducted in this research shows that AutoML can be used to train ML models with good classification performance. Moreover, this paper fills a research gap by providing new insights into developing AI solutions without a dedicated professional by utilizing AutoML, which makes this technology more accessible for companies without specialized AI departments and staff.
摘要:支持票分类的自动化对于提高客户支持性能和缩短客户询问的解决时间至关重要。这项研究旨在测试自动机器学习(AutoML)作为一种训练可以对支持票进行分类的机器学习模型(ML模型)的技术的适用性。本研究中进行的模型评估表明,AutoML可以用于训练具有良好分类性能的ML模型。此外,本文还填补了研究空白,为利用AutoML在没有专门专业人员的情况下开发人工智能解决方案提供了新的见解,这使得没有专门人工智能部门和员工的公司更容易使用这项技术。

[NLP-79] OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models
[NLP-79] OLoRA:大型语言模型的正向低等级适应

链接: https://arxiv.org/abs/2406.01775
作者: Kerim Büyükakyüz
关键词: generating human-like text, enabling unprecedented capabilities, human-like text, advent of large, unprecedented capabilities
中文关键词: 生成类人文本,实现前所未有的能力,类人文本,巨大、前所未有的能力的出现
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:The advent of large language models (LLMs) has revolutionized natural language processing, enabling unprecedented capabilities in understanding and generating human-like text. However, the computational cost and convergence times associated with fine-tuning these models remain significant challenges. Low-Rank Adaptation (LoRA) has emerged as a promising method to mitigate these issues by introducing efficient fine-tuning techniques with a reduced number of trainable parameters. In this paper, we present OLoRA, an enhancement to the LoRA method that leverages orthonormal matrix initialization through QR decomposition. OLoRA significantly accelerates the convergence of LLM training while preserving the efficiency benefits of LoRA, such as the number of trainable parameters and GPU memory footprint. Our empirical evaluations demonstrate that OLoRA not only converges faster but also exhibits improved performance compared to standard LoRA across a variety of language modeling tasks. This advancement opens new avenues for more efficient and accessible fine-tuning of LLMs, potentially enabling broader adoption and innovation in natural language applications.
摘要:大型语言模型的出现使自然语言处理发生了革命性的变化,使得理解和生成类似人类的文本具有前所未有的能力。然而,与微调这些模型相关的计算成本和收敛时间仍然是巨大的挑战。低阶自适应(LORA)通过引入有效的微调技术和减少可训练参数的数量而成为缓解这些问题的一种有前途的方法。在本文中,我们提出了OLoRA,它是对LORA方法的一种改进,它通过QR分解来利用正交化矩阵初始化。OLoRA显著加快了LLM训练的收敛速度,同时保留了LORA的效率优势,例如可训练参数的数量和GPU内存占用。我们的经验评估表明,与标准LORA相比,OLoRA不仅收敛速度更快,而且在各种语言建模任务中表现出更好的性能。这一进步为更有效和更容易获得的LLMS微调开辟了新的途径,潜在地使自然语言应用程序能够更广泛地采用和创新。

[NLP-80] LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback
[NLP-80] 超越英语的法学硕士:通过跨语言反馈提升法学硕士的多语言能力

链接: https://arxiv.org/abs/2406.01771
作者: Wen Lai,Mohsen Mesgar,Alexander Fraser
关键词: large language models, democratize large language, models capable, democratize large, imperative to make
中文关键词: 大型语言模型,大型语言民主化,模型有能力,大型民主化,势在必行
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of ACL 2024. The code, datasets, and models are publicly available at this https URL

点击查看摘要

Abstract:To democratize large language models (LLMs) to most natural languages, it is imperative to make these models capable of understanding and generating texts in many languages, in particular low-resource ones. While recent multilingual LLMs demonstrate remarkable performance in such capabilities, these LLMs still support a limited number of human languages due to the lack of training data for low-resource languages. Moreover, these LLMs are not yet aligned with human preference for downstream tasks, which is crucial for the success of LLMs in English. In this paper, we introduce xLLaMA-100 and xBLOOM-100 (collectively xLLMs-100), which scale the multilingual capabilities of LLaMA and BLOOM to 100 languages. To do so, we construct two datasets: a multilingual instruction dataset including 100 languages, which represents the largest language coverage to date, and a cross-lingual human feedback dataset encompassing 30 languages. We perform multilingual instruction tuning on the constructed instruction data and further align the LLMs with human feedback using the DPO algorithm on our cross-lingual human feedback dataset. We evaluate the multilingual understanding and generating capabilities of xLLMs-100 on five multilingual benchmarks. Experimental results show that xLLMs-100 consistently outperforms its peers across the benchmarks by considerable margins, defining a new state-of-the-art multilingual LLM that supports 100 languages.
摘要:要将大型语言模型推广到大多数自然语言,必须使这些模型能够理解和生成多种语言的文本,特别是低资源的语言。虽然最近的多语言LLM在这种能力方面表现出显著的性能,但由于缺乏低资源语言的训练数据,这些LLM仍然支持有限数量的人类语言。此外,这些LLMS还没有与人们对下游任务的偏好保持一致,这对LLMS在英语中的成功至关重要。在本文中,我们介绍了xLLaMA-100和xBLOOM-100(统称为xLLMS-100),它们将Llama和Bloom的多语言能力扩展到100种语言。为此,我们构建了两个数据集:包括100种语言的多语言教学数据集,这是迄今为止覆盖范围最大的语言;以及包括30种语言的跨语言人类反馈数据集。我们在构建的教学数据上执行多语言教学调整,并在我们的跨语言人类反馈数据集上使用DPO算法进一步将LLMS与人类反馈对齐。我们在五个多语言基准上对xLLMS-100的多语言理解和生成能力进行了评估。实验结果表明,xLLMS-100在基准测试中的表现一直高于其他同类语言,定义了一个支持100种语言的最先进的多语言LLM。

[NLP-81] owards Harnessing Large Language Models for Comprehension of Conversational Grounding
[NLP-81] 利用大型语言模型来理解对话基础

链接: https://arxiv.org/abs/2406.01749
作者: Kristiina Jokinen,Phillip Schneider,Taiga Mori
关键词: establishing mutual knowledge, large language models, collaborative mechanism, mechanism for establishing, establishing mutual
中文关键词: 建立相互知识、大型语言模型、协作机制、建立机制、建立相互
类目: Computation and Language (cs.CL)
备注: Accepted to IWSDS 2024

点击查看摘要

Abstract:Conversational grounding is a collaborative mechanism for establishing mutual knowledge among participants engaged in a dialogue. This experimental study analyzes information-seeking conversations to investigate the capabilities of large language models in classifying dialogue turns related to explicit or implicit grounding and predicting grounded knowledge elements. Our experimental results reveal challenges encountered by large language models in the two tasks and discuss ongoing research efforts to enhance large language model-based conversational grounding comprehension through pipeline architectures and knowledge bases. These initiatives aim to develop more effective dialogue systems that are better equipped to handle the intricacies of grounded knowledge in conversations.
摘要:对话基础是一种用于在参与对话的参与者之间建立相互知识的协作机制。这项实验研究分析了寻求信息的对话,以调查大型语言模型在对与显式或隐式基础相关的对话回合进行分类以及预测基础知识元素方面的能力。我们的实验结果揭示了大型语言模型在这两项任务中遇到的挑战,并讨论了正在进行的研究工作,以通过管道架构和知识库增强基于大型语言模型的对话基础理解。这些举措旨在开发更有效的对话系统,以便更好地处理对话中基础知识的复杂性。

[NLP-82] Rotation and Permutation for Advanced Outlier Management and Efficient Quantization of LLMs
[NLP-82] 用于高级异常值管理和LLM高效量化的轮换和排列

链接: https://arxiv.org/abs/2406.01721
作者: Haokun Lin,Haobo Xu,Yichen Wu,Jingzhi Cui,Yingtao Zhang,Linzhan Mou,Linqi Song,Zhenan Sun,Ying Wei
关键词: Quantizing large language, presents significant challenges, Quantizing large, large language models, solving Normal Outliers-activations
中文关键词: 量化大型语言,提出了重大挑战,量化大型语言模型,解决正常异常者激活问题
类目: Computation and Language (cs.CL)
备注: 26 pages, 13 figures

点击查看摘要

Abstract:Quantizing large language models (LLMs) presents significant challenges, primarily due to outlier activations that compromise the efficiency of low-bit representation. Traditional approaches mainly focus on solving Normal Outliers-activations with consistently high magnitudes across all tokens. However, these techniques falter when dealing with Massive Outliers, which are significantly higher in value and often cause substantial performance losses during low-bit quantization. In this study, we propose DuQuant, an innovative quantization strategy employing rotation and permutation transformations to more effectively eliminate both types of outliers. Initially, DuQuant constructs rotation matrices informed by specific outlier dimensions, redistributing these outliers across adjacent channels within different rotation blocks. Subsequently, a zigzag permutation is applied to ensure a balanced distribution of outliers among blocks, minimizing block-wise variance. An additional rotation further enhances the smoothness of the activation landscape, thereby improving model performance. DuQuant streamlines the quantization process and demonstrates superior outlier management, achieving top-tier results in multiple tasks with various LLM architectures even under 4-bit weight-activation quantization. Our code is available at this https URL.
摘要:对大型语言模型(LLM)进行量化是一个巨大的挑战,主要是由于异常激活影响了低位表示的效率。传统的方法主要集中在解决正常离群点-所有令牌上一致高幅度的激活。然而,这些技术在处理海量异常值时会出现问题,海量异常值的值要高得多,并且在低位量化期间通常会导致大量的性能损失。在这项研究中,我们提出了DUQUANT,一种创新的量化策略,使用旋转和置换变换来更有效地消除这两种类型的离群值。最初,DUQUANT构造由特定离群点维度通知的旋转矩阵,在不同旋转块内的相邻通道中重新分布这些离群点。随后,应用Z字形置换来确保块之间的离群值的均衡分布,从而最小化块方向的方差。额外的旋转进一步增强了激活场景的平滑度,从而改善了模型性能。DUQUANT简化了量化过程,并展示了卓越的异常值管理,即使在4位权重激活量化下,也能在使用各种LLM架构的多个任务中实现顶级结果。我们的代码可以在这个HTTPS URL上找到。

[NLP-83] meCMA: Towards LLM-Empowered Time Series Forecasting via Cross-Modality Alignment
[NLP-83] meCMA:通过跨模式对齐实现LLM授权的时间序列预测

链接: https://arxiv.org/abs/2406.01638
作者: Chenxi Liu,Qianxiong Xu,Hao Miao,Sun Yang,Lingzheng Zhang,Cheng Long,Ziyue Li,Rui Zhao
关键词: scalable mobile sensing, time series, series, time, time series forecasting
中文关键词: 可扩展移动传感、时间序列、序列、时间、时间序列预测
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The widespread adoption of scalable mobile sensing has led to large amounts of time series data for real-world applications. A fundamental application is multivariate time series forecasting (MTSF), which aims to predict future time series values based on historical observations. Existing MTSF methods suffer from limited parameterization and small-scale training data. Recently, Large language models (LLMs) have been introduced in time series, which achieve promising forecasting performance but incur heavy computational costs. To solve these challenges, we propose TimeCMA, an LLM-empowered framework for time series forecasting with cross-modality alignment. We design a dual-modality encoding module with two branches, where the time series encoding branch extracts relatively low-quality yet pure embeddings of time series through an inverted Transformer. In addition, the LLM-empowered encoding branch wraps the same time series as prompts to obtain high-quality yet entangled prompt embeddings via a Pre-trained LLM. Then, we design a cross-modality alignment module to retrieve high-quality and pure time series embeddings from the prompt embeddings. Moreover, we develop a time series forecasting module to decode the aligned embeddings while capturing dependencies among multiple variables for forecasting. Notably, we tailor the prompt to encode sufficient temporal information into a last token and design the last token embedding storage to reduce computational costs. Extensive experiments on real data offer insight into the accuracy and efficiency of the proposed framework.
摘要:可伸缩移动传感的广泛应用为现实世界的应用带来了大量的时间序列数据。一个基本的应用是多变量时间序列预测(MTSF),它的目的是根据历史观测来预测未来的时间序列值。现有的MTSF方法存在有限的参数化量和小规模的训练数据。近年来,大型语言模型被引入到时间序列中,它们取得了良好的预测性能,但计算成本很高。为了解决这些挑战,我们提出了TimeCMA,这是一个基于LLM的跨通道对齐时间序列预测框架。我们设计了一个具有两个分支的双模式编码模块,其中时间序列编码分支通过倒置变换提取相对低质量但纯的时间序列嵌入。此外,支持LLM的编码分支将相同的时间序列包装为提示,以通过预先训练的LLM获得高质量但仍纠缠的提示嵌入。然后,我们设计了一个跨通道对齐模块,从提示嵌入中检索出高质量的纯时间序列嵌入。此外,我们开发了一个时间序列预测模块,在解码对齐嵌入的同时,捕获多变量之间的依赖关系进行预测。值得注意的是,我们定制了提示,将足够的时间信息编码到最后一个令牌中,并设计了嵌入存储的最后一个令牌,以减少计算成本。在真实数据上的广泛实验提供了对所提出的框架的准确性和效率的洞察。

[NLP-84] On Overcoming Miscalibrated Conversational Priors in LLM-based Chatbots
[NLP-84] 关于克服基于LLM的聊天机器人中错误校准的对话先验

链接: https://arxiv.org/abs/2406.01633
作者: Christine Herlihy,Jennifer Neville,Tobias Schnabel,Adith Swaminathan
关键词:
中文关键词:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint of UAI’24 conference publication

点击查看摘要

[NLP-85] Judgement Citation Retrieval using Contextual Similarity
[NLP-85] 利用上下文相似度进行判断引文检索

链接: https://arxiv.org/abs/2406.01609
作者: Akshat Mohan Dasula,Hrushitha Tigulla,Preethika Bhukya
关键词: demanded manual effort, keyword-based search applications, understanding legal jargon, Legal case descriptions, intricate case descriptions
中文关键词: 需要手动工作、基于关键字的搜索应用程序、理解法律行话、法律案件描述、复杂的案件描述
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 16 images, Submitted to Multimedia Tools and Applications Springer journal

点击查看摘要

Abstract:Traditionally in the domain of legal research, the retrieval of pertinent citations from intricate case descriptions has demanded manual effort and keyword-based search applications that mandate expertise in understanding legal jargon. Legal case descriptions hold pivotal information for legal professionals and researchers, necessitating more efficient and automated approaches. We propose a methodology that combines natural language processing (NLP) and machine learning techniques to enhance the organization and utilization of legal case descriptions. This approach revolves around the creation of textual embeddings with the help of state-of-art embedding models. Our methodology addresses two primary objectives: unsupervised clustering and supervised citation retrieval, both designed to automate the citation extraction process. Although the proposed methodology can be used for any dataset, we employed the Supreme Court of The United States (SCOTUS) dataset, yielding remarkable results. Our methodology achieved an impressive accuracy rate of 90.9%. By automating labor-intensive processes, we pave the way for a more efficient, time-saving, and accessible landscape in legal research, benefiting legal professionals, academics, and researchers.
摘要:传统上,在法律研究领域,从错综复杂的案例描述中检索相关引文需要手动工作和基于关键字的搜索应用程序,这些应用程序要求具有理解法律术语的专业知识。法律案例描述为法律专业人员和研究人员保存了关键信息,需要更高效和自动化的方法。我们提出了一种结合自然语言处理(NLP)和机器学习技术的方法来增强法律案例描述的组织和利用。这种方法围绕着在最先进的嵌入模型的帮助下创建文本嵌入。我们的方法解决了两个主要目标:无监督聚类和监督引文检索,两者都旨在自动化引文提取过程。虽然建议的方法可以用于任何数据集,但我们使用了美国最高法院(SCOTUS)的数据集,产生了显著的结果。我们的方法达到了令人印象深刻的准确率90.9%。通过自动化劳动密集型流程,我们为法律研究中更高效、更节省时间和更容易获得的环境铺平了道路,使法律专业人员、学者和研究人员受益。

[NLP-86] Detecting Deceptive Dark Patterns in E-commerce Platforms
[NLP-86] 检测电子商务平台中的欺骗性黑暗模式

链接: https://arxiv.org/abs/2406.01608
作者: Arya Ramteke,Sankalp Tembhurne,Gunesh Sonawane,Ratnmala N. Bhimanpallewar
关键词: deceptive user interfaces, user interfaces employed, manipulate user behavior, Dark patterns, deceptive user
中文关键词: 欺骗性用户界面、使用的用户界面、操纵用户行为、黑暗模式、欺骗性用户
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Dark patterns are deceptive user interfaces employed by e-commerce websites to manipulate user’s behavior in a way that benefits the website, often unethically. This study investigates the detection of such dark patterns. Existing solutions include UIGuard, which uses computer vision and natural language processing, and approaches that categorize dark patterns based on detectability or utilize machine learning models trained on datasets. We propose combining web scraping techniques with fine-tuned BERT language models and generative capabilities to identify dark patterns, including outliers. The approach scrapes textual content, feeds it into the BERT model for detection, and leverages BERT’s bidirectional analysis and generation abilities. The study builds upon research on automatically detecting and explaining dark patterns, aiming to raise awareness and protect consumers.
摘要:黑暗模式是电子商务网站采用的欺骗性用户界面,以使网站受益的方式操纵用户行为,通常是不道德的。这项研究调查了这种黑暗模式的检测。现有的解决方案包括使用计算机视觉和自然语言处理的UIGuard,以及根据可检测性对黑暗模式进行分类或利用在数据集上训练的机器学习模型的方法。我们建议将网络抓取技术与微调的BERT语言模型和生成能力相结合,以识别黑暗模式,包括异常值。该方法删除文本内容,将其输入BERT模型进行检测,并利用BERT的双向分析和生成能力。该研究基于自动检测和解释黑暗模式的研究,旨在提高认识并保护消费者。

[NLP-87] Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark
[NLP-87] 文本嵌入的最新进展:MTEB基准上表现最佳方法的全面回顾

链接: https://arxiv.org/abs/2406.01607
作者: Hongliu Cao
关键词: academic fields due, natural language processing, Text embedding methods, language processing tasks, universal text embeddings
中文关键词: 学术领域,自然语言处理,文本嵌入方法,语言处理任务,通用文本嵌入
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 45 pages

点击查看摘要

Abstract:Text embedding methods have become increasingly popular in both industrial and academic fields due to their critical role in a variety of natural language processing tasks. The significance of universal text embeddings has been further highlighted with the rise of Large Language Models (LLMs) applications such as Retrieval-Augmented Systems (RAGs). While previous models have attempted to be general-purpose, they often struggle to generalize across tasks and domains. However, recent advancements in training data quantity, quality and diversity; synthetic data generation from LLMs as well as using LLMs as backbones encourage great improvements in pursuing universal text embeddings. In this paper, we provide an overview of the recent advances in universal text embedding models with a focus on the top performing text embeddings on Massive Text Embedding Benchmark (MTEB). Through detailed comparison and analysis, we highlight the key contributions and limitations in this area, and propose potentially inspiring future research directions.
摘要:文本嵌入方法因其在各种自然语言处理任务中的重要作用而日益受到工业界和学术界的青睐。随着检索增强系统(RAGS)等大型语言模型(LLMS)应用的兴起,通用文本嵌入的重要性进一步突出。虽然以前的模型试图具有通用性,但它们往往难以跨任务和领域进行推广。然而,最近在训练数据的数量、质量和多样性、从LLMS生成合成数据以及使用LLMS作为主干方面的进展鼓励了在追求通用文本嵌入方面的巨大改进。本文综述了通用文本嵌入模型的最新进展,重点介绍了海量文本嵌入基准测试(MTEB)中性能最好的文本嵌入。通过详细的比较和分析,我们突出了这一领域的主要贡献和局限性,并提出了具有潜在启发意义的未来研究方向。

[NLP-88] SymTax: Symbiotic Relationship and Taxonomy Fusion for Effective Citation Recommendation
[NLP-88] SymTax:共生关系和分类学融合,实现有效的引文推荐

链接: https://arxiv.org/abs/2406.01606
作者: Karan Goyal,Mayank Goel,Vikram Goyal,Mukesh Mohania
关键词: Citing pertinent literature, Citing pertinent, scientific document, pertinent literature, literature is pivotal
中文关键词: 引用相关文献,引用相关的科学文献,相关的文献,文献是关键
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted in ACL 2024

点击查看摘要

Abstract:Citing pertinent literature is pivotal to writing and reviewing a scientific document. Existing techniques mainly focus on the local context or the global context for recommending citations but fail to consider the actual human citation behaviour. We propose SymTax, a three-stage recommendation architecture that considers both the local and the global context, and additionally the taxonomical representations of query-candidate tuples and the Symbiosis prevailing amongst them. SymTax learns to embed the infused taxonomies in the hyperbolic space and uses hyperbolic separation as a latent feature to compute query-candidate similarity. We build a novel and large dataset ArSyTa containing 8.27 million citation contexts and describe the creation process in detail. We conduct extensive experiments and ablation studies to demonstrate the effectiveness and design choice of each module in our framework. Also, combinatorial analysis from our experiments shed light on the choice of language models (LMs) and fusion embedding, and the inclusion of section heading as a signal. Our proposed module that captures the symbiotic relationship solely leads to performance gains of 26.66% and 39.25% in Recall@5 w.r.t. SOTA on ACL-200 and RefSeer datasets, respectively. The complete framework yields a gain of 22.56% in Recall@5 wrt SOTA on our proposed dataset. The code and dataset are available at this https URL
摘要:引用相关文献是撰写和审阅科学文献的关键。现有的引文推荐技术主要集中在局部语境或全球语境下进行引文推荐,而没有考虑人类的实际引文行为。我们提出了一种三阶段推荐体系结构SymTax,它同时考虑了局部和全局上下文,以及候选查询元组的分类表示和它们之间的共生关系。SymTax学习将注入的分类法嵌入到双曲空间中,并使用双曲分离作为潜在特征来计算查询-候选相似度。我们构建了一个包含827万个引文上下文的新颖的大型数据集ArSyTa,并详细描述了其创建过程。我们进行了广泛的实验和烧蚀研究,以验证我们框架中每个模块的有效性和设计选择。此外,我们实验的组合分析揭示了语言模型(LMS)和融合嵌入的选择,以及将节标题作为信号包括在内。我们提出的仅捕获共生关系的模块在recall@5w.r.t.中分别获得了26.66%和39.25%的性能提升。分别针对ACL-200和RefSeer数据集的SOTA。在我们提出的数据集上,完整的框架在Recall@5WRT SOTA上获得了22.56%的收益。代码和数据集可在此HTTPS URL中找到

[NLP-89] Language-Universal Speech Attributes Modeling for Zero-Shot Multilingual Spoken Keyword Recognition
[NLP-89] 零镜头多语言口语关键词识别的模糊通用语音属性建模

链接: https://arxiv.org/abs/2406.02488
作者: Hao Yen,Pin-Jui Ku,Sabato Marco Siniscalchi,Chin-Hui Lee
关键词: self-supervised pre-trained model, automatic spoken keyword, spoken keyword recognition, universal speech attributes, manner and place
中文关键词: 自我监督预训练模型、自动口语关键词、口语关键词识别、通用语音属性、方式和地点
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:We propose a novel language-universal approach to end-to-end automatic spoken keyword recognition (SKR) leveraging upon (i) a self-supervised pre-trained model, and (ii) a set of universal speech attributes (manner and place of articulation). Specifically, Wav2Vec2.0 is used to generate robust speech representations, followed by a linear output layer to produce attribute sequences. A non-trainable pronunciation model then maps sequences of attributes into spoken keywords in a multilingual setting. Experiments on the Multilingual Spoken Words Corpus show comparable performances to character- and phoneme-based SKR in seen languages. The inclusion of domain adversarial training (DAT) improves the proposed framework, outperforming both character- and phoneme-based SKR approaches with 13.73% and 17.22% relative word error rate (WER) reduction in seen languages, and achieves 32.14% and 19.92% WER reduction for unseen languages in zero-shot settings.
摘要:我们提出了一种新颖的语言通用方法来实现端到端自动口语关键词识别(SKR),该方法利用(i)自我监督的预训练模型,和(ii)一组通用语音属性(发音的方式和位置)。具体来说,Wav2Vec2.0用于生成鲁棒的语音表示,然后是线性输出层来生成属性序列。然后,不可训练的发音模型将属性序列映射到多语言环境中的口语关键词。多语言口语库的实验显示,在可见语言中,性能与基于字符和音素的SKR相当。领域对抗训练(DART)的加入改进了所提出的框架,优于基于字符和音素的SKR方法,在可见语言中降低了13.73%和17.22%的相对词错误率(WER),并在零镜头设置中实现了32.14%和19.92%的WER降低。

[NLP-90] SimulTron: On-Device Simultaneous Speech to Speech Translation
[NLP-90] SimulTron:设备上同步语音到语音翻译

链接: https://arxiv.org/abs/2406.02133
作者: Alex Agranovich,Eliya Nachmani,Oleg Rybakov,Yifan Ding,Ye Jia,Nadav Bar,Heiga Zen,Michelle Tadmor Ramanovich
关键词: enabling fluid conversations, holds the promise, conversations across languages, promise of breaking, breaking down communication
中文关键词: 实现流畅的对话,拥有跨语言对话的承诺,打破沟通的承诺
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Simultaneous speech-to-speech translation (S2ST) holds the promise of breaking down communication barriers and enabling fluid conversations across languages. However, achieving accurate, real-time translation through mobile devices remains a major challenge. We introduce SimulTron, a novel S2ST architecture designed to tackle this task. SimulTron is a lightweight direct S2ST model that uses the strengths of the Translatotron framework while incorporating key modifications for streaming operation, and an adjustable fixed delay. Our experiments show that SimulTron surpasses Translatotron 2 in offline evaluations. Furthermore, real-time evaluations reveal that SimulTron improves upon the performance achieved by Translatotron 1. Additionally, SimulTron achieves superior BLEU scores and latency compared to previous real-time S2ST method on the MuST-C dataset. Significantly, we have successfully deployed SimulTron on a Pixel 7 Pro device, show its potential for simultaneous S2ST on-device.
摘要:同步语音到语音翻译(S2ST)有望打破沟通障碍,实现跨语言的流畅对话。然而,通过移动设备实现准确、实时的翻译仍然是一个重大挑战。我们介绍了SimulTron,一种新的S2ST体系结构,旨在解决这一问题。SimulTron是一个轻量级的直接S2ST模型,它利用了Translatotron框架的优势,同时结合了对流操作的关键修改,以及可调整的固定延迟。我们的实验表明,SimulTron在离线评估方面超过了Translatotron 2。此外,实时评估表明,SimulTron在Translatotron 1上的性能有所提高。此外,SimulTron在SIMD-C数据集上获得了比以前的实时S2ST方法更好的BLEU分数和延迟。值得注意的是,我们已经成功地在Pixel 7 Pro设备上部署了SimulTron,显示了其在设备上同步S2ST的潜力。

[NLP-91] Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis
[NLP-91] 用于文本到语音合成的语音增强语言建模

链接: https://arxiv.org/abs/2406.02009
作者: Kun Zhou,Shengkui Zhao,Yukun Ma,Chong Zhang,Hao Wang,Dianwen Ng,Chongjia Ni,Nguyen Trung Hieu,Jia Qi Yip,Bin Ma
关键词: Recent language model-based, frameworks demonstrate scalability, in-context learning capabilities, Recent language, frameworks demonstrate
中文关键词: 最近的基于语言模型的框架展示了可扩展性、上下文学习能力,最近的语言框架展示了
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted by Interspeech 2024

点击查看摘要

Abstract:Recent language model-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive language modeling. In this paper, we propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-supervised representations that are phonetically rich as the training target for the autoregressive language model. Subsequently, a non-autoregressive model is employed to predict discrete acoustic codecs that contain fine-grained acoustic details. The TTS model focuses solely on linguistic modeling during autoregressive training, thereby reducing the error propagation that occurs in non-autoregressive training. Both objective and subjective evaluations validate the effectiveness of our proposed method.
摘要:最近的基于语言模型的文本到语音(TTC)框架展示了可扩展性和上下文学习能力。然而,由于自回归语言建模期间语音单位预测中的错误积累,它们面临鲁棒性问题。本文提出了一种语音增强语言建模方法来提高TTC模型的性能。我们利用语音丰富的自我监督表示作为自回归语言模型的训练目标。随后,采用非自回归模型来预测包含细粒度声学细节的离散声学编解码器。TTC模型仅关注自回归训练期间的语言建模,从而减少非自回归训练中发生的错误传播。客观和主观评估都验证了我们提出的方法的有效性。

[NLP-92] Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition
[NLP-92] 揭开隐藏因素:用于语音情感识别特征增强的可解释人工智能

链接: https://arxiv.org/abs/2406.01624
作者: Alaa Nfissi,Wassim Bouachir,Nizar Bouguila,Brian Mishara
关键词: gained significant attention, significant attention due, Speech emotion recognition, SER, SER systems
中文关键词: 引起了极大的关注,引起了极大的关注,语音情感识别,BER,BER系统
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Published in: Springer Nature International Journal of Applied Intelligence (2024)

点击查看摘要

Abstract:Speech emotion recognition (SER) has gained significant attention due to its several application fields, such as mental health, education, and human-computer interaction. However, the accuracy of SER systems is hindered by high-dimensional feature sets that may contain irrelevant and redundant information. To overcome this challenge, this study proposes an iterative feature boosting approach for SER that emphasizes feature relevance and explainability to enhance machine learning model performance. Our approach involves meticulous feature selection and analysis to build efficient SER systems. In addressing our main problem through model explainability, we employ a feature evaluation loop with Shapley values to iteratively refine feature sets. This process strikes a balance between model performance and transparency, which enables a comprehensive understanding of the model’s predictions. The proposed approach offers several advantages, including the identification and removal of irrelevant and redundant features, leading to a more effective model. Additionally, it promotes explainability, facilitating comprehension of the model’s predictions and the identification of crucial features for emotion determination. The effectiveness of the proposed method is validated on the SER benchmarks of the Toronto emotional speech set (TESS), Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion (SAVEE) datasets, outperforming state-of-the-art methods. These results highlight the potential of the proposed technique in developing accurate and explainable SER systems. To the best of our knowledge, this is the first work to incorporate model explainability into an SER framework.
摘要:语音情感识别因其在心理健康、教育、人机交互等领域的应用而备受关注。然而,高维特征集可能包含无关和冗余的信息,这阻碍了SER系统的准确性。为了克服这一挑战,本研究提出了一种迭代的SER特征提升方法,强调特征相关性和可解释性,以提高机器学习模型的性能。我们的方法包括细致的特征选择和分析,以构建高效的SER系统。在通过模型可解释性解决我们的主要问题时,我们使用带有Shapley值的特征评估循环来迭代地精炼特征集。这一过程在模型性能和透明度之间取得了平衡,从而能够全面理解模型的预测。该方法提供了几个优点,包括识别和去除不相关和冗余的特征,从而产生更有效的模型。此外,它还提高了可解释性,促进了对模型预测的理解,并有助于识别情绪决定的关键特征。在多伦多情感语音集(TESS)、柏林情感语音数据库(EMO-DB)、Ryerson情感语音和歌曲视听数据库(RAVDESS)和萨里视听表达情感(SAVEE)数据集上验证了该方法的有效性,优于最先进的方法。这些结果突出了所提出的技术在开发准确和可解释的SER系统方面的潜力。据我们所知,这是将模型可解释性纳入SER框架的第一项工作。

计算机视觉

[CV-0] VHS: High-Resolution Iterative Stereo Matching with Visual Hull Priors

链接: https://arxiv.org/abs/2406.02552
作者: Markus Plack,Hannah Dröge,Leif Van Holland,Matthias B. Hullin
关键词: present a stereo-matching, memory-efficient technique, stereo-matching method, correlation computation, high-resolution images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present a stereo-matching method for depth estimation from high-resolution images using visual hulls as priors, and a memory-efficient technique for the correlation computation. Our method uses object masks extracted from supplementary views of the scene to guide the disparity estimation, effectively reducing the search space for matches. This approach is specifically tailored to stereo rigs in volumetric capture systems, where an accurate depth plays a key role in the downstream reconstruction task. To enable training and regression at high resolutions targeted by recent systems, our approach extends a sparse correlation computation into a hybrid sparse-dense scheme suitable for application in leading recurrent network architectures. We evaluate the performance-efficiency trade-off of our method compared to state-of-the-art methods, and demonstrate the efficacy of the visual hull guidance. In addition, we propose a training scheme for a further reduction of memory requirements during optimization, facilitating training on high-resolution data.

[CV-1] Dreamguider: Improved Training free Diffusion-based Conditional Generation

链接: https://arxiv.org/abs/2406.02549
作者: Nithin Gopalakrishnan Nair,Vishal M Patel
关键词: training-free conditional generation.However, conditional generation.However, formidable tool, tool for training-free, training-free conditional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as a formidable tool for training-free conditional generation.However, a key hurdle in inference-time guidance techniques is the need for compute-heavy backpropagation through the diffusion network for estimating the guidance direction. Moreover, these techniques often require handcrafted parameter tuning on a case-by-case basis. Although some recent works have introduced minimal compute methods for linear inverse problems, a generic lightweight guidance solution to both linear and non-linear guidance problems is still missing. To this end, we propose Dreamguider, a method that enables inference-time guidance without compute-heavy backpropagation through the diffusion network. The key idea is to regulate the gradient flow through a time-varying factor. Moreover, we propose an empirical guidance scale that works for a wide variety of tasks, hence removing the need for handcrafted parameter tuning. We further introduce an effective lightweight augmentation strategy that significantly boosts the performance during inference-time guidance. We present experiments using Dreamguider on multiple tasks across multiple datasets and models to show the effectiveness of the proposed modules. To facilitate further research, we will make the code public after the review process.

[CV-2] Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

链接: https://arxiv.org/abs/2406.02548
作者: Mohamed El Amine Boudjoghra,Angela Dai,Jean Lahoud,Hisham Cholakkal,Rao Muhammad Anwer,Salman Khan,Fahad Shahbaz Khan
关键词: show strong promise, high computation requirements, segmentation show strong, Recent works, high computation cost
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the projection of class-agnostic 3D point cloud instances already holds instance information; thus, using SAM might only result in redundancy that unnecessarily increases the inference time. We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector. We validate our Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our Open-YOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to \sim 16 \times speedup compared to the best existing method in literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7% while operating at 22 seconds per scene. Code and model are available at this http URL.

[CV-3] Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

链接: https://arxiv.org/abs/2406.02547
作者: Alex Jinpeng Wang,Linjie Li,Yiqi Lin,Min Li,Lijuan Wang,Mike Zheng Shou
关键词: substantial GPU memory, in-context text, in-context text length, multimodal model due, computational costs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages. The website is \url{ this https URL }

点击查看摘要

Abstract:Training models with longer in-context lengths is a significant challenge for multimodal model due to substantial GPU memory and computational costs. This exploratory study does not present state-of-the-art models; rather, it introduces an innovative method designed to increase in-context text length in multi-modality large language models (MLLMs) efficiently. We present Visualized In-Context Text Processing (VisInContext), which processes long in-context text using visual tokens. This technique significantly reduces GPU memory usage and floating point operations (FLOPs) for both training and inferenceing stage. For instance, our method expands the pre-training in-context text length from 256 to 2048 tokens with nearly same FLOPs for a 56 billion parameter MOE model. Experimental results demonstrate that model trained with VisInContext delivers superior performance on common downstream benchmarks for in-context few-shot evaluation. Additionally, VisInContext is complementary to existing methods for increasing in-context text length and enhances document understanding capabilities, showing great potential in document QA tasks and sequential document retrieval.

[CV-4] Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting

链接: https://arxiv.org/abs/2406.02541
作者: Inkyu Shin,Qihang Yu,Xiaohui Shen,In So Kweon,Kuk-Jin Yoon,Liang-Chieh Chen
关键词: Recent advancements, achieving high temporal, video, high temporal consistency, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in zero-shot video diffusion models have shown promise for text-driven video editing, but challenges remain in achieving high temporal consistency. To address this, we introduce Video-3DGS, a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors. Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos. In the first stage, Video-3DGS employs an improved version of COLMAP, referred to as MC-COLMAP, which processes original videos using a Masked and Clipped approach. For each video clip, MC-COLMAP generates the point clouds for dynamic foreground objects and complex backgrounds. These point clouds are utilized to initialize two sets of 3D Gaussians (Frg-3DGS and Bkg-3DGS) aiming to represent foreground and background views. Both foreground and background views are then merged with a 2D learnable parameter map to reconstruct full views. In the second stage, we leverage the reconstruction ability developed in the first stage to impose the temporal constraints on the video diffusion model. To demonstrate the efficacy of Video-3DGS on both stages, we conduct extensive experiments across two related tasks: Video Reconstruction and Video Editing. Video-3DGS trained with 3k iterations significantly improves video reconstruction quality (+3 PSNR, +7 PSNR increase) and training efficiency (x1.9, x4.5 times faster) over NeRF-based and 3DGS-based state-of-art methods on DAVIS dataset, respectively. Moreover, it enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.

[CV-5] ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

链接: https://arxiv.org/abs/2406.02540
作者: Tianchen Zhao,Tongcheng Fang,Enshu Liu,Wan Rui,Widyadewi Soedarmadji,Shiyao Li,Zinan Lin,Guohao Dai,Shengen Yan,Huazhong Yang,Xuefei Ning,Yu Wang
关键词: exhibited remarkable performance, generating realistic images, textual instructions, Diffusion transformers, quantizing diffusion transformers
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Diffusion transformers (DiTs) have exhibited remarkable performance in visual generation tasks, such as generating realistic images or videos based on textual instructions. However, larger model sizes and multi-frame processing for video generation lead to increased computational and memory costs, posing challenges for practical deployment on edge devices. Post-Training Quantization (PTQ) is an effective method for reducing memory costs and computational complexity. When quantizing diffusion transformers, we find that applying existing diffusion quantization methods designed for U-Net faces challenges in preserving quality. After analyzing the major challenges for quantizing diffusion transformers, we design an improved quantization scheme: “ViDiT-Q”: Video and Image Diffusion Transformer Quantization) to address these issues. Furthermore, we identify highly sensitive layers and timesteps hinder quantization for lower bit-widths. To tackle this, we improve ViDiT-Q with a novel metric-decoupled mixed-precision quantization method (ViDiT-Q-MP). We validate the effectiveness of ViDiT-Q across a variety of text-to-image and video models. While baseline quantization methods fail at W8A8 and produce unreadable content at W4A8, ViDiT-Q achieves lossless W8A8 quantization. ViDiTQ-MP achieves W4A8 with negligible visual quality degradation, resulting in a 2.5x memory optimization and a 1.5x latency speedup.

[CV-6] Parrot: Multilingual Visual Instruction Tuning

链接: https://arxiv.org/abs/2406.02539
作者: Hai-Long Sun,Da-Wei Zhou,Yang Li,Shiyin Lu,Chao Yi,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,De-Chuan Zhan,Han-Jia Ye
关键词: Large Language Models, Multimodal Large Language, artificial general intelligence, Language Models, Multimodal Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has marked a significant step towards artificial general intelligence. Existing methods mainly focus on aligning vision encoders with LLMs through supervised fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs’ inherent ability to react to multiple languages progressively deteriorate as the training process evolves. We empirically find that the imbalanced SFT datasets, primarily composed of English-centric image-text pairs, lead to significantly reduced performance in non-English languages. This is due to the failure of aligning the vision encoder and LLM with multilingual tokens during the SFT process. In this paper, we introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Specifically, to enhance non-English visual tokens alignment, we compute the cross-attention using the initial visual features and textual embeddings, the result of which is then fed into the MoE router to select the most relevant experts. The selected experts subsequently convert the initial visual tokens into language-specific visual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks. Both the source code and the training dataset of Parrot will be made publicly available.

[CV-7] opViewRS: Vision-Language Models as Top-View Spatial Reasoners

链接: https://arxiv.org/abs/2406.02537
作者: Chengzu Li,Caiqi Zhang,Han Zhou,Nigel Collier,Anna Korhonen,Ivan Vulić
关键词: Top-view perspective denotes, large Vision-Language Models, perspective denotes, denotes a typical, vital for localization
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages, 3 figures, 3 tables (21 pages, 4 figures, 15 tables including references and appendices)

点击查看摘要

Abstract:Top-view perspective denotes a typical way in which humans read and reason over different types of maps, and it is vital for localization and navigation of humans as well as of `non-human’ agents, such as the ones backed by large Vision-Language Models (VLMs). Nonetheless, spatial reasoning capabilities of modern VLMs remain unattested and underexplored. In this work, we thus study their capability to understand and reason over spatial relations from the top view. The focus on top view also enables controlled evaluations at different granularity of spatial reasoning; we clearly disentangle different abilities (e.g., recognizing particular objects versus understanding their relative positions). We introduce the TopViewRS (Top-View Reasoning in Space) dataset, consisting of 11,384 multiple-choice questions with either realistic or semantic top-view map as visual input. We then use it to study and evaluate VLMs across 4 perception and reasoning tasks with different levels of complexity. Evaluation of 10 representative open- and closed-source VLMs reveals the gap of more than 50% compared to average human performance, and it is even lower than the random baseline in some cases. Although additional experiments show that Chain-of-Thought reasoning can boost model capabilities by 5.82% on average, the overall performance of VLMs remains limited. Our findings underscore the critical need for enhanced model capability in top-view spatial reasoning and set a foundation for further research towards human-level proficiency of VLMs in real-world multimodal tasks.

[CV-8] Enhancing 2D Representation Learning with a 3D Prior

链接: https://arxiv.org/abs/2406.02535
作者: Mehmet Aygün,Prithviraj Dhar,Zhicheng Yan,Oisin Mac Aodha,Rakesh Ranjan
关键词: fundamental task, task in computer, data, Abstract, Learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Learning robust and effective representations of visual data is a fundamental task in computer vision. Traditionally, this is achieved by training models with labeled data which can be expensive to obtain. Self-supervised learning attempts to circumvent the requirement for labeled data by learning representations from raw unlabeled visual data alone. However, unlike humans who obtain rich 3D information from their binocular vision and through motion, the majority of current self-supervised methods are tasked with learning from monocular 2D image collections. This is noteworthy as it has been demonstrated that shape-centric visual processing is more robust compared to texture-biased automated methods. Inspired by this, we propose a new approach for strengthening existing self-supervised methods by explicitly enforcing a strong 3D structural prior directly into the model during training. Through experiments, across a range of datasets, we demonstrate that our 3D aware representations are more robust compared to conventional self-supervised baselines.

[CV-9] SatSplatYOLO: 3D Gaussian Splatting-based Virtual Object Detection Ensembles for Satellite Feature Recognition

链接: https://arxiv.org/abs/2406.02533
作者: Van Minh Nguyen,Emma Sandidge,Trupti Mahendrakar,Ryan T. White
关键词: active debris removal, On-orbit servicing, inspection of spacecraft, debris removal, active debris
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:On-orbit servicing (OOS), inspection of spacecraft, and active debris removal (ADR). Such missions require precise rendezvous and proximity operations in the vicinity of non-cooperative, possibly unknown, resident space objects. Safety concerns with manned missions and lag times with ground-based control necessitate complete autonomy. In this article, we present an approach for mapping geometries and high-confidence detection of components of unknown, non-cooperative satellites on orbit. We implement accelerated 3D Gaussian splatting to learn a 3D representation of the satellite, render virtual views of the target, and ensemble the YOLOv5 object detector over the virtual views, resulting in reliable, accurate, and precise satellite component detections. The full pipeline capable of running on-board and stand to enable downstream machine intelligence tasks necessary for autonomous guidance, navigation, and control tasks.

[CV-10] DDGS-CT: Direction-Disentangled Gaussian Splatting for Realistic Volume Rendering

链接: https://arxiv.org/abs/2406.02518
作者: Zhongpai Gao,Benjamin Planche,Meng Zheng,Xiao Chen,Terrence Chen,Ziyan Wu
关键词: physics-based Monte Carlo, Digitally reconstructed radiographs, heavy physics-based Monte, Monte Carlo methods, Monte Carlo
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Digitally reconstructed radiographs (DRRs) are simulated 2D X-ray images generated from 3D CT volumes, widely used in preoperative settings but limited in intraoperative applications due to computational bottlenecks, especially for accurate but heavy physics-based Monte Carlo methods. While analytical DRR renderers offer greater efficiency, they overlook anisotropic X-ray image formation phenomena, such as Compton scattering. We present a novel approach that marries realistic physics-inspired X-ray simulation with efficient, differentiable DRR generation using 3D Gaussian splatting (3DGS). Our direction-disentangled 3DGS (DDGS) method separates the radiosity contribution into isotropic and direction-dependent components, approximating complex anisotropic interactions without intricate runtime simulations. Additionally, we adapt the 3DGS initialization to account for tomography data properties, enhancing accuracy and efficiency. Our method outperforms state-of-the-art techniques in image accuracy. Furthermore, our DDGS shows promise for intraoperative applications and inverse problems such as pose registration, delivering superior registration accuracy and runtime performance compared to analytical DRR methods.

[CV-11] V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation

链接: https://arxiv.org/abs/2406.02511
作者: Cong Wang,Kuan Tian,Jun Zhang,Yonghang Guan,Feng Luo,Fei Shen,Zhiwei Jiang,Qing Gu,Xiao Han,Wei Yang
关键词: increasingly prevalent, portrait video generation, portrait, signals, reference image
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. A common approach involves leveraging generative models to enhance adapters for controlled generation. However, control signals (e.g., text, audio, reference image, pose, depth map, etc.) can vary in strength. Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions. In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as facial pose and reference image. However, direct training with weak signals often leads to difficulties in convergence. To address this, we propose V-Express, a simple method that balances different control signals through the progressive training and the conditional dropout operation. Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account the facial pose, reference image, and audio. The experimental results demonstrate that our method can effectively generate portrait videos controlled by audio. Furthermore, a potential solution is provided for the simultaneous and effective use of conditions of varying strengths.

[CV-12] CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

链接: https://arxiv.org/abs/2406.02509
作者: Dejia Xu,Weili Nie,Chao Liu,Sifei Liu,Jan Kautz,Zhangyang Wang,Arash Vahdat
关键词: Recently video diffusion, expressive generative tools, content creation readily, high-quality video content, video content creation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Recently video diffusion models have emerged as expressive generative tools for high-quality video content creation readily available to general users. However, these models often do not offer precise control over camera poses for video generation, limiting the expression of cinematic language and user control. To address this issue, we introduce CamCo, which allows fine-grained Camera pose Control for image-to-video generation. We equip a pre-trained image-to-video generator with accurately parameterized camera pose input using Plücker coordinates. To enhance 3D consistency in the videos produced, we integrate an epipolar attention module in each attention block that enforces epipolar constraints to the feature maps. Additionally, we fine-tune CamCo on real-world videos with camera poses estimated through structure-from-motion algorithms to better synthesize object motion. Our experiments show that CamCo significantly improves 3D consistency and camera control capabilities compared to previous models while effectively generating plausible object motion. Project page: this https URL

[CV-13] Guiding a Diffusion Model with a Bad Version of Itself

链接: https://arxiv.org/abs/2406.02507
作者: Tero Karras,Miika Aittala,Tuomas Kynkäänniemi,Jaakko Lehtinen,Timo Aila,Samuli Laine
关键词: results align, primary axes, axes of interest, interest in image-generating, class label
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The primary axes of interest in image-generating diffusion models are image quality, the amount of variation in the results, and how well the results align with a given condition, e.g., a class label or a text prompt. The popular classifier-free guidance approach uses an unconditional model to guide a conditional model, leading to simultaneously better prompt alignment and higher-quality images at the cost of reduced variation. These effects seem inherently entangled, and thus hard to control. We make the surprising observation that it is possible to obtain disentangled control over image quality without compromising the amount of variation by guiding generation using a smaller, less-trained version of the model itself rather than an unconditional model. This leads to significant improvements in ImageNet generation, setting record FIDs of 1.01 for 64x64 and 1.25 for 512x512, using publicly available networks. Furthermore, the method is also applicable to unconditional diffusion models, drastically improving their quality.

[CV-14] An Open-Source Tool for Mapping War Destruction at Scale in Ukraine using Sentinel-1 Time Series

链接: https://arxiv.org/abs/2406.02506
作者: Olivier Dietrich,Torben Peters,Vivien Sainte Fare Garnot,Valerie Sticher,Thao Ton-That Whelan,Konrad Schindler,Jan Dirk Wegner
关键词: effectively assist populations, Access to detailed, effectively assist, assist populations, populations most affected
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Access to detailed war impact assessments is crucial for humanitarian organizations to effectively assist populations most affected by armed conflicts. However, maintaining a comprehensive understanding of the situation on the ground is challenging, especially in conflicts that cover vast territories and extend over long periods. This study presents a scalable and transferable method for estimating war-induced damage to buildings. We first train a machine learning model to output pixel-wise probability of destruction from Synthetic Aperture Radar (SAR) satellite image time series, leveraging existing, manual damage assessments as ground truth and cloud-based geospatial analysis tools for large-scale inference. We further post-process these assessments using open building footprints to obtain a final damage estimate per building. We introduce an accessible, open-source tool that allows users to adjust the confidence interval based on their specific requirements and use cases. Our approach enables humanitarian organizations and other actors to rapidly screen large geographic regions for war impacts. We provide two publicly accessible dashboards: a Ukraine Damage Explorer to dynamically view our pre-computed estimates, and a Rapid Damage Mapping Tool to easily run our method and produce custom maps.

[CV-15] GenS: Generalizable Neural Surface Reconstruction from Multi-View Images

链接: https://arxiv.org/abs/2406.02495
作者: Rui Peng,Xiaodong Gu,Luyang Tang,Shihe Shen,Fanqi Yu,Ronggang Wang
关键词: signed distance function, Combining the signed, differentiable volume rendering, distance function, signed distance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: NeurIPS 2023 Accepted

点击查看摘要

Abstract:Combining the signed distance function (SDF) and differentiable volume rendering has emerged as a powerful paradigm for surface reconstruction from multi-view images without 3D supervision. However, current methods are impeded by requiring long-time per-scene optimizations and cannot generalize to new scenes. In this paper, we present GenS, an end-to-end generalizable neural surface reconstruction model. Unlike coordinate-based methods that train a separate network for each scene, we construct a generalized multi-scale volume to directly encode all scenes. Compared with existing solutions, our representation is more powerful, which can recover high-frequency details while maintaining global smoothness. Meanwhile, we introduce a multi-scale feature-metric consistency to impose the multi-view consistency in a more discriminative multi-scale feature space, which is robust to the failures of the photometric consistency. And the learnable feature can be self-enhanced to continuously improve the matching accuracy and mitigate aggregation ambiguity. Furthermore, we design a view contrast loss to force the model to be robust to those regions covered by few viewpoints through distilling the geometric prior from dense input to sparse input. Extensive experiments on popular benchmarks show that our model can generalize well to new scenes and outperform existing state-of-the-art methods even those employing ground-truth depth supervision. Code is available at this https URL.

[CV-16] Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

链接: https://arxiv.org/abs/2406.02485
作者: Jiajun Wang,Morteza Ghahremani,Yitong Li,Björn Ommer,Christian Wachinger
关键词: generating high-quality visual, high-quality visual content, shown impressive performance, shown impressive, generating high-quality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Masked pose images are used to smoothly refine the attention maps based on target pose-related features in a hierarchical manner, transitioning from coarse to fine levels. Additionally, our loss function is formulated to allocate increased emphasis to the pose region, thereby augmenting the model’s precision in capturing intricate pose details. We assessed the performance of Stable-Pose across five public datasets under a wide range of indoor and outdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet. The project link and code is available at this https URL.

[CV-17] DL-KDD: Dual-Light Knowledge Distillation for Action Recognition in the Dark

链接: https://arxiv.org/abs/2406.02468
作者: Chi-Jui Chang,Oscar Tai-Yuan Chen,Vincent S. Tseng
关键词: video, Human action recognition, action recognition, computer vision, original
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human action recognition in dark videos is a challenging task for computer vision. Recent research focuses on applying dark enhancement methods to improve the visibility of the video. However, such video processing results in the loss of critical information in the original (un-enhanced) video. Conversely, traditional two-stream methods are capable of learning information from both original and processed videos, but it can lead to a significant increase in the computational cost during the inference phase in the task of video classification. To address these challenges, we propose a novel teacher-student video classification framework, named Dual-Light KnowleDge Distillation for Action Recognition in the Dark (DL-KDD). This framework enables the model to learn from both original and enhanced video without introducing additional computational cost during inference. Specifically, DL-KDD utilizes the strategy of knowledge distillation during training. The teacher model is trained with enhanced video, and the student model is trained with both the original video and the soft target generated by the teacher model. This teacher-student framework allows the student model to predict action using only the original input video during inference. In our experiments, the proposed DL-KDD framework outperforms state-of-the-art methods on the ARID, ARID V1.5, and Dark-48 datasets. We achieve the best performance on each dataset and up to a 4.18% improvement on Dark-48, using only original video inputs, thus avoiding the use of two-stream framework or enhancement modules for inference. We further validate the effectiveness of the distillation strategy in ablative experiments. The results highlight the advantages of our knowledge distillation framework in dark human action recognition.

[CV-18] An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders

链接: https://arxiv.org/abs/2406.02465
作者: Scott C. Lowe,Joakim Bruslund Haurum,Sageev Oore,Thomas B. Moeslund,Graham W. Taylor
关键词: pretrained models generalize, models, models generalize, Abstract, pretrained
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Can pretrained models generalize to new datasets without any retraining? We deploy pretrained image models on datasets they were not trained for, and investigate whether their embeddings form meaningful clusters. Our suite of benchmarking experiments use encoders pretrained solely on ImageNet-1k with either supervised or self-supervised training techniques, deployed on image datasets that were not seen during training, and clustered with conventional clustering algorithms. This evaluation provides new insights into the embeddings of self-supervised models, which prioritize different features to supervised models. Supervised encoders typically offer more utility than SSL encoders within the training domain, and vice-versa far outside of it, however, fine-tuned encoders demonstrate the opposite trend. Clustering provides a way to evaluate the utility of self-supervised learned representations orthogonal to existing methods such as kNN. Additionally, we find the silhouette score when measured in a UMAP-reduced space is highly correlated with clustering performance, and can therefore be used as a proxy for clustering performance on data with no ground truth labels. Our code implementation is available at \urlthis https URL.

[CV-19] Learning Image Priors through Patch-based Diffusion Models for Solving Inverse Problems

链接: https://arxiv.org/abs/2406.02462
作者: Jason Hu,Bowen Song,Xiaojian Xu,Liyue Shen,Jeffrey A. Fessler
关键词: underlying data distribution, process is computationally, computationally expensive, expensive and requires, requires lots
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models can learn strong image priors from underlying data distribution and use them to solve inverse problems, but the training process is computationally expensive and requires lots of data. Such bottlenecks prevent most existing works from being feasible for high-dimensional and high-resolution data such as 3D images. This paper proposes a method to learn an efficient data prior for the entire image by training diffusion models only on patches of images. Specifically, we propose a patch-based position-aware diffusion inverse solver, called PaDIS, where we obtain the score function of the whole image through scores of patches and their positional encoding and utilize this as the prior for solving inverse problems. First of all, we show that this diffusion model achieves an improved memory efficiency and data efficiency while still maintaining the capability to generate entire images via positional encoding. Additionally, the proposed PaDIS model is highly flexible and can be plugged in with different diffusion inverse solvers (DIS). We demonstrate that the proposed PaDIS approach enables solving various inverse problems in both natural and medical image domains, including CT reconstruction, deblurring, and superresolution, given only patch-based priors. Notably, PaDIS outperforms previous DIS methods trained on entire image priors in the case of limited training data, demonstrating the data efficiency of our proposed approach by learning patch-based prior.

[CV-20] RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting

链接: https://arxiv.org/abs/2406.02461
作者: Qi Wang,Ruijie Lu,Xudong Xu,Jingbo Wang,Michael Yu Wang,Bo Dai,Gang Zeng,Dan Xu
关键词: advancement of diffusion, diffusion models, models has pushed, pushed the boundary, object generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The advancement of diffusion models has pushed the boundary of text-to-3D object generation. While it is straightforward to composite objects into a scene with reasonable geometry, it is nontrivial to texture such a scene perfectly due to style inconsistency and occlusions between objects. To tackle these problems, we propose a coarse-to-fine 3D scene texturing framework, referred to as RoomTex, to generate high-fidelity and style-consistent textures for untextured compositional scene meshes. In the coarse stage, RoomTex first unwraps the scene mesh to a panoramic depth map and leverages ControlNet to generate a room panorama, which is regarded as the coarse reference to ensure the global texture consistency. In the fine stage, based on the panoramic image and perspective depth maps, RoomTex will refine and texture every single object in the room iteratively along a series of selected camera views, until this object is completely painted. Moreover, we propose to maintain superior alignment between RGB and depth spaces via subtle edge detection methods. Extensive experiments show our method is capable of generating high-quality and diverse room textures, and more importantly, supporting interactive fine-grained texture control and flexible scene editing thanks to our inpainting-based framework and compositional mesh input. Our project page is available at this https URL.

[CV-21] Generative Active Learning for Long-tailed Instance Segmentation

链接: https://arxiv.org/abs/2406.02435
作者: Muzhi Zhu,Chengxiang Fan,Hao Chen,Yang Liu,Weian Mao,Xiaogang Xu,Chunhua Shen
关键词: large-scale language-image generative, gained widespread attention, language-image generative models, generated data, large-scale language-image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICML 2024

点击查看摘要

Abstract:Recently, large-scale language-image generative models have gained widespread attention and many works have utilized generated data from these models to further enhance the performance of perception tasks. However, not all generated data can positively impact downstream models, and these methods do not thoroughly explore how to better select and utilize generated data. On the other hand, there is still a lack of research oriented towards active learning on generated data. In this paper, we explore how to perform active learning specifically for generated data in the long-tailed instance segmentation task. Subsequently, we propose BSGAL, a new algorithm that online estimates the contribution of the generated data based on gradient cache. BSGAL can handle unlimited generated data and complex downstream segmentation tasks effectively. Experiments show that BSGAL outperforms the baseline approach and effectually improves the performance of long-tailed segmentation. Our code can be found at this https URL.

[CV-22] CoNav: A Benchmark for Human-Centered Collaborative Navigation

链接: https://arxiv.org/abs/2406.02425
作者: Changhao Li,Xinyu Sun,Peihao Chen,Jugang Fan,Zixu Wang,Yanxia Liu,Jinhui Zhu,Chuang Gan,Mingkui Tan
关键词: robot intelligently assists, Human-robot collaboration, appealing objective, human, robot intelligently
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Human-robot collaboration, in which the robot intelligently assists the human with the upcoming task, is an appealing objective. To achieve this goal, the agent needs to be equipped with a fundamental collaborative navigation ability, where the agent should reason human intention by observing human activities and then navigate to the human’s intended destination in advance of the human. However, this vital ability has not been well studied in previous literature. To fill this gap, we propose a collaborative navigation (CoNav) benchmark. Our CoNav tackles the critical challenge of constructing a 3D navigation environment with realistic and diverse human activities. To achieve this, we design a novel LLM-based humanoid animation generation framework, which is conditioned on both text descriptions and environmental context. The generated humanoid trajectory obeys the environmental context and can be easily integrated into popular simulators. We empirically find that the existing navigation methods struggle in CoNav task since they neglect the perception of human intention. To solve this problem, we propose an intention-aware agent for reasoning both long-term and short-term human intention. The agent predicts navigation action based on the predicted intention and panoramic observation. The emergent agent behavior including observing humans, avoiding human collision, and navigation reveals the efficiency of the proposed datasets and agents.

[CV-23] Decoupling of neural network calibration measures

链接: https://arxiv.org/abs/2406.02411
作者: Dominik Werner Wolf,Prasannavenkatesh Balaji,Alexander Braun,Markus Ulrich
关键词: autonomous driving systems, safeguarding autonomous driving, deep neural networks, Uncertainty Calibration Error, neural network calibration
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Submitted to the German Conference on Pattern Recognition (GCPR) 2024

点击查看摘要

Abstract:A lot of effort is currently invested in safeguarding autonomous driving systems, which heavily rely on deep neural networks for computer vision. We investigate the coupling of different neural network calibration measures with a special focus on the Area Under the Sparsification Error curve (AUSE) metric. We elaborate on the well-known inconsistency in determining optimal calibration using the Expected Calibration Error (ECE) and we demonstrate similar issues for the AUSE, the Uncertainty Calibration Score (UCS), as well as the Uncertainty Calibration Error (UCE). We conclude that the current methodologies leave a degree of freedom, which prevents a unique model calibration for the homologation of safety-critical functionalities. Furthermore, we propose the AUSE as an indirect measure for the residual uncertainty, which is irreducible for a fixed network architecture and is driven by the stochasticity in the underlying data generation process (aleatoric contribution) as well as the limitation in the hypothesis space (epistemic contribution).

[CV-24] WE-GS: An In-the-wild Efficient 3D Gaussian Representation for Unconstrained Photo Collections

链接: https://arxiv.org/abs/2406.02407
作者: Yuze Wang,Junyi Wang,Yue Qi
关键词: unconstrained photo collections, computer graphics, challenging in computer, photo collections, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Our project page is available at this https URL

点击查看摘要

Abstract:Novel View Synthesis (NVS) from unconstrained photo collections is challenging in computer graphics. Recently, 3D Gaussian Splatting (3DGS) has shown promise for photorealistic and real-time NVS of static scenes. Building on 3DGS, we propose an efficient point-based differentiable rendering framework for scene reconstruction from photo collections. Our key innovation is a residual-based spherical harmonic coefficients transfer module that adapts 3DGS to varying lighting conditions and photometric post-processing. This lightweight module can be pre-computed and ensures efficient gradient propagation from rendered images to 3D Gaussian attributes. Additionally, we observe that the appearance encoder and the transient mask predictor, the two most critical parts of NVS from unconstrained photo collections, can be mutually beneficial. We introduce a plug-and-play lightweight spatial attention module to simultaneously predict transient occluders and latent appearance representation for each image. After training and preprocessing, our method aligns with the standard 3DGS format and rendering pipeline, facilitating seamlessly integration into various 3DGS applications. Extensive experiments on diverse datasets show our approach outperforms existing approaches on the rendering quality of novel view and appearance synthesis with high converge and rendering speed.

[CV-25] GrootVL: Tree Topology is All You Need in State Space Model

链接: https://arxiv.org/abs/2406.02395
作者: Yicheng Xiao,Lin Song,Shaoli Huang,Jiangshan Wang,Siyu Song,Yixiao Ge,Xiu Li,Ying Shan
关键词: employing recursively propagated, comparable to Transformer, recursively propagated features, employing recursively, superior efficiency
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: The code is available at this https URL

点击查看摘要

Abstract:The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency. However, constrained by the inherent geometric constraints of sequences, it still falls short in modeling long-range dependencies. To address this issue, we propose the GrootVL network, which first dynamically generates a tree topology based on spatial relationships and input features. Then, feature propagation is performed based on this graph, thereby breaking the original sequence constraints to achieve stronger representation capabilities. Additionally, we introduce a linear complexity dynamic programming algorithm to enhance long-range interactions without increasing computational cost. GrootVL is a versatile multimodal framework that can be applied to both visual and textual tasks. Extensive experiments demonstrate that our method significantly outperforms existing structured state space models on image classification, object detection and segmentation. Besides, by fine-tuning large language models, our approach achieves consistent improvements in multiple textual tasks at minor training cost.

[CV-26] Low-Rank Adaption on Transformer-based Oriented Object Detector for Satellite Onboard Processing of Remote Sensing Images

链接: https://arxiv.org/abs/2406.02385
作者: Xinyang Pu,Feng Xu
关键词: Deep learning models, remote sensing images, Deep learning, conserving communication resources, satellite onboard real-time
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning models in satellite onboard enable real-time interpretation of remote sensing images, reducing the need for data transmission to the ground and conserving communication resources. As satellite numbers and observation frequencies increase, the demand for satellite onboard real-time image interpretation grows, highlighting the expanding importance and development of this technology. However, updating the extensive parameters of models deployed on the satellites for spaceborne object detection model is challenging due to the limitations of uplink bandwidth in wireless satellite communications. To address this issue, this paper proposes a method based on parameter-efficient fine-tuning technology with low-rank adaptation (LoRA) module. It involves training low-rank matrix parameters and integrating them with the original model’s weight matrix through multiplication and summation, thereby fine-tuning the model parameters to adapt to new data distributions with minimal weight updates. The proposed method combines parameter-efficient fine-tuning with full fine-tuning in the parameter update strategy of the oriented object detection algorithm architecture. This strategy enables model performance improvements close to full fine-tuning effects with minimal parameter updates. In addition, low rank approximation is conducted to pick an optimal rank value for LoRA matrices. Extensive experiments verify the effectiveness of the proposed method. By fine-tuning and updating only 12.4 % of the model’s total parameters, it is able to achieve 97 % to 100 % of the performance of full fine-tuning models. Additionally, the reduced number of trainable parameters accelerates model training iterations and enhances the generalization and robustness of the oriented object detection model. The source code is available at: \urlthis https URL.

[CV-27] Learning to Edit Visual Programs with Self-Supervision

链接: https://arxiv.org/abs/2406.02383
作者: R. Kenny Jones,Renhao Zhang,Aditya Ganeshan,Daniel Ritchie
关键词: design a system, system that learns, edit network, edit, visual programs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We design a system that learns how to edit visual programs. Our edit network consumes a complete input program and a visual target. From this input, we task our network with predicting a local edit operation that could be applied to the input program to improve its similarity to the target. In order to apply this scheme for domains that lack program annotations, we develop a self-supervised learning approach that integrates this edit network into a bootstrapped finetuning loop along with a network that predicts entire programs in one-shot. Our joint finetuning scheme, when coupled with an inference procedure that initializes a population from the one-shot model and evolves members of this population with the edit network, helps to infer more accurate visual programs. Over multiple domains, we experimentally compare our method against the alternative of using only the one-shot model, and find that even under equal search-time budgets, our editing-based paradigm provides significant advantages.

[CV-28] EUFCC-340K: A Faceted Hierarchical Dataset for Metadata Annotation in GLAM Collections

链接: https://arxiv.org/abs/2406.02380
作者: Francesc Net,Marc Folia,Pep Casals,Andrew D. Bagdanov,Lluis Gomez
关键词: automatic metadata annotation, Art Architecture Thesaurus, Europeana portal, domain of Galleries, address the challenges
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 23 pages, 13 figures

点击查看摘要

Abstract:In this paper, we address the challenges of automatic metadata annotation in the domain of Galleries, Libraries, Archives, and Museums (GLAMs) by introducing a novel dataset, EUFCC340K, collected from the Europeana portal. Comprising over 340,000 images, the EUFCC340K dataset is organized across multiple facets: Materials, Object Types, Disciplines, and Subjects, following a hierarchical structure based on the Art Architecture Thesaurus (AAT). We developed several baseline models, incorporating multiple heads on a ConvNeXT backbone for multi-label image tagging on these facets, and fine-tuning a CLIP model with our image text pairs. Our experiments to evaluate model robustness and generalization capabilities in two different test scenarios demonstrate the utility of the dataset in improving multi-label classification tools that have the potential to alleviate cataloging tasks in the cultural heritage sector.

[CV-29] FedDr: Stabilizing Dot-regression with Global Feature Distillation for Federated Learning

链接: https://arxiv.org/abs/2406.02355
作者: Seongyoon Kim,Minchan Jeong,Sungnyun Kim,Sungwoo Cho,Sumyeong Ahn,Se-Young Yun
关键词: Federated Learning, non-iid data distribution, pivotal framework, non-iid data, Federated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a pivotal framework for the development of effective global models (global FL) or personalized models (personalized FL) across clients with heterogeneous, non-iid data distribution. A key challenge in FL is client drift, where data heterogeneity impedes the aggregation of scattered knowledge. Recent studies have tackled the client drift issue by identifying significant divergence in the last classifier layer. To mitigate this divergence, strategies such as freezing the classifier weights and aligning the feature extractor accordingly have proven effective. Although the local alignment between classifier and feature extractor has been studied as a crucial factor in FL, we observe that it may lead the model to overemphasize the observed classes within each client. Thus, our objectives are twofold: (1) enhancing local alignment while (2) preserving the representation of unseen class samples. This approach aims to effectively integrate knowledge from individual clients, thereby improving performance for both global and personalized FL. To achieve this, we introduce a novel algorithm named FedDr+, which empowers local model alignment using dot-regression loss. FedDr+ freezes the classifier as a simplex ETF to align the features and improves aggregated global models by employing a feature distillation mechanism to retain information about unseen/missing classes. Consequently, we provide empirical evidence demonstrating that our algorithm surpasses existing methods that use a frozen classifier to boost alignment across the diverse distribution.

[CV-30] CADE: Cosine Annealing Differential Evolution for Spiking Neural Network

链接: https://arxiv.org/abs/2406.02349
作者: Runhua Jiang,Guodong Du,Shuyang Yu,Yifei Guo,Sim Kuan Goh,Ho-Kin Tang
关键词: Spiking neural networks, Spiking Element Wise, energy-efficient artificial intelligence, Annealing Differential Evolution, Cosine Annealing Differential
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Spiking neural networks (SNNs) have gained prominence for their potential in neuromorphic computing and energy-efficient artificial intelligence, yet optimizing them remains a formidable challenge for gradient-based methods due to their discrete, spike-based computation. This paper attempts to tackle the challenges by introducing Cosine Annealing Differential Evolution (CADE), designed to modulate the mutation factor (F) and crossover rate (CR) of differential evolution (DE) for the SNN model, i.e., Spiking Element Wise (SEW) ResNet. Extensive empirical evaluations were conducted to analyze CADE. CADE showed a balance in exploring and exploiting the search space, resulting in accelerated convergence and improved accuracy compared to existing gradient-based and DE-based methods. Moreover, an initialization method based on a transfer learning setting was developed, pretraining on a source dataset (i.e., CIFAR-10) and fine-tuning the target dataset (i.e., CIFAR-100), to improve population diversity. It was found to further enhance CADE for SNN. Remarkably, CADE elevates the performance of the highest accuracy SEW model by an additional 0.52 percentage points, underscoring its effectiveness in fine-tuning and enhancing SNNs. These findings emphasize the pivotal role of a scheduler for F and CR adjustment, especially for DE-based SNN. Source Code on Github: this https URL.

[CV-31] Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation

链接: https://arxiv.org/abs/2406.02347
作者: Clement Chadebec,Onur Tasar,Eyal Benaroche,Benjamin Aubin
关键词: pre-trained diffusion models, Flash Diffusion, versatile distillation method, diffusion models, pre-trained diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages + 16 pages appendices

点击查看摘要

Abstract:In this paper, we propose an efficient, fast, and versatile distillation method to accelerate the generation of pre-trained diffusion models: Flash Diffusion. The method reaches state-of-the-art performances in terms of FID and CLIP-Score for few steps image generation on the COCO2014 and COCO2017 datasets, while requiring only several GPU hours of training and fewer trainable parameters than existing methods. In addition to its efficiency, the versatility of the method is also exposed across several tasks such as text-to-image, inpainting, face-swapping, super-resolution and using different backbones such as UNet-based denoisers (SD1.5, SDXL) or DiT (Pixart- \alpha ), as well as adapters. In all cases, the method allowed to reduce drastically the number of sampling steps while maintaining very high-quality image generation. The official implementation is available at this https URL.

[CV-32] Progressive Confident Masking Attention Network for Audio-Visual Segmentation

链接: https://arxiv.org/abs/2406.02345
作者: Yuxuan Wang,Feng Dong,Jinchao Zhu
关键词: typically occur simultaneously, signals typically occur, occur simultaneously, typically occur, humans possess
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 10 pages, 9 figures, submitted to IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

点击查看摘要

Abstract:Audio and visual signals typically occur simultaneously, and humans possess an innate ability to correlate and synchronize information from these two modalities. Recently, a challenging problem known as Audio-Visual Segmentation (AVS) has emerged, intending to produce segmentation maps for sounding objects within a scene. However, the methods proposed so far have not sufficiently integrated audio and visual information, and the computational costs have been extremely high. Additionally, the outputs of different stages have not been fully utilized. To facilitate this research, we introduce a novel Progressive Confident Masking Attention Network (PMCANet). It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames. Furthermore, we design an efficient and effective cross-attention module to enhance semantic perception by selecting query tokens. This selection is determined through confidence-driven units based on the network’s multi-stage predictive outputs. Experiments demonstrate that our network outperforms other AVS methods while requiring less computational resources.

[CV-33] Cluster-Aware Similarity Diffusion for Instance Retrieval

链接: https://arxiv.org/abs/2406.02343
作者: Jifei Luo,Hantao Yao,Changsheng Xu
关键词: Diffusion-based re-ranking, performing similarity propagation, nearest neighbor graph, common method, similarity
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion-based re-ranking is a common method used for retrieving instances by performing similarity propagation in a nearest neighbor graph. However, existing techniques that construct the affinity graph based on pairwise instances can lead to the propagation of misinformation from outliers and other manifolds, resulting in inaccurate results. To overcome this issue, we propose a novel Cluster-Aware Similarity (CAS) diffusion for instance retrieval. The primary concept of CAS is to conduct similarity diffusion within local clusters, which can reduce the influence from other manifolds explicitly. To obtain a symmetrical and smooth similarity matrix, our Bidirectional Similarity Diffusion strategy introduces an inverse constraint term to the optimization objective of local cluster diffusion. Additionally, we have optimized a Neighbor-guided Similarity Smoothing approach to ensure similarity consistency among the local neighbors of each instance. Evaluations in instance retrieval and object re-identification validate the effectiveness of the proposed CAS, our code is publicly available.

[CV-34] Continual Unsupervised Out-of-Distribution Detection

链接: https://arxiv.org/abs/2406.02327
作者: Lars Doorenbos,Raphael Sznitman,Pablo Márquez-Neila
关键词: testing data, OOD, learning models excel, aligns with testing, Deep learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models excel when the data distribution during training aligns with testing data. Yet, their performance diminishes when faced with out-of-distribution (OOD) samples, leading to great interest in the field of OOD detection. Current approaches typically assume that OOD samples originate from an unconcentrated distribution complementary to the training distribution. While this assumption is appropriate in the traditional unsupervised OOD (U-OOD) setting, it proves inadequate when considering the place of deployment of the underlying deep learning model. To better reflect this real-world scenario, we introduce the novel setting of continual U-OOD detection. To tackle this new setting, we propose a method that starts from a U-OOD detector, which is agnostic to the OOD distribution, and slowly updates during deployment to account for the actual OOD distribution. Our method uses a new U-OOD scoring function that combines the Mahalanobis distance with a nearest-neighbor approach. Furthermore, we design a confidence-scaled few-shot OOD detector that outperforms previous methods. We show our method greatly improves upon strong baselines from related fields.

[CV-35] Optimised ProPainter for Video Diminished Reality Inpainting

链接: https://arxiv.org/abs/2406.02287
作者: Pengze Li,Lihao Liu,Carola-Bibiane Schönlieb,Angelica I Aviles-Rivero
关键词: inpainting technique optimised, DREAMING Challenge, refined video inpainting, video inpainting technique, Reality for Emerging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ISBI 2024

点击查看摘要

Abstract:In this paper, part of the DREAMING Challenge - Diminished Reality for Emerging Applications in Medicine through Inpainting, we introduce a refined video inpainting technique optimised from the ProPainter method to meet the specialised demands of medical imaging, specifically in the context of oral and maxillofacial surgery. Our enhanced algorithm employs the zero-shot ProPainter, featuring optimized parameters and pre-processing, to adeptly manage the complex task of inpainting surgical video sequences, without requiring any training process. It aims to produce temporally coherent and detail-rich reconstructions of occluded regions, facilitating clearer views of operative fields. The efficacy of our approach is evaluated using comprehensive metrics, positioning it as a significant advancement in the application of diminished reality for medical purposes.

[CV-36] Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning

链接: https://arxiv.org/abs/2406.02265
作者: Wenyan Li,Jiaang Li,Rita Ramos,Raphael Tang,Desmond Elliott
关键词: strong domain-transfer capabilities, Recent advancements, image captioning highlight, retrieving related captions, domain-transfer capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: 9 pages, long paper at ACL 2024

点击查看摘要

Abstract:Recent advancements in retrieval-augmented models for image captioning highlight the significance of retrieving related captions for efficient, lightweight models with strong domain-transfer capabilities. While these models demonstrate the success of retrieval augmentation, retrieval models are still far from perfect in practice. Retrieved information can sometimes mislead the model generation, negatively impacting performance. In this paper, we analyze the robustness of the SmallCap retrieval-augmented captioning model. Our analysis shows that SmallCap is sensitive to tokens that appear in the majority of the retrieved captions, and integrated gradients attribution shows that those tokens are likely copied into the final caption. Given these findings, we propose to train the model by sampling retrieved captions from more diverse sets. This reduces the probability that the model learns to copy majority tokens and improves both in-domain and cross-domain performance effectively.

[CV-37] Image contrast enhancement based on the Schr"odinger operator spectrum

链接: https://arxiv.org/abs/2406.02264
作者: Juan M. Vargas,Taous-Meriem Laleg-Kirati
关键词: dimensional Schrödinger operator, Schrödinger operator, dimensional Schrödinger, gamma, enhancement method based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This study proposes a novel image contrast enhancement method based on image projection onto the squared eigenfunctions of the two dimensional Schrödinger operator. This projection depends on a design parameter \texorpdfstring(\gamma)gamma which is proposed to control the pixel intensity during image reconstruction. The performance of the proposed method is investigated through its application to color images. The selection of \texorpdfstring(\gamma)gamma values is performed using k-means, which helps preserve the image spatial adjacency information. Furthermore, multi-objective optimization using the Non dominated Sorting Genetic Algorithm II (NSAG2) algorithm is proposed to select the optimal values of \texorpdfstring(\gamma)gamma and the semi-classical parameter h from the 2DSCSA. The results demonstrate the effectiveness of the proposed method for enhancing image contrast while preserving the inherent characteristics of the original image, producing the desired enhancement with almost no artifacts.

[CV-38] M3DM-NR: RGB-3D Noisy-Resistant Industrial Anomaly Detection via Multimodal Denoising

链接: https://arxiv.org/abs/2406.02263
作者: Chengjie Wang,Haokun Zhu,Jinlong Peng,Yue Wang,Ran Yi,Yunsheng Wu,Lizhuang Ma,Jiangning Zhang
关键词: Existing industrial anomaly, pristine RGB images, Existing industrial, Suspected Anomaly Map, anomaly detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing industrial anomaly detection methods primarily concentrate on unsupervised learning with pristine RGB images. Yet, both RGB and 3D data are crucial for anomaly detection, and the datasets are seldom completely clean in practical scenarios. To address above challenges, this paper initially delves into the RGB-3D multi-modal noisy anomaly detection, proposing a novel noise-resistant M3DM-NR framework to leveraging strong multi-modal discriminative capabilities of CLIP. M3DM-NR consists of three stages: Stage-I introduces the Suspected References Selection module to filter a few normal samples from the training dataset, using the multimodal features extracted by the Initial Feature Extraction, and a Suspected Anomaly Map Computation module to generate a suspected anomaly map to focus on abnormal regions as reference. Stage-II uses the suspected anomaly maps of the reference samples as reference, and inputs image, point cloud, and text information to achieve denoising of the training samples through intra-modal comparison and multi-scale aggregation operations. Finally, Stage-III proposes the Point Feature Alignment, Unsupervised Feature Fusion, Noise Discriminative Coreset Selection, and Decision Layer Fusion modules to learn the pattern of the training dataset, enabling anomaly detection and segmentation while filtering out noise. Extensive experiments show that M3DM-NR outperforms state-of-the-art methods in 3D-RGB multi-modal noisy anomaly detection.

[CV-39] PuFace: Defending against Facial Cloaking Attacks for Facial Recognition Models

链接: https://arxiv.org/abs/2406.02253
作者: Jing Wen
关键词: add invisible perturbation, facial recognition models, recently proposed facial, attacks add invisible, unauthorized facial recognition
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The recently proposed facial cloaking attacks add invisible perturbation (cloaks) to facial images to protect users from being recognized by unauthorized facial recognition models. However, we show that the “cloaks” are not robust enough and can be removed from images. This paper introduces PuFace, an image purification system leveraging the generalization ability of neural networks to diminish the impact of cloaks by pushing the cloaked images towards the manifold of natural (uncloaked) images before the training process of facial recognition models. Specifically, we devise a purifier that takes all the training images including both cloaked and natural images as input and generates the purified facial images close to the manifold where natural images lie. To meet the defense goal, we propose to train the purifier on particularly amplified cloaked images with a loss function that combines image loss and feature loss. Our empirical experiment shows PuFace can effectively defend against two state-of-the-art facial cloaking attacks and reduces the attack success rate from 69.84% to 7.61% on average without degrading the normal accuracy for various facial recognition models. Moreover, PuFace is a model-agnostic defense mechanism that can be applied to any facial recognition model without modifying the model structure. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2406.02253 [cs.CV] (or arXiv:2406.02253v1 [cs.CV] for this version)

[CV-40] I4VGen: Image as Stepping Stone for Text-to-Video Generation

链接: https://arxiv.org/abs/2406.02230
作者: Xiefan Guo,Jinlin Liu,Miaomiao Cui,Di Huang
关键词: limited video-text datasets, video-text datasets, diversity due, complexity of spatio-temporal, spatio-temporal modeling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Text-to-video generation has lagged behind text-to-image synthesis in quality and diversity due to the complexity of spatio-temporal modeling and limited video-text datasets. This paper presents I4VGen, a training-free and plug-and-play video diffusion inference framework, which enhances text-to-video generation by leveraging robust image techniques. Specifically, following text-to-image-to-video, I4VGen decomposes the text-to-video generation into two stages: anchor image synthesis and anchor image-guided video synthesis. Correspondingly, a well-designed generation-selection pipeline is employed to achieve visually-realistic and semantically-faithful anchor image, and an innovative Noise-Invariant Video Score Distillation Sampling is incorporated to animate the image to a dynamic video, followed by a video regeneration process to refine the video. This inference strategy effectively mitigates the prevalent issue of non-zero terminal signal-to-noise ratio. Extensive evaluations show that I4VGen not only produces videos with higher visual realism and textual fidelity but also integrates seamlessly into existing image-to-video diffusion models, thereby improving overall video quality.

[CV-41] SMCL: Saliency Masked Contrastive Learning for Long-tailed Recognition

链接: https://arxiv.org/abs/2406.02223
作者: Sanglee Park,Seung-won Hwang,Jungmin So
关键词: Real-world data, high imbalance, Real-world, contrastive learning, classes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted at ICASSP 2023

点击查看摘要

Abstract:Real-world data often follow a long-tailed distribution with a high imbalance in the number of samples between classes. The problem with training from imbalanced data is that some background features, common to all classes, can be unobserved in classes with scarce samples. As a result, this background correlates to biased predictions into ``major" classes. In this paper, we propose saliency masked contrastive learning, a new method that uses saliency masking and contrastive learning to mitigate the problem and improve the generalizability of a model. Our key idea is to mask the important part of an image using saliency detection and use contrastive learning to move the masked image towards minor classes in the feature space, so that background features present in the masked image are no longer correlated with the original class. Experiment results show that our method achieves state-of-the-art level performance on benchmark long-tailed datasets.

[CV-42] Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts

链接: https://arxiv.org/abs/2406.02208
作者: Haodong Hong,Sen Wang,Zi Huang,Qi Wu,Jiajun Liu
关键词: employ textual instructions, Current, Prompts, employ textual, textual instructions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: IJCAI 2024

点击查看摘要

Abstract:Current Vision-and-Language Navigation (VLN) tasks mainly employ textual instructions to guide agents. However, being inherently abstract, the same textual instruction can be associated with different visual signals, causing severe ambiguity and limiting the transfer of prior knowledge in the vision domain from the user to the agent. To fill this gap, we propose Vision-and-Language Navigation with Multi-modal Prompts (VLN-MP), a novel task augmenting traditional VLN by integrating both natural language and images in instructions. VLN-MP not only maintains backward compatibility by effectively handling text-only prompts but also consistently shows advantages with different quantities and relevance of visual prompts. Possible forms of visual prompts include both exact and similar object images, providing adaptability and versatility in diverse navigation scenarios. To evaluate VLN-MP under a unified framework, we implement a new benchmark that offers: (1) a training-free pipeline to transform textual instructions into multi-modal forms with landmark images; (2) diverse datasets with multi-modal instructions for different downstream tasks; (3) a novel module designed to process various image prompts for seamless integration with state-of-the-art VLN models. Extensive experiments on four VLN benchmarks (R2R, RxR, REVERIE, CVDN) show that incorporating visual prompts significantly boosts navigation performance. While maintaining efficiency with text-only prompts, VLN-MP enables agents to navigate in the pre-explore setting and outperform text-based models, showing its broader applicability.

[CV-43] Can CLIP help CLIP in learning 3D?

链接: https://arxiv.org/abs/2406.02202
作者: Cristian Sbrolli,Matteo Matteucci
关键词: explore an alternative, textual descriptions, leverage CLIP knowledge, enhance contrastive, alternative approach
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this study, we explore an alternative approach to enhance contrastive text-image-3D alignment in the absence of textual descriptions for 3D objects. We introduce two unsupervised methods, I2I and (I2L)^2 , which leverage CLIP knowledge about textual and 2D data to compute the neural perceived similarity between two 3D samples. We employ the proposed methods to mine 3D hard negatives, establishing a multimodal contrastive pipeline with hard negative weighting via a custom loss function. We train on different configurations of the proposed hard negative mining approach, and we evaluate the accuracy of our models in 3D classification and on the cross-modal retrieval benchmark, testing image-to-shape and shape-to-image retrieval. Results demonstrate that our approach, even without explicit text alignment, achieves comparable or superior performance on zero-shot and standard 3D classification, while significantly improving both image-to-shape and shape-to-image retrieval compared to previous methods.

[CV-44] GraVITON: Graph based garment warping with attention guided inversion for Virtual-tryon

链接: https://arxiv.org/abs/2406.02184
作者: Sanhita Pathak,Vinay Kaushik,Brejesh Lall
关键词: rapidly evolving field, improving customer experiences, computer vision, human body, rapidly evolving
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 7 Figures and 6 Tables

点击查看摘要

Abstract:Virtual try-on, a rapidly evolving field in computer vision, is transforming e-commerce by improving customer experiences through precise garment warping and seamless integration onto the human body. While existing methods such as TPS and flow address the garment warping but overlook the finer contextual details. In this paper, we introduce a novel graph based warping technique which emphasizes the value of context in garment flow. Our graph based warping module generates warped garment as well as a coarse person image, which is utilised by a simple refinement network to give a coarse virtual tryon image. The proposed work exploits latent diffusion model to generate the final tryon, treating garment transfer as an inpainting task. The diffusion model is conditioned with decoupled cross attention based inversion of visual and textual information. We introduce an occlusion aware warping constraint that generates dense warped garment, without any holes and occlusion. Our method, validated on VITON-HD and Dresscode datasets, showcases substantial state-of-the-art qualitative and quantitative results showing considerable improvement in garment warping, texture preservation, and overall realism.

[CV-45] Radar Spectra-Language Model for Automotive Scene Parsing

链接: https://arxiv.org/abs/2406.02158
作者: Mariia Pushkareva,Yuri Feldman,Csaba Domokos,Kilian Rambach,Dotan Di Castro
关键词: Radar, radar spectra, low cost, sensors are low, spectra
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Radar sensors are low cost, long-range, and weather-resilient. Therefore, they are widely used for driver assistance functions, and are expected to be crucial for the success of autonomous driving in the future. In many perception tasks only pre-processed radar point clouds are considered. In contrast, radar spectra are a raw form of radar measurements and contain more information than radar point clouds. However, radar spectra are rather difficult to interpret. In this work, we aim to explore the semantic information contained in spectra in the context of automated driving, thereby moving towards better interpretability of radar spectra. To this end, we create a radar spectra-language model, allowing us to query radar spectra measurements for the presence of scene elements using free text. We overcome the scarcity of radar spectra data by matching the embedding space of an existing vision-language model (VLM). Finally, we explore the benefit of the learned representation for scene parsing, and obtain improvements in free space segmentation and object detection merely by injecting the spectra embedding into a baseline model.

[CV-46] Analyzing the Feature Extractor Networks for Face Image Synthesis

链接: https://arxiv.org/abs/2406.02153
作者: Erdi Sarıtaş,Hazım Kemal Ekenel
关键词: Generative Adversarial Networks, Advancements like Generative, Generative Adversarial, Adversarial Networks, Networks have attracted
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at 18th International Conference on Automatic Face and Gesture Recognition (FG) on 1st SD-FGA Workshop 2024

点击查看摘要

Abstract:Advancements like Generative Adversarial Networks have attracted the attention of researchers toward face image synthesis to generate ever more realistic images. Thereby, the need for the evaluation criteria to assess the realism of the generated images has become apparent. While FID utilized with InceptionV3 is one of the primary choices for benchmarking, concerns about InceptionV3’s limitations for face images have emerged. This study investigates the behavior of diverse feature extractors – InceptionV3, CLIP, DINOv2, and ArcFace – considering a variety of metrics – FID, KID, Precision\Recall. While the FFHQ dataset is used as the target domain, as the source domains, the CelebA-HQ dataset and the synthetic datasets generated using StyleGAN2 and Projected FastGAN are used. Experiments include deep-down analysis of the features: L_2 normalization, model attention during extraction, and domain distributions in the feature space. We aim to give valuable insights into the behavior of feature extractors for evaluating face image synthesis methodologies. The code is publicly available at this https URL.

[CV-47] UA-Track: Uncertainty-Aware End-to-End 3D Multi-Object Tracking

链接: https://arxiv.org/abs/2406.02147
作者: Lijun Zhou,Tao Tang,Pengkun Hao,Zihang He,Kalok Ho,Shuo Gu,Wenbo Hou,Zhihui Hao,Haiyang Sun,Kun Zhan,Peng Jia,Xianpeng Lang,Xiaodan Liang
关键词: autonomous driving perception, plays a crucial, driving perception, multiple object tracking, crucial role
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D multiple object tracking (MOT) plays a crucial role in autonomous driving perception. Recent end-to-end query-based trackers simultaneously detect and track objects, which have shown promising potential for the 3D MOT task. However, existing methods overlook the uncertainty issue, which refers to the lack of precise confidence about the state and location of tracked objects. Uncertainty arises owing to various factors during motion observation by cameras, especially occlusions and the small size of target objects, resulting in an inaccurate estimation of the object’s position, label, and identity. To this end, we propose an Uncertainty-Aware 3D MOT framework, UA-Track, which tackles the uncertainty problem from multiple aspects. Specifically, we first introduce an Uncertainty-aware Probabilistic Decoder to capture the uncertainty in object prediction with probabilistic attention. Secondly, we propose an Uncertainty-guided Query Denoising strategy to further enhance the training process. We also utilize Uncertainty-reduced Query Initialization, which leverages predicted 2D object location and depth information to reduce query uncertainty. As a result, our UA-Track achieves state-of-the-art performance on the nuScenes benchmark, i.e., 66.3% AMOTA on the test split, surpassing the previous best end-to-end solution by a significant margin of 8.9% AMOTA.

[CV-48] Analyzing the Effect of Combined Degradations on Face Recognition

链接: https://arxiv.org/abs/2406.02142
作者: Erdi Sarıtaş,Hazım Kemal Ekenel
关键词: controlled environments, typically trained, trained on large, collected from controlled, real-world
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at 18th International Conference on Automatic Face and Gesture Recognition (FG) on 2nd PrivAAL Workshop 2024

点击查看摘要

Abstract:A face recognition model is typically trained on large datasets of images that may be collected from controlled environments. This results in performance discrepancies when applied to real-world scenarios due to the domain gap between clean and in-the-wild images. Therefore, some researchers have investigated the robustness of these models by analyzing synthetic degradations. Yet, existing studies have mostly focused on single degradation factors, which may not fully capture the complexity of real-world degradations. This work addresses this problem by analyzing the impact of both single and combined degradations using a real-world degradation pipeline extended with under/over-exposure conditions. We use the LFW dataset for our experiments and assess the model’s performance based on verification accuracy. Results reveal that single and combined degradations show dissimilar model behavior. The combined effect of degradation significantly lowers performance even if its single effect is negligible. This work emphasizes the importance of accounting for real-world complexity to assess the robustness of face recognition models in real-world settings. The code is publicly available at this https URL.

[CV-49] Domain Game: Disentangle Anatomical Feature for Single Domain Generalized Segmentation

链接: https://arxiv.org/abs/2406.02125
作者: Hao Chen,Hongrun Zhang,U Wang Chan,Rui Yin,Xiaofei Wang,Chao Li
关键词: Single domain generalization, domain generalization aims, Single domain, generalization aims, generalization problem
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Single domain generalization aims to address the challenge of out-of-distribution generalization problem with only one source domain available. Feature distanglement is a classic solution to this purpose, where the extracted task-related feature is presumed to be resilient to domain shift. However, the absence of references from other domains in a single-domain scenario poses significant uncertainty in feature disentanglement (ill-posedness). In this paper, we propose a new framework, named \textitDomain Game, to perform better feature distangling for medical image segmentation, based on the observation that diagnostic relevant features are more sensitive to geometric transformations, whilist domain-specific features probably will remain invariant to such operations. In domain game, a set of randomly transformed images derived from a singular source image is strategically encoded into two separate feature sets to represent diagnostic features and domain-specific features, respectively, and we apply forces to pull or repel them in the feature space, accordingly. Results from cross-site test domain evaluation showcase approximately an ~11.8% performance boost in prostate segmentation and around ~10.5% in brain tumor segmentation compared to the second-best method.

[CV-50] FaceCom: Towards High-fidelity 3D Facial Shape Completion via Optimization and Inpainting Guidance

链接: https://arxiv.org/abs/2406.02074
作者: Yinglong Li,Hongyu Wu,Xiaogang Wang,Qingzhao Qin,Yijiao Zhao,Yong wang,Aimin Hao
关键词: arbitrary forms, delivers high-fidelity results, delivers high-fidelity, shape completion, facial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted to CVPR2024

点击查看摘要

Abstract:We propose FaceCom, a method for 3D facial shape completion, which delivers high-fidelity results for incomplete facial inputs of arbitrary forms. Unlike end-to-end shape completion methods based on point clouds or voxels, our approach relies on a mesh-based generative network that is easy to optimize, enabling it to handle shape completion for irregular facial scans. We first train a shape generator on a mixed 3D facial dataset containing 2405 identities. Based on the incomplete facial input, we fit complete faces using an optimization approach under image inpainting guidance. The completion results are refined through a post-processing step. FaceCom demonstrates the ability to effectively and naturally complete facial scan data with varying missing regions and degrees of missing areas. Our method can be used in medical prosthetic fabrication and the registration of deficient scanning data. Our experimental results demonstrate that FaceCom achieves exceptional performance in fitting and shape completion tasks. The code is available at this https URL.

[CV-51] Advancing Generalized Transfer Attack with Initialization Derived Bilevel Optimization and Dynamic Sequence Truncation

链接: https://arxiv.org/abs/2406.02064
作者: Yaohua Liu,Jiaxin Gao,Xuan Liu,Xianghao Jiao,Xin Fan,Risheng Liu
关键词: Transfer attacks generate, generate significant interest, real-world black-box applications, crafting transferable adversarial, attacks generate significant
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IJCAI 2024. 10 pages

点击查看摘要

Abstract:Transfer attacks generate significant interest for real-world black-box applications by crafting transferable adversarial examples through surrogate models. Whereas, existing works essentially directly optimize the single-level objective w.r.t. the surrogate model, which always leads to poor interpretability of attack mechanism and limited generalization performance over unknown victim models. In this work, we propose the \textbfBil\textbfEvel \textbfTransfer \textbfAttac\textbfK (BETAK) framework by establishing an initialization derived bilevel optimization paradigm, which explicitly reformulates the nested constraint relationship between the Upper-Level (UL) pseudo-victim attacker and the Lower-Level (LL) surrogate attacker. Algorithmically, we introduce the Hyper Gradient Response (HGR) estimation as an effective feedback for the transferability over pseudo-victim attackers, and propose the Dynamic Sequence Truncation (DST) technique to dynamically adjust the back-propagation path for HGR and reduce computational overhead simultaneously. Meanwhile, we conduct detailed algorithmic analysis and provide convergence guarantee to support non-convexity of the LL surrogate attacker. Extensive evaluations demonstrate substantial improvement of BETAK (e.g., \mathbf53.41 % increase of attack success rates against IncRes-v 2_ens ) against different victims and defense methods in targeted and untargeted attack scenarios. The source code is available at this https URL.

[CV-52] OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding

链接: https://arxiv.org/abs/2406.02058
作者: Yanmin Wu,Jiarui Meng,Haijie Li,Chenming Wu,Yahao Shi,Xinhua Cheng,Chen Zhao,Haocheng Feng,Errui Ding,Jingdong Wang,Jian Zhang
关键词: Gaussian Splatting, paper introduces OpenGaussian, point-level open vocabulary, open vocabulary understanding, open vocabulary
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: technical report, 15 pages

点击查看摘要

Abstract:This paper introduces OpenGaussian, a method based on 3D Gaussian Splatting (3DGS) capable of 3D point-level open vocabulary understanding. Our primary motivation stems from observing that existing 3DGS-based open vocabulary methods mainly focus on 2D pixel-level parsing. These methods struggle with 3D point-level tasks due to weak feature expressiveness and inaccurate 2D-3D feature associations. To ensure robust feature presentation and 3D point-level understanding, we first employ SAM masks without cross-frame associations to train instance features with 3D consistency. These features exhibit both intra-object consistency and inter-object distinction. Then, we propose a two-stage codebook to discretize these features from coarse to fine levels. At the coarse level, we consider the positional information of 3D points to achieve location-based clustering, which is then refined at the fine level. Finally, we introduce an instance-level 3D-2D feature association method that links 3D points to 2D masks, which are further associated with 2D CLIP features. Extensive experiments, including open vocabulary-based 3D object selection, 3D point cloud understanding, click-based 3D object selection, and ablation studies, demonstrate the effectiveness of our proposed method. Project page: this https URL

[CV-53] Leveraging Predicate and Triplet Learning for Scene Graph Generation

链接: https://arxiv.org/abs/2406.02038
作者: Jiankai Li,Yunhong Wang,Xiefan Guo,Ruijie Yang,Weixin Li
关键词: Scene Graph Generation, Graph Generation, Scene Graph, visual scenes, textless subject
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR 2024

点击查看摘要

Abstract:Scene Graph Generation (SGG) aims to identify entities and predict the relationship triplets \textit\textless subject, predicate, object\textgreater in visual scenes. Given the prevalence of large visual variations of subject-object pairs even in the same predicate, it can be quite challenging to model and refine predicate representations directly across such pairs, which is however a common strategy adopted by most existing SGG methods. We observe that visual variations within the identical triplet are relatively small and certain relation cues are shared in the same type of triplet, which can potentially facilitate the relation learning in SGG. Moreover, for the long-tail problem widely studied in SGG task, it is also crucial to deal with the limited types and quantity of triplets in tail predicates. Accordingly, in this paper, we propose a Dual-granularity Relation Modeling (DRM) network to leverage fine-grained triplet cues besides the coarse-grained predicate ones. DRM utilizes contexts and semantics of predicate and triplet with Dual-granularity Constraints, generating compact and balanced representations from two perspectives to facilitate relation recognition. Furthermore, a Dual-granularity Knowledge Transfer (DKT) strategy is introduced to transfer variation from head predicates/triplets to tail ones, aiming to enrich the pattern diversity of tail classes to alleviate the long-tail problem. Extensive experiments demonstrate the effectiveness of our method, which establishes new state-of-the-art performance on Visual Genome, Open Image, and GQA datasets. Our code is available at \urlthis https URL

[CV-54] Multi-Scale Direction-Aware Network for Infrared Small Target Detection

链接: https://arxiv.org/abs/2406.02037
作者: Jinmiao Zhao,Zelin Shi,Chuang Yu,Yunpeng Liu
关键词: high-frequency directional features, target detection faces, high-frequency directional, small target detection, Infrared small target
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Infrared small target detection faces the problem that it is difficult to effectively separate the background and the target. Existing deep learning-based methods focus on appearance features and ignore high-frequency directional features. Therefore, we propose a multi-scale direction-aware network (MSDA-Net), which is the first attempt to integrate the high-frequency directional features of infrared small targets as domain prior knowledge into neural networks. Specifically, an innovative multi-directional feature awareness (MDFA) module is constructed, which fully utilizes the prior knowledge of targets and emphasizes the focus on high-frequency directional features. On this basis, combined with the multi-scale local relation learning (MLRL) module, a multi-scale direction-aware (MSDA) module is further constructed. The MSDA module promotes the full extraction of local relations at different scales and the full perception of key features in different directions. Meanwhile, a high-frequency direction injection (HFDI) module without training parameters is constructed to inject the high-frequency directional information of the original image into the network. This helps guide the network to pay attention to detailed information such as target edges and shapes. In addition, we propose a feature aggregation (FA) structure that aggregates multi-level features to solve the problem of small targets disappearing in deep feature maps. Furthermore, a lightweight feature alignment fusion (FAF) module is constructed, which can effectively alleviate the pixel offset existing in multi-level feature map fusion. Extensive experimental results show that our MSDA-Net achieves state-of-the-art (SOTA) results on the public NUDT-SIRST, SIRST and IRSTD-1k datasets.

[CV-55] Inference Attacks in Machine Learning as a Service: A Taxonomy Review and Promising Directions

链接: https://arxiv.org/abs/2406.02027
作者: Feng Wu,Lei Cui,Shaowen Yao,Shui Yu
关键词: brought people concerns, inference attacks, inference, brought people, people concerns
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The prosperity of machine learning has also brought people’s concerns about data privacy. Among them, inference attacks can implement privacy breaches in various MLaaS scenarios and model training/prediction phases. Specifically, inference attacks can perform privacy inference on undisclosed target training sets based on outputs of the target model, including but not limited to statistics, membership, semantics, data representation, etc. For instance, infer whether the target data has the characteristics of AIDS. In addition, the rapid development of the machine learning community in recent years, especially the surge of model types and application scenarios, has further stimulated the inference attacks’ research. Thus, studying inference attacks and analyzing them in depth is urgent and significant. However, there is still a gap in the systematic discussion of inference attacks from taxonomy, global perspective, attack, and defense perspectives. This survey provides an in-depth and comprehensive inference of attacks and corresponding countermeasures in ML-as-a-service based on taxonomy and the latest researches. Without compromising researchers’ intuition, we first propose the 3MP taxonomy based on the community research status, trying to normalize the confusing naming system of inference attacks. Also, we analyze the pros and cons of each type of inference attack, their workflow, countermeasure, and how they interact with other attacks. In the end, we point out several promising directions for researchers from a more comprehensive and novel perspective.

[CV-56] MetaMixer Is All You Need

链接: https://arxiv.org/abs/2406.02021
作者: Seokju Yun,Dongheon Lee,Youngmin Ro
关键词: revolutionized the landscape, Transformer, Feed-Forward Network, network design, FFN
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Code: this https URL

点击查看摘要

Abstract:Transformer, composed of self-attention and Feed-Forward Network, has revolutionized the landscape of network design across various vision tasks. FFN is a versatile operator seamlessly integrated into nearly all AI models to effectively harness rich representations. Recent works also show that FFN functions like key-value memories. Thus, akin to the query-key-value mechanism within self-attention, FFN can be viewed as a memory network, where the input serves as query and the two projection weights operate as keys and values, respectively. We hypothesize that the importance lies in query-key-value framework itself rather than in self-attention. To verify this, we propose converting self-attention into a more FFN-like efficient token mixer with only convolutions while retaining query-key-value framework, namely FFNification. Specifically, FFNification replaces query-key and attention coefficient-value interactions with large kernel convolutions and adopts GELU activation function instead of softmax. The derived token mixer, FFNified attention, serves as key-value memories for detecting locally distributed spatial patterns, and operates in the opposite dimension to the ConvNeXt block within each corresponding sub-operation of the query-key-value framework. Building upon the above two modules, we present a family of Fast-Forward Networks. Our FFNet achieves remarkable performance improvements over previous state-of-the-art methods across a wide range of tasks. The strong and general performance of our proposed method validates our hypothesis and leads us to introduce MetaMixer, a general mixer architecture that does not specify sub-operations within the query-key-value framework. We show that using only simple operations like convolution and GELU in the MetaMixer can achieve superior performance.

[CV-57] Bayesian Mesh Optimization for Graph Neural Networks to Enhance Engineering Performance Prediction

链接: https://arxiv.org/abs/2406.01996
作者: Jangseop Park,Namwoo Kang
关键词: replace computationally expensive, leveraging design variables, computationally expensive simulations, design variables, widely employed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 17 pages, 8 figures, 3 tables

点击查看摘要

Abstract:In engineering design, surrogate models are widely employed to replace computationally expensive simulations by leveraging design variables and geometric parameters from computer-aided design (CAD) models. However, these models often lose critical information when simplified to lower dimensions and face challenges in parameter definition, especially with the complex 3D shapes commonly found in industrial datasets. To address these limitations, we propose a Bayesian graph neural network (GNN) framework for a 3D deep-learning-based surrogate model that predicts engineering performance by directly learning geometric features from CAD using mesh representation. Our framework determines the optimal size of mesh elements through Bayesian optimization, resulting in a high-accuracy surrogate model. Additionally, it effectively handles the irregular and complex structures of 3D CADs, which differ significantly from the regular and uniform pixel structures of 2D images typically used in deep learning. Experimental results demonstrate that the quality of the mesh significantly impacts the prediction accuracy of the surrogate model, with an optimally sized mesh achieving superior performance. We compare the performance of models based on various 3D representations such as voxel, point cloud, and graph, and evaluate the computational costs of Monte Carlo simulation and Bayesian optimization methods to find the optimal mesh size. We anticipate that our proposed framework has the potential to be applied to mesh-based simulations across various engineering fields, leveraging physics-based information commonly used in computer-aided engineering.

[CV-58] 3D Imaging of Complex Specular Surfaces by Fusing Polarimetric and Deflectometric Information

链接: https://arxiv.org/abs/2406.01994
作者: Jiazhang Wang,Oliver Cossairt,Florian Willomitzer
关键词: poses major challenges, Accurate and fast, optical measurement principles, poses major, major challenges
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:Accurate and fast 3D imaging of specular surfaces still poses major challenges for state-of-the-art optical measurement principles. Frequently used methods, such as phase-measuring deflectometry (PMD) or shape-from-polarization (SfP), rely on strong assumptions about the measured objects, limiting their generalizability in broader application areas like medical imaging, industrial inspection, virtual reality, or cultural heritage analysis. In this paper, we introduce a measurement principle that utilizes a novel technique to effectively encode and decode the information contained in a light field reflected off a specular surface. We combine polarization cues from SfP with geometric information obtained from PMD to resolve all arising ambiguities in the 3D measurement. Moreover, our approach removes the unrealistic orthographic imaging assumption for SfP, which significantly improves the respective results. We showcase our new technique by demonstrating single-shot and multi-shot measurements on complex-shaped specular surfaces, displaying an evaluated accuracy of surface normals below 0.6^\circ .

[CV-59] Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization

链接: https://arxiv.org/abs/2406.01987
作者: Yunpeng Zhao,Cheng Chen,Qing You Pang,Quanzheng Li,Carol Tang,Beng-Ti Ang,Yueming Jin
关键词: Addressing missing modalities, Addressing missing, missing modalities presents, presents a critical, critical challenge
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Addressing missing modalities presents a critical challenge in multimodal learning. Current approaches focus on developing models that can handle modality-incomplete inputs during inference, assuming that the full set of modalities are available for all the data during training. This reliance on full-modality data for training limits the use of abundant modality-incomplete samples that are often encountered in practical settings. In this paper, we propose a robust universal model with modality reconstruction and model personalization, which can effectively tackle the missing modality at both training and testing stages. Our method leverages a multimodal masked autoencoder to reconstruct the missing modality and masked patches simultaneously, incorporating an innovative distribution approximation mechanism to fully utilize both modality-complete and modality-incomplete data. The reconstructed modalities then contributes to our designed data-model co-distillation scheme to guide the model learning in the presence of missing modalities. Moreover, we propose a CLIP-driven hyper-network to personalize partial model parameters, enabling the model to adapt to each distinct missing modality scenario. Our method has been extensively validated on two brain tumor segmentation benchmarks. Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches under the all-stage missing modality settings with different missing ratios. Code will be available.

[CV-60] Can Dense Connectivity Benefit Outlier Detection? An Odyssey with NAS

链接: https://arxiv.org/abs/2406.01975
作者: Hao Fu,Tunhou Zhang,Hai Li,Yiran Chen
关键词: Convolutional Neural Networks, real world applications, Recent advances, Neural Networks, deployment of Convolutional
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advances in Out-of-Distribution (OOD) Detection is the driving force behind safe and reliable deployment of Convolutional Neural Networks (CNNs) in real world applications. However, existing studies focus on OOD detection through confidence score and deep generative model-based methods, without considering the impact of DNN structures, especially dense connectivity in architecture fabrications. In addition, existing outlier detection approaches exhibit high variance in generalization performance, lacking stability and confidence in evaluating and ranking different outlier detectors. In this work, we propose a novel paradigm, Dense Connectivity Search of Outlier Detector (DCSOD), that automatically explore the dense connectivity of CNN architectures on near-OOD detection task using Neural Architecture Search (NAS). We introduce a hierarchical search space containing versatile convolution operators and dense connectivity, allowing a flexible exploration of CNN architectures with diverse connectivity patterns. To improve the quality of evaluation on OOD detection during search, we propose evolving distillation based on our multi-view feature learning explanation. Evolving distillation stabilizes training for OOD detection evaluation, thus improves the quality of search. We thoroughly examine DCSOD on CIFAR benchmarks under OOD detection protocol. Experimental results show that DCSOD achieve remarkable performance over widely used architectures and previous NAS baselines. Notably, DCSOD achieves state-of-the-art (SOTA) performance on CIFAR benchmark, with AUROC improvement of \sim 1.0%.

[CV-61] he Crystal Ball Hypothesis in diffusion models: Anticipating object positions from initial noise

链接: https://arxiv.org/abs/2406.01970
作者: Yuanhao Ban,Ruochen Wang,Tianyi Zhou,Boqing Gong,Cho-Jui Hsieh,Minhao Cheng
关键词: achieved remarkable success, Diffusion models, rarely explored, initial noise, models have achieved
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in text-to-image generation tasks; however, the role of initial noise has been rarely explored. In this study, we identify specific regions within the initial noise image, termed trigger patches, that play a key role for object generation in the resulting images. Notably, these patches are ``universal’’ and can be generalized across various positions, seeds, and prompts. To be specific, extracting these patches from one noise and injecting them into another noise leads to object generation in targeted areas. We identify these patches by analyzing the dispersion of object bounding boxes across generated images, leading to the development of a posterior analysis technique. Furthermore, we create a dataset consisting of Gaussian noises labeled with bounding boxes corresponding to the objects appearing in the generated images and train a detector that identifies these patches from the initial noise. To explain the formation of these patches, we reveal that they are outliers in Gaussian noise, and follow distinct distributions through two-sample tests. Finally, we find the misalignment between prompts and the trigger patch patterns can result in unsuccessful image generations. The study proposes a reject-sampling strategy to obtain optimal noise, aiming to improve prompt adherence and positional diversity in image generation.

[CV-62] Exploring Real World Map Change Generalization of Prior-Informed HD Map Prediction Models

链接: https://arxiv.org/abs/2406.01961
作者: Samuel M.Bateman,Ning Xu,H.Charles Zhao,Yael Ben Shalom,Vince Gong,Greg Long,Will Maddern
关键词: Building and maintaining, maintaining High-Definition, represents a large, large barrier, autonomous vehicle deployment
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to CVPR 2024, Workshop on Autonomous Driving

点击查看摘要

Abstract:Building and maintaining High-Definition (HD) maps represents a large barrier to autonomous vehicle deployment. This, along with advances in modern online map detection models, has sparked renewed interest in the online mapping problem. However, effectively predicting online maps at a high enough quality to enable safe, driverless deployments remains a significant challenge. Recent work on these models proposes training robust online mapping systems using low quality map priors with synthetic perturbations in an attempt to simulate out-of-date HD map priors. In this paper, we investigate how models trained on these synthetically perturbed map priors generalize to performance on deployment-scale, real world map changes. We present a large-scale experimental study to determine which synthetic perturbations are most useful in generalizing to real world HD map changes, evaluated using multiple years of real-world autonomous driving data. We show there is still a substantial sim2real gap between synthetic prior perturbations and observed real-world changes, which limits the utility of current prior-informed HD map prediction models.

[CV-63] Enhance Image-to-Image Generation with LLaVA Prompt and Negative Prompt

链接: https://arxiv.org/abs/2406.01956
作者: Zhicheng Ding,Panfeng Li,Qikai Yang,Siyang Li
关键词: Vision Assistant, Large Language, Language and Vision, approach to enhance, paper presents
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by 2024 5th International Conference on Information Science, Parallel and Distributed Systems

点击查看摘要

Abstract:This paper presents a novel approach to enhance image-to-image generation by leveraging the multimodal capabilities of the Large Language and Vision Assistant (LLaVA). We propose a framework where LLaVA analyzes input images and generates textual descriptions, hereinafter LLaVA-generated prompts. These prompts, along with the original image, are fed into the image-to-image generation pipeline. This enriched representation guides the generation process towards outputs that exhibit a stronger resemblance to the input image. Extensive experiments demonstrate the effectiveness of LLaVA-generated prompts in promoting image similarity. We observe a significant improvement in the visual coherence between the generated and input images compared to traditional methods. Future work will explore fine-tuning LLaVA prompts for increased control over the creative process. By providing more specific details within the prompts, we aim to achieve a delicate balance between faithfulness to the original image and artistic expression in the generated outputs.

[CV-64] Plug-and-Play Diffusion Distillation

链接: https://arxiv.org/abs/2406.01954
作者: Yi-Ting Hsiao,Siavash Khodadadeh,Kevin Duarte,Wei-An Lin,Hui Qu,Mingi Kwon,Ratheesh Kalarot
关键词: shown tremendous results, shown tremendous, Diffusion models, Diffusion, model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024

点击查看摘要

Abstract:Diffusion models have shown tremendous results in image generation. However, due to the iterative nature of the diffusion process and its reliance on classifier-free guidance, inference times are slow. In this paper, we propose a new distillation approach for guided diffusion models in which an external lightweight guide model is trained while the original text-to-image model remains frozen. We show that our method reduces the inference computation of classifier-free guided latent-space diffusion models by almost half, and only requires 1% trainable parameters of the base model. Furthermore, once trained, our guide model can be applied to various fine-tuned, domain-specific versions of the base diffusion model without the need for additional training: this “plug-and-play” functionality drastically improves inference computation while maintaining the visual fidelity of generated images. Empirically, we show that our approach is able to produce visually appealing results and achieve a comparable FID score to the teacher with as few as 8 to 16 steps.

[CV-65] Nutrition Estimation for Dietary Management: A Transformer Approach with Depth Sensing

链接: https://arxiv.org/abs/2406.01938
作者: Zhengyi Kwan,Wei Zhang,Zhengkui Wang,Aik Beng Ng,Simon See
关键词: Nutrition estimation, health and well-being, crucial for effective, effective dietary management, effective dietary
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 10 pages

点击查看摘要

Abstract:Nutrition estimation is crucial for effective dietary management and overall health and well-being. Existing methods often struggle with sub-optimal accuracy and can be time-consuming. In this paper, we propose NuNet, a transformer-based network designed for nutrition estimation that utilizes both RGB and depth information from food images. We have designed and implemented a multi-scale encoder and decoder, along with two types of feature fusion modules, specialized for estimating five nutritional factors. These modules effectively balance the efficiency and effectiveness of feature extraction with flexible usage of our customized attention mechanisms and fusion strategies. Our experimental study shows that NuNet outperforms its variants and existing solutions significantly for nutrition estimation. It achieves an error rate of 15.65%, the lowest known to us, largely due to our multi-scale architecture and fusion modules. This research holds practical values for dietary management with huge potential for transnational research and deployment and could inspire other applications involving multiple data types with varying degrees of importance.

[CV-66] Detecting Endangered Marine Species in Autonomous Underwater Vehicle Imagery Using Point Annotations and Few-Shot Learning

链接: https://arxiv.org/abs/2406.01932
作者: Heather Doig,Oscar Pizarro,Jacquomo Monk,Stefan Williams
关键词: Autonomous Underwater Vehicles, Underwater Vehicles, Autonomous Underwater, common marine species, marine species
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 7 pages, 5 figures. Submitted to the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2024)

点击查看摘要

Abstract:One use of Autonomous Underwater Vehicles (AUVs) is the monitoring of habitats associated with threatened, endangered and protected marine species, such as the handfish of Tasmania, Australia. Seafloor imagery collected by AUVs can be used to identify individuals within their broader habitat context, but the sheer volume of imagery collected can overwhelm efforts to locate rare or cryptic individuals. Machine learning models can be used to identify the presence of a particular species in images using a trained object detector, but the lack of training examples reduces detection performance, particularly for rare species that may only have a small number of examples in the wild. In this paper, inspired by recent work in few-shot learning, images and annotations of common marine species are exploited to enhance the ability of the detector to identify rare and cryptic species. Annotated images of six common marine species are used in two ways. Firstly, the common species are used in a pre-training step to allow the backbone to create rich features for marine species. Secondly, a copy-paste operation is used with the common species images to augment the training data. While annotations for more common marine species are available in public datasets, they are often in point format, which is unsuitable for training an object detector. A popular semantic segmentation model efficiently generates bounding box annotations for training from the available point annotations. Our proposed framework is applied to AUV images of handfish, increasing average precision by up to 48% compared to baseline object detection training. This approach can be applied to other objects with low numbers of annotations and promises to increase the ability to actively monitor threatened, endangered and protected species.

[CV-67] CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models

链接: https://arxiv.org/abs/2406.01920
作者: Junho Kim,Hyunjun Kim,Yeonju Kim,Yong Man Ro
关键词: Large Multi-modal Models, Large Multi-modal, recently demonstrated remarkable, demonstrated remarkable abilities, visual context understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Project page: this https URL

点击查看摘要

Abstract:Large Multi-modal Models (LMMs) have recently demonstrated remarkable abilities in visual context understanding and coherent response generation. However, alongside these advancements, the issue of hallucinations has emerged as a significant challenge, producing erroneous responses that are unrelated to the visual contents. In this paper, we introduce a novel contrastive-based decoding method, COuntering DEscription Contrastive Decoding (CODE), which leverages self-generated descriptions as contrasting references during the decoding phase of LMMs to address hallucination issues. CODE utilizes the comprehensive descriptions from model itself as visual counterpart to correct and improve response alignment with actual visual content. By dynamically adjusting the information flow and distribution of next-token predictions in the LMM’s vocabulary, CODE enhances the coherence and informativeness of generated responses. Extensive experiments demonstrate that our method significantly reduces hallucinations and improves cross-modal consistency across various benchmarks and cutting-edge LMMs. Our method provides a simple yet effective decoding strategy that can be integrated to existing LMM frameworks without additional training.

[CV-68] GOMAA-Geo: GOal Modality Agnostic Active Geo-localization

链接: https://arxiv.org/abs/2406.01917
作者: Anindya Sarkar,Srikumar Sastry,Aleksis Pirinen,Chongjie Zhang,Nathan Jacobs,Yevgeniy Vorobeychik
关键词: sequence of visual, find a target, visual cues observed, goal, AGL task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 23 pages, 17 figures

点击查看摘要

Abstract:We consider the task of active geo-localization (AGL) in which an agent uses a sequence of visual cues observed during aerial navigation to find a target specified through multiple possible modalities. This could emulate a UAV involved in a search-and-rescue operation navigating through an area, observing a stream of aerial images as it goes. The AGL task is associated with two important challenges. Firstly, an agent must deal with a goal specification in one of multiple modalities (e.g., through a natural language description) while the search cues are provided in other modalities (aerial imagery). The second challenge is limited localization time (e.g., limited battery life, urgency) so that the goal must be localized as efficiently as possible, i.e. the agent must effectively leverage its sequentially observed aerial views when searching for the goal. To address these challenges, we propose GOMAA-Geo - a goal modality agnostic active geo-localization agent - for zero-shot generalization between different goal modalities. Our approach combines cross-modality contrastive learning to align representations across modalities with supervised foundation model pretraining and reinforcement learning to obtain highly effective navigation and localization policies. Through extensive evaluations, we show that GOMAA-Geo outperforms alternative learnable approaches and that it generalizes across datasets - e.g., to disaster-hit areas without seeing a single disaster scenario during training - and goal modalities - e.g., to ground-level imagery or textual descriptions, despite only being trained with goals specified as aerial views. Code and models are publicly available at this https URL.

[CV-69] FastLGS: Speeding up Language Embedded Gaussians with Feature Grid Mapping

链接: https://arxiv.org/abs/2406.01916
作者: Yuzhou Ji,He Zhu,Junshu Tang,Wuyi Liu,Zhizhong Zhang,Yuan Xie,Lizhuang Ma,Xin Tan
关键词: scene understanding applications, semantically interactive radiance, interactive radiance field, automated real-world, scene understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The semantically interactive radiance field has always been an appealing task for its potential to facilitate user-friendly and automated real-world 3D scene understanding applications. However, it is a challenging task to achieve high quality, efficiency and zero-shot ability at the same time with semantics in radiance fields. In this work, we present FastLGS, an approach that supports real-time open-vocabulary query within 3D Gaussian Splatting (3DGS) under high resolution. We propose the semantic feature grid to save multi-view CLIP features which are extracted based on Segment Anything Model (SAM) masks, and map the grids to low dimensional features for semantic field training through 3DGS. Once trained, we can restore pixel-aligned CLIP embeddings through feature grids from rendered features for open-vocabulary queries. Comparisons with other state-of-the-art methods prove that FastLGS can achieve the first place performance concerning both speed and accuracy, where FastLGS is 98x faster than LERF and 4x faster than LangSplat. Meanwhile, experiments show that FastLGS is adaptive and compatible with many downstream tasks, such as 3D segmentation and 3D object inpainting, which can be easily applied to other 3D manipulation systems.

[CV-70] HPE-CogVLM: New Head Pose Grounding Task Exploration on Vision Language Model

链接: https://arxiv.org/abs/2406.01914
作者: Yu Tian,Tianqi Shao,Tsukasa Demizu,Xuyang Wu,Hsin-Tai Wu
关键词: roll Euler angles, Head pose estimation, precise numerical output, Euler angles, roll Euler
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Head pose estimation (HPE) task requires a sophisticated understanding of 3D spatial relationships and precise numerical output of yaw, pitch, and roll Euler angles. Previous HPE studies are mainly based on Non-large language models (Non-LLMs), which rely on close-up human heads cropped from the full image as inputs and lack robustness in real-world scenario. In this paper, we present a novel framework to enhance the HPE prediction task by leveraging the visual grounding capability of CogVLM. CogVLM is a vision language model (VLM) with grounding capability of predicting object bounding boxes (BBoxes), which enables HPE training and prediction using full image information input. To integrate the HPE task into the VLM, we first cop with the catastrophic forgetting problem in large language models (LLMs) by investigating the rehearsal ratio in the data rehearsal method. Then, we propose and validate a LoRA layer-based model merging method, which keeps the integrity of parameters, to enhance the HPE performance in the framework. The results show our HPE-CogVLM achieves a 31.5% reduction in Mean Absolute Error for HPE prediction over the current Non-LLM based state-of-the-art in cross-dataset evaluation. Furthermore, we compare our LoRA layer-based model merging method with LoRA fine-tuning only and other merging methods in CogVLM. The results demonstrate our framework outperforms them in all HPE metrics.

[CV-71] ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization

链接: https://arxiv.org/abs/2406.01906
作者: Chen Mao,Jingqi Hu
关键词: computer vision tasks, augmented reality, autonomous driving, visual geo-localization datasets, Visual Geo-localization
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Visual Geo-localization (VG) refers to the process to identify the location described in query images, which is widely applied in robotics field and computer vision tasks, such as autonomous driving, metaverse, augmented reality, and SLAM. In fine-grained images lacking specific text descriptions, directly applying pure visual methods to represent neighborhood features often leads to the model focusing on overly fine-grained features, unable to fully mine the semantic information in the images. Therefore, we propose a two-stage training method to enhance visual performance and use contrastive learning to mine challenging samples. We first leverage the multi-modal description capability of CLIP (Contrastive Language-Image Pretraining) to create a set of learnable text prompts for each geographic image feature to form vague descriptions. Then, by utilizing dynamic text prompts to assist the training of the image encoder, we enable the image encoder to learn better and more generalizable visual features. This strategy of applying text to purely visual tasks addresses the challenge of using multi-modal models for geographic images, which often suffer from a lack of precise descriptions, making them difficult to utilize widely. We validate the effectiveness of the proposed strategy on several large-scale visual geo-localization datasets, and our method achieves competitive results on multiple visual geo-localization datasets. Our code and model are available at this https URL.

[CV-72] Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation

链接: https://arxiv.org/abs/2406.01900
作者: Yue Ma,Hongyu Liu,Hongfa Wang,Heng Pan,Yingqing He,Junkun Yuan,Ailing Zeng,Chengfei Cai,Heung-Yeung Shum,Wei Liu,Qifeng Chen
关键词: target landmark sequences, reference portrait, diffusion-based framework, portrait, reference
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:We present Follow-Your-Emoji, a diffusion-based framework for portrait animation, which animates a reference portrait with target landmark sequences. The main challenge of portrait animation is to preserve the identity of the reference portrait and transfer the target expression to this portrait while maintaining temporal consistency and fidelity. To address these challenges, Follow-Your-Emoji equipped the powerful Stable Diffusion model with two well-designed technologies. Specifically, we first adopt a new explicit motion signal, namely expression-aware landmark, to guide the animation process. We discover this landmark can not only ensure the accurate motion alignment between the reference portrait and target motion during inference but also increase the ability to portray exaggerated expressions (i.e., large pupil movements) and avoid identity leakage. Then, we propose a facial fine-grained loss to improve the model’s ability of subtle expression perception and reference portrait appearance reconstruction by using both expression and facial masks. Accordingly, our method demonstrates significant performance in controlling the expression of freestyle portraits, including real humans, cartoons, sculptures, and even animals. By leveraging a simple and effective progressive generation strategy, we extend our model to stable long-term animation, thus increasing its potential application value. To address the lack of a benchmark for this field, we introduce EmojiBench, a comprehensive benchmark comprising diverse portrait images, driving videos, and landmarks. We show extensive evaluations on EmojiBench to verify the superiority of Follow-Your-Emoji.

[CV-73] SVASTIN: Sparse Video Adversarial Attack via Spatio-Temporal Invertible Neural Networks

链接: https://arxiv.org/abs/2406.01894
作者: Yi Pan,Jun-Jie Huang,Zihan Chen,Wentao Zhao,Ziyue Wang
关键词: Robust and imperceptible, Invertible Neural Networks, adversarial video attack, Spatio-Temporal Invertible Neural, imperceptible adversarial video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Robust and imperceptible adversarial video attack is challenging due to the spatial and temporal characteristics of videos. The existing video adversarial attack methods mainly take a gradient-based approach and generate adversarial videos with noticeable perturbations. In this paper, we propose a novel Sparse Adversarial Video Attack via Spatio-Temporal Invertible Neural Networks (SVASTIN) to generate adversarial videos through spatio-temporal feature space information exchanging. It consists of a Guided Target Video Learning (GTVL) module to balance the perturbation budget and optimization speed and a Spatio-Temporal Invertible Neural Network (STIN) module to perform spatio-temporal feature space information exchanging between a source video and the target feature tensor learned by GTVL module. Extensive experiments on UCF-101 and Kinetics-400 demonstrate that our proposed SVASTIN can generate adversarial examples with higher imperceptibility than the state-of-the-art methods with the higher fooling rate. Code is available at \hrefthis https URLthis https URL.

[CV-74] Rank-based No-reference Quality Assessment for Face Swapping

链接: https://arxiv.org/abs/2406.01884
作者: Xinghui Zhou,Wenbo Zhou,Tianyi Wei,Shen Chen,Taiping Yao,Shouhong Ding,Weiming Zhang,Nenghai Yu
关键词: rapid technological advancements, prominent research area, image processing due, image quality assessment, Face swapping
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Face swapping has become a prominent research area in computer vision and image processing due to rapid technological advancements. The metric of measuring the quality in most face swapping methods relies on several distances between the manipulated images and the source image, or the target image, i.e., there are suitable known reference face images. Therefore, there is still a gap in accurately assessing the quality of face interchange in reference-free scenarios. In this study, we present a novel no-reference image quality assessment (NR-IQA) method specifically designed for face swapping, addressing this issue by constructing a comprehensive large-scale dataset, implementing a method for ranking image quality based on multiple facial attributes, and incorporating a Siamese network based on interpretable qualitative comparisons. Our model demonstrates the state-of-the-art performance in the quality assessment of swapped faces, providing coarse- and fine-grained. Enhanced by this metric, an improved face-swapping model achieved a more advanced level with respect to expressions and poses. Extensive experiments confirm the superiority of our method over existing general no-reference image quality assessment metrics and the latest metric of facial image quality assessment, making it well suited for evaluating face swapping images in real-world scenarios.

[CV-75] Fruit Classification System with Deep Learning and Neural Architecture Search

链接: https://arxiv.org/abs/2406.01869
作者: Christine Dewi,Dhananjay Thiruvady,Nayyar Zaidi
关键词: identification process involves, process involves analyzing, fruit identification process, visual characteristics, identification process
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The fruit identification process involves analyzing and categorizing different types of fruits based on their visual characteristics. This activity can be achieved using a range of methodologies, encompassing manual examination, conventional computer vision methodologies, and more sophisticated methodologies employing machine learning and deep learning. Our study identified a total of 15 distinct categories of fruit, consisting of class Avocado, Banana, Cherry, Apple Braeburn, Apple golden 1, Apricot, Grape, Kiwi, Mango, Orange, Papaya, Peach, Pineapple, Pomegranate and Strawberry. Neural Architecture Search (NAS) is a technological advancement employed within the realm of deep learning and artificial intelligence, to automate conceptualizing and refining neural network topologies. NAS aims to identify neural network structures that are highly suitable for tasks, such as the detection of fruits. Our suggested model with 99.98% mAP increased the detection performance of the preceding research study that used Fruit datasets. In addition, after the completion of the study, a comparative analysis was carried out to assess the findings in conjunction with those of another research that is connected to the topic. When compared to the findings of earlier studies, the detector that was proposed exhibited higher performance in terms of both its accuracy and its precision.

[CV-76] MoLA: Motion Generation and Editing with Latent Diffusion Enhanced by Adversarial Training

链接: https://arxiv.org/abs/2406.01867
作者: Kengo Uchida,Takashi Shibuya,Yuhta Takida,Naoki Murata,Shusuke Takahashi,Yuki Mitsufuji
关键词: latent diffusion model, diffusion model, editing tasks, quality and speed, editing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 6 figures

点击查看摘要

Abstract:In motion generation, controllability as well as generation quality and speed is becoming more and more important. There are various motion editing tasks, such as in-betweening, upper body editing, and path-following, but existing methods perform motion editing with a data-space diffusion model, which is slow in inference compared to a latent diffusion model. In this paper, we propose MoLA, which provides fast and high-quality motion generation and also can deal with multiple editing tasks in a single framework. For high-quality and fast generation, we employ a variational autoencoder and latent diffusion model, and improve the performance with adversarial training. In addition, we apply a training-free guided generation framework to achieve various editing tasks with motion control inputs. We quantitatively show the effectiveness of adversarial learning in text-to-motion generation, and demonstrate the applicability of our editing framework to multiple editing tasks in the motion domain.

[CV-77] L-MAGIC: Language Model Assisted Generation of Images with Coherence

链接: https://arxiv.org/abs/2406.01843
作者: Zhipeng Cai,Matthias Mueller,Reiner Birkl,Diana Wofk,Shao-Yen Tseng,JunDa Cheng,Gabriela Ben-Melech Stan,Vasudev Lal,Michael Paulitsch
关键词: single input image, input image remains, generative AI breakthroughs, key challenge, current era
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted to CVPR 2024

点击查看摘要

Abstract:In the current era of generative AI breakthroughs, generating panoramic scenes from a single input image remains a key challenge. Most existing methods use diffusion-based iterative or simultaneous multi-view inpainting. However, the lack of global scene layout priors leads to subpar outputs with duplicated objects (e.g., multiple beds in a bedroom) or requires time-consuming human text inputs for each view. We propose L-MAGIC, a novel method leveraging large language models for guidance while diffusing multiple coherent views of 360 degree panoramic scenes. L-MAGIC harnesses pre-trained diffusion and language models without fine-tuning, ensuring zero-shot performance. The output quality is further enhanced by super-resolution and multi-view fusion techniques. Extensive experiments demonstrate that the resulting panoramic scenes feature better scene layouts and perspective view rendering quality compared to related works, with 70% preference in human evaluations. Combined with conditional diffusion models, L-MAGIC can accept various input modalities, including but not limited to text, depth maps, sketches, and colored scripts. Applying depth estimation further enables 3D point cloud generation and dynamic scene exploration with fluid camera motion. Code is available at this https URL. The video presentation is available at this https URL.

[CV-78] Boosting Vision-Language Models with Transduction

链接: https://arxiv.org/abs/2406.01837
作者: Maxime Zanella,Benoît Gérin,Ismail Ben Ayed
关键词: boost predictive accuracy, predictive accuracy, powerful paradigm, paradigm that leverages, leverages the structure
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Transduction is a powerful paradigm that leverages the structure of unlabeled data to boost predictive accuracy. We present TransCLIP, a novel and computationally efficient transductive approach designed for Vision-Language Models (VLMs). TransCLIP is applicable as a plug-and-play module on top of popular inductive zero- and few-shot models, consistently improving their performances. Our new objective function can be viewed as a regularized maximum-likelihood estimation, constrained by a KL divergence penalty that integrates the text-encoder knowledge and guides the transductive learning process. We further derive an iterative Block Majorize-Minimize (BMM) procedure for optimizing our objective, with guaranteed convergence and decoupled sample-assignment updates, yielding computationally efficient transduction for large-scale datasets. We report comprehensive evaluations, comparisons, and ablation studies that demonstrate: (i) Transduction can greatly enhance the generalization capabilities of inductive pretrained zero- and few-shot VLMs; (ii) TransCLIP substantially outperforms standard transductive few-shot learning methods relying solely on vision features, notably due to the KL-based language constraint.

[CV-79] FacAID: A Transformer Model for Neuro-Symbolic Facade Reconstruction

链接: https://arxiv.org/abs/2406.01829
作者: Aleksander Płocharski,Jan Swidzinski,Joanna Porter-Sobieraj,Przemyslaw Musialski
关键词: custom-designed split grammar, split grammar, custom-designed split, segmented facade structures, semi-complex split grammar
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 11 pages, 10 figures, preprint

点击查看摘要

Abstract:We introduce a neuro-symbolic transformer-based model that converts flat, segmented facade structures into procedural definitions using a custom-designed split grammar. To facilitate this, we first develop a semi-complex split grammar tailored for architectural facades and then generate a dataset comprising of facades alongside their corresponding procedural representations. This dataset is used to train our transformer model to convert segmented, flat facades into the procedural language of our grammar. During inference, the model applies this learned transformation to new facade segmentations, providing a procedural representation that users can adjust to generate varied facade designs. This method not only automates the conversion of static facade images into dynamic, editable procedural formats but also enhances the design flexibility, allowing for easy modifications and variations by architects and designers. Our approach sets a new standard in facade design by combining the precision of procedural generation with the adaptability of neuro-symbolic learning.

[CV-80] Finding Lottery Tickets in Vision Models via Data-driven Spectral Foresight Pruning

链接: https://arxiv.org/abs/2406.01820
作者: Leonardo Iurada,Marco Ciccone,Tatiana Tommasi
关键词: Neural Tangent Kernel, Recent advances, deep learning models, neural network pruning, memory demands
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted CVPR 2024 - this https URL

点击查看摘要

Abstract:Recent advances in neural network pruning have shown how it is possible to reduce the computational costs and memory demands of deep learning models before training. We focus on this framework and propose a new pruning at initialization algorithm that leverages the Neural Tangent Kernel (NTK) theory to align the training dynamics of the sparse network with that of the dense one. Specifically, we show how the usually neglected data-dependent component in the NTK’s spectrum can be taken into account by providing an analytical upper bound to the NTK’s trace obtained by decomposing neural networks into individual paths. This leads to our Path eXclusion (PX), a foresight pruning method designed to preserve the parameters that mostly influence the NTK’s trace. PX is able to find lottery tickets (i.e. good paths) even at high sparsity levels and largely reduces the need for additional training. When applied to pre-trained models it extracts subnetworks directly usable for several downstream tasks, resulting in performance comparable to those of the dense counterpart but with substantial cost and computational savings. Code available at: this https URL

[CV-81] Deep asymmetric mixture model for unsupervised cell segmentation

链接: https://arxiv.org/abs/2406.01815
作者: Yang Nan,Guang Yang
关键词: Automated cell segmentation, Automated cell, drug discovery, laborious and subjective, Deep Gaussian mixture
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Automated cell segmentation has become increasingly crucial for disease diagnosis and drug discovery, as manual delineation is excessively laborious and subjective. To address this issue with limited manual annotation, researchers have developed semi/unsupervised segmentation approaches. Among these approaches, the Deep Gaussian mixture model plays a vital role due to its capacity to facilitate complex data distributions. However, these models assume that the data follows symmetric normal distributions, which is inapplicable for data that is asymmetrically distributed. These models also obstacles weak generalization capacity and are sensitive to outliers. To address these issues, this paper presents a novel asymmetric mixture model for unsupervised cell segmentation. This asymmetric mixture model is built by aggregating certain multivariate Gaussian mixture models with log-likelihood and self-supervised-based optimization functions. The proposed asymmetric mixture model outperforms (nearly 2-30% gain in dice coefficient, p0.05) the existing state-of-the-art unsupervised models on cell segmentation including the segment anything.

[CV-82] he Empirical Impact of Forgetting and Transfer in Continual Visual Odometry

链接: https://arxiv.org/abs/2406.01797
作者: Paolo Cudrano,Xiaoyu Luo,Matteo Matteucci
关键词: embodied agents increases, continues to advance, continuously-learning embodied agents, adaptive and continuously-learning, realm of assistance
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Accepted to CoLLAs 2024

点击查看摘要

Abstract:As robotics continues to advance, the need for adaptive and continuously-learning embodied agents increases, particularly in the realm of assistance robotics. Quick adaptability and long-term information retention are essential to operate in dynamic environments typical of humans’ everyday lives. A lifelong learning paradigm is thus required, but it is scarcely addressed by current robotics literature. This study empirically investigates the impact of catastrophic forgetting and the effectiveness of knowledge transfer in neural networks trained continuously in an embodied setting. We focus on the task of visual odometry, which holds primary importance for embodied agents in enabling their self-localization. We experiment on the simple continual scenario of discrete transitions between indoor locations, akin to a robot navigating different apartments. In this regime, we observe initial satisfactory performance with high transferability between environments, followed by a specialization phase where the model prioritizes current environment-specific knowledge at the expense of generalization. Conventional regularization strategies and increased model capacity prove ineffective in mitigating this phenomenon. Rehearsal is instead mildly beneficial but with the addition of a substantial memory cost. Incorporating action information, as commonly done in embodied settings, facilitates quicker convergence but exacerbates specialization, making the model overly reliant on its motion expectations and less adept at correctly interpreting visual cues. These findings emphasize the open challenges of balancing adaptation and memory retention in lifelong robotics and contribute valuable insights into the application of a lifelong paradigm on embodied agents.

[CV-83] Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels

链接: https://arxiv.org/abs/2406.01791
作者: Weitong Cai,Jiabo Huang,Shaogang Gong
关键词: text query description, untrimmed raw video, Video moment retrieval, query description, Video moment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by BMVC2022

点击查看摘要

Abstract:Video moment retrieval (VMR) is to search for a visual temporal moment in an untrimmed raw video by a given text query description (sentence). Existing studies either start from collecting exhaustive frame-wise annotations on the temporal boundary of target moments (fully-supervised), or learn with only the video-level video-text pairing labels (weakly-supervised). The former is poor in generalisation to unknown concepts and/or novel scenes due to restricted dataset scale and diversity under expensive annotation costs; the latter is subject to visual-textual mis-correlations from incomplete labels. In this work, we introduce a new approach called hybrid-learning video moment retrieval to solve the problem by knowledge transfer through adapting the video-text matching relationships learned from a fully-supervised source domain to a weakly-labelled target domain when they do not share a common label space. Our aim is to explore shared universal knowledge between the two domains in order to improve model learning in the weakly-labelled target domain. Specifically, we introduce a multiplE branch Video-text Alignment model (EVA) that performs cross-modal (visual-textual) matching information sharing and multi-modal feature alignment to optimise domain-invariant visual and textual features as well as per-task discriminative joint video-text representations. Experiments show EVA’s effectiveness in exploring temporal segment annotations in a source domain to help learn video moment retrieval without temporal labels in a target domain.

[CV-84] Reproducibility Study on Adversarial Attacks Against Robust Transformer Trackers

链接: https://arxiv.org/abs/2406.01765
作者: Fatemeh Nourilenjan Nokabadi,Jean-François Lalonde,Christian Gagné
关键词: demonstrated strong performance, transformer trackers, trackers, demonstrated strong, transformer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published in Transactions on Machine Learning Research (05/2024): this https URL

点击查看摘要

Abstract:New transformer networks have been integrated into object tracking pipelines and have demonstrated strong performance on the latest benchmarks. This paper focuses on understanding how transformer trackers behave under adversarial attacks and how different attacks perform on tracking datasets as their parameters change. We conducted a series of experiments to evaluate the effectiveness of existing adversarial attacks on object trackers with transformer and non-transformer backbones. We experimented on 7 different trackers, including 3 that are transformer-based, and 4 which leverage other architectures. These trackers are tested against 4 recent attack methods to assess their performance and robustness on VOT2022ST, UAV123 and GOT10k datasets. Our empirical study focuses on evaluating adversarial robustness of object trackers based on bounding box versus binary mask predictions, and attack methods at different levels of perturbations. Interestingly, our study found that altering the perturbation level may not significantly affect the overall object tracking results after the attack. Similarly, the sparsity and imperceptibility of the attack perturbations may remain stable against perturbation level shifts. By applying a specific attack on all transformer trackers, we show that new transformer trackers having a stronger cross-attention modeling achieve a greater adversarial robustness on tracking datasets, such as VOT2022ST and GOT10k. Our results also indicate the necessity for new attack methods to effectively tackle the latest types of transformer trackers. The codes necessary to reproduce this study are available at this https URL.

[CV-85] An approximation-based approach versus an AI one for the study of CT images of abdominal aorta aneurysms

链接: https://arxiv.org/abs/2406.01764
作者: Lucrezia Rinelli,Arianna Travaglini,Nicolò Vescera,Gianluca Vinti
关键词: abdominal aortic aneurysm, tools of Approximation, Artificial Intelligence, Approximation Theory, based on tools
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 28 pages

点击查看摘要

Abstract:This study evaluates two approaches applied to computed tomography (CT) images of patients with abdominal aortic aneurysm: one deterministic, based on tools of Approximation Theory, and one based on Artificial Intelligence. Both aim to segment the basal CT images to extract the patent area of the aortic vessel, in order to propose an alternative to nephrotoxic contrast agents for diagnosing this pathology. While the deterministic approach employs sampling Kantorovich operators and the theory behind, leveraging the reconstruction and enhancement capabilities of these operators applied to images, the artificial intelligence-based approach lays on a U-net neural network. The results obtained from testing the two methods have been compared numerically and visually to assess their performances, demonstrating that both models yield accurate results.

[CV-86] Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

链接: https://arxiv.org/abs/2406.01733
作者: Xinyin Ma,Gongfan Fang,Michael Bi Mi,Xinchao Wang
关键词: recently demonstrated unprecedented, demonstrated unprecedented generative, unprecedented generative capabilities, recently demonstrated, demonstrated unprecedented
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Diffusion Transformers have recently demonstrated unprecedented generative capabilities for various tasks. The encouraging results, however, come with the cost of slow inference, since each denoising step requires inference on a transformer model with a large scale of parameters. In this study, we make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through introducing a caching mechanism, can be readily removed even without updating the model parameters. In the case of U-ViT-H/2, for example, we may remove up to 93.68% of the computation in the cache steps (46.84% for all steps), with less than 0.01 drop in FID. To achieve this, we introduce a novel scheme, named Learning-to-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Specifically, by leveraging the identical structure of layers in transformers and the sequential nature of diffusion, we explore redundant computations between timesteps by treating each layer as the fundamental unit for caching. To address the challenge of the exponential search space in deep models for identifying layers to cache and remove, we propose a novel differentiable optimization objective. An input-invariant yet timestep-variant router is then optimized, which can finally produce a static computation graph. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-Solver, alongside prior cache-based methods at the same inference speed.

[CV-87] Model for Peanuts: Hijacking ML Models without Training Access is Possible

链接: https://arxiv.org/abs/2406.01708
作者: Mahmoud Ghorbel,Halima Bouzidi,Ioan Marius Bilasco,Ihsen Alouani
关键词: Machine Learning, deployment of Machine, Model, Model hijacking, invasion of privacy
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 17 pages, 14 figures, 7 tables

点击查看摘要

Abstract:The massive deployment of Machine Learning (ML) models has been accompanied by the emergence of several attacks that threaten their trustworthiness and raise ethical and societal concerns such as invasion of privacy, discrimination risks, and lack of accountability. Model hijacking is one of these attacks, where the adversary aims to hijack a victim model to execute a different task than its original one. Model hijacking can cause accountability and security risks since a hijacked model owner can be framed for having their model offering illegal or unethical services. Prior state-of-the-art works consider model hijacking as a training time attack, whereby an adversary requires access to the ML model training to execute their attack. In this paper, we consider a stronger threat model where the attacker has no access to the training phase of the victim model. Our intuition is that ML models, typically over-parameterized, might (unintentionally) learn more than the intended task for they are trained. We propose a simple approach for model hijacking at inference time named SnatchML to classify unknown input samples using distance measures in the latent space of the victim model to previously known samples associated with the hijacking task classes. SnatchML empirically shows that benign pre-trained models can execute tasks that are semantically related to the initial task. Surprisingly, this can be true even for hijacking tasks unrelated to the original task. We also explore different methods to mitigate this risk. We first propose a novel approach we call meta-unlearning, designed to help the model unlearn a potentially malicious task while training on the original task dataset. We also provide insights on over-parameterization as one possible inherent factor that makes model hijacking easier, and we accordingly propose a compression-based countermeasure against this attack.

[CV-88] Few-Shot Classification of Interactive Activities of Daily Living (InteractADL)

链接: https://arxiv.org/abs/2406.01662
作者: Zane Durante,Robathan Harries,Edward Vendrow,Zelun Luo,Yuta Kyuragi,Kazuki Kozuka,Li Fei-Fei,Ehsan Adeli
关键词: Daily Living, Activities of Daily, including assistive robots, applications including assistive, Understanding Activities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding Activities of Daily Living (ADLs) is a crucial step for different applications including assistive robots, smart homes, and healthcare. However, to date, few benchmarks and methods have focused on complex ADLs, especially those involving multi-person interactions in home environments. In this paper, we propose a new dataset and benchmark, InteractADL, for understanding complex ADLs that involve interaction between humans (and objects). Furthermore, complex ADLs occurring in home environments comprise a challenging long-tailed distribution due to the rarity of multi-person interactions, and pose fine-grained visual recognition tasks due to the presence of semantically and visually similar classes. To address these issues, we propose a novel method for fine-grained few-shot video classification called Name Tuning that enables greater semantic separability by learning optimal class name vectors. We show that Name Tuning can be combined with existing prompt tuning strategies to learn the entire input text (rather than only learning the prompt or class names) and demonstrate improved performance for few-shot classification on InteractADL and 4 other fine-grained visual classification benchmarks. For transparency and reproducibility, we release our code at this https URL.

[CV-89] Proxy Denoising for Source-Free Domain Adaptation

链接: https://arxiv.org/abs/2406.01658
作者: Song Tang,Wenxin Su,Mao Ye,Jianwei Zhang,Xiatian Zhu
关键词: Source-free Domain Adaptation, pre-trained source model, unlabeled target domain, source data, Source-free Domain
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Source-free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to an unlabeled target domain with no access to the source data. Inspired by the success of pre-trained large vision-language (ViL) models in many other applications, the latest SFDA methods have also validated the benefit of ViL models by leveraging their predictions as pseudo supervision. However, we observe that ViL’s predictions could be noisy and inaccurate at an unknown rate, potentially introducing additional negative effects during adaption. To address this thus-far ignored challenge, in this paper, we introduce a novel Proxy Denoising (ProDe) approach. Specifically, we leverage the ViL model as a proxy to facilitate the adaptation process towards the latent domain-invariant space. Critically, we design a proxy denoising mechanism for correcting ViL’s predictions. This is grounded on a novel proxy confidence theory by modeling elegantly the domain adaption effect of the proxy’s divergence against the domain-invariant space. To capitalize the corrected proxy, we further derive a mutual knowledge distilling regularization. Extensive experiments show that our ProDe significantly outperforms the current state-of-the-art alternatives under both conventional closed-set setting and the more challenging open-set, partial-set and generalized SFDA settings. The code will release soon.

[CV-90] An Empirical Study of Excitation and Aggregation Design Adaptions in CLIP4Clip for Video-Text Retrieval

链接: https://arxiv.org/abs/2406.01604
作者: Xiaolun Jing,Genke Yang,Jian Chu
关键词: clip retrieval task, video clip retrieval, video-text retrieval domain, clip retrieval, frame representations aggregation
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 20 pages

点击查看摘要

Abstract:CLIP4Clip model transferred from the CLIP has been the de-factor standard to solve the video clip retrieval task from frame-level input, triggering the surge of CLIP4Clip-based models in the video-text retrieval domain. In this work, we rethink the inherent limitation of widely-used mean pooling operation in the frame features aggregation and investigate the adaptions of excitation and aggregation design for discriminative video representation generation. We present a novel excitationand-aggregation design, including (1) The excitation module is available for capturing non-mutuallyexclusive relationships among frame features and achieving frame-wise features recalibration, and (2) The aggregation module is applied to learn exclusiveness used for frame representations aggregation. Similarly, we employ the cascade of sequential module and aggregation design to generate discriminative video representation in the sequential type. Besides, we adopt the excitation design in the tight type to obtain representative frame features for multi-modal interaction. The proposed modules are evaluated on three benchmark datasets of MSR-VTT, ActivityNet and DiDeMo, achieving MSR-VTT (43.9 R@1), ActivityNet (44.1 R@1) and DiDeMo (31.0 R@1). They outperform the CLIP4Clip results by +1.2% (+0.5%), +4.5% (+1.9%) and +9.5% (+2.7%) relative (absolute) improvements, demonstrating the superiority of our proposed excitation and aggregation designs. We hope our work will serve as an alternative for frame representations aggregation and facilitate future research.

[CV-91] D2E-An Autonomous Decision-making Dataset involving Driver States and Human Evaluation

链接: https://arxiv.org/abs/2406.01598
作者: Zehong Ke,Yanbo Jiang,Yuning Wang,Hao Cheng,Jinhao Li,Jianqiang Wang
关键词: deep learning technology, datasets greatly influenced, learning technology, model performance, advancement of deep
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB); Robotics (cs.RO)
*备注: Submit for ITSC 2024

点击查看摘要

Abstract:With the advancement of deep learning technology, data-driven methods are increasingly used in the decision-making of autonomous driving, and the quality of datasets greatly influenced the model performance. Although current datasets have made significant progress in the collection of vehicle and environment data, emphasis on human-end data including the driver states and human evaluation is not sufficient. In addition, existing datasets consist mostly of simple scenarios such as car following, resulting in low interaction levels. In this paper, we introduce the Driver to Evaluation dataset (D2E), an autonomous decision-making dataset that contains data on driver states, vehicle states, environmental situations, and evaluation scores from human reviewers, covering a comprehensive process of vehicle decision-making. Apart from regular agents and surrounding environment information, we not only collect driver factor data including first-person view videos, physiological signals, and eye attention data, but also provide subjective rating scores from 40 human volunteers. The dataset is mixed of driving simulator scenes and real-road ones. High-interaction situations are designed and filtered to ensure behavior diversity. Through data organization, analysis, and preprocessing, D2E contains over 1100 segments of interactive driving case data covering from human driver factor to evaluation results, supporting the development of data-driven decision-making related algorithms.

[CV-92] End-to-End Rate-Distortion Optimized 3D Gaussian Representation

链接: https://arxiv.org/abs/2406.01597
作者: Henan Wang,Hanxin Zhu,Tianyu He,Runsen Feng,Jiajun Deng,Jiang Bian,Zhibo Chen
关键词: Gaussian Splatting, representation and image, image rendering, emerging technique, technique with remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has become an emerging technique with remarkable potential in 3D representation and image rendering. However, the substantial storage overhead of 3DGS significantly impedes its practical applications. In this work, we formulate the compact 3D Gaussian learning as an end-to-end Rate-Distortion Optimization (RDO) problem and propose RDO-Gaussian that can achieve flexible and continuous rate control. RDO-Gaussian addresses two main issues that exist in current schemes: 1) Different from prior endeavors that minimize the rate under the fixed distortion, we introduce dynamic pruning and entropy-constrained vector quantization (ECVQ) that optimize the rate and distortion at the same time. 2) Previous works treat the colors of each Gaussian equally, while we model the colors of different regions and materials with learnable numbers of parameters. We verify our method on both real and synthetic scenes, showcasing that RDO-Gaussian greatly reduces the size of 3D Gaussian over 40x, and surpasses existing methods in rate-distortion performance.

[CV-93] EditWorld: Simulating World Dynamics for Instruction-Following Image Editing

链接: https://arxiv.org/abs/2405.14785
作者: Ling Yang,Bohan Zeng,Jiaming Liu,Hong Li,Minghao Xu,Wentao Zhang,Shuicheng Yan
关键词: Diffusion models, improved the performance, image editing, editing, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project: this https URL

点击查看摘要

Abstract:Diffusion models have significantly improved the performance of image editing. Existing methods realize various approaches to achieve high-quality image editing, including but not limited to text control, dragging operation, and mask-and-inpainting. Among these, instruction-based editing stands out for its convenience and effectiveness in following human instructions across diverse scenarios. However, it still focuses on simple editing operations like adding, replacing, or deleting, and falls short of understanding aspects of world dynamics that convey the realistic dynamic nature in the physical world. Therefore, this work, EditWorld, introduces a new editing task, namely world-instructed image editing, which defines and categorizes the instructions grounded by various world scenarios. We curate a new image editing dataset with world instructions using a set of large pretrained models (e.g., GPT-3.5, Video-LLava and SDXL). To enable sufficient simulation of world dynamics for image editing, our EditWorld trains model in the curated dataset, and improves instruction-following ability with designed post-edit strategy. Extensive experiments demonstrate our method significantly outperforms existing editing methods in this new task. Our dataset and code will be available at this https URL

[CV-94] RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2402.12908
作者: Xinchen Zhang,Ling Yang,Yaqi Cai,Zhaochen Yu,Kai-Ni Wang,Jiake Xie,Ye Tian,Minkai Xu,Yong Tang,Yujiu Yang,Bin Cui
关键词: achieved remarkable advancements, image diffusion models, spatial-aware image diffusion, Diffusion models, image diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project: this https URL

点击查看摘要

Abstract:Diffusion models have achieved remarkable advancements in text-to-image generation. However, existing models still have many difficulties when faced with multiple-object compositional generation. In this paper, we propose RealCompo, a new training-free and transferred-friendly text-to-image generation framework, which aims to leverage the respective advantages of text-to-image models and spatial-aware image diffusion models (e.g., layout, keypoints and segmentation maps) to enhance both realism and compositionality of the generated images. An intuitive and novel balancer is proposed to dynamically balance the strengths of the two models in denoising process, allowing plug-and-play use of any model without extra training. Extensive experiments show that our RealCompo consistently outperforms state-of-the-art text-to-image models and spatial-aware image diffusion models in multiple-object compositional generation while keeping satisfactory realism and compositionality of the generated images. Notably, our RealCompo can be seamlessly extended with a wide range of spatial-aware image diffusion models and stylized diffusion models. Our code is available at: this https URL

[CV-95] Mastering Text-to-Image Diffusion: Recaptioning Planning and Generating with Multimodal LLMs

链接: https://arxiv.org/abs/2401.11708
作者: Ling Yang,Zhaochen Yu,Chenlin Meng,Minkai Xu,Stefano Ermon,Bin Cui
关键词: exhibit exceptional performance, exceptional performance, Diffusion models, Plan and Generate, RPG
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ICML 2024. Project: this https URL

点击查看摘要

Abstract:Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available at: this https URL

[CV-96] Improving Diffusion-Based Image Synthesis with Context Prediction

链接: https://arxiv.org/abs/2401.02015
作者: Ling Yang,Jingwei Liu,Shenda Hong,Zhilong Zhang,Zhilin Huang,Zheming Cai,Wentao Zhang,Bin Cui
关键词: dramatically promoted image, quality and diversity, class of generative, dramatically promoted, unprecedented quality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2023

点击查看摘要

Abstract:Diffusion models are a new class of generative models, and have dramatically promoted image generation with unprecedented quality and diversity. Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes. However, such point-based reconstruction may fail to make each predicted pixel/feature fully preserve its neighborhood context, impairing diffusion-based image synthesis. As a powerful source of automatic supervisory signal, context has been well studied for learning representations. Inspired by this, we for the first time propose ConPreDiff to improve diffusion-based image synthesis with context prediction. We explicitly reinforce each point to predict its neighborhood context (i.e., multi-stride features/tokens/pixels) with a context decoder at the end of diffusion denoising blocks in training stage, and remove the decoder for inference. In this way, each point can better reconstruct itself by preserving its semantic connections with neighborhood context. This new paradigm of ConPreDiff can generalize to arbitrary discrete and continuous diffusion backbones without introducing extra parameters in sampling procedure. Extensive experiments are conducted on unconditional image generation, text-to-image generation and image inpainting tasks. Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.

[CV-97] Enhancing predictive imaging biomarker discovery through treatment effect analysis

链接: https://arxiv.org/abs/2406.02534
作者: Shuhan Xiao,Lukas Klein,Jens Petersen,Philipp Vollmuth,Paul F. Jaeger,Klaus H. Maier-Hein
关键词: individual treatment effectiveness, Identifying predictive biomarkers, forecast individual treatment, Identifying predictive, predictive imaging biomarkers
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 19 pages, 12 figures

点击查看摘要

Abstract:Identifying predictive biomarkers, which forecast individual treatment effectiveness, is crucial for personalized medicine and informs decision-making across diverse disciplines. These biomarkers are extracted from pre-treatment data, often within randomized controlled trials, and have to be distinguished from prognostic biomarkers, which are independent of treatment assignment. Our study focuses on the discovery of predictive imaging biomarkers, aiming to leverage pre-treatment images to unveil new causal relationships. Previous approaches relied on labor-intensive handcrafted or manually derived features, which may introduce biases. In response, we present a new task of discovering predictive imaging biomarkers directly from the pre-treatment images to learn relevant image features. We propose an evaluation protocol for this task to assess a model’s ability to identify predictive imaging biomarkers and differentiate them from prognostic ones. It employs statistical testing and a comprehensive analysis of image feature attribution. We explore the suitability of deep learning models originally designed for estimating the conditional average treatment effect (CATE) for this task, which previously have been primarily assessed for the precision of CATE estimation, overlooking the evaluation of imaging biomarker discovery. Our proof-of-concept analysis demonstrates promising results in discovering and validating predictive imaging biomarkers from synthetic outcomes and real-world image datasets.

[CV-98] ReLUs Are Sufficient for Learning Implicit Neural Representations

链接: https://arxiv.org/abs/2406.02529
作者: Joseph Shenouda,Yamin Zhou,Robert D. Nowak
关键词: Rectified Linear Unit, Linear Unit, Rectified Linear, learning implicit neural, employ the Rectified
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:Motivated by the growing theoretical understanding of neural networks that employ the Rectified Linear Unit (ReLU) as their activation function, we revisit the use of ReLU activation functions for learning implicit neural representations (INRs). Inspired by second order B-spline wavelets, we incorporate a set of simple constraints to the ReLU neurons in each layer of a deep neural network (DNN) to remedy the spectral bias. This in turn enables its use for various INR tasks. Empirically, we demonstrate that, contrary to popular belief, one can learn state-of-the-art INRs based on a DNN composed of only ReLU neurons. Next, by leveraging recent theoretical works which characterize the kinds of functions ReLU neural networks learn, we provide a way to quantify the regularity of the learned function. This offers a principled approach to selecting the hyperparameters in INR architectures. We substantiate our claims through experiments in signal representation, super resolution, and computed tomography, demonstrating the versatility and effectiveness of our method. The code for all experiments can be found at this https URL.

[CV-99] Fairness Evolution in Continual Learning for Medical Imaging

链接: https://arxiv.org/abs/2406.02480
作者: Marina Ceccon,Davide Dalle Pezze,Alessandro Fabris,Gian Antonio Susto
关键词: achieving remarkable results, made significant strides, Deep Learning, Chest X-ray images, recent years
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep Learning (DL) has made significant strides in various medical applications in recent years, achieving remarkable results. In the field of medical imaging, DL models can assist doctors in disease diagnosis by classifying pathologies in Chest X-ray images. However, training on new data to expand model capabilities and adapt to distribution shifts is a notable challenge these models face. Continual Learning (CL) has emerged as a solution to this challenge, enabling models to adapt to new data while retaining knowledge gained from previous experiences. Previous studies have analyzed the behavior of CL strategies in medical imaging regarding classification performance. However, when considering models that interact with sensitive information, such as in the medical domain, it is imperative to disaggregate the performance of socially salient groups. Indeed, DL algorithms can exhibit biases against certain sub-populations, leading to discrepancies in predictive performance across different groups identified by sensitive attributes such as age, race/ethnicity, sex/gender, and socioeconomic status. In this study, we go beyond the typical assessment of classification performance in CL and study bias evolution over successive tasks with domain-specific fairness metrics. Specifically, we evaluate the CL strategies using the well-known CheXpert (CXP) and ChestX-ray14 (NIH) datasets. We consider a class incremental scenario of five tasks with 12 pathologies. We evaluate the Replay, Learning without Forgetting (LwF), LwF Replay, and Pseudo-Label strategies. LwF and Pseudo-Label exhibit optimal classification performance, but when including fairness metrics in the evaluation, it is clear that Pseudo-Label is less biased. For this reason, this strategy should be preferred when considering real-world scenarios in which it is crucial to consider the fairness of the model.

[CV-100] Inpainting Pathology in Lumbar Spine MRI with Latent Diffusion

链接: https://arxiv.org/abs/2406.02477
作者: Colin Hansen,Simas Glinskis,Ashwin Raju,Micha Kornreich,JinHyeong Park,Jayashri Pawar,Richard Herzog,Li Zhang,Benjamin Odry
关键词: imbalanced datasets due, expert annotations, automated diagnosis, diagnosis in radiology, radiology suffer
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data driven models for automated diagnosis in radiology suffer from insufficient and imbalanced datasets due to low representation of pathology in a population and the cost of expert annotations. Datasets can be bolstered through data augmentation. However, even when utilizing a full suite of transformations during model training, typical data augmentations do not address variations in human anatomy. An alternative direction is to synthesize data using generative models, which can potentially craft datasets with specific attributes. While this holds promise, commonly used generative models such as Generative Adversarial Networks may inadvertently produce anatomically inaccurate features. On the other hand, diffusion models, which offer greater stability, tend to memorize training data, raising concerns about privacy and generative diversity. Alternatively, inpainting has the potential to augment data through directly inserting pathology in medical images. However, this approach introduces a new challenge: accurately merging the generated pathological features with the surrounding anatomical context. While inpainting is a well established method for addressing simple lesions, its application to pathologies that involve complex structural changes remains relatively unexplored. We propose an efficient method for inpainting pathological features onto healthy anatomy in MRI through voxelwise noise scheduling in a latent diffusion model. We evaluate the method’s ability to insert disc herniation and central canal stenosis in lumbar spine sagittal T2 MRI, and it achieves superior Frechet Inception Distance compared to state-of-the-art methods.

[CV-101] IterMask2: Iterative Unsupervised Anomaly Segmentation via Spatial and Frequency Masking for Brain Lesions in MRI

链接: https://arxiv.org/abs/2406.02422
作者: Ziyun Liang,Xiaoqing Guo,J. Alison Noble,Konstantinos Kamnitsas
关键词: Unsupervised anomaly segmentation, Unsupervised anomaly, healthy subjects, anomaly segmentation approaches, pathology segmentation train
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unsupervised anomaly segmentation approaches to pathology segmentation train a model on images of healthy subjects, that they define as the ‘normal’ data distribution. At inference, they aim to segment any pathologies in new images as ‘anomalies’, as they exhibit patterns that deviate from those in ‘normal’ training data. Prevailing methods follow the ‘corrupt-and-reconstruct’ paradigm. They intentionally corrupt an input image, reconstruct it to follow the learned ‘normal’ distribution, and subsequently segment anomalies based on reconstruction error. Corrupting an input image, however, inevitably leads to suboptimal reconstruction even of normal regions, causing false positives. To alleviate this, we propose a novel iterative spatial mask-refining strategy IterMask2. We iteratively mask areas of the image, reconstruct them, and update the mask based on reconstruction error. This iterative process progressively adds information about areas that are confidently normal as per the model. The increasing content guides reconstruction of nearby masked areas, improving reconstruction of normal tissue under these areas, reducing false positives. We also use high-frequency image content as an auxiliary input to provide additional structural information for masked areas. This further improves reconstruction error of normal in comparison to anomalous areas, facilitating segmentation of the latter. We conduct experiments on several brain lesion datasets and demonstrate effectiveness of our method. Code is available at: this https URL

[CV-102] Multi-target stain normalization for histology slides

链接: https://arxiv.org/abs/2406.02077
作者: Desislav Ivanov,Carlo Alberto Barbano,Marco Grangetto
关键词: Traditional staining normalization, single representative reference, staining normalization approaches, diverse staining patterns, representative reference image
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Traditional staining normalization approaches, e.g. Macenko, typically rely on the choice of a single representative reference image, which may not adequately account for the diverse staining patterns of datasets collected in practical scenarios. In this study, we introduce a novel approach that leverages multiple reference images to enhance robustness against stain variation. Our method is parameter-free and can be adopted in existing computational pathology pipelines with no significant changes. We evaluate the effectiveness of our method through experiments using a deep-learning pipeline for automatic nuclei segmentation on colorectal images. Our results show that by leveraging multiple reference images, better results can be achieved when generalizing to external data, where the staining can widely differ from the training set.

[CV-103] Choroidal Vessel Segmentation on Indocyanine Green Angiography Images via Human-in-the-Loop Labeling

链接: https://arxiv.org/abs/2406.01993
作者: Ruoyu Chen(1),Ziwei Zhao(1),Mayinuer Yusufu(4 and 5),Xianwen Shang(1),Danli Shi(1 and 2),Mingguang He(1,2 and 3) ((1) School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China. (2) Research Centre for SHARP Vision, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China.(3) Centre for Eye and Vision Research (CEVR), 17W Hong Kong Science Park, Hong Kong SAR, China.(4) Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, Australia.(5) Department of Surgery (Ophthalmology), The University of Melbourne, Melbourne, Australia)
关键词: medical image processing, ICGA, degree view ICGA, field of medical, choroidal
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 25 pages,4 figures

点击查看摘要

Abstract:Human-in-the-loop (HITL) strategy has been recently introduced into the field of medical image processing. Indocyanine green angiography (ICGA) stands as a well-established examination for visualizing choroidal vasculature and detecting chorioretinal diseases. However, the intricate nature of choroidal vascular networks makes large-scale manual segmentation of ICGA images challenging. Thus, the study aims to develop a high-precision choroidal vessel segmentation model with limited labor using HITL framework. We utilized a multi-source ICGA dataset, including 55 degree view and ultra-widefield ICGA (UWF-ICGA) images for model development. The choroidal vessel network was pre-segmented by a pre-trained vessel segmentation model, and then manually modified by two ophthalmologists. Choroidal vascular diameter, density, complexity, tortuosity, and branching angle were automatically quantified based on the segmentation. We finally conducted four cycles of HITL. One hundred and fifty 55 degree view ICGA images were used for the first three cycles (50 images per cycle), and twenty UWF-ICGA images for the last cycle. The average time needed to manually correct a pre-segmented ICGA image per cycle reduced from 20 minutes to 1 minute. High segmentation accuracy has been achieved on both 55 degree view ICGA and UWF-ICGA images. Additionally, the multi-dimensional choroidal vascular parameters were significantly associated with various chorioretinal diseases. Our study not only demonstrated the feasibility of the HITL strategy in improving segmentation performance with reduced manual labeling, but also innovatively introduced several risk predictors for choroidal abnormalities.

[CV-104] QuST: QuPath Extension for Integrative Whole Slide Image and Spatial Transcriptomics Analysis

链接: https://arxiv.org/abs/2406.01613
作者: Chao-Hui Huang
关键词: WSI analysis, WSI analysis tools, DL-based WSI analysis, analysis, WSI
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recently, various technologies have been introduced into digital pathology, including artificial intelligence (AI) driven methods, in both areas of pathological whole slide image (WSI) analysis and spatial transcriptomics (ST) analysis. AI-driven WSI analysis utilizes the power of deep learning (DL), expands the field of view for histopathological image analysis. On the other hand, ST bridges the gap between tissue spatial analysis and biological signals, offering the possibility to understand the spatial biology. However, a major bottleneck in DL-based WSI analysis is the preparation of training patterns, as hematoxylin \ eosin (H\E) staining does not provide direct biological evidence, such as gene expression, for determining the category of a biological component. On the other hand, as of now, the resolution in ST is far beyond that of WSI, resulting the challenge of further spatial analysis. Although various WSI analysis tools, including QuPath, have cited the use of WSI analysis tools in the context of ST analysis, its usage is primarily focused on initial image analysis, with other tools being utilized for more detailed transcriptomic analysis. As a result, the information hidden beneath WSI has not yet been fully utilized to support ST analysis. To bridge this gap, we introduce QuST, a QuPath extension designed to bridge the gap between H\E WSI and ST analyzing tasks. In this paper, we highlight the importance of integrating DL-based WSI analysis and ST analysis in understanding disease biology and the challenges in integrating these modalities due to differences in data formats and analytical methods. The QuST source code is hosted on GitHub and documentation is available at this https URL. Subjects: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2406.01613 [q-bio.QM] (or arXiv:2406.01613v1 [q-bio.QM] for this version)

[CV-105] An Enhanced Encoder-Decoder Network Architecture for Reducing Information Loss in Image Semantic Segmentation

链接: https://arxiv.org/abs/2406.01605
作者: Zijun Gao,Qi Wang,Taiyuan Mei,Xiaohan Cheng,Yun Zi,Haowei Yang
关键词: commonly encounters significant, encounters significant information, architecture commonly encounters, significant information loss, sampling process
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The traditional SegNet architecture commonly encounters significant information loss during the sampling process, which detrimentally affects its accuracy in image semantic segmentation tasks. To counter this challenge, we introduce an innovative encoder-decoder network structure enhanced with residual connections. Our approach employs a multi-residual connection strategy designed to preserve the intricate details across various image scales more effectively, thus minimizing the information loss inherent to down-sampling procedures. Additionally, to enhance the convergence rate of network training and mitigate sample imbalance issues, we have devised a modified cross-entropy loss function incorporating a balancing factor. This modification optimizes the distribution between positive and negative samples, thus improving the efficiency of model training. Experimental evaluations of our model demonstrate a substantial reduction in information loss and improved accuracy in semantic segmentation. Notably, our proposed network architecture demonstrates a substantial improvement in the finely annotated mean Intersection over Union (mIoU) on the dataset compared to the conventional SegNet. The proposed network structure not only reduces operational costs by decreasing manual inspection needs but also scales up the deployment of AI-driven image analysis across different sectors.

机器学习

[LG-0] Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

链接: https://arxiv.org/abs/2406.02550
作者: Tianyu He,Darshil Doshi,Aritra Das,Andrey Gromov
关键词: Large language models, Large language, training set, learning and skill, Large
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); High Energy Physics - Theory (hep-th); Machine Learning (stat.ML)
*备注: 21 pages, 19 figures

点击查看摘要

Abstract:Large language models can solve tasks that were not present in the training set. This capability is believed to be due to in-context learning and skill composition. In this work, we study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks. Specifically, we consider a finite collection of linear modular functions z = a , x + b , y ;\mathrmmod; p labeled by the vector (a, b) \in \mathbbZ_p^2 . We use some of these tasks for pre-training and the rest for out-of-distribution testing. We empirically show that a GPT-style transformer exhibits a transition from in-distribution to out-of-distribution generalization as the number of pre-training tasks increases. We find that the smallest model capable of out-of-distribution generalization requires two transformer blocks, while for deeper models, the out-of-distribution generalization phase is \emphtransient, necessitating early stopping. Finally, we perform an interpretability study of the pre-trained models, revealing the highly structured representations in both phases; and discuss the learnt algorithm.

[LG-1] Robust and highly scalable estimation of directional couplings from time-shifted signals

链接: https://arxiv.org/abs/2406.02545
作者: Luca Ambrogioni,Louis Rouillard,Demian Wassermann
关键词: central methodological challenge, systems biology, biology and economics, estimation of directed, central methodological
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The estimation of directed couplings between the nodes of a network from indirect measurements is a central methodological challenge in scientific fields such as neuroscience, systems biology and economics. Unfortunately, the problem is generally ill-posed due to the possible presence of unknown delays in the measurements. In this paper, we offer a solution of this problem by using a variational Bayes framework, where the uncertainty over the delays is marginalized in order to obtain conservative coupling estimates. To overcome the well-known overconfidence of classical variational methods, we use a hybrid-VI scheme where the (possibly flat or multimodal) posterior over the measurement parameters is estimated using a forward KL loss while the (nearly convex) conditional posterior over the couplings is estimated using the highly scalable gradient-based VI. In our ground-truth experiments, we show that the network provides reliable and conservative estimates of the couplings, greatly outperforming similar methods such as regression DCM.

[LG-2] o Believe or Not to Believe Your LLM

链接: https://arxiv.org/abs/2406.02543
作者: Yasin Abbasi Yadkori,Ilja Kuzborskij,András György,Csaba Szepesvári
关键词: large language models, explore uncertainty quantification, goal to identify, epistemic uncertainty, explore uncertainty
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We explore uncertainty quantification in large language models (LLMs), with the goal to identify when uncertainty in responses given a query is large. We simultaneously consider both epistemic and aleatoric uncertainties, where the former comes from the lack of knowledge about the ground truth (such as about facts or the language), and the latter comes from irreducible randomness (such as multiple possible answers). In particular, we derive an information-theoretic metric that allows to reliably detect when only epistemic uncertainty is large, in which case the output of the model is unreliable. This condition can be computed based solely on the output of the model obtained simply by some special iterative prompting based on the previous responses. Such quantification, for instance, allows to detect hallucinations (cases when epistemic uncertainty is high) in both single- and multi-answer responses. This is in contrast to many standard uncertainty quantification strategies (such as thresholding the log-likelihood of a response) where hallucinations in the multi-answer case cannot be detected. We conduct a series of experiments which demonstrate the advantage of our formulation. Further, our investigations shed some light on how the probabilities assigned to a given output by an LLM can be amplified by iterative prompting, which might be of independent interest.

[LG-3] Loki: Low-Rank Keys for Efficient Sparse Attention

链接: https://arxiv.org/abs/2406.02542
作者: Prajwal Singhania,Siddharth Singh,Shwai He,Soheil Feizi,Abhinav Bhatele
关键词: long sequence lengths, memory costs involved, large language models, large language, expensive in terms
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inference on large language models can be expensive in terms of the compute and memory costs involved, especially when long sequence lengths are used. In particular, the self-attention mechanism used in such models contributes significantly to these costs, which has resulted in several recent works that propose sparse attention approximations for inference. In this work, we propose to approximate the self-attention computation by focusing on the dimensionality of key vectors computed in the attention block. Our analysis reveals that the key vectors lie in a significantly lower-dimensional space, consistently across several datasets and models. Exploiting this observation, we propose Loki, a novel sparse attention method that ranks and selects tokens in the KV-cache based on attention scores computed in low-dimensional space. Our evaluations show that Loki is able to maintain the efficacy of the models better than other popular approximation methods, while speeding up the attention computation due to reduced data movement (load/store) and compute costs.

[LG-4] Parrot: Multilingual Visual Instruction Tuning

链接: https://arxiv.org/abs/2406.02539
作者: Hai-Long Sun,Da-Wei Zhou,Yang Li,Shiyin Lu,Chao Yi,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,De-Chuan Zhan,Han-Jia Ye
关键词: Large Language Models, Multimodal Large Language, artificial general intelligence, Language Models, Multimodal Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has marked a significant step towards artificial general intelligence. Existing methods mainly focus on aligning vision encoders with LLMs through supervised fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs’ inherent ability to react to multiple languages progressively deteriorate as the training process evolves. We empirically find that the imbalanced SFT datasets, primarily composed of English-centric image-text pairs, lead to significantly reduced performance in non-English languages. This is due to the failure of aligning the vision encoder and LLM with multilingual tokens during the SFT process. In this paper, we introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Specifically, to enhance non-English visual tokens alignment, we compute the cross-attention using the initial visual features and textual embeddings, the result of which is then fed into the MoE router to select the most relevant experts. The selected experts subsequently convert the initial visual tokens into language-specific visual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks. Both the source code and the training dataset of Parrot will be made publicly available.

[LG-5] opViewRS: Vision-Language Models as Top-View Spatial Reasoners

链接: https://arxiv.org/abs/2406.02537
作者: Chengzu Li,Caiqi Zhang,Han Zhou,Nigel Collier,Anna Korhonen,Ivan Vulić
关键词: Top-view perspective denotes, large Vision-Language Models, perspective denotes, denotes a typical, vital for localization
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages, 3 figures, 3 tables (21 pages, 4 figures, 15 tables including references and appendices)

点击查看摘要

Abstract:Top-view perspective denotes a typical way in which humans read and reason over different types of maps, and it is vital for localization and navigation of humans as well as of `non-human’ agents, such as the ones backed by large Vision-Language Models (VLMs). Nonetheless, spatial reasoning capabilities of modern VLMs remain unattested and underexplored. In this work, we thus study their capability to understand and reason over spatial relations from the top view. The focus on top view also enables controlled evaluations at different granularity of spatial reasoning; we clearly disentangle different abilities (e.g., recognizing particular objects versus understanding their relative positions). We introduce the TopViewRS (Top-View Reasoning in Space) dataset, consisting of 11,384 multiple-choice questions with either realistic or semantic top-view map as visual input. We then use it to study and evaluate VLMs across 4 perception and reasoning tasks with different levels of complexity. Evaluation of 10 representative open- and closed-source VLMs reveals the gap of more than 50% compared to average human performance, and it is even lower than the random baseline in some cases. Although additional experiments show that Chain-of-Thought reasoning can boost model capabilities by 5.82% on average, the overall performance of VLMs remains limited. Our findings underscore the critical need for enhanced model capability in top-view spatial reasoning and set a foundation for further research towards human-level proficiency of VLMs in real-world multimodal tasks.

[LG-6] Mitigate Position Bias in Large Language Models via Scaling a Single Dimension

链接: https://arxiv.org/abs/2406.02536
作者: Yijiong Yu,Huiqiang Jiang,Xufang Luo,Qianhui Wu,Chin-Yew Lin,Dongsheng Li,Yuqing Yang,Yongfeng Huang,Lili Qiu
关键词: Large Language Models, Large Language, robust generative abilities, excellent generalization capabilities, real-world scenarios due
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly applied in various real-world scenarios due to their excellent generalization capabilities and robust generative abilities. However, they exhibit position bias, also known as “lost in the middle”, a phenomenon that is especially pronounced in long-context scenarios, which indicates the placement of the key information in different positions of a prompt can significantly affect accuracy. This paper first explores the micro-level manifestations of position bias, concluding that attention weights are a micro-level expression of position bias. It further identifies that, in addition to position embeddings, causal attention mask also contributes to position bias by creating position-specific hidden states. Based on these insights, we propose a method to mitigate position bias by scaling this positional hidden states. Experiments on the NaturalQuestions Multi-document QA, KV retrieval, LongBench and timeline reorder tasks, using various models including RoPE models, context windowextended models, and Alibi models, demonstrate the effectiveness and generalizability of our approach. Our method can improve performance by up to 15.2% by modifying just one dimension of hidden states. Our code is available at this https URL.

[LG-7] RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

链接: https://arxiv.org/abs/2406.02523
作者: Soroush Nasiriany,Abhiram Maddukuri,Lance Zhang,Adeet Parikh,Aaron Lo,Abhishek Joshi,Ajay Mandlekar,Yuke Zhu
关键词: Artificial Intelligence, Recent advancements, advancements in Artificial, largely been propelled, Recent
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: RSS 2024

点击查看摘要

Abstract:Recent advancements in Artificial Intelligence (AI) have largely been propelled by scaling. In Robotics, scaling is hindered by the lack of access to massive robot datasets. We advocate using realistic physical simulation as a means to scale environments, tasks, and datasets for robot learning methods. We present RoboCasa, a large-scale simulation framework for training generalist robots in everyday environments. RoboCasa features realistic and diverse scenes focusing on kitchen environments. We provide thousands of 3D assets across over 150 object categories and dozens of interactable furniture and appliances. We enrich the realism and diversity of our simulation with generative AI tools, such as object assets from text-to-3D models and environment textures from text-to-image models. We design a set of 100 tasks for systematic evaluation, including composite tasks generated by the guidance of large language models. To facilitate learning, we provide high-quality human demonstrations and integrate automated trajectory generation methods to substantially enlarge our datasets with minimal human burden. Our experiments show a clear scaling trend in using synthetically generated robot data for large-scale imitation learning and show great promise in harnessing simulation data in real-world tasks. Videos and open-source code are available at this https URL

[LG-8] Uncertainty of Joint Neural Contextual Bandit

链接: https://arxiv.org/abs/2406.02515
作者: Hongbo Guo,Zheqing Zhu
关键词: Contextual bandit learning, neural contextual bandit, Contextual bandit, contextual bandit solution, large-scale recommendation systems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Contextual bandit learning is increasingly favored in modern large-scale recommendation systems. To better utlize the contextual information and available user or item features, the integration of neural networks have been introduced to enhance contextual bandit learning and has triggered significant interest from both academia and industry. However, a major challenge arises when implementing a disjoint neural contextual bandit solution in large-scale recommendation systems, where each item or user may correspond to a separate bandit arm. The huge number of items to recommend poses a significant hurdle for real world production deployment. This paper focuses on a joint neural contextual bandit solution which serves all recommending items in one single model. The output consists of a predicted reward \mu , an uncertainty \sigma and a hyper-parameter \alpha which balances exploitation and exploration, e.g., \mu + \alpha \sigma . The tuning of the parameter \alpha is typically heuristic and complex in practice due to its stochastic nature. To address this challenge, we provide both theoretical analysis and experimental findings regarding the uncertainty \sigma of the joint neural contextual bandit model. Our analysis reveals that \alpha demonstrates an approximate square root relationship with the size of the last hidden layer F and inverse square root relationship with the amount of training data N , i.e., \sigma \propto \sqrt\fracFN . The experiments, conducted with real industrial data, align with the theoretical analysis, help understanding model behaviors and assist the hyper-parameter tuning during both offline training and online deployment. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2406.02515 [cs.LG] (or arXiv:2406.02515v1 [cs.LG] for this version)

[LG-9] Fairness-Optimized Synthetic EHR Generation for Arbitrary Downstream Predictive Tasks

链接: https://arxiv.org/abs/2406.02510
作者: Mirza Farhan Bin Tarek,Raphael Poulain,Rahmatollah Beheshti
关键词: key focus area, addressing fairness concerns, focus area, aspects of ensuring, ensuring the responsible
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Among various aspects of ensuring the responsible design of AI tools for healthcare applications, addressing fairness concerns has been a key focus area. Specifically, given the wide spread of electronic health record (EHR) data and their huge potential to inform a wide range of clinical decision support tasks, improving fairness in this category of health AI tools is of key importance. While such a broad problem (that is, mitigating fairness in EHR-based AI models) has been tackled using various methods, task- and model-agnostic methods are noticeably rare. In this study, we aimed to target this gap by presenting a new pipeline that generates synthetic EHR data, which is not only consistent with (faithful to) the real EHR data but also can reduce the fairness concerns (defined by the end-user) in the downstream tasks, when combined with the real data. We demonstrate the effectiveness of our proposed pipeline across various downstream tasks and two different EHR datasets. Our proposed pipeline can add a widely applicable and complementary tool to the existing toolbox of methods to address fairness in health AI applications such as those modifying the design of a downstream model. The codebase for our project is available at this https URL

[LG-10] Guiding a Diffusion Model with a Bad Version of Itself

链接: https://arxiv.org/abs/2406.02507
作者: Tero Karras,Miika Aittala,Tuomas Kynkäänniemi,Jaakko Lehtinen,Timo Aila,Samuli Laine
关键词: results align, primary axes, axes of interest, interest in image-generating, class label
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The primary axes of interest in image-generating diffusion models are image quality, the amount of variation in the results, and how well the results align with a given condition, e.g., a class label or a text prompt. The popular classifier-free guidance approach uses an unconditional model to guide a conditional model, leading to simultaneously better prompt alignment and higher-quality images at the cost of reduced variation. These effects seem inherently entangled, and thus hard to control. We make the surprising observation that it is possible to obtain disentangled control over image quality without compromising the amount of variation by guiding generation using a smaller, less-trained version of the model itself rather than an unconditional model. This leads to significant improvements in ImageNet generation, setting record FIDs of 1.01 for 64x64 and 1.25 for 512x512, using publicly available networks. Furthermore, the method is also applicable to unconditional diffusion models, drastically improving their quality.

[LG-11] Demystifying the Compression of Mixture-of-Experts Through a Unified Framework

链接: https://arxiv.org/abs/2406.02500
作者: Shwai He,Daize Dong,Liang Ding,Ang Li
关键词: Scaling large language, size poses significant, poses significant challenges, Scaling large, model size poses
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Scaling large language models has revolutionized the performance across diverse domains, yet the continual growth in model size poses significant challenges for real-world deployment. The Mixture of Experts (MoE) approach addresses this by dynamically selecting and activating only a subset of experts, significantly reducing computational costs while maintaining high performance. However, MoE introduces potential redundancy (e.g., parameters) and extra costs (e.g., communication overhead). Despite numerous compression techniques developed for mitigating the redundancy in dense models, the compression of MoE remains under-explored. We first bridge this gap with a cutting-edge unified framework that not only seamlessly integrates mainstream compression methods but also helps systematically understand MoE compression. This framework approaches compression from two perspectives: Expert Slimming which compresses individual experts and Expert Trimming which removes structured modules. Within this framework, we explore the optimization space unexplored by existing methods,and further introduce aggressive Expert Trimming techniques, i.e., Layer Drop and Block Drop, to eliminate redundancy at larger scales. Based on these insights,we present a comprehensive recipe to guide practitioners in compressing MoE effectively. Extensive experimental results demonstrate the effectiveness of the compression methods under our framework and the proposed recipe, achieving a 6.05x speedup and only 20.0GB memory usage while maintaining over 92% of performance on Mixtral-8x7B.

[LG-12] Dropout MPC: An Ensemble Neural MPC Approach for Systems with Learned Dynamics

链接: https://arxiv.org/abs/2406.02497
作者: Spyridon Syntakas,Kostas Vlachos
关键词: neural MPC strategies, neural MPC, ensemble neural MPC, neural MPC algorithm, Model Predictive Control
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Neural networks are lately more and more often being used in the context of data-driven control, as an approximate model of the true system dynamics. Model Predictive Control (MPC) adopts this practise leading to neural MPC strategies. This raises a question of whether the trained neural network has converged and generalized in a way that the learned model encapsulates an accurate approximation of the true dynamic model of the system, thus making it a reliable choice for model-based control, especially for disturbed and uncertain systems. To tackle that, we propose Dropout MPC, a novel sampling-based ensemble neural MPC algorithm that employs the Monte-Carlo dropout technique on the learned system model. The closed loop is based on an ensemble of predictive controllers, that are used simultaneously at each time-step for trajectory optimization. Each member of the ensemble influences the control input, based on a weighted voting scheme, thus by employing different realizations of the learned system dynamics, neural control becomes more reliable by design. An additional strength of the method is that it offers by design a way to estimate future uncertainty, leading to cautious control. While the method aims in general at uncertain systems with complex dynamics, where models derived from first principles are hard to infer, to showcase the application we utilize data gathered in the laboratory from a real mobile manipulator and employ the proposed algorithm for the navigation of the robot in simulation.

[LG-13] Kolmogorov-Arnold Networks for Time Series: Bridging Predictive Power and Interpretability

链接: https://arxiv.org/abs/2406.02496
作者: Kunpeng Xu,Lifei Chen,Shengrui Wang
关键词: MIT team, groundbreaking model recently, model recently proposed, Kolmogorov-Arnold Networks, representing a revolutionary
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KAN) is a groundbreaking model recently proposed by the MIT team, representing a revolutionary approach with the potential to be a game-changer in the field. This innovative concept has rapidly garnered worldwide interest within the AI community. Inspired by the Kolmogorov-Arnold representation theorem, KAN utilizes spline-parametrized univariate functions in place of traditional linear weights, enabling them to dynamically learn activation patterns and significantly enhancing interpretability. In this paper, we explore the application of KAN to time series forecasting and propose two variants: T-KAN and MT-KAN. T-KAN is designed to detect concept drift within time series and can explain the nonlinear relationships between predictions and previous time steps through symbolic regression, making it highly interpretable in dynamically changing environments. MT-KAN, on the other hand, improves predictive performance by effectively uncovering and leveraging the complex relationships among variables in multivariate time series. Experiments validate the effectiveness of these approaches, demonstrating that T-KAN and MT-KAN significantly outperform traditional methods in time series forecasting tasks, not only enhancing predictive accuracy but also improving model interpretability. This research opens new avenues for adaptive forecasting models, highlighting the potential of KAN as a powerful and interpretable tool in predictive analytics.

[LG-14] Ai-Sampler: Adversarial Learning of Markov kernels with involutive maps

链接: https://arxiv.org/abs/2406.02490
作者: Evgenii Egorov,Ricardo Valperga,Efstratios Gavves
关键词: Monte Carlo methods, chain Monte Carlo, Markov chain Monte, Monte Carlo, complicated probability distributions
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Markov chain Monte Carlo methods have become popular in statistics as versatile techniques to sample from complicated probability distributions. In this work, we propose a method to parameterize and train transition kernels of Markov chains to achieve efficient sampling and good mixing. This training procedure minimizes the total variation distance between the stationary distribution of the chain and the empirical distribution of the data. Our approach leverages involutive Metropolis-Hastings kernels constructed from reversible neural networks that ensure detailed balance by construction. We find that reversibility also implies C_2 -equivariance of the discriminator function which can be used to restrict its function space.

[LG-15] A Temporal Kolmogorov-Arnold Transformer for Time Series Forecasting

链接: https://arxiv.org/abs/2406.02486
作者: Remi Genet,Hugo Inzirillo
关键词: Capturing complex temporal, multivariate data streams, Temporal Kolmogorov-Arnold Networks, complex temporal patterns, Temporal Fusion Transformer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Capturing complex temporal patterns and relationships within multivariate data streams is a difficult task. We propose the Temporal Kolmogorov-Arnold Transformer (TKAT), a novel attention-based architecture designed to address this task using Temporal Kolmogorov-Arnold Networks (TKANs). Inspired by the Temporal Fusion Transformer (TFT), TKAT emerges as a powerful encoder-decoder model tailored to handle tasks in which the observed part of the features is more important than the a priori known part. This new architecture combined the theoretical foundation of the Kolmogorov-Arnold representation with the power of transformers. TKAT aims to simplify the complex dependencies inherent in time series, making them more “interpretable”. The use of transformer architecture in this framework allows us to capture long-range dependencies through self-attention mechanisms.

[LG-16] Applying Fine-Tuned LLMs for Reducing Data Needs in Load Profile Analysis

链接: https://arxiv.org/abs/2406.02479
作者: Yi Hu,Hyeonjin Kim,Kai Ye,Ning Lu
关键词: Large Language Models, fine-tuned Large Language, Large Language, utilizing fine-tuned Large, minimize data requirements
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This paper presents a novel method for utilizing fine-tuned Large Language Models (LLMs) to minimize data requirements in load profile analysis, demonstrated through the restoration of missing data in power system load profiles. A two-stage fine-tuning strategy is proposed to adapt a pre-trained LLMs, i.e., GPT-3.5, for missing data restoration tasks. Through empirical evaluation, we demonstrate the effectiveness of the fine-tuned model in accurately restoring missing data, achieving comparable performance to state-of-the-art specifically designed models such as BERT-PIN. Key findings include the importance of prompt engineering and the optimal utilization of fine-tuning samples, highlighting the efficiency of few-shot learning in transferring knowledge from general user cases to specific target users. Furthermore, the proposed approach demonstrates notable cost-effectiveness and time efficiency compared to training models from scratch, making it a practical solution for scenarios with limited data availability and computing resources. This research has significant potential for application to other power system load profile analysis tasks. Consequently, it advances the use of LLMs in power system analytics, offering promising implications for enhancing the resilience and efficiency of power distribution systems.

[LG-17] Landscape-Aware Growing: The Power of a Little LAG

链接: https://arxiv.org/abs/2406.02469
作者: Stefani Karp,Nikunj Saunshi,Sobhan Miryoosefi,Sashank J. Reddi,Sanjiv Kumar
关键词: efficient pretraining paradigms, training Transformer-based models, Transformer-based models, training Transformer-based, increasing interest
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recently, there has been increasing interest in efficient pretraining paradigms for training Transformer-based models. Several recent approaches use smaller models to initialize larger models in order to save computation (e.g., stacking and fusion). In this work, we study the fundamental question of how to select the best growing strategy from a given pool of growing strategies. Prior works have extensively focused on loss- and/or function-preserving behavior at initialization or simply performance at the end of training. Instead, we identify that behavior at initialization can be misleading as a predictor of final performance and present an alternative perspective based on early training dynamics, which we call “landscape-aware growing (LAG)”. We perform extensive analysis of correlation of the final performance with performance in the initial steps of training and find early and more accurate predictions of the optimal growing strategy (i.e., with only a small “lag” after initialization). This perspective also motivates an adaptive strategy for gradual stacking.

[LG-18] An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders

链接: https://arxiv.org/abs/2406.02465
作者: Scott C. Lowe,Joakim Bruslund Haurum,Sageev Oore,Thomas B. Moeslund,Graham W. Taylor
关键词: pretrained models generalize, models, models generalize, Abstract, pretrained
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Can pretrained models generalize to new datasets without any retraining? We deploy pretrained image models on datasets they were not trained for, and investigate whether their embeddings form meaningful clusters. Our suite of benchmarking experiments use encoders pretrained solely on ImageNet-1k with either supervised or self-supervised training techniques, deployed on image datasets that were not seen during training, and clustered with conventional clustering algorithms. This evaluation provides new insights into the embeddings of self-supervised models, which prioritize different features to supervised models. Supervised encoders typically offer more utility than SSL encoders within the training domain, and vice-versa far outside of it, however, fine-tuned encoders demonstrate the opposite trend. Clustering provides a way to evaluate the utility of self-supervised learned representations orthogonal to existing methods such as kNN. Additionally, we find the silhouette score when measured in a UMAP-reduced space is highly correlated with clustering performance, and can therefore be used as a proxy for clustering performance on data with no ground truth labels. Our code implementation is available at \urlthis https URL.

[LG-19] Meta-Learners for Partially-Identified Treatment Effects Across Multiple Environments

链接: https://arxiv.org/abs/2406.02464
作者: Jonas Schweisthal,Dennis Frauen,Mihaela van der Schaar,Stefan Feuerriegel
关键词: Estimating the conditional, average treatment effect, conditional average treatment, personalized medicine, conditional average
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted at ICML 2024

点击查看摘要

Abstract:Estimating the conditional average treatment effect (CATE) from observational data is relevant for many applications such as personalized medicine. Here, we focus on the widespread setting where the observational data come from multiple environments, such as different hospitals, physicians, or countries. Furthermore, we allow for violations of standard causal assumptions, namely, overlap within the environments and unconfoundedness. To this end, we move away from point identification and focus on partial identification. Specifically, we show that current assumptions from the literature on multiple environments allow us to interpret the environment as an instrumental variable (IV). This allows us to adapt bounds from the IV literature for partial identification of CATE by leveraging treatment assignment mechanisms across environments. Then, we propose different model-agnostic learners (so-called meta-learners) to estimate the bounds that can be used in combination with arbitrary machine learning models. We further demonstrate the effectiveness of our meta-learners across various experiments using both simulated and real-world data. Finally, we discuss the applicability of our meta-learners to partial identification in instrumental variable settings, such as randomized controlled trials with non-compliance.

[LG-20] Offline Bayesian Aleatoric and Epistemic Uncertainty Quantification and Posterior Value Optimisation in Finite-State MDPs

链接: https://arxiv.org/abs/2406.02456
作者: Filippo Valdettaro,A. Aldo Faisal
关键词: Markov Decision Processes, quantifying Bayesian uncertainty, finite-state Markov Decision, Decision Processes, Markov Decision
类目: Machine Learning (cs.LG)
*备注: 19 pages, 13 figures, 40th Conference on Uncertainty in Artificial Intelligence (UAI 2024)

点击查看摘要

Abstract:We address the challenge of quantifying Bayesian uncertainty and incorporating it in offline use cases of finite-state Markov Decision Processes (MDPs) with unknown dynamics. Our approach provides a principled method to disentangle epistemic and aleatoric uncertainty, and a novel technique to find policies that optimise Bayesian posterior expected value without relying on strong assumptions about the MDP’s posterior distribution. First, we utilise standard Bayesian reinforcement learning methods to capture the posterior uncertainty in MDP parameters based on available data. We then analytically compute the first two moments of the return distribution across posterior samples and apply the law of total variance to disentangle aleatoric and epistemic uncertainties. To find policies that maximise posterior expected value, we leverage the closed-form expression for value as a function of policy. This allows us to propose a stochastic gradient-based approach for solving the problem. We illustrate the uncertainty quantification and Bayesian posterior value optimisation performance of our agent in simple, interpretable gridworlds and validate it through ground-truth evaluations on synthetic MDPs. Finally, we highlight the real-world impact and computational scalability of our method by applying it to the AI Clinician problem, which recommends treatment for patients in intensive care units and has emerged as a key use case of finite-state MDPs with offline data. We discuss the challenges that arise with Bayesian modelling of larger scale MDPs while demonstrating the potential to apply our methods rooted in Bayesian decision theory into the real world. We make our code available at this https URL .

[LG-21] A Generalized Apprenticeship Learning Framework for Modeling Heterogeneous Student Pedagogical Strategies

链接: https://arxiv.org/abs/2406.02450
作者: Md Mirajul Islam,Xi Yang,John Hostetter,Adittya Soukarjya Saha,Min Chi
关键词: Intelligent Tutoring Systems, Tutoring Systems, Intelligent Tutoring, Deep Reinforcement Learning, environments like Intelligent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A key challenge in e-learning environments like Intelligent Tutoring Systems (ITSs) is to induce effective pedagogical policies efficiently. While Deep Reinforcement Learning (DRL) often suffers from sample inefficiency and reward function design difficulty, Apprenticeship Learning(AL) algorithms can overcome them. However, most AL algorithms can not handle heterogeneity as they assume all demonstrations are generated with a homogeneous policy driven by a single reward function. Still, some AL algorithms which consider heterogeneity, often can not generalize to large continuous state space and only work with discrete states. In this paper, we propose an expectation-maximization(EM)-EDM, a general AL framework to induce effective pedagogical policies from given optimal or near-optimal demonstrations, which are assumed to be driven by heterogeneous reward functions. We compare the effectiveness of the policies induced by our proposed EM-EDM against four AL-based baselines and two policies induced by DRL on two different but related tasks that involve pedagogical action prediction. Our overall results showed that, for both tasks, EM-EDM outperforms the four AL baselines across all performance metrics and the two DRL baselines. This suggests that EM-EDM can effectively model complex student pedagogical decision-making processes through the ability to manage a large, continuous state space and adapt to handle diverse and heterogeneous reward functions with very few given demonstrations.

[LG-22] Reducing Bias in Federated Class-Incremental Learning with Hierarchical Generative Prototypes

链接: https://arxiv.org/abs/2406.02447
作者: Riccardo Salami,Pietro Buzzega,Matteo Mosconi,Mattia Verasani,Simone Calderara
关键词: Federated Continual Learning, safeguarding data privacy, Continual Learning, Federated Learning, aims at unburdening
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) aims at unburdening the training of deep models by distributing computation across multiple devices (clients) while safeguarding data privacy. On top of that, Federated Continual Learning (FCL) also accounts for data distribution evolving over time, mirroring the dynamic nature of real-world environments. In this work, we shed light on the Incremental and Federated biases that naturally emerge in FCL. While the former is a known problem in Continual Learning, stemming from the prioritization of recently introduced classes, the latter (i.e., the bias towards local distributions) remains relatively unexplored. Our proposal constrains both biases in the last layer by efficiently fine-tuning a pre-trained backbone using learnable prompts, resulting in clients that produce less biased representations and more biased classifiers. Therefore, instead of solely relying on parameter aggregation, we also leverage generative prototypes to effectively balance the predictions of the global model. Our method improves on the current State Of The Art, providing an average increase of +7.9% in accuracy.

[LG-23] Coresets for Multiple ell_p Regression

链接: https://arxiv.org/abs/2406.02432
作者: David P. Woodruff,Taisuke Yasuda
关键词: varepsilon, data analytic tasks, solving downstream data, downstream data analytic, ell
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2024

点击查看摘要

Abstract:A coreset of a dataset with n examples and d features is a weighted subset of examples that is sufficient for solving downstream data analytic tasks. Nearly optimal constructions of coresets for least squares and \ell_p linear regression with a single response are known in prior work. However, for multiple \ell_p regression where there can be m responses, there are no known constructions with size sublinear in m . In this work, we construct coresets of size \tilde O(\varepsilon^-2d) for p2 and \tilde O(\varepsilon^-pd^p/2) for p2 independently of m (i.e., dimension-free) that approximate the multiple \ell_p regression objective at every point in the domain up to (1\pm\varepsilon) relative error. If we only need to preserve the minimizer subject to a subspace constraint, we improve these bounds by an \varepsilon factor for all p1 . All of our bounds are nearly tight. We give two application of our results. First, we settle the number of uniform samples needed to approximate \ell_p Euclidean power means up to a (1+\varepsilon) factor, showing that \tilde\Theta(\varepsilon^-2) samples for p = 1 , \tilde\Theta(\varepsilon^-1) samples for 1 p 2 , and \tilde\Theta(\varepsilon^1-p) samples for p2 is tight, answering a question of Cohen-Addad, Saulpic, and Schwiegelshohn. Second, we show that for 1p2 , every matrix has a subset of \tilde O(\varepsilon^-1k) rows which spans a (1+\varepsilon) -approximately optimal k -dimensional subspace for \ell_p subspace approximation, which is also nearly optimal. Comments: ICML 2024 Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2406.02432 [cs.DS] (or arXiv:2406.02432v1 [cs.DS] for this version)

[LG-24] Reweighted Solutions for Weighted Low Rank Approximation

链接: https://arxiv.org/abs/2406.02431
作者: David P. Woodruff,Taisuke Yasuda
关键词: computationally challenging primitive, statistical analysis, signal processing, important yet computationally, computationally challenging
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2024

点击查看摘要

Abstract:Weighted low rank approximation (WLRA) is an important yet computationally challenging primitive with applications ranging from statistical analysis, model compression, and signal processing. To cope with the NP-hardness of this problem, prior work considers heuristics, bicriteria, or fixed parameter tractable algorithms to solve this problem. In this work, we introduce a new relaxed solution to WLRA which outputs a matrix that is not necessarily low rank, but can be stored using very few parameters and gives provable approximation guarantees when the weight matrix has low rank. Our central idea is to use the weight matrix itself to reweight a low rank solution, which gives an extremely simple algorithm with remarkable empirical performance in applications to model compression and on synthetic datasets. Our algorithm also gives nearly optimal communication complexity bounds for a natural distributed problem associated with this problem, for which we show matching communication lower bounds. Together, our communication complexity bounds show that the rank of the weight matrix provably parameterizes the communication complexity of WLRA. We also obtain the first relative error guarantees for feature selection with a weighted objective.

[LG-25] Harnessing Neural Unit Dynamics for Effective and Scalable Class-Incremental Learning

链接: https://arxiv.org/abs/2406.02428
作者: Depeng Li,Tianqi Wang,Junwei Chen,Wei Dai,Zhigang Zeng
关键词: non-stationary data streams, aims to train, Class-incremental learning, non-stationary data, data streams
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:Class-incremental learning (CIL) aims to train a model to learn new classes from non-stationary data streams without forgetting old ones. In this paper, we propose a new kind of connectionist model by tailoring neural unit dynamics that adapt the behavior of neural networks for CIL. In each training session, it introduces a supervisory mechanism to guide network expansion whose growth size is compactly commensurate with the intrinsic complexity of a newly arriving task. This constructs a near-minimal network while allowing the model to expand its capacity when cannot sufficiently hold new classes. At inference time, it automatically reactivates the required neural units to retrieve knowledge and leaves the remaining inactivated to prevent interference. We name our model AutoActivator, which is effective and scalable. To gain insights into the neural unit dynamics, we theoretically analyze the model’s convergence property via a universal approximation theorem on learning sequential mappings, which is under-explored in the CIL community. Experiments show that our method achieves strong CIL performance in rehearsal-free and minimal-expansion settings with different backbones.

[LG-26] Contextual Dynamic Pricing: Algorithms Optimality and Local Differential Privacy Constraints

链接: https://arxiv.org/abs/2406.02424
作者: Zifeng Zhao,Feiyu Jiang,Yi Yu
关键词: sequentially arriving consumers, unknown demand model, firm sells products, sequentially arriving, dynamic pricing
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We study the contextual dynamic pricing problem where a firm sells products to T sequentially arriving consumers that behave according to an unknown demand model. The firm aims to maximize its revenue, i.e. minimize its regret over a clairvoyant that knows the model in advance. The demand model is a generalized linear model (GLM), allowing for a stochastic feature vector in \mathbb R^d that encodes product and consumer information. We first show that the optimal regret upper bound is of order \sqrtdT , up to a logarithmic factor, improving upon existing upper bounds in the literature by a \sqrtd factor. This sharper rate is materialised by two algorithms: a confidence bound-type (supCB) algorithm and an explore-then-commit (ETC) algorithm. A key insight of our theoretical result is an intrinsic connection between dynamic pricing and the contextual multi-armed bandit problem with many arms based on a careful discretization. We further study contextual dynamic pricing under the local differential privacy (LDP) constraints. In particular, we propose a stochastic gradient descent based ETC algorithm that achieves an optimal regret upper bound of order d\sqrtT/\epsilon , up to a logarithmic factor, where \epsilon0 is the privacy parameter. The regret upper bounds with and without LDP constraints are accompanied by newly constructed minimax lower bounds, which further characterize the cost of privacy. Extensive numerical experiments and a real data application on online lending are conducted to illustrate the efficiency and practical value of the proposed algorithms in dynamic pricing.

[LG-27] Representing Piecewise-Linear Functions by Functions with Minimal Arity

链接: https://arxiv.org/abs/2406.02421
作者: Christoph Koutschan,Anton Ponomarchuk,Josef Schicho
关键词: continuous piecewise-linear function, Representing piecewise linear, continuous piecewise-linear, linear combination, mathbb
类目: Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:Any continuous piecewise-linear function F\colon \mathbbR^n\to \mathbbR can be represented as a linear combination of \max functions of at most n+1 affine-linear functions. In our previous paper [``Representing piecewise linear functions by functions with small arity’', AAECC, 2023], we showed that this upper bound of n+1 arguments is tight. In the present paper, we extend this result by establishing a correspondence between the function F and the minimal number of arguments that are needed in any such decomposition. We show that the tessellation of the input space \mathbbR^n induced by the function F has a direct connection to the number of arguments in the \max functions.

[LG-28] Improved Modelling of Federated Datasets using Mixtures-of-Dirichlet-Multinomials

链接: https://arxiv.org/abs/2406.02416
作者: Jonathan Scott,Áine Cahill
关键词: orders of magnitude, magnitude slower, slower than standard, standard centralized training, training
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In practice, training using federated learning can be orders of magnitude slower than standard centralized training. This severely limits the amount of experimentation and tuning that can be done, making it challenging to obtain good performance on a given task. Server-side proxy data can be used to run training simulations, for instance for hyperparameter tuning. This can greatly speed up the training pipeline by reducing the number of tuning runs to be performed overall on the true clients. However, it is challenging to ensure that these simulations accurately reflect the dynamics of the real federated training. In particular, the proxy data used for simulations often comes as a single centralized dataset without a partition into distinct clients, and partitioning this data in a naive way can lead to simulations that poorly reflect real federated training. In this paper we address the challenge of how to partition centralized data in a way that reflects the statistical heterogeneity of the true federated clients. We propose a fully federated, theoretically justified, algorithm that efficiently learns the distribution of the true clients and observe improved server-side simulations when using the inferred distribution to create simulated clients from the centralized data.

[LG-29] GrootVL: Tree Topology is All You Need in State Space Model

链接: https://arxiv.org/abs/2406.02395
作者: Yicheng Xiao,Lin Song,Shaoli Huang,Jiangshan Wang,Siyu Song,Yixiao Ge,Xiu Li,Ying Shan
关键词: employing recursively propagated, comparable to Transformer, recursively propagated features, employing recursively, superior efficiency
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: The code is available at this https URL

点击查看摘要

Abstract:The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency. However, constrained by the inherent geometric constraints of sequences, it still falls short in modeling long-range dependencies. To address this issue, we propose the GrootVL network, which first dynamically generates a tree topology based on spatial relationships and input features. Then, feature propagation is performed based on this graph, thereby breaking the original sequence constraints to achieve stronger representation capabilities. Additionally, we introduce a linear complexity dynamic programming algorithm to enhance long-range interactions without increasing computational cost. GrootVL is a versatile multimodal framework that can be applied to both visual and textual tasks. Extensive experiments demonstrate that our method significantly outperforms existing structured state space models on image classification, object detection and segmentation. Besides, by fine-tuning large language models, our approach achieves consistent improvements in multiple textual tasks at minor training cost.

[LG-30] Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data

链接: https://arxiv.org/abs/2406.02394
作者: Maxime Griot,Jean Vanderdonckt,Demet Yuksel,Coralie Hemptinne
关键词: ChatGPT demonstrate significant, demonstrate significant potential, Large Language Models, Large Language, ChatGPT demonstrate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) like ChatGPT demonstrate significant potential in the medical field, often evaluated using multiple-choice questions (MCQs) similar to those found on the USMLE. Despite their prevalence in medical education, MCQs have limitations that might be exacerbated when assessing LLMs. To evaluate the effectiveness of MCQs in assessing the performance of LLMs, we developed a fictional medical benchmark focused on a non-existent gland, the Glianorex. This approach allowed us to isolate the knowledge of the LLM from its test-taking abilities. We used GPT-4 to generate a comprehensive textbook on the Glianorex in both English and French and developed corresponding multiple-choice questions in both languages. We evaluated various open-source, proprietary, and domain-specific LLMs using these questions in a zero-shot setting. The models achieved average scores around 67%, with minor performance differences between larger and smaller models. Performance was slightly higher in English than in French. Fine-tuned medical models showed some improvement over their base versions in English but not in French. The uniformly high performance across models suggests that traditional MCQ-based benchmarks may not accurately measure LLMs’ clinical knowledge and reasoning abilities, instead highlighting their pattern recognition skills. This study underscores the need for more robust evaluation methods to better assess the true capabilities of LLMs in medical contexts.

[LG-31] Learning to Edit Visual Programs with Self-Supervision

链接: https://arxiv.org/abs/2406.02383
作者: R. Kenny Jones,Renhao Zhang,Aditya Ganeshan,Daniel Ritchie
关键词: design a system, system that learns, edit network, edit, visual programs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We design a system that learns how to edit visual programs. Our edit network consumes a complete input program and a visual target. From this input, we task our network with predicting a local edit operation that could be applied to the input program to improve its similarity to the target. In order to apply this scheme for domains that lack program annotations, we develop a self-supervised learning approach that integrates this edit network into a bootstrapped finetuning loop along with a network that predicts entire programs in one-shot. Our joint finetuning scheme, when coupled with an inference procedure that initializes a population from the one-shot model and evolves members of this population with the edit network, helps to infer more accurate visual programs. Over multiple domains, we experimentally compare our method against the alternative of using only the one-shot model, and find that even under equal search-time budgets, our editing-based paradigm provides significant advantages.

[LG-32] Finding NeMo: Localizing Neurons Responsible For Memorization in Diffusion Models

链接: https://arxiv.org/abs/2406.02366
作者: Dominik Hintersdorf,Lukas Struppek,Kristian Kersting,Adam Dziedzic,Franziska Boenisch
关键词: produce very detailed, detailed and high-quality, Diffusion models, high-quality images, copyrighted training images
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Diffusion models (DMs) produce very detailed and high-quality images. Their power results from extensive training on large amounts of data, usually scraped from the internet without proper attribution or consent from content creators. Unfortunately, this practice raises privacy and intellectual property concerns, as DMs can memorize and later reproduce their potentially sensitive or copyrighted training images at inference time. Prior efforts prevent this issue by either changing the input to the diffusion process, thereby preventing the DM from generating memorized samples during inference, or removing the memorized data from training altogether. While those are viable solutions when the DM is developed and deployed in a secure and constantly monitored environment, they hold the risk of adversaries circumventing the safeguards and are not effective when the DM itself is publicly released. To solve the problem, we introduce NeMo, the first method to localize memorization of individual data samples down to the level of neurons in DMs’ cross-attention layers. Through our experiments, we make the intriguing finding that in many cases, single neurons are responsible for memorizing particular training samples. By deactivating these memorization neurons, we can avoid the replication of training data at inference time, increase the diversity in the generated outputs, and mitigate the leakage of private and copyrighted data. In this way, our NeMo contributes to a more responsible deployment of DMs.

[LG-33] mporal Graph Rewiring with Expander Graphs

链接: https://arxiv.org/abs/2406.02362
作者: Katarina Petrović,Shenyang Huang,Farimah Poursafaei,Petar Veličković
关键词: Evolving relations, Graph Neural Networks, Temporal Graph Rewiring, Graph rewiring, temporal graphs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Evolving relations in real-world networks are often modelled by temporal graphs. Graph rewiring techniques have been utilised on Graph Neural Networks (GNNs) to improve expressiveness and increase model performance. In this work, we propose Temporal Graph Rewiring (TGR), the first approach for graph rewiring on temporal graphs. TGR enables communication between temporally distant nodes in a continuous time dynamic graph by utilising expander graph propagation to construct a message passing highway for message passing between distant nodes. Expander graphs are suitable candidates for rewiring as they help overcome the oversquashing problem often observed in GNNs. On the public tgbl-wiki benchmark, we show that TGR improves the performance of a widely used TGN model by a significant margin. Our code repository is accessible at https://anonymous.4open.science/r/TGR-254C.

[LG-34] Using Self-supervised Learning Can Improve Model Fairness

链接: https://arxiv.org/abs/2406.02361
作者: Sofia Yfantidou,Dimitris Spathis,Marios Constantinides,Athena Vakali,Daniele Quercia,Fahim Kawsar
关键词: facto training paradigm, Self-supervised learning, data and labels, facto training, training paradigm
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2401.01640

点击查看摘要

Abstract:Self-supervised learning (SSL) has become the de facto training paradigm of large models, where pre-training is followed by supervised fine-tuning using domain-specific data and labels. Despite demonstrating comparable performance with supervised methods, comprehensive efforts to assess SSL’s impact on machine learning fairness (i.e., performing equally on different demographic breakdowns) are lacking. Hypothesizing that SSL models would learn more generic, hence less biased representations, this study explores the impact of pre-training and fine-tuning strategies on fairness. We introduce a fairness assessment framework for SSL, comprising five stages: defining dataset requirements, pre-training, fine-tuning with gradual unfreezing, assessing representation similarity conditioned on demographics, and establishing domain-specific evaluation processes. We evaluate our method’s generalizability on three real-world human-centric datasets (i.e., MIMIC, MESA, and GLOBEM) by systematically comparing hundreds of SSL and fine-tuned models on various dimensions spanning from the intermediate representations to appropriate evaluation metrics. Our findings demonstrate that SSL can significantly improve model fairness, while maintaining performance on par with supervised methods-exhibiting up to a 30% increase in fairness with minimal loss in performance through self-supervision. We posit that such differences can be attributed to representation dissimilarities found between the best- and the worst-performing demographics across models-up to x13 greater for protected attributes with larger performance discrepancies between segments.

[LG-35] he complexity of approximate (coarse) correlated equilibrium for incomplete information games

链接: https://arxiv.org/abs/2406.02357
作者: Binghui Peng,Aviad Rubinstein
关键词: approximate correlated equilibrium, approximate correlated, incomplete information games, correlated equilibrium, mathit
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the iteration complexity of decentralized learning of approximate correlated equilibria in incomplete information games. On the negative side, we prove that in \mathitextensive - \mathitform \mathitgames , assuming \mathsfPPAD \not\subset \mathsfTIME(n^\mathsfpolylog(n)) , any polynomial-time learning algorithms must take at least 2^\log_2^1-o(1)(|\mathcalI|) iterations to converge to the set of \epsilon -approximate correlated equilibrium, where |\mathcalI| is the number of nodes in the game and \epsilon 0 is an absolute constant. This nearly matches, up to the o(1) term, the algorithms of [PR’24, DDFG’24] for learning \epsilon -approximate correlated equilibrium, and resolves an open question of Anagnostides, Kalavasis, Sandholm, and Zampetakis [AKSZ’24]. Our lower bound holds even for the easier solution concept of \epsilon -approximate \mathitcoarse correlated equilibrium On the positive side, we give uncoupled dynamics that reach \epsilon -approximate correlated equilibria of a \mathitBayesian \mathitgame in polylogarithmic iterations, without any dependence of the number of types. This demonstrates a separation between Bayesian games and extensive-form games. Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2406.02357 [cs.GT] (or arXiv:2406.02357v1 [cs.GT] for this version)

[LG-36] Language Models Do Hard Arithmetic Tasks Easily and Hardly Do Easy Arithmetic Tasks

链接: https://arxiv.org/abs/2406.02356
作者: Andrew Gambardella,Yusuke Iwasawa,Yutaka Matsuo
关键词: large language models, perform arithmetic tasks, language models, practical debate, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

点击查看摘要

Abstract:The ability (and inability) of large language models (LLMs) to perform arithmetic tasks has been the subject of much theoretical and practical debate. We show that LLMs are frequently able to correctly and confidently predict the first digit of n-digit by m-digit multiplication tasks without using chain of thought reasoning, despite these tasks require compounding operations to solve. Simultaneously, LLMs in practice often fail to correctly or confidently predict the last digit of an n-digit by m-digit multiplication, a task equivalent to 1-digit by 1-digit multiplication which can be easily learned or memorized. We show that the latter task can be solved more robustly when the LLM is conditioned on all of the correct higher-order digits, which on average increases the confidence of the correct last digit on 5-digit by 5-digit multiplication tasks using Llama 2-13B by over 230% (0.13 to 0.43) and Mistral-7B by 150% (0.22 to 0.55).

[LG-37] FedDr: Stabilizing Dot-regression with Global Feature Distillation for Federated Learning

链接: https://arxiv.org/abs/2406.02355
作者: Seongyoon Kim,Minchan Jeong,Sungnyun Kim,Sungwoo Cho,Sumyeong Ahn,Se-Young Yun
关键词: Federated Learning, non-iid data distribution, pivotal framework, non-iid data, Federated
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a pivotal framework for the development of effective global models (global FL) or personalized models (personalized FL) across clients with heterogeneous, non-iid data distribution. A key challenge in FL is client drift, where data heterogeneity impedes the aggregation of scattered knowledge. Recent studies have tackled the client drift issue by identifying significant divergence in the last classifier layer. To mitigate this divergence, strategies such as freezing the classifier weights and aligning the feature extractor accordingly have proven effective. Although the local alignment between classifier and feature extractor has been studied as a crucial factor in FL, we observe that it may lead the model to overemphasize the observed classes within each client. Thus, our objectives are twofold: (1) enhancing local alignment while (2) preserving the representation of unseen class samples. This approach aims to effectively integrate knowledge from individual clients, thereby improving performance for both global and personalized FL. To achieve this, we introduce a novel algorithm named FedDr+, which empowers local model alignment using dot-regression loss. FedDr+ freezes the classifier as a simplex ETF to align the features and improves aggregated global models by employing a feature distillation mechanism to retain information about unseen/missing classes. Consequently, we provide empirical evidence demonstrating that our algorithm surpasses existing methods that use a frozen classifier to boost alignment across the diverse distribution.

[LG-38] Label-wise Aleatoric and Epistemic Uncertainty Quantification

链接: https://arxiv.org/abs/2406.02354
作者: Yusuf Sale,Paul Hofman,Timo Löhr,Lisa Wimmer,Thomas Nagler,Eyke Hüllermeier
关键词: classification tasks based, classification tasks, tasks based, uncertainty, uncertainty quantification
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Uncertainty in Artificial Intelligence. arXiv admin note: substantial text overlap with arXiv:2401.00276

点击查看摘要

Abstract:We present a novel approach to uncertainty quantification in classification tasks based on label-wise decomposition of uncertainty measures. This label-wise perspective allows uncertainty to be quantified at the individual class level, thereby improving cost-sensitive decision-making and helping understand the sources of uncertainty. Furthermore, it allows to define total, aleatoric, and epistemic uncertainty on the basis of non-categorical measures such as variance, going beyond common entropy-based measures. In particular, variance-based measures address some of the limitations associated with established methods that have recently been discussed in the literature. We show that our proposed measures adhere to a number of desirable properties. Through empirical evaluation on a variety of benchmark data sets – including applications in the medical domain where accurate uncertainty quantification is crucial – we establish the effectiveness of label-wise uncertainty quantification.

[LG-39] System-Aware Neural ODE Processes for Few-Shot Bayesian Optimization

链接: https://arxiv.org/abs/2406.02352
作者: Jixiang Qing,Becky D Langdon,Robert M Lee,Behrang Shafei,Mark van der Wilk,Calvin Tsay,Ruth Misener
关键词: Neural ODE Processes, ordinary differential equations, unknown ordinary differential, ODE Processes, dynamical systems governed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of optimizing initial conditions and timing in dynamical systems governed by unknown ordinary differential equations (ODEs), where evaluating different initial conditions is costly and there are constraints on observation times. To identify the optimal conditions within several trials, we introduce a few-shot Bayesian Optimization (BO) framework based on the system’s prior information. At the core of our approach is the System-Aware Neural ODE Processes (SANODEP), an extension of Neural ODE Processes (NODEP) designed to meta-learn ODE systems from multiple trajectories using a novel context embedding block. Additionally, we propose a multi-scenario loss function specifically for optimization purposes. Our two-stage BO framework effectively incorporates search space constraints, enabling efficient optimization of both initial conditions and observation timings. We conduct extensive experiments showcasing SANODEP’s potential for few-shot BO. We also explore SANODEP’s adaptability to varying levels of prior information, highlighting the trade-off between prior flexibility and model fitting accuracy.

[LG-40] AMOSL: Adaptive Modality-wise Structure Learning in Multi-view Graph Neural Networks For Enhanced Unified Representation

链接: https://arxiv.org/abs/2406.02348
作者: Peiyu Liang,Hongchang Gao,Xubin He
关键词: Graph Neural Networks, Multi-view Graph Neural, Neural Networks, existing methods assume, overlook real-world discrepancies
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While Multi-view Graph Neural Networks (MVGNNs) excel at leveraging diverse modalities for learning object representation, existing methods assume identical local topology structures across modalities that overlook real-world discrepancies. This leads MVGNNs straggles in modality fusion and representations denoising. To address these issues, we propose adaptive modality-wise structure learning (AMoSL). AMoSL captures node correspondences between modalities via optimal transport, and jointly learning with graph embedding. To enable efficient end-to-end training, we employ an efficient solution for the resulting complex bilevel optimization problem. Furthermore, AMoSL adapts to downstream tasks through unsupervised learning on inter-modality distances. The effectiveness of AMoSL is demonstrated by its ability to train more accurate graph classifiers on six benchmark datasets.

[LG-41] Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation

链接: https://arxiv.org/abs/2406.02347
作者: Clement Chadebec,Onur Tasar,Eyal Benaroche,Benjamin Aubin
关键词: pre-trained diffusion models, Flash Diffusion, versatile distillation method, diffusion models, pre-trained diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages + 16 pages appendices

点击查看摘要

Abstract:In this paper, we propose an efficient, fast, and versatile distillation method to accelerate the generation of pre-trained diffusion models: Flash Diffusion. The method reaches state-of-the-art performances in terms of FID and CLIP-Score for few steps image generation on the COCO2014 and COCO2017 datasets, while requiring only several GPU hours of training and fewer trainable parameters than existing methods. In addition to its efficiency, the versatility of the method is also exposed across several tasks such as text-to-image, inpainting, face-swapping, super-resolution and using different backbones such as UNet-based denoisers (SD1.5, SDXL) or DiT (Pixart- \alpha ), as well as adapters. In all cases, the method allowed to reduce drastically the number of sampling steps while maintaining very high-quality image generation. The official implementation is available at this https URL.

[LG-42] Progressive Confident Masking Attention Network for Audio-Visual Segmentation

链接: https://arxiv.org/abs/2406.02345
作者: Yuxuan Wang,Feng Dong,Jinchao Zhu
关键词: typically occur simultaneously, signals typically occur, occur simultaneously, typically occur, humans possess
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 10 pages, 9 figures, submitted to IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

点击查看摘要

Abstract:Audio and visual signals typically occur simultaneously, and humans possess an innate ability to correlate and synchronize information from these two modalities. Recently, a challenging problem known as Audio-Visual Segmentation (AVS) has emerged, intending to produce segmentation maps for sounding objects within a scene. However, the methods proposed so far have not sufficiently integrated audio and visual information, and the computational costs have been extremely high. Additionally, the outputs of different stages have not been fully utilized. To facilitate this research, we introduce a novel Progressive Confident Masking Attention Network (PMCANet). It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames. Furthermore, we design an efficient and effective cross-attention module to enhance semantic perception by selecting query tokens. This selection is determined through confidence-driven units based on the network’s multi-stage predictive outputs. Experiments demonstrate that our network outperforms other AVS methods while requiring less computational resources.

[LG-43] Incorporating Navigation Context into Inland Vessel Trajectory Prediction: A Gaussian Mixture Model and Transformer Approach

链接: https://arxiv.org/abs/2406.02344
作者: Kathrin Donandt,Dirk Söffker
关键词: Automatic Identification System, Automatic Identification, Identification System, machine learning approaches, System to represent
类目: Machine Learning (cs.LG)
*备注: To be published in Proceedings of the 27th International Conference on Information Fusion (FUSION 2024)

点击查看摘要

Abstract:Using data sources beyond the Automatic Identification System to represent the context a vessel is navigating in and consequently improve situation awareness is still rare in machine learning approaches to vessel trajectory prediction (VTP). In inland shipping, where vessel movement is constrained within fairways, navigational context information is indispensable. In this contribution targeting inland VTP, Gaussian Mixture Models (GMMs) are applied, on a fused dataset of AIS and discharge measurements, to generate multi-modal distribution curves, capturing typical lateral vessel positioning in the fairway and dislocation speeds along the waterway. By sampling the probability density curves of the GMMs, feature vectors are derived which are used, together with spatio-temporal vessel features and fairway geometries, as input to a VTP transformer model. The incorporation of these distribution features of both the current and forthcoming navigation context improves prediction accuracy. The superiority of the model over a previously proposed transformer model for inland VTP is shown. The novelty lies in the provision of preprocessed, statistics-based features representing the conditioned spatial context, rather than relying on the model to extract relevant features for the VTP task from contextual data. Oversimplification of the complexity of inland navigation patterns by assuming a single typical route or selecting specific clusters prior to model application is avoided by giving the model access to the entire distribution information. The methodology’s generalizability is demonstrated through the usage of data of 3 distinct river sections. It can be integrated into an interaction-aware prediction framework, where insights into the positioning of the actual vessel behavior in the overall distribution at the current location and discharge can enhance trajectory prediction accuracy.

[LG-44] Cluster-Aware Similarity Diffusion for Instance Retrieval

链接: https://arxiv.org/abs/2406.02343
作者: Jifei Luo,Hantao Yao,Changsheng Xu
关键词: Diffusion-based re-ranking, performing similarity propagation, nearest neighbor graph, common method, similarity
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion-based re-ranking is a common method used for retrieving instances by performing similarity propagation in a nearest neighbor graph. However, existing techniques that construct the affinity graph based on pairwise instances can lead to the propagation of misinformation from outliers and other manifolds, resulting in inaccurate results. To overcome this issue, we propose a novel Cluster-Aware Similarity (CAS) diffusion for instance retrieval. The primary concept of CAS is to conduct similarity diffusion within local clusters, which can reduce the influence from other manifolds explicitly. To obtain a symmetrical and smooth similarity matrix, our Bidirectional Similarity Diffusion strategy introduces an inverse constraint term to the optimization objective of local cluster diffusion. Additionally, we have optimized a Neighbor-guided Similarity Smoothing approach to ensure similarity consistency among the local neighbors of each instance. Evaluations in instance retrieval and object re-identification validate the effectiveness of the proposed CAS, our code is publicly available.

[LG-45] Polynomial-Augmented Neural Networks (PANNs) with Weak Orthogonality Constraints for Enhanced Function and PDE Approximation

链接: https://arxiv.org/abs/2406.02336
作者: Madison Cooley,Shandian Zhe,Robert M. Kirby,Varun Shankar
关键词: polynomial-augmented neural networks, deep neural networks, neural networks, combines deep neural, present polynomial-augmented neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present polynomial-augmented neural networks (PANNs), a novel machine learning architecture that combines deep neural networks (DNNs) with a polynomial approximant. PANNs combine the strengths of DNNs (flexibility and efficiency in higher-dimensional approximation) with those of polynomial approximation (rapid convergence rates for smooth functions). To aid in both stable training and enhanced accuracy over a variety of problems, we present (1) a family of orthogonality constraints that impose mutual orthogonality between the polynomial and the DNN within a PANN; (2) a simple basis pruning approach to combat the curse of dimensionality introduced by the polynomial component; and (3) an adaptation of a polynomial preconditioning strategy to both DNNs and polynomials. We test the resulting architecture for its polynomial reproduction properties, ability to approximate both smooth functions and functions of limited smoothness, and as a method for the solution of partial differential equations (PDEs). Through these experiments, we demonstrate that PANNs offer superior approximation properties to DNNs for both regression and the numerical solution of PDEs, while also offering enhanced accuracy over both polynomial and DNN-based regression (each) when regressing functions with limited smoothness.

[LG-46] owards Neural Architecture Search for Transfer Learning in 6G Networks

链接: https://arxiv.org/abs/2406.02333
作者: Adam Orucu,Farnaz Moradi,Masoumeh Ebrahimi,Andreas Johnsson
关键词: reducing energy consumption, optimizing performance, reducing energy, energy consumption, complexity and heterogeneity
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The future 6G network is envisioned to be AI-native, and as such, ML models will be pervasive in support of optimizing performance, reducing energy consumption, and in coping with increasing complexity and heterogeneity. A key challenge is automating the process of finding optimal model architectures satisfying stringent requirements stemming from varying tasks, dynamicity and available resources in the infrastructure and deployment positions. In this paper, we describe and review the state-of-the-art in Neural Architecture Search and Transfer Learning and their applicability in networking. Further, we identify open research challenges and set directions with a specific focus on three main requirements with elements unique to the future network, namely combining NAS and TL, multi-objective search, and tabular data. Finally, we outline and discuss both near-term and long-term work ahead.

[LG-47] Extended Mind Transformers

链接: https://arxiv.org/abs/2406.02332
作者: Phoebe Klett,Thomas Ahle
关键词: Pre-trained language models, long inputs quickly, Pre-trained language, demonstrate general intelligence, language models demonstrate
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Pre-trained language models demonstrate general intelligence and common sense, but long inputs quickly become a bottleneck for memorizing information at inference time. We resurface a simple method, Memorizing Transformers (Wu et al., 2022), that gives the model access to a bank of pre-computed memories. We show that it is possible to fix many of the shortcomings of the original method, such as the need for fine-tuning, by critically assessing how positional encodings should be updated for the keys and values retrieved. This intuitive method uses the model’s own key/query system to select and attend to the most relevant memories at each generation step, rather than using external embeddings. We demonstrate the importance of external information being retrieved in a majority of decoder layers, contrary to previous work. We open source a new counterfactual long-range retrieval benchmark, and show that Extended Mind Transformers outperform today’s state of the art by 6% on average.

[LG-48] On Affine Homotopy between Language Encoders

链接: https://arxiv.org/abs/2406.02329
作者: Robin SM Chan,Reda Boumasmoud,Anej Svete,Yuxin Ren,Qipeng Guo,Zhijing Jin,Shauli Ravfogel,Mrinmaya Sachan,Bernhard Schölkopf,Mennatallah El-Assady,Ryan Cotterell
关键词: NLP tasks, functions that represent, text as vectors, represent text, integral component
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Pre-trained language encoders – functions that represent text as vectors – are an integral component of many NLP tasks. We tackle a natural question in language encoder analysis: What does it mean for two encoders to be similar? We contend that a faithful measure of similarity needs to be \emphintrinsic, that is, task-independent, yet still be informative of \emphextrinsic similarity – the performance on downstream tasks. It is common to consider two encoders similar if they are \emphhomotopic, i.e., if they can be aligned through some transformation. In this spirit, we study the properties of \emphaffine alignment of language encoders and its implications on extrinsic similarity. We find that while affine alignment is fundamentally an asymmetric notion of similarity, it is still informative of extrinsic similarity. We confirm this on datasets of natural language representations. Beyond providing useful bounds on extrinsic similarity, affine intrinsic similarity also allows us to begin uncovering the structure of the space of pre-trained encoders by defining an order over them.

[LG-49] Continual Unsupervised Out-of-Distribution Detection

链接: https://arxiv.org/abs/2406.02327
作者: Lars Doorenbos,Raphael Sznitman,Pablo Márquez-Neila
关键词: testing data, OOD, learning models excel, aligns with testing, Deep learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models excel when the data distribution during training aligns with testing data. Yet, their performance diminishes when faced with out-of-distribution (OOD) samples, leading to great interest in the field of OOD detection. Current approaches typically assume that OOD samples originate from an unconcentrated distribution complementary to the training distribution. While this assumption is appropriate in the traditional unsupervised OOD (U-OOD) setting, it proves inadequate when considering the place of deployment of the underlying deep learning model. To better reflect this real-world scenario, we introduce the novel setting of continual U-OOD detection. To tackle this new setting, we propose a method that starts from a U-OOD detector, which is agnostic to the OOD distribution, and slowly updates during deployment to account for the actual OOD distribution. Our method uses a new U-OOD scoring function that combines the Mahalanobis distance with a nearest-neighbor approach. Furthermore, we design a confidence-scaled few-shot OOD detector that outperforms previous methods. We show our method greatly improves upon strong baselines from related fields.

[LG-50] A Survey of Transformer Enabled Time Series Synthesis

链接: https://arxiv.org/abs/2406.02322
作者: Alexander Sommers,Logan Cummins,Sudip Mittal,Shahram Rahimi,Maria Seale,Joseph Jaboure,Thomas Arnold
关键词: neural network continuing, transformer neural network, received much attention, image and language, neural network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative AI has received much attention in the image and language domains, with the transformer neural network continuing to dominate the state of the art. Application of these models to time series generation is less explored, however, and is of great utility to machine learning, privacy preservation, and explainability research. The present survey identifies this gap at the intersection of the transformer, generative AI, and time series data, and reviews works in this sparsely populated subdomain. The reviewed works show great variety in approach, and have not yet converged on a conclusive answer to the problems the domain poses. GANs, diffusion models, state space models, and autoencoders were all encountered alongside or surrounding the transformers which originally motivated the survey. While too open a domain to offer conclusive insights, the works surveyed are quite suggestive, and several recommendations for best practice, and suggestions of valuable future work, are provided.

[LG-51] PeFAD: A Parameter-Efficient Federated Framework for Time Series Anomaly Detection

链接: https://arxiv.org/abs/2406.02318
作者: Ronghui Xu,Hao Miao,Senzhang Wang,Philip S. Yu,Jianxin Wang
关键词: mobile sensing techniques, time series, time series data, time series anomaly, series anomaly detection
类目: Machine Learning (cs.LG); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted by SIGKDD 2024 (Research Track)

点击查看摘要

Abstract:With the proliferation of mobile sensing techniques, huge amounts of time series data are generated and accumulated in various domains, fueling plenty of real-world applications. In this setting, time series anomaly detection is practically important. It endeavors to identify deviant samples from the normal sample distribution in time series. Existing approaches generally assume that all the time series is available at a central location. However, we are witnessing the decentralized collection of time series due to the deployment of various edge devices. To bridge the gap between the decentralized time series data and the centralized anomaly detection algorithms, we propose a Parameter-efficient Federated Anomaly Detection framework named PeFAD with the increasing privacy concerns. PeFAD for the first time employs the pre-trained language model (PLM) as the body of the client’s local model, which can benefit from its cross-modality knowledge transfer capability. To reduce the communication overhead and local model adaptation cost, we propose a parameter-efficient federated training module such that clients only need to fine-tune small-scale parameters and transmit them to the server for update. PeFAD utilizes a novel anomaly-driven mask selection strategy to mitigate the impact of neglected anomalies during training. A knowledge distillation operation on a synthetic privacy-preserving dataset that is shared by all the clients is also proposed to address the data heterogeneity issue across clients. We conduct extensive evaluations on four real datasets, where PeFAD outperforms existing state-of-the-art baselines by up to 28.74%.

[LG-52] Generative Conditional Distributions by Neural (Entropic) Optimal Transport

链接: https://arxiv.org/abs/2406.02317
作者: Bao Nguyen,Binh Nguyen,Hieu Trung Nguyen,Viet Anh Nguyen
关键词: multiple instances, conditional distributions, conditional distribution learning, Learning conditional distributions, desired outcome
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:Learning conditional distributions is challenging because the desired outcome is not a single distribution but multiple distributions that correspond to multiple instances of the covariates. We introduce a novel neural entropic optimal transport method designed to effectively learn generative models of conditional distributions, particularly in scenarios characterized by limited sample sizes. Our method relies on the minimax training of two neural networks: a generative network parametrizing the inverse cumulative distribution functions of the conditional distributions and another network parametrizing the conditional Kantorovich potential. To prevent overfitting, we regularize the objective function by penalizing the Lipschitz constant of the network output. Our experiments on real-world datasets show the effectiveness of our algorithm compared to state-of-the-art conditional distribution learning techniques. Our implementation can be found at this https URL.

[LG-53] An Independence-promoting Loss for Music Generation with Language Models

链接: https://arxiv.org/abs/2406.02315
作者: Jean-Marie Lemercier,Simon Rouard,Jade Copet,Yossi Adi,Alexandre Déffosez
关键词: language modeling rely, discrete latent space, latent space learnt, Music generation schemes, joint distribution
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:Music generation schemes using language modeling rely on a vocabulary of audio tokens, generally provided as codes in a discrete latent space learnt by an auto-encoder. Multi-stage quantizers are often employed to produce these tokens, therefore the decoding strategy used for token prediction must be adapted to account for multiple codebooks: either it should model the joint distribution over all codebooks, or fit the product of the codebook marginal distributions. Modelling the joint distribution requires a costly increase in the number of auto-regressive steps, while fitting the product of the marginals yields an inexact model unless the codebooks are mutually independent. In this work, we introduce an independence-promoting loss to regularize the auto-encoder used as the tokenizer in language models for music generation. The proposed loss is a proxy for mutual information based on the maximum mean discrepancy principle, applied in reproducible kernel Hilbert spaces. Our criterion is simple to implement and train, and it is generalizable to other multi-stream codecs. We show that it reduces the statistical dependence between codebooks during auto-encoding. This leads to an increase in the generated music quality when modelling the product of the marginal distributions, while generating audio much faster than the joint distribution model.

[LG-54] Disentangled Representation via Variational AutoEncoder for Continuous Treatment Effect Estimation

链接: https://arxiv.org/abs/2406.02310
作者: Ruijing Cui,Jianbin Sun,Bingyu He,Kewei Yang,Bingfeng Ge
关键词: holds significant practical, significant practical importance, estimation holds significant, assessment domains, holds significant
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continuous treatment effect estimation holds significant practical importance across various decision-making and assessment domains, such as healthcare and the military. However, current methods for estimating dose-response curves hinge on balancing the entire representation by treating all covariates as confounding variables. Although various approaches disentangle covariates into different factors for treatment effect estimation, they are confined to binary treatment settings. Moreover, observational data are often tainted with non-causal noise information that is imperceptible to the human. Hence, in this paper, we propose a novel Dose-Response curve estimator via Variational AutoEncoder (DRVAE) disentangled covariates representation. Our model is dedicated to disentangling covariates into instrumental factors, confounding factors, adjustment factors, and external noise factors, thereby facilitating the estimation of treatment effects under continuous treatment settings by balancing the disentangled confounding factors. Extensive results on synthetic and semi-synthetic datasets demonstrate that our model outperforms the current state-of-the-art methods.

[LG-55] Effects of Exponential Gaussian Distribution on (Double Sampling) Randomized Smoothing

链接: https://arxiv.org/abs/2406.02309
作者: Youwei Shu,Xi Xiao,Derui Wang,Yuxin Cao,Siji Chen,Jason Xue,Linyi Li,Bo Li
关键词: Sampling Randomized Smoothing, Randomized Smoothing, Double Sampling Randomized, method providing robustness, Exponential Standard Gaussian
类目: Machine Learning (cs.LG)
*备注: ICML 2024 Poster

点击查看摘要

Abstract:Randomized Smoothing (RS) is currently a scalable certified defense method providing robustness certification against adversarial examples. Although significant progress has been achieved in providing defenses against \ell_p adversaries, the interaction between the smoothing distribution and the robustness certification still remains vague. In this work, we comprehensively study the effect of two families of distributions, named Exponential Standard Gaussian (ESG) and Exponential General Gaussian (EGG) distributions, on Randomized Smoothing and Double Sampling Randomized Smoothing (DSRS). We derive an analytic formula for ESG’s certified radius, which converges to the origin formula of RS as the dimension d increases. Additionally, we prove that EGG can provide tighter constant factors than DSRS in providing \Omega(\sqrtd) lower bounds of \ell_2 certified radius, and thus further addresses the curse of dimensionality in RS. Our experiments on real-world datasets confirm our theoretical analysis of the ESG distributions, that they provide almost the same certification under different exponents \eta for both RS and DSRS. In addition, EGG

[LG-56] Learning-Rate-Free Stochastic Optimization over Riemannian Manifolds

链接: https://arxiv.org/abs/2406.02296
作者: Daniel Dodd,Louis Sharrock,Christopher Nemeth
关键词: recent years, interest in gradient-based, optimization over Riemannian, Riemannian manifolds, gradient-based optimization
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: ICML 2024

点击查看摘要

Abstract:In recent years, interest in gradient-based optimization over Riemannian manifolds has surged. However, a significant challenge lies in the reliance on hyperparameters, especially the learning rate, which requires meticulous tuning by practitioners to ensure convergence at a suitable rate. In this work, we introduce innovative learning-rate-free algorithms for stochastic optimization over Riemannian manifolds, eliminating the need for hand-tuning and providing a more robust and user-friendly approach. We establish high probability convergence guarantees that are optimal, up to logarithmic factors, compared to the best-known optimally tuned rate in the deterministic setting. Our approach is validated through numerical experiments, demonstrating competitive performance against learning-rate-dependent algorithms.

[LG-57] How to Explore with Belief: State Entropy Maximization in POMDPs

链接: https://arxiv.org/abs/2406.02295
作者: Riccardo Zamboni,Duilio Cirino,Marcello Restelli,Mirco Mutti
关键词: inducing high entropy, Recent works, policy inducing high, state entropy maximization, works have studied
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent works have studied state entropy maximization in reinforcement learning, in which the agent’s objective is to learn a policy inducing high entropy over states visitation (Hazan et al., 2019). They typically assume full observability of the state of the system, so that the entropy of the observations is maximized. In practice, the agent may only get partial observations, e.g., a robot perceiving the state of a physical space through proximity sensors and cameras. A significant mismatch between the entropy over observations and true states of the system can arise in those settings. In this paper, we address the problem of entropy maximization over the true states with a decision policy conditioned on partial observations only. The latter is a generalization of POMDPs, which is intractable in general. We develop a memory and computationally efficient policy gradient method to address a first-order relaxation of the objective defined on belief states, providing various formal characterizations of approximation gaps, the optimization landscape, and the hallucination problem. This paper aims to generalize state entropy maximization to more realistic domains that meet the challenges of applications.

[LG-58] Smaller Batches Bigger Gains? Investigating the Impact of Batch Sizes on Reinforcement Learning Based Real-World Production Scheduling

链接: https://arxiv.org/abs/2406.02294
作者: Arthur Müller,Felix Grumbach,Matthia Sabatelli
关键词: Reinforcement Learning, batch sizes, task in manufacturing, essential task, real-world production line
类目: Machine Learning (cs.LG)
*备注: This paper was accepted at the ETFA 2024 conference

点击查看摘要

Abstract:Production scheduling is an essential task in manufacturing, with Reinforcement Learning (RL) emerging as a key solution. In a previous work, RL was utilized to solve an extended permutation flow shop scheduling problem (PFSSP) for a real-world production line with two stages, linked by a central buffer. The RL agent was trained to sequence equallysized product batches to minimize setup efforts and idle times. However, the substantial impact caused by varying the size of these product batches has not yet been explored. In this follow-up study, we investigate the effects of varying batch sizes, exploring both the quality of solutions and the training dynamics of the RL agent. The results demonstrate that it is possible to methodically identify reasonable boundaries for the batch size. These boundaries are determined on one side by the increasing sample complexity associated with smaller batch sizes, and on the other side by the decreasing flexibility of the agent when dealing with larger batch sizes. This provides the practitioner the ability to make an informed decision regarding the selection of an appropriate batch size. Moreover, we introduce and investigate two new curriculum learning strategies to enable the training with small batch sizes. The findings of this work offer the potential for application in several industrial use cases with comparable scheduling problems.

[LG-59] An Axiomatic Approach to Loss Aggregation and an Adapted Aggregating Algorithm

链接: https://arxiv.org/abs/2406.02292
作者: Armando J. Cabrera Pacheco,Rabanus Derr,Robert C. Williamson
关键词: risk minimization framework, expected risk minimization, Supervised learning, minimization framework, expected risk
类目: Machine Learning (cs.LG)
*备注: 31 pages

点击查看摘要

Abstract:Supervised learning has gone beyond the expected risk minimization framework. Central to most of these developments is the introduction of more general aggregation functions for losses incurred by the learner. In this paper, we turn towards online learning under expert advice. Via easily justified assumptions we characterize a set of reasonable loss aggregation functions as quasi-sums. Based upon this insight, we suggest a variant of the Aggregating Algorithm tailored to these more general aggregation functions. This variant inherits most of the nice theoretical properties of the AA, such as recovery of Bayes’ updating and a time-independent bound on quasi-sum regret. Finally, we argue that generalized aggregations express the attitude of the learner towards losses.

[LG-60] A Study of Optimizations for Fine-tuning Large Language Models

链接: https://arxiv.org/abs/2406.02290
作者: Arjun Singh,Nikhil Pandey,Anup Shirgaonkar,Pavan Manoj,Vijay Aski
关键词: Fine-tuning, popular choice, Low Rank Adaptation, specific applications, context length
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Fine-tuning large language models is a popular choice among users trying to adapt them for specific applications. However, fine-tuning these models is a demanding task because the user has to examine several factors, such as resource budget, runtime, model size and context length among others. A specific challenge is that fine-tuning is memory intensive, imposing constraints on the required hardware memory and context length of training data that can be handled. In this work, we share a detailed study on a variety of fine-tuning optimizations across different fine-tuning scenarios. In particular, we assess Gradient Checkpointing, Low Rank Adaptation, DeepSpeed’s ZeRO Redundancy Optimizer and Flash Attention. With a focus on memory and runtime, we examine the impact of different optimization combinations on GPU memory usage and execution runtime during fine-tuning phase. We provide recommendation on best default optimization for balancing memory and runtime across diverse model sizes. We share effective strategies for fine-tuning very large models with tens or hundreds of billions of parameters and enabling large context lengths during fine-tuning. Furthermore, we propose the appropriate optimization mixtures for fine-tuning under GPU resource limitations.

[LG-61] st-Time Regret Minimization in Meta Reinforcement Learning

链接: https://arxiv.org/abs/2406.02282
作者: Mirco Mutti,Aviv Tamar
关键词: Meta reinforcement learning, Meta reinforcement, reinforcement learning sets, test task efficiently, reinforcement learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Meta reinforcement learning sets a distribution over a set of tasks on which the agent can train at will, then is asked to learn an optimal policy for any test task efficiently. In this paper, we consider a finite set of tasks modeled through Markov decision processes with various dynamics. We assume to have endured a long training phase, from which the set of tasks is perfectly recovered, and we focus on regret minimization against the optimal policy in the unknown test task. Under a separation condition that states the existence of a state-action pair revealing a task against another, Chen et al. (2022) show that O(M^2 \log(H)) regret can be achieved, where M, H are the number of tasks in the set and test episodes, respectively. In our first contribution, we demonstrate that the latter rate is nearly optimal by developing a novel lower bound for test-time regret minimization under separation, showing that a linear dependence with M is unavoidable. Then, we present a family of stronger yet reasonable assumptions beyond separation, which we call strong identifiability, enabling algorithms achieving fast rates \log (H) and sublinear dependence with M simultaneously. Our paper provides a new understanding of the statistical barriers of test-time regret minimization and when fast rates can be achieved.

[LG-62] Analyzing the Benefits of Prototypes for Semi-Supervised Category Learning

链接: https://arxiv.org/abs/2406.02268
作者: Liyi Zhang,Logan Nelson,Thomas L. Griffiths
关键词: levels of abstraction, typical members, members to remembering, remembering all observed, observed exemplars
类目: Machine Learning (cs.LG)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:Categories can be represented at different levels of abstraction, from prototypes focused on the most typical members to remembering all observed exemplars of the category. These representations have been explored in the context of supervised learning, where stimuli are presented with known category labels. We examine the benefits of prototype-based representations in a less-studied domain: semi-supervised learning, where agents must form unsupervised representations of stimuli before receiving category labels. We study this problem in a Bayesian unsupervised learning model called a variational auto-encoder, and we draw on recent advances in machine learning to implement a prior that encourages the model to use abstract prototypes to represent data. We apply this approach to image datasets and show that forming prototypes can improve semi-supervised category learning. Additionally, we study the latent embeddings of the models and show that these prototypes allow the models to form clustered representations without supervision, contributing to their success in downstream categorization performance.

[LG-63] Reinforcement Learning with Lookahead Information

链接: https://arxiv.org/abs/2406.02258
作者: Nadav Merlis
关键词: study reinforcement learning, study reinforcement, current state, state before deciding, deciding which action
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study reinforcement learning (RL) problems in which agents observe the reward or transition realizations at their current state before deciding which action to take. Such observations are available in many applications, including transactions, navigation and more. When the environment is known, previous work shows that this lookahead information can drastically increase the collected reward. However, outside of specific applications, existing approaches for interacting with unknown environments are not well-adapted to these observations. In this work, we close this gap and design provably-efficient learning algorithms able to incorporate lookahead information. To achieve this, we perform planning using the empirical distribution of the reward and transition observations, in contrast to vanilla approaches that only rely on estimated expectations. We prove that our algorithms achieve tight regret versus a baseline that also has access to lookahead information - linearly increasing the amount of collected reward compared to agents that cannot handle lookahead information.

[LG-64] Description Boosting for Zero-Shot Entity and Relation Classification

链接: https://arxiv.org/abs/2406.02245
作者: Gabriele Picco,Leopold Fuchs,Marcos Martínez Galindo,Alberto Purpura,Vanessa López,Hoang Thanh Lam
关键词: annotate input text, input text data, leverage available external, external information, information of unseen
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Zero-shot entity and relation classification models leverage available external information of unseen classes – e.g., textual descriptions – to annotate input text data. Thanks to the minimum data requirement, Zero-Shot Learning (ZSL) methods have high value in practice, especially in applications where labeled data is scarce. Even though recent research in ZSL has demonstrated significant results, our analysis reveals that those methods are sensitive to provided textual descriptions of entities (or relations). Even a minor modification of descriptions can lead to a change in the decision boundary between entity (or relation) classes. In this paper, we formally define the problem of identifying effective descriptions for zero shot inference. We propose a strategy for generating variations of an initial description, a heuristic for ranking them and an ensemble method capable of boosting the predictions of zero-shot models through description enhancement. Empirical results on four different entity and relation classification datasets show that our proposed method outperform existing approaches and achieve new SOTA results on these datasets under the ZSL settings. The source code of the proposed solutions and the evaluation framework are open-sourced.

[LG-65] On the Limitations of Fractal Dimension as a Measure of Generalization

链接: https://arxiv.org/abs/2406.02234
作者: Charlie Tan,Inés García-Redondo,Qiquan Wang,Michael M. Bronstein,Anthea Monod
关键词: central open problem, Bounding and predicting, overparameterized neural networks, neural networks remains, theoretical machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Machine Learning (stat.ML)
*备注: 17 pages, 6 figures

点击查看摘要

Abstract:Bounding and predicting the generalization gap of overparameterized neural networks remains a central open problem in theoretical machine learning. Neural network optimization trajectories have been proposed to possess fractal structure, leading to bounds and generalization measures based on notions of fractal dimension on these trajectories. Prominently, both the Hausdorff dimension and the persistent homology dimension have been proposed to correlate with generalization gap, thus serving as a measure of generalization. This work performs an extended evaluation of these topological generalization measures. We demonstrate that fractal dimension fails to predict generalization of models trained from poor initializations. We further identify that the \ell^2 norm of the final parameter iterate, one of the simplest complexity measures in learning theory, correlates more strongly with the generalization gap than these notions of fractal dimension. Finally, our study reveals the intriguing manifestation of model-wise double descent in persistent homology-based generalization measures. This work lays the ground for a deeper investigation of the causal relationships between fractal geometry, topological data analysis, and neural network optimization.

[LG-66] SMCL: Saliency Masked Contrastive Learning for Long-tailed Recognition

链接: https://arxiv.org/abs/2406.02223
作者: Sanglee Park,Seung-won Hwang,Jungmin So
关键词: Real-world data, high imbalance, Real-world, contrastive learning, classes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted at ICASSP 2023

点击查看摘要

Abstract:Real-world data often follow a long-tailed distribution with a high imbalance in the number of samples between classes. The problem with training from imbalanced data is that some background features, common to all classes, can be unobserved in classes with scarce samples. As a result, this background correlates to biased predictions into ``major" classes. In this paper, we propose saliency masked contrastive learning, a new method that uses saliency masking and contrastive learning to mitigate the problem and improve the generalizability of a model. Our key idea is to mask the important part of an image using saliency detection and use contrastive learning to move the masked image towards minor classes in the feature space, so that background features present in the masked image are no longer correlated with the original class. Experiment results show that our method achieves state-of-the-art level performance on benchmark long-tailed datasets.

[LG-67] SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining

链接: https://arxiv.org/abs/2406.02214
作者: Andi Han,Jiaxiang Li,Wei Huang,Mingyi Hong,Akiko Takeda,Pratik Jawanpuria,Bamdev Mishra
关键词: Large language models, shown impressive capabilities, Large language, shown impressive, impressive capabilities
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown impressive capabilities across various tasks. However, training LLMs from scratch requires significant computational power and extensive memory capacity. Recent studies have explored low-rank structures on weights for efficient fine-tuning in terms of parameters and memory, either through low-rank adaptation or factorization. While effective for fine-tuning, low-rank structures are generally less suitable for pretraining because they restrict parameters to a low-dimensional subspace. In this work, we propose to parameterize the weights as a sum of low-rank and sparse matrices for pretraining, which we call SLTrain. The low-rank component is learned via matrix factorization, while for the sparse component, we employ a simple strategy of uniformly selecting the sparsity support at random and learning only the non-zero entries with the fixed support. While being simple, the random fixed-support sparse learning strategy significantly enhances pretraining when combined with low-rank learning. Our results show that SLTrain adds minimal extra parameters and memory costs compared to pretraining with low-rank parameterization, yet achieves substantially better performance, which is comparable to full-rank training. Remarkably, when combined with quantization and per-layer updates, SLTrain can reduce memory requirements by up to 73% when pretraining the LLaMA 7B model.

[LG-68] Rectifying Reinforcement Learning for Reward Matching

链接: https://arxiv.org/abs/2406.02213
作者: Haoran He,Emmanuel Bengio,Qingpeng Cai,Ling Pan
关键词: Generative Flow Network, Flow Network, Generative Flow, unnormalized reward function, probabilistic framework
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Generative Flow Network (GFlowNet) is a probabilistic framework in which an agent learns a stochastic policy and flow functions to sample objects with probability proportional to an unnormalized reward function. GFlowNets share a strong resemblance to reinforcement learning (RL), that typically aims to maximize reward, due to their sequential decision-making processes. Recent works have studied connections between GFlowNets and maximum entropy (MaxEnt) RL, which modifies the standard objective of RL agents by learning an entropy-regularized objective. However, a critical theoretical gap persists: despite the apparent similarities in their sequential decision-making nature, a direct link between GFlowNets and standard RL has yet to be discovered, while bridging this gap could further unlock the potential of both fields. In this paper, we establish a new connection between GFlowNets and policy evaluation for a uniform policy. Surprisingly, we find that the resulting value function for the uniform policy has a close relationship to the flows in GFlowNets. Leveraging these insights, we further propose a novel rectified policy evaluation (RPE) algorithm, which achieves the same reward-matching effect as GFlowNets, offering a new perspective. We compare RPE, MaxEnt RL, and GFlowNets in a number of benchmarks, and show that RPE achieves competitive results compared to previous approaches. This work sheds light on the previously unexplored connection between (non-MaxEnt) RL and GFlowNets, potentially opening new avenues for future research in both fields.

[LG-69] he Deep Latent Space Particle Filter for Real-Time Data Assimilation with Uncertainty Quantification

链接: https://arxiv.org/abs/2406.02204
作者: Nikolaj T. Mücke,Sander M. Bohté,Cornelis W. Oosterlee
关键词: observations are fused, fused with simulations, simulations to obtain, state and parameters, Latent Space
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In Data Assimilation, observations are fused with simulations to obtain an accurate estimate of the state and parameters for a given physical system. Combining data with a model, however, while accurately estimating uncertainty, is computationally expensive and infeasible to run in real-time for complex systems. Here, we present a novel particle filter methodology, the Deep Latent Space Particle filter or D-LSPF, that uses neural network-based surrogate models to overcome this computational challenge. The D-LSPF enables filtering in the low-dimensional latent space obtained using Wasserstein AEs with modified vision transformer layers for dimensionality reduction and transformers for parameterized latent space time stepping. As we demonstrate on three test cases, including leak localization in multi-phase pipe flow and seabed identification for fully nonlinear water waves, the D-LSPF runs orders of magnitude faster than a high-fidelity particle filter and 3-5 times faster than alternative methods while being up to an order of magnitude more accurate. The D-LSPF thus enables real-time data assimilation with uncertainty quantification for physical systems.

[LG-70] Fast and Scalable Multi-Kernel Encoder Classifier

链接: https://arxiv.org/abs/2406.02189
作者: Cencheng Shen
关键词: leveraging recent progress, graph embedding techniques, viewing kernel matrices, generalized graphs, paper introduces
类目: Machine Learning (cs.LG)
*备注: 12 pages main + 3 pages appendix

点击查看摘要

Abstract:This paper introduces a new kernel-based classifier by viewing kernel matrices as generalized graphs and leveraging recent progress in graph embedding techniques. The proposed method facilitates fast and scalable kernel matrix embedding, and seamlessly integrates multiple kernels to enhance the learning process. Our theoretical analysis offers a population-level characterization of this approach using random variables. Empirically, our method demonstrates superior running time compared to standard approaches such as support vector machines and two-layer neural network, while achieving comparable classification accuracy across various simulated and real datasets.

[LG-71] DNCs Require More Planning Steps

链接: https://arxiv.org/abs/2406.02187
作者: Yara Shamshoum,Nitzan Hodos,Yuval Sieradzki,Assaf Schuster
关键词: machine learning models, complex algorithmic problems, machine learning, Differentiable Neural Computer, learning models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many recent works use machine learning models to solve various complex algorithmic problems. However, these models attempt to reach a solution without considering the problem’s required computational complexity, which can be detrimental to their ability to solve it correctly. In this work we investigate the effect of computational time and memory on generalization of implicit algorithmic solvers. To do so, we focus on the Differentiable Neural Computer (DNC), a general problem solver that also lets us reason directly about its usage of time and memory. In this work, we argue that the number of planning steps the model is allowed to take, which we call “planning budget”, is a constraint that can cause the model to generalize poorly and hurt its ability to fully utilize its external memory. We evaluate our method on Graph Shortest Path, Convex Hull, Graph MinCut and Associative Recall, and show how the planning budget can drastically change the behavior of the learned algorithm, in terms of learned time complexity, training time, stability and generalization to inputs larger than those seen during training.

[LG-72] On The Statistical Representation Properties Of The Perturb-Softmax And The Perturb-Argmax Probability Distributions

链接: https://arxiv.org/abs/2406.02180
作者: Hedda Cohen Indelman,Tamir Hazan
关键词: learning discrete tokens, learning discrete structures, learning discrete, Gumbel-Softmax probability distribution, Gumbel-Argmax probability distribution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Gumbel-Softmax probability distribution allows learning discrete tokens in generative learning, while the Gumbel-Argmax probability distribution is useful in learning discrete structures in discriminative learning. Despite the efforts invested in optimizing these probability models, their statistical properties are under-explored. In this work, we investigate their representation properties and determine for which families of parameters these probability distributions are complete, i.e., can represent any probability distribution, and minimal, i.e., can represent a probability distribution uniquely. We rely on convexity and differentiability to determine these statistical conditions and extend this framework to general probability models, such as Gaussian-Softmax and Gaussian-Argmax. We experimentally validate the qualities of these extensions, which enjoy a faster convergence rate. We conclude the analysis by identifying two sets of parameters that satisfy these assumptions and thus admit a complete and minimal representation. Our contribution is theoretical with supporting practical evaluation.

[LG-73] One-Shot Federated Learning with Bayesian Pseudocoresets

链接: https://arxiv.org/abs/2406.02177
作者: Tim d’Hondt,Mykola Pechenizkiy,Robert Peharz
关键词: Optimization-based techniques, high dimensional model, dimensional model parameters, techniques for federated, high dimensional
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
*备注: 10 pages

点击查看摘要

Abstract:Optimization-based techniques for federated learning (FL) often come with prohibitive communication cost, as high dimensional model parameters need to be communicated repeatedly between server and clients. In this paper, we follow a Bayesian approach allowing to perform FL with one-shot communication, by solving the global inference problem as a product of local client posteriors. For models with multi-modal likelihoods, such as neural networks, a naive application of this scheme is hampered, since clients will capture different posterior modes, causing a destructive collapse of the posterior on the server side. Consequently, we explore approximate inference in the function-space representation of client posteriors, hence suffering less or not at all from multi-modality. We show that distributed function-space inference is tightly related to learning Bayesian pseudocoresets and develop a tractable Bayesian FL algorithm on this insight. We show that this approach achieves prediction performance competitive to state-of-the-art while showing a striking reduction in communication cost of up to two orders of magnitude. Moreover, due to its Bayesian nature, our method also delivers well-calibrated uncertainty estimates.

[LG-74] AROMA: Preserving Spatial Structure for Latent PDE Modeling with Local Neural Fields

链接: https://arxiv.org/abs/2406.02176
作者: Louis Serrano,Thomas X Wang,Etienne Le Naour,Jean-Noël Vittaut,Patrick Gallinari
关键词: Attentive Reduced Order, Reduced Order Model, Attentive Reduced, Model with Attention, Reduced Order
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present AROMA (Attentive Reduced Order Model with Attention), a framework designed to enhance the modeling of partial differential equations (PDEs) using local neural fields. Our flexible encoder-decoder architecture can obtain smooth latent representations of spatial physical fields from a variety of data types, including irregular-grid inputs and point clouds. This versatility eliminates the need for patching and allows efficient processing of diverse geometries. The sequential nature of our latent representation can be interpreted spatially and permits the use of a conditional transformer for modeling the temporal dynamics of PDEs. By employing a diffusion-based formulation, we achieve greater stability and enable longer rollouts compared to conventional MSE training. AROMA’s superior performance in simulating 1D and 2D equations underscores the efficacy of our approach in capturing complex dynamical behaviors.

[LG-75] Branches: A Fast Dynamic Programming and Branch Bound Algorithm for Optimal Decision Trees

链接: https://arxiv.org/abs/2406.02175
作者: Ayman Chaouki,Jesse Read,Albert Bifet
关键词: Interpretable Machine Learning, Decision Tree Learning, Machine Learning, Interpretable Machine, formidable optimization challenge
类目: Machine Learning (cs.LG)
*备注: This preprint is currently under review

点击查看摘要

Abstract:Decision Tree Learning is a fundamental problem for Interpretable Machine Learning, yet it poses a formidable optimization challenge. Despite numerous efforts dating back to the early 1990’s, practical algorithms have only recently emerged, primarily leveraging Dynamic Programming (DP) and Branch Bound (BB) techniques. These breakthroughs led to the development of two distinct approaches. Algorithms like DL8.5 and MurTree operate on the space of nodes (or branches), they are very fast, but do not penalise complex Decision Trees, i.e. they do not solve for sparsity. On the other hand, algorithms like OSDT and GOSDT operate on the space of Decision Trees, they solve for sparsity but at the detriment of speed. In this work, we introduce Branches, a novel algorithm that integrates the strengths of both paradigms. Leveraging DP and BB, Branches achieves exceptional speed while also solving for sparsity. Central to its efficiency is a novel analytical bound enabling substantial pruning of the search space. Theoretical analysis demonstrates that Branches has lower complexity compared to state-of-the-art methods, a claim validated through extensive empirical evaluation. Our results illustrate that Branches not only greatly outperforms existing approaches in terms of speed and number of iterations, it also consistently yields optimal Decision Trees.

[LG-76] Learning the Hodgkin-Huxley Model with Operator Learning Techniques

链接: https://arxiv.org/abs/2406.02173
作者: Edoardo Centofanti,Massimiliano Ghiotto,Luca F. Pavarino
关键词: Fourier Neural Operator, Wavelet Neural Operator, Huxley ionic model, time-dependent applied current, operator learning architectures
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 24 pages, 8 figures

点击查看摘要

Abstract:We construct and compare three operator learning architectures, DeepONet, Fourier Neural Operator, and Wavelet Neural Operator, in order to learn the operator mapping a time-dependent applied current to the transmembrane potential of the Hodgkin- Huxley ionic model. The underlying non-linearity of the Hodgkin-Huxley dynamical system, the stiffness of its solutions, and the threshold dynamics depending on the intensity of the applied current, are some of the challenges to address when exploiting artificial neural networks to learn this class of complex operators. By properly designing these operator learning techniques, we demonstrate their ability to effectively address these challenges, achieving a relative L2 error as low as 1.4% in learning the solutions of the Hodgkin-Huxley ionic model.

[LG-77] SaVeR: Optimal Data Collection Strategy for Safe Policy Evaluation in Tabular MDP

链接: https://arxiv.org/abs/2406.02165
作者: Subhojyoti Mukherjee,Josiah P. Hanna,Robert Nowak
关键词: Markov decision processes, tabular Markov decision, tabular Markov, Markov decision, policy evaluation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we study safe data collection for the purpose of policy evaluation in tabular Markov decision processes (MDPs). In policy evaluation, we are given a \textittarget policy and asked to estimate the expected cumulative reward it will obtain. Policy evaluation requires data and we are interested in the question of what \textitbehavior policy should collect the data for the most accurate evaluation of the target policy. While prior work has considered behavior policy selection, in this paper, we additionally consider a safety constraint on the behavior policy. Namely, we assume there exists a known default policy that incurs a particular expected cost when run and we enforce that the cumulative cost of all behavior policies ran is better than a constant factor of the cost that would be incurred had we always run the default policy. We first show that there exists a class of intractable MDPs where no safe oracle algorithm with knowledge about problem parameters can efficiently collect data and satisfy the safety constraints. We then define the tractability condition for an MDP such that a safe oracle algorithm can efficiently collect data and using that we prove the first lower bound for this setting. We then introduce an algorithm SaVeR for this problem that approximates the safe oracle algorithm and bound the finite-sample mean squared error of the algorithm while ensuring it satisfies the safety constraint. Finally, we show in simulations that SaVeR produces low MSE policy evaluation while satisfying the safety constraint.

[LG-78] Radar Spectra-Language Model for Automotive Scene Parsing

链接: https://arxiv.org/abs/2406.02158
作者: Mariia Pushkareva,Yuri Feldman,Csaba Domokos,Kilian Rambach,Dotan Di Castro
关键词: Radar, radar spectra, low cost, sensors are low, spectra
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Radar sensors are low cost, long-range, and weather-resilient. Therefore, they are widely used for driver assistance functions, and are expected to be crucial for the success of autonomous driving in the future. In many perception tasks only pre-processed radar point clouds are considered. In contrast, radar spectra are a raw form of radar measurements and contain more information than radar point clouds. However, radar spectra are rather difficult to interpret. In this work, we aim to explore the semantic information contained in spectra in the context of automated driving, thereby moving towards better interpretability of radar spectra. To this end, we create a radar spectra-language model, allowing us to query radar spectra measurements for the presence of scene elements using free text. We overcome the scarcity of radar spectra data by matching the embedding space of an existing vision-language model (VLM). Finally, we explore the benefit of the learned representation for scene parsing, and obtain improvements in free space segmentation and object detection merely by injecting the spectra embedding into a baseline model.

[LG-79] Almost linear time differentially private release of synthetic graphs

链接: https://arxiv.org/abs/2406.02156
作者: Jingcheng Liu,Jalaj Upadhyay,Zongrui Zou
关键词: score function defined, large non-convex set, exponentially large non-convex, score function, non-convex set
类目: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we give an almost linear time and space algorithms to sample from an exponential mechanism with an \ell_1 -score function defined over an exponentially large non-convex set. As a direct result, on input an n vertex m edges graph G , we present the \textitfirst \widetildeO(m) time and O(m) space algorithms for differentially privately outputting an n vertex O(m) edges synthetic graph that approximates all the cuts and the spectrum of G . These are the \emphfirst private algorithms for releasing synthetic graphs that nearly match this task’s time and space complexity in the non-private setting while achieving the same (or better) utility as the previous works in the more practical sparse regime. Additionally, our algorithms can be extended to private graph analysis under continual observation.

[LG-80] Activation Bottleneck: Sigmoidal Neural Networks Cannot Forecast a Straight Line

链接: https://arxiv.org/abs/2406.02146
作者: Maximilian Toller,Hussain Hussain,Bernhard C Geiger
关键词: bounded image, hidden layers, activation, activation bottleneck, LSTM and GRU
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A neural network has an activation bottleneck if one of its hidden layers has a bounded image. We show that networks with an activation bottleneck cannot forecast unbounded sequences such as straight lines, random walks, or any sequence with a trend: The difference between prediction and ground truth becomes arbitrary large, regardless of the training procedure. Widely-used neural network architectures such as LSTM and GRU suffer from this limitation. In our analysis, we characterize activation bottlenecks and explain why they prevent sigmoidal networks from learning unbounded sequences. We experimentally validate our findings and discuss modifications to network architectures which mitigate the effects of activation bottlenecks.

[LG-81] Optimality of Matrix Mechanism on ell_pp-metric

链接: https://arxiv.org/abs/2406.02140
作者: Jingcheng Liu,Jalaj Upadhyay,Zongrui Zou
关键词: differential privacy, answering linear queries, error metric, ell, linear queries
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we introduce the \ell_p^p -error metric (for p \geq 2 ) when answering linear queries under the constraint of differential privacy. We characterize such an error under (\epsilon,\delta) -differential privacy. Before this paper, tight characterization in the hardness of privately answering linear queries was known under \ell_2^2 -error metric (Edmonds et al., STOC 2020) and \ell_p^2 -error metric for unbiased mechanisms (Nikolov and Tang, ITCS 2024). As a direct consequence of our results, we give tight bounds on answering prefix sum and parity queries under differential privacy for all constant p in terms of the \ell_p^p error, generalizing the bounds in Henzinger et al. (SODA 2023) for p=2 .

[LG-82] CondTSF: One-line Plugin of Dataset Condensation for Time Series Forecasting

链接: https://arxiv.org/abs/2406.02131
作者: Jianrong Ding,Zhanyu Liu,Guanjie Zheng,Haiming Jin,Linghe Kong
关键词: lower training costs, training deep neural, Dataset condensation, model trained, deep neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 23 pages, 13 figures

点击查看摘要

Abstract:Dataset condensation is a newborn technique that generates a small dataset that can be used in training deep neural networks to lower training costs. The objective of dataset condensation is to ensure that the model trained with the synthetic dataset can perform comparably to the model trained with full datasets. However, existing methods predominantly concentrate on classification tasks, posing challenges in their adaptation to time series forecasting (TS-forecasting). This challenge arises from disparities in the evaluation of synthetic data. In classification, the synthetic data is considered well-distilled if the model trained with the full dataset and the model trained with the synthetic dataset yield identical labels for the same input, regardless of variations in output logits distribution. Conversely, in TS-forecasting, the effectiveness of synthetic data distillation is determined by the distance between predictions of the two models. The synthetic data is deemed well-distilled only when all data points within the predictions are similar. Consequently, TS-forecasting has a more rigorous evaluation methodology compared to classification. To mitigate this gap, we theoretically analyze the optimization objective of dataset condensation for TS-forecasting and propose a new one-line plugin of dataset condensation designated as Dataset Condensation for Time Series Forecasting (CondTSF) based on our analysis. Plugging CondTSF into previous dataset condensation methods facilitates a reduction in the distance between the predictions of the model trained with the full dataset and the model trained with the synthetic dataset, thereby enhancing performance. We conduct extensive experiments on eight commonly used time series datasets. CondTSF consistently improves the performance of all previous dataset condensation methods across all datasets, particularly at low condensing ratios.

[LG-83] Iteration Head: A Mechanistic Study of Chain-of-Thought

链接: https://arxiv.org/abs/2406.02128
作者: Vivien Cabannes,Charles Arnal,Wassim Bouaziz,Alice Yang,Francois Charton,Julia Kempe
关键词: Large Language Models, improve Large Language, theoretical approximation power, Large Language, Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning is known to improve Large Language Models both empirically and in terms of theoretical approximation power. However, our understanding of the inner workings and conditions of apparition of CoT capabilities remains limited. This paper helps fill this gap by demonstrating how CoT reasoning emerges in transformers in a controlled and interpretable setting. In particular, we observe the appearance of a specialized attention mechanism dedicated to iterative reasoning, which we coined “iteration heads”. We track both the emergence and the precise working of these iteration heads down to the attention level, and measure the transferability of the CoT skills to which they give rise between tasks.

[LG-84] CityLight: A Universal Model Towards Real-world City-scale Traffic Signal Control Coordination

链接: https://arxiv.org/abs/2406.02126
作者: Jinwei Zeng,Chao Yu,Xinyi Yang,Wenxuan Ao,Jian Yuan,Yong Li,Yu Wang,Huazhong Yang
关键词: promising low-cost measure, Traffic signal control, existing road infrastructure, affecting existing road, enhance transportation efficiency
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Traffic signal control (TSC) is a promising low-cost measure to enhance transportation efficiency without affecting existing road infrastructure. While various reinforcement learning-based TSC methods have been proposed and experimentally outperform conventional rule-based methods, none of them has been deployed in the real world. An essential gap lies in the oversimplification of the scenarios in terms of intersection heterogeneity and road network intricacy. To make TSC applicable in urban traffic management, we target TSC coordination in city-scale high-authenticity road networks, aiming to solve the three unique and important challenges: city-level scalability, heterogeneity of real-world intersections, and effective coordination among intricate neighbor connections. Since optimizing multiple agents in a parameter-sharing paradigm can boost the training efficiency and help achieve scalability, we propose our method, CityLight, based on the well-acknowledged optimization framework, parameter-sharing MAPPO. To ensure the unified policy network can learn to fit large-scale heterogeneous intersections and tackle the intricate between-neighbor coordination, CityLight proposes a universal representation module that consists of two key designs: heterogeneous intersection alignment and neighborhood impact alignment for coordination. To further boost coordination, CityLight adopts neighborhood-integrated rewards to transition from achieving local optimal to global optimal. Extensive experiments on datasets with hundreds to tens of thousands of real-world intersections and authentic traffic demands validate the surprising effectiveness and generalizability of CityLight, with an overall performance gain of 11.66% and a 22.59% improvement in transfer scenarios in terms of throughput.

[LG-85] Kernel vs. Kernel: Exploring How the Data Structure Affects Neural Collapse

链接: https://arxiv.org/abs/2406.02105
作者: Vignesh Kothapalli,Tom Tirer
关键词: training error point, error point, training neural network, vast amount, amount of literature
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: 34 pages, 14 figures

点击查看摘要

Abstract:Recently, a vast amount of literature has focused on the “Neural Collapse” (NC) phenomenon, which emerges when training neural network (NN) classifiers beyond the zero training error point. The core component of NC is the decrease in the within class variability of the network’s deepest features, dubbed as NC1. The theoretical works that study NC are typically based on simplified unconstrained features models (UFMs) that mask any effect of the data on the extent of collapse. In this paper, we provide a kernel-based analysis that does not suffer from this limitation. First, given a kernel function, we establish expressions for the traces of the within- and between-class covariance matrices of the samples’ features (and consequently an NC1 metric). Then, we turn to focus on kernels associated with shallow NNs. First, we consider the NN Gaussian Process kernel (NNGP), associated with the network at initialization, and the complement Neural Tangent Kernel (NTK), associated with its training in the “lazy regime”. Interestingly, we show that the NTK does not represent more collapsed features than the NNGP for prototypical data models. As NC emerges from training, we then consider an alternative to NTK: the recently proposed adaptive kernel, which generalizes NNGP to model the feature mapping learned from the training data. Contrasting our NC1 analysis for these two kernels enables gaining insights into the effect of data distribution on the extent of collapse, which are empirically aligned with the behavior observed with practical training of NNs.

[LG-86] MaskSR: Masked Language Model for Full-band Speech Restoration

链接: https://arxiv.org/abs/2406.02092
作者: Xu Li,Qirui Wang,Xiaoyu Liu
关键词: diverse set, Speech, Speech restoration aims, MaskSR, restoration aims
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: Accepted by INTERSPEECH 2024. Demo page: this https URL

点击查看摘要

Abstract:Speech restoration aims at restoring high quality speech in the presence of a diverse set of distortions. Although several deep learning paradigms have been studied for this task, the power of the recently emerging language models has not been fully explored. In this paper, we propose MaskSR, a masked language model capable of restoring full-band 44.1 kHz speech jointly considering noise, reverb, clipping, and low bandwidth. MaskSR works with discrete acoustic tokens extracted using a pre-trained neural codec. During training, MaskSR is optimized to predict randomly masked tokens extracted from the high quality target speech, conditioned on the corrupted speech with various distortions. During inference, MaskSR reconstructs the target speech tokens with efficient iterative sampling. Extensive experiments show that MaskSR obtains competitive results on both the full-band speech restoration task and also on sub-tasks compared with a wide range of models.

[LG-87] FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2406.02081
作者: Wenzhe Li,Zihan Ding,Seth Karten,Chi Jin
关键词: Recent advances, competitive MARL research, reinforcement learning, heavily rely, advances in reinforcement
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

Abstract:Recent advances in reinforcement learning (RL) heavily rely on a variety of well-designed benchmarks, which provide environmental platforms and consistent criteria to evaluate existing and novel algorithms. Specifically, in multi-agent RL (MARL), a plethora of benchmarks based on cooperative games have spurred the development of algorithms that improve the scalability of cooperative multi-agent systems. However, for the competitive setting, a lightweight and open-sourced benchmark with challenging gaming dynamics and visual inputs has not yet been established. In this work, we present FightLadder, a real-time fighting game platform, to empower competitive MARL research. Along with the platform, we provide implementations of state-of-the-art MARL algorithms for competitive games, as well as a set of evaluation metrics to characterize the performance and exploitability of agents. We demonstrate the feasibility of this platform by training a general agent that consistently defeats 12 built-in characters in single-player mode, and expose the difficulty of training a non-exploitable agent without human knowledge and demonstrations in two-player mode. FightLadder provides meticulously designed environments to address critical challenges in competitive MARL research, aiming to catalyze a new era of discovery and advancement in the field. Videos and code at this https URL.

[LG-88] LongSSM: On the Length Extension of State-space Models in Language Modelling

链接: https://arxiv.org/abs/2406.02080
作者: Shida Wang
关键词: language modeling, Length extension, investigate the length-extension, Length, extension
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注: 23 pages

点击查看摘要

Abstract:In this paper, we investigate the length-extension of state-space models (SSMs) in language modeling. Length extension involves training models on short sequences and testing them on longer ones. We show that state-space models trained with zero hidden states initialization have difficulty doing length extension. We explain this difficulty by pointing out the length extension is equivalent to polynomial extrapolation. Based on the theory, we propose a simple yet effective method - changing the hidden states initialization scheme - to improve the length extension. Moreover, our method shows that using long training sequence length is beneficial but not necessary to length extension. Changing the hidden state initialization enables the efficient training of long-memory model with a smaller training context length.

[LG-89] ReLU-KAN: New Kolmogorov-Arnold Networks that Only Need Matrix Addition Dot Multiplication and ReLU

链接: https://arxiv.org/abs/2406.02075
作者: Qi Qiu,Tao Zhu,Helin Gong,Liming Chen,Huansheng Ning
关键词: parallel computing capability, restricted parallel computing, Rectified Linear Unit, suffer from restricted, capability on GPUs
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Limited by the complexity of basis function (B-spline) calculations, Kolmogorov-Arnold Networks (KAN) suffer from restricted parallel computing capability on GPUs. This paper proposes a novel ReLU-KAN implementation that inherits the core idea of KAN. By adopting ReLU (Rectified Linear Unit) and point-wise multiplication, we simplify the design of KAN’s basis function and optimize the computation process for efficient CUDA computing. The proposed ReLU-KAN architecture can be readily implemented on existing deep learning frameworks (e.g., PyTorch) for both inference and training. Experimental results demonstrate that ReLU-KAN achieves a 20x speedup compared to traditional KAN with 4-layer networks. Furthermore, ReLU-KAN exhibits a more stable training process with superior fitting ability while preserving the “catastrophic forgetting avoidance” property of KAN. You can get the code in this https URL

[LG-90] Preference Optimization for Molecule Synthesis with Conditional Residual Energy-based Models

链接: https://arxiv.org/abs/2406.02066
作者: Songtao Liu,Hanjun Dai,Yue Zhao,Peng Liu
关键词: drug discovery, synthesis through machine, machine learning, fundamental problems, problems in drug
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: Accepted by ICML 2024(Oral)

点击查看摘要

Abstract:Molecule synthesis through machine learning is one of the fundamental problems in drug discovery. Current data-driven strategies employ one-step retrosynthesis models and search algorithms to predict synthetic routes in a top-bottom manner. Despite their effective performance, these strategies face limitations in the molecule synthetic route generation due to a greedy selection of the next molecule set without any lookahead. Furthermore, existing strategies cannot control the generation of synthetic routes based on possible criteria such as material costs, yields, and step count. In this work, we propose a general and principled framework via conditional residual energy-based models (EBMs), that focus on the quality of the entire synthetic route based on the specific criteria. By incorporating an additional energy-based function into our probabilistic model, our proposed algorithm can enhance the quality of the most probable synthetic routes (with higher probabilities) generated by various strategies in a plug-and-play fashion. Extensive experiments demonstrate that our framework can consistently boost performance across various strategies and outperforms previous state-of-the-art top-1 accuracy by a margin of 2.5%. Code is available at this https URL.

[LG-91] Advancing Generalized Transfer Attack with Initialization Derived Bilevel Optimization and Dynamic Sequence Truncation

链接: https://arxiv.org/abs/2406.02064
作者: Yaohua Liu,Jiaxin Gao,Xuan Liu,Xianghao Jiao,Xin Fan,Risheng Liu
关键词: Transfer attacks generate, generate significant interest, real-world black-box applications, crafting transferable adversarial, attacks generate significant
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IJCAI 2024. 10 pages

点击查看摘要

Abstract:Transfer attacks generate significant interest for real-world black-box applications by crafting transferable adversarial examples through surrogate models. Whereas, existing works essentially directly optimize the single-level objective w.r.t. the surrogate model, which always leads to poor interpretability of attack mechanism and limited generalization performance over unknown victim models. In this work, we propose the \textbfBil\textbfEvel \textbfTransfer \textbfAttac\textbfK (BETAK) framework by establishing an initialization derived bilevel optimization paradigm, which explicitly reformulates the nested constraint relationship between the Upper-Level (UL) pseudo-victim attacker and the Lower-Level (LL) surrogate attacker. Algorithmically, we introduce the Hyper Gradient Response (HGR) estimation as an effective feedback for the transferability over pseudo-victim attackers, and propose the Dynamic Sequence Truncation (DST) technique to dynamically adjust the back-propagation path for HGR and reduce computational overhead simultaneously. Meanwhile, we conduct detailed algorithmic analysis and provide convergence guarantee to support non-convexity of the LL surrogate attacker. Extensive evaluations demonstrate substantial improvement of BETAK (e.g., \mathbf53.41 % increase of attack success rates against IncRes-v 2_ens ) against different victims and defense methods in targeted and untargeted attack scenarios. The source code is available at this https URL.

[LG-92] Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models

链接: https://arxiv.org/abs/2406.02061
作者: Marianna Nezhurina,Lucia Cipolina-Kun,Mehdi Cherti,Jenia Jitsev
关键词: Large Language Models, exhibiting scaling laws, predict function improvement, Large Language, zero-shot manner
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: v1

点击查看摘要

Abstract:Large Language Models (LLMs) are often described as being instances of foundation models - that is, models that transfer strongly across various tasks and conditions in few-show or zero-shot manner, while exhibiting scaling laws that predict function improvement when increasing the pre-training scale. These claims of excelling in different functions and tasks rely on measurements taken across various sets of standardized benchmarks showing high scores for such models. We demonstrate here a dramatic breakdown of function and reasoning capabilities of state-of-the-art models trained at the largest available scales which claim strong function, using a simple, short, conventional common sense problem formulated in concise natural language, easily solvable by humans. The breakdown is dramatic, as models also express strong overconfidence in their wrong solutions, while providing often non-sensical “reasoning”-like explanations akin to confabulations to justify and backup the validity of their clearly failed responses, making them sound plausible. Various standard interventions in an attempt to get the right solution, like various type of enhanced prompting, or urging the models to reconsider the wrong solutions again by multi step re-evaluation, fail. We take these initial observations to the scientific and technological community to stimulate urgent re-assessment of the claimed capabilities of current generation of LLMs, Such re-assessment also requires common action to create standardized benchmarks that would allow proper detection of such basic reasoning deficits that obviously manage to remain undiscovered by current state-of-the-art evaluation procedures and benchmarks. Code for reproducing experiments in the paper and raw experiments data can be found at this https URL

[LG-93] Graph Adversarial Diffusion Convolution

链接: https://arxiv.org/abs/2406.02059
作者: Songtao Liu,Jinghui Chen,Tianfan Fu,Lu Lin,Marinka Zitnik,Dinghao Wu
关键词: Graph Signal Denoising, Signal Denoising, Graph Diffusion Convolution, Graph Signal, Adversarial Diffusion Convolution
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2024

点击查看摘要

Abstract:This paper introduces a min-max optimization formulation for the Graph Signal Denoising (GSD) problem. In this formulation, we first maximize the second term of GSD by introducing perturbations to the graph structure based on Laplacian distance and then minimize the overall loss of the GSD. By solving the min-max optimization problem, we derive a new variant of the Graph Diffusion Convolution (GDC) architecture, called Graph Adversarial Diffusion Convolution (GADC). GADC differs from GDC by incorporating an additional term that enhances robustness against adversarial attacks on the graph structure and noise in node features. Moreover, GADC improves the performance of GDC on heterophilic graphs. Extensive experiments demonstrate the effectiveness of GADC across various datasets. Code is available at this https URL.

[LG-94] abular and Deep Learning for the Whittle Index

链接: https://arxiv.org/abs/2406.02057
作者: Francisco Robledo Relaño(LMAP, UPPA, UPV / EHU),Vivek Borkar(EE-IIT),Urtzi Ayesta(IRIT-RMESS, UPV/EHU, CNRS),Konstantin Avrachenkov(Inria)
关键词: Restless Multi-Armed Bandit, Multi-Armed Bandit Problems, remarkably good performance, guaranteed asymptotic optimality, Whittle index policy
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ACM Transactions on Modeling and Performance Evaluation of Computing Systems, 2024

点击查看摘要

Abstract:The Whittle index policy is a heuristic that has shown remarkably good performance (with guaranteed asymptotic optimality) when applied to the class of problems known as Restless Multi-Armed Bandit Problems (RMABPs). In this paper we present QWI and QWINN, two reinforcement learning algorithms, respectively tabular and deep, to learn the Whittle index for the total discounted criterion. The key feature is the use of two time-scales, a faster one to update the state-action Q -values, and a relatively slower one to update the Whittle indices. In our main theoretical result we show that QWI, which is a tabular implementation, converges to the real Whittle indices. We then present QWINN, an adaptation of QWI algorithm using neural networks to compute the Q -values on the faster time-scale, which is able to extrapolate information from one state to another and scales naturally to large state-space environments. For QWINN, we show that all local minima of the Bellman error are locally stable equilibria, which is the first result of its kind for DQN-based schemes. Numerical computations show that QWI and QWINN converge faster than the standard Q -learning algorithm, neural-network based approximate Q-learning and other state of the art algorithms.

[LG-95] CAP: A Context-Aware Neural Predictor for NAS

链接: https://arxiv.org/abs/2406.02056
作者: Han Ji,Yuqi Feng,Yanan Sun
关键词: time-consuming performance evaluation, performance evaluation stage, architectures, annotated architectures, Neural predictors
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted by IJCAI24

点击查看摘要

Abstract:Neural predictors are effective in boosting the time-consuming performance evaluation stage in neural architecture search (NAS), owing to their direct estimation of unseen architectures. Despite the effectiveness, training a powerful neural predictor with fewer annotated architectures remains a huge challenge. In this paper, we propose a context-aware neural predictor (CAP) which only needs a few annotated architectures for training based on the contextual information from the architectures. Specifically, the input architectures are encoded into graphs and the predictor infers the contextual structure around the nodes inside each graph. Then, enhanced by the proposed context-aware self-supervised task, the pre-trained predictor can obtain expressive and generalizable representations of architectures. Therefore, only a few annotated architectures are sufficient for training. Experimental results in different search spaces demonstrate the superior performance of CAP compared with state-of-the-art neural predictors. In particular, CAP can rank architectures precisely at the budget of only 172 annotated architectures in NAS-Bench-101. Moreover, CAP can help find promising architectures in both NAS-Bench-101 and DARTS search spaces on the CIFAR-10 dataset, serving as a useful navigator for NAS to explore the search space efficiently.

[LG-96] PETRA: Parallel End-to-end Training with Reversible Architectures

链接: https://arxiv.org/abs/2406.02052
作者: Stéphane Rivaud(MLIA),Louis Fournier(MLIA),Thomas Pumir,Eugene Belilovsky(MILA),Michael Eickenberg,Edouard Oyallon
关键词:
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-97] QROA: A Black-Box Query-Response Optimization Attack on LLMs

链接: https://arxiv.org/abs/2406.02044
作者: Hussein Jawad,Nicolas J.-B. BRUNEL(LaMME)
关键词: Large Language Models, Large Language, Language Models, recent months, surged in popularity
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have surged in popularity in recent months, yet they possess concerning capabilities for generating harmful content when manipulated. This study introduces the Query-Response Optimization Attack (QROA), an optimization-based strategy designed to exploit LLMs through a black-box, query-only interaction. QROA adds an optimized trigger to a malicious instruction to compel the LLM to generate harmful content. Unlike previous approaches, QROA does not require access to the model’s logit information or any other internal data and operates solely through the standard query-response interface of LLMs. Inspired by deep Q-learning and Greedy coordinate descent, the method iteratively updates tokens to maximize a designed reward function. We tested our method on various LLMs such as Vicuna, Falcon, and Mistral, achieving an Attack Success Rate (ASR) over 80%. We also tested the model against Llama2-chat, the fine-tuned version of Llama2 designed to resist Jailbreak attacks, achieving good ASR with a suboptimal initial trigger seed. This study demonstrates the feasibility of generating jailbreak attacks against deployed LLMs in the public domain using black-box optimization methods, enabling more comprehensive safety testing of LLMs.

[LG-98] DFA-GNN: Forward Learning of Graph Neural Networks by Direct Feedback Alignment

链接: https://arxiv.org/abs/2406.02040
作者: Gongpei Zhao,Tao Wang,Congyan Lang,Yi Jin,Yidong Li,Haibin Ling
关键词: backpropagation algorithm playing, Graph neural networks, graph data, Graph, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph neural networks are recognized for their strong performance across various applications, with the backpropagation algorithm playing a central role in the development of most GNN models. However, despite its effectiveness, BP has limitations that challenge its biological plausibility and affect the efficiency, scalability and parallelism of training neural networks for graph-based tasks. While several non-BP training algorithms, such as the direct feedback alignment, have been successfully applied to fully-connected and convolutional network components for handling Euclidean data, directly adapting these non-BP frameworks to manage non-Euclidean graph data in GNN models presents significant challenges. These challenges primarily arise from the violation of the i.i.d. assumption in graph data and the difficulty in accessing prediction errors for all samples (nodes) within the graph. To overcome these obstacles, in this paper we propose DFA-GNN, a novel forward learning framework tailored for GNNs with a case study of semi-supervised learning. The proposed method breaks the limitations of BP by using a dedicated forward training mechanism. Specifically, DFA-GNN extends the principles of DFA to adapt to graph data and unique architecture of GNNs, which incorporates the information of graph topology into the feedback links to accommodate the non-Euclidean characteristics of graph data. Additionally, for semi-supervised graph learning tasks, we developed a pseudo error generator that spreads residual errors from training data to create a pseudo error for each unlabeled node. These pseudo errors are then utilized to train GNNs using DFA. Extensive experiments on 10 public benchmarks reveal that our learning framework outperforms not only previous non-BP methods but also the standard BP methods, and it exhibits excellent robustness against various types of noise and attacks.

[LG-99] A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning

链接: https://arxiv.org/abs/2406.02035
作者: Khimya Khetarpal,Zhaohan Daniel Guo,Bernardo Avila Pires,Yunhao Tang,Clare Lyle,Mark Rowland,Nicolas Heess,Diana Borsa,Arthur Guez,Will Dabney
关键词: challenge for Reinforcement, Reinforcement Learning, self-predictive representation learning, crucial challenge, Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning a good representation is a crucial challenge for Reinforcement Learning (RL) agents. Self-predictive learning provides means to jointly learn a latent representation and dynamics model by bootstrapping from future latent representations (BYOL). Recent work has developed theoretical insights into these algorithms by studying a continuous-time ODE model for self-predictive representation learning under the simplifying assumption that the algorithm depends on a fixed policy (BYOL- \Pi ); this assumption is at odds with practical instantiations of such algorithms, which explicitly condition their predictions on future actions. In this work, we take a step towards bridging the gap between theory and practice by analyzing an action-conditional self-predictive objective (BYOL-AC) using the ODE framework, characterizing its convergence properties and highlighting important distinctions between the limiting solutions of the BYOL- \Pi and BYOL-AC dynamics. We show how the two representations are related by a variance equation. This connection leads to a novel variance-like action-conditional objective (BYOL-VAR) and its corresponding ODE. We unify the study of all three objectives through two complementary lenses; a model-based perspective, where each objective is shown to be equivalent to a low-rank approximation of certain dynamics, and a model-free perspective, which establishes relationships between the objectives and their respective value, Q-value, and advantage function. Our empirical investigations, encompassing both linear function approximation and Deep RL environments, demonstrates that BYOL-AC is better overall in a variety of different settings.

[LG-100] Inference Attacks in Machine Learning as a Service: A Taxonomy Review and Promising Directions

链接: https://arxiv.org/abs/2406.02027
作者: Feng Wu,Lei Cui,Shaowen Yao,Shui Yu
关键词: brought people concerns, inference attacks, inference, brought people, people concerns
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The prosperity of machine learning has also brought people’s concerns about data privacy. Among them, inference attacks can implement privacy breaches in various MLaaS scenarios and model training/prediction phases. Specifically, inference attacks can perform privacy inference on undisclosed target training sets based on outputs of the target model, including but not limited to statistics, membership, semantics, data representation, etc. For instance, infer whether the target data has the characteristics of AIDS. In addition, the rapid development of the machine learning community in recent years, especially the surge of model types and application scenarios, has further stimulated the inference attacks’ research. Thus, studying inference attacks and analyzing them in depth is urgent and significant. However, there is still a gap in the systematic discussion of inference attacks from taxonomy, global perspective, attack, and defense perspectives. This survey provides an in-depth and comprehensive inference of attacks and corresponding countermeasures in ML-as-a-service based on taxonomy and the latest researches. Without compromising researchers’ intuition, we first propose the 3MP taxonomy based on the community research status, trying to normalize the confusing naming system of inference attacks. Also, we analyze the pros and cons of each type of inference attack, their workflow, countermeasure, and how they interact with other attacks. In the end, we point out several promising directions for researchers from a more comprehensive and novel perspective.

[LG-101] Verifying the Generalization of Deep Learning to Out-of-Distribution Domains

链接: https://arxiv.org/abs/2406.02024
作者: Guy Amir,Osher Maayan,Tom Zelazny,Guy Katz,Michael Schapira
关键词: play a crucial, crucial role, field of machine, Deep neural networks, Deep
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: To appear in the Journal of Automated Reasoning (JAR), 2024. arXiv admin note: substantial text overlap with arXiv:2302.05745

点击查看摘要

Abstract:Deep neural networks (DNNs) play a crucial role in the field of machine learning, demonstrating state-of-the-art performance across various application domains. However, despite their success, DNN-based models may occasionally exhibit challenges with generalization, i.e., may fail to handle inputs that were not encountered during training. This limitation is a significant challenge when it comes to deploying deep learning for safety-critical tasks, as well as in real-world settings characterized by substantial variability. We introduce a novel approach for harnessing DNN verification technology to identify DNN-driven decision rules that exhibit robust generalization to previously unencountered input domains. Our method assesses generalization within an input domain by measuring the level of agreement between independently trained deep neural networks for inputs in this domain. We also efficiently realize our approach by using off-the-shelf DNN verification engines, and extensively evaluate it on both supervised and unsupervised DNN benchmarks, including a deep reinforcement learning (DRL) system for Internet congestion control – demonstrating the applicability of our approach for real-world settings. Moreover, our research introduces a fresh objective for formal verification, offering the prospect of mitigating the challenges linked to deploying DNN-driven systems in real-world scenarios.

[LG-102] MetaMixer Is All You Need

链接: https://arxiv.org/abs/2406.02021
作者: Seokju Yun,Dongheon Lee,Youngmin Ro
关键词: revolutionized the landscape, Transformer, Feed-Forward Network, network design, FFN
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Code: this https URL

点击查看摘要

Abstract:Transformer, composed of self-attention and Feed-Forward Network, has revolutionized the landscape of network design across various vision tasks. FFN is a versatile operator seamlessly integrated into nearly all AI models to effectively harness rich representations. Recent works also show that FFN functions like key-value memories. Thus, akin to the query-key-value mechanism within self-attention, FFN can be viewed as a memory network, where the input serves as query and the two projection weights operate as keys and values, respectively. We hypothesize that the importance lies in query-key-value framework itself rather than in self-attention. To verify this, we propose converting self-attention into a more FFN-like efficient token mixer with only convolutions while retaining query-key-value framework, namely FFNification. Specifically, FFNification replaces query-key and attention coefficient-value interactions with large kernel convolutions and adopts GELU activation function instead of softmax. The derived token mixer, FFNified attention, serves as key-value memories for detecting locally distributed spatial patterns, and operates in the opposite dimension to the ConvNeXt block within each corresponding sub-operation of the query-key-value framework. Building upon the above two modules, we present a family of Fast-Forward Networks. Our FFNet achieves remarkable performance improvements over previous state-of-the-art methods across a wide range of tasks. The strong and general performance of our proposed method validates our hypothesis and leads us to introduce MetaMixer, a general mixer architecture that does not specify sub-operations within the query-key-value framework. We show that using only simple operations like convolution and GELU in the MetaMixer can achieve superior performance.

[LG-103] On the Mode-Seeking Properties of Langevin Dynamics

链接: https://arxiv.org/abs/2406.02017
作者: Xiwei Cheng,Kexin Fu,Farzan Farnia
关键词: Langevin Dynamics, Chained Langevin Dynamics, Langevin Dynamics framework, score-based generative modeling, interpreting score-based generative
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Langevin Dynamics framework, which aims to generate samples from the score function of a probability distribution, is widely used for analyzing and interpreting score-based generative modeling. While the convergence behavior of Langevin Dynamics under unimodal distributions has been extensively studied in the literature, in practice the data distribution could consist of multiple distinct modes. In this work, we investigate Langevin Dynamics in producing samples from multimodal distributions and theoretically study its mode-seeking properties. We prove that under a variety of sub-Gaussian mixtures, Langevin Dynamics is unlikely to find all mixture components within a sub-exponential number of steps in the data dimension. To reduce the mode-seeking tendencies of Langevin Dynamics, we propose Chained Langevin Dynamics, which divides the data vector into patches of constant size and generates every patch sequentially conditioned on the previous patches. We perform a theoretical analysis of Chained Langevin Dynamics by reducing it to sampling from a constant-dimensional distribution. We present the results of several numerical experiments on synthetic and real image datasets, supporting our theoretical results on the iteration complexities of sample generation from mixture distributions using the chained and vanilla Langevin Dynamics. The code is available at this https URL.

[LG-104] Parameterizing Federated Continual Learning for Reproducible Research

链接: https://arxiv.org/abs/2406.02015
作者: Bart Cox,Jeroen Galjaard,Aditya Shankar,Jérémie Decouchant,Lydia Y. Chen
关键词: Federated Continual Learning, ever-evolving environments, Continual Learning, systems evolve, Learning
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Preprint: Accepted at the 1st WAFL (Workshop on Advancements in Federated Learning) workshop, ECML-PKDD 2023

点击查看摘要

Abstract:Federated Learning (FL) systems evolve in heterogeneous and ever-evolving environments that challenge their performance. Under real deployments, the learning tasks of clients can also evolve with time, which calls for the integration of methodologies such as Continual Learning. To enable research reproducibility, we propose a set of experimental best practices that precisely capture and emulate complex learning scenarios. Our framework, Freddie, is the first entirely configurable framework for Federated Continual Learning (FCL), and it can be seamlessly deployed on a large number of machines thanks to the use of Kubernetes and containerization. We demonstrate the effectiveness of Freddie on two use cases, (i) large-scale FL on CIFAR100 and (ii) heterogeneous task sequence on FCL, which highlight unaddressed performance challenges in FCL scenarios.

[LG-105] Mamba as Decision Maker: Exploring Multi-scale Sequence Modeling in Offline Reinforcement Learning

链接: https://arxiv.org/abs/2406.02013
作者: Jiahang Cao,Qiang Zhang,Ziqing Wang,Jiaxu Wang,Hao Cheng,Yecheng Shao,Wen Zhao,Gang Han,Yijie Guo,Renjing Xu
关键词: achieving significant success, offline reinforcement learning, Markov Decision Process, Decision Transformer, Mamba Decision Maker
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:Sequential modeling has demonstrated remarkable capabilities in offline reinforcement learning (RL), with Decision Transformer (DT) being one of the most notable representatives, achieving significant success. However, RL trajectories possess unique properties to be distinguished from the conventional sequence (e.g., text or audio): (1) local correlation, where the next states in RL are theoretically determined solely by current states and actions based on the Markov Decision Process (MDP), and (2) global correlation, where each step’s features are related to long-term historical information due to the time-continuous nature of trajectories. In this paper, we propose a novel action sequence predictor, named Mamba Decision Maker (MambaDM), where Mamba is expected to be a promising alternative for sequence modeling paradigms, owing to its efficient modeling of multi-scale dependencies. In particular, we introduce a novel mixer module that proficiently extracts and integrates both global and local features of the input sequence, effectively capturing interrelationships in RL datasets. Extensive experiments demonstrate that MambaDM achieves state-of-the-art performance in Atari and OpenAI Gym datasets. Furthermore, we empirically investigate the scaling laws of MambaDM, finding that increasing model size does not bring performance improvement, but scaling the dataset amount by 2x for MambaDM can obtain up to 33.7% score improvement on Atari dataset. This paper delves into the sequence modeling capabilities of MambaDM in the RL domain, paving the way for future advancements in robust and efficient decision-making systems. Our code will be available at this https URL.

[LG-106] Bayesian Mesh Optimization for Graph Neural Networks to Enhance Engineering Performance Prediction

链接: https://arxiv.org/abs/2406.01996
作者: Jangseop Park,Namwoo Kang
关键词: replace computationally expensive, leveraging design variables, computationally expensive simulations, design variables, widely employed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 17 pages, 8 figures, 3 tables

点击查看摘要

Abstract:In engineering design, surrogate models are widely employed to replace computationally expensive simulations by leveraging design variables and geometric parameters from computer-aided design (CAD) models. However, these models often lose critical information when simplified to lower dimensions and face challenges in parameter definition, especially with the complex 3D shapes commonly found in industrial datasets. To address these limitations, we propose a Bayesian graph neural network (GNN) framework for a 3D deep-learning-based surrogate model that predicts engineering performance by directly learning geometric features from CAD using mesh representation. Our framework determines the optimal size of mesh elements through Bayesian optimization, resulting in a high-accuracy surrogate model. Additionally, it effectively handles the irregular and complex structures of 3D CADs, which differ significantly from the regular and uniform pixel structures of 2D images typically used in deep learning. Experimental results demonstrate that the quality of the mesh significantly impacts the prediction accuracy of the surrogate model, with an optimally sized mesh achieving superior performance. We compare the performance of models based on various 3D representations such as voxel, point cloud, and graph, and evaluate the computational costs of Monte Carlo simulation and Bayesian optimization methods to find the optimal mesh size. We anticipate that our proposed framework has the potential to be applied to mesh-based simulations across various engineering fields, leveraging physics-based information commonly used in computer-aided engineering.

[LG-107] What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding

链接: https://arxiv.org/abs/2406.01977
作者: Hongkang Li,Meng Wang,Tengfei Ma,Sijia Liu,Zaixi Zhang,Pin-Yu Chen
关键词: graph learning tasks, Graph Transformers, shallow Graph Transformer, recently emerged, powerful architecture
类目: Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

Abstract:Graph Transformers, which incorporate self-attention and positional encoding, have recently emerged as a powerful architecture for various graph learning tasks. Despite their impressive performance, the complex non-convex interactions across layers and the recursive graph structure have made it challenging to establish a theoretical foundation for learning and generalization. This study introduces the first theoretical investigation of a shallow Graph Transformer for semi-supervised node classification, comprising a self-attention layer with relative positional encoding and a two-layer perceptron. Focusing on a graph data model with discriminative nodes that determine node labels and non-discriminative nodes that are class-irrelevant, we characterize the sample complexity required to achieve a desirable generalization error by training with stochastic gradient descent (SGD). This paper provides the quantitative characterization of the sample complexity and number of iterations for convergence dependent on the fraction of discriminative nodes, the dominant patterns, and the initial model errors. Furthermore, we demonstrate that self-attention and positional encoding enhance generalization by making the attention map sparse and promoting the core neighborhood during training, which explains the superior feature representation of Graph Transformers. Our theoretical results are supported by empirical experiments on synthetic and real-world benchmarks.

[LG-108] Can Dense Connectivity Benefit Outlier Detection? An Odyssey with NAS

链接: https://arxiv.org/abs/2406.01975
作者: Hao Fu,Tunhou Zhang,Hai Li,Yiran Chen
关键词: Convolutional Neural Networks, real world applications, Recent advances, Neural Networks, deployment of Convolutional
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advances in Out-of-Distribution (OOD) Detection is the driving force behind safe and reliable deployment of Convolutional Neural Networks (CNNs) in real world applications. However, existing studies focus on OOD detection through confidence score and deep generative model-based methods, without considering the impact of DNN structures, especially dense connectivity in architecture fabrications. In addition, existing outlier detection approaches exhibit high variance in generalization performance, lacking stability and confidence in evaluating and ranking different outlier detectors. In this work, we propose a novel paradigm, Dense Connectivity Search of Outlier Detector (DCSOD), that automatically explore the dense connectivity of CNN architectures on near-OOD detection task using Neural Architecture Search (NAS). We introduce a hierarchical search space containing versatile convolution operators and dense connectivity, allowing a flexible exploration of CNN architectures with diverse connectivity patterns. To improve the quality of evaluation on OOD detection during search, we propose evolving distillation based on our multi-view feature learning explanation. Evolving distillation stabilizes training for OOD detection evaluation, thus improves the quality of search. We thoroughly examine DCSOD on CIFAR benchmarks under OOD detection protocol. Experimental results show that DCSOD achieve remarkable performance over widely used architectures and previous NAS baselines. Notably, DCSOD achieves state-of-the-art (SOTA) performance on CIFAR benchmark, with AUROC improvement of \sim 1.0%.

[LG-109] Multiway Multislice PHATE: Visualizing Hidden Dynamics of RNNs through Training

链接: https://arxiv.org/abs/2406.01969
作者: Jiancheng Xie,Lou C. Kohler Voinov,Noga Mudrik,Gal Mishne,Adam Charles
关键词: Recurrent neural networks, sequential data analysis, Recurrent neural, data analysis, boxes of computation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recurrent neural networks (RNNs) are a widely used tool for sequential data analysis, however, they are still often seen as black boxes of computation. Understanding the functional principles of these networks is critical to developing ideal model architectures and optimization strategies. Previous studies typically only emphasize the network representation post-training, overlooking their evolution process throughout training. Here, we present Multiway Multislice PHATE (MM-PHATE), a novel method for visualizing the evolution of RNNs’ hidden states. MM-PHATE is a graph-based embedding using structured kernels across the multiple dimensions spanned by RNNs: time, training epoch, and units. We demonstrate on various datasets that MM-PHATE uniquely preserves hidden representation community structure among units and identifies information processing and compression phases during training. The embedding allows users to look under the hood of RNNs across training and provides an intuitive and comprehensive strategy to understanding the network’s internal dynamics and draw conclusions, e.g., on why and how one model outperforms another or how a specific architecture might impact an RNN’s learning ability.

[LG-110] DrEureka: Language Model Guided Sim-To-Real Transfer

链接: https://arxiv.org/abs/2406.01967
作者: Yecheng Jason Ma,William Liang,Hung-Ju Wang,Sam Wang,Yuke Zhu,Linxi Fan,Osbert Bastani,Dinesh Jayaraman
关键词: Transferring policies learned, Transferring policies, acquiring robot skills, Large Language Models, skills at scale
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Robotics: Science and Systems (RSS) 2024. Project website and open-source code: this https URL

点击查看摘要

Abstract:Transferring policies learned in simulation to the real world is a promising strategy for acquiring robot skills at scale. However, sim-to-real approaches typically rely on manual design and tuning of the task reward function as well as the simulation physics parameters, rendering the process slow and human-labor intensive. In this paper, we investigate using Large Language Models (LLMs) to automate and accelerate sim-to-real design. Our LLM-guided sim-to-real approach, DrEureka, requires only the physics simulation for the target task and automatically constructs suitable reward functions and domain randomization distributions to support real-world transfer. We first demonstrate that our approach can discover sim-to-real configurations that are competitive with existing human-designed ones on quadruped locomotion and dexterous manipulation tasks. Then, we showcase that our approach is capable of solving novel robot tasks, such as quadruped balancing and walking atop a yoga ball, without iterative manual design.

[LG-111] Certifiably Byzantine-Robust Federated Conformal Prediction

链接: https://arxiv.org/abs/2406.01960
作者: Mintong Kang,Zhen Lin,Jimeng Sun,Cao Xiao,Bo Li
关键词: shown impressive capacity, constructing statistically rigorous, machine learning models, exchangeable data samples, statistically rigorous prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:Conformal prediction has shown impressive capacity in constructing statistically rigorous prediction sets for machine learning models with exchangeable data samples. The siloed datasets, coupled with the escalating privacy concerns related to local data sharing, have inspired recent innovations extending conformal prediction into federated environments with distributed data samples. However, this framework for distributed uncertainty quantification is susceptible to Byzantine failures. A minor subset of malicious clients can significantly compromise the practicality of coverage guarantees. To address this vulnerability, we introduce a novel framework Rob-FCP, which executes robust federated conformal prediction, effectively countering malicious clients capable of reporting arbitrary statistics with the conformal calibration process. We theoretically provide the conformal coverage bound of Rob-FCP in the Byzantine setting and show that the coverage of Rob-FCP is asymptotically close to the desired coverage level. We also propose a malicious client number estimator to tackle a more challenging setting where the number of malicious clients is unknown to the defender and theoretically shows its effectiveness. We empirically demonstrate the robustness of Rob-FCP against diverse proportions of malicious clients under a variety of Byzantine attacks on five standard benchmark and real-world healthcare datasets.

[LG-112] A Comparative Study of Sampling Methods with Cross-Validation in the FedHome Framework

链接: https://arxiv.org/abs/2406.01950
作者: Arash Ahmadi,Sarah S. Sharif,Yaser M. Banad
关键词: Stratified K-fold cross-validation, paper presents, presents a comparative, comparative study, study of sampling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 11 Figures

点击查看摘要

Abstract:This paper presents a comparative study of sampling methods within the FedHome framework, designed for personalized in-home health monitoring. FedHome leverages federated learning (FL) and generative convolutional autoencoders (GCAE) to train models on decentralized edge devices while prioritizing data privacy. A notable challenge in this domain is the class imbalance in health data, where critical events such as falls are underrepresented, adversely affecting model performance. To address this, the research evaluates six oversampling techniques using Stratified K-fold cross-validation: SMOTE, Borderline-SMOTE, Random OverSampler, SMOTE-Tomek, SVM-SMOTE, and SMOTE-ENN. These methods are tested on FedHome’s public implementation over 200 training rounds with and without stratified K-fold cross-validation. The findings indicate that SMOTE-ENN achieves the most consistent test accuracy, with a standard deviation range of 0.0167-0.0176, demonstrating stable performance compared to other samplers. In contrast, SMOTE and SVM-SMOTE exhibit higher variability in performance, as reflected by their wider standard deviation ranges of 0.0157-0.0180 and 0.0155-0.0180, respectively. Similarly, the Random OverSampler method shows a significant deviation range of 0.0155-0.0176. SMOTE-Tomek, with a deviation range of 0.0160-0.0175, also shows greater stability but not as much as SMOTE-ENN. This finding highlights the potential of SMOTE-ENN to enhance the reliability and accuracy of personalized health monitoring systems within the FedHome framework.

[LG-113] Data-Driven Approaches for Thrust Prediction in Underwater Flapping Fin Propulsion Systems

链接: https://arxiv.org/abs/2406.01947
作者: Julian Lee,Kamal Viswanath,Alisha Sharma,Jason Geder,Ravi Ramamurti,Marius D. Pruessner
关键词: require high maneuverability, Flapping-fin underwater vehicle, propulsion systems provide, Flapping-fin underwater, high maneuverability
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 9 pages, 11 figures, AAAI 2021 Fall Series Symposium on Science-Guided AI

点击查看摘要

Abstract:Flapping-fin underwater vehicle propulsion systems provide an alternative to propeller-driven systems in situations that require involve a constrained environment or require high maneuverability. Testing new configurations through experiments or high-fidelity simulations is an expensive process, slowing development of new systems. This is especially true when introducing new fin geometries. In this work, we propose machine learning approaches for thrust prediction given the system’s fin geometries and kinematics. We introduce data-efficient fin shape parameterization strategies that enable our network to predict thrust profiles for unseen fin geometries given limited fin shapes in input data. In addition to faster development of systems, generalizable surrogate models offer fast, accurate predictions that could be used on an unmanned underwater vehicle control system.

[LG-114] Process-Driven Autoformalization in Lean 4

链接: https://arxiv.org/abs/2406.01940
作者: Jianqiao Lu,Zhengying Liu,Yingjia Wan,Yinya Huang,Haiming Wang,Zhicheng Yang,Jing Tang,Zhijiang Guo
关键词: advancing mathematical reasoning, textbf, natural language mathematics, offers significant potential, mathematical reasoning
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 22 pages, 1 figures, 11 tables

点击查看摘要

Abstract:Autoformalization, the conversion of natural language mathematics into formal languages, offers significant potential for advancing mathematical reasoning. However, existing efforts are limited to formal languages with substantial online corpora and struggle to keep pace with rapidly evolving languages like Lean 4. To bridge this gap, we propose a new benchmark \textbfFormalization for \textbfLean~\textbf4 (\textbf\name) designed to evaluate the autoformalization capabilities of large language models (LLMs). This benchmark encompasses a comprehensive assessment of questions, answers, formal statements, and proofs. Additionally, we introduce a \textbfProcess-\textbfSupervised \textbfVerifier (\textbfPSV) model that leverages the precise feedback from Lean 4 compilers to enhance autoformalization. Our experiments demonstrate that the PSV method improves autoformalization, enabling higher accuracy using less filtered training data. Furthermore, when fine-tuned with data containing detailed process information, PSV can leverage the data more effectively, leading to more significant improvements in autoformalization for Lean 4. Our dataset and code are available at \urlthis https URL.

[LG-115] Speeding up Policy Simulation in Supply Chain RL

链接: https://arxiv.org/abs/2406.01939
作者: Vivek Farias,Joren Gijsbrechts,Aryan Khojandi,Tianyi Peng,Andrew Zheng
关键词: Simulating a single, dynamical system, policy, single, core bottleneck
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulating a single trajectory of a dynamical system under some state-dependent policy is a core bottleneck in policy optimization algorithms. The many inherently serial policy evaluations that must be performed in a single simulation constitute the bulk of this bottleneck. To wit, in applying policy optimization to supply chain optimization (SCO) problems, simulating a single month of a supply chain can take several hours. We present an iterative algorithm for policy simulation, which we dub Picard Iteration. This scheme carefully assigns policy evaluation tasks to independent processes. Within an iteration, a single process evaluates the policy only on its assigned tasks while assuming a certain ‘cached’ evaluation for other tasks; the cache is updated at the end of the iteration. Implemented on GPUs, this scheme admits batched evaluation of the policy on a single trajectory. We prove that the structure afforded by many SCO problems allows convergence in a small number of iterations, independent of the horizon. We demonstrate practical speedups of 400x on large-scale SCO problems even with a single GPU, and also demonstrate practical efficacy in other RL environments. Subjects: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2406.01939 [cs.AI] (or arXiv:2406.01939v1 [cs.AI] for this version)

[LG-116] Generating Synthetic Net Load Data with Physics-informed Diffusion Model

链接: https://arxiv.org/abs/2406.01913
作者: Shaorong Zhang,Yuanbin Cheng,Nanpeng Yu
关键词: physics-informed diffusion model, diffusion model, addressing the challenges, privacy concerns, generating synthetic net
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents a novel physics-informed diffusion model for generating synthetic net load data, addressing the challenges of data scarcity and privacy concerns. The proposed framework embeds physical models within denoising networks, offering a versatile approach that can be readily generalized to unforeseen scenarios. A conditional denoising neural network is designed to jointly train the parameters of the transition kernel of the diffusion model and the parameters of the physics-informed function. Utilizing the real-world smart meter data from Pecan Street, we validate the proposed method and conduct a thorough numerical study comparing its performance with state-of-the-art generative models, including generative adversarial networks, variational autoencoders, normalizing flows, and a well calibrated baseline diffusion model. A comprehensive set of evaluation metrics is used to assess the accuracy and diversity of the generated synthetic net load data. The numerical study results demonstrate that the proposed physics-informed diffusion model outperforms state-of-the-art models across all quantitative metrics, yielding at least 20% improvement.

[LG-117] A Global Geometric Analysis of Maximal Coding Rate Reduction

链接: https://arxiv.org/abs/2406.01909
作者: Peng Wang,Huikang Liu,Druv Pai,Yaodong Yu,Zhihui Zhu,Qing Qu,Yi Ma
关键词: deep network architectures, coding rate reduction, drawing increasing attention, highly effective deep, effective deep network
类目: Machine Learning (cs.LG)
*备注: 43 pages, 9 figures. This work has been accepted for publication in the Proceedings of the 41st International Conference on Machine Learning (ICML 2024)

点击查看摘要

Abstract:The maximal coding rate reduction (MCR ^2 ) objective for learning structured and compact deep representations is drawing increasing attention, especially after its recent usage in the derivation of fully explainable and highly effective deep network architectures. However, it lacks a complete theoretical justification: only the properties of its global optima are known, and its global landscape has not been studied. In this work, we give a complete characterization of the properties of all its local and global optima, as well as other types of critical points. Specifically, we show that each (local or global) maximizer of the MCR ^2 problem corresponds to a low-dimensional, discriminative, and diverse representation, and furthermore, each critical point of the objective is either a local maximizer or a strict saddle point. Such a favorable landscape makes MCR ^2 a natural choice of objective for learning diverse and discriminative representations via first-order optimization methods. To validate our theoretical findings, we conduct extensive experiments on both synthetic and real data sets.

[LG-118] PDHG-Unrolled Learning-to-Optimize Method for Large-Scale Linear Programming

链接: https://arxiv.org/abs/2406.01908
作者: Bingheng Li,Linxin Yang,Yupeng Chen,Senmiao Wang,Qian Chen,Haitao Mao,Yao Ma,Akang Wang,Tian Ding,Jiliang Tang,Ruoyu Sun
关键词: large-scale linear programming, Solving large-scale linear, power systems, finance and logistics, linear programming
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted by ICML 2024

点击查看摘要

Abstract:Solving large-scale linear programming (LP) problems is an important task in various areas such as communication networks, power systems, finance and logistics. Recently, two distinct approaches have emerged to expedite LP solving: (i) First-order methods (FOMs); (ii) Learning to optimize (L2O). In this work, we propose an FOM-unrolled neural network (NN) called PDHG-Net, and propose a two-stage L2O method to solve large-scale LP problems. The new architecture PDHG-Net is designed by unrolling the recently emerged PDHG method into a neural network, combined with channel-expansion techniques borrowed from graph neural networks. We prove that the proposed PDHG-Net can recover PDHG algorithm, thus can approximate optimal solutions of LP instances with a polynomial number of neurons. We propose a two-stage inference approach: first use PDHG-Net to generate an approximate solution, and then apply PDHG algorithm to further improve the solution. Experiments show that our approach can significantly accelerate LP solving, achieving up to a 3 \times speedup compared to FOMs for large-scale LP problems.

[LG-119] Bifurcated Generative Flow Networks

链接: https://arxiv.org/abs/2406.01901
作者: Chunhui Li,Cheng-Hao Liu,Dianbo Liu,Qingpeng Cai,Ling Pan
关键词: Generative Flow Networks, diverse objects proportionally, Generative Flow, Flow Networks, learning stochastic policies
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets), a new family of probabilistic samplers, have recently emerged as a promising framework for learning stochastic policies that generate high-quality and diverse objects proportionally to their rewards. However, existing GFlowNets often suffer from low data efficiency due to the direct parameterization of edge flows or reliance on backward policies that may struggle to scale up to large action spaces. In this paper, we introduce Bifurcated GFlowNets (BN), a novel approach that employs a bifurcated architecture to factorize the flows into separate representations for state flows and edge-based flow allocation. This factorization enables BN to learn more efficiently from data and better handle large-scale problems while maintaining the convergence guarantee. Through extensive experiments on standard evaluation benchmarks, we demonstrate that BN significantly improves learning efficiency and effectiveness compared to strong baselines.

[LG-120] Cross-Domain Graph Data Scaling: A Showcase with Diffusion Models

链接: https://arxiv.org/abs/2406.01899
作者: Wenzhuo Tang,Haitao Mao,Danial Dervovic,Ivan Brugere,Saumitra Mishra,Yuying Xie,Jiliang Tang
关键词: data scaling behavior, diffusion model, natural language, language and images, images benefit
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Models for natural language and images benefit from data scaling behavior: the more data fed into the model, the better they perform. This ‘better with more’ phenomenon enables the effectiveness of large-scale pre-training on vast amounts of data. However, current graph pre-training methods struggle to scale up data due to heterogeneity across graphs. To achieve effective data scaling, we aim to develop a general model that is able to capture diverse data patterns of graphs and can be utilized to adaptively help the downstream tasks. To this end, we propose UniAug, a universal graph structure augmentor built on a diffusion model. We first pre-train a discrete diffusion model on thousands of graphs across domains to learn the graph structural patterns. In the downstream phase, we provide adaptive enhancement by conducting graph structure augmentation with the help of the pre-trained diffusion model via guided generation. By leveraging the pre-trained diffusion model for structure augmentation, we consistently achieve performance improvements across various downstream tasks in a plug-and-play manner. To the best of our knowledge, this study represents the first demonstration of a data-scaling graph structure augmentor on graphs across domains.

[LG-121] Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks

链接: https://arxiv.org/abs/2406.01895
作者: Mahdi Sabbaghi,George Pappas,Hamed Hassani,Surbhi Goel
关键词: basic arithmetic tasks, code generation, language understanding, logical reasoning, basic arithmetic
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: 32 pages, 16 figures

点击查看摘要

Abstract:Despite the success of Transformers on language understanding, code generation, and logical reasoning, they still fail to generalize over length on basic arithmetic tasks such as addition and multiplication. A major reason behind this failure is the vast difference in structure between numbers and text; For example, the numbers are typically parsed from right to left, and there is a correspondence between digits at the same position across different numbers. In contrast, for text, such symmetries are quite unnatural. In this work, we propose to encode these semantics explicitly into the model via modified number formatting and custom positional encodings. Empirically, our method allows a Transformer trained on numbers with at most 5-digits for addition and multiplication to generalize up to 50-digit numbers, without using additional data for longer sequences. We further demonstrate that traditional absolute positional encodings (APE) fail to generalize to longer sequences, even when trained with augmented data that captures task symmetries. To elucidate the importance of explicitly encoding structure, we prove that explicit incorporation of structure via positional encodings is necessary for out-of-distribution generalization. Finally, we pinpoint other challenges inherent to length generalization beyond capturing symmetries, in particular complexity of the underlying task, and propose changes in the training distribution to address them.

[LG-122] GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security

链接: https://arxiv.org/abs/2406.01876
作者: Xuanqing Liu,Luyang Kong,Runhui Wang,Patrick Song,Austin Nevins,Henrik Johnson,Nimish Amlathe,Davor Golac
关键词: data ingestion process, Schema matching constitutes, contemporary database systems, constitutes a pivotal, pivotal phase
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: KDD 2024 Camera Ready; 11 pages, 8 figures

点击查看摘要

Abstract:Schema matching constitutes a pivotal phase in the data ingestion process for contemporary database systems. Its objective is to discern pairwise similarities between two sets of attributes, each associated with a distinct data table. This challenge emerges at the initial stages of data analytics, such as when incorporating a third-party table into existing databases to inform business insights. Given its significance in the realm of database systems, schema matching has been under investigation since the 2000s. This study revisits this foundational problem within the context of large language models. Adhering to increasingly stringent data security policies, our focus lies on the zero-shot and few-shot scenarios: the model should analyze only a minimal amount of customer data to execute the matching task, contrasting with the conventional approach of scrutinizing the entire data table. We emphasize that the zero-shot or few-shot assumption is imperative to safeguard the identity and privacy of customer data, even at the potential cost of accuracy. The capability to accurately match attributes under such stringent requirements distinguishes our work from previous literature in this domain.

[LG-123] CR-UTP: Certified Robustness against Universal Text Perturbations

链接: https://arxiv.org/abs/2406.01873
作者: Qian Lou,Xin Liang,Jiaqi Xue,Yancheng Zhang,Rui Xie,Mengxin Zheng
关键词: Universal Text Perturbations, minor input variations, language model robustness, language model, language prediction
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted by ACL Findings 2024

点击查看摘要

Abstract:It is imperative to ensure the stability of every prediction made by a language model; that is, a language’s prediction should remain consistent despite minor input variations, like word substitutions. In this paper, we investigate the problem of certifying a language model’s robustness against Universal Text Perturbations (UTPs), which have been widely used in universal adversarial attacks and backdoor attacks. Existing certified robustness based on random smoothing has shown considerable promise in certifying the input-specific text perturbations (ISTPs), operating under the assumption that any random alteration of a sample’s clean or adversarial words would negate the impact of sample-wise perturbations. However, with UTPs, masking only the adversarial words can eliminate the attack. A naive method is to simply increase the masking ratio and the likelihood of masking attack tokens, but it leads to a significant reduction in both certified accuracy and the certified radius due to input corruption by extensive masking. To solve this challenge, we introduce a novel approach, the superior prompt search method, designed to identify a superior prompt that maintains higher certified accuracy under extensive masking. Additionally, we theoretically motivate why ensembles are a particularly suitable choice as base prompts for random smoothing. The method is denoted by superior prompt ensembling technique. We also empirically confirm this technique, obtaining state-of-the-art results in multiple settings. These methodologies, for the first time, enable high certified accuracy against both UTPs and ISTPs. The source code of CR-UTP is available at this https URL.

[LG-124] Understanding Stochastic Natural Gradient Variational Inference

链接: https://arxiv.org/abs/2406.01870
作者: Kaiwen Wu,Jacob R. Gardner
关键词: popular posterior inference, posterior inference method, probabilistic models, popular posterior, method with applications
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2024

点击查看摘要

Abstract:Stochastic natural gradient variational inference (NGVI) is a popular posterior inference method with applications in various probabilistic models. Despite its wide usage, little is known about the non-asymptotic convergence rate in the \emphstochastic setting. We aim to lessen this gap and provide a better understanding. For conjugate likelihoods, we prove the first \mathcalO(\frac1T) non-asymptotic convergence rate of stochastic NGVI. The complexity is no worse than stochastic gradient descent (\aka black-box variational inference) and the rate likely has better constant dependency that leads to faster convergence in practice. For non-conjugate likelihoods, we show that stochastic NGVI with the canonical parameterization implicitly optimizes a non-convex objective. Thus, a global convergence rate of \mathcalO(\frac1T) is unlikely without some significant new understanding of optimizing the ELBO using natural gradients.

[LG-125] Neural Greens Operators for Parametric Partial Differential Equations

链接: https://arxiv.org/abs/2406.01857
作者: Hugo Melchers,Joost Prins,Michael Abdelmalik
关键词: partial differential equations, introduces neural Green, work introduces neural, operator network architecture, linear partial differential
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This work introduces neural Green’s operators (NGOs), a novel neural operator network architecture that learns the solution operator for a parametric family of linear partial differential equations (PDEs). Our construction of NGOs is derived directly from the Green’s formulation of such a solution operator. Similar to deep operator networks (DeepONets) and variationally mimetic operator networks (VarMiONs), NGOs constitutes an expansion of the solution to the PDE in terms of basis functions, that is returned from a sub-network, contracted with coefficients, that are returned from another sub-network. However, in accordance with the Green’s formulation, NGOs accept weighted averages of the input functions, rather than sampled values thereof, as is the case in DeepONets and VarMiONs. Application of NGOs to canonical linear parametric PDEs shows that, while they remain competitive with DeepONets, VarMiONs and Fourier neural operators when testing on data that lie within the training distribution, they robustly generalize when testing on finer-scale data generated outside of the training distribution. Furthermore, we show that the explicit representation of the Green’s function that is returned by NGOs enables the construction of effective preconditioners for numerical solvers for PDEs.

[LG-126] Multi-Agent Reinforcement Learning Meets Leaf Sequencing in Radiotherapy

链接: https://arxiv.org/abs/2406.01853
作者: Riqiang Gao,Florin C. Ghesu,Simon Arberet,Shahab Basiri,Esa Kuusela,Martin Kraus,Dorin Comaniciu,Ali Kamen
关键词: contemporary radiotherapy planning, key module leaf, module leaf sequencing, radiotherapy planning, optimization-based approaches
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: Accepted by ICML 2024

点击查看摘要

Abstract:In contemporary radiotherapy planning (RTP), a key module leaf sequencing is predominantly addressed by optimization-based approaches. In this paper, we propose a novel deep reinforcement learning (DRL) model termed as Reinforced Leaf Sequencer (RLS) in a multi-agent framework for leaf sequencing. The RLS model offers improvements to time-consuming iterative optimization steps via large-scale training and can control movement patterns through the design of reward mechanisms. We have conducted experiments on four datasets with four metrics and compared our model with a leading optimization sequencer. Our findings reveal that the proposed RLS model can achieve reduced fluence reconstruction errors, and potential faster convergence when integrated in an optimization planner. Additionally, RLS has shown promising results in a full artificial intelligence RTP pipeline. We hope this pioneer multi-agent RL leaf sequencer can foster future research on machine learning for RTP.

[LG-127] Non-uniformity is All You Need: Efficient and Timely Encrypted Traffic Classification With ECHO

链接: https://arxiv.org/abs/2406.01852
作者: Shilo Daum,Tal Shapira,David Hay,Anat Bremler-Barr
关键词: Hyperparameter Optimization, effective approach, approach to classifying, crucial for network, classification
类目: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With 95% of Internet traffic now encrypted, an effective approach to classifying this traffic is crucial for network security and management. This paper introduces ECHO – a novel optimization process for ML/DL-based encrypted traffic classification. ECHO targets both classification time and memory utilization and incorporates two innovative techniques. The first component, HO (Hyperparameter Optimization of binnings), aims at creating efficient traffic representations. While previous research often uses representations that map packet sizes and packet arrival times to fixed-sized bins, we show that non-uniform binnings are significantly more efficient. These non-uniform binnings are derived by employing a hyperparameter optimization algorithm in the training stage. HO significantly improves accuracy given a required representation size, or, equivalently, achieves comparable accuracy using smaller representations. Then, we introduce EC (Early Classification of traffic), which enables faster classification using a cascade of classifiers adapted for different exit times, where classification is based on the level of confidence. EC reduces the average classification latency by up to 90%. Remarkably, this method not only maintains classification accuracy but also, in certain cases, improves it. Using three publicly available datasets, we demonstrate that the combined method, Early Classification with Hyperparameter Optimization (ECHO), leads to a significant improvement in classification efficiency. Subjects: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2406.01852 [cs.NI] (or arXiv:2406.01852v1 [cs.NI] for this version) Submission history From: David Hay [view email] [v1] Mon, 3 Jun 2024 23:54:48 UTC (2,030 KB)

[LG-128] Learning the Target Network in Function Space

链接: https://arxiv.org/abs/2406.01838
作者: Kavosh Asadi,Yao Liu,Shoham Sabach,Ming Yin,Rasool Fakoor
关键词: reinforcement learning, Abstract, setting, networks, task
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to International Conference on Machine Learning (ICML24)

点击查看摘要

Abstract:We focus on the task of learning the value function in the reinforcement learning (RL) setting. This task is often solved by updating a pair of online and target networks while ensuring that the parameters of these two networks are equivalent. We propose Lookahead-Replicate (LR), a new value-function approximation algorithm that is agnostic to this parameter-space equivalence. Instead, the LR algorithm is designed to maintain an equivalence between the two networks in the function space. This value-based equivalence is obtained by employing a new target-network update. We show that LR leads to a convergent behavior in learning the value function. We also present empirical results demonstrating that LR-based target-network updates significantly improve deep RL on the Atari benchmark.

[LG-129] CAFO: Feature-Centric Explanation on Time Series Classification

链接: https://arxiv.org/abs/2406.01833
作者: Jaeho Kim,Seok-Ju Hahn,Yoontae Hwang,Junghye Lee,Seulki Lee
关键词: intricate temporal dynamics, multivariate time series, MTS, high-dimensional nature, intricate temporal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to KDD 2024 Research Track

点击查看摘要

Abstract:In multivariate time series (MTS) classification, finding the important features (e.g., sensors) for model performance is crucial yet challenging due to the complex, high-dimensional nature of MTS data, intricate temporal dynamics, and the necessity for domain-specific interpretations. Current explanation methods for MTS mostly focus on time-centric explanations, apt for pinpointing important time periods but less effective in identifying key features. This limitation underscores the pressing need for a feature-centric approach, a vital yet often overlooked perspective that complements time-centric analysis. To bridge this gap, our study introduces a novel feature-centric explanation and evaluation framework for MTS, named CAFO (Channel Attention and Feature Orthgonalization). CAFO employs a convolution-based approach with channel attention mechanisms, incorporating a depth-wise separable channel attention module (DepCA) and a QR decomposition-based loss for promoting feature-wise orthogonality. We demonstrate that this orthogonalization enhances the separability of attention distributions, thereby refining and stabilizing the ranking of feature importance. This improvement in feature-wise ranking enhances our understanding of feature explainability in MTS. Furthermore, we develop metrics to evaluate global and class-specific feature importance. Our framework’s efficacy is validated through extensive empirical analyses on two major public benchmarks and real-world datasets, both synthetic and self-collected, specifically designed to highlight class-wise discriminative features. The results confirm CAFO’s robustness and informative capacity in assessing feature importance in MTS classification tasks. This study not only advances the understanding of feature-centric explanations in MTS but also sets a foundation for future explorations in feature-centric explanations.

[LG-130] FacAID: A Transformer Model for Neuro-Symbolic Facade Reconstruction

链接: https://arxiv.org/abs/2406.01829
作者: Aleksander Płocharski,Jan Swidzinski,Joanna Porter-Sobieraj,Przemyslaw Musialski
关键词: custom-designed split grammar, split grammar, custom-designed split, segmented facade structures, semi-complex split grammar
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 11 pages, 10 figures, preprint

点击查看摘要

Abstract:We introduce a neuro-symbolic transformer-based model that converts flat, segmented facade structures into procedural definitions using a custom-designed split grammar. To facilitate this, we first develop a semi-complex split grammar tailored for architectural facades and then generate a dataset comprising of facades alongside their corresponding procedural representations. This dataset is used to train our transformer model to convert segmented, flat facades into the procedural language of our grammar. During inference, the model applies this learned transformation to new facade segmentations, providing a procedural representation that users can adjust to generate varied facade designs. This method not only automates the conversion of static facade images into dynamic, editable procedural formats but also enhances the design flexibility, allowing for easy modifications and variations by architects and designers. Our approach sets a new standard in facade design by combining the precision of procedural generation with the adaptability of neuro-symbolic learning.

[LG-131] EMOE: Expansive Matching of Experts for Robust Uncertainty Based Rejection

链接: https://arxiv.org/abs/2406.01825
作者: Yunni Qu(1),James Wellnitz(2),Alexander Tropsha(2),Junier Oliva(1) ((1) Department of Computer Science, University of North Carolina at Chapel Hill, (2) Eshelman School of Pharmacy, University of North Carolina at Chapel Hill)
关键词: uncertainty based rejection, Expansive Matching, prediction and uncertainty, extrapolatory pseudo-labeling, utilizes support-expanding
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Expansive Matching of Experts (EMOE) is a novel method that utilizes support-expanding, extrapolatory pseudo-labeling to improve prediction and uncertainty based rejection on out-of-distribution (OOD) points. We propose an expansive data augmentation technique that generates OOD instances in a latent space, and an empirical trial based approach to filter out augmented expansive points for pseudo-labeling. EMOE utilizes a diverse set of multiple base experts as pseudo-labelers on the augmented data to improve OOD performance through a shared MLP with multiple heads (one per expert). We demonstrate that EMOE achieves superior performance compared to state-of-the-art methods on tabular data.

[LG-132] Causal Discovery with Fewer Conditional Independence Tests

链接: https://arxiv.org/abs/2406.01823
作者: Kirankumar Shiragur,Jiaqi Zhang,Caroline Uhler
关键词: understanding causal relationships, causal graph, causal, graph, underlying causal graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Many questions in science center around the fundamental problem of understanding causal relationships. However, most constraint-based causal discovery algorithms, including the well-celebrated PC algorithm, often incur an exponential number of conditional independence (CI) tests, posing limitations in various applications. Addressing this, our work focuses on characterizing what can be learned about the underlying causal graph with a reduced number of CI tests. We show that it is possible to a learn a coarser representation of the hidden causal graph with a polynomial number of tests. This coarser representation, named Causal Consistent Partition Graph (CCPG), comprises of a partition of the vertices and a directed graph defined over its components. CCPG satisfies consistency of orientations and additional constraints which favor finer partitions. Furthermore, it reduces to the underlying causal graph when the causal graph is identifiable. As a consequence, our results offer the first efficient algorithm for recovering the true causal graph with a polynomial number of tests, in special cases where the causal graph is fully identifiable through observational data and potentially additional interventions.

[LG-133] In-Context Learning of Physical Properties: Few-Shot Adaptation to Out-of-Distribution Molecular Graphs

链接: https://arxiv.org/abs/2406.01808
作者: Grzegorz Kaszuba,Amirhossein D. Naghdi,Dario Massa,Stefanos Papanikolaou,Andrzej Jaszkiewicz,Piotr Sankowski
关键词: Large language models, Large language, language models manifest, manifest the ability, ability of few-shot
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Large language models manifest the ability of few-shot adaptation to a sequence of provided examples. This behavior, known as in-context learning, allows for performing nontrivial machine learning tasks during inference only. In this work, we address the question: can we leverage in-context learning to predict out-of-distribution materials properties? However, this would not be possible for structure property prediction tasks unless an effective method is found to pass atomic-level geometric features to the transformer model. To address this problem, we employ a compound model in which GPT-2 acts on the output of geometry-aware graph neural networks to adapt in-context information. To demonstrate our model’s capabilities, we partition the QM9 dataset into sequences of molecules that share a common substructure and use them for in-context learning. This approach significantly improves the performance of the model on out-of-distribution examples, surpassing the one of general graph neural network models.

[LG-134] abMDA: Tabular Manifold Data Augmentation for Any Classifier using Transformers with In-context Subsetting

链接: https://arxiv.org/abs/2406.01805
作者: Andrei Margeloiu,Adrián Bazaga,Nikola Simidjievski,Pietro Liò,Mateja Jamnik
关键词: critical domains, large quantities, data, challenging to acquire, acquire in large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tabular data is prevalent in many critical domains, yet it is often challenging to acquire in large quantities. This scarcity usually results in poor performance of machine learning models on such data. Data augmentation, a common strategy for performance improvement in vision and language tasks, typically underperforms for tabular data due to the lack of explicit symmetries in the input space. To overcome this challenge, we introduce TabMDA, a novel method for manifold data augmentation on tabular data. This method utilises a pre-trained in-context model, such as TabPFN, to map the data into a manifold space. TabMDA performs label-invariant transformations by encoding the data multiple times with varied contexts. This process explores the manifold of the underlying in-context models, thereby enlarging the training dataset. TabMDA is a training-free method, making it applicable to any classifier. We evaluate TabMDA on five standard classifiers and observe significant performance improvements across various tabular datasets. Our results demonstrate that TabMDA provides an effective way to leverage information from pre-trained in-context models to enhance the performance of downstream classifiers.

[LG-135] Online Control in Population Dynamics

链接: https://arxiv.org/abs/2406.01799
作者: Noah Golowich,Elad Hazan,Zhou Lu,Dhruv Rohatgi,Y. Jennifer Sun
关键词: evolutionary game theory, early sociological works, including biology, sociological works, evolutionary game
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The study of population dynamics originated with early sociological works (Malthus, 1872) but has since extended into many fields, including biology, epidemiology, evolutionary game theory, and economics. Most studies on population dynamics focus on the problem of prediction rather than control. Existing mathematical models for population control are often restricted to specific, noise-free dynamics, while real-world population changes can be complex and adversarial. To address this gap, we propose a new framework based on the paradigm of online control. We first characterize a set of linear dynamical systems that can naturally model evolving populations. We then give an efficient gradient-based controller for these systems, with near-optimal regret bounds with respect to a broad class of linear policies. Our empirical evaluations demonstrate the effectiveness of the proposed algorithm for population control even in non-linear models such as SIR and replicator dynamics. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2406.01799 [cs.LG] (or arXiv:2406.01799v1 [cs.LG] for this version)

[LG-136] owards the Transferability of Rewards Recovered via Regularized Inverse Reinforcement Learning

链接: https://arxiv.org/abs/2406.01793
作者: Andreas Schlaginhaufen,Maryam Kamgarpour
关键词: Inverse reinforcement learning, Inverse reinforcement, transition laws, optimal policy, aims to infer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Inverse reinforcement learning (IRL) aims to infer a reward from expert demonstrations, motivated by the idea that the reward, rather than the policy, is the most succinct and transferable description of a task [Ng et al., 2000]. However, the reward corresponding to an optimal policy is not unique, making it unclear if an IRL-learned reward is transferable to new transition laws in the sense that its optimal policy aligns with the optimal policy corresponding to the expert’s true reward. Past work has addressed this problem only under the assumption of full access to the expert’s policy, guaranteeing transferability when learning from two experts with the same reward but different transition laws that satisfy a specific rank condition [Rolland et al., 2022]. In this work, we show that the conditions developed under full access to the expert’s policy cannot guarantee transferability in the more practical scenario where we have access only to demonstrations of the expert. Instead of a binary rank condition, we propose principal angles as a more refined measure of similarity and dissimilarity between transition laws. Based on this, we then establish two key results: 1) a sufficient condition for transferability to any transition laws when learning from at least two experts with sufficiently different transition laws, and 2) a sufficient condition for transferability to local changes in the transition law when learning from a single expert. Furthermore, we also provide a probably approximately correct (PAC) algorithm and an end-to-end analysis for learning transferable rewards from demonstrations of multiple experts.

[LG-137] AI-based Classification of Customer Support Tickets: State of the Art and Implementation with AutoML

链接: https://arxiv.org/abs/2406.01789
作者: Mario Truss,Stephan Boehm
关键词: shortening resolution time, improve customer support, customer inquiries, crucial to improve, shortening resolution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Automation of support ticket classification is crucial to improve customer support performance and shortening resolution time for customer inquiries. This research aims to test the applicability of automated machine learning (AutoML) as a technology to train a machine learning model (ML model) that can classify support tickets. The model evaluation conducted in this research shows that AutoML can be used to train ML models with good classification performance. Moreover, this paper fills a research gap by providing new insights into developing AI solutions without a dedicated professional by utilizing AutoML, which makes this technology more accessible for companies without specialized AI departments and staff.

[LG-138] Multi-agent assignment via state augmented reinforcement learning

链接: https://arxiv.org/abs/2406.01782
作者: Leopoldo Agorio,Sean Van Alen,Miguel Calvo-Fullana,Santiago Paternain,Juan Andres Bazerque
关键词: constrained reinforcement learning, standard regularization techniques, reinforcement learning, emphasizing the inadequacy, multi-agent assignment problem
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 12 pages, 3 figures, 6th Annual Conference on Learning for Dynamics and Control

点击查看摘要

Abstract:We address the conflicting requirements of a multi-agent assignment problem through constrained reinforcement learning, emphasizing the inadequacy of standard regularization techniques for this purpose. Instead, we recur to a state augmentation approach in which the oscillation of dual variables is exploited by agents to alternate between tasks. In addition, we coordinate the actions of the multiple agents acting on their local states through these multipliers, which are gossiped through a communication network, eliminating the need to access other agent states. By these means, we propose a distributed multi-agent assignment protocol with theoretical feasibility guarantees that we corroborate in a monitoring numerical experiment.

[LG-139] DEFT: Efficient Finetuning of Conditional Diffusion Models by Learning the Generalised h-transform

链接: https://arxiv.org/abs/2406.01781
作者: Alexander Denker,Francisco Vargas,Shreyas Padhy,Kieran Didi,Simon Mathis,Vincent Dutordoir,Riccardo Barbano,Emile Mathieu,Urszula Julia Komorowska,Pietro Lio
关键词: Generative modelling paradigms, modelling paradigms based, denoising diffusion processes, Generative modelling, inverse problems
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2312.09236

点击查看摘要

Abstract:Generative modelling paradigms based on denoising diffusion processes have emerged as a leading candidate for conditional sampling in inverse problems. In many real-world applications, we often have access to large, expensively trained unconditional diffusion models, which we aim to exploit for improving conditional sampling. Most recent approaches are motivated heuristically and lack a unifying framework, obscuring connections between them. Further, they often suffer from issues such as being very sensitive to hyperparameters, being expensive to train or needing access to weights hidden behind a closed API. In this work, we unify conditional training and sampling using the mathematically well-understood Doob’s h-transform. This new perspective allows us to unify many existing methods under a common umbrella. Under this framework, we propose DEFT (Doob’s h-transform Efficient FineTuning), a new approach for conditional generation that simply fine-tunes a very small network to quickly learn the conditional h -transform, while keeping the larger unconditional network unchanged. DEFT is much faster than existing baselines while achieving state-of-the-art performance across a variety of linear and non-linear benchmarks. On image reconstruction tasks, we achieve speedups of up to 1.6 \times , while having the best perceptual quality on natural images and reconstruction performance on medical images.

[LG-140] Efficient Data Distribution Estimation for Accelerated Federated Learning

链接: https://arxiv.org/abs/2406.01774
作者: Yuanli Wang,Lei Huang
关键词: privacy-preserving machine learning, machine learning paradigm, Federated Learning, machine learning, learning paradigm
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning(FL) is a privacy-preserving machine learning paradigm where a global model is trained in-situ across a large number of distributed edge devices. These systems are often comprised of millions of user devices and only a subset of available devices can be used for training in each epoch. Designing a device selection strategy is challenging, given that devices are highly heterogeneous in both their system resources and training data. This heterogeneity makes device selection very crucial for timely model convergence and sufficient model accuracy. To tackle the FL client heterogeneity problem, various client selection algorithms have been developed, showing promising performance improvement in terms of model coverage and accuracy. In this work, we study the overhead of client selection algorithms in a large scale FL environment. Then we propose an efficient data distribution summary calculation algorithm to reduce the overhead in a real-world large scale FL environment. The evaluation shows that our proposed solution could achieve up to 30x reduction in data summary time, and up to 360x reduction in clustering time.

[LG-141] How Does Gradient Descent Learn Features – A Local Analysis for Regularized Two-Layer Neural Networks

链接: https://arxiv.org/abs/2406.01766
作者: Mo Zhou,Rong Ge
关键词: feature learning, neural networks, major advantages, learning, neural
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The ability of learning useful features is one of the major advantages of neural networks. Although recent works show that neural network can operate in a neural tangent kernel (NTK) regime that does not allow feature learning, many works also demonstrate the potential for neural networks to go beyond NTK regime and perform feature learning. Recently, a line of work highlighted the feature learning capabilities of the early stages of gradient-based training. In this paper we consider another mechanism for feature learning via gradient descent through a local convergence analysis. We show that once the loss is below a certain threshold, gradient descent with a carefully regularized objective will capture ground-truth directions. Our results demonstrate that feature learning not only happens at the initial gradient steps, but can also occur towards the end of training.

[LG-142] Non-Asymptotic Analysis for Single-Loop (Natural) Actor-Critic with Compatible Function Approximation

链接: https://arxiv.org/abs/2406.01762
作者: Yudan Wang,Yue Wang,Yi Zhou,Shaofeng Zou
关键词: approximate gradient direction, reinforcement learning, Markovian sample trajectory, compatible function approximation, single Markovian sample
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: ICML 2024

点击查看摘要

Abstract:Actor-critic (AC) is a powerful method for learning an optimal policy in reinforcement learning, where the critic uses algorithms, e.g., temporal difference (TD) learning with function approximation, to evaluate the current policy and the actor updates the policy along an approximate gradient direction using information from the critic. This paper provides the \textittightest non-asymptotic convergence bounds for both the AC and natural AC (NAC) algorithms. Specifically, existing studies show that AC converges to an \epsilon+\varepsilon_\textcritic neighborhood of stationary points with the best known sample complexity of \mathcalO(\epsilon^-2) (up to a log factor), and NAC converges to an \epsilon+\varepsilon_\textcritic+\sqrt\varepsilon_\textactor neighborhood of the global optimum with the best known sample complexity of \mathcalO(\epsilon^-3) , where \varepsilon_\textcritic is the approximation error of the critic and \varepsilon_\textactor is the approximation error induced by the insufficient expressive power of the parameterized policy class. This paper analyzes the convergence of both AC and NAC algorithms with compatible function approximation. Our analysis eliminates the term \varepsilon_\textcritic from the error bounds while still achieving the best known sample complexities. Moreover, we focus on the challenging single-loop setting with a single Markovian sample trajectory. Our major technical novelty lies in analyzing the stochastic bias due to policy-dependent and time-varying compatible function approximation in the critic, and handling the non-ergodicity of the MDP due to the single Markovian sample trajectory. Numerical results are also provided in the appendix.

[LG-143] Position: Cracking the Code of Cascading Disparity Towards Marginalized Communities

链接: https://arxiv.org/abs/2406.01757
作者: Golnoosh Farnadi,Mohammad Havaei,Negar Rostamzadeh
关键词: holds immense promise, amplify existing risks, models holds immense, leaving marginalized communities, marginalized communities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 14 pages, 1 figure

点击查看摘要

Abstract:The rise of foundation models holds immense promise for advancing AI, but this progress may amplify existing risks and inequalities, leaving marginalized communities behind. In this position paper, we discuss that disparities towards marginalized communities - performance, representation, privacy, robustness, interpretability and safety - are not isolated concerns but rather interconnected elements of a cascading disparity phenomenon. We contrast foundation models with traditional models and highlight the potential for exacerbated disparity against marginalized communities. Moreover, we emphasize the unique threat of cascading impacts in foundation models, where interconnected disparities can trigger long-lasting negative consequences, specifically to the people on the margin. We define marginalized communities within the machine learning context and explore the multifaceted nature of disparities. We analyze the sources of these disparities, tracing them from data creation, training and deployment procedures to highlight the complex technical and socio-technical landscape. To mitigate the pressing crisis, we conclude with a set of calls to action to mitigate disparity at its source.

[LG-144] Sparser Better Deeper Stronger: Improving Sparse Training with Exact Orthogonal Initialization

链接: https://arxiv.org/abs/2406.01755
作者: Aleksandra Irena Nowak,Łukasz Gniecki,Filip Szatkowski,Jacek Tabor
关键词: achieving remarkable results, train sparse models, models from scratch, achieving remarkable, recent years
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICML 2024

点击查看摘要

Abstract:Static sparse training aims to train sparse models from scratch, achieving remarkable results in recent years. A key design choice is given by the sparse initialization, which determines the trainable sub-network through a binary mask. Existing methods mainly select such mask based on a predefined dense initialization. Such an approach may not efficiently leverage the mask’s potential impact on the optimization. An alternative direction, inspired by research into dynamical isometry, is to introduce orthogonality in the sparse subnetwork, which helps in stabilizing the gradient signal. In this work, we propose Exact Orthogonal Initialization (EOI), a novel sparse orthogonal initialization scheme based on composing random Givens rotations. Contrary to other existing approaches, our method provides exact (not approximated) orthogonality and enables the creation of layers with arbitrary densities. We demonstrate the superior effectiveness and efficiency of EOI through experiments, consistently outperforming common sparse initialization techniques. Our method enables training highly sparse 1000-layer MLP and CNN networks without residual connections or normalization techniques, emphasizing the crucial role of weight initialization in static sparse training alongside sparse mask selection. The code is available at this https URL

[LG-145] Optimizing the Optimal Weighted Average: Efficient Distributed Sparse Classification

链接: https://arxiv.org/abs/2406.01753
作者: Fred Lu,Ryan R. Curtin,Edward Raff,Francis Ferraro,James Holt
关键词: increasingly large datasets, inter-machine communication costs, data dimensionality increases, optimizing linear models, large datasets
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
*备注: Under review

点击查看摘要

Abstract:While distributed training is often viewed as a solution to optimizing linear models on increasingly large datasets, inter-machine communication costs of popular distributed approaches can dominate as data dimensionality increases. Recent work on non-interactive algorithms shows that approximate solutions for linear models can be obtained efficiently with only a single round of communication among machines. However, this approximation often degenerates as the number of machines increases. In this paper, building on the recent optimal weighted average method, we introduce a new technique, ACOWA, that allows an extra round of communication to achieve noticeably better approximation quality with minor runtime increases. Results show that for sparse distributed logistic regression, ACOWA obtains solutions that are more faithful to the empirical risk minimizer and attain substantially higher accuracy than other distributed algorithms.

[LG-146] Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

链接: https://arxiv.org/abs/2406.01733
作者: Xinyin Ma,Gongfan Fang,Michael Bi Mi,Xinchao Wang
关键词: recently demonstrated unprecedented, demonstrated unprecedented generative, unprecedented generative capabilities, recently demonstrated, demonstrated unprecedented
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Diffusion Transformers have recently demonstrated unprecedented generative capabilities for various tasks. The encouraging results, however, come with the cost of slow inference, since each denoising step requires inference on a transformer model with a large scale of parameters. In this study, we make an interesting and somehow surprising observation: the computation of a large proportion of layers in the diffusion transformer, through introducing a caching mechanism, can be readily removed even without updating the model parameters. In the case of U-ViT-H/2, for example, we may remove up to 93.68% of the computation in the cache steps (46.84% for all steps), with less than 0.01 drop in FID. To achieve this, we introduce a novel scheme, named Learning-to-Cache (L2C), that learns to conduct caching in a dynamic manner for diffusion transformers. Specifically, by leveraging the identical structure of layers in transformers and the sequential nature of diffusion, we explore redundant computations between timesteps by treating each layer as the fundamental unit for caching. To address the challenge of the exponential search space in deep models for identifying layers to cache and remove, we propose a novel differentiable optimization objective. An input-invariant yet timestep-variant router is then optimized, which can finally produce a static computation graph. Experimental results show that L2C largely outperforms samplers such as DDIM and DPM-Solver, alongside prior cache-based methods at the same inference speed.

[LG-147] Federated Learning-based Collaborative Wideband Spectrum Sensing and Scheduling for UAVs in UTM Systems

链接: https://arxiv.org/abs/2406.01727
作者: Sravan Reddy Chintareddy,Keenan Roach,Kenny Cheung,Morteza Hashemi
关键词: wideband spectrum sensing, opportunistically utilize detected, opportunistically utilize, collaborative wideband spectrum, unmanned aerial vehicles
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Signal Processing (eess.SP)
*备注: This is a preprint version submitted to IEEE Transactions on Machine learning in Communications and Networking. arXiv admin note: text overlap with arXiv:2308.05036

点击查看摘要

Abstract:In this paper, we propose a data-driven framework for collaborative wideband spectrum sensing and scheduling for networked unmanned aerial vehicles (UAVs), which act as the secondary users (SUs) to opportunistically utilize detected “spectrum holes”. Our overall framework consists of three main stages. Firstly, in the model training stage, we explore dataset generation in a multi-cell environment and training a machine learning (ML) model using the federated learning (FL) architecture. Unlike the existing studies on FL for wireless that presume datasets are readily available for training, we propose a novel architecture that directly integrates wireless dataset generation, which involves capturing I/Q samples from over-the-air signals in a multi-cell environment, into the FL training process. Secondly, in the collaborative spectrum inference stage, we propose a collaborative spectrum fusion strategy that is compatible with the unmanned aircraft system traffic management (UTM) ecosystem. Finally, in the spectrum scheduling stage, we leverage reinforcement learning (RL) solutions to dynamically allocate the detected spectrum holes to the secondary users. To evaluate the proposed methods, we establish a comprehensive simulation framework that generates a near-realistic synthetic dataset using MATLAB LTE toolbox by incorporating base-station~(BS) locations in a chosen area of interest, performing ray-tracing, and emulating the primary users channel usage in terms of I/Q samples. This evaluation methodology provides a flexible framework to generate large spectrum datasets that could be used for developing ML/AI-based spectrum management solutions for aerial devices.

[LG-148] Model for Peanuts: Hijacking ML Models without Training Access is Possible

链接: https://arxiv.org/abs/2406.01708
作者: Mahmoud Ghorbel,Halima Bouzidi,Ioan Marius Bilasco,Ihsen Alouani
关键词: Machine Learning, deployment of Machine, Model, Model hijacking, invasion of privacy
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 17 pages, 14 figures, 7 tables

点击查看摘要

Abstract:The massive deployment of Machine Learning (ML) models has been accompanied by the emergence of several attacks that threaten their trustworthiness and raise ethical and societal concerns such as invasion of privacy, discrimination risks, and lack of accountability. Model hijacking is one of these attacks, where the adversary aims to hijack a victim model to execute a different task than its original one. Model hijacking can cause accountability and security risks since a hijacked model owner can be framed for having their model offering illegal or unethical services. Prior state-of-the-art works consider model hijacking as a training time attack, whereby an adversary requires access to the ML model training to execute their attack. In this paper, we consider a stronger threat model where the attacker has no access to the training phase of the victim model. Our intuition is that ML models, typically over-parameterized, might (unintentionally) learn more than the intended task for they are trained. We propose a simple approach for model hijacking at inference time named SnatchML to classify unknown input samples using distance measures in the latent space of the victim model to previously known samples associated with the hijacking task classes. SnatchML empirically shows that benign pre-trained models can execute tasks that are semantically related to the initial task. Surprisingly, this can be true even for hijacking tasks unrelated to the original task. We also explore different methods to mitigate this risk. We first propose a novel approach we call meta-unlearning, designed to help the model unlearn a potentially malicious task while training on the original task dataset. We also provide insights on over-parameterization as one possible inherent factor that makes model hijacking easier, and we accordingly propose a compression-based countermeasure against this attack.

[LG-149] Demystifying Platform Requirements for Diverse LLM Inference Use Cases

链接: https://arxiv.org/abs/2406.01698
作者: Abhimanyu Bambhaniya,Ritik Raj,Geonhwa Jeong,Souvik Kundu,Sudarshan Srinivasan,Midhilesh Elavazhagan,Madhu Kumar,Tushar Krishna
关键词: outperforming human experts, shown remarkable performance, human experts, shown remarkable, wide range
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 12 Pages, this https URL

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance across a wide range of applications, often outperforming human experts. However, deploying these parameter-heavy models efficiently for diverse inference use cases requires carefully designed hardware platforms with ample computing, memory, and network resources. With LLM deployment scenarios and models evolving at breakneck speed, the hardware requirements to meet SLOs remains an open research question. In this work, we present an analytical tool, GenZ, to study the relationship between LLM inference performance and various platform design parameters. Our analysis provides insights into configuring platforms for different LLM workloads and use cases. We quantify the platform requirements to support SOTA LLMs models like LLaMA and GPT-4 under diverse serving settings. Furthermore, we project the hardware capabilities needed to enable future LLMs potentially exceeding hundreds of trillions of parameters. The trends and insights derived from GenZ can guide AI engineers deploying LLMs as well as computer architects designing next-generation hardware accelerators and platforms. Ultimately, this work sheds light on the platform design considerations for unlocking the full potential of large language models across a spectrum of applications. The source code is available at this https URL .

[LG-150] A Diffusion Model Framework for Unsupervised Neural Combinatorial Optimization

链接: https://arxiv.org/abs/2406.01661
作者: Sebastian Sanokowski,Sepp Hochreiter,Sebastian Lehner
关键词: including Combinatorial Optimization, exact sample likelihoods, Combinatorial Optimization, intractable distributions, distributions over discrete
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Machine Learning (stat.ML)
*备注: Accepted at ICML 2024

点击查看摘要

Abstract:Learning to sample from intractable distributions over discrete sets without relying on corresponding training data is a central problem in a wide range of fields, including Combinatorial Optimization. Currently, popular deep learning-based approaches rely primarily on generative models that yield exact sample likelihoods. This work introduces a method that lifts this restriction and opens the possibility to employ highly expressive latent variable models like diffusion models. Our approach is conceptually based on a loss that upper bounds the reverse Kullback-Leibler divergence and evades the requirement of exact sample likelihoods. We experimentally validate our approach in data-free Combinatorial Optimization and demonstrate that our method achieves a new state-of-the-art on a wide range of benchmark problems.

[LG-151] Self-Improving Robust Preference Optimization

链接: https://arxiv.org/abs/2406.01660
作者: Eugene Choi,Arash Ahmadian,Matthieu Geist,Oilvier Pietquin,Mohammad Gheshlaghi Azar
关键词: offline RLHF methods, offline RLHF, Preference Optimization SRPO, extremely successful, successful in aligning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Both online and offline RLHF methods such as PPO and DPO have been extremely successful in aligning AI with human preferences. Despite their success, the existing methods suffer from a fundamental problem that their optimal solution is highly task-dependent (i.e., not robust to out-of-distribution (OOD) tasks). Here we address this challenge by proposing Self-Improving Robust Preference Optimization SRPO, a practical and mathematically principled offline RLHF framework that is completely robust to the changes in the task. The key idea of SRPO is to cast the problem of learning from human preferences as a self-improvement process, which can be mathematically expressed in terms of a min-max objective that aims at joint optimization of self-improvement policy and the generative policy in an adversarial fashion. The solution for this optimization problem is independent of the training task and thus it is robust to its changes. We then show that this objective can be re-expressed in the form of a non-adversarial offline loss which can be optimized using standard supervised optimization techniques at scale without any need for reward model and online inference. We show the effectiveness of SRPO in terms of AI Win-Rate (WR) against human (GOLD) completions. In particular, when SRPO is evaluated on the OOD XSUM dataset, it outperforms the celebrated DPO by a clear margin of 15% after 5 self-revisions, achieving WR of 90%.

[LG-152] nySV: Speaker Verification in TinyML with On-device Learning

链接: https://arxiv.org/abs/2406.01655
作者: Massimo Pavan,Gioele Mombelli,Francesco Sinacori,Manuel Roveri
关键词: gained huge momentum, execute machine learning, Tiny Speaker Verification, TinyML learning algorithms, learning algorithms
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:TinyML is a novel area of machine learning that gained huge momentum in the last few years thanks to the ability to execute machine learning algorithms on tiny devices (such as Internet-of-Things or embedded systems). Interestingly, research in this area focused on the efficient execution of the inference phase of TinyML models on tiny devices, while very few solutions for on-device learning of TinyML models are available in the literature due to the relevant overhead introduced by the learning algorithms. The aim of this paper is to introduce a new type of adaptive TinyML solution that can be used in tasks, such as the presented \textitTiny Speaker Verification (TinySV), that require to be tackled with an on-device learning algorithm. Achieving this goal required (i) reducing the memory and computational demand of TinyML learning algorithms, and (ii) designing a TinyML learning algorithm operating with few and possibly unlabelled training data. The proposed TinySV solution relies on a two-layer hierarchical TinyML solution comprising Keyword Spotting and Adaptive Speaker Verification module. We evaluated the effectiveness and efficiency of the proposed TinySV solution on a dataset collected expressly for the task and tested the proposed solution on a real-world IoT device (Infineon PSoC 62S2 Wi-Fi BT Pioneer Kit). Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2406.01655 [cs.SD] (or arXiv:2406.01655v1 [cs.SD] for this version)

[LG-153] CoLa-DCE – Concept-guided Latent Diffusion Counterfactual Explanations

链接: https://arxiv.org/abs/2406.01649
作者: Franz Motzkus,Christian Hellert,Ute Schmid
关键词: Recent advancements, practical implementations, advancements in generative, introduced novel prospects, prospects and practical
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Recent advancements in generative AI have introduced novel prospects and practical implementations. Especially diffusion models show their strength in generating diverse and, at the same time, realistic features, positioning them well for generating counterfactual explanations for computer vision models. Answering “what if” questions of what needs to change to make an image classifier change its prediction, counterfactual explanations align well with human understanding and consequently help in making model behavior more comprehensible. Current methods succeed in generating authentic counterfactuals, but lack transparency as feature changes are not directly perceivable. To address this limitation, we introduce Concept-guided Latent Diffusion Counterfactual Explanations (CoLa-DCE). CoLa-DCE generates concept-guided counterfactuals for any classifier with a high degree of control regarding concept selection and spatial conditioning. The counterfactuals comprise an increased granularity through minimal feature changes. The reference feature visualization ensures better comprehensibility, while the feature localization provides increased transparency of “where” changed “what”. We demonstrate the advantages of our approach in minimality and comprehensibility across multiple image classification models and datasets and provide insights into how our CoLa-DCE explanations help comprehend model errors like misclassification cases.

[LG-154] An Analysis under a Unified Fomulation of Learning Algorithms with Output Constraints

链接: https://arxiv.org/abs/2406.01647
作者: Mooho Song,Jay-Yoon Lee
关键词: Neural networks, produce nonsensical results, produce nonsensical, nonsensical results, Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural networks (NN) perform well in diverse tasks, but sometimes produce nonsensical results to humans. Most NN models “solely” learn from (input, output) pairs, occasionally conflicting with human knowledge. Many studies indicate injecting human knowledge by reducing output constraints during training can improve model performance and reduce constraint violations. While there have been several attempts to compare different existing algorithms under the same programming framework, nonetheless, there has been no previous work that categorizes learning algorithms with output constraints in a unified manner. Our contributions are as follows: (1) We categorize the previous studies based on three axes: type of constraint loss used (e.g. probabilistic soft logic, REINFORCE), exploration strategy of constraint-violating examples, and integration mechanism of learning signals from main task and constraint. (2) We propose new algorithms to integrate the information of main task and constraint injection, inspired by continual-learning algorithms. (3) Furthermore, we propose the H\beta -score as a metric for considering the main task metric and constraint violation simultaneously. To provide a thorough analysis, we examine all the algorithms on three NLP tasks: natural language inference (NLI), synthetic transduction examples (STE), and semantic role labeling (SRL). We explore and reveal the key factors of various algorithms associated with achieving high H\beta -scores.

[LG-155] KAN: Global Incremental Learning with KAN for Human Activity Recognition Across Heterogeneous Datasets

链接: https://arxiv.org/abs/2406.01646
作者: Mengxi Liu,Sizhen Bian,Bo Zhou,Paul Lukowicz
关键词: human activity recognition, wearable sensor human, sensor human activity, activity recognition, challenges simultaneously
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: This work is submitted to Ubicomp/ISWC24 and is under review

点击查看摘要

Abstract:This work proposes an incremental learning (IL) framework for wearable sensor human activity recognition (HAR) that tackles two challenges simultaneously: catastrophic forgetting and non-uniform inputs. The scalable framework, iKAN, pioneers IL with Kolmogorov-Arnold Networks (KAN) to replace multi-layer perceptrons as the classifier that leverages the local plasticity and global stability of splines. To adapt KAN for HAR, iKAN uses task-specific feature branches and a feature redistribution layer. Unlike existing IL methods that primarily adjust the output dimension or the number of classifier nodes to adapt to new tasks, iKAN focuses on expanding the feature extraction branches to accommodate new inputs from different sensor modalities while maintaining consistent dimensions and the number of classifier outputs. Continual learning across six public HAR datasets demonstrated the iKAN framework’s incremental learning performance, with a last performance of 84.9% (weighted F1 score) and an average incremental performance of 81.34%, which significantly outperforms the two existing incremental learning methods, such as EWC (51.42%) and experience replay (59.92%).

[LG-156] FNP: Fourier Neural Processes for Arbitrary-Resolution Data Assimilation

链接: https://arxiv.org/abs/2406.01645
作者: Kun Chen,Tao Chen,Peng Ye,Hao Chen,Kang Chen,Tao Han,Wanli Ouyang,Lei Bai
关键词: modern global medium-range, global medium-range weather, medium-range weather forecasting, weather forecasting systems, AI-based data assimilation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data assimilation is a vital component in modern global medium-range weather forecasting systems to obtain the best estimation of the atmospheric state by combining the short-term forecast and observations. Recently, AI-based data assimilation approaches have attracted increasing attention for their significant advantages over traditional techniques in terms of computational consumption. However, existing AI-based data assimilation methods can only handle observations with a specific resolution, lacking the compatibility and generalization ability to assimilate observations with other resolutions. Considering that complex real-world observations often have different resolutions, we propose the \textit\textbfFourier Neural Processes (FNP) for \textitarbitrary-resolution data assimilation in this paper. Leveraging the efficiency of the designed modules and flexible structure of neural processes, FNP achieves state-of-the-art results in assimilating observations with varying resolutions, and also exhibits increasing advantages over the counterparts as the resolution and the amount of observations increase. Moreover, our FNP trained on a fixed resolution can directly handle the assimilation of observations with out-of-distribution resolutions and the observational information reconstruction task without additional fine-tuning, demonstrating its excellent generalization ability across data resolutions as well as across tasks.

[LG-157] meCMA: Towards LLM-Empowered Time Series Forecasting via Cross-Modality Alignment

链接: https://arxiv.org/abs/2406.01638
作者: Chenxi Liu,Qianxiong Xu,Hao Miao,Sun Yang,Lingzheng Zhang,Cheng Long,Ziyue Li,Rui Zhao
关键词: scalable mobile sensing, time series, series, time, time series forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The widespread adoption of scalable mobile sensing has led to large amounts of time series data for real-world applications. A fundamental application is multivariate time series forecasting (MTSF), which aims to predict future time series values based on historical observations. Existing MTSF methods suffer from limited parameterization and small-scale training data. Recently, Large language models (LLMs) have been introduced in time series, which achieve promising forecasting performance but incur heavy computational costs. To solve these challenges, we propose TimeCMA, an LLM-empowered framework for time series forecasting with cross-modality alignment. We design a dual-modality encoding module with two branches, where the time series encoding branch extracts relatively low-quality yet pure embeddings of time series through an inverted Transformer. In addition, the LLM-empowered encoding branch wraps the same time series as prompts to obtain high-quality yet entangled prompt embeddings via a Pre-trained LLM. Then, we design a cross-modality alignment module to retrieve high-quality and pure time series embeddings from the prompt embeddings. Moreover, we develop a time series forecasting module to decode the aligned embeddings while capturing dependencies among multiple variables for forecasting. Notably, we tailor the prompt to encode sufficient temporal information into a last token and design the last token embedding storage to reduce computational costs. Extensive experiments on real data offer insight into the accuracy and efficiency of the proposed framework.

[LG-158] On Overcoming Miscalibrated Conversational Priors in LLM-based Chatbots

链接: https://arxiv.org/abs/2406.01633
作者: Christine Herlihy,Jennifer Neville,Tobias Schnabel,Adith Swaminathan
关键词:
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Preprint of UAI’24 conference publication

点击查看摘要

[LG-159] An LLM-based Recommender System Environment

链接: https://arxiv.org/abs/2406.01631
作者: Nathan Corecco,Giorgio Piatti,Luca A. Lanzendörfer,Flint Xiaofeng Fan,Roger Wattenhofer
关键词: discovering relevant content, optimize long-term rewards, Reinforcement learning, recommender systems due, recommender systems
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has gained popularity in the realm of recommender systems due to its ability to optimize long-term rewards and guide users in discovering relevant content. However, the successful implementation of RL in recommender systems is challenging because of several factors, including the limited availability of online data for training on-policy methods. This scarcity requires expensive human interaction for online model training. Furthermore, the development of effective evaluation frameworks that accurately reflect the quality of models remains a fundamental challenge in recommender systems. To address these challenges, we propose a comprehensive framework for synthetic environments that simulate human behavior by harnessing the capabilities of large language models (LLMs). We complement our framework with in-depth ablation studies and demonstrate its effectiveness with experiments on movie and book recommendations. By utilizing LLMs as synthetic users, this work introduces a modular and novel framework for training RL-based recommender systems. The software, including the RL environment, is publicly available.

[LG-160] System-2 Recommenders: Disentangling Utility and Engagement in Recommendation Systems via Temporal Point-Processes

链接: https://arxiv.org/abs/2406.01611
作者: Arpit Agarwal,Nicolas Usunier,Alessandro Lazaric,Maximilian Nickel
关键词: modern human experience, important part, modern human, human experience, experience whose influence
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at FAccT’24

点击查看摘要

Abstract:Recommender systems are an important part of the modern human experience whose influence ranges from the food we eat to the news we read. Yet, there is still debate as to what extent recommendation platforms are aligned with the user goals. A core issue fueling this debate is the challenge of inferring a user utility based on engagement signals such as likes, shares, watch time etc., which are the primary metric used by platforms to optimize content. This is because users utility-driven decision-processes (which we refer to as System-2), e.g., reading news that are relevant for them, are often confounded by their impulsive decision-processes (which we refer to as System-1), e.g., spend time on click-bait news. As a result, it is difficult to infer whether an observed engagement is utility-driven or impulse-driven. In this paper we explore a new approach to recommender systems where we infer user utility based on their return probability to the platform rather than engagement signals. Our intuition is that users tend to return to a platform in the long run if it creates utility for them, while pure engagement-driven interactions that do not add utility, may affect user return in the short term but will not have a lasting effect. We propose a generative model in which past content interactions impact the arrival rates of users based on a self-exciting Hawkes process. These arrival rates to the platform are a combination of both System-1 and System-2 decision processes. The System-2 arrival intensity depends on the utility and has a long lasting effect, while the System-1 intensity depends on the instantaneous gratification and tends to vanish rapidly. We show analytically that given samples it is possible to disentangle System-1 and System-2 and allow content optimization based on user utility. We conduct experiments on synthetic data to demonstrate the effectiveness of our approach.

[LG-161] Judgement Citation Retrieval using Contextual Similarity

链接: https://arxiv.org/abs/2406.01609
作者: Akshat Mohan Dasula,Hrushitha Tigulla,Preethika Bhukya
关键词: demanded manual effort, keyword-based search applications, understanding legal jargon, Legal case descriptions, intricate case descriptions
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 14 pages, 16 images, Submitted to Multimedia Tools and Applications Springer journal

点击查看摘要

Abstract:Traditionally in the domain of legal research, the retrieval of pertinent citations from intricate case descriptions has demanded manual effort and keyword-based search applications that mandate expertise in understanding legal jargon. Legal case descriptions hold pivotal information for legal professionals and researchers, necessitating more efficient and automated approaches. We propose a methodology that combines natural language processing (NLP) and machine learning techniques to enhance the organization and utilization of legal case descriptions. This approach revolves around the creation of textual embeddings with the help of state-of-art embedding models. Our methodology addresses two primary objectives: unsupervised clustering and supervised citation retrieval, both designed to automate the citation extraction process. Although the proposed methodology can be used for any dataset, we employed the Supreme Court of The United States (SCOTUS) dataset, yielding remarkable results. Our methodology achieved an impressive accuracy rate of 90.9%. By automating labor-intensive processes, we pave the way for a more efficient, time-saving, and accessible landscape in legal research, benefiting legal professionals, academics, and researchers.

[LG-162] Privacy-preserving recommender system using the data collaboration analysis for distributed datasets

链接: https://arxiv.org/abs/2406.01603
作者: Tomoya Yanagi,Shunnosuke Ikeda,Noriyoshi Sukegawa,Yuichi Takano
关键词: provide high-quality recommendations, multiple datasets held, integrate multiple datasets, recommendations for users, order to provide
类目: Information Retrieval (cs.IR); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In order to provide high-quality recommendations for users, it is desirable to share and integrate multiple datasets held by different parties. However, when sharing such distributed datasets, we need to protect personal and confidential information contained in the datasets. To this end, we establish a framework for privacy-preserving recommender systems using the data collaboration analysis of distributed datasets. Numerical experiments with two public rating datasets demonstrate that our privacy-preserving method for rating prediction can improve the prediction accuracy for distributed datasets. This study opens up new possibilities for privacy-preserving techniques in recommender systems.

[LG-163] Backpropogation-Free Multi-modal On-Device Model Adaptation via Cloud-Device Collaboration

链接: https://arxiv.org/abs/2406.01601
作者: Wei Ji,Li Li,Zheqi Lv,Wenqiao Zhang,Mengze Li,Zhen Wan,Wenqiang Lei,Roger Zimmermann
关键词: personalized device-aware services, amass copious personalized, increasingly interconnected world, continually amass copious, copious personalized multi-modal
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In our increasingly interconnected world, where intelligent devices continually amass copious personalized multi-modal data, a pressing need arises to deliver high-quality, personalized device-aware services. However, this endeavor presents a multifaceted challenge to prevailing artificial intelligence (AI) systems primarily rooted in the cloud. As these systems grapple with shifting data distributions between the cloud and devices, the traditional approach of fine-tuning-based adaptation (FTA) exists the following issues: the costly and time-consuming data annotation required by FTA and the looming risk of model overfitting. To surmount these challenges, we introduce a Universal On-Device Multi-modal Model Adaptation Framework, revolutionizing on-device model adaptation by striking a balance between efficiency and effectiveness. The framework features the Fast Domain Adaptor (FDA) hosted in the cloud, providing tailored parameters for the Lightweight Multi-modal Model on devices. To enhance adaptability across multi-modal tasks, the AnchorFrame Distribution Reasoner (ADR) minimizes communication costs. Our contributions, encapsulated in the Cloud-Device Collaboration Multi-modal Parameter Generation (CDC-MMPG) framework, represent a pioneering solution for on-Device Multi-modal Model Adaptation (DMMA). Extensive experiments validate the efficiency and effectiveness of our method, particularly in video question answering and retrieval tasks, driving forward the integration of intelligent devices into our daily lives.

[LG-164] EditWorld: Simulating World Dynamics for Instruction-Following Image Editing

链接: https://arxiv.org/abs/2405.14785
作者: Ling Yang,Bohan Zeng,Jiaming Liu,Hong Li,Minghao Xu,Wentao Zhang,Shuicheng Yan
关键词: Diffusion models, improved the performance, image editing, editing, Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project: this https URL

点击查看摘要

Abstract:Diffusion models have significantly improved the performance of image editing. Existing methods realize various approaches to achieve high-quality image editing, including but not limited to text control, dragging operation, and mask-and-inpainting. Among these, instruction-based editing stands out for its convenience and effectiveness in following human instructions across diverse scenarios. However, it still focuses on simple editing operations like adding, replacing, or deleting, and falls short of understanding aspects of world dynamics that convey the realistic dynamic nature in the physical world. Therefore, this work, EditWorld, introduces a new editing task, namely world-instructed image editing, which defines and categorizes the instructions grounded by various world scenarios. We curate a new image editing dataset with world instructions using a set of large pretrained models (e.g., GPT-3.5, Video-LLava and SDXL). To enable sufficient simulation of world dynamics for image editing, our EditWorld trains model in the curated dataset, and improves instruction-following ability with designed post-edit strategy. Extensive experiments demonstrate our method significantly outperforms existing editing methods in this new task. Our dataset and code will be available at this https URL

[LG-165] RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2402.12908
作者: Xinchen Zhang,Ling Yang,Yaqi Cai,Zhaochen Yu,Kai-Ni Wang,Jiake Xie,Ye Tian,Minkai Xu,Yong Tang,Yujiu Yang,Bin Cui
关键词: achieved remarkable advancements, image diffusion models, spatial-aware image diffusion, Diffusion models, image diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project: this https URL

点击查看摘要

Abstract:Diffusion models have achieved remarkable advancements in text-to-image generation. However, existing models still have many difficulties when faced with multiple-object compositional generation. In this paper, we propose RealCompo, a new training-free and transferred-friendly text-to-image generation framework, which aims to leverage the respective advantages of text-to-image models and spatial-aware image diffusion models (e.g., layout, keypoints and segmentation maps) to enhance both realism and compositionality of the generated images. An intuitive and novel balancer is proposed to dynamically balance the strengths of the two models in denoising process, allowing plug-and-play use of any model without extra training. Extensive experiments show that our RealCompo consistently outperforms state-of-the-art text-to-image models and spatial-aware image diffusion models in multiple-object compositional generation while keeping satisfactory realism and compositionality of the generated images. Notably, our RealCompo can be seamlessly extended with a wide range of spatial-aware image diffusion models and stylized diffusion models. Our code is available at: this https URL

[LG-166] Mastering Text-to-Image Diffusion: Recaptioning Planning and Generating with Multimodal LLMs

链接: https://arxiv.org/abs/2401.11708
作者: Ling Yang,Zhaochen Yu,Chenlin Meng,Minkai Xu,Stefano Ermon,Bin Cui
关键词: exhibit exceptional performance, exceptional performance, Diffusion models, Plan and Generate, RPG
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ICML 2024. Project: this https URL

点击查看摘要

Abstract:Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet). Our code is available at: this https URL

[LG-167] Improving Diffusion-Based Image Synthesis with Context Prediction

链接: https://arxiv.org/abs/2401.02015
作者: Ling Yang,Jingwei Liu,Shenda Hong,Zhilong Zhang,Zhilin Huang,Zheming Cai,Wentao Zhang,Bin Cui
关键词: dramatically promoted image, quality and diversity, class of generative, dramatically promoted, unprecedented quality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2023

点击查看摘要

Abstract:Diffusion models are a new class of generative models, and have dramatically promoted image generation with unprecedented quality and diversity. Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes. However, such point-based reconstruction may fail to make each predicted pixel/feature fully preserve its neighborhood context, impairing diffusion-based image synthesis. As a powerful source of automatic supervisory signal, context has been well studied for learning representations. Inspired by this, we for the first time propose ConPreDiff to improve diffusion-based image synthesis with context prediction. We explicitly reinforce each point to predict its neighborhood context (i.e., multi-stride features/tokens/pixels) with a context decoder at the end of diffusion denoising blocks in training stage, and remove the decoder for inference. In this way, each point can better reconstruct itself by preserving its semantic connections with neighborhood context. This new paradigm of ConPreDiff can generalize to arbitrary discrete and continuous diffusion backbones without introducing extra parameters in sampling procedure. Extensive experiments are conducted on unconditional image generation, text-to-image generation and image inpainting tasks. Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.

[LG-168] Differentially Private Densest Subgraph Detection

链接: https://arxiv.org/abs/2105.13287
作者: Dung Nguyen,Anil Vullikanti
关键词: Densest subgraph detection, Densest subgraph, fundamental graph mining, densest subgraph problem, Densest
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted by ICML 2021

点击查看摘要

Abstract:Densest subgraph detection is a fundamental graph mining problem, with a large number of applications. There has been a lot of work on efficient algorithms for finding the densest subgraph in massive networks. However, in many domains, the network is private, and returning a densest subgraph can reveal information about the network. Differential privacy is a powerful framework to handle such settings. We study the densest subgraph problem in the edge privacy model, in which the edges of the graph are private. We present the first sequential and parallel differentially private algorithms for this problem. We show that our algorithms have an additive approximation guarantee. We evaluate our algorithms on a large number of real-world networks, and observe a good privacy-accuracy tradeoff when the network has high density.

[LG-169] Enhancing predictive imaging biomarker discovery through treatment effect analysis

链接: https://arxiv.org/abs/2406.02534
作者: Shuhan Xiao,Lukas Klein,Jens Petersen,Philipp Vollmuth,Paul F. Jaeger,Klaus H. Maier-Hein
关键词: individual treatment effectiveness, Identifying predictive biomarkers, forecast individual treatment, Identifying predictive, predictive imaging biomarkers
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 19 pages, 12 figures

点击查看摘要

Abstract:Identifying predictive biomarkers, which forecast individual treatment effectiveness, is crucial for personalized medicine and informs decision-making across diverse disciplines. These biomarkers are extracted from pre-treatment data, often within randomized controlled trials, and have to be distinguished from prognostic biomarkers, which are independent of treatment assignment. Our study focuses on the discovery of predictive imaging biomarkers, aiming to leverage pre-treatment images to unveil new causal relationships. Previous approaches relied on labor-intensive handcrafted or manually derived features, which may introduce biases. In response, we present a new task of discovering predictive imaging biomarkers directly from the pre-treatment images to learn relevant image features. We propose an evaluation protocol for this task to assess a model’s ability to identify predictive imaging biomarkers and differentiate them from prognostic ones. It employs statistical testing and a comprehensive analysis of image feature attribution. We explore the suitability of deep learning models originally designed for estimating the conditional average treatment effect (CATE) for this task, which previously have been primarily assessed for the precision of CATE estimation, overlooking the evaluation of imaging biomarker discovery. Our proof-of-concept analysis demonstrates promising results in discovering and validating predictive imaging biomarkers from synthetic outcomes and real-world image datasets.

[LG-170] ReLUs Are Sufficient for Learning Implicit Neural Representations

链接: https://arxiv.org/abs/2406.02529
作者: Joseph Shenouda,Yamin Zhou,Robert D. Nowak
关键词: Rectified Linear Unit, Linear Unit, Rectified Linear, learning implicit neural, employ the Rectified
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:Motivated by the growing theoretical understanding of neural networks that employ the Rectified Linear Unit (ReLU) as their activation function, we revisit the use of ReLU activation functions for learning implicit neural representations (INRs). Inspired by second order B-spline wavelets, we incorporate a set of simple constraints to the ReLU neurons in each layer of a deep neural network (DNN) to remedy the spectral bias. This in turn enables its use for various INR tasks. Empirically, we demonstrate that, contrary to popular belief, one can learn state-of-the-art INRs based on a DNN composed of only ReLU neurons. Next, by leveraging recent theoretical works which characterize the kinds of functions ReLU neural networks learn, we provide a way to quantify the regularity of the learned function. This offers a principled approach to selecting the hyperparameters in INR architectures. We substantiate our claims through experiments in signal representation, super resolution, and computed tomography, demonstrating the versatility and effectiveness of our method. The code for all experiments can be found at this https URL.

[LG-171] Inpainting Pathology in Lumbar Spine MRI with Latent Diffusion

链接: https://arxiv.org/abs/2406.02477
作者: Colin Hansen,Simas Glinskis,Ashwin Raju,Micha Kornreich,JinHyeong Park,Jayashri Pawar,Richard Herzog,Li Zhang,Benjamin Odry
关键词: imbalanced datasets due, expert annotations, automated diagnosis, diagnosis in radiology, radiology suffer
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data driven models for automated diagnosis in radiology suffer from insufficient and imbalanced datasets due to low representation of pathology in a population and the cost of expert annotations. Datasets can be bolstered through data augmentation. However, even when utilizing a full suite of transformations during model training, typical data augmentations do not address variations in human anatomy. An alternative direction is to synthesize data using generative models, which can potentially craft datasets with specific attributes. While this holds promise, commonly used generative models such as Generative Adversarial Networks may inadvertently produce anatomically inaccurate features. On the other hand, diffusion models, which offer greater stability, tend to memorize training data, raising concerns about privacy and generative diversity. Alternatively, inpainting has the potential to augment data through directly inserting pathology in medical images. However, this approach introduces a new challenge: accurately merging the generated pathological features with the surrounding anatomical context. While inpainting is a well established method for addressing simple lesions, its application to pathologies that involve complex structural changes remains relatively unexplored. We propose an efficient method for inpainting pathological features onto healthy anatomy in MRI through voxelwise noise scheduling in a latent diffusion model. We evaluate the method’s ability to insert disc herniation and central canal stenosis in lumbar spine sagittal T2 MRI, and it achieves superior Frechet Inception Distance compared to state-of-the-art methods.

[LG-172] Meta-Designing Quantum Experiments with Language Models

链接: https://arxiv.org/abs/2406.02470
作者: Sören Arlt,Haonan Duan,Felix Li,Sang Michael Xie,Yuhuai Wu,Mario Krenn
关键词: Artificial Intelligence, advance scientific discovery, significantly advance scientific, human capabilities, potential to significantly
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 10+3 pages, 5 figures

点击查看摘要

Abstract:Artificial Intelligence (AI) has the potential to significantly advance scientific discovery by finding solutions beyond human capabilities. However, these super-human solutions are often unintuitive and require considerable effort to uncover underlying principles, if possible at all. Here, we show how a code-generating language model trained on synthetic data can not only find solutions to specific problems but can create meta-solutions, which solve an entire class of problems in one shot and simultaneously offer insight into the underlying design principles. Specifically, for the design of new quantum physics experiments, our sequence-to-sequence transformer architecture generates interpretable Python code that describes experimental blueprints for a whole class of quantum systems. We discover general and previously unknown design rules for infinitely large classes of quantum states. The ability to automatically generate generalized patterns in readable computer code is a crucial step toward machines that help discover new scientific understanding – one of the central aims of physics.

[LG-173] Machine learning Hubbard parameters with equivariant neural networks

链接: https://arxiv.org/abs/2406.02457
作者: Martin Uhrin,Austin Zadoks,Luca Binci,Nicola Marzari,Iurii Timrov
关键词: accurately describe complex, extended Hubbard functionals, describe complex materials, rare-earth elements, framework to accurately
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Density-functional theory with extended Hubbard functionals (DFT+ U + V ) provides a robust framework to accurately describe complex materials containing transition-metal or rare-earth elements. It does so by mitigating self-interaction errors inherent to semi-local functionals which are particularly pronounced in systems with partially-filled d and f electronic states. However, achieving accuracy in this approach hinges upon the accurate determination of the on-site U and inter-site V Hubbard parameters. In practice, these are obtained either by semi-empirical tuning, requiring prior knowledge, or, more correctly, by using predictive but expensive first-principles calculations. Here, we present a machine learning model based on equivariant neural networks which uses atomic occupation matrices as descriptors, directly capturing the electronic structure, local chemical environment, and oxidation states of the system at hand. We target here the prediction of Hubbard parameters computed self-consistently with iterative linear-response calculations, as implemented in density-functional perturbation theory (DFPT), and structural relaxations. Remarkably, when trained on data from 11 materials spanning various crystal structures and compositions, our model achieves mean absolute relative errors of 3% and 5% for Hubbard U and V parameters, respectively. By circumventing computationally expensive DFT or DFPT self-consistent protocols, our model significantly expedites the prediction of Hubbard parameters with negligible computational overhead, while approaching the accuracy of DFPT. Moreover, owing to its robust transferability, the model facilitates accelerated materials discovery and design via high-throughput calculations, with relevance for various technological applications.

[LG-174] Contextual Optimization under Covariate Shift: A Robust Approach by Intersecting Wasserstein Balls

链接: https://arxiv.org/abs/2406.02426
作者: Tianyu Wang,Ningyuan Chen,Chun Wang
关键词: decision-maker observes historical, observes historical samples, decision-maker observes, uncertain variables, knowing their joint
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In contextual optimization, a decision-maker observes historical samples of uncertain variables and associated concurrent covariates, without knowing their joint distribution. Given an additional covariate observation, the goal is to choose a decision that minimizes some operational costs. A prevalent issue here is covariate shift, where the marginal distribution of the new covariate differs from historical samples, leading to decision performance variations with nonparametric or parametric estimators. To address this, we propose a distributionally robust approach that uses an ambiguity set by the intersection of two Wasserstein balls, each centered on typical nonparametric or parametric distribution estimators. Computationally, we establish the tractable reformulation of this distributionally robust optimization problem. Statistically, we provide guarantees for our Wasserstein ball intersection approach under covariate shift by analyzing the measure concentration of the estimators. Furthermore, to reduce computational complexity, we employ a surrogate objective that maintains similar generalization guarantees. Through synthetic and empirical case studies on income prediction and portfolio optimization, we demonstrate the strong empirical performance of our proposed models.

[LG-175] IterMask2: Iterative Unsupervised Anomaly Segmentation via Spatial and Frequency Masking for Brain Lesions in MRI

链接: https://arxiv.org/abs/2406.02422
作者: Ziyun Liang,Xiaoqing Guo,J. Alison Noble,Konstantinos Kamnitsas
关键词: Unsupervised anomaly segmentation, Unsupervised anomaly, healthy subjects, anomaly segmentation approaches, pathology segmentation train
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unsupervised anomaly segmentation approaches to pathology segmentation train a model on images of healthy subjects, that they define as the ‘normal’ data distribution. At inference, they aim to segment any pathologies in new images as ‘anomalies’, as they exhibit patterns that deviate from those in ‘normal’ training data. Prevailing methods follow the ‘corrupt-and-reconstruct’ paradigm. They intentionally corrupt an input image, reconstruct it to follow the learned ‘normal’ distribution, and subsequently segment anomalies based on reconstruction error. Corrupting an input image, however, inevitably leads to suboptimal reconstruction even of normal regions, causing false positives. To alleviate this, we propose a novel iterative spatial mask-refining strategy IterMask2. We iteratively mask areas of the image, reconstruct them, and update the mask based on reconstruction error. This iterative process progressively adds information about areas that are confidently normal as per the model. The increasing content guides reconstruction of nearby masked areas, improving reconstruction of normal tissue under these areas, reducing false positives. We also use high-frequency image content as an auxiliary input to provide additional structural information for masked areas. This further improves reconstruction error of normal in comparison to anomalous areas, facilitating segmentation of the latter. We conduct experiments on several brain lesion datasets and demonstrate effectiveness of our method. Code is available at: this https URL

[LG-176] Neural Thermodynamic Integration: Free Energies from Energy-based Diffusion Models

链接: https://arxiv.org/abs/2406.02313
作者: Bálint Máté,François Fleuret,Tristan Bereau
关键词: estimating free-energy differences, Thermodynamic integration, interpolating conformational ensembles, offers a rigorous, estimating free-energy
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Thermodynamic integration (TI) offers a rigorous method for estimating free-energy differences by integrating over a sequence of interpolating conformational ensembles. However, TI calculations are computationally expensive and typically limited to coupling a small number of degrees of freedom due to the need to sample numerous intermediate ensembles with sufficient conformational-space overlap. In this work, we propose to perform TI along an alchemical pathway represented by a trainable neural network, which we term Neural TI. Critically, we parametrize a time-dependent Hamiltonian interpolating between the interacting and non-interacting systems, and optimize its gradient using a denoising-diffusion objective. The ability of the resulting energy-based diffusion model to sample all intermediate ensembles, allows us to perform TI from a single reference calculation. We apply our method to Lennard-Jones fluids, where we report accurate calculations of the excess chemical potential, demonstrating that Neural TI is capable of coupling hundreds of degrees of freedom at once.

[LG-177] Node-Level Topological Representation Learning on Point Clouds

链接: https://arxiv.org/abs/2406.02300
作者: Vincent P. Grande,Michael T. Schaub
关键词: Topological Data Analysis, Data Analysis, Euler Transform give, extract powerful topological, Persistent Homology
类目: Algebraic Topology (math.AT); Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注: 30 pages, 10 figures, comments welcome

点击查看摘要

Abstract:Topological Data Analysis (TDA) allows us to extract powerful topological and higher-order information on the global shape of a data set or point cloud. Tools like Persistent Homology or the Euler Transform give a single complex description of the global structure of the point cloud. However, common machine learning applications like classification require point-level information and features to be available. In this paper, we bridge this gap and propose a novel method to extract node-level topological features from complex point clouds using discrete variants of concepts from algebraic topology and differential geometry. We verify the effectiveness of these topological point features (TOPF) on both synthetic and real-world data and study their robustness under noise.

[LG-178] Solving Partial Differential Equations in Different Domains by Operator Learning method Based on Boundary Integral Equations

链接: https://arxiv.org/abs/2406.02298
作者: Bin Meng,Yutong Lu,Ying Jiang
关键词: Deep Operator Network, partial differential equations, boundary integral equations, Boundary Integral Type, article explores operator
类目: Mathematical Physics (math-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This article explores operator learning models that can deduce solutions to partial differential equations (PDEs) on arbitrary domains without requiring retraining. We introduce two innovative models rooted in boundary integral equations (BIEs): the Boundary Integral Type Deep Operator Network (BI-DeepONet) and the Boundary Integral Trigonometric Deep Operator Neural Network (BI-TDONet), which are crafted to address PDEs across diverse domains. Once fully trained, these BIE-based models adeptly predict the solutions of PDEs in any domain without the need for additional training. BI-TDONet notably enhances its performance by employing the singular value decomposition (SVD) of bounded linear operators, allowing for the efficient distribution of input functions across its modules. Furthermore, to tackle the issue of function sampling values that do not effectively capture oscillatory and impulse signal characteristics, trigonometric coefficients are utilized as both inputs and outputs in BI-TDONet. Our numerical experiments robustly support and confirm the efficacy of this theoretical framework.

[LG-179] Composite Quantile Regression With XGBoost Using the Novel Arctan Pinball Loss

链接: https://arxiv.org/abs/2406.02293
作者: Laurens Sluijterman,Frank Kreuwel,Eric Cator,Tom Heskes
关键词: composite quantile regression, pinball loss, loss, loss function, arctan pinball loss
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages, 9 figures

点击查看摘要

Abstract:This paper explores the use of XGBoost for composite quantile regression. XGBoost is a highly popular model renowned for its flexibility, efficiency, and capability to deal with missing data. The optimization uses a second order approximation of the loss function, complicating the use of loss functions with a zero or vanishing second derivative. Quantile regression – a popular approach to obtain conditional quantiles when point estimates alone are insufficient – unfortunately uses such a loss function, the pinball loss. Existing workarounds are typically inefficient and can result in severe quantile crossings. In this paper, we present a smooth approximation of the pinball loss, the arctan pinball loss, that is tailored to the needs of XGBoost. Specifically, contrary to other smooth approximations, the arctan pinball loss has a relatively large second derivative, which makes it more suitable to use in the second order approximation. Using this loss function enables the simultaneous prediction of multiple quantiles, which is more efficient and results in far fewer quantile crossings.

[LG-180] owards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models

链接: https://arxiv.org/abs/2406.02285
作者: Victor Miara,Theo Lepage,Reda Dehak
关键词: shown promising results, Recent advancements, Speaker Verification, Self-Supervised Learning, shown promising
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: accepted at INTERSPEECH 2024

点击查看摘要

Abstract:Recent advancements in Self-Supervised Learning (SSL) have shown promising results in Speaker Verification (SV). However, narrowing the performance gap with supervised systems remains an ongoing challenge. Several studies have observed that speech representations from large-scale ASR models contain valuable speaker information. This work explores the limitations of fine-tuning these models for SV using an SSL contrastive objective in an end-to-end approach. Then, we propose a framework to learn speaker representations in an SSL context by fine-tuning a pre-trained WavLM with a supervised loss using pseudo-labels. Initial pseudo-labels are derived from an SSL DINO-based model and are iteratively refined by clustering the model embeddings. Our method achieves 0.99% EER on VoxCeleb1-O, establishing the new state-of-the-art on self-supervised SV. As this performance is close to our supervised baseline of 0.94% EER, this contribution is a step towards supervised performance on SV with SSL.

[LG-181] A KL-based Analysis Framework with Applications to Non-Descent Optimization Methods

链接: https://arxiv.org/abs/2406.02273
作者: Junwen Qiu,Bohao Ma,Xiao Li,Andre Milzarek
关键词: nonconvex scenarios based, methodologies in nonconvex, nonconvex scenarios, optimization methodologies, Kurdyka-Lojasiewicz property
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 29 pages

点击查看摘要

Abstract:We propose a novel analysis framework for non-descent-type optimization methodologies in nonconvex scenarios based on the Kurdyka-Lojasiewicz property. Our framework allows covering a broad class of algorithms, including those commonly employed in stochastic and distributed optimization. Specifically, it enables the analysis of first-order methods that lack a sufficient descent property and do not require access to full (deterministic) gradient information. We leverage this framework to establish, for the first time, iterate convergence and the corresponding rates for the decentralized gradient method and federated averaging under mild assumptions. Furthermore, based on the new analysis techniques, we show the convergence of the random reshuffling and stochastic gradient descent method without necessitating typical a priori bounded iterates assumptions.

[LG-182] Graph Neural Networks Do Not Always Oversmooth

链接: https://arxiv.org/abs/2406.02269
作者: Bastian Epping,Alexandre René,Moritz Helias,Michael T. Schaub
关键词: processing relational data, Graph neural networks, data in applications, emerged as powerful, powerful tools
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) have emerged as powerful tools for processing relational data in applications. However, GNNs suffer from the problem of oversmoothing, the property that the features of all nodes exponentially converge to the same vector over layers, prohibiting the design of deep GNNs. In this work we study oversmoothing in graph convolutional networks (GCNs) by using their Gaussian process (GP) equivalence in the limit of infinitely many hidden features. By generalizing methods from conventional deep neural networks (DNNs), we can describe the distribution of features at the output layer of deep GCNs in terms of a GP: as expected, we find that typical parameter choices from the literature lead to oversmoothing. The theory, however, allows us to identify a new, nonoversmoothing phase: if the initial weights of the network have sufficiently large variance, GCNs do not oversmooth, and node features remain informative even at large depth. We demonstrate the validity of this prediction in finite-size GCNs by training a linear classifier on their output. Moreover, using the linearization of the GCN GP, we generalize the concept of propagation depth of information from DNNs to GCNs. This propagation depth diverges at the transition between the oversmoothing and non-oversmoothing phase. We test the predictions of our approach and find good agreement with finite-size GCNs. Initializing GCNs near the transition to the non-oversmoothing phase, we obtain networks which are both deep and expressive.

[LG-183] MidiCaps – A large-scale MIDI dataset with text captions

链接: https://arxiv.org/abs/2406.02255
作者: Jan Melechovsky,Abhinaba Roy,Dorien Herremans
关键词: Generative models guided, Generative models, prompts are increasingly, MIDI, Instrument Digital Interface
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
*备注: Under review

点击查看摘要

Abstract:Generative models guided by text prompts are increasingly becoming more popular. However, no text-to-MIDI models currently exist, mostly due to the lack of a captioned MIDI dataset. This work aims to enable research that combines LLMs with symbolic music by presenting the first large-scale MIDI dataset with text captions that is openly available: MidiCaps. MIDI (Musical Instrument Digital Interface) files are a widely used format for encoding musical information. Their structured format captures the nuances of musical composition and has practical applications by music producers, composers, musicologists, as well as performers. Inspired by recent advancements in captioning techniques applied to various domains, we present a large-scale curated dataset of over 168k MIDI files accompanied by textual descriptions. Each MIDI caption succinctly describes the musical content, encompassing tempo, chord progression, time signature, instruments present, genre and mood; thereby facilitating multi-modal exploration and analysis. The dataset contains a mix of various genres, styles, and complexities, offering a rich source for training and evaluating models for tasks such as music information retrieval, music understanding and cross-modal translation. We provide detailed statistics about the dataset and have assessed the quality of the captions in an extensive listening study. We anticipate that this resource will stimulate further research in the intersection of music and natural language processing, fostering advancements in both fields.

[LG-184] Riemannian coordinate descent algorithms on matrix manifolds

链接: https://arxiv.org/abs/2406.02225
作者: Andi Han,Pratik Jawanpuria,Bamdev Mishra
关键词: machine learning applications, machine learning, naturally formulated, Riemannian optimization, optimization problems
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Many machine learning applications are naturally formulated as optimization problems on Riemannian manifolds. The main idea behind Riemannian optimization is to maintain the feasibility of the variables while moving along a descent direction on the manifold. This results in updating all the variables at every iteration. In this work, we provide a general framework for developing computationally efficient coordinate descent (CD) algorithms on matrix manifolds that allows updating only a few variables at every iteration while adhering to the manifold constraint. In particular, we propose CD algorithms for various manifolds such as Stiefel, Grassmann, (generalized) hyperbolic, symplectic, and symmetric positive (semi)definite. While the cost per iteration of the proposed CD algorithms is low, we further develop a more efficient variant via a first-order approximation of the objective function. We analyze their convergence and complexity, and empirically illustrate their efficacy in several applications.

[LG-185] On the Recoverability of Causal Relations from Temporally Aggregated I.I.D. Data

链接: https://arxiv.org/abs/2406.02191
作者: Shunxing Fan,Mingming Gong,Kun Zhang
关键词: general setting, effect of temporal, causal discovery, causal, causal discovery results
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the effect of temporal aggregation on instantaneous (non-temporal) causal discovery in general setting. This is motivated by the observation that the true causal time lag is often considerably shorter than the observational interval. This discrepancy leads to high aggregation, causing time-delay causality to vanish and instantaneous dependence to manifest. Although we expect such instantaneous dependence has consistency with the true causal relation in certain sense to make the discovery results meaningful, it remains unclear what type of consistency we need and when will such consistency be satisfied. We proposed functional consistency and conditional independence consistency in formal way correspond functional causal model-based methods and conditional independence-based methods respectively and provide the conditions under which these consistencies will hold. We show theoretically and experimentally that causal discovery results may be seriously distorted by aggregation especially in complete nonlinear case and we also find causal relationship still recoverable from aggregated data if we have partial linearity or appropriate prior. Our findings suggest community should take a cautious and meticulous approach when interpreting causal discovery results from such data and show why and when aggregation will distort the performance of causal discovery methods.

[LG-186] Online Learning and Information Exponents: On The Importance of Batch size and Time/Complexity Tradeoffs

链接: https://arxiv.org/abs/2406.02157
作者: Luca Arnaboldi,Yatin Dandi,Florent Krzakala,Bruno Loureiro,Luca Pesce,Ludovic Stephan
关键词: two-layer neural networks, stochastic gradient descent, one-pass stochastic gradient, multi-index target functions, training two-layer neural
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the impact of the batch size n_b on the iteration time T of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches n_b \lesssim d^\frac\ell2 minimizes the training time without changing the total sample complexity, where \ell is the information exponent of the target to be learned \citeparous2021online and d is the input dimension. However, larger batch sizes than n_b \gg d^\frac\ell2 are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, \textitCorrelation loss SGD, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.

[LG-187] Learning Hamiltonian neural Koopman operator and simultaneously sustaining and discovering conservation law

链接: https://arxiv.org/abs/2406.02154
作者: Jingdong Zhang,Qunxi Zhu,Wei Lin
关键词: major challenge presently, predicting dynamics based, Hamiltonian Neural Koopman, Neural Koopman Operator, Accurately finding
类目: Mathematical Physics (math-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately finding and predicting dynamics based on the observational data with noise perturbations is of paramount significance but still a major challenge presently. Here, for the Hamiltonian mechanics, we propose the Hamiltonian Neural Koopman Operator (HNKO), integrating the knowledge of mathematical physics in learning the Koopman operator, and making it automatically sustain and even discover the conservation laws. We demonstrate the outperformance of the HNKO and its extension using a number of representative physical systems even with hundreds or thousands of freedoms. Our results suggest that feeding the prior knowledge of the underlying system and the mathematical theory appropriately to the learning framework can reinforce the capability of machine learning in solving physical problems.

[LG-188] SimulTron: On-Device Simultaneous Speech to Speech Translation

链接: https://arxiv.org/abs/2406.02133
作者: Alex Agranovich,Eliya Nachmani,Oleg Rybakov,Yifan Ding,Ye Jia,Nadav Bar,Heiga Zen,Michelle Tadmor Ramanovich
关键词: enabling fluid conversations, holds the promise, conversations across languages, promise of breaking, breaking down communication
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Simultaneous speech-to-speech translation (S2ST) holds the promise of breaking down communication barriers and enabling fluid conversations across languages. However, achieving accurate, real-time translation through mobile devices remains a major challenge. We introduce SimulTron, a novel S2ST architecture designed to tackle this task. SimulTron is a lightweight direct S2ST model that uses the strengths of the Translatotron framework while incorporating key modifications for streaming operation, and an adjustable fixed delay. Our experiments show that SimulTron surpasses Translatotron 2 in offline evaluations. Furthermore, real-time evaluations reveal that SimulTron improves upon the performance achieved by Translatotron 1. Additionally, SimulTron achieves superior BLEU scores and latency compared to previous real-time S2ST method on the MuST-C dataset. Significantly, we have successfully deployed SimulTron on a Pixel 7 Pro device, show its potential for simultaneous S2ST on-device.

[LG-189] Causal Effect Identification in LiNGAM Models with Latent Confounders

链接: https://arxiv.org/abs/2406.02049
作者: Daniele Tramontano,Yaroslav Kivva,Saber Salehkaleybar,Mathias Drton,Negar Kiyavash
关键词: non-Gaussian acyclic models, linear non-Gaussian acyclic, causal effects, acyclic models, study the generic
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Accepted at International Conference on Machine Learning (ICML) 2024

点击查看摘要

Abstract:We study the generic identifiability of causal effects in linear non-Gaussian acyclic models (LiNGAM) with latent variables. We consider the problem in two main settings: When the causal graph is known a priori, and when it is unknown. In both settings, we provide a complete graphical characterization of the identifiable direct or total causal effects among observed variables. Moreover, we propose efficient algorithms to certify the graphical conditions. Finally, we propose an adaptation of the reconstruction independent component analysis (RICA) algorithm that estimates the causal effects from the observational data given the causal graph. Experimental results show the effectiveness of the proposed method in estimating the causal effects.

[LG-190] Adaptive and Optimal Second-order Optimistic Methods for Minimax Optimization

链接: https://arxiv.org/abs/2406.02016
作者: Ruichen Jiang,Ali Kavis,Qiujiang Jin,Sujay Sanghavi,Aryan Mokhtari
关键词: convex-concave min-max problems, solving convex-concave min-max, line search-free second-order, search-free second-order methods, min-max problems
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 33 pages, 2 figures

点击查看摘要

Abstract:We propose adaptive, line search-free second-order methods with optimal rate of convergence for solving convex-concave min-max problems. By means of an adaptive step size, our algorithms feature a simple update rule that requires solving only one linear system per iteration, eliminating the need for line search or backtracking mechanisms. Specifically, we base our algorithms on the optimistic method and appropriately combine it with second-order information. Moreover, distinct from common adaptive schemes, we define the step size recursively as a function of the gradient norm and the prediction error in the optimistic update. We first analyze a variant where the step size requires knowledge of the Lipschitz constant of the Hessian. Under the additional assumption of Lipschitz continuous gradients, we further design a parameter-free version by tracking the Hessian Lipschitz constant locally and ensuring the iterates remain bounded. We also evaluate the practical performance of our algorithm by comparing it to existing second-order algorithms for minimax optimization.

[LG-191] Understanding Auditory Evoked Brain Signal via Physics-informed Embedding Network with Multi-Task Transformer

链接: https://arxiv.org/abs/2406.02014
作者: Wanli Ma,Xuegang Tang,Jin Gu,Ying Wang,Yuling Xia
关键词: magnetic resonance imaging, task-based functional magnetic, functional magnetic resonance, cognitive neuroscience, resonance imaging
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In the fields of brain-computer interaction and cognitive neuroscience, effective decoding of auditory signals from task-based functional magnetic resonance imaging (fMRI) is key to understanding how the brain processes complex auditory information. Although existing methods have enhanced decoding capabilities, limitations remain in information utilization and model representation. To overcome these challenges, we propose an innovative multi-task learning model, Physics-informed Embedding Network with Multi-Task Transformer (PEMT-Net), which enhances decoding performance through physics-informed embedding and deep learning techniques. PEMT-Net consists of two principal components: feature augmentation and classification. For feature augmentation, we propose a novel approach by creating neural embedding graphs via node embedding, utilizing random walks to simulate the physical diffusion of neural information. This method captures both local and non-local information overflow and proposes a position encoding based on relative physical coordinates. In the classification segment, we propose adaptive embedding fusion to maximally capture linear and non-linear characteristics. Furthermore, we propose an innovative parameter-sharing mechanism to optimize the retention and learning of extracted features. Experiments on a specific dataset demonstrate PEMT-Net’s significant performance in multi-task auditory signal decoding, surpassing existing methods and offering new insights into the brain’s mechanisms for processing complex auditory information.

[LG-192] Adaptive Variance Reduction for Stochastic Optimization under Weaker Assumptions

链接: https://arxiv.org/abs/2406.01959
作者: Wei Jiang,Sifan Yang,Yibo Wang,Lijun Zhang
关键词: paper explores adaptive, optimal convergence rate, convergence rate, mathcal, adaptive STORM method
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores adaptive variance reduction methods for stochastic optimization based on the STORM technique. Existing adaptive extensions of STORM rely on strong assumptions like bounded gradients and bounded function values, or suffer an additional \mathcalO(\log T) term in the convergence rate. To address these limitations, we introduce a novel adaptive STORM method that achieves an optimal convergence rate of \mathcalO(T^-1/3) for non-convex functions with our newly designed learning rate strategy. Compared with existing approaches, our method requires weaker assumptions and attains the optimal convergence rate without the additional \mathcalO(\log T) term. We also extend the proposed technique to stochastic compositional optimization, obtaining the same optimal rate of \mathcalO(T^-1/3) . Furthermore, we investigate the non-convex finite-sum problem and develop another innovative adaptive variance reduction method that achieves an optimal convergence rate of \mathcalO(n^1/4 T^-1/2 ) , where n represents the number of component functions. Numerical experiments across various tasks validate the effectiveness of our method.

[LG-193] Orthogonal Causal Calibration

链接: https://arxiv.org/abs/2406.01933
作者: Justin Whitehouse,Christopher Jung,Vasilis Syrgkanis,Bryan Wilder,Zhiwei Steven Wu
关键词: treatment effects play, average treatment effects, quantile treatment effects, conditional average treatment, conditional quantile treatment
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 44 pages

点击查看摘要

Abstract:Estimates of causal parameters such as conditional average treatment effects and conditional quantile treatment effects play an important role in real-world decision making. Given this importance, one should ensure these estimators are calibrated. While there is a rich literature on calibrating estimators of non-causal parameters, very few methods have been derived for calibrating estimators of causal parameters, or more generally estimators of quantities involving nuisance parameters. In this work, we provide a general framework for calibrating predictors involving nuisance estimation. We consider a notion of calibration defined with respect to an arbitrary, nuisance-dependent loss \ell , under which we say an estimator \theta is calibrated if its predictions cannot be changed on any level set to decrease loss. We prove generic upper bounds on the calibration error of any causal parameter estimate \theta with respect to any loss \ell using a concept called Neyman Orthogonality. Our bounds involve two decoupled terms - one measuring the error in estimating the unknown nuisance parameters, and the other representing the calibration error in a hypothetical world where the learned nuisance estimates were true. We use our bound to analyze the convergence of two sample splitting algorithms for causal calibration. One algorithm, which applies to universally orthogonalizable loss functions, transforms the data into generalized pseudo-outcomes and applies an off-the-shelf calibration procedure. The other algorithm, which applies to conditionally orthogonalizable loss functions, extends the classical uniform mass binning algorithm to include nuisance estimation. Our results are exceedingly general, showing that essentially any existing calibration algorithm can be used in causal settings, with additional loss only arising from errors in nuisance estimation. Comments: 44 pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME) Cite as: arXiv:2406.01933 [stat.ML] (or arXiv:2406.01933v1 [stat.ML] for this version)

[LG-194] Diffusion Boosted Trees

链接: https://arxiv.org/abs/2406.01813
作者: Xizewen Han,Mingyuan Zhou
关键词: supervised learning problems, Combining the merits, tackling supervised learning, denoising diffusion probabilistic, diffusion boosting paradigm
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Combining the merits of both denoising diffusion probabilistic models and gradient boosting, the diffusion boosting paradigm is introduced for tackling supervised learning problems. We develop Diffusion Boosted Trees (DBT), which can be viewed as both a new denoising diffusion generative model parameterized by decision trees (one single tree for each diffusion timestep), and a new boosting algorithm that combines the weak learners into a strong learner of conditional distributions without making explicit parametric assumptions on their density forms. We demonstrate through experiments the advantages of DBT over deep neural network-based diffusion models as well as the competence of DBT on real-world regression tasks, and present a business application (fraud detection) of DBT for classification on tabular data with the ability of learning to defer.

[LG-195] Fearless Stochasticity in Expectation Propagation

链接: https://arxiv.org/abs/2406.01801
作者: Jonathan So,Richard E. Turner
关键词: performing approximate inference, Expectation propagation, family of algorithms, algorithms for performing, performing approximate
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Expectation propagation (EP) is a family of algorithms for performing approximate inference in probabilistic models. The updates of EP involve the evaluation of moments – expectations of certain functions – which can be estimated from Monte Carlo (MC) samples. However, the updates are not robust to MC noise when performed naively, and various prior works have attempted to address this issue in different ways. In this work, we provide a novel perspective on the moment-matching updates of EP; namely, that they perform natural-gradient-based optimisation of a variational objective. We use this insight to motivate two new EP variants, with updates that are particularly well-suited to MC estimation; they remain stable and are most sample-efficient when estimated with just a single sample. These new variants combine the benefits of their predecessors and address key weaknesses. In particular, they are easier to tune, offer an improved speed-accuracy trade-off, and do not rely on the use of debiasing estimators. We demonstrate their efficacy on a variety of probabilistic inference tasks.

[LG-196] An efficient solution to Hidden Markov Models on trees with coupled branches

链接: https://arxiv.org/abs/2406.01663
作者: Farzan Vafa,Sahand Hormoz
关键词: Hidden Markov Models, Hidden Markov, underlying states evolve, modeling sequential data, Markov Models
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Quantitative Methods (q-bio.QM); Methodology (stat.ME)
*备注: 24 + 6 pages, 5 figures

点击查看摘要

Abstract:Hidden Markov Models (HMMs) are powerful tools for modeling sequential data, where the underlying states evolve in a stochastic manner and are only indirectly observable. Traditional HMM approaches are well-established for linear sequences, and have been extended to other structures such as trees. In this paper, we extend the framework of HMMs on trees to address scenarios where the tree-like structure of the data includes coupled branches – a common feature in biological systems where entities within the same lineage exhibit dependent characteristics. We develop a dynamic programming algorithm that efficiently solves the likelihood, decoding, and parameter learning problems for tree-based HMMs with coupled branches. Our approach scales polynomially with the number of states and nodes, making it computationally feasible for a wide range of applications and does not suffer from the underflow problem. We demonstrate our algorithm by applying it to simulated data and propose self-consistency checks for validating the assumptions of the model used for inference. This work not only advances the theoretical understanding of HMMs on trees but also provides a practical tool for analyzing complex biological data where dependencies between branches cannot be ignored.

[LG-197] An efficient Wasserstein-distance approach for reconstructing jump-diffusion processes using parameterized neural networks

链接: https://arxiv.org/abs/2406.01653
作者: Mingtao Xia,Xiangting Li,Qijing Shen,Tom Chou
关键词: temporally decoupled squared, multidimensional jump-diffusion processes, Wasserstein distance, jump-diffusion processes, probability distributions
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We analyze the Wasserstein distance ( W -distance) between two probability distributions associated with two multidimensional jump-diffusion processes. Specifically, we analyze a temporally decoupled squared W_2 -distance, which provides both upper and lower bounds associated with the discrepancies in the drift, diffusion, and jump amplitude functions between the two jump-diffusion processes. Then, we propose a temporally decoupled squared W_2 -distance method for efficiently reconstructing unknown jump-diffusion processes from data using parameterized neural networks. We further show its performance can be enhanced by utilizing prior information on the drift function of the jump-diffusion process. The effectiveness of our proposed reconstruction method is demonstrated across several examples and applications.

[LG-198] Distributional bias compromises leave-one-out cross-validation

链接: https://arxiv.org/abs/2406.01652
作者: George I. Austin,Itsik Pe’er,Tal Korem
关键词: method for estimating, estimating the predictive, machine learning models, Cross-validation, common method
类目: Methodology (stat.ME); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 20 pages, 5 figures, supplementary information

点击查看摘要

Abstract:Cross-validation is a common method for estimating the predictive performance of machine learning models. In a data-scarce regime, where one typically wishes to maximize the number of instances used for training the model, an approach called “leave-one-out cross-validation” is often used. In this design, a separate model is built for predicting each data instance after training on all other instances. Since this results in a single test data point available per model trained, predictions are aggregated across the entire dataset to calculate common rank-based performance metrics such as the area under the receiver operating characteristic or precision-recall curves. In this work, we demonstrate that this approach creates a negative correlation between the average label of each training fold and the label of its corresponding test instance, a phenomenon that we term distributional bias. As machine learning models tend to regress to the mean of their training data, this distributional bias tends to negatively impact performance evaluation and hyperparameter optimization. We show that this effect generalizes to leave-P-out cross-validation and persists across a wide range of modeling and evaluation approaches, and that it can lead to a bias against stronger regularization. To address this, we propose a generalizable rebalanced cross-validation approach that corrects for distributional bias. We demonstrate that our approach improves cross-validation performance evaluation in synthetic simulations and in several published leave-one-out analyses.

[LG-199] FusionDTI: Fine-grained Binding Discovery with Token-level Fusion for Drug-Target Interaction

链接: https://arxiv.org/abs/2406.01651
作者: Zhaohan Meng,Zaiqiao Meng,Iadh Ounis
关键词: Predicting drug-target interaction, drug discovery process, Predicting drug-target, discovery process, drug-target interaction
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Predicting drug-target interaction (DTI) is critical in the drug discovery process. Despite remarkable advances in recent DTI models through the integration of representations from diverse drug and target encoders, such models often struggle to capture the fine-grained interactions between drugs and protein, i.e. the binding of specific drug atoms (or substructures) and key amino acids of proteins, which is crucial for understanding the binding mechanisms and optimising drug design. To address this issue, this paper introduces a novel model, called FusionDTI, which uses a token-level Fusion module to effectively learn fine-grained information for Drug-Target Interaction. In particular, our FusionDTI model uses the SELFIES representation of drugs to mitigate sequence fragment invalidation and incorporates the structure-aware (SA) vocabulary of target proteins to address the limitation of amino acid sequences in structural information, additionally leveraging pre-trained language models extensively trained on large-scale biomedical datasets as encoders to capture the complex information of drugs and targets. Experiments on three well-known benchmark datasets show that our proposed FusionDTI model achieves the best performance in DTI prediction compared with seven existing state-of-the-art baselines. Furthermore, our case study indicates that FusionDTI could highlight the potential binding sites, enhancing the explainability of the DTI prediction.

[LG-200] AGMol: Target-Aware Gradient-guided Molecule Generation

链接: https://arxiv.org/abs/2406.01650
作者: Vineeth Dorna,D. Subhalingam,Keshav Kolluru,Shreshth Tuli,Mrityunjay Singh,Saurabh Singal,N. M. Anoop Krishnan,Sayan Ranu
关键词: shown significant promise, target binding sites, specific target binding, discovering ligands tailored, structure-based drug design
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:3D generative models have shown significant promise in structure-based drug design (SBDD), particularly in discovering ligands tailored to specific target binding sites. Existing algorithms often focus primarily on ligand-target binding, characterized by binding affinity. Moreover, models trained solely on target-ligand distribution may fall short in addressing the broader objectives of drug discovery, such as the development of novel ligands with desired properties like drug-likeness, and synthesizability, underscoring the multifaceted nature of the drug design process. To overcome these challenges, we decouple the problem into molecular generation and property prediction. The latter synergistically guides the diffusion sampling process, facilitating guided diffusion and resulting in the creation of meaningful molecules with the desired properties. We call this guided molecular generation process as TAGMol. Through experiments on benchmark datasets, TAGMol demonstrates superior performance compared to state-of-the-art baselines, achieving a 22% improvement in average Vina Score and yielding favorable outcomes in essential auxiliary properties. This establishes TAGMol as a comprehensive framework for drug generation.

[LG-201] Equivariant amortized inference of poses for cryo-EM

链接: https://arxiv.org/abs/2406.01630
作者: Larissa de Ruijter,Gabriele Cesa
关键词: technique for determining, structure of biological, proteins and viruses, vital technique, biological molecules
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: Published at the GEM workshop, ICLR 2024

点击查看摘要

Abstract:Cryo-EM is a vital technique for determining 3D structure of biological molecules such as proteins and viruses. The cryo-EM reconstruction problem is challenging due to the high noise levels, the missing poses of particles, and the computational demands of processing large datasets. A promising solution to these challenges lies in the use of amortized inference methods, which have shown particular efficacy in pose estimation for large datasets. However, these methods also encounter convergence issues, often necessitating sophisticated initialization strategies or engineered solutions for effective convergence. Building upon the existing cryoAI pipeline, which employs a symmetric loss function to address convergence problems, this work explores the emergence and persistence of these issues within the pipeline. Additionally, we explore the impact of equivariant amortized inference on enhancing convergence. Our investigations reveal that, when applied to simulated data, a pipeline incorporating an equivariant encoder not only converges faster and more frequently than the standard approach but also demonstrates superior performance in terms of pose estimation accuracy and the resolution of the reconstructed volume. Notably, D_4 -equivariant encoders make the symmetric loss superfluous and, therefore, allow for a more efficient reconstruction pipeline.

[LG-202] GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models

链接: https://arxiv.org/abs/2406.01627
作者: Zicheng Liu,Jiahui Li,Siyuan Li,Zelin Zang,Cheng Tan,Yufei Huang,Yajing Bai,Stan Z. Li
关键词: massive genomic data, Genomic Foundation Model, downstream applications, paradigm is expected, Genomic Foundation
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Genomic Foundation Model (GFM) paradigm is expected to facilitate the extraction of generalizable representations from massive genomic data, thereby enabling their application across a spectrum of downstream applications. Despite advancements, a lack of evaluation framework makes it difficult to ensure equitable assessment due to experimental settings, model intricacy, benchmark datasets, and reproducibility challenges. In the absence of standardization, comparative analyses risk becoming biased and unreliable. To surmount this impasse, we introduce GenBench, a comprehensive benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. Through systematic evaluations of datasets spanning diverse biological domains with a particular emphasis on both short-range and long-range genomic tasks, firstly including the three most important DNA tasks covering Coding Region, Non-Coding Region, Genome Structure, etc. Moreover, We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance. Our findings reveal an interesting observation: independent of the number of parameters, the discernible difference in preference between the attention-based and convolution-based models on short- and long-range tasks may provide insights into the future design of GFM.

[LG-203] Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition

链接: https://arxiv.org/abs/2406.01624
作者: Alaa Nfissi,Wassim Bouachir,Nizar Bouguila,Brian Mishara
关键词: gained significant attention, significant attention due, Speech emotion recognition, SER, SER systems
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Published in: Springer Nature International Journal of Applied Intelligence (2024)

点击查看摘要

Abstract:Speech emotion recognition (SER) has gained significant attention due to its several application fields, such as mental health, education, and human-computer interaction. However, the accuracy of SER systems is hindered by high-dimensional feature sets that may contain irrelevant and redundant information. To overcome this challenge, this study proposes an iterative feature boosting approach for SER that emphasizes feature relevance and explainability to enhance machine learning model performance. Our approach involves meticulous feature selection and analysis to build efficient SER systems. In addressing our main problem through model explainability, we employ a feature evaluation loop with Shapley values to iteratively refine feature sets. This process strikes a balance between model performance and transparency, which enables a comprehensive understanding of the model’s predictions. The proposed approach offers several advantages, including the identification and removal of irrelevant and redundant features, leading to a more effective model. Additionally, it promotes explainability, facilitating comprehension of the model’s predictions and the identification of crucial features for emotion determination. The effectiveness of the proposed method is validated on the SER benchmarks of the Toronto emotional speech set (TESS), Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion (SAVEE) datasets, outperforming state-of-the-art methods. These results highlight the potential of the proposed technique in developing accurate and explainable SER systems. To the best of our knowledge, this is the first work to incorporate model explainability into an SER framework.

[LG-204] Sifting through the Noise: A Survey of Diffusion Probabilistic Models and Their Applications to Biomolecules

链接: https://arxiv.org/abs/2406.01622
作者: Trevor Norton,Debswapna Bhattacharya
关键词: Diffusion probabilistic models, number of high-profile, diffusion models, models, Diffusion probabilistic
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 31 pages, 6 figures

点击查看摘要

Abstract:Diffusion probabilistic models have made their way into a number of high-profile applications since their inception. In particular, there has been a wave of research into using diffusion models in the prediction and design of biomolecular structures and sequences. Their growing ubiquity makes it imperative for researchers in these fields to understand them. This paper serves as a general overview for the theory behind these models and the current state of research. We first introduce diffusion models and discuss common motifs used when applying them to biomolecules. We then present the significant outcomes achieved through the application of these models in generative and predictive tasks. This survey aims to provide readers with a comprehensive understanding of the increasingly critical role of diffusion models.

[LG-205] LightCPPgen: An Explainable Machine Learning Pipeline for Rational Design of Cell Penetrating Peptides

链接: https://arxiv.org/abs/2406.01617
作者: Gabriele Maroni,Filip Stojceski,Lorenzo Pallante,Marco A. Deriu,Dario Piga,Gianvito Grasso
关键词: therapeutic molecules, Cell-penetrating peptides, powerful vectors, intracellular delivery, diverse array
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Cell-penetrating peptides (CPPs) are powerful vectors for the intracellular delivery of a diverse array of therapeutic molecules. Despite their potential, the rational design of CPPs remains a challenging task that often requires extensive experimental efforts and iterations. In this study, we introduce an innovative approach for the de novo design of CPPs, leveraging the strengths of machine learning (ML) and optimization algorithms. Our strategy, named LightCPPgen, integrates a LightGBM-based predictive model with a genetic algorithm (GA), enabling the systematic generation and optimization of CPP sequences. At the core of our methodology is the development of an accurate, efficient, and interpretable predictive model, which utilizes 20 explainable features to shed light on the critical factors influencing CPP translocation capacity. The CPP predictive model works synergistically with an optimization algorithm, which is tuned to enhance computational efficiency while maintaining optimization performance. The GA solutions specifically target the candidate sequences’ penetrability score, while trying to maximize similarity with the original non-penetrating peptide in order to retain its original biological and physicochemical properties. By prioritizing the synthesis of only the most promising CPP candidates, LightCPPgen can drastically reduce the time and cost associated with wet lab experiments. In summary, our research makes a substantial contribution to the field of CPP design, offering a robust framework that combines ML and optimization techniques to facilitate the rational design of penetrating peptides, by enhancing the explainability and interpretability of the design process.

[LG-206] Markov Chain Monte Carlo with Gaussian Process Emulation for a 1D Hemodynamics Model of CTEPH

链接: https://arxiv.org/abs/2406.01599
作者: Amirreza Kachabi,Mitchel J. Colebank,Sofia Altieri Correa,Naomi C. Chesler
关键词: persistent pulmonary hypertension, thromboembolic pulmonary hypertension, chronic thromboembolic pulmonary, pulmonary hypertension, persistent pulmonary
类目: Quantitative Methods (q-bio.QM); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Microvascular disease is a contributor to persistent pulmonary hypertension in those with chronic thromboembolic pulmonary hypertension (CTEPH). The heterogenous nature of the micro and macrovascular defects motivates the use of personalized computational models, which can predict flow dynamics within multiple generations of the arterial tree and into the microvasculature. Our study uses computational hemodynamics models and Gaussian processes for rapid, subject-specific calibration using retrospective data from a large animal model of CTEPH. Our subject-specific predictions shed light on microvascular dysfunction and arterial wall shear stress changes in CTEPH.

[LG-207] Quantum consistent neural/tensor networks for photonic circuits with strongly/weakly entangled states

链接: https://arxiv.org/abs/2406.01157
作者: Nicolas Allegra
关键词: Modern quantum optical, imaging devices require, realistically exploit entang