本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-06-07)

今日共更新453篇论文,其中:

  • 自然语言处理85篇(Computation and Language (cs.CL))
  • 计算机视觉106篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能119篇(Artificial Intelligence (cs.AI))
  • 机器学习192篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Verbalized Machine Learning: Revisiting Machine Learning with Language Models
[NLP-0] 言语化机器学习:用语言模型重新审视机器学习

链接: https://arxiv.org/abs/2406.04344
作者: Tim Z. Xiao,Robert Bamler,Bernhard Schölkopf,Weiyang Liu
关键词: large progress made, large language models, machine learning models, machine learning, large progress
中文关键词: 取得重大进展,大型语言模型,机器学习模型,机器学习,重大进展
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report v1 (92 pages, 15 figures)

点击查看摘要

Abstract:Motivated by the large progress made by large language models (LLMs), we introduce the framework of verbalized machine learning (VML). In contrast to conventional machine learning models that are typically optimized over a continuous parameter space, VML constrains the parameter space to be human-interpretable natural language. Such a constraint leads to a new perspective of function approximation, where an LLM with a text prompt can be viewed as a function parameterized by the text prompt. Guided by this perspective, we revisit classical machine learning problems, such as regression and classification, and find that these problems can be solved by an LLM-parameterized learner and optimizer. The major advantages of VML include (1) easy encoding of inductive bias: prior knowledge about the problem and hypothesis class can be encoded in natural language and fed into the LLM-parameterized learner; (2) automatic model class selection: the optimizer can automatically select a concrete model class based on data and verbalized prior knowledge, and it can update the model class during training; and (3) interpretable learner updates: the LLM-parameterized optimizer can provide explanations for why each learner update is performed. We conduct several studies to empirically evaluate the effectiveness of VML, and hope that VML can serve as a stepping stone to stronger interpretability and trustworthiness in ML.
摘要:受大语言模型取得的巨大进展的启发,我们介绍了动词化机器学习的框架。与通常在连续参数空间上优化的传统机器学习模型不同,VML将参数空间约束为人类可解释的自然语言。这样的约束导致了函数逼近的新视角,其中带有文本提示的LLM可以被视为由文本提示参数化的函数。在这一观点的指导下,我们重新审视了经典的机器学习问题,如回归和分类,并发现这些问题可以通过LLM参数学习器和优化器来解决。VML的主要优点包括:(1)易于对归纳偏差进行编码:关于问题和假设类的先验知识可以用自然语言编码并反馈给LLM参数化学习器;(2)自动模型类选择:优化器可以根据数据和言语先验知识自动选择具体的模型类,并可以在训练期间更新模型类;以及(3)可解释的学习者更新:LLM参数优化器可以解释为什么每次执行学习者更新。我们进行了多项研究,对VML的有效性进行了实证评估,并希望VML能够成为增强ML的可解释性和可信性的垫脚石。

[NLP-1] PaCE: Parsimonious Concept Engineering for Large Language Models
[NLP-1] PaCE:大型语言模型的节俭概念工程

链接: https://arxiv.org/abs/2406.04331
作者: Jinqi Luo,Tianjiao Ding,Kwan Ho Ryan Chan,Darshan Thaker,Aditya Chattopadhyay,Chris Callison-Burch,René Vidal
关键词: Large Language Models, Large Language, wide variety, Large, Alignment
中文关键词: 大型语言模型、大型语言、品种广泛、大型、对齐
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 26 pages, 17 figures, 5 tables, dataset and code at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) are being used for a wide variety of tasks. While they are capable of generating human-like responses, they can also produce undesirable output including potentially harmful information, racist or sexist language, and hallucinations. Alignment methods are designed to reduce such undesirable output, via techniques such as fine-tuning, prompt engineering, and representation engineering. However, existing methods face several challenges: some require costly fine-tuning for every alignment task; some do not adequately remove undesirable concepts, failing alignment; some remove benign concepts, lowering the linguistic capabilities of LLMs. To address these issues, we propose Parsimonious Concept Engineering (PaCE), a novel activation engineering framework for alignment. First, to sufficiently model the concepts, we construct a large-scale concept dictionary in the activation space, in which each atom corresponds to a semantic concept. Then, given any alignment task, we instruct a concept partitioner to efficiently annotate the concepts as benign or undesirable. Finally, at inference time, we decompose the LLM activations along the concept dictionary via sparse coding, to accurately represent the activation as a linear combination of the benign and undesirable components. By removing the latter ones from the activation, we reorient the behavior of LLMs towards alignment goals. We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising, and show that PaCE achieves state-of-the-art alignment performance while maintaining linguistic capabilities.
摘要:大型语言模型(LLM)正被广泛用于各种任务。虽然它们能够产生与人类相似的反应,但它们也会产生不良的输出,包括潜在的有害信息、种族主义或性别歧视语言以及幻觉。对准方法旨在通过诸如微调、快速工程和表示工程等技术来减少这种不期望的输出。然而,现有的方法面临着几个挑战:一些方法需要对每一项对齐任务进行昂贵的微调;一些方法没有充分消除不受欢迎的概念,导致对齐失败;一些方法去除了良性概念,降低了LLMS的语言能力。为了解决这些问题,我们提出了简约概念工程(Pace),这是一种新的比对激活工程框架。首先,为了对概念进行充分的建模,我们在激活空间中构建了一个大规模的概念词典,其中每个原子对应一个语义概念。然后,在给定任何对齐任务的情况下,我们指示概念分割器有效地将概念注释为良性或不期望的。最后,在推理时,我们通过稀疏编码沿着概念字典对LLM激活进行分解,以准确地将激活表示为良性成分和不良成分的线性组合。通过将后者从激活中移除,我们将LLM的行为重新定向为一致目标。我们在反应解毒、提高忠诚度和修正情绪等任务上进行了实验,结果表明,PACE在保持语言能力的同时,达到了最先进的对齐性能。

[NLP-2] Improving Alignment and Robustness with Short Circuiting
[NLP-2] 通过短路改善对齐和鲁棒性

链接: https://arxiv.org/abs/2406.04313
作者: Andy Zou,Long Phan,Justin Wang,Derek Duenas,Maxwell Lin,Maksym Andriushchenko,Rowan Wang,Zico Kolter,Matt Fredrikson,Dan Hendrycks
关键词: highly vulnerable, harmful, adversarial, attacks, harmful outputs
中文关键词: 高度脆弱、有害、对抗性、攻击、有害输出
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that “short-circuits” models as they respond with harmful outputs. Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to plug these holes by countering specific attacks. As an alternative to refusal training and adversarial training, short-circuiting directly controls the representations that are responsible for harmful outputs in the first place. Our technique can be applied to both text-only and multimodal language models to prevent the generation of harmful outputs without sacrificing utility – even in the presence of powerful unseen attacks. Notably, while adversarial robustness in standalone image recognition remains an open challenge, short-circuiting allows the larger multimodal system to reliably withstand image “hijacks” that aim to produce harmful content. Finally, we extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack. Our approach represents a significant step forward in the development of reliable safeguards to harmful behavior and adversarial attacks.
摘要:人工智能系统可能采取有害行动,并且极易受到对抗性攻击。我们提出了一种方法,灵感来自于最近在表示工程方面的进步,即在模型以有害输出做出反应时使其“短路”。旨在改善一致性的现有技术,如拒绝训练,经常被绕过。对抗性训练等技术试图通过反击特定攻击来堵塞这些漏洞。作为拒绝训练和对抗性训练的替代方案,短路直接控制了首先要对有害输出负责的陈述。我们的技术可以应用于纯文本和多模式语言模型,在不牺牲效用的情况下防止产生有害输出-即使在存在强大的看不见的攻击的情况下也是如此。值得注意的是,虽然独立图像识别中的对抗性健壮性仍然是一个开放的挑战,但短路使更大的多模式系统能够可靠地经受住旨在产生有害内容的图像“劫持”。最后,我们将我们的方法扩展到人工智能代理,表明当他们受到攻击时,有害行动的比率大大降低。我们的方法代表着在发展对有害行为和敌对攻击的可靠保障方面向前迈出了重要的一步。

[NLP-3] Measuring and Addressing Indexical Bias in Information Retrieval
[NLP-3] 测量和解决信息检索中的索引偏差

链接: https://arxiv.org/abs/2406.04298
作者: Caleb Ziems,William Held,Jane Dwivedi-Yu,Diyi Yang
关键词: Information Retrieval, deliver relevant content, relevant content, rankings for fairness, balance of ideas
中文关键词: 信息检索、提供相关内容、相关内容、公平性排名、思想平衡
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: ACL 2024

点击查看摘要

Abstract:Information Retrieval (IR) systems are designed to deliver relevant content, but traditional systems may not optimize rankings for fairness, neutrality, or the balance of ideas. Consequently, IR can often introduce indexical biases, or biases in the positional order of documents. Although indexical bias can demonstrably affect people’s opinion, voting patterns, and other behaviors, these issues remain understudied as the field lacks reliable metrics and procedures for automatically measuring indexical bias. Towards this end, we introduce the PAIR framework, which supports automatic bias audits for ranked documents or entire IR systems. After introducing DUO, the first general-purpose automatic bias metric, we run an extensive evaluation of 8 IR systems on a new corpus of 32k synthetic and 4.7k natural documents, with 4k queries spanning 1.4k controversial issue topics. A human behavioral study validates our approach, showing that our bias metric can help predict when and how indexical bias will shift a reader’s opinion.
摘要:信息检索(IR)系统旨在提供相关内容,但传统系统可能不会在公平性、中立性或思想平衡方面优化排名。因此,信息检索往往会导致索引偏向,或文件位置顺序的偏向。尽管指数偏差可以明显地影响人们的意见、投票模式和其他行为,但这些问题仍然没有得到充分的研究,因为该领域缺乏可靠的度量和程序来自动衡量指数偏差。为此,我们引入了Pair框架,该框架支持对排名的文档或整个IR系统进行自动偏差审计。在引入了第一个通用的自动偏向度量DUO之后,我们在一个由32k合成文档和4.7k自然文档组成的新语料库上对8个IR系统进行了广泛的评估,其中4k查询跨越1.4k有争议的问题主题。一项人类行为研究验证了我们的方法,表明我们的偏差度量可以帮助预测索引偏差何时以及如何改变读者的观点。

[NLP-4] VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
[NLP-4] VISTA:通用多模式检索的可视化文本嵌入

链接: https://arxiv.org/abs/2406.04292
作者: Junjie Zhou,Zheng Liu,Shitao Xiao,Bo Zhao,Yongping Xiong
关键词: popular in practice, increasingly popular, Multi-modal retrieval, Multi-modal, data
中文关键词: 实践中流行,日益流行,多模式检索,多模式,数据
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACL 2024 main conference

点击查看摘要

Abstract:Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal multi-modal retrieval. Our work brings forth threefold technical contributions. Firstly, we introduce a flexible architecture which extends a powerful text encoder with the image understanding capability by introducing visual token embeddings. Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model. Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capability using the generated composed image-text data. In our experiments, VISTA achieves superior performances across a variety of multi-modal retrieval tasks in both zero-shot and supervised settings. Our model, data, and source code are available at this https URL.
摘要:多通道检索在实践中日益流行。然而,现有的检索器大多是面向文本的,缺乏处理视觉信息的能力。尽管存在像CLIP这样的视觉语言模型,但目前的方法在表示纯文本和纯图像数据方面存在严重限制。在这项工作中,我们提出了一种新的通用多模式检索的嵌入模型Vista。我们的工作带来了三方面的技术贡献。首先,我们介绍了一种灵活的体系结构,它通过引入视觉标记嵌入来扩展强大的文本编码器,使其具有图像理解能力。其次,我们开发了两种数据生成策略,它们为嵌入模型的训练带来了高质量的合成图文。第三,提出了一种多阶段训练算法,该算法首先利用海量的弱标签数据将视觉标记嵌入到文本编码器中,然后利用生成的合成图文数据来发展多模式表示能力。在我们的实验中,Vista在零镜头和监督设置下的各种多模式检索任务中都取得了卓越的性能。我们的模型、数据和源代码可以在这个HTTPS URL上找到。

[NLP-5] What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages
[NLP-5] 哪些语言是容易贬低的模型?从学习概率规则语言的角度来看

链接: https://arxiv.org/abs/2406.04289
作者: Nadav Borenstein,Anej Svete,Robin Chan,Josef Valvoda,Franz Nowak,Isabelle Augenstein,Eleanor Chodroff,Ryan Cotterell
关键词: language models learn, large language models, language models, models learn, large language
中文关键词: 语言模型学习,大型语言模型,语言模型,模型学习,大型语言
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024

点击查看摘要

Abstract:What can large language models learn? By definition, language models (LM) are distributions over strings. Therefore, an intuitive way of addressing the above question is to formalize it as a matter of learnability of classes of distributions over strings. While prior work in this direction focused on assessing the theoretical limits, in contrast, we seek to understand the empirical learnability. Unlike prior empirical work, we evaluate neural LMs on their home turf-learning probabilistic languages-rather than as classifiers of formal languages. In particular, we investigate the learnability of regular LMs (RLMs) by RNN and Transformer LMs. We empirically test the learnability of RLMs as a function of various complexity parameters of the RLM and the hidden state size of the neural LM. We find that the RLM rank, which corresponds to the size of linear space spanned by the logits of its conditional distributions, and the expected length of sampled strings are strong and significant predictors of learnability for both RNNs and Transformers. Several other predictors also reach significance, but with differing patterns between RNNs and Transformers.
摘要:大型语言模型能学到什么?根据定义,语言模型(Language Models,LM)是字符串上的分布。因此,解决上述问题的一个直观方法是将其形式化为字符串上的分布类的可学习性问题。虽然以前在这个方向上的工作集中在评估理论上的限制,相反,我们试图理解经验可获得性。与以前的经验工作不同,我们在其本土-学习概率语言-而不是作为形式语言的分类器来评估神经LMS。特别地,我们用RNN和变换LMS研究了规则LMS(RLMS)的可学习性。我们通过实验测试了RLMS的可学习性,它是RLM的各种复杂性参数和神经LM的隐态大小的函数。我们发现,对应于条件分布的对数所跨越的线性空间大小的RLm阶和采样串的预期长度都是RNN和Transformers可学习性的强大而显著的预测因子。其他几个预测指标也达到了显著水平,但RNN和变形金刚之间的模式不同。

[NLP-6] ABEX: Data Augmentation for Low-Resource NLU via Expanding Abstract Descriptions
[NLP-6] ABEX:通过扩展抽象描述来增强低资源NLU的数据

链接: https://arxiv.org/abs/2406.04286
作者: Sreyan Ghosh,Utkarsh Tyagi,Sonal Kumar,C. K. Evuru,S Ramaneswaran,S Sakshi,Dinesh Manocha
关键词: Natural Language Understanding, low-resource Natural Language, Language Understanding, Natural Language, effective generative data
中文关键词: 自然语言理解、低资源自然语言、语言理解、自然语言、有效的生成数据
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2024 Main Conference. Code and data: this https URL

点击查看摘要

Abstract:We present ABEX, a novel and effective generative data augmentation methodology for low-resource Natural Language Understanding (NLU) tasks. ABEX is based on ABstract-and-EXpand, a novel paradigm for generating diverse forms of an input document – we first convert a document into its concise, abstract description and then generate new documents based on expanding the resultant abstraction. To learn the task of expanding abstract descriptions, we first train BART on a large-scale synthetic dataset with abstract-document pairs. Next, to generate abstract descriptions for a document, we propose a simple, controllable, and training-free method based on editing AMR graphs. ABEX brings the best of both worlds: by expanding from abstract representations, it preserves the original semantic properties of the documents, like style and meaning, thereby maintaining alignment with the original label and data distribution. At the same time, the fundamental process of elaborating on abstract descriptions facilitates diverse generations. We demonstrate the effectiveness of ABEX on 4 NLU tasks spanning 12 datasets and 4 low-resource settings. ABEX outperforms all our baselines qualitatively with improvements of 0.04% - 38.8%. Qualitatively, ABEX outperforms all prior methods from literature in terms of context and length diversity.
摘要:针对低资源的自然语言理解(NLU)任务,提出了一种新颖有效的生成性数据扩充方法ABEX。ABEX基于抽象和扩展,这是一种用于生成不同形式的输入文档的新范例–我们首先将文档转换为其简洁的抽象描述,然后基于扩展得到的抽象生成新的文档。为了学习扩展抽象描述的任务,我们首先在具有抽象文档对的大规模合成数据集上训练BART。接下来,为了生成文档的抽象描述,我们提出了一种基于编辑AMR图的简单、可控、无需训练的方法。Abex两全其美:通过从抽象表示进行扩展,它保留了文档的原始语义属性,如样式和含义,从而与原始标签和数据分布保持一致。同时,阐述抽象描述的基本过程促进了不同的世代。我们在跨越12个数据集和4个低资源环境的4个NLU任务上演示了ABEX的有效性。Abex在质量上超过了我们所有的基线,提高了0.04%-38.8%。在质量上,Abex在上下文和长度多样性方面优于所有先前文献中的方法。

[NLP-7] Characterizing Similarities and Divergences in Conversational Tones in Humans and LLMs by Sampling with People
[NLP-7] 通过人抽样来描述人类和LLM对话语气的相似性和差异性

链接: https://arxiv.org/abs/2406.04278
作者: Dun-Ming Huang,Pol Van Rijn,Ilia Sucholutsky,Raja Marjieh,Nori Jacoby
关键词: Large Language Models, Conversational tones, speakers communicate, effective communication, Language Models
中文关键词: 大型语言模型、对话语气、说话者沟通、有效沟通、语言模型
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted to Main Conference at ACL 2024

点击查看摘要

Abstract:Conversational tones – the manners and attitudes in which speakers communicate – are essential to effective communication. Amidst the increasing popularization of Large Language Models (LLMs) over recent years, it becomes necessary to characterize the divergences in their conversational tones relative to humans. However, existing investigations of conversational modalities rely on pre-existing taxonomies or text corpora, which suffer from experimenter bias and may not be representative of real-world distributions for the studies’ psycholinguistic domains. Inspired by methods from cognitive science, we propose an iterative method for simultaneously eliciting conversational tones and sentences, where participants alternate between two tasks: (1) one participant identifies the tone of a given sentence and (2) a different participant generates a sentence based on that tone. We run 100 iterations of this process with human participants and GPT-4, then obtain a dataset of sentences and frequent conversational tones. In an additional experiment, humans and GPT-4 annotated all sentences with all tones. With data from 1,339 human participants, 33,370 human judgments, and 29,900 GPT-4 queries, we show how our approach can be used to create an interpretable geometric representation of relations between conversational tones in humans and GPT-4. This work demonstrates how combining ideas from machine learning and cognitive science can address challenges in human-computer interactions.
摘要:会话语气–说话者交流的方式和态度–对于有效的交流至关重要。近年来,随着大型语言模型(LLM)的日益普及,有必要刻画它们的会话语调相对于人类的差异。然而,现有的对话模式研究依赖于已有的分类或语料库,这些分类或语料库受到实验者偏见的影响,可能不能代表研究的心理语言学领域的真实分布。受认知科学方法的启发,我们提出了一种同时产生会话语调和句子的迭代方法,参与者在两个任务之间交替:(1)一个参与者识别给定句子的语调,(2)另一个参与者基于该语调生成一个句子。我们与人类参与者和GPT-4一起运行了这个过程的100次迭代,然后获得了一个句子和频繁会话语调的数据集。在另一项实验中,人类和GPT-4用所有声调注释了所有句子。通过来自1,339名人类参与者、33,370个人类判断和29,900个GPT-4查询的数据,我们展示了如何使用我们的方法来创建人类和GPT-4会话声调之间关系的可解释几何表示。这项工作展示了如何将机器学习和认知科学的想法结合起来,以应对人机交互中的挑战。

[NLP-8] Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models
[NLP-8] 与对抗性批评者的自我游戏:语言模型的可证明和可扩展的离线对齐

链接: https://arxiv.org/abs/2406.04274
作者: Xiang Ji,Sanjeev Kulkarni,Mengdi Wang,Tengyang Xie
关键词: aligning large language, large language models, preference optimization methods, studies the challenge, challenge of aligning
中文关键词: 对齐大语言、大语言模型、偏好优化方法,研究对齐的挑战,挑战
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work studies the challenge of aligning large language models (LLMs) with offline preference data. We focus on alignment by Reinforcement Learning from Human Feedback (RLHF) in particular. While popular preference optimization methods exhibit good empirical performance in practice, they are not theoretically guaranteed to converge to the optimal policy and can provably fail when the data coverage is sparse by classical offline reinforcement learning (RL) results. On the other hand, a recent line of work has focused on theoretically motivated preference optimization methods with provable guarantees, but these are not computationally efficient for large-scale applications like LLM alignment. To bridge this gap, we propose SPAC, a new offline preference optimization method with self-play, inspired by the on-average pessimism technique from the offline RL literature, to be the first provable and scalable approach to LLM alignment. We both provide theoretical analysis for its convergence under single-policy concentrability for the general function approximation setting and demonstrate its competitive empirical performance for LLM alignment on a 7B Mistral model with Open LLM Leaderboard evaluations.
摘要:这项工作研究了将大型语言模型(LLM)与离线偏好数据对齐的挑战。我们特别关注通过从人的反馈中强化学习(RLHF)来进行对齐。虽然流行的偏好优化方法在实践中表现出了良好的经验性能,但它们在理论上并不能保证收敛到最优策略,并且当数据覆盖稀疏时,经典的离线强化学习(RL)结果可能会失败。另一方面,最近的一系列工作集中在具有可证明保证的理论激励的偏好优化方法上,但这些方法对于像LLM比对这样的大规模应用在计算上并不高效。为了弥补这一差距,我们提出了一种新的离线偏好优化方法SPAC,它受到离线RL文献中平均悲观主义技术的启发,是第一个可证明和可扩展的LLM对齐方法。我们在一般函数近似设置下对其在单策略集中性下的收敛进行了理论分析,并在具有开放LLM排行榜评估的7B Mistral模型上展示了其竞争性的LLM对齐的经验性能。

[NLP-9] Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models
[NLP-9] 思想缓冲区:使用大型语言模型的知识增强推理

链接: https://arxiv.org/abs/2406.04271
作者: Ling Yang,Zhaochen Yu,Tianjun Zhang,Shiyi Cao,Minkai Xu,Wentao Zhang,Joseph E. Gonzalez,Bin Cui
关键词: versatile thought-augmented reasoning, thought-augmented reasoning approach, introduce Buffer, large language models, Buffer of Thoughts
中文关键词: 多功能思想增强推理、思想增强推理方法、引入缓冲区、大型语言模型、思想缓冲区
类目: Computation and Language (cs.CL)
备注: Project: this https URL

点击查看摘要

Abstract:We introduce Buffer of Thoughts (BoT), a novel and versatile thought-augmented reasoning approach for enhancing accuracy, efficiency and robustness of large language models (LLMs). Specifically, we propose meta-buffer to store a series of informative high-level thoughts, namely thought-template, distilled from the problem-solving processes across various tasks. Then for each problem, we retrieve a relevant thought-template and adaptively instantiate it with specific reasoning structures to conduct efficient reasoning. To guarantee the scalability and stability, we further propose buffer-manager to dynamically update the meta-buffer, thus enhancing the capacity of meta-buffer as more tasks are solved. We conduct extensive experiments on 10 challenging reasoning-intensive tasks, and achieve significant performance improvements over previous SOTA methods: 11% on Game of 24, 20% on Geometric Shapes and 51% on Checkmate-in-One. Further analysis demonstrate the superior generalization ability and model robustness of our BoT, while requiring only 12% of the cost of multi-query prompting methods (e.g., tree/graph of thoughts) on average. Notably, we find that our Llama3-8B+BoT has the potential to surpass Llama3-70B model. Our project is available at: this https URL
摘要:为了提高大型语言模型(LLMS)的准确性、效率和稳健性,我们引入了思想缓冲(BOT),这是一种新颖而通用的思维增强推理方法。具体地说,我们提出了元缓冲区来存储一系列信息丰富的高层思维,即思维模板,这些思维是从各种任务的问题解决过程中提取出来的。然后,对于每个问题,我们检索一个相关的思维模板,并用特定的推理结构自适应地实例化它,以进行高效的推理。为了保证可伸缩性和稳定性,我们进一步提出了缓冲区管理器来动态更新元缓冲区,从而在处理更多任务时增强元缓冲区的容量。我们在10个具有挑战性的推理密集型任务上进行了大量的实验,与以前的SOTA方法相比,性能有了显著的提高:在24人游戏上提高了11%,在几何形状上提高了20%,在棋盘格上提高了51%。进一步的分析表明,我们的机器人具有优越的泛化能力和模型稳健性,而平均只需要12%的多查询提示方法(例如,树/思想图)的代价。值得注意的是,我们发现我们的Llama3-8B+机器人有超过Llama3-70B模型的潜力。我们的项目可通过以下网址获得:此HTTPS URL

[NLP-10] ransformers need glasses! Information over-squashing in language tasks
[NLP-10] 勒索者需要眼镜!语言任务中的信息过度挤压

链接: https://arxiv.org/abs/2406.04267
作者: Federico Barbero,Andrea Banino,Steven Kapturowski,Dharshan Kumaran,João G.M. Araújo,Alex Vitvitskyi,Razvan Pascanu,Petar Veličković
关键词: existing frontier large, frontier large language, study how information, information propagates, architectural backbone
中文关键词: 现有的前沿大,前沿大语言,研究信息如何传播,架构支柱
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large language models (LLMs). We rely on a theoretical signal propagation analysis – specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways – leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input, which relates to the well-known phenomenon of over-squashing in graph neural networks. We provide empirical evidence supporting our claims on contemporary LLMs. Our theory also points to simple solutions towards ameliorating these issues.
摘要:我们研究了信息如何在解码器专用的转换器中传播,这些转换器是大多数现有的前沿大型语言模型(LLMS)的体系结构骨干。我们依赖于理论信号传播分析–具体地说,我们分析Transformer最后一层中最后一个令牌的表示形式,因为这是用于下一个令牌预测的表示形式。我们的分析揭示了一种表征崩溃现象:我们证明了对转换器的某些不同的输入序列可以在最终令牌中产生任意接近的表示。现代LLM中经常使用的低精度浮点格式加剧了这种影响。结果,该模型被证明不能以不同的方式对这些序列作出反应–导致例如涉及计数或复制的任务中的错误。此外,我们还证明了只有解码器的Transformer语言模型会对输入中的特定标记失去敏感度,这与图神经网络中众所周知的过度挤压现象有关。我们提供了支持我们关于当代LLM的主张的经验证据。我们的理论也指出了改善这些问题的简单解决方案。

[NLP-11] MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding
[NLP-11] MLVU:多任务长视频理解的全面基准

链接: https://arxiv.org/abs/2406.04264
作者: Junjie Zhou,Yan Shu,Bo Zhao,Boya Wu,Shitao Xiao,Xi Yang,Yongping Xiong,Bo Zhang,Tiejun Huang,Zheng Liu
关键词: Long Video Understanding, Video Understanding, Multi-task Long Video, video understanding benchmarks, Long Video
中文关键词: 长视频理解,视频理解,多任务长视频,视频理解基准,长视频
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To address the above problems, we propose a new benchmark, called MLVU (Multi-task Long Video Understanding Benchmark), for the comprehensive and in-depth evaluation of LVU. MLVU presents the following critical values: 1) The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations. 2) The inclusion of various video genres, e.g., movies, surveillance footage, egocentric videos, cartoons, game videos, etc., which reflects the models’ LVU performances in different scenarios. 3) The development of diversified evaluation tasks, which enables a comprehensive examination of MLLMs’ key abilities in long-video understanding. The empirical study with 20 latest MLLMs reveals significant room for improvement in today’s technique, as all existing methods struggle with most of the evaluation tasks and exhibit severe performance degradation when handling longer videos. Additionally, it suggests that factors such as context length, image-understanding quality, and the choice of LLM backbone can play critical roles in future advancements. We anticipate that MLVU will advance the research of long video understanding by providing a comprehensive and in-depth analysis of MLLMs.
摘要:长视频理解性能的评估是一个重要而又具有挑战性的研究问题。尽管已经做了一些努力,但现有的视频理解基准受到一些问题的严重制约,特别是视频长度不足,视频类型和评估任务缺乏多样性,以及不适合评估LVU的性能。针对上述问题,我们提出了一个新的基准,称为MLVU(多任务长视频理解基准),用于全面和深入地评估LVU。MLVU提供了以下临界值:1)视频长度的大幅和灵活扩展,使基准能够在广泛的持续时间范围内评估LVU性能。2)包含各种类型的视频,如电影、监控镜头、以自我为中心的视频、动画片、游戏视频等,反映了模特在不同场景下的LVU表现。3)开发多样化的评估任务,全面考察MLLMS的长视频理解关键能力。对20个最新的MLLMS进行的实证研究表明,当今的技术有很大的改进空间,因为所有现有的方法都在努力完成大多数评估任务,并且在处理较长视频时表现出严重的性能下降。此外,它还表明,诸如背景长度、图像理解质量和LLM主干的选择等因素在未来的进步中可以发挥关键作用。我们预计,MLVU将通过对MLLMS进行全面和深入的分析,推动长视频理解的研究。

[NLP-12] Benchmark Data Contamination of Large Language Models: A Survey
[NLP-12] 大型语言模型的基准数据污染:调查

链接: https://arxiv.org/abs/2406.04244
作者: Cheng Xu,Shuhao Guan,Derek Greene,M-Tahar Kechadi
关键词: Large Language Models, natural language processing, development of Large, Gemini has transformed, Large Language
中文关键词: 大型语言模型、自然语言处理、大型开发、Gemini已转型、大型语言
类目: Computation and Language (cs.CL)
备注: 31 pages, 7 figures, 3 tables

点击查看摘要

Abstract:The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase of the process. This paper reviews the complex challenge of BDC in LLM evaluation and explores alternative assessment methods to mitigate the risks associated with traditional benchmarks. The paper also examines challenges and future directions in mitigating BDC risks, highlighting the complexity of the issue and the need for innovative solutions to ensure the reliability of LLM evaluation in real-world applications.
摘要:GPT-4、Claude-3和Gemini等大型语言模型(LLM)的快速发展改变了自然语言处理领域。然而,它也导致了一个名为基准数据污染(BCD)的重大问题。当语言模型无意中从其训练数据中纳入评估基准信息时,就会出现这种情况,从而导致流程评估阶段的性能不准确或不可靠。本文回顾了BDS在LLM评估中面临的复杂挑战,并探索了替代评估方法来减轻与传统基准相关的风险。该论文还探讨了缓解BDS风险的挑战和未来方向,强调了问题的复杂性以及创新解决方案的必要性,以确保LLM评估在现实世界应用中的可靠性。

[NLP-13] Hypernetworks for Personalizing ASR to Atypical Speech
[NLP-13] 用于将ASB个性化到非典型语音的超网络

链接: https://arxiv.org/abs/2406.04240
作者: Max Mueller-Eberstein,Dianna Yee,Karren Yang,Gautam Varma Mantena,Colin Lea
关键词: recently shown promise, automatic speech recognition, personalizing automatic speech, adapting general population, general population models
中文关键词: 最近表现出的希望,自动语音识别,个性化自动语音,适应一般人群,一般人群模型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) for personalizing automatic speech recognition (ASR) has recently shown promise for adapting general population models to atypical speech. However, these approaches assume a priori knowledge of the atypical speech disorder being adapted for – the diagnosis of which requires expert knowledge that is not always available. Even given this knowledge, data scarcity and high inter/intra-speaker variability further limit the effectiveness of traditional fine-tuning. To circumvent these challenges, we first identify the minimal set of model parameters required for ASR adaptation. Our analysis of each individual parameter’s effect on adaptation performance allows us to reduce Word Error Rate (WER) by half while adapting 0.03% of all weights. Alleviating the need for cohort-specific models, we next propose the novel use of a meta-learned hypernetwork to generate highly individualized, utterance-level adaptations on-the-fly for a diverse set of atypical speech characteristics. Evaluating adaptation at the global, cohort and individual-level, we show that hypernetworks generalize better to out-of-distribution speakers, while maintaining an overall relative WER reduction of 75.2% using 0.1% of the full parameter budget.
摘要:用于个性化自动语音识别(ASR)的参数高效微调(PEFT)最近显示出使一般人口模型适应非典型语音的前景。然而,这些方法假设对正在适应的非典型言语障碍有先验知识–诊断需要专业知识,但并不总是可用的。即使有了这些知识,数据的稀缺和说话人之间/说话人内部的高度可变性也进一步限制了传统微调的有效性。为了绕过这些挑战,我们首先确定ASR适应所需的最小模型参数集。我们对每个参数对适应性能的影响的分析允许我们在适应所有权重的0.03%的情况下将错词率(WER)降低一半。为了减少对队列特定模型的需求,我们接下来提出了元学习超网络的新颖使用,以针对不同的非典型语音特征集动态地生成高度个性化的、发音级别的适配。我们在全球、队列和个人层面对自适应进行了评估,结果表明,超网络更好地推广到分布外的说话人,同时使用全参数预算的0.1%保持了75.2%的总体相对WER降低。

[NLP-14] FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages
[NLP-14] FairytaleQA翻译:以资源较少的语言生成教育问答

链接: https://arxiv.org/abs/2406.04233
作者: Bernardo Leite,Tomás Freitas Osório,Henrique Lopes Cardoso
关键词: Question Answering, machines and humans, crucial in assessing, assessing reading comprehension, comprehension skills
中文关键词: 提问、机器和人类,对于评估阅读理解、理解技能至关重要
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint - Accepted for publication at ECTEL 2024

点击查看摘要

Abstract:Question Answering (QA) datasets are crucial in assessing reading comprehension skills for both machines and humans. While numerous datasets have been developed in English for this purpose, a noticeable void exists in less-resourced languages. To alleviate this gap, our paper introduces machine-translated versions of FairytaleQA, a renowned QA dataset designed to assess and enhance narrative comprehension skills in young children. By employing fine-tuned, modest-scale models, we establish benchmarks for both Question Generation (QG) and QA tasks within the translated datasets. In addition, we present a case study proposing a model for generating question-answer pairs, with an evaluation incorporating quality metrics such as question well-formedness, answerability, relevance, and children suitability. Our evaluation prioritizes quantifying and describing error cases, along with providing directions for future work. This paper contributes to the advancement of QA and QG research in less-resourced languages, promoting accessibility and inclusivity in the development of these models for reading comprehension. The code and data is publicly available at this http URL.
摘要:问答数据集在评估机器和人类的阅读理解能力方面都是至关重要的。虽然已经为此目的用英语开发了许多数据集,但在资源较少的语言中存在一个明显的空白。为了缓解这一差距,我们引入了FairytaleQA的机器翻译版本,FairytaleQA是一个著名的QA数据集,旨在评估和提高幼儿的叙事理解能力。通过使用微调的中等规模模型,我们在翻译后的数据集中为问题生成(QG)和QA任务建立基准。此外,我们还提供了一个案例研究,提出了一个生成问题-答案对的模型,其中的评估包含了问题的形成性、可回答性、相关性和儿童适宜性等质量度量。我们的评估优先量化和描述错误案例,并为未来的工作提供方向。这篇文章有助于在资源较少的语言中开展QA和QG研究,促进这些阅读理解模式的可及性和包容性的发展。代码和数据可在此http URL上公开获得。

[NLP-15] he CLRS-Text Algorithmic Reasoning Language Benchmark
[NLP-15] CLRS文本数学推理语言基准

链接: https://arxiv.org/abs/2406.04229
作者: Larisa Markeeva,Sean McLeish,Borja Ibarz,Wilfried Bounsi,Olga Kozlova,Alex Vitvitskyi,Charles Blundell,Tom Goldstein,Avi Schwarzschild,Petar Veličković
关键词: Eliciting reasoning capabilities, building intelligent systems, Eliciting reasoning, language models, intelligent systems
中文关键词: 激发推理能力,构建智能系统,激发推理,语言模型,智能系统
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
备注: Preprint, under review. Comments welcome

点击查看摘要

Abstract:Eliciting reasoning capabilities from language models (LMs) is a critical direction on the path towards building intelligent systems. Most recent studies dedicated to reasoning focus on out-of-distribution performance on procedurally-generated synthetic benchmarks, bespoke-built to evaluate specific skills only. This trend makes results hard to transfer across publications, slowing down progress. Three years ago, a similar issue was identified and rectified in the field of neural algorithmic reasoning, with the advent of the CLRS benchmark. CLRS is a dataset generator comprising graph execution traces of classical algorithms from the Introduction to Algorithms textbook. Inspired by this, we propose CLRS-Text – a textual version of these algorithmic traces. Out of the box, CLRS-Text is capable of procedurally generating trace data for thirty diverse, challenging algorithmic tasks across any desirable input distribution, while offering a standard pipeline in which any additional algorithmic tasks may be created in the benchmark. We fine-tune and evaluate various LMs as generalist executors on this benchmark, validating prior work and revealing a novel, interesting challenge for the LM reasoning community. Our code is available at this https URL.
摘要:从语言模型(LMS)中获取推理能力是构建智能系统的重要方向。最近致力于推理的研究集中在程序生成的合成基准上的偏离分布的表现,这些基准是专门为评估特定技能而定制的。这种趋势使得结果很难在不同的出版物之间转移,从而减缓了进展。三年前,随着CLRS基准的出现,神经算法推理领域也发现并纠正了类似的问题。CLRS是一个数据集生成器,包含了《算法导论》教科书中经典算法的图形执行痕迹。受此启发,我们提出了CLRS-TEXT–这些算法跟踪的文本版本。开箱即用,CLRS-TEXT能够为30个不同的、具有挑战性的算法任务在任何所需的输入分布中按程序生成跟踪数据,同时提供一个标准流水线,在其中可以在基准中创建任何额外的算法任务。我们作为这个基准的通才执行者对各种LM进行了微调和评估,验证了之前的工作,并揭示了LM推理社区面临的一个新颖、有趣的挑战。我们的代码可以在这个HTTPS URL上找到。

[NLP-16] BEADs: Bias Evaluation Across Domains
[NLP-16] BEADs:跨领域的偏见评估

链接: https://arxiv.org/abs/2406.04220
作者: Shaina Raza,Mizanur Rahman,Michael R. Zhang
关键词: significantly enhanced natural, Recent improvements, natural language processing, enhanced natural language, NLP tasks
中文关键词: 显着增强的自然、最近的改进、自然语言处理、增强的自然语言、NLP任务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: under review

点击查看摘要

Abstract:Recent improvements in large language models (LLMs) have significantly enhanced natural language processing (NLP) applications. However, these models can also inherit and perpetuate biases from their training data. Addressing this issue is crucial, yet many existing datasets do not offer evaluation across diverse NLP tasks. To tackle this, we introduce the Bias Evaluations Across Domains (BEADs) dataset, designed to support a wide range of NLP tasks, including text classification, bias entity recognition, bias quantification, and benign language generation. BEADs uses AI-driven annotation combined with experts’ verification to provide reliable labels. This method overcomes the limitations of existing datasets that typically depend on crowd-sourcing, expert-only annotations with limited bias evaluations, or unverified AI labeling. Our empirical analysis shows that BEADs is effective in detecting and reducing biases across different language models, with smaller models fine-tuned on BEADs often outperforming LLMs in bias classification tasks. However, these models may still exhibit biases towards certain demographics. Fine-tuning LLMs with our benign language data also reduces biases while preserving the models’ knowledge. Our findings highlight the importance of comprehensive bias evaluation and the potential of targeted fine-tuning for reducing the bias of LLMs. We are making BEADs publicly available at this https URL Warning: This paper contains examples that may be considered offensive. Comments: under review Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.04220 [cs.CL] (or arXiv:2406.04220v1 [cs.CL] for this version)
摘要:最近在大语言模型(LLM)方面的改进显著地增强了自然语言处理(NLP)的应用。然而,这些模型也可以继承和保持其训练数据的偏差。解决这个问题是至关重要的,但许多现有的数据集并不提供对不同NLP任务的评估。为了解决这个问题,我们引入了跨域偏见评估(BEADS)数据集,旨在支持广泛的NLP任务,包括文本分类、偏见实体识别、偏见量化和良性语言生成。Beads使用人工智能驱动的注释结合专家验证来提供可靠的标签。这种方法克服了现有数据集的局限性,这些数据集通常依赖于众包、仅限专家的注释和有限的偏见评估,或者未经验证的AI标签。我们的实证分析表明,珠子在检测和减少不同语言模型之间的偏差方面是有效的,在偏见分类任务中,对珠子进行微调的较小模型往往比最小最小二乘法性能更好。然而,这些模型可能仍会显示出对某些人口统计数据的偏见。使用我们的良性语言数据对LLM进行微调也可以减少偏差,同时保留模型的知识。我们的发现强调了全面偏差评估的重要性,以及有针对性的微调对于减少LLMS偏差的潜力。我们将在此HTTPS URL上公开提供珠子警告:本文包含可能被视为冒犯的示例。评论:正在审查的科目:计算和语言(cs.CL);人工智能(cs.AI)引用为:arxiv:2406.04220cs.CL

[NLP-17] Rethinking LLM and Linguistic Steganalysis: An Efficient Detection of Strongly Concealed Stego
[NLP-17] 重新思考LLM和语言隐写分析:对强烈隐藏的隐写的有效检测

链接: https://arxiv.org/abs/2406.04218
作者: Yifan Tang,Yihao Wang,Ru Zhang,Jianyi Liu
关键词: achieved excellent performance, linguistic steganalysis, complex scenarios, achieved excellent, steganographic text
中文关键词: 实现了出色的性能、语言隐写分析、复杂的场景、实现了出色的隐写文本
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To detect stego (steganographic text) in complex scenarios, linguistic steganalysis (LS) with various motivations has been proposed and achieved excellent performance. However, with the development of generative steganography, some stegos have strong concealment, especially after the emergence of LLMs-based steganography, the existing LS has low detection or even cannot detect them. We designed a novel LS with two modes called LSGC. In the generation mode, we created an LS-task “description” and used the generation ability of LLM to explain whether texts to be detected are stegos. On this basis, we rethought the principle of LS and LLMs, and proposed the classification mode. In this mode, LSGC deleted the LS-task “description” and changed the “causalLM” LLMs to the “sequenceClassification” architecture. The LS features can be extracted by only one pass of the model, and a linear layer with initialization weights is added to obtain the classification probability. Experiments on strongly concealed stegos show that LSGC significantly improves detection and reaches SOTA performance. Additionally, LSGC in classification mode greatly reduces training time while maintaining high performance.
摘要:为了检测复杂场景中的隐写文本,人们提出了多种动机的语言隐写分析方法,并取得了很好的效果。然而,随着生成性隐写术的发展,一些隐写体具有很强的隐蔽性,特别是在基于LLMS的隐写体出现后,现有的LS隐写术检测能力较低,甚至无法检测到它们。我们设计了一种具有两种模式的新型LS,称为LSGC。在生成模式中,我们创建了LS-任务描述,并利用LLM的生成能力来解释待检测文本是否为隐写文本。在此基础上,对最小二乘法和最小二乘法的原理进行了重新思考,提出了分类模式。在这种模式下,LSGC删除了LS任务“DESCRIPTION”,并将“causalLm”LLMS更改为“SequenceClass”体系结构。该模型只需一遍即可提取最小二乘特征,并加入一个具有初始权值的线性层来获得分类概率。对强隐蔽性隐写图像的实验表明,LSGC算法显著提高了检测性能,达到了SOTA算法的性能。此外,分类模式下的LSGC在保持高性能的同时大大减少了训练时间。

[NLP-18] What Do Language Models Learn in Context? The Structured Task Hypothesis
[NLP-18] 语言模型在上下文中学习什么?结构化任务假设

链接: https://arxiv.org/abs/2406.04216
作者: Jiaoda Li,Yifan Hou,Mrinmaya Sachan,Ryan Cotterell
关键词: Large language models, Large language, termed in-context learning, termed in-context, exhibit an intriguing
中文关键词: 大型语言模型,大型语言,称为上下文学习,称为上下文学习,表现出有趣的
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This work is published in ACL 2024

点击查看摘要

Abstract:Large language models (LLMs) exhibit an intriguing ability to learn a novel task from in-context examples presented in a demonstration, termed in-context learning (ICL). Understandably, a swath of research has been dedicated to uncovering the theories underpinning ICL. One popular hypothesis explains ICL by task selection. LLMs identify the task based on the demonstration and generalize it to the prompt. Another popular hypothesis is that ICL is a form of meta-learning, i.e., the models learn a learning algorithm at pre-training time and apply it to the demonstration. Finally, a third hypothesis argues that LLMs use the demonstration to select a composition of tasks learned during pre-training to perform ICL. In this paper, we empirically explore these three hypotheses that explain LLMs’ ability to learn in context with a suite of experiments derived from common text classification tasks. We invalidate the first two hypotheses with counterexamples and provide evidence in support of the last hypothesis. Our results suggest an LLM could learn a novel task in context via composing tasks learned during pre-training.
摘要:大型语言模型(LLM)展示了一种有趣的能力,即从演示中提供的上下文中的例子学习新的任务,称为上下文中的学习(ICL)。可以理解的是,一系列研究一直致力于揭示支撑ICL的理论。一种流行的假说是通过任务选择来解释ICL。LLM根据演示识别任务并将其概括为提示符。另一个流行的假设是ICL是元学习的一种形式,即模型在训练前学习一种学习算法并将其应用于演示。最后,第三个假设认为,LLM使用演示来选择在预培训期间学习的任务组合来执行ICL。在本文中,我们通过一组来自常见文本分类任务的实验,对这三个解释LLMS在上下文中学习的能力的假设进行了实证研究。我们用反例证明了前两个假设是无效的,并提供了支持最后一个假设的证据。我们的结果表明,LLM可以通过组合在预培训中学到的任务来学习新的任务。

[NLP-19] mCSQA: Multilingual Commonsense Reasoning Dataset with Unified Creation Strategy by Language Models and Humans
[NLP-19] mCSQA:具有语言模型和人类统一创建策略的多语言常识推理数据集

链接: https://arxiv.org/abs/2406.04215
作者: Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe
关键词: natural language understanding, evaluate natural language, language understanding capabilities, challenging to curate, common sense
中文关键词: 自然语言理解、评估自然语言、语言理解能力、具有挑战性的策展、常识
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at Findings of ACL 2024

点击查看摘要

Abstract:It is very challenging to curate a dataset for language-specific knowledge and common sense in order to evaluate natural language understanding capabilities of language models. Due to the limitation in the availability of annotators, most current multilingual datasets are created through translation, which cannot evaluate such language-specific aspects. Therefore, we propose Multilingual CommonsenseQA (mCSQA) based on the construction process of CSQA but leveraging language models for a more efficient construction, e.g., by asking LM to generate questions/answers, refine answers and verify QAs followed by reduced human efforts for verification. Constructed dataset is a benchmark for cross-lingual language-transfer capabilities of multilingual LMs, and experimental results showed high language-transfer capabilities for questions that LMs could easily solve, but lower transfer capabilities for questions requiring deep knowledge or commonsense. This highlights the necessity of language-specific datasets for evaluation and training. Finally, our method demonstrated that multilingual LMs could create QA including language-specific knowledge, significantly reducing the dataset creation cost compared to manual creation. The datasets are available at this https URL.
摘要:为了评估语言模型的自然语言理解能力,整理特定语言知识和常识的数据集是非常具有挑战性的。由于注释器的可获得性有限,目前大多数多语言数据集都是通过翻译创建的,无法评估这些特定语言的方面。因此,我们提出了多语言常识问答(MCSQA),该方法基于CSQA的构建过程,但利用语言模型来更有效地构建,例如,通过要求LM生成问题/答案、提炼答案和验证QA,然后减少验证的人力。构建的语料集是多语种LMS跨语言迁移能力的基准,实验结果表明,对于LMS容易解决的问题,LMS具有较高的语言迁移能力,而对于需要深入知识或常识的问题,迁移能力较低。这突出表明,有必要为评价和培训使用特定语言的数据集。最后,我们的方法证明了多语言LMS可以创建包含特定语言知识的QA,与手动创建相比,显著降低了数据集的创建成本。这些数据集可以在此HTTPS URL上找到。

[NLP-20] ValueBench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models
[NLP-20] ValueBench:全面评估价值取向和大型语言模型的理解

链接: https://arxiv.org/abs/2406.04214
作者: Yuanyi Ren,Haoran Ye,Hanjun Fang,Xin Zhang,Guojie Song
关键词: Large Language Models, Large Language, Language Models, transforming diverse fields, gaining increasing influence
中文关键词: 大语言模型,大语言,语言模型,改变不同领域,影响力越来越大
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2024

点击查看摘要

Abstract:Large Language Models (LLMs) are transforming diverse fields and gaining increasing influence as human proxies. This development underscores the urgent need for evaluating value orientations and understanding of LLMs to ensure their responsible integration into public-facing applications. This work introduces ValueBench, the first comprehensive psychometric benchmark for evaluating value orientations and value understanding in LLMs. ValueBench collects data from 44 established psychometric inventories, encompassing 453 multifaceted value dimensions. We propose an evaluation pipeline grounded in realistic human-AI interactions to probe value orientations, along with novel tasks for evaluating value understanding in an open-ended value space. With extensive experiments conducted on six representative LLMs, we unveil their shared and distinctive value orientations and exhibit their ability to approximate expert conclusions in value-related extraction and generation tasks. ValueBench is openly accessible at this https URL.
摘要:大型语言模型作为人类的替代品,正在改变着不同的领域,并获得越来越大的影响力。这一事态发展突出表明,迫切需要评估LLMS的价值取向和理解,以确保将其负责任地纳入面向公众的应用程序。这项工作介绍了ValueBch,第一个全面的心理测量基准,评估价值取向和价值理解在LLMS。ValueBch从44个已建立的心理测量量表中收集数据,涵盖453个多方面的价值维度。我们提出了一种基于现实人类-人工智能交互的评估管道来探索价值取向,以及在开放的价值空间中评估价值理解的新任务。通过对六个具有代表性的低成本模型进行广泛的实验,我们揭示了它们共享的和独特的价值取向,并展示了它们在与价值相关的提取和生成任务中逼近专家结论的能力。ValueBtch可通过此HTTPS URL公开访问。

[NLP-21] Legal Documents Drafting with Fine-Tuned Pre-Trained Large Language Model
[NLP-21] 使用微调的预训练大型语言模型起草法律文件

链接: https://arxiv.org/abs/2406.04202
作者: Chun-Hsien Lin,Pu-Jen Cheng
关键词: natural language processing, language model, solving downstream tasks, fine-tuning pre-trained LLM, large-scale Language Models
中文关键词: 自然语言处理、语言模型、解决下游任务、微调预训练LLM、大规模语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12th International Conference on Software Engineering Trends (SE 2024), April 27 ~ 28, 2024, Copenhagen, Denmark Volume Editors : David C. Wyld, Dhinaharan Nagamalai (Eds) ISBN : 978-1-923107-24-3

点击查看摘要

Abstract:With the development of large-scale Language Models (LLM), fine-tuning pre-trained LLM has become a mainstream paradigm for solving downstream tasks of natural language processing. However, training a language model in the legal field requires a large number of legal documents so that the language model can learn legal terminology and the particularity of the format of legal documents. The typical NLP approaches usually rely on many manually annotated data sets for training. However, in the legal field application, it is difficult to obtain a large number of manually annotated data sets, which restricts the typical method applied to the task of drafting legal documents. The experimental results of this paper show that not only can we leverage a large number of annotation-free legal documents without Chinese word segmentation to fine-tune a large-scale language model, but more importantly, it can fine-tune a pre-trained LLM on the local computer to achieve the generating legal document drafts task, and at the same time achieve the protection of information privacy and to improve information security issues.
摘要:随着大规模语言模型的发展,微调预训练语言模型已成为解决自然语言处理下游任务的主流范式。然而,在法律领域训练语言模型需要大量的法律文档,以便语言模型能够学习法律术语和法律文档格式的特殊性。典型的NLP方法通常依赖于许多人工标注的数据集来进行训练。然而,在法律领域的应用中,很难获得大量的人工标注数据集,这限制了典型的方法应用于法律文书起草的任务。本文的实验结果表明,不仅可以利用大量没有中文分词的无标注法律文档来微调大规模的语言模型,更重要的是可以在本地计算机上微调预先训练好的LLM来完成法律文档草稿的生成任务,同时实现对信息隐私的保护和改善信息安全问题。

[NLP-22] DICE: Detecting In-distribution Contamination in LLMs Fine-tuning Phase for Math Reasoning
[NLP-22] DICE:在LLM数学推理微调阶段检测内分布污染

链接: https://arxiv.org/abs/2406.04197
作者: Shangqing Tu,Kejian Zhu,Yushi Bai,Zijun Yao,Lei Hou,Juanzi Li
关键词: large language models, In-distribution contamination, contamination, relies on evaluation, advancement of large
中文关键词: 大型语言模型,内分布污染,污染,依赖于评估,大型的进步
类目: Computation and Language (cs.CL)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:The advancement of large language models (LLMs) relies on evaluation using public benchmarks, but data contamination can lead to overestimated performance. Previous researches focus on detecting contamination by determining whether the model has seen the exact same data during training. In this work, we argue that even training on data similar to benchmark data inflates performance on in-distribution tasks without improving overall capacity, which we called In-distribution contamination. To effectively detect in-distribution contamination, we propose DICE, a novel method that leverages the internal states of LLMs to locate-then-detect the contamination. DICE first identifies the most sensitive layer to contamination, then trains a classifier based on the internal states of that layer. Experiments reveal DICE’s high accuracy in detecting in-distribution contamination across various LLMs and math reasoning datasets. We also show the generalization capability of the trained DICE detector, which is able to detect contamination across multiple benchmarks with similar distributions. Additionally, we find that the DICE detection scores are positively correlated with the performance of ten LLMs fine-tuned by either us or other organizations on four math reasoning datasets (with R^2 values between 0.6 and 0.75). This indicates that the in-distribution contamination problem potentially lead to an overestimation of the true capabilities of many existing models. The code and data are available at this https URL.
摘要:大型语言模型(LLM)的发展依赖于使用公共基准进行评估,但数据污染可能导致高估性能。以前的研究侧重于通过确定模型在训练期间是否看到完全相同的数据来检测污染。在这项工作中,我们认为,即使是对类似于基准数据的数据进行培训,也会在没有提高整体容量的情况下夸大分配内任务的性能,这称为分配内污染。为了有效地检测分布中的污染,我们提出了一种新的方法DICE,它利用LLMS的内部状态来定位然后检测污染。DICE首先识别对污染最敏感的层,然后根据该层的内部状态训练分类器。实验表明,DICE在检测各种LLM和数学推理数据集的分布内污染方面具有高精度。我们还展示了经过训练的骰子检测器的泛化能力,它能够在具有相似分布的多个基准上检测污染。此外,我们还发现,骰子检测分数与我们或其他组织在四个数学推理数据集上微调的10个LLM的性能呈正相关(R^2值在0.6到0.75之间)。这表明,分布内污染问题可能会导致对许多现有模型的真实能力的高估。代码和数据可在此HTTPS URL上找到。

[NLP-23] Confabulation: The Surprising Value of Large Language Model Hallucinations
[NLP-23] 虚构:大型语言模型幻觉的惊人价值

链接: https://arxiv.org/abs/2406.04175
作者: Peiqi Sui,Eamon Duede,Sophie Wu,Richard Jean So
关键词: large language model, categorically negative pitfall, language model, negative pitfall, presents a systematic
中文关键词: 大语言模型,绝对消极的陷阱,语言模型,消极的陷阱,呈现出系统性的
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Forthcoming at ACL2024 main conference. 1 figure

点击查看摘要

Abstract:This paper presents a systematic defense of large language model (LLM) hallucinations or ‘confabulations’ as a potential resource instead of a categorically negative pitfall. The standard view is that confabulations are inherently problematic and AI research should eliminate this flaw. In this paper, we argue and empirically demonstrate that measurable semantic characteristics of LLM confabulations mirror a human propensity to utilize increased narrativity as a cognitive resource for sense-making and communication. In other words, it has potential value. Specifically, we analyze popular hallucination benchmarks and reveal that hallucinated outputs display increased levels of narrativity and semantic coherence relative to veridical outputs. This finding reveals a tension in our usually dismissive understandings of confabulation. It suggests, counter-intuitively, that the tendency for LLMs to confabulate may be intimately associated with a positive capacity for coherent narrative-text generation.
摘要:本文对大型语言模型(LLM)的幻觉或“虚构”进行了系统的辩护,认为它是一种潜在的资源,而不是绝对的负面陷阱。标准的观点是,交谈本质上是有问题的,人工智能研究应该消除这一缺陷。在这篇文章中,我们论证并实证证明,LLM虚构的可测量的语义特征反映了人类将增加的叙事性作为一种认知资源来进行意义构建和交流的倾向。换句话说,它具有潜在的价值。具体地说,我们分析了流行的幻觉基准,发现与真实输出相比,幻觉输出表现出更高的叙事性和语义连贯性。这一发现揭示了我们通常对虚构的理解中的一种紧张。这表明,与直觉相反的是,LLMS的交谈倾向可能与连贯的叙事文本生成的积极能力密切相关。

[NLP-24] Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness
[NLP-24] 指针引导的预训练:将大型语言模型注入段落级上下文意识

链接: https://arxiv.org/abs/2406.04156
作者: Lars Hillebrand,Prabhupad Pradhan,Christian Bauckhage,Rafet Sifa
关键词: paragraph-level text representations, pre-training technique aimed, large language models, pointer-guided segment ordering, technique aimed
中文关键词: 段落级文本表示、针对预训练技术、大型语言模型、指针引导的片段排序、针对的技术
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 3 figures, 5 tables, accepted at ECML-PKDD 2024

点击查看摘要

Abstract:We introduce “pointer-guided segment ordering” (SO), a novel pre-training technique aimed at enhancing the contextual understanding of paragraph-level text representations in large language models. Our methodology leverages a self-attention-driven pointer network to restore the original sequence of shuffled text segments, addressing the challenge of capturing the structural coherence and contextual dependencies within documents. This pre-training approach is complemented by a fine-tuning methodology that incorporates dynamic sampling, augmenting the diversity of training instances and improving sample efficiency for various downstream applications. We evaluate our method on a diverse set of datasets, demonstrating its efficacy in tasks requiring sequential text classification across scientific literature and financial reporting domains. Our experiments show that pointer-guided pre-training significantly enhances the model’s ability to understand complex document structures, leading to state-of-the-art performance in downstream classification tasks.
摘要:我们介绍了一种新的预训练技术–指针引导的分段排序(SO),旨在增强对大型语言模型中段级文本表示的上下文理解。我们的方法利用自我注意力驱动的指针网络来恢复洗牌文本片段的原始序列,解决了捕获文档中的结构一致性和上下文相关性的挑战。这种预训练方法得到了微调方法的补充,该方法结合了动态采样,增加了训练实例的多样性,并提高了各种下游应用的样本效率。我们在一组不同的数据集上评估了我们的方法,展示了它在跨科学文献和金融报告领域需要顺序文本分类的任务中的有效性。我们的实验表明,指针引导的预训练显著增强了模型理解复杂文档结构的能力,从而在下游分类任务中获得了最先进的性能。

[NLP-25] AgentGym: Evolving Large Language Model-based Agents across Diverse Environments
[NLP-25] AgentGym:在不同环境中进化基于大型语言模型的代理

链接: https://arxiv.org/abs/2406.04151
作者: Zhiheng Xi,Yiwen Ding,Wenxiang Chen,Boyang Hong,Honglin Guo,Junzhe Wang,Dingwen Yang,Chenyang Liao,Xin Guo,Wei He,Songyang Gao,Lu Chen,Rui Zheng,Yicheng Zou,Tao Gui,Qi Zhang,Xipeng Qiu,Xuanjing Huang,Zuxuan Wu,Yu-Gang Jiang
关键词: Building generalist agents, long-term goal, agents, Building generalist, handle diverse tasks
中文关键词: 建筑通才代理,长期目标,代理,建筑通才,处理多样化的任务
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project site: this https URL

点击查看摘要

Abstract:Building generalist agents that can handle diverse tasks and evolve themselves across different environments is a long-term goal in the AI community. Large language models (LLMs) are considered a promising foundation to build such agents due to their generalized capabilities. Current approaches either have LLM-based agents imitate expert-provided trajectories step-by-step, requiring human supervision, which is hard to scale and limits environmental exploration; or they let agents explore and learn in isolated environments, resulting in specialist agents with limited generalization. In this paper, we take the first step towards building generally-capable LLM-based agents with self-evolution ability. We identify a trinity of ingredients: 1) diverse environments for agent exploration and learning, 2) a trajectory set to equip agents with basic capabilities and prior knowledge, and 3) an effective and scalable evolution method. We propose AgentGym, a new framework featuring a variety of environments and tasks for broad, real-time, uni-format, and concurrent agent exploration. AgentGym also includes a database with expanded instructions, a benchmark suite, and high-quality trajectories across environments. Next, we propose a novel method, AgentEvol, to investigate the potential of agent self-evolution beyond previously seen data across tasks and environments. Experimental results show that the evolved agents can achieve results comparable to SOTA models. We release the AgentGym suite, including the platform, dataset, benchmark, checkpoints, and algorithm implementations. The AgentGym suite is available on this https URL.
摘要:构建能够处理不同任务并在不同环境中自我进化的通才代理是人工智能社区的长期目标。大型语言模型(LLM)因其通用功能而被认为是构建此类代理的有前途的基础。目前的方法要么让基于LLM的代理一步一步地模仿专家提供的轨迹,需要人工监督,这很难扩展,限制了环境探索;要么让代理在孤立的环境中探索和学习,导致专家代理的推广有限。在本文中,我们向构建具有自进化能力的通用的基于LLM的代理迈出了第一步。我们确定了三位一体的要素:1)不同的代理探索和学习环境,2)为代理提供基本能力和先验知识的轨迹集,以及3)有效和可扩展的进化方法。我们提出了AgentGym,这是一个新的框架,具有广泛的、实时的、统一格式的和并发的代理探索的各种环境和任务。AgentGym还包括一个数据库,其中包含扩展的指令、基准测试套件和跨环境的高质量轨迹。接下来,我们提出了一种新的方法,AgentEvol,来研究跨任务和环境的代理自我进化的潜力,而不是以前看到的数据。实验结果表明,进化后的智能体可以获得与SOTA模型相当的结果。我们发布了AgentGym套件,包括平台、数据集、基准测试、检查点和算法实现。AgentGym套件可在此HTTPS URL上找到。

[NLP-26] owards Understanding Task-agnostic Debiasing Through the Lenses of Intrinsic Bias and Forgetfulness
[NLP-26] 教师通过内在偏见和遗忘的视角理解任务不可知的去偏见

链接: https://arxiv.org/abs/2406.04146
作者: Guangliang Liu,Milad Afshari,Xitong Zhang,Zhiyu Xue,Avrajit Ghosh,Bidhan Bashyal,Rongrong Wang,Kristen Johnson
关键词: debiasing Pretrained Language, Pretrained Language Models, language modeling ability, Pretrained Language, relearning social biases
中文关键词: 去偏见预训练语言、预训练语言模型、语言建模能力、预训练语言、重新学习社会偏见
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While task-agnostic debiasing provides notable generalizability and reduced reliance on downstream data, its impact on language modeling ability and the risk of relearning social biases from downstream task-specific data remain as the two most significant challenges when debiasing Pretrained Language Models (PLMs). The impact on language modeling ability can be alleviated given a high-quality and long-contextualized debiasing corpus, but there remains a deficiency in understanding the specifics of relearning biases. We empirically ascertain that the effectiveness of task-agnostic debiasing hinges on the quantitative bias level of both the task-specific data used for downstream applications and the debiased model. We empirically show that the lower bound of the bias level of the downstream fine-tuned model can be approximated by the bias level of the debiased model, in most practical cases. To gain more in-depth understanding about how the parameters of PLMs change during fine-tuning due to the forgetting issue of PLMs, we propose a novel framework which can Propagate Socially-fair Debiasing to Downstream Fine-tuning, ProSocialTuning. Our proposed framework can push the fine-tuned model to approach the bias lower bound during downstream fine-tuning, indicating that the ineffectiveness of debiasing can be alleviated by overcoming the forgetting issue through regularizing successfully debiased attention heads based on the PLMs’ bias levels from stages of pretraining and debiasing.
摘要:尽管任务不可知性去偏向提供了显著的概括性并减少了对下游数据的依赖,但它对语言建模能力的影响和从下游特定任务数据中重新学习社会偏见的风险仍然是去偏向预训练语言模型(PLM)时最重要的两个挑战。高质量和长语境化的去偏倚语料库对语言建模能力的影响可以得到缓解,但在理解再学习偏差的细节方面仍然存在不足。我们的经验证明,任务不可知性去偏倚的有效性取决于用于下游应用的特定于任务的数据和去偏模型的量化偏差水平。我们的经验表明,在大多数实际情况下,下游微调模型的偏置水平的下界可以用去偏模型的偏置水平近似。为了更深入地了解由于PLM遗忘问题而导致的PLM参数在微调过程中的变化,我们提出了一种新的框架ProSocial-Tuning,它可以将社会公平去偏向传播到下游的微调。我们提出的框架可以在下游微调时将精调模型推向偏差下限,这表明通过基于PLM在预训练和去偏阶段的偏差水平成功地规则化去偏的注意头部来克服遗忘问题可以缓解去偏的无效。

[NLP-27] Every Answer Matters: Evaluating Commonsense with Probabilistic Measures
[NLP-27] 每个答案都很重要:用概率衡量标准评估常识

链接: https://arxiv.org/abs/2406.04145
作者: Qi Cheng,Michael Boratko,Pranay Kumar Yelugam,Tim O’Gorman,Nalini Singh,Andrew McCallum,Xiang Lorraine Li
关键词: exploit systematic biases, demonstrated impressive performance, Large language models, Large language, multiple-choice questions
中文关键词: 利用系统性偏见,表现出令人印象深刻的表现,大型语言模型,大型语言,多项选择题
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2024 Camera Ready

点击查看摘要

Abstract:Large language models have demonstrated impressive performance on commonsense tasks; however, these tasks are often posed as multiple-choice questions, allowing models to exploit systematic biases. Commonsense is also inherently probabilistic with multiple correct answers. The purpose of “boiling water” could be making tea and cooking, but it also could be killing germs. Existing tasks do not capture the probabilistic nature of common sense. To this end, we present commonsense frame completion (CFC), a new generative task that evaluates common sense via multiple open-ended generations. We also propose a method of probabilistic evaluation that strongly correlates with human judgments. Humans drastically outperform strong language model baselines on our dataset, indicating this approach is both a challenging and useful evaluation of machine common sense.
摘要:大型语言模型在常识任务上表现出了令人印象深刻的性能;然而,这些任务通常被提出为多项选择题,允许模型利用系统性偏差。常识本质上也是概率性的,有多个正确答案。“沸水”的目的可能是泡茶和做饭,但也可能是杀死细菌。现有的任务没有捕捉到常识的概率本质。为此,我们提出了常识框架完成(CFC),这是一项新的生成性任务,通过多个开放式世代评估常识。我们还提出了一种与人类判断密切相关的概率评估方法。人类在我们的数据集上的表现大大超过了强语言模型基线,这表明这种方法既是对机器常识的一种具有挑战性且有用的评估。

[NLP-28] Do Language Models Understand Morality? Towards a Robust Detection of Moral Content
[NLP-28] 语言模型理解道德吗?迈向道德内容的稳健检测

链接: https://arxiv.org/abs/2406.04143
作者: Luana Bulla,Aldo Gangemi,Misael Mongiovì
关键词: natural language processing, including natural language, Natural Language Inference, natural language, Large Language Models
中文关键词: 自然语言处理,包括自然语言、自然语言推理、自然语言、大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The task of detecting moral values in text has significant implications in various fields, including natural language processing, social sciences, and ethical decision-making. Previously proposed supervised models often suffer from overfitting, leading to hyper-specialized moral classifiers that struggle to perform well on data from different domains. To address this issue, we introduce novel systems that leverage abstract concepts and common-sense knowledge acquired from Large Language Models and Natural Language Inference models during previous stages of training on multiple data sources. By doing so, we aim to develop versatile and robust methods for detecting moral values in real-world scenarios. Our approach uses the GPT 3.5 model as a zero-shot ready-made unsupervised multi-label classifier for moral values detection, eliminating the need for explicit training on labeled data. We compare it with a smaller NLI-based zero-shot model. The results show that the NLI approach achieves competitive results compared to the Davinci model. Furthermore, we conduct an in-depth investigation of the performance of supervised systems in the context of cross-domain multi-label moral value detection. This involves training supervised models on different domains to explore their effectiveness in handling data from different sources and comparing their performance with the unsupervised methods. Our contributions encompass a thorough analysis of both supervised and unsupervised methodologies for cross-domain value detection. We introduce the Davinci model as a state-of-the-art zero-shot unsupervised moral values classifier, pushing the boundaries of moral value detection without the need for explicit training on labeled data. Additionally, we perform a comparative evaluation of our approach with the supervised models, shedding light on their respective strengths and weaknesses.
摘要:文本中道德价值的检测任务在自然语言处理、社会科学和伦理决策等领域都具有重要的意义。以前提出的监督模型往往存在过度拟合的问题,导致高度专业化的道德分类器难以在不同领域的数据上表现良好。为了解决这个问题,我们引入了新的系统,这些系统利用了在多个数据源的先前训练阶段从大型语言模型和自然语言推理模型中获得的抽象概念和常识知识。通过这样做,我们的目标是开发出在现实世界场景中检测道德价值的通用和强大的方法。我们的方法使用GPT 3.5模型作为道德值检测的零命中率现成的无监督多标签分类器,消除了对标签数据的显式训练的需要。我们将其与一个较小的基于NLI的零炮模型进行比较。结果表明,与Davinci模型相比,NLI方法获得了具有竞争力的结果。此外,我们在跨域多标签道德价值检测的背景下对监督系统的性能进行了深入的研究。这涉及对不同领域的监督模型进行训练,以探索它们在处理来自不同来源的数据方面的有效性,并将它们的性能与非监督方法进行比较。我们的贡献包括对跨域价值检测的监督和非监督方法进行彻底分析。我们将Davinci模型作为一种最先进的零概率无监督道德价值分类器,在不需要对已标记数据进行显式训练的情况下,突破了道德价值检测的界限。此外,我们对我们的方法与监督模型进行了比较评估,揭示了它们各自的优势和劣势。

[NLP-29] Legal Judgment Reimagined: PredEx and the Rise of Intelligent AI Interpretation in Indian Courts
[NLP-29] 重新构想法律判决:PredEx和印度法院智能人工智能解释的兴起

链接: https://arxiv.org/abs/2406.04136
作者: Shubham Kumar Nigam,Anurag Sharma,Danush Khanna,Noel Shallum,Kripabandhu Ghosh,Arnab Bhattacharya
关键词: Large Language Models, Large Language, predicting judicial outcomes, judicial outcomes poses, outcomes poses significant
中文关键词: 大型语言模型,大型语言,预测司法结果,司法结果构成,结果构成重大
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the era of Large Language Models (LLMs), predicting judicial outcomes poses significant challenges due to the complexity of legal proceedings and the scarcity of expert-annotated datasets. Addressing this, we introduce \textbfPrediction with \textbfExplanation (\textttPredEx), the largest expert-annotated dataset for legal judgment prediction and explanation in the Indian context, featuring over 15,000 annotations. This groundbreaking corpus significantly enhances the training and evaluation of AI models in legal analysis, with innovations including the application of instruction tuning to LLMs. This method has markedly improved the predictive accuracy and explanatory depth of these models for legal judgments. We employed various transformer-based models, tailored for both general and Indian legal contexts. Through rigorous lexical, semantic, and expert assessments, our models effectively leverage \textttPredEx to provide precise predictions and meaningful explanations, establishing it as a valuable benchmark for both the legal profession and the NLP community.
摘要:在大型语言模型时代,由于法律程序的复杂性和专家注释数据集的稀缺,预测司法结果带来了巨大的挑战。针对这一问题,我们引入了\textbf预测与\textbf解释(\textttPredEx),这是印度上下文中最大的专家注释数据集,用于法律判决预测和解释,具有超过15,000个注释。这个突破性的语料库极大地加强了法律分析中人工智能模型的培训和评估,创新包括将指令调整应用于LLMS。该方法显著提高了这些模型对法律判决的预测精度和解释深度。我们采用了各种基于变压器的模型,为一般法律环境和印度法律环境量身定做。通过严格的词汇、语义和专家评估,我们的模型有效地利用\texttPredEx提供准确的预测和有意义的解释,使其成为法律专业和NLP社区的宝贵基准。

[NLP-30] Are We Done with MMLU?
[NLP-30] MMLU结束了吗?

链接: https://arxiv.org/abs/2406.04127
作者: Aryo Pradipta Gema,Joshua Ong Jun Leang,Giwon Hong,Alessio Devoto,Alberto Carlo Maria Mancino,Rohit Saxena,Xuanli He,Yu Zhao,Xiaotang Du,Mohammad Reza Ghasemi Madani,Claire Barale,Robert McHardy,Joshua Harris,Jean Kaddour,Emile van Krieken,Pasquale Minervini
关键词: Multitask Language Understanding, Massive Multitask Language, popular Massive Multitask, Language Understanding, Massive Multitask
中文关键词: 多任务语言理解、大规模多任务语言、流行的大规模多任务、语言理解、大规模多任务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error taxonomy. Then, we create MMLU-Redux, which is a subset of 3,000 manually re-annotated questions across 30 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU’s error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation this https URL.
摘要:也许不是。我们识别并分析流行的大规模多任务语言理解(MMLU)基准测试中的错误。尽管MMLU被广泛采用,但我们的分析表明存在许多基本事实错误,这些错误掩盖了LLM的真实能力。例如,我们发现病毒学子集中57%的分析问题包含错误。为了解决这个问题,我们引入了一个全面的框架,用于使用新颖的错误分类法识别数据集错误。然后,我们创建MMLU-Redux,它是30个MMLU科目中3,000个手动重新注释问题的子集。使用MMLU-Redux,我们证明了与最初报告的模型性能指标存在显着差异。我们的结果强烈主张修改MMLU错误百出的问题,以增强其作为基准的未来实用性和可靠性。因此,我们打开MMLU-Redux以进行额外的注释该https URL。

[NLP-31] Promoting Fairness and Diversity in Speech Datasets for Mental Health and Neurological Disorders Research
[NLP-31] 促进心理健康和神经系统疾病研究语音数据集的公平性和多样性

链接: https://arxiv.org/abs/2406.04116
作者: Eleonora Mancini,Ana Tanevska,Andrea Galassi,Alessio Galatolo,Federico Ruggeri,Paolo Torroni
关键词: Current research, performance evaluation, machine learning, learning and artificial, artificial intelligence
中文关键词: 当前研究、性能评估、机器学习、学习和人工、人工智能
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 34 pages

点击查看摘要

Abstract:Current research in machine learning and artificial intelligence is largely centered on modeling and performance evaluation, less so on data collection. However, recent research demonstrated that limitations and biases in data may negatively impact trustworthiness and reliability. These aspects are particularly impactful on sensitive domains such as mental health and neurological disorders, where speech data are used to develop AI applications aimed at improving the health of patients and supporting healthcare providers. In this paper, we chart the landscape of available speech datasets for this domain, to highlight possible pitfalls and opportunities for improvement and promote fairness and diversity. We present a comprehensive list of desiderata for building speech datasets for mental health and neurological disorders and distill it into a checklist focused on ethical concerns to foster more responsible research.
摘要:当前机器学习和人工智能的研究主要集中在建模和性能评估上,而不是数据收集。然而,最近的研究表明,数据的限制和偏见可能会对可信度和可靠性产生负面影响。这些方面对心理健康和神经系统疾病等敏感领域尤其有影响,其中语音数据用于开发人工智能应用程序,旨在改善患者的健康并支持医疗保健提供者。在本文中,我们绘制了该领域可用语音数据集的格局,以强调可能的陷阱和改进机会,并促进公平性和多样性。我们提出了一份构建心理健康和神经系统疾病语音数据集的全面需求清单,并将其提炼成一份专注于道德问题的清单,以促进更负责任的研究。

[NLP-32] Uncovering Limitations of Large Language Models in Information Seeking from Tables
[NLP-32] 揭示大型语言模型在从表格中查找信息中的局限性

链接: https://arxiv.org/abs/2406.04113
作者: Chaoxu Pang,Yixuan Cao,Chunhao Yang,Ping Luo
关键词: high information density, Large Language Models, widespread usage, serving as essential, density and widespread
中文关键词: 高信息密度、大型语言模型、广泛使用、充当必需品、密度和广泛
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2024

点击查看摘要

Abstract:Tables are recognized for their high information density and widespread usage, serving as essential sources of information. Seeking information from tables (TIS) is a crucial capability for Large Language Models (LLMs), serving as the foundation of knowledge-based QA systems. However, this field presently suffers from an absence of thorough and reliable evaluation. This paper introduces a more reliable benchmark for Table Information Seeking (TabIS). To avoid the unreliable evaluation caused by text similarity-based metrics, TabIS adopts a single-choice question format (with two options per question) instead of a text generation format. We establish an effective pipeline for generating options, ensuring their difficulty and quality. Experiments conducted on 12 LLMs reveal that while the performance of GPT-4-turbo is marginally satisfactory, both other proprietary and open-source models perform inadequately. Further analysis shows that LLMs exhibit a poor understanding of table structures, and struggle to balance between TIS performance and robustness against pseudo-relevant tables (common in retrieval-augmented systems). These findings uncover the limitations and potential challenges of LLMs in seeking information from tables. We release our data and code to facilitate further research in this field.
摘要:表格因其高度的信息密度和广泛的使用而被公认为是必不可少的信息源。从表格中查找信息是大型语言模型的一项重要功能,是基于知识的问答系统的基础。然而,这一领域目前缺乏全面和可靠的评估。本文介绍了一种更可靠的表信息查找基准(TABIS)。为了避免基于文本相似性度量带来的不可靠评价,Tabis采用了单选题格式(每个问题有两个选项)来代替文本生成格式。我们建立了一条有效的渠道来产生备选方案,确保其难度和质量。在12个LLM上进行的实验表明,虽然GPT-4-Turbo的性能略有令人满意,但其他专有和开源模型的性能都不够好。进一步的分析表明,LLM对表结构的理解很差,并且难以在TIS性能和对伪相关表(在检索增强系统中常见)的健壮性之间取得平衡。这些发现揭示了LLMS在从表格中寻找信息方面的局限性和潜在挑战。我们公布了我们的数据和代码,以促进这一领域的进一步研究。

[NLP-33] Intention and Face in Dialog
[NLP-33] 对话中的意图和面孔

链接: https://arxiv.org/abs/2406.04109
作者: Adil Soubki,Owen Rambow
关键词: Brown and Levinson, great detail, face, studied in great, mediate the planning
中文关键词: 布朗和莱文森,非常详细,面对,研究得很好,调解规划
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The notion of face described by Brown and Levinson (1987) has been studied in great detail, but a critical aspect of the framework, that which focuses on how intentions mediate the planning of turns which impose upon face, has received far less attention. We present an analysis of three computational systems trained for classifying both intention and politeness, focusing on how the former influences the latter. In politeness theory, agents attend to the desire to have their wants appreciated (positive face), and a complementary desire to act unimpeded and maintain freedom (negative face). Similar to speech acts, utterances can perform so-called face acts which can either raise or threaten the positive or negative face of the speaker or hearer. We begin by using an existing corpus to train a model which classifies face acts, achieving a new SoTA in the process. We then observe that every face act has an underlying intention that motivates it and perform additional experiments integrating dialog act annotations to provide these intentions by proxy. Our analysis finds that dialog acts improve performance on face act detection for minority classes and points to a close relationship between aspects of face and intent.
摘要:Brown和Levinson(1987)对面子的概念进行了详细的研究,但该框架的一个关键方面,即意图如何调节强加于面子的话轮规划,却鲜有人关注。我们分析了三个被训练用于区分意图和礼貌的计算系统,重点关注前者如何影响后者。在礼貌理论中,代理人关注的是他们想要被欣赏的愿望(积极的面孔),以及一种相辅相成的畅通无阻和维护自由的愿望(消极的面子)。与言语行为类似,话语也可以执行所谓的面子行为,这种面子行为可以提升或威胁说话人或听话人的正面或负面面子。我们首先使用现有的语料库来训练一个对人脸行为进行分类的模型,在这个过程中实现了一个新的SOTA。然后,我们观察到每个面部动作都有一个潜在的动机,并执行额外的实验,整合对话动作注释,以通过代理提供这些意图。我们的分析发现,对话行为提高了针对少数族裔类别的人脸行为检测的性能,并指出了人脸特征和意图之间的密切关系。

[NLP-34] Explainability and Hate Speech: Structured Explanations Make Social Media Moderators Faster
[NLP-34] 解释性和仇恨言论:结构化解释让社交媒体版主更快

链接: https://arxiv.org/abs/2406.04106
作者: Agostina Calabrese,Leonardo Neves,Neil Shah,Maarten W. Bos,Björn Ross,Mirella Lapata,Francesco Barbieri
关键词: social media healthy, media healthy, Content moderators play, play a key, key role
中文关键词: 社交媒体健康,媒体健康,内容版主发挥着关键的作用
类目: Computation and Language (cs.CL)
备注: 11 pages, 14 figures, to be published at ACL 2024

点击查看摘要

Abstract:Content moderators play a key role in keeping the conversation on social media healthy. While the high volume of content they need to judge represents a bottleneck to the moderation pipeline, no studies have explored how models could support them to make faster decisions. There is, by now, a vast body of research into detecting hate speech, sometimes explicitly motivated by a desire to help improve content moderation, but published research using real content moderators is scarce. In this work we investigate the effect of explanations on the speed of real-world moderators. Our experiments show that while generic explanations do not affect their speed and are often ignored, structured explanations lower moderators’ decision making time by 7.4%.
摘要:内容版主在保持社交媒体对话健康方面发挥着关键作用。虽然他们需要判断的大量内容是审核管道的瓶颈,但还没有研究探索模型如何支持他们更快地做出决策。到目前为止,有大量关于检测仇恨言论的研究,有时明确的动机是为了帮助改进内容审核,但使用真实内容审核员的已发表研究却很少。在这项工作中,我们研究了解释对现实世界版主速度的影响。我们的实验表明,虽然通用解释不会影响其速度并且经常被忽视,但结构化解释可以将版主的决策时间缩短7.4%。

[NLP-35] Ask LLMs Directly “What shapes your bias?”: Measuring Social Bias in Large Language Models
[NLP-35] 直接询问法学硕士“是什么影响了你的偏见?”:测量大型语言模型中的社交偏见

链接: https://arxiv.org/abs/2406.04064
作者: Jisu Shin,Hoyun Song,Huije Lee,Soyeong Jeong,Jong C. Park
关键词: Social, social perceptions, Social bias, LLMs, bias
中文关键词: 社会、社会认知、社会偏见、法学硕士、偏见
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Findings of ACL 2024

点击查看摘要

Abstract:Social bias is shaped by the accumulation of social perceptions towards targets across various demographic identities. To fully understand such social bias in large language models (LLMs), it is essential to consider the composite of social perceptions from diverse perspectives among identities. Previous studies have either evaluated biases in LLMs by indirectly assessing the presence of sentiments towards demographic identities in the generated text or measuring the degree of alignment with given stereotypes. These methods have limitations in directly quantifying social biases at the level of distinct perspectives among identities. In this paper, we aim to investigate how social perceptions from various viewpoints contribute to the development of social bias in LLMs. To this end, we propose a novel strategy to intuitively quantify these social perceptions and suggest metrics that can evaluate the social biases within LLMs by aggregating diverse social perceptions. The experimental results show the quantitative demonstration of the social attitude in LLMs by examining social perception. The analysis we conducted shows that our proposed metrics capture the multi-dimensional aspects of social bias, enabling a fine-grained and comprehensive investigation of bias in LLMs.
摘要:社会偏见是由不同人口身份对目标的社会认知积累形成的。为了充分理解大型语言模型(LLM)中的这种社会偏见,有必要考虑不同身份之间不同视角的社会认知的组合。以前的研究要么通过间接评估生成的文本中对人口身份的情绪的存在,要么通过测量与给定的刻板印象的一致程度来评估LLMS中的偏见。这些方法在直接量化不同身份之间不同视角的社会偏见方面存在局限性。在这篇论文中,我们旨在从不同的角度调查社会认知如何促进LLMS中社会偏见的发展。为此,我们提出了一种新的策略来直观地量化这些社会感知,并提出了一种度量方法,通过聚合不同的社会感知来评估LLM中的社会偏见。实验结果表明,通过考察社会知觉,LLMS中的社会态度得到了定量的展示。我们进行的分析表明,我们提出的度量方法捕捉到了社会偏见的多维方面,使得对LLMS中的偏见进行细粒度和全面的调查成为可能。

[NLP-36] he syntax-semantics interface in a childs path: A study of 3- to 11-year-olds elicited production of Mandarin recursive relative clauses
[NLP-36] 儿童路径中的语法-语义界面:对3至11岁儿童的研究引发了普通话循环关系分句的产生

链接: https://arxiv.org/abs/2406.04025
作者: Caimei Yang,Qihang Yang,Xingzhi Su,Chenxi Fu,Xiaoyi Wang,Ying Yan,Zaijiang Man
关键词: apparently conflicting claims, apparently conflicting, conflicting claims, RRCs, subject-gapped RC embedded
中文关键词: 明显相互冲突的主张、明显相互冲突的主张、REC、嵌入的主题差距RC
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:There have been apparently conflicting claims over the syntax-semantics relationship in child acquisition. However, few of them have assessed the child’s path toward the acquisition of recursive relative clauses (RRCs). The authors of the current paper did experiments to investigate 3- to 11-year-olds’ most-structured elicited production of eight Mandarin RRCs in a 4 (syntactic types)*2 (semantic conditions) design. The four syntactic types were RRCs with a subject-gapped RC embedded in an object-gapped RC (SORRCs), RRCs with an object-gapped RC embedded in another object-gapped RC (OORRCs), RRCs with an object-gapped RC embedded in a subject-gapped RC (OSRRCs), and RRCs with a subject-gapped RC embedded in another subject-gapped RC (SSRRCs). Each syntactic type was put in two conditions differing in internal semantics: irreversible internal semantics (IIS) and reversible internal semantics (RIS). For example, “the balloon that [the girl that _ eats the banana] holds " is SORRCs in the IIS condition; "the monkey that [the dog that _ bites the pig] hits” is SORRCs in the RIS condition. For each target, the participants were provided with a speech-visual stimulus constructing a condition of irreversible external semantics (IES). The results showed that SSRRCs, OSRRCs and SORRCs in the IIS-IES condition were produced two years earlier than their counterparts in the RIS-IES condition. Thus, a 2-stage development path is proposed: the language acquisition device starts with the interface between (irreversible) syntax and IIS, and ends with the interface between syntax and IES, both abiding by the syntax-semantic interface principle.
摘要:关于儿童习得中的句法-语义关系,一直存在着明显的矛盾。然而,很少有人对儿童习得递归关系从句的过程进行评估。本文作者采用4(句法类型)×2(语义条件)设计,考察了3-11岁儿童对8个普通话RRC的最具结构的诱发产出。这四种句法类型分别是有主语间隙的RRC(SORRC)、有宾语间隙的RRC(OORRC)、有宾语间隙的RRC(OORRCs)、有宾语间隙的RRC(OSRRCs)和有主语间隙的RRC(SSRRCs)。每种句法类型被置于两种内部语义不同的条件下:不可逆的内部语义(IIS)和可逆的内部语义(RIS)。例如,“[吃香蕉的女孩]抱着的气球”在IIS条件下是SORRC;“[咬猪的狗]打到的猴子”在RIS条件下是SORRC。对于每个目标,为参与者提供了构建不可逆外部语义(IES)条件的语音-视觉刺激。结果表明,IIS-IES条件下的SSRRCs、OSRRCs和SORRCs比RIS-IES条件下的SSRRCs、OSRRCs和SORRCs早两年产生。在此基础上,提出了语言习得器的两阶段发展路径:语言习得器以(不可逆)句法与IIS的接口为起点,以句法与IES的接口为终点,均遵循句法-语义接口原则。

[NLP-37] American Sign Language Handshapes Reflect Pressures for Communicative Efficiency
[NLP-37] 美国手语手势反映了沟通效率的压力

链接: https://arxiv.org/abs/2406.04024
作者: Kayo Yin,Terry Regier,Dan Klein
关键词: American Sign Language, cognitive science, ASL, prominent theory, theory in linguistics
中文关键词: 美国手语、认知科学、美国手语、杰出理论、语言学理论
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2024

点击查看摘要

Abstract:Communicative efficiency is a prominent theory in linguistics and cognitive science. While numerous studies have shown how the pressure to save energy is reflected in the form of spoken languages, few have explored this phenomenon in signed languages. In this paper, we show how handshapes in American Sign Language (ASL) reflect these efficiency pressures and we present new evidence of communicative efficiency in the visual-gestural modality. We focus on handshapes that are used in both native ASL signs and signs borrowed from English to compare efficiency pressures from both ASL and English. First, we design new methodologies to quantify the articulatory effort required to produce handshapes as well as the perceptual effort needed to recognize them. Then, we compare correlations between communicative effort and usage statistics in ASL and English. Our findings reveal that frequent ASL handshapes are easier to produce and that pressures for communicative efficiency mostly come from ASL usage, not from English lexical borrowing. Comments: Accepted to ACL 2024 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.04024 [cs.CL] (or arXiv:2406.04024v1 [cs.CL] for this version)
摘要:交际效率是语言学和认知科学中的一个重要理论。虽然许多研究表明,节约能源的压力如何以口语的形式反映出来,但很少有人研究这种现象在手语中的表现。在这篇文章中,我们展示了美国手语(ASL)中的手形如何反映了这些效率压力,并提出了视觉手势通道中交流效率的新证据。我们将重点放在美国手语手势和从英语借来的手势上,以比较来自美国手语和英语的效率压力。首先,我们设计了新的方法来量化产生手形所需的发音努力以及识别它们所需的知觉努力。然后,我们比较了ASL和英语中交际努力和使用统计之间的相关性。我们的发现表明,频繁的ASL手形更容易产生,而且对交际效率的压力主要来自ASL的使用,而不是来自英语词汇借用。备注:接受ACL2024科目:计算和语言(cs.CL);人工智能(cs.AI)引用为:arxiv:2406.04024cs.CL

[NLP-38] Assessing LLMs for Zero-shot Abstractive Summarization Through the Lens of Relevance Paraphrasing
[NLP-38] 通过相关性解释的角度评估零镜头抽象摘要的LLM

链接: https://arxiv.org/abs/2406.03993
作者: Hadi Askari,Anshuman Chhabra,Muhao Chen,Prasant Mohapatra
关键词: Large Language Models, Large Language, Language Models, generation of abstractive, abstractive summaries
中文关键词: 大型语言模型、大型语言、语言模型、抽象、抽象摘要的生成
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved state-of-the-art performance at zero-shot generation of abstractive summaries for given articles. However, little is known about the robustness of such a process of zero-shot summarization. To bridge this gap, we propose relevance paraphrasing, a simple strategy that can be used to measure the robustness of LLMs as summarizers. The relevance paraphrasing approach identifies the most relevant sentences that contribute to generating an ideal summary, and then paraphrases these inputs to obtain a minimally perturbed dataset. Then, by evaluating model performance for summarization on both the original and perturbed datasets, we can assess the LLM’s one aspect of robustness. We conduct extensive experiments with relevance paraphrasing on 4 diverse datasets, as well as 4 LLMs of different sizes (GPT-3.5-Turbo, Llama-2-13B, Mistral-7B, and Dolly-v2-7B). Our results indicate that LLMs are not consistent summarizers for the minimally perturbed articles, necessitating further improvements.
摘要:大型语言模型(LLM)在为给定文章生成摘要的零命中率方面取得了最先进的性能。然而,人们对这种零概率总结过程的稳健性知之甚少。为了弥补这一差距,我们提出了相关性释义,这是一种简单的策略,可以用来衡量作为摘要者的LLM的稳健性。相关性释义方法识别有助于生成理想摘要的最相关句子,然后释义这些输入以获得最小扰动的数据集。然后,通过评估模型在原始数据集和扰动数据集上的摘要性能,我们可以评估LLM的一个方面的稳健性。我们在4个不同的数据集以及4个不同大小的LLM(GPT-3.5-Turbo、Llama-2-13B、Mistral-7B和Dolly-v2-7B)上进行了广泛的相关性解释实验。我们的结果表明,对于最小扰动的文章,LLM并不是一致的摘要,需要进一步改进。

[NLP-39] On The Persona-based Summarization of Domain-Specific Documents
[NLP-39] 基于人物的领域特定文档总结

链接: https://arxiv.org/abs/2406.03986
作者: Ankan Mullick,Sombit Bose,Rounak Saha,Ayan Kumar Bhowmick,Pawan Goyal,Niloy Ganguly,Prasenjit Dey,Ravi Kokku
关键词: storing information necessitates, large information repositories, complexity of consuming, ever-expanding world, increasing complexity
中文关键词: 存储信息需要大型信息存储库、消费的复杂性、不断扩大的世界、不断增加的复杂性
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In an ever-expanding world of domain-specific knowledge, the increasing complexity of consuming, and storing information necessitates the generation of summaries from large information repositories. However, every persona of a domain has different requirements of information and hence their summarization. For example, in the healthcare domain, a persona-based (such as Doctor, Nurse, Patient etc.) approach is imperative to deliver targeted medical information efficiently. Persona-based summarization of domain-specific information by humans is a high cognitive load task and is generally not preferred. The summaries generated by two different humans have high variability and do not scale in cost and subject matter expertise as domains and personas grow. Further, AI-generated summaries using generic Large Language Models (LLMs) may not necessarily offer satisfactory accuracy for different domains unless they have been specifically trained on domain-specific data and can also be very expensive to use in day-to-day operations. Our contribution in this paper is two-fold: 1) We present an approach to efficiently fine-tune a domain-specific small foundation LLM using a healthcare corpus and also show that we can effectively evaluate the summarization quality using AI-based critiquing. 2) We further show that AI-based critiquing has good concordance with Human-based critiquing of the summaries. Hence, such AI-based pipelines to generate domain-specific persona-based summaries can be easily scaled to other domains such as legal, enterprise documents, education etc. in a very efficient and cost-effective manner.
摘要:在不断扩大的领域特定知识世界中,消费和存储信息的日益复杂要求从大型信息存储库中生成摘要。然而,一个领域的每个人物角色对信息的要求不同,因此他们的摘要也不同。例如,在医疗保健领域,基于角色(如医生、护士、患者等)要有效地传递有针对性的医疗信息,方法势在必行。基于人物角色的人类对特定领域信息的总结是一项高认知负荷的任务,通常不是首选的。由两个不同的人生成的摘要具有很高的可变性,并且不会随域和角色的增长而在成本和主题专业知识上进行伸缩。此外,使用通用大型语言模型(LLM)生成的人工智能摘要可能不一定为不同的领域提供令人满意的准确性,除非它们已就特定领域的数据进行了专门培训,而且在日常操作中使用也可能非常昂贵。我们在本文中的贡献有两个方面:1)我们提出了一种使用医疗保健语料库对特定于领域的小基础LLM进行高效微调的方法,并表明我们可以使用基于人工智能的批评来有效地评估摘要质量。2)进一步证明了基于人工智能的评测与基于人的评测具有很好的一致性。因此,这种基于人工智能的管道,以生成特定于领域的基于人物角色的摘要,可以很容易地以非常高效和经济的方式扩展到其他领域,如法律、企业文档、教育等。

[NLP-40] A B: A General Generator-Reader Framework for Optimizing LLMs to Unleash Synergy Potential
[NLP-40] A B:用于优化LLM以释放协同潜力的通用生成器-读取器框架

链接: https://arxiv.org/abs/2406.03963
作者: Wei Tang,Yixin Cao,Jiahao Ying,Bo Wang,Yuyue Zhao,Yong Liao,Pengyuan Zhou
关键词: large language models, Retrieval-Augmented Generation, solution to supplement, large language, RAG
中文关键词: 大型语言模型、检索增强生成、补充解决方案、大型语言、RAG
类目: Computation and Language (cs.CL)
备注: Accepted to ACL’24 (Findings)

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is an effective solution to supplement necessary knowledge to large language models (LLMs). Targeting its bottleneck of retriever performance, “generate-then-read” pipeline is proposed to replace the retrieval stage with generation from the LLM itself. Although promising, this research direction is underexplored and still cannot work in the scenario when source knowledge is given. In this paper, we formalize a general “A + B” framework with varying combinations of foundation models and types for systematic investigation. We explore the efficacy of the base and chat versions of LLMs and found their different functionalities suitable for generator A and reader B, respectively. Their combinations consistently outperform single models, especially in complex scenarios. Furthermore, we extend the application of the “A + B” framework to scenarios involving source documents through continuous learning, enabling the direct integration of external knowledge into LLMs. This approach not only facilitates effective acquisition of new knowledge but also addresses the challenges of safety and helpfulness post-adaptation. The paper underscores the versatility of the “A + B” framework, demonstrating its potential to enhance the practical application of LLMs across various domains.
摘要:检索-扩充生成(RAG)是向大型语言模型(LLM)补充必要知识的有效方法。针对检索器的性能瓶颈,提出了先生成后读的流水线,用LLM本身的生成代替检索阶段。虽然这一研究方向很有前途,但还没有得到充分的探索,仍然不能在源知识给定的情况下工作。在这篇文章中,我们形式化了一个通用的“A+B”框架,它具有不同的基础模型和类型组合,用于系统研究。我们探索了基本版本和聊天版本的LLMS的有效性,并发现它们分别适合生成器A和阅读器B的不同功能。它们的组合始终优于单一模型,特别是在复杂的情况下。此外,我们通过持续学习将“A+B”框架的应用扩展到涉及源文档的场景,使外部知识能够直接整合到LLMS中。这种方法不仅有助于有效地获得新知识,而且还解决了适应后安全和有用的挑战。该文件强调了“A+B”框架的多功能性,展示了其在各个领域加强低成本管理实际应用的潜力。

[NLP-41] ox-BART: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech
[NLP-41] ox-BART:利用毒性属性来解释内隐仇恨言语的生成

链接: https://arxiv.org/abs/2406.03953
作者: Neemesh Yadav,Sarah Masud,Vikram Goyal,Vikram Goyal,Md Shad Akhtar,Tanmoy Chakraborty
关键词: Employing language models, Employing language, incoming implicit hate, implicit hate post, area of research
中文关键词: 使用语言模型,使用语言,即将到来的隐性仇恨,隐性仇恨帖子,研究领域
类目: Computation and Language (cs.CL)
备注: 17 Pages, 5 Figures, 13 Tables, ACL Findings 2024

点击查看摘要

Abstract:Employing language models to generate explanations for an incoming implicit hate post is an active area of research. The explanation is intended to make explicit the underlying stereotype and aid content moderators. The training often combines top-k relevant knowledge graph (KG) tuples to provide world knowledge and improve performance on standard metrics. Interestingly, our study presents conflicting evidence for the role of the quality of KG tuples in generating implicit explanations. Consequently, simpler models incorporating external toxicity signals outperform KG-infused models. Compared to the KG-based setup, we observe a comparable performance for SBIC (LatentHatred) datasets with a performance variation of +0.44 (+0.49), +1.83 (-1.56), and -4.59 (+0.77) in BLEU, ROUGE-L, and BERTScore. Further human evaluation and error analysis reveal that our proposed setup produces more precise explanations than zero-shot GPT-3.5, highlighting the intricate nature of the task.
摘要:利用语言模型来为即将到来的隐性仇恨帖子生成解释是一个活跃的研究领域。该解释旨在明确潜在的刻板印象并帮助内容版主。培训通常结合前k相关知识图(KG)二元组,以提供世界知识并提高标准指标的性能。有趣的是,我们的研究为KG二元组的质量在产生内隐解释方面的作用提供了相互矛盾的证据。因此,结合外部毒性信号的更简单模型优于KG注入模型。与基于KG的设置相比,我们观察到SBIC(LatentHatred)数据集的性能相当,BLEU、ROUGE-L和BERTScore的性能变化为+0.44(+0.49)、+1.83(-1.56)和-4.59(+0.77)。进一步的人为评估和错误分析表明,我们提出的设置比零射击GPT-3.5更精确的解释,凸显了该任务的复杂性质。

[NLP-42] UltraMedical: Building Specialized Generalists in Biomedicine
[NLP-42] UltraMedical:培养生物医学专业通才

链接: https://arxiv.org/abs/2406.03949
作者: Kaiyan Zhang,Sihang Zeng,Ermo Hua,Ning Ding,Zhang-Ren Chen,Zhiyuan Ma,Haoxin Li,Ganqu Cui,Biqing Qi,Xuekai Zhu,Xingtai Lv,Hu Jinfang,Zhiyuan Liu,Bowen Zhou
关键词: Large Language Models, Large Language, demonstrated remarkable capabilities, Language Models, demonstrated remarkable
中文关键词: 大型语言模型,大型语言,表现出非凡的能力,语言模型,表现出非凡的能力
类目: Computation and Language (cs.CL)
备注: Datasets and models are available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains and are moving towards more specialized areas. Recent advanced proprietary models such as GPT-4 and Gemini have achieved significant advancements in biomedicine, which have also raised privacy and security challenges. The construction of specialized generalists hinges largely on high-quality datasets, enhanced by techniques like supervised fine-tuning and reinforcement learning from human or AI feedback, and direct preference optimization. However, these leading technologies (e.g., preference learning) are still significantly limited in the open source community due to the scarcity of specialized data. In this paper, we present the UltraMedical collections, which consist of high-quality manual and synthetic datasets in the biomedicine domain, featuring preference annotations across multiple advanced LLMs. By utilizing these datasets, we fine-tune a suite of specialized medical models based on Llama-3 series, demonstrating breathtaking capabilities across various medical benchmarks. Moreover, we develop powerful reward models skilled in biomedical and general reward benchmark, enhancing further online preference learning within the biomedical LLM community.
摘要:大型语言模型在各个领域都显示出了卓越的能力,并正在向更专业的领域发展。最近先进的专有模型,如GPT-4和Gemini,在生物医学方面取得了重大进展,这也带来了隐私和安全挑战。专业多面手的构建在很大程度上取决于高质量的数据集,并通过监督微调和来自人类或人工智能反馈的强化学习以及直接偏好优化等技术来增强。然而,由于缺乏专门的数据,这些领先技术(例如,偏好学习)在开放源码社区中仍然受到很大限制。在本文中,我们介绍了UltraMedical集合,它由生物医学领域的高质量手动和合成数据集组成,具有跨多个高级LLM的偏好标注。通过利用这些数据集,我们微调了一套基于Llama-3系列的专业医学模型,展示了各种医学基准的惊人能力。此外,我们开发了在生物医学和一般奖励基准方面熟练的强大奖励模型,增强了生物医学LLM社区内进一步的在线偏好学习。

[NLP-43] Culturally Aware and Adapted NLP: A Taxonomy and a Survey of the State of the Art
[NLP-43] 文化意识和适应的NLP:分类学和最新技术状况调查

链接: https://arxiv.org/abs/2406.03930
作者: Chen Cecilia Liu,Iryna Gurevych,Anna Korhonen
关键词: Natural Language Processing, adapted Natural Language, Language Processing, Natural Language, adapted Natural
中文关键词: 自然语言处理,改编自然语言,语言处理,自然语言,改编自然
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The surge of interest in culturally aware and adapted Natural Language Processing (NLP) has inspired much recent research. However, the lack of common understanding of the concept of “culture” has made it difficult to evaluate progress in this emerging area. Drawing on prior research in NLP and related fields, we propose an extensive taxonomy of elements of culture that can provide a systematic framework for analyzing and understanding research progress. Using the taxonomy, we survey existing resources and models for culturally aware and adapted NLP, providing an overview of the state of the art and the research gaps that still need to be filled.
摘要:对文化感知和适应性自然语言处理(NLP)的兴趣激增激发了最近的许多研究。然而,由于对“文化”概念缺乏共识,因此很难评估这一新兴领域的进展。根据NLP和相关领域的先前研究,我们提出了一种广泛的文化元素分类法,可以为分析和理解研究进展提供系统性框架。使用分类法,我们调查了具有文化意识和适应性的NLP的现有资源和模型,概述了最新技术水平和仍需要填补的研究空白。

[NLP-44] ArMeme: Propagandistic Content in Arabic Memes
[NLP-44] ArMeme:阿拉伯语Meme中的阿拉伯语内容

链接: https://arxiv.org/abs/2406.03916
作者: Firoj Alam,Abul Hasnat,Fatema Ahmed,Md Arid Hasan,Maram Hasanain
关键词: digital communication, mislead audiences, rise of digital, cultural and political, political expression
中文关键词: 数字传播,误导受众,数字、文化和政治、政治表达的兴起
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: disinformation, misinformation, factuality, harmfulness, fake news, propaganda, multimodality, text, images

点击查看摘要

Abstract:With the rise of digital communication, memes have become a significant medium for cultural and political expression that is often used to mislead audiences. Identification of such misleading and persuasive multimodal content has become more important among various stakeholders, including social media platforms, policymakers, and the broader society as they often cause harm to individuals, organizations, and/or society. While there has been effort to develop AI-based automatic systems for resource-rich languages (e.g., English), it is relatively little to none for medium to low resource languages. In this study, we focused on developing an Arabic memes dataset with manual annotations of propagandistic content. We annotated ~6K Arabic memes collected from various social media platforms, which is a first resource for Arabic multimodal research. We provide a comprehensive analysis aiming to develop computational tools for their detection. We will make them publicly available for the community.
摘要:随着数字传播的兴起,模因已经成为一种重要的文化和政治表达媒介,经常被用来误导受众。识别这种误导性和有说服力的多模式内容在各种利益攸关方中变得更加重要,包括社交媒体平台、政策制定者和更广泛的社会,因为它们往往对个人、组织和/或社会造成伤害。虽然已经努力为资源丰富的语言(例如英语)开发基于人工智能的自动系统,但对于中低资源语言来说,这方面的工作相对较少甚至没有。在这项研究中,我们专注于开发一个带有宣传内容手动标注的阿拉伯语模因数据集。我们对从各种社交媒体平台收集的~6K阿拉伯语表情包进行了标注,这是阿拉伯语多模式研究的第一个资源。我们提供了一项全面的分析,旨在为它们的检测开发计算工具。我们将把它们公之于众。

[NLP-45] HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew
[NLP-45] HeSum:希伯来语抽象文本摘要的新型数据集

链接: https://arxiv.org/abs/2406.03897
作者: Tzuf Paz-Argaman,Itai Mondshine,Asaf Achi Mordechai,Reut Tsarfaty
关键词: natural language tasks, large language models, tasks in English, remains unclear, performance in lower-resourced
中文关键词: 自然语言任务、大型语言模型、英语任务仍然不清楚,在资源较少的情况下表现
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language models (LLMs) excel in various natural language tasks in English, their performance in lower-resourced languages like Hebrew, especially for generative tasks such as abstractive summarization, remains unclear. The high morphological richness in Hebrew adds further challenges due to the ambiguity in sentence comprehension and the complexities in meaning construction. In this paper, we address this resource and evaluation gap by introducing HeSum, a novel benchmark specifically designed for abstractive text summarization in Modern Hebrew. HeSum consists of 10,000 article-summary pairs sourced from Hebrew news websites written by professionals. Linguistic analysis confirms HeSum’s high abstractness and unique morphological challenges. We show that HeSum presents distinct difficulties for contemporary state-of-the-art LLMs, establishing it as a valuable testbed for generative language technology in Hebrew, and MRLs generative challenges in general.
摘要:虽然大型语言模型(LLM)在英语中的各种自然语言任务中表现出色,但它们在希伯来语等资源较低的语言中的表现仍然不清楚,尤其是对于抽象摘要等生成性任务。由于句子理解的模糊性和意义构建的复杂性,希伯来语形态的高度丰富性增加了进一步的挑战。在本文中,我们通过引入HeSum来解决这一资源和评估差距,HeSum是一种专门为现代希伯来语抽象文本摘要设计的新型基准。HeSum由10,000篇文章摘要对组成,这些文章摘要来自专业人士撰写的希伯来新闻网站。语言分析证实了HeSum的高度抽象性和独特的形态挑战。我们表明,HeSum为当代最先进的LLM带来了明显的困难,使其成为希伯来语生成式语言技术的宝贵测试平台,以及总体而言MRL生成式挑战。

[NLP-46] How Good is Zero-Shot MT Evaluation for Low Resource Indian Languages?
[NLP-46] 低资源印度语言的零镜头MT评估有多好?

链接: https://arxiv.org/abs/2406.03893
作者: Anushka Singh,Ananya B. Sai,Raj Dabre,Ratish Puduppully,Anoop Kunchukuttan,Mitesh M Khapra
关键词: machine translation evaluation, low-resource languages due, low-resource Indian languages, machine translation, studied primarily
中文关键词: 机器翻译评估,低资源语言,低资源印度语言,机器翻译,主要研究
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While machine translation evaluation has been studied primarily for high-resource languages, there has been a recent interest in evaluation for low-resource languages due to the increasing availability of data and models. In this paper, we focus on a zero-shot evaluation setting focusing on low-resource Indian languages, namely Assamese, Kannada, Maithili, and Punjabi. We collect sufficient Multi-Dimensional Quality Metrics (MQM) and Direct Assessment (DA) annotations to create test sets and meta-evaluate a plethora of automatic evaluation metrics. We observe that even for learned metrics, which are known to exhibit zero-shot performance, the Kendall Tau and Pearson correlations with human annotations are only as high as 0.32 and 0.45. Synthetic data approaches show mixed results and overall do not help close the gap by much for these languages. This indicates that there is still a long way to go for low-resource evaluation.
摘要:虽然机器翻译评估主要针对高资源语言进行研究,但由于数据和模型的可用性不断增加,最近人们对低资源语言的评估产生了兴趣。在本文中,我们重点关注低资源印度语言(即阿萨姆语、卡纳达语、Maithili语和旁遮普语)的零镜头评估设置。我们收集足够的多维质量指标(MQM)和直接评估(DA)注释来创建测试集并元评估大量自动评估指标。我们观察到,即使对于已知表现出零射击性能的学习指标,Kendall Tau和Pearson与人类注释的相关性也仅高达0.32和0.45。合成数据方法显示出好坏参半的结果,总体而言无助于缩小这些语言的差距。这表明低资源评估还有很长的路要走。

[NLP-47] Spontaneous Speech-Based Suicide Risk Detection Using Whisper and Large Language Models
[NLP-47] 使用Whisper和Large语言模型的基于自发言语的自杀风险检测

链接: https://arxiv.org/abs/2406.03882
作者: Ziyun Cui,Chang Lei,Wen Wu,Yinan Duan,Diyang Qu,Ji Wu,Runsen Chen,Chao Zhang
关键词: suicide risk detection, suicide risk, potential suicide attempts, prevent potential suicide, risk detection
中文关键词: 自杀风险检测,自杀风险,潜在的自杀企图,预防潜在的自杀,风险检测
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2024

点击查看摘要

Abstract:The early detection of suicide risk is important since it enables the intervention to prevent potential suicide attempts. This paper studies the automatic detection of suicide risk based on spontaneous speech from adolescents, and collects a Mandarin dataset with 15 hours of suicide speech from more than a thousand adolescents aged from ten to eighteen for our experiments. To leverage the diverse acoustic and linguistic features embedded in spontaneous speech, both the Whisper speech model and textual large language models (LLMs) are used for suicide risk detection. Both all-parameter finetuning and parameter-efficient finetuning approaches are used to adapt the pre-trained models for suicide risk detection, and multiple audio-text fusion approaches are evaluated to combine the representations of Whisper and the LLM. The proposed system achieves a detection accuracy of 0.807 and an F1-score of 0.846 on the test set with 119 subjects, indicating promising potential for real suicide risk detection applications.
摘要:早期发现自杀风险很重要,因为它使干预能够防止潜在的自杀企图。本文对基于青少年自发语音的自杀风险自动检测进行了研究,收集了1000多名10-18岁青少年15小时自杀语音的普通话语料。为了利用自发语音中嵌入的各种声学和语言特征,耳语语音模型和文本大语言模型(LLMS)都被用于自杀风险检测。采用全参数精调和参数高效精调两种方法对预先训练好的模型进行自适应自杀风险检测,并对多种音文融合方法进行了评估,以结合Whisper和LLM的表示。该系统在包含119个对象的测试集上达到了0.807的检测准确率和0.846的F1分数,这表明该系统在真实的自杀风险检测应用中具有很大的潜力。

[NLP-48] Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations Automatic Metrics and Segmentation
[NLP-48] 评估IWSLT 2023语音翻译任务:人类注释自动分组和分段

链接: https://arxiv.org/abs/2406.03881
作者: Matthias Sperber,Ondřej Bojar,Barry Haddow,Dávid Javorský,Xutai Ma,Matteo Negri,Jan Niehues,Peter Polák,Elizabeth Salesky,Katsuhito Sudoh,Marco Turchi
关键词: text translation research, Spoken Language Translation, Human evaluation, machine translation system, translation system development
中文关键词: 文本翻译研究、口语翻译、人性化评估、机器翻译系统、翻译系统开发
类目: Computation and Language (cs.CL)
备注: LREC-COLING2024 publication (with corrections for Table 3)

点击查看摘要

Abstract:Human evaluation is a critical component in machine translation system development and has received much attention in text translation research. However, little prior work exists on the topic of human evaluation for speech translation, which adds additional challenges such as noisy data and segmentation mismatches. We take first steps to fill this gap by conducting a comprehensive human evaluation of the results of several shared tasks from the last International Workshop on Spoken Language Translation (IWSLT 2023). We propose an effective evaluation strategy based on automatic resegmentation and direct assessment with segment context. Our analysis revealed that: 1) the proposed evaluation strategy is robust and scores well-correlated with other types of human judgements; 2) automatic metrics are usually, but not always, well-correlated with direct assessment scores; and 3) COMET as a slightly stronger automatic metric than chrF, despite the segmentation noise introduced by the resegmentation step systems. We release the collected human-annotated data in order to encourage further investigation.
摘要:人工评价是机器翻译系统开发中的一个重要组成部分,在文本翻译研究中受到了广泛的关注。然而,关于语音翻译的人工评价这一主题的研究很少,这增加了额外的挑战,如数据噪声和分段失配。我们采取第一步来填补这一空白,对上一届国际口语翻译研讨会(IWSLT 2023)的几项共同任务的结果进行了全面的人类评估。我们提出了一种基于自动再分割和分段上下文直接评估的有效评估策略。我们的分析表明:1)所提出的评估策略是稳健的,并且分数与其他类型的人类判断有很好的相关性;2)自动度量通常(但不总是)与直接评估分数有很好的相关性;3)尽管重新分割步骤系统引入了分割噪声,但是Comet作为一种比chrF稍微强的自动度量。我们公布了收集的人类注释数据,以鼓励进一步的调查。

[NLP-49] Decoder-only Streaming Transformer for Simultaneous Translation
[NLP-49] 用于同步翻译的仅解码器流媒体Transformer

链接: https://arxiv.org/abs/2406.03878
作者: Shoutao Guo,Shaolei Zhang,Yang Feng
关键词: Simultaneous Machine Translation, Simultaneous Machine, reading source tokens, target prefix based, Machine Translation
中文关键词: 同时机器翻译,同时机器,读取源令牌,基于目标前置,机器翻译
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024. 14 pages, 10 Tables, 5 Figures

点击查看摘要

Abstract:Simultaneous Machine Translation (SiMT) generates translation while reading source tokens, essentially producing the target prefix based on the source prefix. To achieve good performance, it leverages the relationship between source and target prefixes to exact a policy to guide the generation of translations. Although existing SiMT methods primarily focus on the Encoder-Decoder architecture, we explore the potential of Decoder-only architecture, owing to its superior performance in various tasks and its inherent compatibility with SiMT. However, directly applying the Decoder-only architecture to SiMT poses challenges in terms of training and inference. To alleviate the above problems, we propose the first Decoder-only SiMT model, named Decoder-only Streaming Transformer (DST). Specifically, DST separately encodes the positions of the source and target prefixes, ensuring that the position of the target prefix remains unaffected by the expansion of the source prefix. Furthermore, we propose a Streaming Self-Attention (SSA) mechanism tailored for the Decoder-only architecture. It is capable of obtaining translation policy by assessing the sufficiency of input source information and integrating with the soft-attention mechanism to generate translations. Experiments demonstrate that our approach achieves state-of-the-art performance on three translation tasks.
摘要:同时机器翻译(SIMT)在读取源标记的同时生成翻译,本质上是根据源前缀生成目标前缀。为了获得良好的性能,它利用源和目标前缀之间的关系来确定指导翻译生成的策略。虽然现有的SIMT方法主要集中在编解码器体系结构上,但由于其在各种任务中的卓越性能以及与SIMT的内在兼容性,我们探索了仅解码器体系结构的潜力。然而,直接将仅解码器架构应用于SIMT会在训练和推理方面带来挑战。为了缓解上述问题,我们提出了第一个仅解码器的SIMT模型,称为仅解码器的流转换器(DST)。具体地说,DST对源前缀和目标前缀的位置进行单独编码,确保目标前缀的位置不受源前缀扩展的影响。此外,我们还提出了一种针对仅解码器架构的流自我注意(SSA)机制。它能够通过评估输入源信息的充分性来获得翻译策略,并结合软注意机制生成翻译。实验表明,我们的方法在三个翻译任务上取得了最好的性能。

[NLP-50] BLSP-Emo: Towards Empathetic Large Speech-Language Models
[NLP-50] BLSP-Emo:迈向同理心的大型演讲语言模型

链接: https://arxiv.org/abs/2406.03872
作者: Chen Wang,Minpeng Liao,Zhongqiang Huang,Junhong Wu,Chengqing Zong,Jiajun Zhang
关键词: showcased the potential, terms of low, low latency, ability to understand, generate expressive speech
中文关键词: 展示了潜力、低、低延迟、理解、生成表达性语音的能力
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The recent release of GPT-4o showcased the potential of end-to-end multimodal models, not just in terms of low latency but also in their ability to understand and generate expressive speech with rich emotions. While the details are unknown to the open research community, it likely involves significant amounts of curated data and compute, neither of which is readily accessible. In this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with Emotion support), a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech and generate empathetic responses. BLSP-Emo utilizes existing speech recognition (ASR) and speech emotion recognition (SER) datasets through a two-stage process. The first stage focuses on semantic alignment, following recent work on pretraining speech-language models using ASR data. The second stage performs emotion alignment with the pretrained speech-language model on an emotion-aware continuation task constructed from SER data. Our experiments demonstrate that the BLSP-Emo model excels in comprehending speech and delivering empathetic responses, both in instruction-following tasks and conversations.
摘要:最近发布的GPT-40展示了端到端多模式模型的潜力,不仅在低延迟方面,而且在理解和生成具有丰富情感的表达语音的能力方面也是如此。虽然公开研究界不知道细节,但它很可能涉及大量经过管理的数据和计算,而这两者都不是很容易获得的。本文提出了一种新的端到端语音-语言模型BLSP-EMO(Bootstrap Language-Speech PreTrading with Effect Support),该模型能够理解语音中的语义和情感,并产生移情反应。BLSP-EMO通过两个阶段的过程利用现有的语音识别(ASR)和语音情感识别(SER)数据集。第一阶段关注语义对齐,紧随最近使用ASR数据预训练语音语言模型的工作。第二阶段在由SER数据构建的情感感知继续任务上与预先训练的语音-语言模型执行情感对齐。我们的实验表明,BLSP-EMO模型在理解语音和提供移情反应方面表现出色,无论是在跟随教学的任务中还是在对话中。

[NLP-51] Recovering document annotations for sentence-level bitext
[NLP-51] 恢复业务级双文本的文档注释

链接: https://arxiv.org/abs/2406.03869
作者: Rachel Wicks,Matt Post,Philipp Koehn
关键词: Data availability limits, availability limits, limits the scope, Data availability, document-level
中文关键词: 数据可用性限制、可用性限制、限制范围、数据可用性、文档级别
类目: Computation and Language (cs.CL)
备注: ACL 2024 Findings

点击查看摘要

Abstract:Data availability limits the scope of any given task. In machine translation, historical models were incapable of handling longer contexts, so the lack of document-level datasets was less noticeable. Now, despite the emergence of long-sequence methods, we remain within a sentence-level paradigm and without data to adequately approach context-aware machine translation. Most large-scale datasets have been processed through a pipeline that discards document-level metadata. In this work, we reconstruct document-level information for three (ParaCrawl, News Commentary, and Europarl) large datasets in German, French, Spanish, Italian, Polish, and Portuguese (paired with English). We then introduce a document-level filtering technique as an alternative to traditional bitext filtering. We present this filtering with analysis to show that this method prefers context-consistent translations rather than those that may have been sentence-level machine translated. Last we train models on these longer contexts and demonstrate improvement in document-level translation without degradation of sentence-level translation. We release our dataset, ParaDocs, and resulting models as a resource to the community.
摘要:数据可用性限制了任何给定任务的范围。在机器翻译中,历史模型无法处理更长的上下文,因此文档级数据集的缺乏不那么明显。现在,尽管出现了长序列方法,但我们仍然停留在句子级别的范例中,没有足够的数据来充分接近上下文感知机器翻译。大多数大型数据集都是通过丢弃文档级元数据的管道进行处理的。在这项工作中,我们用德语、法语、西班牙语、意大利语、波兰语和葡萄牙语(与英语配对)重建了三个大型数据集(ParaCrawl、News Comments和Europarl)的文档级信息。然后,我们引入了一种文档级过滤技术,作为传统的位文本过滤的替代方案。我们给出了这种过滤的分析,以表明该方法更喜欢上下文一致的翻译,而不是那些可能已经被句子级机器翻译的翻译。最后,我们在这些较长的上下文上训练模型,并在不降低句子级翻译的情况下展示了文档级翻译的改进。我们将我们的数据集、悖论和结果模型作为资源发布给社区。

[NLP-52] MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition
[NLP-52] MuJo:用于人类活动识别的多模式关节特征空间学习

链接: https://arxiv.org/abs/2406.03857
作者: Stefan Gerd Fritsch,Cennet Oguz,Vitor Fortes Rey,Lala Ray,Maximilian Kiefer-Emmanouilidis,Paul Lukowicz
关键词: Human Activity Recognition, human computer interaction, Human Activity, Activity Recognition, sports and fitness
中文关键词: 人类活动识别、人机交互、人类活动、活动识别、运动和健身
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human Activity Recognition is a longstanding problem in AI with applications in a broad range of areas: from healthcare, sports and fitness, security, and human computer interaction to robotics. The performance of HAR in real-world settings is strongly dependent on the type and quality of the input signal that can be acquired. Given an unobstructed, high-quality camera view of a scene, computer vision systems, in particular in conjunction with foundational models (e.g., CLIP), can today fairly reliably distinguish complex activities. On the other hand, recognition using modalities such as wearable sensors (which are often more broadly available, e.g, in mobile phones and smartwatches) is a more difficult problem, as the signals often contain less information and labeled training data is more difficult to acquire. In this work, we show how we can improve HAR performance across different modalities using multimodal contrastive pretraining. Our approach MuJo (Multimodal Joint Feature Space Learning), learns a multimodal joint feature space with video, language, pose, and IMU sensor data. The proposed approach combines contrastive and multitask learning methods and analyzes different multitasking strategies for learning a compact shared representation. A large dataset with parallel video, language, pose, and sensor data points is also introduced to support the research, along with an analysis of the robustness of the multimodal joint space for modal-incomplete and low-resource data. On the MM-Fit dataset, our model achieves an impressive Macro F1-Score of up to 0.992 with only 2% of the train data and 0.999 when using all available training data for classification tasks. Moreover, in the scenario where the MM-Fit dataset is unseen, we demonstrate a generalization performance of up to 0.638.
摘要:人类活动识别是人工智能中一个长期存在的问题,其应用领域非常广泛:从医疗保健、体育健身、安全、人机交互到机器人。HAR在实际环境中的性能在很大程度上取决于可以采集的输入信号的类型和质量。计算机视觉系统,特别是与基础模型(例如,CLIP)相结合的计算机视觉系统,在给定场景的畅通无阻的、高质量的摄像机视角的情况下,今天可以相当可靠地区分复杂的活动。另一方面,使用诸如可穿戴传感器(其通常在移动电话和智能手表中更广泛地可用)之类的模式进行识别是一个更困难的问题,因为信号通常包含较少的信息,并且标记的训练数据更难获取。在这项工作中,我们展示了如何使用多模式对比预训练来提高不同模式的HAR性能。我们的方法MUJO(多模式联合特征空间学习),学习包含视频、语言、姿势和IMU传感器数据的多模式联合特征空间。该方法结合了对比学习和多任务学习方法,并分析了学习紧凑共享表征的不同多任务策略。为了支持这项研究,还引入了一个包含并行视频、语言、姿势和传感器数据点的大型数据集,并分析了多模式联合空间对模式不完整和低资源数据的稳健性。在MM-FIT数据集上,我们的模型在仅使用2%的训练数据的情况下获得了高达0.992的宏F1分数,在使用所有可用的训练数据进行分类任务时获得了0.999的分数。此外,在MM-FIT数据集不可见的情况下,我们展示了高达0.638的泛化性能。

[NLP-53] Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based QAs
[NLP-53] 大型语言模型在数字与语义医学知识中的性能:基于证据的QA基准

链接: https://arxiv.org/abs/2406.03855
作者: Eden Avnat,Michal Levy,Daniel Herstain,Elia Yanko,Daniel Ben Joya,Michal Tzuchman Katz,Dafna Eshel,Sahar Laros,Yael Dagan,Shahar Barami,Joseph Mermelstein,Shahar Ovadia,Noam Shomron,Varda Shalev,Raja-Elie E. Abdulnour
关键词: Clinical problem-solving requires, problem-solving requires processing, Clinical problem-solving, problem-solving requires, requires processing
中文关键词: 临床问题解决需要,问题解决需要处理,临床问题解决需要,需要处理
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Clinical problem-solving requires processing of semantic medical knowledge such as illness scripts and numerical medical knowledge of diagnostic tests for evidence-based decision-making. As large language models (LLMs) show promising results in many aspects of language-based clinical practice, their ability to generate non-language evidence-based answers to clinical questions is inherently limited by tokenization. Therefore, we evaluated LLMs’ performance on two question types: numeric (correlating findings) and semantic (differentiating entities) while examining differences within and between LLMs in medical aspects and comparing their performance to humans. To generate straightforward multi-choice questions and answers (QAs) based on evidence-based medicine (EBM), we used a comprehensive medical knowledge graph (encompassed data from more than 50,00 peer-reviewed articles) and created the “EBMQA”. EBMQA contains 105,000 QAs labeled with medical and non-medical topics and classified into numerical or semantic questions. We benchmarked this dataset using more than 24,500 QAs on two state-of-the-art LLMs: Chat-GPT4 and Claude3-Opus. We evaluated the LLMs accuracy on semantic and numerical question types and according to sub-labeled topics. For validation, six medical experts were tested on 100 numerical EBMQA questions. We found that both LLMs excelled more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical QAs. However, both LLMs showed inter and intra gaps in different medical aspects and remained inferior to humans. Thus, their medical advice should be addressed carefully.
摘要:临床问题的解决需要对疾病脚本等语义医学知识和诊断试验的数值医学知识进行处理,以进行基于证据的决策。由于大语言模型(LLM)在基于语言的临床实践的许多方面都显示出良好的结果,它们为临床问题生成非语言循证答案的能力固有地受到标记化的限制。因此,我们评估了LLMS在两种问题类型上的表现:数字(相关发现)和语义(区分实体),同时考察了LLM内部和之间在医学方面的差异,并将他们的表现与人类进行了比较。为了生成基于循证医学(EBM)的简单的多选问题与答案(QAS),我们使用了一个全面的医学知识图谱(包括来自50,00多篇同行评议文章的数据),并创建了“EBMQA”。EBMQA包含105,000个带有医学和非医学主题的问答,并被分类为数字或语义问题。我们在两个最先进的LLMS:Chat-GPT4和Claude3-Opus上使用了超过24,500个QA对该数据集进行了基准测试。我们评估了LLMS在语义问题类型和数字问题类型上的准确率,并根据子标签主题进行了评估。为了验证,六名医学专家就100个数字EBMQA问题进行了测试。我们发现,两种LLMS在语义上都优于数值QAS,Claude3在数值QAS上超过了GPT4。然而,两个LLM在不同的医学方面都显示出内部和内部的差距,仍然不如人类。因此,他们的医疗建议应该得到认真的对待。

[NLP-54] Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism
[NLP-54] 通过提前退出进行推测解码,以实现更快的LLM推理,并采用Thompson采样控制机制

链接: https://arxiv.org/abs/2406.03853
作者: Jiahao Liu,Qifan Wang,Jingang Wang,Xunliang Cai
关键词: escalating inference costs, large language models, Early-exiting Speculative Decoding, called Early-exiting Speculative, real-world applications
中文关键词: 不断上升的推理成本、大型语言模型、早期退出的推测解码,称为早期退出的推测、现实世界应用程序
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2024 (Findings)

点击查看摘要

Abstract:The recent advancements in large language models (LLMs) have been extraordinary, yet the escalating inference costs associated with them present challenges in real-world applications. To address these challenges, we propose a novel approach called Early-exiting Speculative Decoding (EESD) with lossless acceleration. Specifically, EESD utilizes a segment of the LLM to generate draft tokens, incorporating Early-exiting structures after the first N layers. To enhance the quality of draft tokens, a self-distillation method is integrated. This early-exiting design not only reduces deployment and training costs but also significantly accelerates the token generation speed. Moreover, we introduce a novel sampling mechanism that leverages Thompson Sampling to regulate the generation processes, automatically determining the quantity of draft tokens in each round. The original LLM is then employed to validate these draft tokens through a single forward pass, and thus guarantees that the final output text maintains a distribution consistent with vanilla auto-regressive decoding. The experimental results on both 13B and 70B models demonstrate that our approach decodes tokens at a markedly accelerated rate compared to prior methods, showing the effectiveness of our approach.
摘要:大型语言模型(LLM)的最新进展是非同寻常的,但与之相关的不断上升的推理成本在现实世界的应用中提出了挑战。为了应对这些挑战,我们提出了一种新的方法,称为具有无损加速的提前退出投机解码(EESD)。具体地说,EESD利用LLM的一段来生成草稿令牌,在前N层之后结合了较早退出的结构。为了提高草稿代币的质量,集成了一种自蒸馏方法。这种提前退出的设计不仅降低了部署和培训成本,还显著加快了令牌生成速度。此外,我们引入了一种新颖的抽样机制,该机制利用汤普森抽样来调节生成过程,自动确定每轮草稿代币的数量。然后使用原始LLM通过单次前向传递来验证这些草稿令牌,从而保证最终输出文本保持与普通自回归解码一致的分布。在13B和70B模型上的实验结果表明,与以前的方法相比,我们的方法对令牌的解码速度明显加快,表明了我们方法的有效性。

[NLP-55] Lean Workbook: A large-scale Lean problem set formalized from natural language math problems
[NLP-55] 精益练习册:从自然语言数学问题形式化的大规模精益问题集

链接: https://arxiv.org/abs/2406.03847
作者: Huaiyuan Ying,Zijian Wu,Yihan Geng,Jiayu Wang,Dahua Lin,Kai Chen
关键词: Large language models, demonstrated impressive capabilities, language processing tasks, Large language, processing tasks
中文关键词: 大型语言模型,展示了令人印象深刻的能力,语言处理任务,大型语言,处理任务
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have demonstrated impressive capabilities across various natural language processing tasks, especially in solving mathematical problems. However, large language models are not good at math theorem proving using formal languages like Lean. A significant challenge in this area is the scarcity of training data available in these formal languages. To address this issue, we propose a novel pipeline that iteratively generates and filters synthetic data to translate natural language mathematical problems into Lean 4 statements, and vice versa. Our results indicate that the synthetic data pipeline can provide useful training data and improve the performance of LLMs in translating and understanding complex mathematical problems and proofs. Our final dataset contains about 57K formal-informal question pairs along with searched proof from the math contest forum and 21 new IMO questions. We open-source our code at this https URL and our data at this https URL.
摘要:大型语言模型在各种自然语言处理任务中表现出了令人印象深刻的能力,尤其是在解决数学问题方面。然而,大型语言模型并不擅长使用Lean等形式语言证明数学定理。该领域的一个重大挑战是这些正式语言中可用的训练数据稀缺。为了解决这个问题,我们提出了一种新颖的管道,该管道迭代地生成和过滤合成数据,以将自然语言数学问题转化为Lean 4陈述,反之亦然。我们的结果表明,合成数据管道可以提供有用的训练数据,并提高LLM在翻译和理解复杂数学问题和证明方面的性能。我们的最终数据集包含约57,000个正式-非正式问题对,以及从数学竞赛论坛搜索到的证据和21个新的海事组织问题。我们在这个https URL上开源我们的代码,并在这个https URL上开源我们的数据。

[NLP-56] Chaos with Keywords: Exposing Large Language Models Sycophancy to Misleading Keywords and Evaluating Defense Strategies
[NLP-56] 关键词带来的混乱:将大型语言模型暴露给误导性关键词并评估防御策略

链接: https://arxiv.org/abs/2406.03827
作者: Aswin RRV,Nemika Tyagi,Md Nayem Uddin,Neeraj Varshney,Chitta Baral
关键词: Large Language Models, Large Language, tendencies of Large, Language Models, study explores
中文关键词: 大型语言模型,大型语言,大型趋势,语言模型,研究探索
类目: Computation and Language (cs.CL)
备注: To be published in Findings of ACL 2024

点击查看摘要

Abstract:This study explores the sycophantic tendencies of Large Language Models (LLMs), where these models tend to provide answers that match what users want to hear, even if they are not entirely correct. The motivation behind this exploration stems from the common behavior observed in individuals searching the internet for facts with partial or misleading knowledge. Similar to using web search engines, users may recall fragments of misleading keywords and submit them to an LLM, hoping for a comprehensive response. Our empirical analysis of several LLMs shows the potential danger of these models amplifying misinformation when presented with misleading keywords. Additionally, we thoroughly assess four existing hallucination mitigation strategies to reduce LLMs sycophantic behavior. Our experiments demonstrate the effectiveness of these strategies for generating factually correct statements. Furthermore, our analyses delve into knowledge-probing experiments on factual keywords and different categories of sycophancy mitigation.
摘要:这项研究探索了大型语言模型(LLM)的奉承倾向,这些模型往往提供与用户想听的答案相匹配的答案,即使它们并不完全正确。这种探索背后的动机源于观察到的个人在互联网上搜索具有部分或误导性知识的事实的常见行为。与使用网络搜索引擎类似,用户可能会回忆起误导性的关键字片段,并将它们提交给LLM,希望得到全面的回应。我们对几个LLM的实证分析表明,当出现误导性的关键字时,这些模型存在放大错误信息的潜在危险。此外,我们彻底评估了四种现有的减少幻觉的策略,以减少LLMS的奉承行为。我们的实验证明了这些策略在生成真实正确语句方面的有效性。此外,我们的分析深入到了对事实关键词和不同类别的奉承缓解的知识探索实验中。

[NLP-57] ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
[NLP-57] ReST-MCTS*:通过流程奖励引导树搜索进行LLM自我培训

链接: https://arxiv.org/abs/2406.03816
作者: Dan Zhang,Sining Zhoubian,Yisong Yue,Yuxiao Dong,Jie Tang
关键词: LLM generating responses, Recent methodologies, correct output answers, generating responses, responses and filtering
中文关键词: LLM生成响应、最新方法论、正确输出答案、生成响应、响应和过滤
类目: Computation and Language (cs.CL)
备注: 29 pages

点击查看摘要

Abstract:Recent methodologies in LLM self-training mostly rely on LLM generating responses and filtering those with correct output answers as training data. This approach often yields a low-quality fine-tuning training set (e.g., incorrect plans or intermediate reasoning). In this paper, we develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models. ReST-MCTS* circumvents the per-step manual annotation typically used to train process rewards by tree-search-based reinforcement learning: Given oracle final correct answers, ReST-MCTS* is able to infer the correct process rewards by estimating the probability this step can help lead to the correct answer. These inferred rewards serve dual purposes: they act as value targets for further refining the process reward model and also facilitate the selection of high-quality traces for policy model self-training. We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget. We then show that by using traces searched by this tree-search policy as training data, we can continuously enhance the three language models for multiple iterations, and outperform other self-training algorithms such as ReST ^\textEM and Self-Rewarding LM.
摘要:目前的LLM自我训练方法大多依赖于LLM生成响应,并过滤出正确的输出答案作为训练数据。这种方法经常产生低质量的微调训练集(例如,不正确的计划或中间推理)。在本文中,我们开发了一种强化的自我训练方法,称为REST-MCTS*,它将过程奖励指导与树搜索MCTS相结合,以收集更高质量的推理轨迹以及每一步的值来训练策略和奖励模型。REST-MCTS通过基于树搜索的强化学习绕过了通常用于训练过程奖励的每一步人工注释:给定Oracle最终正确答案,REST-MCTS能够通过估计这一步骤有助于获得正确答案的概率来推断正确的过程奖励。这些推断的奖励具有双重目的:它们作为进一步完善过程奖励模型的价值目标,也有助于为政策模型自我培训选择高质量的踪迹。我们首先证明了REST-MCTS中的树搜索策略在相同的搜索预算下,比以往的LLM推理基线(如N中最佳和思维树)获得了更高的准确率。然后,我们证明了通过使用这种树搜索策略搜索的踪迹作为训练数据,我们可以在多次迭代中不断增强这三种语言模型,并优于其他自我训练算法,如REST^\TextEM和Self-rewarding LM。

[NLP-58] Improving Zero-Shot Chinese-English Code-Switching ASR with kNN-CTC and Gated Monolingual Datastores
[NLP-58] 利用kNN-ctc和门控单语数据存储库改进零镜头中英代码转换ASB

链接: https://arxiv.org/abs/2406.03814
作者: Jiaming Zhou,Shiwan Zhao,Hui Wang,Tian-Hao Zhang,Haoqin Sun,Xuechen Wang,Yong Qin
关键词: automatic speech recognition, monolingual automatic speech, speech recognition, automatic speech, kNN-CTC model
中文关键词: 自动语音识别,单语自动语音,语音识别,自动语音,kNN-ctc模型
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The kNN-CTC model has proven to be effective for monolingual automatic speech recognition (ASR). However, its direct application to multilingual scenarios like code-switching, presents challenges. Although there is potential for performance improvement, a kNN-CTC model utilizing a single bilingual datastore can inadvertently introduce undesirable noise from the alternative language. To address this, we propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism to reduce noise interference. Our method selects the appropriate datastore for decoding each frame, ensuring the injection of language-specific information into the ASR process. We apply this framework to cutting-edge CTC-based models, developing an advanced CS-ASR system. Extensive experiments demonstrate the remarkable effectiveness of our gated datastore mechanism in enhancing the performance of zero-shot Chinese-English CS-ASR.
摘要:KNN-CTC模型被证明是单语自动语音识别(ASR)的有效模型。然而,它在代码切换等多语言场景中的直接应用带来了挑战。尽管性能有改进的潜力,但使用单个双语数据存储的KNN-CTC模型可能会无意中带来来自替代语言的不希望看到的噪音。针对这一问题,我们提出了一种新的基于KNN-CTC的码切换ASR(CS-ASR)框架,该框架使用双单语言数据存储和门控数据存储选择机制来降低噪声干扰。我们的方法选择适当的数据存储来解码每一帧,确保将特定于语言的信息注入ASR过程。我们将该框架应用于基于CTC的前沿模型,开发了一个先进的CS-ASR系统。大量实验表明,本文提出的门控数据存储机制在提高零命中率汉英CS-ASR性能方面具有显著的效果。

[NLP-59] ool-Planner: Dynamic Solution Tree Planning for Large Language Model with Tool Clustering
[NLP-59] ool-Planner:使用工具集群的大型语言模型动态解决方案树规划

链接: https://arxiv.org/abs/2406.03807
作者: Yanming Liu,Xinyue Peng,Yuwei Zhang,Jiannan Cao,Xuhong Zhang,Sheng Cheng,Xun Wang,Jianwei Yin,Tianyu Du
关键词: exceptional reasoning capabilities, demonstrated exceptional reasoning, Large language models, Large language, reasoning capabilities
中文关键词: 卓越的推理能力,表现出卓越的推理,大型语言模型,大型语言,推理能力
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注: 46pages first version

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated exceptional reasoning capabilities, enabling them to solve various complex problems. Recently, this ability has been applied to the paradigm of tool learning. Tool learning involves providing examples of tool usage and their corresponding functions, allowing LLMs to formulate plans and demonstrate the process of invoking and executing each tool. LLMs can address tasks that they cannot complete independently, thereby enhancing their potential across different tasks. However, this approach faces two key challenges. First, redundant error correction leads to unstable planning and long execution time. Additionally, designing a correct plan among multiple tools is also a challenge in tool learning. To address these issues, we propose Tool-Planner, a task-processing framework based on toolkits. Tool-Planner groups tools based on the API functions with the same function into a toolkit and allows LLMs to implement planning across the various toolkits. When a tool error occurs, the language model can reselect and adjust tools based on the toolkit. Experiments show that our approach demonstrates a high pass and win rate across different datasets and optimizes the planning scheme for tool learning in models such as GPT-4 and Claude 3, showcasing the potential of our method.
摘要:大型语言模型(LLM)表现出了卓越的推理能力,使其能够解决各种复杂的问题。最近,这种能力被应用到工具学习的范式中。工具学习包括提供工具使用及其相应功能的示例,使LLM能够制定计划并演示调用和执行每个工具的过程。LLM可以处理它们无法独立完成的任务,从而增强它们在不同任务中的潜力。然而,这种方法面临着两个关键挑战。首先,冗余纠错导致规划不稳定,执行时间长。此外,在多个工具中设计正确的计划也是工具学习中的一个挑战。为了解决这些问题,我们提出了基于工具包的任务处理框架Tool-Planner。Tool-Planner基于具有相同功能的API函数将工具分组到一个工具包中,并允许LLM跨各种工具包实施计划。当出现工具错误时,语言模型可以根据工具包重新选择和调整工具。实验表明,该方法在不同的数据集上具有较高的通过率和优胜率,并在GPT-4和Claude 3等模型中优化了工具学习的规划方案,展示了该方法的潜力。

[NLP-60] Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning
[NLP-60] Light-PEFT:通过早期修剪进行优化参数高效的微调

链接: https://arxiv.org/abs/2406.03792
作者: Naibin Gu,Peng Fu,Xiyu Liu,Bowen Shen,Zheng Lin,Weiping Wang
关键词: large language models, PEFT, Parameter-efficient fine-tuning, Foundation Model, Masked Early Pruning
中文关键词: 大型语言模型、PEFT、参数高效微调、基础模型、掩蔽早期修剪
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2024

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) has emerged as the predominant technique for fine-tuning in the era of large language models. However, existing PEFT methods still have inadequate training efficiency. Firstly, the utilization of large-scale foundation models during the training process is excessively redundant for certain fine-tuning tasks. Secondly, as the model size increases, the growth in trainable parameters of empirically added PEFT modules becomes non-negligible and redundant, leading to inefficiency. To achieve task-specific efficient fine-tuning, we propose the Light-PEFT framework, which includes two methods: Masked Early Pruning of the Foundation Model and Multi-Granularity Early Pruning of PEFT. The Light-PEFT framework allows for the simultaneous estimation of redundant parameters in both the foundation model and PEFT modules during the early stage of training. These parameters can then be pruned for more efficient fine-tuning. We validate our approach on GLUE, SuperGLUE, QA tasks, and various models. With Light-PEFT, parameters of the foundation model can be pruned by up to over 40%, while still controlling trainable parameters to be only 25% of the original PEFT method. Compared to utilizing the PEFT method directly, Light-PEFT achieves training and inference speedup, reduces memory usage, and maintains comparable performance and the plug-and-play feature of PEFT.
摘要:在大型语言模型时代,参数高效微调(PEFT)已成为主要的微调技术。然而,现有的PEFT方法仍然存在训练效率不高的问题。首先,在训练过程中使用大型基础模型对于某些微调任务来说是过于多余的。其次,随着模型规模的增加,经验添加的PEFT模型的可训练参数的增长变得不可忽略和冗余,从而导致效率低下。为了实现特定任务的高效微调,我们提出了Light-PEFT框架,该框架包括两种方法:屏蔽早期剪枝基础模型和多粒度早期剪枝PEFT。Light-PEFT框架允许在训练的早期阶段同时估计基础模型和PEFT模块中的冗余参数。然后可以修剪这些参数,以便更有效地进行微调。我们在胶水、强力胶、质量保证任务和各种型号上验证了我们的方法。使用Light-PEFT方法,可以将基础模型的参数剪枝40%以上,同时仍可控制可训练参数仅为原始PEFT方法的25%。与直接使用PEFT方法相比,Light-PEFT提高了训练和推理的加速比,减少了内存使用,并保持了PEFT的性能和即插即用特性。

[NLP-61] End-to-End Trainable Soft Retriever for Low-resource Relation Extraction
[NLP-61] 用于低资源关系提取的端到端可训练软检索器

链接: https://arxiv.org/abs/2406.03790
作者: Kohei Makino,Makoto Miwa,Yutaka Sasaki
关键词: TRAinable Soft K-nearest, Soft K-nearest neighbor, study addresses, addresses a crucial, crucial challenge
中文关键词: 可训练的软K最近,软K最近邻居,研究解决了一个至关重要的挑战
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:This study addresses a crucial challenge in instance-based relation extraction using text generation models: end-to-end training in target relation extraction task is not applicable to retrievers due to the non-differentiable nature of instance selection. We propose a novel End-to-end TRAinable Soft K-nearest neighbor retriever (ETRASK) by the neural prompting method that utilizes a soft, differentiable selection of the k nearest instances. This approach enables the end-to-end training of retrievers in target tasks. On the TACRED benchmark dataset with a low-resource setting where the training data was reduced to 10%, our method achieved a state-of-the-art F1 score of 71.5%. Moreover, ETRASK consistently improved the baseline model by adding instances for all settings. These results highlight the efficacy of our approach in enhancing relation extraction performance, especially in resource-constrained environments. Our findings offer a promising direction for future research with extraction and the broader application of text generation in natural language processing.
摘要:本文研究了基于实例的文本关系抽取中的一个关键问题:目标关系抽取任务中的端到端训练不适用于检索者,因为实例选择是不可区分的。提出了一种端到端可训练的软K近邻检索器(ETRASK),该方法通过对k个最近实例的软可微选择来进行神经提示。这种方法能够在目标任务中对检索者进行端到端的培训。在低资源设置的TACRED基准数据集上,训练数据减少到10%,我们的方法达到了最先进的F1分数71.5%。此外,ETRASK通过为所有设置添加实例来始终如一地改进基线模型。这些结果突出了我们的方法在提高关系提取性能方面的有效性,特别是在资源受限的环境中。我们的发现为未来的抽取研究和文本生成在自然语言处理中的更广泛应用提供了一个很有前途的方向。

[NLP-62] XL-HeadTags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags
[NLP-62] XL-HeadTags:利用多模式检索增强来多语言生成新闻标题和标签

链接: https://arxiv.org/abs/2406.03776
作者: Faisal Tareque Shohan,Mir Tafseer Nayeem,Samsul Islam,Abu Ubaida Akash,Shafiq Joty
关键词: published online daily, articles published online, published online, online daily, daily can overwhelm
中文关键词: 每天在网上发表,文章在网上发表,在线发表,每天在网上,每天都可以压倒
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: ACL 2024 camera ready

点击查看摘要

Abstract:Millions of news articles published online daily can overwhelm readers. Headlines and entity (topic) tags are essential for guiding readers to decide if the content is worth their time. While headline generation has been extensively studied, tag generation remains largely unexplored, yet it offers readers better access to topics of interest. The need for conciseness in capturing readers’ attention necessitates improved content selection strategies for identifying salient and relevant segments within lengthy articles, thereby guiding language models effectively. To address this, we propose to leverage auxiliary information such as images and captions embedded in the articles to retrieve relevant sentences and utilize instruction tuning with variations to generate both headlines and tags for news articles in a multilingual context. To make use of the auxiliary information, we have compiled a dataset named XL-HeadTags, which includes 20 languages across 6 diverse language families. Through extensive evaluation, we demonstrate the effectiveness of our plug-and-play multimodal-multilingual retrievers for both tasks. Additionally, we have developed a suite of tools for processing and evaluating multilingual texts, significantly contributing to the research community by enabling more accurate and efficient analysis across languages.
摘要:每天在网上发布的数百万篇新闻文章会让读者不知所措。标题和实体(主题)标签是引导读者决定内容是否值得花费时间的关键。虽然标题生成已经得到了广泛的研究,但标签生成在很大程度上仍未被探索,但它为读者提供了更好的访问感兴趣的主题的途径。为了在吸引读者注意力时保持简洁,必须改进内容选择策略,以便在长篇文章中识别突出和相关的部分,从而有效地指导语言模型。为了解决这个问题,我们建议利用嵌入文章中的图像和说明文字等辅助信息来检索相关句子,并利用随变化进行的指令调整来生成多语言背景下的新闻文章的标题和标签。为了利用辅助信息,我们编制了一个名为XL-HeadTages的数据集,其中包括6个不同语系的20种语言。通过广泛的评估,我们证明了我们的即插即用多通道-多语言检索器对这两项任务的有效性。此外,我们还开发了一套处理和评估多语言文本的工具,通过实现更准确和高效的跨语言分析,为研究界做出了重大贡献。

[NLP-63] Character-Level Chinese Dependency Parsing via Modeling Latent Intra-Word Structure
[NLP-63] 通过潜在词内结构建模实现高级中文依赖解析

链接: https://arxiv.org/abs/2406.03772
作者: Yang Hou,Zhenghua Li
关键词: poses significant challenges, Chinese poses significant, clear word boundaries, Revealing the syntactic, word-level parsers due
中文关键词: 提出了重大挑战,中文提出了重要、明确的词边界,揭示了由于语法、词级解析器
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2024

点击查看摘要

Abstract:Revealing the syntactic structure of sentences in Chinese poses significant challenges for word-level parsers due to the absence of clear word boundaries. To facilitate a transition from word-level to character-level Chinese dependency parsing, this paper proposes modeling latent internal structures within words. In this way, each word-level dependency tree is interpreted as a forest of character-level trees. A constrained Eisner algorithm is implemented to ensure the compatibility of character-level trees, guaranteeing a single root for intra-word structures and establishing inter-word dependencies between these roots. Experiments on Chinese treebanks demonstrate the superiority of our method over both the pipeline framework and previous joint models. A detailed analysis reveals that a coarse-to-fine parsing strategy empowers the model to predict more linguistically plausible intra-word structures.
摘要:由于缺乏明确的词边界,揭示中文句子的语法结构对词级解析器提出了重大挑战。为了促进从单词级到字符级的中文依存解析的过渡,本文提出对单词内潜在的内部结构进行建模。这样,每个单词级依赖树都被解释为字符级树森林。实现了一种受约束的搜索器算法,以确保字符级树的兼容性,保证字内结构的单一根并建立这些根之间的字间依赖关系。中国树库上的实验证明了我们的方法优于管道框架和之前的联合模型。详细的分析表明,从粗到细的解析策略使模型能够预测语言上更合理的词内结构。

[NLP-64] NAP2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human
[NLP-64] NAP 2:通过向人类学习重写自然性和隐私保护文本的基准

链接: https://arxiv.org/abs/2406.03749
作者: Shuo Huang,William MacLean,Xiaoxi Kang,Anqi Wu,Lizhen Qu,Qiongkai Xu,Zhuang Li,Xingliang Yuan,Gholamreza Haffari
关键词: employing NLP models, Increasing concerns, employing NLP, process sensitive texts, NLP models
中文关键词: 采用NLP模型、增加担忧、采用NLP、处理敏感文本、NLP模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Increasing concerns about privacy leakage issues in academia and industry arise when employing NLP models from third-party providers to process sensitive texts. To protect privacy before sending sensitive data to those models, we suggest sanitizing sensitive text using two common strategies used by humans: i) deleting sensitive expressions, and ii) obscuring sensitive details by abstracting them. To explore the issues and develop a tool for text rewriting, we curate the first corpus, coined NAP^2, through both crowdsourcing and the use of large language models (LLMs). Compared to the prior works based on differential privacy, which lead to a sharp drop in information utility and unnatural texts, the human-inspired approaches result in more natural rewrites and offer an improved balance between privacy protection and data utility, as demonstrated by our extensive experiments.
摘要:当使用第三方提供商的NLP模型来处理敏感文本时,学术界和工业界对隐私泄露问题的担忧日益增加。为了在将敏感数据发送到这些模型之前保护隐私,我们建议使用人类使用的两种常见策略来清理敏感文本:i)删除敏感表达,ii)通过抽象敏感细节来模糊敏感细节。为了探索这些问题并开发文本重写工具,我们通过众包和使用大型语言模型(LLM)来策划了第一个文集,命名为NAP ’ 2。与之前基于差异隐私的作品(导致信息效用和非自然文本急剧下降)相比,人类启发的方法可以实现更自然的重写,并在隐私保护和数据效用之间提供更好的平衡,正如我们广泛的实验所证明的那样。

[NLP-65] Efficient Knowledge Infusion via KG-LLM Alignment
[NLP-65] 通过KG-LLM对齐高效知识注入

链接: https://arxiv.org/abs/2406.03746
作者: Zhouyu Jiang,Ling Zhong,Mengshu Sun,Jun Xu,Rui Sun,Hui Cai,Shuhan Luo,Zhiqiang Zhang
关键词: large language models, knowledge graph-retrievalaugmented method, domain-specific knowledge scarcity, language models, knowledge graphs
中文关键词: 大型语言模型、知识图检索增强方法、领域特定知识稀缺性、语言模型、知识图
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL2024 Findings

点击查看摘要

Abstract:To tackle the problem of domain-specific knowledge scarcity within large language models (LLMs), knowledge graph-retrievalaugmented method has been proven to be an effective and efficient technique for knowledge infusion. However, existing approaches face two primary challenges: knowledge mismatch between public available knowledge graphs and the specific domain of the task at hand, and poor information compliance of LLMs with knowledge graphs. In this paper, we leverage a small set of labeled samples and a large-scale corpus to efficiently construct domain-specific knowledge graphs by an LLM, addressing the issue of knowledge mismatch. Additionally, we propose a three-stage KG-LLM alignment strategyto enhance the LLM’s capability to utilize information from knowledge graphs. We conduct experiments with a limited-sample setting on two biomedical question-answering datasets, and the results demonstrate that our approach outperforms existing baselines.
摘要:为了解决大型语言模型(LLM)中特定领域知识稀缺的问题,知识图检索增强方法已被证明是一种有效且高效的知识注入技术。然而,现有方法面临两个主要挑战:公开可用知识图与手头任务的特定领域之间的知识不匹配,以及LLM与知识图的信息合规性较差。在本文中,我们利用一小组标记样本和大规模的数据库通过LLM有效地构建特定领域的知识图,解决知识不匹配的问题。此外,我们提出了一个三阶段的KG-LLM对齐策略,以增强LLM利用知识图谱信息的能力。我们对两个生物医学问答数据集进行了有限样本设置的实验,结果表明我们的方法优于现有的基线。

[NLP-66] Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data
[NLP-66] 您的吸收性离散扩散秘密建模干净数据的条件分布

链接: https://arxiv.org/abs/2406.03736
作者: Jingyang Ou,Shen Nie,Kaiwen Xue,Fengqi Zhu,Jiacheng Sun,Zhenguo Li,Chongxuan Li
关键词: absorbing discrete diffusion, concrete score, Discrete diffusion, Discrete diffusion models, processes have shown
中文关键词: 吸收离散扩散,具体得分,离散扩散,离散扩散模型,过程已显示
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Discrete diffusion models with absorbing processes have shown promise in language modeling. The key quantities to be estimated are the ratios between the marginal probabilities of two transitive states at all timesteps, called the concrete score. In this paper, we reveal that the concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data, multiplied by a time-dependent scalar in an analytic form. Motivated by the finding, we propose reparameterized absorbing discrete diffusion (RADD), a dedicated diffusion model that characterizes the time-independent conditional probabilities. Besides its simplicity, RADD can reduce the number of function evaluations (NFEs) by caching the output of the time-independent network when the noisy sample remains unchanged in a sampling interval. Empirically, RADD is up to 3.5 times faster while consistently achieving a better performance than the strongest baseline. Built upon the new factorization of the concrete score, we further prove a surprising result that the exact likelihood of absorbing diffusion can be rewritten to a simple form (named denoising cross-entropy) and then estimated efficiently by the Monte Carlo method. The resulting approach also applies to the original parameterization of the concrete score. It significantly advances the state-of-the-art discrete diffusion on 5 zero-shot language modeling benchmarks (measured by perplexity) at the GPT-2 scale.
摘要:具有吸收过程的离散扩散模型在语言建模中显示出良好的应用前景。要估计的关键量是在所有时间步长上两个过渡态的边际概率之间的比率,称为具体分数。在本文中,我们揭示了吸收扩散的具体分数可以用干净数据的条件概率乘以解析形式的含时标量来表示。受这一发现的启发,我们提出了一种描述与时间无关的条件概率的专用扩散模型–再参数吸收离散扩散模型(RADD)。除了简单之外,RADD还可以在采样间隔内噪声样本保持不变的情况下,通过缓存时间无关网络的输出来减少函数求值(NFE)的数量。根据经验,RADD的速度最高可达3.5倍,同时始终实现比最强基准更好的性能。在对具体分数进行新的因式分解的基础上,我们进一步证明了一个令人惊讶的结果:吸收扩散的确切可能性可以重写为一种简单的形式(称为去噪交叉熵),然后用蒙特卡罗方法进行有效的估计。由此产生的方法也适用于具体分数的原始参数化。它在GPT-2级别的5个零次语言建模基准(以困惑度衡量)上显著推进了最先进的离散扩散。

[NLP-67] LLMEmbed: Rethinking Lightweight LLMs Genuine Function in Text Classification
[NLP-67] LLMEmbed:重新思考轻量级LLM在文本分类中的真正功能

链接: https://arxiv.org/abs/2406.03725
作者: Chun Liu,Hongguang Zhang,Kainan Zhao,Xinghai Ju,Lin Yang
关键词: Large Language Models, Large Language, booming of Large, Language Models, research areas
中文关键词: 大型语言模型,大型语言,大型语言的蓬勃发展,语言模型,研究领域
类目: Computation and Language (cs.CL)
备注: ACL 2024 main conference

点击查看摘要

Abstract:With the booming of Large Language Models (LLMs), prompt-learning has become a promising method mainly researched in various research areas. Recently, many attempts based on prompt-learning have been made to improve the performance of text classification. However, most of these methods are based on heuristic Chain-of-Thought (CoT), and tend to be more complex but less efficient. In this paper, we rethink the LLM-based text classification methodology, propose a simple and effective transfer learning strategy, namely LLMEmbed, to address this classical but challenging task. To illustrate, we first study how to properly extract and fuse the text embeddings via various lightweight LLMs at different network depths to improve their robustness and discrimination, then adapt such embeddings to train the classifier. We perform extensive experiments on publicly available datasets, and the results show that LLMEmbed achieves strong performance while enjoys low training overhead using lightweight LLM backbones compared to recent methods based on larger LLMs, i.e. GPT-3, and sophisticated prompt-based strategies. Our LLMEmbed achieves adequate accuracy on publicly available benchmarks without any fine-tuning while merely use 4% model parameters, 1.8% electricity consumption and 1.5% runtime compared to its counterparts. Code is available at: this https URL.
摘要:随着大型语言模型的兴起,快速学习已成为各个研究领域的研究热点。近年来,人们进行了许多基于提示学习的尝试来提高文本分类的性能。然而,这些方法大多是基于启发式思想链(COT)的,而且往往更复杂,但效率较低。在本文中,我们对基于LLM的文本分类方法进行了反思,提出了一种简单有效的迁移学习策略–LLMEmed,以解决这一经典但具有挑战性的任务。为了说明这一点,我们首先研究了如何通过不同网络深度的各种轻量级LLM来适当地提取和融合文本嵌入,以提高它们的稳健性和区分性,然后采用这种嵌入来训练分类器。我们在公开可用的数据集上进行了广泛的实验,结果表明,与基于更大的LLM的现有方法(即GPT-3)和复杂的基于提示的策略相比,使用轻量级LLM主干的LLMEmbed获得了强大的性能,同时享受了较低的训练开销。与同类产品相比,我们的LLMEmbed在无需任何微调的情况下,在公开可用的基准上实现了足够的准确性,仅使用4%的模型参数,1.8%的电力消耗和1.5%的运行时间。代码可在以下网址获得:这个HTTPS URL。

[NLP-68] Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning
[NLP-68] 通过多任务指令微调进行概括增强型代码漏洞检测

链接: https://arxiv.org/abs/2406.03718
作者: Xiaohu Du,Ming Wen,Jiahao Zhu,Zifan Xie,Bin Ji,Huijun Liu,Xuanhua Shi,Hai Jin
关键词: Code Pre-trained Models, achieved promising results, Code Pre-trained, Pre-trained Models, recent years
中文关键词: 代码预训练模型,取得了有希望的结果,代码预训练,预训练模型,近年来
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ACL 2024 Findings

点击查看摘要

Abstract:Code Pre-trained Models (CodePTMs) based vulnerability detection have achieved promising results over recent years. However, these models struggle to generalize as they typically learn superficial mapping from source code to labels instead of understanding the root causes of code vulnerabilities, resulting in poor performance in real-world scenarios beyond the training instances. To tackle this challenge, we introduce VulLLM, a novel framework that integrates multi-task learning with Large Language Models (LLMs) to effectively mine deep-seated vulnerability features. Specifically, we construct two auxiliary tasks beyond the vulnerability detection task. First, we utilize the vulnerability patches to construct a vulnerability localization task. Second, based on the vulnerability features extracted from patches, we leverage GPT-4 to construct a vulnerability interpretation task. VulLLM innovatively augments vulnerability classification by leveraging generative LLMs to understand complex vulnerability patterns, thus compelling the model to capture the root causes of vulnerabilities rather than overfitting to spurious features of a single task. The experiments conducted on six large datasets demonstrate that VulLLM surpasses seven state-of-the-art models in terms of effectiveness, generalization, and robustness.
摘要:近年来,基于代码预训练模型(CodePTM)的漏洞检测取得了可喜的成果。然而,这些模型难以推广,因为它们通常学习从源代码到标签的表面映射,而不是了解代码漏洞的根本原因,从而导致在培训实例之外的现实世界场景中性能较差。为了应对这一挑战,我们引入了VulLLM,这是一个将多任务学习与大型语言模型(LLM)相结合的新框架,可以有效地挖掘深层次的漏洞特征。具体地说,我们在漏洞检测任务的基础上构造了两个辅助任务。首先,我们利用漏洞补丁构建漏洞定位任务。其次,基于从补丁中提取的漏洞特征,利用GPT-4构建漏洞解释任务。VulLLM创新性地通过利用生成性LLM来理解复杂的漏洞模式来增强漏洞分类,从而迫使模型捕获漏洞的根本原因,而不是过度适应单个任务的虚假特征。在六个大型数据集上进行的实验表明,VulLLM在有效性、泛化和稳健性方面都超过了七个最先进的模型。

[NLP-69] A Survey on Medical Large Language Models: Technology Application Trustworthiness and Future Directions
[NLP-69] 医学大型语言模型调查:技术应用可信度和未来方向

链接: https://arxiv.org/abs/2406.03712
作者: Lei Liu,Xiaoyan Yang,Junchi Lei,Xiaoyang Liu,Yue Shen,Zhiqiang Zhang,Peng Wei,Jinjie Gu,Zhixuan Chu,Zhan Qin,Kui Ren
关键词: GPT series models, Large language models, received substantial attention, substantial attention due, understanding human-level language
中文关键词: GPT系列模型,大型语言模型,受到了大量关注,应有的大量关注,理解人类水平的语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs), such as GPT series models, have received substantial attention due to their impressive capabilities for generating and understanding human-level language. More recently, LLMs have emerged as an innovative and powerful adjunct in the medical field, transforming traditional practices and heralding a new era of enhanced healthcare services. This survey provides a comprehensive overview of Medical Large Language Models (Med-LLMs), outlining their evolution from general to the medical-specific domain (i.e, Technology and Application), as well as their transformative impact on healthcare (e.g., Trustworthiness and Safety). Concretely, starting from the fundamental history and technology of LLMs, we first delve into the progressive adaptation and refinements of general LLM models in the medical domain, especially emphasizing the advanced algorithms that boost the LLMs’ performance in handling complicated medical environments, including clinical reasoning, knowledge graph, retrieval-augmented generation, human alignment, and multi-modal learning. Secondly, we explore the extensive applications of Med-LLMs across domains such as clinical decision support, report generation, and medical education, illustrating their potential to streamline healthcare services and augment patient outcomes. Finally, recognizing the imperative and responsible innovation, we discuss the challenges of ensuring fairness, accountability, privacy, and robustness in Med-LLMs applications. Finally, we conduct a concise discussion for anticipating possible future trajectories of Med-LLMs, identifying avenues for the prudent expansion of Med-LLMs. By consolidating above-mentioned insights, this review seeks to provide a comprehensive investigation of the potential strengths and limitations of Med-LLMs for professionals and researchers, ensuring a responsible landscape in the healthcare setting.
摘要:大型语言模型(LLM),如GPT系列模型,由于其在生成和理解人类语言方面的惊人能力而受到广泛关注。最近,LLMS已成为医疗领域一个创新和强大的附属机构,改变了传统做法,预示着增强医疗服务的新时代的到来。这项调查全面概述了医学大语言模型(MED-LLMS),概述了它们从一般领域到医学特定领域(即技术和应用)的演变,以及它们对医疗保健的变革性影响(例如可信度和安全性)。具体地说,我们首先从LLMS的基本历史和技术出发,深入研究了医学领域中一般LLM模型的渐进式适应和改进,特别是提高了LLMS处理复杂医疗环境的性能的先进算法,包括临床推理、知识图、检索增强生成、人类对齐和多模式学习。其次,我们探索了MED-LLMS在临床决策支持、报告生成和医学教育等领域的广泛应用,展示了它们在简化医疗服务和提高患者预后方面的潜力。最后,认识到迫切和负责任的创新,我们讨论了在MED-LLMS应用中确保公平、责任、隐私和健壮性的挑战。最后,我们进行了简明的讨论,以预测MED-LLMS未来可能的轨迹,确定谨慎扩展MED-LLMS的途径。通过综合上述见解,本综述旨在为专业人员和研究人员提供对MED-LLMS潜在优势和局限性的全面调查,确保在医疗保健环境中形成负责任的格局。

[NLP-70] What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions
[NLP-70] 嵌入应该嵌入什么?自回归模型代表潜在生成分布

链接: https://arxiv.org/abs/2406.03707
作者: Liyi Zhang,Michael Y. Li,Thomas L. Griffiths
关键词: Autoregressive language models, extract latent structure, large language models, language models, structure from text
中文关键词: 自回归语言模型、提取潜在结构、大语言模型、语言模型、文本结构
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:Autoregressive language models have demonstrated a remarkable ability to extract latent structure from text. The embeddings from large language models have been shown to capture aspects of the syntax and semantics of language. But what \em should embeddings represent? We connect the autoregressive prediction objective to the idea of constructing predictive sufficient statistics to summarize the information contained in a sequence of observations, and use this connection to identify three settings where the optimal content of embeddings can be identified: independent identically distributed data, where the embedding should capture the sufficient statistics of the data; latent state models, where the embedding should encode the posterior distribution over states given the data; and discrete hypothesis spaces, where the embedding should reflect the posterior distribution over hypotheses given the data. We then conduct empirical probing studies to show that transformers encode these three kinds of latent generating distributions, and that they perform well in out-of-distribution cases and without token memorization in these settings.
摘要:自回归语言模型表现出了从文本中提取潜在结构的显著能力。来自大型语言模型的嵌入已被证明捕获了语言的语法和语义的各个方面。但嵌入应该代表什么呢?我们将自回归预测目标与构造预测充分统计量来总结一系列观测数据中包含的信息的思想联系起来,并利用这种联系来确定三种可以识别嵌入的最优内容的环境:独立同分布数据,其中嵌入应该捕获数据的充分统计;潜在状态模型,其中嵌入应该编码给定数据的状态的后验分布;以及离散假设空间,其中嵌入应该反映给定数据的假设的后验分布。然后,我们进行了经验探索研究,以表明转换器编码这三种潜在的生成分布,并且它们在分布外的情况下执行得很好,并且在这些环境中没有记号记忆。

[NLP-71] Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
[NLP-71] 利用多模式上下文和大语言模型改进基于音频编解码器的零镜头文本到语音合成

链接: https://arxiv.org/abs/2406.03706
作者: Jinlong Xue,Yayue Deng,Yicheng Han,Yingming Gao,Ya Li
关键词: codecs greatly propel, Recent advances, large language models, audio codecs greatly, advances in large
中文关键词: 编解码器极大地推动了,最近的进步,大型语言模型,音频编解码器极大地进步
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2024

点击查看摘要

Abstract:Recent advances in large language models (LLMs) and development of audio codecs greatly propel the zero-shot TTS. They can synthesize personalized speech with only a 3-second speech of an unseen speaker as acoustic prompt. However, they only support short speech prompts and cannot leverage longer context information, as required in audiobook and conversational TTS scenarios. In this paper, we introduce a novel audio codec-based TTS model to adapt context features with multiple enhancements. Inspired by the success of Qformer, we propose a multi-modal context-enhanced Qformer (MMCE-Qformer) to utilize additional multi-modal context information. Besides, we adapt a pretrained LLM to leverage its understanding ability to predict semantic tokens, and use a SoundStorm to generate acoustic tokens thereby enhancing audio quality and speaker similarity. The extensive objective and subjective evaluations show that our proposed method outperforms baselines across various context TTS scenarios.
摘要:大语言模型(LLM)的最新进展和音频编解码器的发展极大地推动了零镜头TTS的发展。他们可以合成个性化的语音,只需一个看不见的说话人的3秒语音作为声音提示。然而,它们只支持较短的语音提示,不能像有声读物和对话式TTS场景中所要求的那样利用较长的上下文信息。在本文中,我们提出了一种新的基于音频编解码器的TTS模型,以适应上下文特征并进行多项增强。受Qformer的成功启发,我们提出了一种多通道上下文增强型Qform(MMCE-Qformer)来利用额外的多通道上下文信息。此外,我们采用预先训练的LLM来利用其理解能力来预测语义标记,并使用SoundStorm来生成声学标记,从而提高了音频质量和说话人的相似度。大量的客观和主观评估表明,我们提出的方法在不同的上下文TTS场景中的性能优于基线。

[NLP-72] Synthesizing Conversations from Unlabeled Documents using Automatic Response Segmentation
[NLP-72] 使用自动响应分段从未标记文档合成对话

链接: https://arxiv.org/abs/2406.03703
作者: Fanyou Wu,Weijie Xu,Chandan K. Reddy,Srinivasan H. Sengamedu
关键词: conversational question answering, costly training data, question answering, tackle the challenge, challenge of inadequate
中文关键词: 对话式问答、昂贵的训练数据、问答、应对挑战、不足的挑战
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: findings of ACL 2024

点击查看摘要

Abstract:In this study, we tackle the challenge of inadequate and costly training data that has hindered the development of conversational question answering (ConvQA) systems. Enterprises have a large corpus of diverse internal documents. Instead of relying on a searching engine, a more compelling approach for people to comprehend these documents is to create a dialogue system. In this paper, we propose a robust dialog synthesising method. We learn the segmentation of data for the dialog task instead of using segmenting at sentence boundaries. The synthetic dataset generated by our proposed method achieves superior quality when compared to WikiDialog, as assessed through machine and human evaluations. By employing our inpainted data for ConvQA retrieval system pre-training, we observed a notable improvement in performance across OR-QuAC benchmarks.
摘要:在这项研究中,我们解决了阻碍对话式问答(ConvQA)系统开发的训练数据不足和昂贵的挑战。企业拥有大量不同的内部文件库。人们理解这些文档的更引人注目的方法是创建对话系统,而不是依赖搜索引擎。在本文中,我们提出了一种稳健的对话合成方法。我们学习对话任务的数据分割,而不是在句子边界使用分割。通过机器和人类评估评估,与WikiDialoga相比,我们提出的方法生成的合成数据集实现了更高的质量。通过使用我们对ConvQA检索系统预训练的修补数据,我们观察到OR-QuAC基准测试的性能有了显着提高。

[NLP-73] M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering
[NLP-73] M-QILM:通过问题解答评估大型语言模型中临床阅读理解和知识回忆的基准

链接: https://arxiv.org/abs/2406.03699
作者: Anand Subramanian,Viktor Schlegel,Abhinav Ramesh Kashyap,Thanh-Tung Nguyen,Vijay Prakash Dwivedi,Stefan Winkler
关键词: adapting Large Language, Large Language Models, adapting Large, Large Language, Abstractive Question Answering
中文关键词: 适应大语言、大语言模型、适应大、大语言、抽象问题解答
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2024 (Findings)

点击查看摘要

Abstract:There is vivid research on adapting Large Language Models (LLMs) to perform a variety of tasks in high-stakes domains such as healthcare. Despite their popularity, there is a lack of understanding of the extent and contributing factors that allow LLMs to recall relevant knowledge and combine it with presented information in the clinical and biomedical domain: a fundamental pre-requisite for success on down-stream tasks. Addressing this gap, we use Multiple Choice and Abstractive Question Answering to conduct a large-scale empirical study on 22 datasets in three generalist and three specialist biomedical sub-domains. Our multifaceted analysis of the performance of 15 LLMs, further broken down by sub-domain, source of knowledge and model architecture, uncovers success factors such as instruction tuning that lead to improved recall and comprehension. We further show that while recently proposed domain-adapted models may lack adequate knowledge, directly fine-tuning on our collected medical knowledge datasets shows encouraging results, even generalising to unseen specialist sub-domains. We complement the quantitative results with a skill-oriented manual error analysis, which reveals a significant gap between the models’ capabilities to simply recall necessary knowledge and to integrate it with the presented context. To foster research and collaboration in this field we share M-QALM, our resources, standardised methodology, and evaluation results, with the research community to facilitate further advancements in clinical knowledge representation learning within language models.
摘要:有一项生动的研究表明,在医疗保健等高风险领域,大型语言模型(LLM)可以执行各种任务。尽管LLMS很受欢迎,但人们对允许LLM回忆相关知识并将其与临床和生物医学领域的现有信息相结合的程度和促成因素缺乏了解:这是下游任务成功的基本先决条件。为了弥补这一差距,我们使用多项选择和抽象问答对三个通用生物医学领域和三个专业生物医学子领域的22个数据集进行了大规模的实证研究。我们对15个学习记忆模型的性能进行了多方面的分析,并进一步按子领域、知识来源和模型架构进行了细分,发现了一些成功因素,如教学调整,这些因素有助于提高回忆和理解能力。我们进一步表明,虽然最近提出的领域适应模型可能缺乏足够的知识,但直接对我们收集的医学知识数据集进行微调显示出令人鼓舞的结果,甚至概括到看不见的专家子领域。我们用面向技能的手动错误分析来补充量化结果,这表明模型在简单地回忆必要的知识并将其与所呈现的背景相结合的能力之间存在着显著的差距。为了促进这一领域的研究和合作,我们与研究社区共享M-QALM、我们的资源、标准化方法和评估结果,以促进在语言模型中临床知识表征学习的进一步进步。

[NLP-74] Evaluating the World Model Implicit in a Generative Model
[NLP-74] 评估生成模型中隐含的世界模型

链接: https://arxiv.org/abs/2406.03689
作者: Keyon Vafa,Justin Y. Chen,Jon Kleinberg,Sendhil Mullainathan,Ashesh Rambachan
关键词: Recent work suggests, Recent work, implicitly learn world, implicitly learn, Recent
中文关键词: 最近的工作表明,最近的工作,隐性学习世界,隐性学习,最近
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent work suggests that large language models may implicitly learn world models. How should we assess this possibility? We formalize this question for the case where the underlying reality is governed by a deterministic finite automaton. This includes problems as diverse as simple logical reasoning, geographic navigation, game-playing, and chemistry. We propose new evaluation metrics for world model recovery inspired by the classic Myhill-Nerode theorem from language theory. We illustrate their utility in three domains: game playing, logic puzzles, and navigation. In all domains, the generative models we consider do well on existing diagnostics for assessing world models, but our evaluation metrics reveal their world models to be far less coherent than they appear. Such incoherence creates fragility: using a generative model to solve related but subtly different tasks can lead it to fail badly. Building generative models that meaningfully capture the underlying logic of the domains they model would be immensely valuable; our results suggest new ways to assess how close a given model is to that goal.
摘要:最近的研究表明,大型语言模型可能会隐含地学习世界模型。我们应该如何评估这种可能性?我们将这个问题形式化用于基本现实由确定的有限自动机控制的情况。这包括各种各样的问题,如简单的逻辑推理、地理导航、玩游戏和化学。受语言理论中经典的MyHill-Nerode定理的启发,我们提出了新的世界模型恢复的评价指标。我们从三个方面说明了它们的用途:游戏、逻辑谜题和导航。在所有领域,我们认为的生成模型在评估世界模型的现有诊断方法上做得很好,但我们的评估指标显示,它们的世界模型远没有看起来那么连贯。这种不连贯造成了脆弱性:使用生成性模型来解决相关但细微不同的任务可能会导致它严重失败。建立生成性模型,有意义地捕捉他们建模的领域的底层逻辑,这将是非常有价值的;我们的结果提出了新的方法来评估给定的模型离目标有多近。

[NLP-75] Linguistically Conditioned Semantic Textual Similarity
[NLP-75] 语言条件语义文本相似性

链接: https://arxiv.org/abs/2406.03673
作者: Jingxuan Tu,Keer Xu,Liulu Yue,Bingyang Ye,Kyeongmin Rim,James Pustejovsky
关键词: Semantic textual similarity, fundamental NLP task, called Conditional STS, fundamental NLP, Semantic textual
中文关键词: 语义文本相似性,基本NLP任务,称为条件STS,基本NLP,语义文本
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in the ACL 2024 main proceedings

点击查看摘要

Abstract:Semantic textual similarity (STS) is a fundamental NLP task that measures the semantic similarity between a pair of sentences. In order to reduce the inherent ambiguity posed from the sentences, a recent work called Conditional STS (C-STS) has been proposed to measure the sentences’ similarity conditioned on a certain aspect. Despite the popularity of C-STS, we find that the current C-STS dataset suffers from various issues that could impede proper evaluation on this task. In this paper, we reannotate the C-STS validation set and observe an annotator discrepancy on 55% of the instances resulting from the annotation errors in the original label, ill-defined conditions, and the lack of clarity in the task definition. After a thorough dataset analysis, we improve the C-STS task by leveraging the models’ capability to understand the conditions under a QA task setting. With the generated answers, we present an automatic error identification pipeline that is able to identify annotation errors from the C-STS data with over 80% F1 score. We also propose a new method that largely improves the performance over baselines on the C-STS data by training the models with the answers. Finally we discuss the conditionality annotation based on the typed-feature structure (TFS) of entity types. We show in examples that the TFS is able to provide a linguistic foundation for constructing C-STS data with new conditions.
摘要:语义文本相似度是衡量句子间语义相似度的一项基本的自然语言处理任务。为了减少句子本身存在的歧义,最近提出了一种称为条件STS(C-STS)的方法来度量句子在某一方面的相似度。尽管C-STS很受欢迎,但我们发现当前的C-STS数据集存在各种问题,这些问题可能会阻碍对这项任务的适当评估。在本文中,我们重新标注了C-STS验证集,并观察到55%的实例存在注释器差异,这是由于原始标签中的标注错误、条件定义不明确以及任务定义不清晰造成的。在彻底的数据集分析之后,我们通过利用模型的能力来理解QA任务设置下的条件,从而改进C-STS任务。使用生成的答案,我们提出了一种自动错误识别流水线,该管道能够从C-STS数据中识别出F1分数超过80%的标注错误。我们还提出了一种新的方法,通过用答案训练模型来极大地提高C-STS数据的基线性能。最后,我们讨论了基于实体类型的类型化特征结构的条件性标注。实例表明,TFS能够为构造具有新条件的C-STS数据提供语言基础。

[NLP-76] What Makes Language Models Good-enough?
[NLP-76] 是什么让语言模型足够好?

链接: https://arxiv.org/abs/2406.03666
作者: Daiki Asami,Saku Sugawara
关键词: good-enough language processing, language processing, task at hand, humans may build, build a representation
中文关键词: 足够好的语言处理,语言处理,手头的任务,人类可以建立,建立一个表示
类目: Computation and Language (cs.CL)
备注: To appear in Findings of ACL2024

点击查看摘要

Abstract:Psycholinguistic research suggests that humans may build a representation of linguistic input that is ‘good-enough’ for the task at hand. This study examines what architectural features make language models learn human-like good-enough language processing. We focus on the number of layers and self-attention heads in Transformers. We create a good-enough language processing (GELP) evaluation dataset (7,680 examples), which is designed to test the effects of two plausibility types, eight construction types, and three degrees of memory cost on language processing. To annotate GELP, we first conduct a crowdsourcing experiment whose design follows prior psycholinguistic studies. Our model evaluation against the annotated GELP then reveals that the full model as well as models with fewer layers and/or self-attention heads exhibit a good-enough performance. This result suggests that models with shallower depth and fewer heads can learn good-enough language processing.
摘要:心理语言学研究表明,人类可能会建立一种对手头的任务“足够好”的语言输入表示。这项研究探讨了哪些架构特征使语言模型学习类似人类的足够好的语言处理。我们关注《变形金刚》中的层次和自我关注头部的数量。我们创建了一个足够好的语言处理(GELP)评估数据集(7,680个示例),旨在测试两种可信度类型、八种结构类型和三种内存成本对语言处理的影响。为了注释GELP,我们首先进行了一项众包实验,其设计遵循了之前的心理语言学研究。然后,我们针对带注释的GELP进行的模型评估显示,完整模型以及具有较少层和/或自我注意力头部的模型表现出足够好的性能。这一结果表明,深度较浅、头部较少的模型可以学习足够好的语言处理。

[NLP-77] Is Free Self-Alignment Possible?
[NLP-77] 自由的自我调整可能吗?

链接: https://arxiv.org/abs/2406.03642
作者: Dyah Adila,Changho Shin,Yijing Zhang,Frederic Sala
关键词: Aligning pretrained language, Aligning pretrained, resource-intensive process, substantial compute, complex and resource-intensive
中文关键词: 调整预训练语言、调整预训练、资源密集型流程、大量计算、复杂且资源密集型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Aligning pretrained language models (LMs) is a complex and resource-intensive process, often requiring access to large amounts of ground-truth preference data and substantial compute. Are these costs necessary? That is, it is possible to align using only inherent model knowledge and without additional training? We tackle this challenge with AlignEZ, a novel approach that uses (1) self-generated preference data and (2) representation editing to provide nearly cost-free alignment. During inference, AlignEZ modifies LM representations to reduce undesirable and boost desirable components using subspaces identified via self-generated preference pairs. Our experiments reveal that this nearly cost-free procedure significantly narrows the gap between base pretrained and tuned models by an average of 31.6%, observed across six datasets and three model architectures. Additionally, we explore the potential of using AlignEZ as a means of expediting more expensive alignment procedures. Our experiments show that AlignEZ improves DPO models tuned only using a small subset of ground-truth preference data. Lastly, we study the conditions under which improvement using AlignEZ is feasible, providing valuable insights into its effectiveness.
摘要:对齐预先训练的语言模型是一个复杂且资源密集的过程,通常需要访问大量的基本事实偏好数据和大量的计算。这些费用有必要吗?也就是说,只使用固有的模型知识而不需要额外的培训就可以对齐?我们用AlignEZ解决了这一挑战,这是一种新的方法,它使用(1)自行生成的偏好数据和(2)表示编辑来提供几乎免费的比对。在推理过程中,AlignEZ使用通过自生成偏好对识别的子空间来修改LM表示,以减少不期望的分量并提高期望的分量。我们的实验表明,这种几乎零成本的过程显著缩小了基础预训练模型和调整后模型之间的差距,平均缩小了31.6%,在六个数据集和三个模型体系结构上进行了观察。此外,我们还探讨了使用AlignEZ作为加快成本更高的比对程序的可能性。我们的实验表明,AlignEZ改进了仅使用一小部分地面事实偏好数据调整的DPO模型。最后,我们研究了使用AlignEZ进行改进的条件,为其有效性提供了有价值的见解。

[NLP-78] ACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools
[NLP-78] ACT:使用信息提取工具推进复杂的聚合推理

链接: https://arxiv.org/abs/2406.03618
作者: Avi Caciularu,Alon Jacovi,Eyal Ben-David,Sasha Goldshtein,Tal Schuster,Jonathan Herzig,Gal Elidan,Amir Globerson
关键词: Large Language Models, Large Language, Language Models, require the aggregation, Large
中文关键词: 大型语言模型,大型语言,语言模型,需要聚合,大型
类目: Computation and Language (cs.CL)
备注: Website ( this https URL ), Huggingface ( this https URL )

点击查看摘要

Abstract:Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts. To better evaluate this setting and facilitate modeling efforts, we introduce TACT - Text And Calculations through Tables, a dataset crafted to evaluate LLMs’ reasoning and computational abilities using complex instructions. TACT contains challenging instructions that demand stitching information scattered across one or more texts, and performing complex integration on this information to generate the answer. We construct this dataset by leveraging an existing dataset of texts and their associated tables. For each such tables, we formulate new queries, and gather their respective answers. We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%. To pinpoint the difficulties and thoroughly dissect the problem, we analyze model performance across three components: table-generation, Pandas command-generation, and execution. Unexpectedly, we discover that each component presents substantial challenges for current LLMs. These insights lead us to propose a focused modeling framework, which we refer to as IE as a tool. Specifically, we propose to add “tools” for each of the above steps, and implement each such tool with few-shot prompting. This approach shows an improvement over existing prompting techniques, offering a promising direction for enhancing model capabilities in these tasks.
摘要:大型语言模型(LLM)通常不能很好地处理需要跨文本聚合信息的查询。为了更好地评估这一设置并促进建模工作,我们通过表格引入了Tact-Text和计算,这是一个精心设计的数据集,用于使用复杂的指令评估LLM的推理和计算能力。Tact包含具有挑战性的指令,这些指令要求将分散在一个或多个文本中的信息拼接在一起,并对这些信息进行复杂的整合以生成答案。我们通过利用文本及其关联表的现有数据集来构建该数据集。对于每个这样的表,我们都制定了新的查询,并收集了它们各自的答案。我们证明了所有当代的最小二乘法在这个数据集上的表现都很差,达到了38%以下的精度。为了找出困难并彻底剖析问题,我们分析了三个组件的模型性能:表生成、Pandas命令生成和执行。出乎意料的是,我们发现每个组件都给当前的LLM带来了巨大的挑战。这些见解使我们提出了一个重点建模框架,我们将其称为IE作为工具。具体地说,我们建议为上述每个步骤添加“工具”,并实现每个此类工具的少量提示。这种方法显示了对现有提示技术的改进,为增强这些任务中的模型能力提供了一个有希望的方向。

[NLP-79] Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs
[NLP-79] 推进异常检测:使用LLM进行非语义金融数据编码

链接: https://arxiv.org/abs/2406.03614
作者: Alexander Bakumenko(1),Kateřina Hlaváčková-Schindler(2),Claudia Plant(2),Nina C. Hubig(1) ((1) Clemson University, USA, (2) University of Vienna, Austria)
关键词: Detecting anomalies, utmost importance, importance to ensure, ensure trustworthiness, machine learning
中文关键词: 检测异常,最重要,确保的重要性,确保可信度,机器学习
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Risk Management (q-fin.RM)
备注:

点击查看摘要

Abstract:Detecting anomalies in general ledger data is of utmost importance to ensure trustworthiness of financial records. Financial audits increasingly rely on machine learning (ML) algorithms to identify irregular or potentially fraudulent journal entries, each characterized by a varying number of transactions. In machine learning, heterogeneity in feature dimensions adds significant complexity to data analysis. In this paper, we introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. To encode non-semantic categorical data from real-world financial records, we tested 3 pre-trained general purpose sentence-transformer models. For the downstream classification task, we implemented and evaluated 5 optimized ML models including Logistic Regression, Random Forest, Gradient Boosting Machines, Support Vector Machines, and Neural Networks. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines, in selected settings even by a large margin. The findings further underscore the effectiveness of LLMs in enhancing anomaly detection in financial journal entries, particularly by tackling feature sparsity. We discuss a promising perspective on using LLM embeddings for non-semantic data in the financial context and beyond.
摘要:检测总账数据中的异常对于确保财务记录的可信性至关重要。财务审计越来越依赖机器学习(ML)算法来识别不规则或潜在的欺诈性日记帐分录,每个分录的特点是不同数量的交易。在机器学习中,特征维度的异构性大大增加了数据分析的复杂性。本文提出了一种新的基于大语言模型嵌入的金融数据异常检测方法。为了对真实金融记录中的非语义范畴数据进行编码,我们测试了3个预先训练好的通用语句转换器模型。对于下游分类任务,我们实现并评估了5个优化的最大似然模型,包括Logistic回归、随机森林、梯度增强机器、支持向量机和神经网络。我们的实验表明,LLMS为异常检测提供了有价值的信息,因为我们的模型在选定的设置下甚至远远超过基线。这些发现进一步强调了LLMS在增强金融期刊条目中的异常检测方面的有效性,特别是通过处理特征稀疏性。我们讨论了在金融环境中和更远的地方将LLM嵌入用于非语义数据的前景。

[NLP-80] Knowledge-Infused Legal Wisdom: Navigating LLM Consultation through the Lens of Diagnostics and Positive-Unlabeled Reinforcement Learning
[NLP-80] 知识注入的法律智慧:通过诊断和积极无标签强化学习的视角进行LLM咨询

链接: https://arxiv.org/abs/2406.03600
作者: Yang Wu,Chenghao Wang,Ece Gumusel,Xiaozhong Liu
关键词: Large Language Models, generative Large Language, Legal Large Language, Large Language, Language Models
中文关键词: 大型语言模型、生成式大型语言、法律大型语言、大型语言、语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL Findings 2024

点击查看摘要

Abstract:The integration of generative Large Language Models (LLMs) into various applications, including the legal domain, has been accelerated by their expansive and versatile nature. However, when facing a legal case, users without a legal background often struggle to formulate professional queries and may inadvertently overlook critical legal factors when presenting their case narrative to LLMs. To address this issue, we propose the Diagnostic Legal Large Language Model (D3LM), which utilizes adaptive lawyer-like diagnostic questions to collect additional case information and then provides high-quality feedback. D3LM incorporates an innovative graph-based Positive-Unlabeled Reinforcement Learning (PURL) algorithm, enabling the generation of critical questions and enhancing user-LLM interactions. Moreover, an integrated LLM-based stopping criterion facilitates precise Court Views Generation (CVG). Our research also introduces a new English-language CVG dataset based on the US case law database, enriching the realm of LLM research and deployment with a vital dimension. D3LM surpasses classical LLMs by delivering outstanding performance and a remarkable user experience in the legal domain.
摘要:产生式大型语言模型由于其可扩展性和通用性,加速了其在包括法律领域在内的各种应用中的集成。然而,在面对法律案件时,没有法律背景的用户往往难以形成专业的问题,在向LLMS展示他们的案件叙述时,可能会无意中忽视关键的法律因素。为了解决这一问题,我们提出了诊断法律大语言模型(D3LM),它利用自适应的律师式诊断问题来收集额外的案例信息,然后提供高质量的反馈。D3LM结合了一种创新的基于图形的正无标签强化学习(PUL)算法,能够生成关键问题并增强用户与LLM的交互。此外,集成的基于LLM的停止标准有助于精确生成法院视图(CVG)。我们的研究还引入了一个基于美国判例法数据库的新的英文CVG数据集,以至关重要的维度丰富了LLM研究和部署的领域。D3Lm在法律领域提供了卓越的性能和卓越的用户体验,超过了经典的LLMS。

[NLP-81] Measuring Retrieval Complexity in Question Answering Systems
[NLP-81] 测量问题解答系统中的检索复杂性

链接: https://arxiv.org/abs/2406.03592
作者: Matteo Gabburo,Nicolaas Paul Jedema,Siddhant Garg,Leonardo F. R. Ribeiro,Alessandro Moschitti
关键词: retrieval-based Question Answering, propose retrieval complexity, questions, Answering, Question Answering
中文关键词: 基于检索的问题解答,提出检索复杂性,问题,解答,问题解答
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2024 (findings)

点击查看摘要

Abstract:In this paper, we investigate which questions are challenging for retrieval-based Question Answering (QA). We (i) propose retrieval complexity (RC), a novel metric conditioned on the completeness of retrieved documents, which measures the difficulty of answering questions, and (ii) propose an unsupervised pipeline to measure RC given an arbitrary retrieval system. Our proposed pipeline measures RC more accurately than alternative estimators, including LLMs, on six challenging QA benchmarks. Further investigation reveals that RC scores strongly correlate with both QA performance and expert judgment across five of the six studied benchmarks, indicating that RC is an effective measure of question difficulty. Subsequent categorization of high-RC questions shows that they span a broad set of question shapes, including multi-hop, compositional, and temporal QA, indicating that RC scores can categorize a new subset of complex questions. Our system can also have a major impact on retrieval-based systems by helping to identify more challenging questions on existing datasets.
摘要:在本文中,我们研究了基于检索的问答(QA)中有哪些问题是具有挑战性的。我们(I)提出了检索复杂性(RC),这是一种基于检索文档完备性的新度量,它衡量了回答问题的难度。(Ii)提出了一种无监督流水线来度量给定的任意检索系统的RC。在六个具有挑战性的QA基准上,我们提议的流水线比包括LLMS在内的其他估计器更准确地衡量RC。进一步的研究发现,在六个被研究的基准中,有五个基准的RC分数与QA成绩和专家判断都有很强的相关性,表明RC是衡量问题难度的有效工具。随后对高RC问题的分类表明,它们跨越了广泛的问题形状集,包括多跳、成分和时态QA,这表明RC分数可以对复杂问题的新子集进行分类。我们的系统还可以通过帮助识别现有数据集上更具挑战性的问题来对基于检索的系统产生重大影响。

[NLP-82] Ranking Manipulation for Conversational Search Engines
[NLP-82] 对话式搜索引擎的排名操纵

链接: https://arxiv.org/abs/2406.03589
作者: Samuel Pfrommer,Yatong Bai,Tanmay Gautam,Somayeh Sojoudi
关键词: Large Language Model, incorporating Large Language, rapidly incorporating Large, Language Model, Large Language
中文关键词: 大型语言模型,合并大型语言,快速合并大型、语言模型、大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Major search engine providers are rapidly incorporating Large Language Model (LLM)-generated content in response to user queries. These conversational search engines operate by loading retrieved website text into the LLM context for summarization and interpretation. Recent research demonstrates that LLMs are highly vulnerable to jailbreaking and prompt injection attacks, which disrupt the safety and quality goals of LLMs using adversarial strings. This work investigates the impact of prompt injections on the ranking order of sources referenced by conversational search engines. To this end, we introduce a focused dataset of real-world consumer product websites and formalize conversational search ranking as an adversarial problem. Experimentally, we analyze conversational search rankings in the absence of adversarial injections and show that different LLMs vary significantly in prioritizing product name, document content, and context position. We then present a tree-of-attacks-based jailbreaking technique which reliably promotes low-ranked products. Importantly, these attacks transfer effectively to state-of-the-art conversational search engines such as this http URL. Given the strong financial incentive for website owners to boost their search ranking, we argue that our problem formulation is of critical importance for future robustness work.
摘要:各大搜索引擎提供商正在迅速整合大型语言模型(LLM)生成的内容,以响应用户查询。这些对话式搜索引擎通过将检索到的网站文本加载到LLM上下文中进行操作以进行摘要和解释。最近的研究表明,LLM非常容易受到越狱和快速注入攻击,这些攻击使用敌意字符串破坏LLM的安全和质量目标。这项工作调查了提示注入对对话式搜索引擎引用的来源的排名顺序的影响。为此,我们引入了一个聚焦于真实世界消费产品网站的数据集,并将会话搜索排名形式化为一个对抗性问题。在实验上,我们分析了在没有对抗性注入的情况下的会话搜索排名,结果表明不同的LLM在产品名称、文档内容和上下文位置的优先顺序上存在显著差异。然后,我们提出了一种基于攻击树的越狱技术,该技术可靠地推广排名较低的产品。重要的是,这些攻击有效地转移到最先进的会话搜索引擎,如这个http URL。考虑到网站所有者有强大的经济动机来提高他们的搜索排名,我们认为我们的问题表达对于未来的稳健性工作至关重要。

[NLP-83] LLMs Meet Multimodal Generation and Editing: A Survey
[NLP-83] LLM满足多模式生成和编辑:一项调查

链接: https://arxiv.org/abs/2405.19334
作者: Yingqing He,Zhaoyang Liu,Jingye Chen,Zeyue Tian,Hongyu Liu,Xiaowei Chi,Runtao Liu,Ruibin Yuan,Yazhou Xing,Wenhai Wang,Jifeng Dai,Yong Zhang,Wei Xue,Qifeng Liu,Yike Guo,Qifeng Chen
关键词: large language models, large language, combining LLMs, growing interest, interest in combining
中文关键词: 大型语言模型,大型语言,结合LLM,日益增长的兴趣,对结合的兴趣
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注: 51 Pages with 16 Figures, 12 Tables, and 534 References. GitHub Repository at: this https URL

点击查看摘要

Abstract:With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on understanding. This survey elaborates on multimodal generation across different domains, including image, video, 3D, and audio, where we highlight the notable advancements with milestone works in these fields. Specifically, we exhaustively investigate the key technical components behind methods and multimodal datasets utilized in these studies. Moreover, we dig into tool-augmented multimodal agents that can use existing generative models for human-computer interaction. Lastly, we also comprehensively discuss the advancement in AI safety and investigate emerging applications as well as future prospects. Our work provides a systematic and insightful overview of multimodal generation, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at this https URL
摘要:随着近年来大语言模型的发展,将大语言模型与多通道学习相结合越来越受到人们的关注。以往对多通道大语言模型的研究主要集中在理解上。这项调查详细阐述了不同领域的多模式生成,包括图像、视频、3D和音频,其中我们重点介绍了这些领域的里程碑式工作取得的显著进展。具体地说,我们详尽地调查了这些研究中使用的方法和多模式数据集背后的关键技术组件。此外,我们深入研究了工具增强的多通道代理,它们可以使用现有的生成模型进行人机交互。最后,我们还全面讨论了人工智能安全方面的进展,并研究了新兴的应用以及未来的前景。我们的工作提供了一个系统和有洞察力的多模式生成概述,预计将推动人工智能的生成内容(AIGC)和世界模型的发展。所有相关论文的精选列表可在此HTTPS URL中找到

[NLP-84] Style Mixture of Experts for Expressive Text-To-Speech Synthesis
[NLP-84] 表达性文本到语音合成的专家风格混合

链接: https://arxiv.org/abs/2406.03637
作者: Ahad Jawaid,Shreeram Suresh Chandra,Junchen Lu,Berrak Sisman
关键词: Recent advances, style transfer TTS, style, improved the expressiveness, expressiveness of synthesized
中文关键词: 最新进展,风格转移TTC,风格,提高了表现力,综合的表现力
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Recent advances in style transfer text-to-speech (TTS) have improved the expressiveness of synthesized speech. Despite these advancements, encoding stylistic information from diverse and unseen reference speech remains challenging. This paper introduces StyleMoE, an approach that divides the embedding space, modeled by the style encoder, into tractable subsets handled by style experts. The proposed method replaces the style encoder in a TTS system with a Mixture of Experts (MoE) layer. By utilizing a gating network to route reference speeches to different style experts, each expert specializes in aspects of the style space during optimization. Our experiments objectively and subjectively demonstrate the effectiveness of our proposed method in increasing the coverage of the style space for diverse and unseen styles. This approach can enhance the performance of existing state-of-the-art style transfer TTS models, marking the first study of MoE in style transfer TTS to our knowledge.
摘要:文语转换(TTS)技术的新进展提高了合成语音的表现力。尽管取得了这些进展,但从不同的和看不见的参考语音中编码文体信息仍然具有挑战性。本文介绍了StyleMoE,这是一种将由样式编码器建模的嵌入空间划分为由样式专家处理的易于处理的子集的方法。该方法用混合专家(MOE)层代替了TTS系统中的样式编码器。通过利用门控网络将参考语音路由到不同风格的专家,每个专家在优化过程中专门研究风格空间的各个方面。我们的实验客观和主观地证明了我们提出的方法在提高样式空间对各种不可见样式的覆盖率方面是有效的。这种方法可以提高现有风格转移TTS模型的性能,标志着我们对MOE在风格转移TTS方面的首次研究。

计算机视觉

[CV-0] Stereo-Depth Fusion through Virtual Pattern Projection

链接: https://arxiv.org/abs/2406.04345
作者: Luca Bartolomei,Matteo Poggi,Fabio Tosi,Andrea Conti,Stefano Mattoccia
关键词: depth data fusion, unreliable physical pattern, physical pattern projector, data fusion paradigm, active stereo principle
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: extended version of ICCV 2023: “Active Stereo Without Pattern Projector”

点击查看摘要

Abstract:This paper presents a novel general-purpose stereo and depth data fusion paradigm that mimics the active stereo principle by replacing the unreliable physical pattern projector with a depth sensor. It works by projecting virtual patterns consistent with the scene geometry onto the left and right images acquired by a conventional stereo camera, using the sparse hints obtained from a depth sensor, to facilitate the visual correspondence. Purposely, any depth sensing device can be seamlessly plugged into our framework, enabling the deployment of a virtual active stereo setup in any possible environment and overcoming the severe limitations of physical pattern projection, such as the limited working range and environmental conditions. Exhaustive experiments on indoor and outdoor datasets featuring both long and close range, including those providing raw, unfiltered depth hints from off-the-shelf depth sensors, highlight the effectiveness of our approach in notably boosting the robustness and accuracy of algorithms and deep stereo without any code modification and even without re-training. Additionally, we assess the performance of our strategy on active stereo evaluation datasets with conventional pattern projection. Indeed, in all these scenarios, our virtual pattern projection paradigm achieves state-of-the-art performance. The source code is available at: this https URL.

[CV-1] Verbalized Machine Learning: Revisiting Machine Learning with Language Models

链接: https://arxiv.org/abs/2406.04344
作者: Tim Z. Xiao,Robert Bamler,Bernhard Schölkopf,Weiyang Liu
关键词: large progress made, large language models, machine learning models, machine learning, large progress
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report v1 (92 pages, 15 figures)

点击查看摘要

Abstract:Motivated by the large progress made by large language models (LLMs), we introduce the framework of verbalized machine learning (VML). In contrast to conventional machine learning models that are typically optimized over a continuous parameter space, VML constrains the parameter space to be human-interpretable natural language. Such a constraint leads to a new perspective of function approximation, where an LLM with a text prompt can be viewed as a function parameterized by the text prompt. Guided by this perspective, we revisit classical machine learning problems, such as regression and classification, and find that these problems can be solved by an LLM-parameterized learner and optimizer. The major advantages of VML include (1) easy encoding of inductive bias: prior knowledge about the problem and hypothesis class can be encoded in natural language and fed into the LLM-parameterized learner; (2) automatic model class selection: the optimizer can automatically select a concrete model class based on data and verbalized prior knowledge, and it can update the model class during training; and (3) interpretable learner updates: the LLM-parameterized optimizer can provide explanations for why each learner update is performed. We conduct several studies to empirically evaluate the effectiveness of VML, and hope that VML can serve as a stepping stone to stronger interpretability and trustworthiness in ML.

[CV-2] Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image

链接: https://arxiv.org/abs/2406.04343
作者: Stanislaw Szymanowicz,Eldar Insafutdinov,Chuanxia Zheng,Dylan Campbell,João F. Henriques,Christian Rupprecht,Andrea Vedaldi
关键词: feed-forward Gaussian Splatting, Gaussian Splatting, single image, scene reconstruction, view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:In this paper, we propose Flash3D, a method for scene reconstruction and novel view synthesis from a single image which is both very generalisable and efficient. For generalisability, we start from a “foundation” model for monocular depth estimation and extend it to a full 3D shape and appearance reconstructor. For efficiency, we base this extension on feed-forward Gaussian Splatting. Specifically, we predict a first layer of 3D Gaussians at the predicted depth, and then add additional layers of Gaussians that are offset in space, allowing the model to complete the reconstruction behind occlusions and truncations. Flash3D is very efficient, trainable on a single GPU in a day, and thus accessible to most researchers. It achieves state-of-the-art results when trained and tested on RealEstate10k. When transferred to unseen datasets like NYU it outperforms competitors by a large margin. More impressively, when transferred to KITTI, Flash3D achieves better PSNR than methods trained specifically on that dataset. In some instances, it even outperforms recent methods that use multiple views as input. Code, models, demo, and more results are available at this https URL.

[CV-3] Learning 1D Causal Visual Representation with De-focus Attention Networks

链接: https://arxiv.org/abs/2406.04342
作者: Chenxin Tao,Xizhou Zhu,Shiqian Su,Lewei Lu,Changyao Tian,Xuan Luo,Gao Huang,Hongsheng Li,Yu Qiao,Jie Zhou,Jifeng Dai
关键词: Modality differences, differences have led, development of heterogeneous, heterogeneous architectures, causal modeling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Modality differences have led to the development of heterogeneous architectures for vision and language models. While images typically require 2D non-causal modeling, texts utilize 1D causal modeling. This distinction poses significant challenges in constructing unified multi-modal models. This paper explores the feasibility of representing images using 1D causal modeling. We identify an “over-focus” issue in existing 1D causal vision models, where attention overly concentrates on a small proportion of visual tokens. The issue of “over-focus” hinders the model’s ability to extract diverse visual features and to receive effective gradients for optimization. To address this, we propose De-focus Attention Networks, which employ learnable bandpass filters to create varied attention patterns. During training, large and scheduled drop path rates, and an auxiliary loss on globally pooled features for global understanding tasks are introduced. These two strategies encourage the model to attend to a broader range of tokens and enhance network optimization. Extensive experiments validate the efficacy of our approach, demonstrating that 1D causal visual representation can perform comparably to 2D non-causal representation in tasks such as global perception, dense prediction, and multi-modal understanding. Code is released at this https URL.

[CV-4] Interpreting the Second-Order Effects of Neurons in CLIP

链接: https://arxiv.org/abs/2406.04341
作者: Yossi Gandelsman,Alexei A. Efros,Jacob Steinhardt
关键词: automatically describing, CLIP, neuron, individual neurons, CLIP by automatically
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: this https URL

点击查看摘要

Abstract:We interpret the function of individual neurons in CLIP by automatically describing them using text. Analyzing the direct effects (i.e. the flow from a neuron through the residual stream to the output) or the indirect effects (overall contribution) fails to capture the neurons’ function in CLIP. Therefore, we present the “second-order lens”, analyzing the effect flowing from a neuron through the later attention heads, directly to the output. We find that these effects are highly selective: for each neuron, the effect is significant for 2% of the images. Moreover, each effect can be approximated by a single direction in the text-image space of CLIP. We describe neurons by decomposing these directions into sparse sets of text representations. The sets reveal polysemantic behavior - each neuron corresponds to multiple, often unrelated, concepts (e.g. ships and cars). Exploiting this neuron polysemy, we mass-produce “semantic” adversarial examples by generating images with concepts spuriously correlated to the incorrect class. Additionally, we use the second-order effects for zero-shot segmentation and attribute discovery in images. Our results indicate that a scalable understanding of neurons can be used for model deception and for introducing new model capabilities.

[CV-5] GLACE: Global Local Accelerated Coordinate Encoding

链接: https://arxiv.org/abs/2406.04340
作者: Fangjinhua Wang,Xudong Jiang,Silvano Galliani,Christoph Vogel,Marc Pollefeys
关键词: camera pose estimation, Scene coordinate regression, visual localization methods, coordinate regression, directly regress
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Large-scale visual localization with a single optimizable MLP. CVPR 2024. Code: this https URL . Project page: this https URL

点击查看摘要

Abstract:Scene coordinate regression (SCR) methods are a family of visual localization methods that directly regress 2D-3D matches for camera pose estimation. They are effective in small-scale scenes but face significant challenges in large-scale scenes that are further amplified in the absence of ground truth 3D point clouds for supervision. Here, the model can only rely on reprojection constraints and needs to implicitly triangulate the points. The challenges stem from a fundamental dilemma: The network has to be invariant to observations of the same landmark at different viewpoints and lighting conditions, etc., but at the same time discriminate unrelated but similar observations. The latter becomes more relevant and severe in larger scenes. In this work, we tackle this problem by introducing the concept of co-visibility to the network. We propose GLACE, which integrates pre-trained global and local encodings and enables SCR to scale to large scenes with only a single small-sized network. Specifically, we propose a novel feature diffusion technique that implicitly groups the reprojection constraints with co-visibility and avoids overfitting to trivial solutions. Additionally, our position decoder parameterizes the output positions for large-scale scenes more effectively. Without using 3D models or depth maps for supervision, our method achieves state-of-the-art results on large-scale scenes with a low-map-size model. On Cambridge landmarks, with a single model, we achieve 17% lower median position error than Poker, the ensemble variant of the state-of-the-art SCR method ACE. Code is available at: this https URL.

[CV-6] RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation

链接: https://arxiv.org/abs/2406.04339
作者: Jiaming Liu,Mengzhen Liu,Zhenyu Wang,Lily Lee,Kaichen Zhou,Pengju An,Senqiao Yang,Renrui Zhang,Yandong Guo,Shanghang Zhang
关键词: robot Multimodal Large, Multimodal Large Language, comprehend visual scenes, Large Language Models, fundamental objective
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:A fundamental objective in robot manipulation is to enable models to comprehend visual scenes and execute actions. Although existing robot Multimodal Large Language Models (MLLMs) can handle a range of basic tasks, they still face challenges in two areas: 1) inadequate reasoning ability to tackle complex tasks, and 2) high computational costs for MLLM fine-tuning and inference. The recently proposed state space model (SSM) known as Mamba demonstrates promising capabilities in non-trivial sequence modeling with linear inference complexity. Inspired by this, we introduce RoboMamba, an end-to-end robotic MLLM that leverages the Mamba model to deliver both robotic reasoning and action capabilities, while maintaining efficient fine-tuning and inference. Specifically, we first integrate the vision encoder with Mamba, aligning visual data with language embedding through co-training, empowering our model with visual common sense and robot-related reasoning. To further equip RoboMamba with action pose prediction abilities, we explore an efficient fine-tuning strategy with a simple policy head. We find that once RoboMamba possesses sufficient reasoning capability, it can acquire manipulation skills with minimal fine-tuning parameters (0.1% of the model) and time (20 minutes). In experiments, RoboMamba demonstrates outstanding reasoning capabilities on general and robotic evaluation benchmarks. Meanwhile, our model showcases impressive pose prediction results in both simulation and real-world experiments, achieving inference speeds 7 times faster than existing robot MLLMs. Our project web page: this https URL

[CV-7] Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

链接: https://arxiv.org/abs/2406.04338
作者: Fangfu Liu,Hanyang Wang,Shunyu Yao,Shengjun Zhang,Jie Zhou,Yueqi Duan
关键词: recent years, rapid development, physical properties, physical, dynamic movements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project page: this https URL

点击查看摘要

Abstract:In recent years, there has been rapid development in 3D generation models, opening up new possibilities for applications such as simulating the dynamic movements of 3D objects and customizing their behaviors. However, current 3D generative models tend to focus only on surface features such as color and shape, neglecting the inherent physical properties that govern the behavior of objects in the real world. To accurately simulate physics-aligned dynamics, it is essential to predict the physical properties of materials and incorporate them into the behavior prediction process. Nonetheless, predicting the diverse materials of real-world objects is still challenging due to the complex nature of their physical attributes. In this paper, we propose \textbfPhysics3D, a novel method for learning various physical properties of 3D objects through a video diffusion model. Our approach involves designing a highly generalizable physical simulation system based on a viscoelastic material model, which enables us to simulate a wide range of materials with high-fidelity capabilities. Moreover, we distill the physical priors from a video diffusion model that contains more understanding of realistic object materials. Extensive experiments demonstrate the effectiveness of our method with both elastic and plastic materials. Physics3D shows great potential for bridging the gap between the physical world and virtual neural space, providing a better integration and application of realistic physical principles in virtual environments. Project page: this https URL.

[CV-8] Coherent Zero-Shot Visual Instruction Generation

链接: https://arxiv.org/abs/2406.04337
作者: Quynh Phung,Songwei Ge,Jia-Bin Huang
关键词: require consistent representation, smooth state transitions, sequential steps remains, generating visual instructions, formidable challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:Despite the advances in text-to-image synthesis, particularly with diffusion models, generating visual instructions that require consistent representation and smooth state transitions of objects across sequential steps remains a formidable challenge. This paper introduces a simple, training-free framework to tackle the issues, capitalizing on the advancements in diffusion models and large language models (LLMs). Our approach systematically integrates text comprehension and image generation to ensure visual instructions are visually appealing and maintain consistency and accuracy throughout the instruction sequence. We validate the effectiveness by testing multi-step instructions and comparing the text alignment and consistency with several baselines. Our experiments show that our approach can visualize coherent and visually pleasing instructions

[CV-9] DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

链接: https://arxiv.org/abs/2406.04334
作者: Lingchen Meng,Jianwei Yang,Rui Tian,Xiyang Dai,Zuxuan Wu,Jianfeng Gao,Yu-Gang Jiang
关键词: textbf, large multimodal models, implemented by feeding, LLM, feeding visual tokens
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input layer. This paper presents a new architecture DeepStack for LMMs. Considering N layers in the language and vision transformer of LMMs, we stack the visual tokens into N groups and feed each group to its aligned transformer layer \textitfrom bottom to top. Surprisingly, this simple method greatly enhances the power of LMMs to model interactions among visual tokens across layers but with minimal additional cost. We apply DeepStack to both language and vision transformer in LMMs, and validate the effectiveness of DeepStack LMMs with extensive empirical results. Using the same context length, our DeepStack 7B and 13B parameters surpass their counterparts by \textbf2.7 and \textbf2.9 on average across \textbf9 benchmarks, respectively. Using only one-fifth of the context length, DeepStack rivals closely to the counterparts that use the full context length. These gains are particularly pronounced on high-resolution tasks, e.g., \textbf4.2, \textbf11.0, and \textbf4.0 improvements on TextVQA, DocVQA, and InfoVQA compared to LLaVA-1.5-7B, respectively. We further apply DeepStack to vision transformer layers, which brings us a similar amount of improvements, \textbf3.8 on average compared with LLaVA-1.5-7B.

[CV-10] BitsFusion: 1.99 bits Weight Quantization of Diffusion Model

链接: https://arxiv.org/abs/2406.04333
作者: Yang Sui,Yanyu Li,Anil Kag,Yerlan Idelbayev,Junli Cao,Ju Hu,Dhritiman Sagar,Bo Yuan,Sergey Tulyakov,Jian Ren
关键词: synthesizing high-quality content, Diffusion-based image generation, achieved great success, Diffusion-based image, high-quality content
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Diffusion-based image generation models have achieved great success in recent years by showing the capability of synthesizing high-quality content. However, these models contain a huge number of parameters, resulting in a significantly large model size. Saving and transferring them is a major bottleneck for various applications, especially those running on resource-constrained devices. In this work, we develop a novel weight quantization method that quantizes the UNet from Stable Diffusion v1.5 to 1.99 bits, achieving a model with 7.9X smaller size while exhibiting even better generation quality than the original one. Our approach includes several novel techniques, such as assigning optimal bits to each layer, initializing the quantized model for better performance, and improving the training strategy to dramatically reduce quantization error. Furthermore, we extensively evaluate our quantized model across various benchmark datasets and through human evaluation to demonstrate its superior generation quality.

[CV-11] Coarse-To-Fine Tensor Trains for Compact Visual Representations

链接: https://arxiv.org/abs/2406.04332
作者: Sebastian Loeschcke,Dan Wang,Christian Leth-Espensen,Serge Belongie,Michael J. Kastoryano,Sagie Benaim
关键词: tensor train, tensor, tensor train representation, train, ability to learn
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:The ability to learn compact, high-quality, and easy-to-optimize representations for visual data is paramount to many applications such as novel view synthesis and 3D reconstruction. Recent work has shown substantial success in using tensor networks to design such compact and high-quality representations. However, the ability to optimize tensor-based representations, and in particular, the highly compact tensor train representation, is still lacking. This has prevented practitioners from deploying the full potential of tensor networks for visual data. To this end, we propose ‘Prolongation Upsampling Tensor Train (PuTT)’, a novel method for learning tensor train representations in a coarse-to-fine manner. Our method involves the prolonging or `upsampling’ of a learned tensor train representation, creating a sequence of ‘coarse-to-fine’ tensor trains that are incrementally refined. We evaluate our representation along three axes: (1). compression, (2). denoising capability, and (3). image completion capability. To assess these axes, we consider the tasks of image fitting, 3D fitting, and novel view synthesis, where our method shows an improved performance compared to state-of-the-art tensor-based methods. For full results see our project webpage: this https URL

[CV-12] Parameter-Inverted Image Pyramid Networks

链接: https://arxiv.org/abs/2406.04330
作者: Xizhou Zhu,Xue Yang,Zhaokai Wang,Hao Li,Wenhan Dou,Junqi Ge,Lewei Lu,Yu Qiao,Jifeng Dai
关键词: Image Pyramid, modern computer vision, modern computer, precise understanding, Image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image pyramids are commonly used in modern computer vision tasks to obtain multi-scale features for precise understanding of images. However, image pyramids process multiple resolutions of images using the same large-scale model, which requires significant computational cost. To overcome this issue, we propose a novel network architecture known as the Parameter-Inverted Image Pyramid Networks (PIIP). Our core idea is to use models with different parameter sizes to process different resolution levels of the image pyramid, thereby balancing computational efficiency and performance. Specifically, the input to PIIP is a set of multi-scale images, where higher resolution images are processed by smaller networks. We further propose a feature interaction mechanism to allow features of different resolutions to complement each other and effectively integrate information from different spatial scales. Extensive experiments demonstrate that the PIIP achieves superior performance in tasks such as object detection, segmentation, and image classification, compared to traditional image pyramid methods and single-branch networks, while reducing computational cost. Notably, when applying our method on a large-scale vision foundation model InternViT-6B, we improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation. These results validate the effectiveness of the PIIP approach and provide a new technical direction for future vision computing tasks. Our code and models are available at this https URL.

[CV-13] ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

链接: https://arxiv.org/abs/2406.04325
作者: Lin Chen,Xilin Wei,Jinsong Li,Xiaoyi Dong,Pan Zhang,Yuhang Zang,Zehui Chen,Haodong Duan,Bin Lin,Zhenyu Tang,Li Yuan,Yu Qiao,Dahua Lin,Feng Zhao,Jiaqi Wang
关键词: large video-language models, aiming to facilitate, large video-language, videos, video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy. 2) ShareCaptioner-Video, an efficient and capable captioning model for arbitrary videos, with 4.8M high-quality aesthetic videos annotated by it. 3) ShareGPT4Video-8B, a simple yet superb LVLM that reached SOTA performance on three advancing video benchmarks. To achieve this, taking aside the non-scalable costly human annotators, we find using GPT4V to caption video with a naive multi-frame or frame-concatenation input strategy leads to less detailed and sometimes temporal-confused results. We argue the challenge of designing a high-quality video captioning strategy lies in three aspects: 1) Inter-frame precise temporal change understanding. 2) Intra-frame detailed content description. 3) Frame-number scalability for arbitrary-length videos. To this end, we meticulously designed a differential video captioning strategy, which is stable, scalable, and efficient for generating captions for videos with arbitrary resolution, aspect ratios, and length. Based on it, we construct ShareGPT4Video, which contains 40K high-quality videos spanning a wide range of categories, and the resulting captions encompass rich world knowledge, object attributes, camera movements, and crucially, detailed and precise temporal descriptions of events. Based on ShareGPT4Video, we further develop ShareCaptioner-Video, a superior captioner capable of efficiently generating high-quality captions for arbitrary videos…

[CV-14] SF-V: Single Forward Video Generation Model

链接: https://arxiv.org/abs/2406.04324
作者: Zhixing Zhang,Yanyu Li,Yushu Wu,Yanwu Xu,Anil Kag,Ivan Skorokhodov,Willi Menapace,Aliaksandr Siarohin,Junli Cao,Dimitris Metaxas,Sergey Tulyakov,Jian Ren
关键词: demonstrated remarkable success, Diffusion-based video generation, obtaining high-fidelity videos, iterative denoising process, Diffusion-based video
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Project Page: this https URL

点击查看摘要

Abstract:Diffusion-based video generation models have demonstrated remarkable success in obtaining high-fidelity videos through the iterative denoising process. However, these models require multiple denoising steps during sampling, resulting in high computational costs. In this work, we propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained video diffusion models. We show that, through the adversarial training, the multi-steps video diffusion model, i.e., Stable Video Diffusion (SVD), can be trained to perform single forward pass to synthesize high-quality videos, capturing both temporal and spatial dependencies in the video data. Extensive experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead for the denoising process (i.e., around 23\times speedup compared with SVD and 6\times speedup compared with existing works, with even better generation quality), paving the way for real-time video synthesis and editing. More visualization results are made publicly available at this https URL.

[CV-15] ATraDiff: Accelerating Online Reinforcement Learning with Imaginary Trajectories

链接: https://arxiv.org/abs/2406.04323
作者: Qianlan Yang,Yu-Xiong Wang
关键词: Training autonomous agents, Training autonomous, low data efficiency, offline data, due to low
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: ICML 2024 Accepted

点击查看摘要

Abstract:Training autonomous agents with sparse rewards is a long-standing problem in online reinforcement learning (RL), due to low data efficiency. Prior work overcomes this challenge by extracting useful knowledge from offline data, often accomplished through the learning of action distribution from offline data and utilizing the learned distribution to facilitate online RL. However, since the offline data are given and fixed, the extracted knowledge is inherently limited, making it difficult to generalize to new tasks. We propose a novel approach that leverages offline data to learn a generative diffusion model, coined as Adaptive Trajectory Diffuser (ATraDiff). This model generates synthetic trajectories, serving as a form of data augmentation and consequently enhancing the performance of online RL methods. The key strength of our diffuser lies in its adaptability, allowing it to effectively handle varying trajectory lengths and mitigate distribution shifts between online and offline data. Because of its simplicity, ATraDiff seamlessly integrates with a wide spectrum of RL methods. Empirical evaluation shows that ATraDiff consistently achieves state-of-the-art performance across a variety of environments, with particularly pronounced improvements in complicated settings. Our code and demo video are available at this https URL .

[CV-16] DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

链接: https://arxiv.org/abs/2406.04322
作者: Qihao Liu,Yi Zhang,Song Bai,Adam Kortylewski,Alan Yuille
关键词: Neural Radiance Fields, Radiance Fields, Neural Radiance, represented by Neural, Fields
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to CVPR 2024; code: this https URL project page: this https URL

点击查看摘要

Abstract:We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets (represented by Neural Radiance Fields) from text prompts. Unlike recent 3D generative models that rely on clean and well-aligned 3D data, limiting them to single or few-class generation, our model is directly trained on extensive noisy and unaligned `in-the-wild’ 3D assets, mitigating the key challenge (i.e., data scarcity) in large-scale 3D generation. In particular, DIRECT-3D is a tri-plane diffusion model that integrates two innovations: 1) A novel learning framework where noisy data are filtered and aligned automatically during the training process. Specifically, after an initial warm-up phase using a small set of clean data, an iterative optimization is introduced in the diffusion process to explicitly estimate the 3D pose of objects and select beneficial data based on conditional density. 2) An efficient 3D representation that is achieved by disentangling object geometry and color features with two separate conditional diffusion models that are optimized hierarchically. Given a prompt input, our model generates high-quality, high-resolution, realistic, and complex 3D objects with accurate geometric details in seconds. We achieve state-of-the-art performance in both single-class generation and text-to-3D generation. We also demonstrate that DIRECT-3D can serve as a useful 3D geometric prior of objects, for example to alleviate the well-known Janus problem in 2D-lifting methods such as DreamFusion. The code and models are available for research purposes at: this https URL.

[CV-17] VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

链接: https://arxiv.org/abs/2406.04321
作者: Zeyue Tian,Zhaoyang Liu,Ruibin Yuan,Jiahao Pan,Xiaoqiang Huang,Qifeng Liu,Xu Tan,Qifeng Chen,Wei Xue,Yike Guo
关键词: generation conditioned solely, study music generation, music generation conditioned, systematically study music, systematically study
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
*备注: The code and datasets will be available at this https URL

点击查看摘要

Abstract:In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 190K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music that is both acoustically and semantically aligned with the video. By incorporating local and global visual cues, VidMuse enables the creation of musically coherent audio tracks that consistently match the video content through Long-Short-Term modeling. Through extensive experiments, VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment. The code and datasets will be available at this https URL.

[CV-18] Adaptive Sampling of k-Space in Magnetic Resonance for Rapid Pathology Prediction

链接: https://arxiv.org/abs/2406.04318
作者: Chen-Yu Yen,Raghav Singhal,Umang Sharma,Rajesh Ranganath,Sumit Chopra,Lerrel Pinto
关键词: Magnetic Resonance, proven diagnostic utility, inaccessible imaging modality, imaging modality, diagnostic utility
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: ICML 2024. Project website at this https URL

点击查看摘要

Abstract:Magnetic Resonance (MR) imaging, despite its proven diagnostic utility, remains an inaccessible imaging modality for disease surveillance at the population level. A major factor rendering MR inaccessible is lengthy scan times. An MR scanner collects measurements associated with the underlying anatomy in the Fourier space, also known as the k-space. Creating a high-fidelity image requires collecting large quantities of such measurements, increasing the scan time. Traditionally to accelerate an MR scan, image reconstruction from under-sampled k-space data is the method of choice. However, recent works show the feasibility of bypassing image reconstruction and directly learning to detect disease directly from a sparser learned subset of the k-space measurements. In this work, we propose Adaptive Sampling for MR (ASMR), a sampling method that learns an adaptive policy to sequentially select k-space samples to optimize for target disease detection. On 6 out of 8 pathology classification tasks spanning the Knee, Brain, and Prostate MR scans, ASMR reaches within 2% of the performance of a fully sampled classifier while using only 8% of the k-space, as well as outperforming prior state-of-the-art work in k-space sampling such as EMRT, LOUPE, and DPS.

[CV-19] Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking

链接: https://arxiv.org/abs/2406.04316
作者: Jiyao Zhang,Weiyao Huang,Bo Peng,Mingdong Wu,Fei Hu,Zijian Chen,Bo Zhao,Hao Dong
关键词: Object Pose Estimation, Pose Estimation, Pose Estimation Dataset, Object Pose, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:6D Object Pose Estimation is a crucial yet challenging task in computer vision, suffering from a significant lack of large-scale datasets. This scarcity impedes comprehensive evaluation of model performance, limiting research advancements. Furthermore, the restricted number of available instances or categories curtails its applications. To address these issues, this paper introduces Omni6DPose, a substantial dataset characterized by its diversity in object categories, large scale, and variety in object materials. Omni6DPose is divided into three main components: ROPE (Real 6D Object Pose Estimation Dataset), which includes 332K images annotated with over 1.5M annotations across 581 instances in 149 categories; SOPE(Simulated 6D Object Pose Estimation Dataset), consisting of 475K images created in a mixed reality setting with depth simulation, annotated with over 5M annotations across 4162 instances in the same 149 categories; and the manually aligned real scanned objects used in both ROPE and SOPE. Omni6DPose is inherently challenging due to the substantial variations and ambiguities. To address this challenge, we introduce GenPose++, an enhanced version of the SOTA category-level pose estimation framework, incorporating two pivotal improvements: Semantic-aware feature extraction and Clustering-based aggregation. Moreover, we provide a comprehensive benchmarking analysis to evaluate the performance of previous methods on this large-scale dataset in the realms of 6D object pose estimation and pose tracking.

[CV-20] Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step

链接: https://arxiv.org/abs/2406.04314
作者: Zhanhao Liang,Yuhui Yuan,Shuyang Gu,Bohan Chen,Tiankai Hang,Ji Li,Liang Zheng
关键词: Direct Preference Optimization, Direct Preference, Step-aware Preference Optimization, Preference Optimization, large language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, Direct Preference Optimization (DPO) has extended its success from aligning large language models (LLMs) to aligning text-to-image diffusion models with human preferences. Unlike most existing DPO methods that assume all diffusion steps share a consistent preference order with the final generated images, we argue that this assumption neglects step-specific denoising performance and that preference labels should be tailored to each step’s contribution. To address this limitation, we propose Step-aware Preference Optimization (SPO), a novel post-training approach that independently evaluates and adjusts the denoising performance at each step, using a step-aware preference model and a step-wise resampler to ensure accurate step-aware supervision. Specifically, at each denoising step, we sample a pool of images, find a suitable win-lose pair, and, most importantly, randomly select a single image from the pool to initialize the next denoising step. This step-wise resampler process ensures the next win-lose image pair comes from the same image, making the win-lose comparison independent of the previous step. To assess the preferences at each step, we train a separate step-aware preference model that can be applied to both noisy and clean images. Our experiments with Stable Diffusion v1.5 and SDXL demonstrate that SPO significantly outperforms the latest Diffusion-DPO in aligning generated images with complex, detailed prompts and enhancing aesthetics, while also achieving more than 20x times faster in training efficiency. Code and model: this https URL

[CV-21] Improving Alignment and Robustness with Short Circuiting

链接: https://arxiv.org/abs/2406.04313
作者: Andy Zou,Long Phan,Justin Wang,Derek Duenas,Maxwell Lin,Maksym Andriushchenko,Rowan Wang,Zico Kolter,Matt Fredrikson,Dan Hendrycks
关键词: highly vulnerable, harmful, adversarial, attacks, harmful outputs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that “short-circuits” models as they respond with harmful outputs. Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to plug these holes by countering specific attacks. As an alternative to refusal training and adversarial training, short-circuiting directly controls the representations that are responsible for harmful outputs in the first place. Our technique can be applied to both text-only and multimodal language models to prevent the generation of harmful outputs without sacrificing utility – even in the presence of powerful unseen attacks. Notably, while adversarial robustness in standalone image recognition remains an open challenge, short-circuiting allows the larger multimodal system to reliably withstand image “hijacks” that aim to produce harmful content. Finally, we extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack. Our approach represents a significant step forward in the development of reliable safeguards to harmful behavior and adversarial attacks.

[CV-22] ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization

链接: https://arxiv.org/abs/2406.04312
作者: Luca Eyring,Shyamgopal Karthik,Karsten Roth,Alexey Dosovitskiy,Zeynep Akata
关键词: made significant advancements, accurately capture intricate, capture intricate details, complex compositional prompts, recent years
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint

点击查看摘要

Abstract:Text-to-Image (T2I) models have made significant advancements in recent years, but they still struggle to accurately capture intricate details specified in complex compositional prompts. While fine-tuning T2I models with reward objectives has shown promise, it suffers from “reward hacking” and may not generalize well to unseen prompt distributions. In this work, we propose Reward-based Noise Optimization (ReNO), a novel approach that enhances T2I models at inference by optimizing the initial noise based on the signal from one or multiple human preference reward models. Remarkably, solving this optimization problem with gradient ascent for 50 iterations yields impressive results on four different one-step models across two competitive benchmarks, T2I-CompBench and GenEval. Within a computational budget of 20-50 seconds, ReNO-enhanced one-step models consistently surpass the performance of all current open-source Text-to-Image models. Extensive user studies demonstrate that our model is preferred nearly twice as often compared to the popular SDXL model and is on par with the proprietary Stable Diffusion 3 with 8B parameters. Moreover, given the same computational resources, a ReNO-optimized one-step model outperforms widely-used open-source models such as SDXL and PixArt- \alpha , highlighting the efficiency and effectiveness of ReNO in enhancing T2I model performance at inference time. Code is available at this https URL.

[CV-23] ReFiNe: Recursive Field Networks for Cross-modal Multi-scene Representation

链接: https://arxiv.org/abs/2406.04309
作者: Sergey Zakharov,Katherine Liu,Adrien Gaidon,Rares Ambrus
关键词: involve trading modeling, trading modeling accuracy, multi-shape representation, involve trading, common trade-offs
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: SIGGRAPH 2024. Project Page: this https URL

点击查看摘要

Abstract:The common trade-offs of state-of-the-art methods for multi-shape representation (a single model “packing” multiple objects) involve trading modeling accuracy against memory and storage. We show how to encode multiple shapes represented as continuous neural fields with a higher degree of precision than previously possible and with low memory usage. Key to our approach is a recursive hierarchical formulation that exploits object self-similarity, leading to a highly compressed and efficient shape latent space. Thanks to the recursive formulation, our method supports spatial and global-to-local latent feature fusion without needing to initialize and maintain auxiliary data structures, while still allowing for continuous field queries to enable applications such as raytracing. In experiments on a set of diverse datasets, we provide compelling qualitative results and demonstrate state-of-the-art multi-scene reconstruction and compression results with a single network per dataset.

[CV-24] Vision-LSTM: xLSTM as Generic Vision Backbone

链接: https://arxiv.org/abs/2406.04303
作者: Benedikt Alkin,Maximilian Beck,Korbinian Pöppel,Sepp Hochreiter,Johannes Brandstetter
关键词: natural language processing, Transformers are widely, language processing, initially introduced, introduced for natural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.

[CV-25] Neural Surface Reconstruction from Sparse Views Using Epipolar Geometry

链接: https://arxiv.org/abs/2406.04301
作者: Kaichen Zhou
关键词: pose significant hurdles, missing information pose, information pose significant, significant hurdles, paper addresses
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper addresses the challenge of reconstructing surfaces from sparse view inputs, where ambiguity and occlusions due to missing information pose significant hurdles. We present a novel approach, named EpiS, that incorporates Epipolar information into the reconstruction process. Existing methods in sparse-view neural surface learning have mainly focused on mean and variance considerations using cost volumes for feature extraction. In contrast, our method aggregates coarse information from the cost volume into Epipolar features extracted from multiple source views, enabling the generation of fine-grained Signal Distance Function (SDF)-aware features. Additionally, we employ an attention mechanism along the line dimension to facilitate feature fusion based on the SDF feature. Furthermore, to address the information gaps in sparse conditions, we integrate depth information from monocular depth estimation using global and local regularization techniques. The global regularization utilizes a triplet loss function, while the local regularization employs a derivative loss function. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods, especially in cases with sparse and generalizable conditions.

[CV-26] Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment

链接: https://arxiv.org/abs/2406.04295
作者: Jiayi Guo,Junhao Zhao,Chunjiang Ge,Chaoqun Du,Zanlin Ni,Shiji Song,Humphrey Shi,Gao Huang
关键词: diffusion-driven TTA methods, TTA methods, unknown shifted target, diffusion-driven TTA, Test-time adaptation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: GitHub: this https URL

点击查看摘要

Abstract:Test-time adaptation (TTA) aims to enhance the performance of source-domain pretrained models when tested on unknown shifted target domains. Traditional TTA methods primarily adapt model weights based on target data streams, making model performance sensitive to the amount and order of target data. Recently, diffusion-driven TTA methods have demonstrated strong performance by using an unconditional diffusion model, which is also trained on the source domain to transform target data into synthetic data as a source domain projection. This allows the source model to make predictions without weight adaptation. In this paper, we argue that the domains of the source model and the synthetic data in diffusion-driven TTA methods are not aligned. To adapt the source model to the synthetic domain of the unconditional diffusion model, we introduce a Synthetic-Domain Alignment (SDA) framework to fine-tune the source model with synthetic data. Specifically, we first employ a conditional diffusion model to generate labeled samples, creating a synthetic dataset. Subsequently, we use the aforementioned unconditional diffusion model to add noise to and denoise each sample before fine-tuning. This process mitigates the potential domain gap between the conditional and unconditional models. Extensive experiments across various models and benchmarks demonstrate that SDA achieves superior domain alignment and consistently outperforms existing diffusion-driven TTA methods. Our code is available at this https URL.

[CV-27] VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval

链接: https://arxiv.org/abs/2406.04292
作者: Junjie Zhou,Zheng Liu,Shitao Xiao,Bo Zhao,Yongping Xiong
关键词: popular in practice, increasingly popular, Multi-modal retrieval, Multi-modal, data
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ACL 2024 main conference

点击查看摘要

Abstract:Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal multi-modal retrieval. Our work brings forth threefold technical contributions. Firstly, we introduce a flexible architecture which extends a powerful text encoder with the image understanding capability by introducing visual token embeddings. Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model. Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capability using the generated composed image-text data. In our experiments, VISTA achieves superior performances across a variety of multi-modal retrieval tasks in both zero-shot and supervised settings. Our model, data, and source code are available at this https URL.

[CV-28] SpectralZoom: Efficient Segmentation with an Adaptive Hyperspectral Camera

链接: https://arxiv.org/abs/2406.04287
作者: Jackson Arnold,Sophia Rossi,Chloe Petrosino,Ethan Mitchell,Sanjeev J. Koppal
关键词: remote sensing, battlefield sensing, biomedical imaging, sensing and astronomy, sensing
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Hyperspectral image segmentation is crucial for many fields such as agriculture, remote sensing, biomedical imaging, battlefield sensing and astronomy. However, the challenge of hyper and multi spectral imaging is its large data footprint. We propose both a novel camera design and a vision transformer-based (ViT) algorithm that alleviate both the captured data footprint and the computational load for hyperspectral segmentation. Our camera is able to adaptively sample image regions or patches at different resolutions, instead of capturing the entire hyperspectral cube at one high resolution. Our segmentation algorithm works in concert with the camera, applying ViT-based segmentation only to adaptively selected patches. We show results both in simulation and on a real hardware platform demonstrating both accurate segmentation results and reduced computational burden.

[CV-29] xMIL: Insightful Explanations for Multiple Instance Learning in Histopathology

链接: https://arxiv.org/abs/2406.04280
作者: Julius Hense,Mina Jamshidi Idaji,Oliver Eberle,Thomas Schnake,Jonas Dippel,Laure Ciernik,Oliver Buchstab,Andreas Mock,Frederick Klauschen,Klaus-Robert Müller
关键词: Multiple instance learning, supervised machine learning, weakly supervised machine, machine learning, Multiple instance
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multiple instance learning (MIL) is an effective and widely used approach for weakly supervised machine learning. In histopathology, MIL models have achieved remarkable success in tasks like tumor detection, biomarker prediction, and outcome prognostication. However, MIL explanation methods are still lagging behind, as they are limited to small bag sizes or disregard instance interactions. We revisit MIL through the lens of explainable AI (XAI) and introduce xMIL, a refined framework with more general assumptions. We demonstrate how to obtain improved MIL explanations using layer-wise relevance propagation (LRP) and conduct extensive evaluation experiments on three toy settings and four real-world histopathology datasets. Our approach consistently outperforms previous explanation attempts with particularly improved faithfulness scores on challenging biomarker prediction tasks. Finally, we showcase how xMIL explanations enable pathologists to extract insights from MIL models, representing a significant advance for knowledge discovery and model debugging in digital histopathology.

[CV-30] VideoTetris: Towards Compositional Text-to-Video Generation

链接: https://arxiv.org/abs/2406.04277
作者: Ye Tian,Ling Yang,Haotian Yang,Yuan Gao,Yufan Deng,Jingmin Chen,Xintao Wang,Zhaochen Yu,Xin Tao,Pengfei Wan,Di Zhang,Bin Cui
关键词: demonstrated great success, models have demonstrated, demonstrated great, great success, Diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Code: this https URL

点击查看摘要

Abstract:Diffusion models have demonstrated great success in text-to-video (T2V) generation. However, existing methods may face challenges when handling complex (long) video generation scenarios that involve multiple objects or dynamic changes in object numbers. To address these limitations, we propose VideoTetris, a novel framework that enables compositional T2V generation. Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Extensive experiments demonstrate that our VideoTetris achieves impressive qualitative and quantitative results in compositional T2V generation. Code is available at: this https URL

[CV-31] ELFS: Enhancing Label-Free Coreset Selection via Clustering-based Pseudo-Labeling

链接: https://arxiv.org/abs/2406.04273
作者: Haizhong Zheng,Elisa Tsai,Yifu Lu,Jiachen Sun,Brian R. Bartoldson,Bhavya Kailkhura,Atul Prakash
关键词: High-quality human-annotated data, High-quality human-annotated, human annotation process, deep learning pipelines, modern deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:High-quality human-annotated data is crucial for modern deep learning pipelines, yet the human annotation process is both costly and time-consuming. Given a constrained human labeling budget, selecting an informative and representative data subset for labeling can significantly reduce human annotation effort. Well-performing state-of-the-art (SOTA) coreset selection methods require ground-truth labels over the whole dataset, failing to reduce the human labeling burden. Meanwhile, SOTA label-free coreset selection methods deliver inferior performance due to poor geometry-based scores. In this paper, we introduce ELFS, a novel label-free coreset selection method. ELFS employs deep clustering to estimate data difficulty scores without ground-truth labels. Furthermore, ELFS uses a simple but effective double-end pruning method to mitigate bias on calculated scores, which further improves the performance on selected coresets. We evaluate ELFS on five vision benchmarks and show that ELFS consistently outperforms SOTA label-free baselines. For instance, at a 90% pruning rate, ELFS surpasses the best-performing baseline by 5.3% on CIFAR10 and 7.1% on CIFAR100. Moreover, ELFS even achieves comparable performance to supervised coreset selection at low pruning rates (e.g., 30% and 50%) on CIFAR10 and ImageNet-1K.

[CV-32] MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

链接: https://arxiv.org/abs/2406.04264
作者: Junjie Zhou,Yan Shu,Bo Zhao,Boya Wu,Shitao Xiao,Xi Yang,Yongping Xiong,Bo Zhang,Tiejun Huang,Zheng Liu
关键词: Long Video Understanding, Video Understanding, Multi-task Long Video, video understanding benchmarks, Long Video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To address the above problems, we propose a new benchmark, called MLVU (Multi-task Long Video Understanding Benchmark), for the comprehensive and in-depth evaluation of LVU. MLVU presents the following critical values: 1) The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations. 2) The inclusion of various video genres, e.g., movies, surveillance footage, egocentric videos, cartoons, game videos, etc., which reflects the models’ LVU performances in different scenarios. 3) The development of diversified evaluation tasks, which enables a comprehensive examination of MLLMs’ key abilities in long-video understanding. The empirical study with 20 latest MLLMs reveals significant room for improvement in today’s technique, as all existing methods struggle with most of the evaluation tasks and exhibit severe performance degradation when handling longer videos. Additionally, it suggests that factors such as context length, image-understanding quality, and the choice of LLM backbone can play critical roles in future advancements. We anticipate that MLVU will advance the research of long video understanding by providing a comprehensive and in-depth analysis of MLLMs.

[CV-33] GeoGen: Geometry-Aware Generative Modeling via Signed Distance Functions

链接: https://arxiv.org/abs/2406.04254
作者: Salvatore Esposito,Qingshan Xu,Kacper Kania,Charlie Hewitt,Octave Mariotti,Lohit Petikam,Julien Valentin,Arno Onken,Oisin Mac Aodha
关键词: approach for synthesizing, single-view collections, Signed Distance Function, multi-view consistent images, generative approach
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Computer Vision and Pattern Recognition 2024

点击查看摘要

Abstract:We introduce a new generative approach for synthesizing 3D geometry and images from single-view collections. Most existing approaches predict volumetric density to render multi-view consistent images. By employing volumetric rendering using neural radiance fields, they inherit a key limitation: the generated geometry is noisy and unconstrained, limiting the quality and utility of the output meshes. To address this issue, we propose GeoGen, a new SDF-based 3D generative model trained in an end-to-end manner. Initially, we reinterpret the volumetric density as a Signed Distance Function (SDF). This allows us to introduce useful priors to generate valid meshes. However, those priors prevent the generative model from learning details, limiting the applicability of the method to real-world scenarios. To alleviate that problem, we make the transformation learnable and constrain the rendered depth map to be consistent with the zero-level set of the SDF. Through the lens of adversarial training, we encourage the network to produce higher fidelity details on the output meshes. For evaluation, we introduce a synthetic dataset of human avatars captured from 360-degree camera angles, to overcome the challenges presented by real-world datasets, which often lack 3D consistency and do not cover all camera angles. Our experiments on multiple datasets show that GeoGen produces visually and quantitatively better geometry than the previous generative models based on neural radiance fields.

[CV-34] A Survey on 3D Human Avatar Modeling – From Reconstruction to Generation

链接: https://arxiv.org/abs/2406.04253
作者: Ruihe Wang,Yukang Cao,Kai Han,Kwan-Yee K. Wong
关键词: computer graphics, human avatar modeling, computer vision, important area, human
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 30 pages, 21 figures

点击查看摘要

Abstract:3D modeling has long been an important area in computer vision and computer graphics. Recently, thanks to the breakthroughs in neural representations and generative models, we witnessed a rapid development of 3D modeling. 3D human modeling, lying at the core of many real-world applications, such as gaming and animation, has attracted significant attention. Over the past few years, a large body of work on creating 3D human avatars has been introduced, forming a new and abundant knowledge base for 3D human modeling. The scale of the literature makes it difficult for individuals to keep track of all the works. This survey aims to provide a comprehensive overview of these emerging techniques for 3D human avatar modeling, from both reconstruction and generation perspectives. Firstly, we review representative methods for 3D human reconstruction, including methods based on pixel-aligned implicit function, neural radiance field, and 3D Gaussian Splatting, etc. We then summarize representative methods for 3D human generation, especially those using large language models like CLIP, diffusion models, and various 3D representations, which demonstrate state-of-the-art performance. Finally, we discuss our reflection on existing methods and open challenges for 3D human avatar modeling, shedding light on future research.

[CV-35] Localized Gaussian Point Management

链接: https://arxiv.org/abs/2406.04251
作者: Haosen Yang,Chenhao Zhang,Wenqing Wang,Marco Volino,Adrian Hilton,Li Zhang,Xiatian Zhu
关键词: Adaptive Density Control, Gaussian Splatting, Localized Point Management, component in optimizing, structure from motion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Point management is a critical component in optimizing 3D Gaussian Splatting (3DGS) models, as the point initiation (e.g., via structure from motion) is distributionally inappropriate. Typically, the Adaptive Density Control (ADC) algorithm is applied, leveraging view-averaged gradient magnitude thresholding for point densification, opacity thresholding for pruning, and regular all-points opacity reset. However, we reveal that this strategy is limited in tackling intricate/special image regions (e.g., transparent) as it is unable to identify all the 3D zones that require point densification, and lacking an appropriate mechanism to handle the ill-conditioned points with negative impacts (occlusion due to false high opacity). To address these limitations, we propose a Localized Point Management (LPM) strategy, capable of identifying those error-contributing zones in the highest demand for both point addition and geometry calibration. Zone identification is achieved by leveraging the underlying multiview geometry constraints, with the guidance of image rendering errors. We apply point densification in the identified zone, whilst resetting the opacity of those points residing in front of these regions so that a new opportunity is created to correct ill-conditioned points. Serving as a versatile plugin, LPM can be seamlessly integrated into existing 3D Gaussian Splatting models. Experimental evaluation across both static 3D and dynamic 4D scenes validate the efficacy of our LPM strategy in boosting a variety of existing 3DGS models both quantitatively and qualitatively. Notably, LPM improves both vanilla 3DGS and SpaceTimeGS to achieve state-of-the-art rendering quality while retaining real-time speeds, outperforming on challenging datasets such as Tanks Temples and the Neural 3D Video Dataset.

[CV-36] Conv-INR: Convolutional Implicit Neural Representation for Multimodal Visual Signals

链接: https://arxiv.org/abs/2406.04249
作者: Zhicheng Cai
关键词: Implicit neural representation, Implicit neural, neural representation, recently emerged, promising paradigm
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Implicit neural representation (INR) has recently emerged as a promising paradigm for signal representations. Typically, INR is parameterized by a multiplayer perceptron (MLP) which takes the coordinates as the inputs and generates corresponding attributes of a signal. However, MLP-based INRs face two critical issues: i) individually considering each coordinate while ignoring the connections; ii) suffering from the spectral bias thus failing to learn high-frequency components. While target visual signals usually exhibit strong local structures and neighborhood dependencies, and high-frequency components are significant in these signals, the issues harm the representational capacity of INRs. This paper proposes Conv-INR, the first INR model fully based on convolution. Due to the inherent attributes of convolution, Conv-INR can simultaneously consider adjacent coordinates and learn high-frequency components effectively. Compared to existing MLP-based INRs, Conv-INR has better representational capacity and trainability without requiring primary function expansion. We conduct extensive experiments on four tasks, including image fitting, CT/MRI reconstruction, and novel view synthesis, Conv-INR all significantly surpasses existing MLP-based INRs, validating the effectiveness. Finally, we raise three reparameterization methods that can further enhance the performance of the vanilla Conv-INR without introducing any extra inference cost.

[CV-37] Understanding Information Storage and Transfer in Multi-modal Large Language Models

链接: https://arxiv.org/abs/2406.04236
作者: Samyadeep Basu,Martin Grayson,Cecily Morrison,Besmira Nushi,Soheil Feizi,Daniela Massiceti
关键词: model understanding progress, driving model understanding, Large Language Models, transfer in Transformer-based, understanding progress
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages

点击查看摘要

Abstract:Understanding the mechanisms of information storage and transfer in Transformer-based models is important for driving model understanding progress. Recent work has studied these mechanisms for Large Language Models (LLMs), revealing insights on how information is stored in a model’s parameters and how information flows to and from these parameters in response to specific prompts. However, these studies have not yet been extended to Multi-modal Large Language Models (MLLMs). Given their expanding capabilities and real-world use, we start by studying one aspect of these models – how MLLMs process information in a factual visual question answering task. We use a constraint-based formulation which views a visual question as having a set of visual or textual constraints that the model’s generated answer must satisfy to be correct (e.g. What movie directed by the director in this photo has won a Golden Globe?). Under this setting, we contribute i) a method that extends causal information tracing from pure language to the multi-modal setting, and ii) VQA-Constraints, a test-bed of 9.7K visual questions annotated with constraints. We use these tools to study two open-source MLLMs, LLaVa and multi-modal Phi-2. Our key findings show that these MLLMs rely on MLP and self-attention blocks in much earlier layers for information storage, compared to LLMs whose mid-layer MLPs are more important. We also show that a consistent small subset of visual tokens output by the vision encoder are responsible for transferring information from the image to these causal blocks. We validate these mechanisms by introducing MultEdit, a model-editing algorithm that can correct errors and insert new long-tailed information into MLLMs by targeting these causal blocks.

[CV-38] M3LEO: A Multi-Modal Multi-Label Earth Observation Dataset Integrating Interferometric SAR and RGB Data

链接: https://arxiv.org/abs/2406.04230
作者: Matthew J Allen,Francisco Dorr,Joseph Alejandro Gallego Mejia,Laura Martínez-Ferrer,Anna Jungbluth,Freddie Kalaitzis,Raúl Ramos-Pollán
关键词: Satellite-based remote sensing, rapidly evolving world, address global challenges, Satellite-based remote, evolving world
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:Satellite-based remote sensing has revolutionised the way we address global challenges in a rapidly evolving world. Huge quantities of Earth Observation (EO) data are generated by satellite sensors daily, but processing these large datasets for use in ML pipelines is technically and computationally challenging. Specifically, different types of EO data are often hosted on a variety of platforms, with differing availability for Python preprocessing tools. In addition, spatial alignment across data sources and data tiling can present significant technical hurdles for novice users. While some preprocessed EO datasets exist, their content is often limited to optical or near-optical wavelength data, which is ineffective at night or in adverse weather conditions. Synthetic Aperture Radar (SAR), an active sensing technique based on microwave length radiation, offers a viable alternative. However, the application of machine learning to SAR has been limited due to a lack of ML-ready data and pipelines, particularly for the full diversity of SAR data, including polarimetry, coherence and interferometry. We introduce M3LEO, a multi-modal, multi-label EO dataset that includes polarimetric, interferometric, and coherence SAR data derived from Sentinel-1, alongside Sentinel-2 RGB imagery and a suite of labelled tasks for model evaluation. M3LEO spans 17.5TB and contains approximately 10M data chips across six geographic regions. The dataset is complemented by a flexible PyTorch Lightning framework, with configuration management using Hydra. We provide tools to process any dataset available on popular platforms such as Google Earth Engine for integration with our framework. Initial experiments validate the utility of our data and framework, showing that SAR imagery contains information additional to that extractable from RGB data. Data at this http URL, and code at this http URL.

[CV-39] R-CONV: An Analytical Approach for Efficient Data Reconstruction via Convolutional Gradients

链接: https://arxiv.org/abs/2406.04227
作者: Tamer Ahmed Eltaras,Qutaibah Malluhi,Alessandro Savino,Stefano Di Carlo,Adnan Qayyum,Junaid Qadir
关键词: exchanging raw data, federated learning, effort to learn, learn from extensive, extensive collections
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the effort to learn from extensive collections of distributed data, federated learning has emerged as a promising approach for preserving privacy by using a gradient-sharing mechanism instead of exchanging raw data. However, recent studies show that private training data can be leaked through many gradient attacks. While previous analytical-based attacks have successfully reconstructed input data from fully connected layers, their effectiveness diminishes when applied to convolutional layers. This paper introduces an advanced data leakage method to efficiently exploit convolutional layers’ gradients. We present a surprising finding: even with non-fully invertible activation functions, such as ReLU, we can analytically reconstruct training samples from the gradients. To the best of our knowledge, this is the first analytical approach that successfully reconstructs convolutional layer inputs directly from the gradients, bypassing the need to reconstruct layers’ outputs. Prior research has mainly concentrated on the weight constraints of convolution layers, overlooking the significance of gradient constraints. Our findings demonstrate that existing analytical methods used to estimate the risk of gradient attacks lack accuracy. In some layers, attacks can be launched with less than 5% of the reported constraints.

[CV-40] Matching Anything by Segmenting Anything

链接: https://arxiv.org/abs/2406.04221
作者: Siyuan Li,Lei Ke,Martin Danelljan,Luigi Piccinelli,Mattia Segu,Luc Van Gool,Fisher Yu
关键词: scenes is crucial, MASA, video frames, frames in complex, complex scenes
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR 2024 Highlight. code at: this https URL

点击查看摘要

Abstract:The robust association of the same objects across video frames in complex scenes is crucial for many applications, especially Multiple Object Tracking (MOT). Current methods predominantly rely on labeled domain-specific video datasets, which limits the cross-domain generalization of learned similarity embeddings. We propose MASA, a novel method for robust instance association learning, capable of matching any objects within videos across diverse domains without tracking labels. Leveraging the rich object segmentation from the Segment Anything Model (SAM), MASA learns instance-level correspondence through exhaustive data transformations. We treat the SAM outputs as dense object region proposals and learn to match those regions from a vast image collection. We further design a universal MASA adapter which can work in tandem with foundational segmentation or detection models and enable them to track any detected objects. Those combinations present strong zero-shot tracking ability in complex domains. Extensive tests on multiple challenging MOT and MOTS benchmarks indicate that the proposed method, using only unlabeled static images, achieves even better performance than state-of-the-art methods trained with fully annotated in-domain video sequences, in zero-shot association. Project Page: this https URL

[CV-41] CDMamba: Remote Sensing Image Change Detection with Mamba

链接: https://arxiv.org/abs/2406.04207
作者: Haotian Zhang,Keyan Chen,Chenyang Liu,Hao Chen,Zhengxia Zou,Zhenwei Shi
关键词: Mamba architecture based, demonstrated remarkable performance, natural language processing, language processing tasks, state space models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, the Mamba architecture based on state space models has demonstrated remarkable performance in a series of natural language processing tasks and has been rapidly applied to remote sensing change detection (CD) tasks. However, most methods enhance the global receptive field by directly modifying the scanning mode of Mamba, neglecting the crucial role that local information plays in dense prediction tasks (e.g., CD). In this article, we propose a model called CDMamba, which effectively combines global and local features for handling CD tasks. Specifically, the Scaled Residual ConvMamba (SRCM) block is proposed to utilize the ability of Mamba to extract global features and convolution to enhance the local details, to alleviate the issue that current Mamba-based methods lack detailed clues and are difficult to achieve fine detection in dense prediction tasks. Furthermore, considering the characteristics of bi-temporal feature interaction required for CD, the Adaptive Global Local Guided Fusion (AGLGF) block is proposed to dynamically facilitate the bi-temporal interaction guided by other temporal global/local features. Our intuition is that more discriminative change features can be acquired with the guidance of other temporal features. Extensive experiments on three datasets demonstrate that our proposed CDMamba outperforms the current state-of-the-art methods. Our code will be open-sourced at this https URL.

[CV-42] Diffusion-based image inpainting with internal learning

链接: https://arxiv.org/abs/2406.04206
作者: Nicolas Cherel,Andrés Almansa,Yann Gousseau,Alasdair Newson
关键词: Diffusion models, image, lightweight diffusion models, image restoration, image generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 5 pages, 4 figures. EUSIPCO 2024

点击查看摘要

Abstract:Diffusion models are now the undisputed state-of-the-art for image generation and image restoration. However, they require large amounts of computational power for training and inference. In this paper, we propose lightweight diffusion models for image inpainting that can be trained on a single image, or a few images. We show that our approach competes with large state-of-the-art models in specific cases. We also show that training a model on a single image is particularly relevant for image acquisition modality that differ from the RGB images of standard learning databases. We show results in three different contexts: texture images, line drawing images, and materials BRDF, for which we achieve state-of-the-art results in terms of realism, with a computational load that is greatly reduced compared to concurrent methods.

[CV-43] Encoding Semantic Priors into the Weights of Implicit Neural Representation

链接: https://arxiv.org/abs/2406.04178
作者: Zhicheng Cai,Qiu Shen
关键词: Implicit neural representation, Implicit neural, INR, semantic information, semantic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ICME 2024

点击查看摘要

Abstract:Implicit neural representation (INR) has recently emerged as a promising paradigm for signal representations, which takes coordinates as inputs and generates corresponding signal values. Since these coordinates contain no semantic features, INR fails to take any semantic information into consideration. However, semantic information has been proven critical in many vision tasks, especially for visual signal representation. This paper proposes a reparameterization method termed as SPW, which encodes the semantic priors to the weights of INR, thus making INR contain semantic information implicitly and enhancing its representational capacity. Specifically, SPW uses the Semantic Neural Network (SNN) to extract both low- and high-level semantic information of the target visual signal and generates the semantic vector, which is input into the Weight Generation Network (WGN) to generate the weights of INR model. Finally, INR uses the generated weights with semantic priors to map the coordinates to the signal values. After training, we only retain the generated weights while abandoning both SNN and WGN, thus SPW introduces no extra costs in inference. Experimental results show that SPW can improve the performance of various INR models significantly on various tasks, including image fitting, CT reconstruction, MRI reconstruction, and novel view synthesis. Further experiments illustrate that model with SPW has lower weight redundancy and learns more novel representations, validating the effectiveness of SPW.

[CV-44] A Voxel-based Approach for Simulating Microbial Decomposition in Soil: Comparison with LBM and Improvement of Morphological Models

链接: https://arxiv.org/abs/2406.04177
作者: Mouad Klai,Olivier Monga,Mohamed Soufiane Jouini,Valérie Pot
关键词: micro-computed tomography, organic matter, microbial decomposition, complex soil matrix, Network Geometrical Modelling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint submitted to IEEE Access

点击查看摘要

Abstract:This study presents a new computational approach for simulating the microbial decomposition of organic matter, from 3D micro-computed tomography (micro-CT) images of soil. The method employs a valuated graph of connected voxels to simulate transformation and diffusion processes involved in microbial decomposition within the complex soil matrix. The resulting model can be adapted to simulate any diffusion-transformation processes in porous media. We implemented parallelization strategies and explored different numerical methods, including implicit, explicit, synchronous, and asynchronous schemes. To validate our method, we compared simulation outputs with those provided by LBioS and by Mosaic models. LBioS uses a lattice-Boltzmann method for diffusion and Mosaic takes benefit of Pore Network Geometrical Modelling (PNGM) by means of geometrical primitives such as spheres and ellipsoids. This approach achieved comparable results to traditional LBM-based simulations, but required only one-fourth of the computing time. Compared to Mosaic simulation, the proposed method is slower but more accurate and does not require any calibration. Furthermore, we present a theoretical framework and an application example to enhance PNGM-based simulations. This is accomplished by approximating the diffusional conductance coefficients using stochastic gradient descent and data generated by the current approach.

[CV-45] Sparse Multi-baseline SAR Cross-modal 3D Reconstruction of Vehicle Targets

链接: https://arxiv.org/abs/2406.04158
作者: Da Li,Guoqiang Zhao,Houjun Sun,Jiacheng Bao
关键词: faces significant challenges, significant challenges due, Multi-baseline SAR, imaging faces significant, SAR
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Multi-baseline SAR 3D imaging faces significant challenges due to data sparsity. In recent years, deep learning techniques have achieved notable success in enhancing the quality of sparse SAR 3D imaging. However, previous work typically rely on full-aperture high-resolution radar images to supervise the training of deep neural networks (DNNs), utilizing only single-modal information from radar data. Consequently, imaging performance is limited, and acquiring full-aperture data for multi-baseline SAR is costly and sometimes impractical in real-world applications. In this paper, we propose a Cross-Modal Reconstruction Network (CMR-Net), which integrates differentiable render and cross-modal supervision with optical images to reconstruct highly sparse multi-baseline SAR 3D images of vehicle targets into visually structured and high-resolution images. We meticulously designed the network architecture and training strategies to enhance network generalization capability. Remarkably, CMR-Net, trained solely on simulated data, demonstrates high-resolution reconstruction capabilities on both publicly available simulation datasets and real measured datasets, outperforming traditional sparse reconstruction algorithms based on compressed sensing and other learning-based methods. Additionally, using optical images as supervision provides a cost-effective way to build training datasets, reducing the difficulty of method dissemination. Our work showcases the broad prospects of deep learning in multi-baseline SAR 3D imaging and offers a novel path for researching radar imaging based on cross-modal learning theory.

[CV-46] Improving Physics-Augmented Continuum Neural Radiance Field-Based Geometry-Agnostic System Identification with Lagrangian Particle Optimization

链接: https://arxiv.org/abs/2406.04155
作者: Takuhiro Kaneko
关键词: Geometry-agnostic system identification, Geometry-agnostic system, Eulerian grid representations, video sequences, grid representations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted to CVPR 2024. Project page: this https URL

点击查看摘要

Abstract:Geometry-agnostic system identification is a technique for identifying the geometry and physical properties of an object from video sequences without any geometric assumptions. Recently, physics-augmented continuum neural radiance fields (PAC-NeRF) has demonstrated promising results for this technique by utilizing a hybrid Eulerian-Lagrangian representation, in which the geometry is represented by the Eulerian grid representations of NeRF, the physics is described by a material point method (MPM), and they are connected via Lagrangian particles. However, a notable limitation of PAC-NeRF is that its performance is sensitive to the learning of the geometry from the first frames owing to its two-step optimization. First, the grid representations are optimized with the first frames of video sequences, and then the physical properties are optimized through video sequences utilizing the fixed first-frame grid representations. This limitation can be critical when learning of the geometric structure is difficult, for example, in a few-shot (sparse view) setting. To overcome this limitation, we propose Lagrangian particle optimization (LPO), in which the positions and features of particles are optimized through video sequences in Lagrangian space. This method allows for the optimization of the geometric structure across the entire video sequence within the physical constraints imposed by the MPM. The experimental results demonstrate that the LPO is useful for geometric correction and physical identification in sparse-view settings.

[CV-47] Redundancy-aware Action Spaces for Robot Learning

链接: https://arxiv.org/abs/2406.04144
作者: Pietro Mazzaglia,Nicholas Backshall,Xiao Ma,Stephen James
关键词: dominant action modes, robot learning literature, controlling robot arms, task space control, modes for controlling
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Published in the RA-L journal

点击查看摘要

Abstract:Joint space and task space control are the two dominant action modes for controlling robot arms within the robot learning literature. Actions in joint space provide precise control over the robot’s pose, but tend to suffer from inefficient training; actions in task space boast data-efficient training but sacrifice the ability to perform tasks in confined spaces due to limited control over the full joint configuration. This work analyses the criteria for designing action spaces for robot manipulation and introduces ER (End-effector Redundancy), a novel action space formulation that, by addressing the redundancies present in the manipulator, aims to combine the advantages of both joint and task spaces, offering fine-grained comprehensive control with overactuated robot arms whilst achieving highly efficient robot learning. We present two implementations of ER, ERAngle (ERA) and ERJoint (ERJ), and we show that ERJ in particular demonstrates superior performance across multiple settings, especially when precise control over the robot configuration is required. We validate our results both in simulated and real robotic environments.

[CV-48] he 3D-PC: a benchmark for visual perspective taking in humans and machines

链接: https://arxiv.org/abs/2406.04138
作者: Drew Linsley,Peisen Zhou,Alekh Karkada Ashok,Akash Nagaraj,Gaurav Gaonkar,Francis E Lewis,Zygmunt Pizlo,Thomas Serre
关键词: Visual perspective taking, perspective taking, Visual perspective, perceive and reason, DNNs
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Visual perspective taking (VPT) is the ability to perceive and reason about the perspectives of others. It is an essential feature of human intelligence, which develops over the first decade of life and requires an ability to process the 3D structure of visual scenes. A growing number of reports have indicated that deep neural networks (DNNs) become capable of analyzing 3D scenes after training on large image datasets. We investigated if this emergent ability for 3D analysis in DNNs is sufficient for VPT with the 3D perception challenge (3D-PC): a novel benchmark for 3D perception in humans and DNNs. The 3D-PC is comprised of three 3D-analysis tasks posed within natural scene images: 1. a simple test of object depth order, 2. a basic VPT task (VPT-basic), and 3. another version of VPT (VPT-Strategy) designed to limit the effectiveness of “shortcut” visual strategies. We tested human participants (N=33) and linearly probed or text-prompted over 300 DNNs on the challenge and found that nearly all of the DNNs approached or exceeded human accuracy in analyzing object depth order. Surprisingly, DNN accuracy on this task correlated with their object recognition performance. In contrast, there was an extraordinary gap between DNNs and humans on VPT-basic. Humans were nearly perfect, whereas most DNNs were near chance. Fine-tuning DNNs on VPT-basic brought them close to human performance, but they, unlike humans, dropped back to chance when tested on VPT-perturb. Our challenge demonstrates that the training routines and architectures of today’s DNNs are well-suited for learning basic 3D properties of scenes and objects but are ill-suited for reasoning about these properties like humans do. We release our 3D-PC datasets and code to help bridge this gap in 3D perception between humans and machines.

[CV-49] LenslessFace: An End-to-End Optimized Lensless System for Privacy-Preserving Face Verification

链接: https://arxiv.org/abs/2406.04129
作者: Xin Cai,Hailong Zhang,Chenchen Wang,Wentao Liu,Jinwei Gu,Tianfan Xue
关键词: encode light directly, innovatively replacing traditional, replacing traditional lenses, flat optics, innovatively replacing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: under review

点击查看摘要

Abstract:Lensless cameras, innovatively replacing traditional lenses for ultra-thin, flat optics, encode light directly onto sensors, producing images that are not immediately recognizable. This compact, lightweight, and cost-effective imaging solution offers inherent privacy advantages, making it attractive for privacy-sensitive applications like face verification. Typical lensless face verification adopts a two-stage process of reconstruction followed by verification, incurring privacy risks from reconstructed faces and high computational costs. This paper presents an end-to-end optimization approach for privacy-preserving face verification directly on encoded lensless captures, ensuring that the entire software pipeline remains encoded with no visible faces as intermediate results. To achieve this, we propose several techniques to address unique challenges from the lensless setup which precludes traditional face detection and alignment. Specifically, we propose a face center alignment scheme, an augmentation curriculum to build robustness against variations, and a knowledge distillation method to smooth optimization and enhance performance. Evaluations under both simulation and real environment demonstrate our method outperforms two-stage lensless verification while enhancing privacy and efficiency. Project website: \urlthis http URL.

[CV-50] Global Parameterization-based Texture Space Optimization

链接: https://arxiv.org/abs/2406.04115
作者: Wei Chen,Yuxue Ren,Na Lei,Zhongxuan Luo,Xianfeng Gu
关键词: texture space, computer graphics, common technology, area of computer, loose texture space
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Preprint submitted to Comput. Math. Math. Phys

点击查看摘要

Abstract:Texture mapping is a common technology in the area of computer graphics, it maps the 3D surface space onto the 2D texture space. However, the loose texture space will reduce the efficiency of data storage and GPU memory addressing in the rendering process. Many of the existing methods focus on repacking given textures, but they still suffer from high computational cost and hardly produce a wholly tight texture space. In this paper, we propose a method to optimize the texture space and produce a new texture mapping which is compact based on global parameterization. The proposed method is computationally robust and efficient. Experiments show the effectiveness of the proposed method and the potency in improving the storage and rendering efficiency.

[CV-51] UrbanSARFloods: Sentinel-1 SLC-Based Benchmark Dataset for Urban and Open-Area Flood Mapping

链接: https://arxiv.org/abs/2406.04111
作者: Jie Zhao,Zhitong Xiong,Xiao Xiang Zhu
关键词: Synthetic Aperture Radar, satellite Synthetic Aperture, Aperture Radar, Synthetic Aperture, providing global coverage
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Accepted by CVPR 2024 EarthVision Workshop

点击查看摘要

Abstract:Due to its cloud-penetrating capability and independence from solar illumination, satellite Synthetic Aperture Radar (SAR) is the preferred data source for large-scale flood mapping, providing global coverage and including various land cover classes. However, most studies on large-scale SAR-derived flood mapping using deep learning algorithms have primarily focused on flooded open areas, utilizing available open-access datasets (e.g., Sen1Floods11) and with limited attention to urban floods. To address this gap, we introduce \textbfUrbanSARFloods, a floodwater dataset featuring pre-processed Sentinel-1 intensity data and interferometric coherence imagery acquired before and during flood events. It contains 8,879 512\times 512 chips covering 807,500 km^2 across 20 land cover classes and 5 continents, spanning 18 flood events. We used UrbanSARFloods to benchmark existing state-of-the-art convolutional neural networks (CNNs) for segmenting open and urban flood areas. Our findings indicate that prevalent approaches, including the Weighted Cross-Entropy (WCE) loss and the application of transfer learning with pretrained models, fall short in overcoming the obstacles posed by imbalanced data and the constraints of a small training dataset. Urban flood detection remains challenging. Future research should explore strategies for addressing imbalanced data challenges and investigate transfer learning’s potential for SAR-based large-scale flood mapping. Besides, expanding this dataset to include additional flood events holds promise for enhancing its utility and contributing to advancements in flood mapping techniques.

[CV-52] Multistep Distillation of Diffusion Models via Moment Matching

链接: https://arxiv.org/abs/2406.04103
作者: Tim Salimans,Thomas Mensink,Jonathan Heek,Emiel Hoogeboom
关键词: faster to sample, making diffusion models, diffusion models faster, making diffusion, diffusion models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:We present a new method for making diffusion models faster to sample. The method distills many-step diffusion models into few-step models by matching conditional expectations of the clean data given noisy data along the sampling trajectory. Our approach extends recently proposed one-step methods to the multi-step case, and provides a new perspective by interpreting these approaches in terms of moment matching. By using up to 8 sampling steps, we obtain distilled models that outperform not only their one-step versions but also their original many-step teacher models, obtaining new state-of-the-art results on the Imagenet dataset. We also show promising results on a large text-to-image model where we achieve fast generation of high resolution images directly in image space, without needing autoencoders or upsamplers.

[CV-53] How Far Can We Compress Instant-NGP-Based NeRF?

链接: https://arxiv.org/abs/2406.04101
作者: Yihang Chen,Qianyi Wu,Mehrtash Harandi,Jianfei Cai
关键词: Neural Radiance Field, Neural Radiance, Radiance Field, demonstrated remarkable capabilities, recent years
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL Code: this https URL . We further propose a 3DGS compression method HAC, which is based on CNC: this https URL

点击查看摘要

Abstract:In recent years, Neural Radiance Field (NeRF) has demonstrated remarkable capabilities in representing 3D scenes. To expedite the rendering process, learnable explicit representations have been introduced for combination with implicit NeRF representation, which however results in a large storage space requirement. In this paper, we introduce the Context-based NeRF Compression (CNC) framework, which leverages highly efficient context models to provide a storage-friendly NeRF representation. Specifically, we excavate both level-wise and dimension-wise context dependencies to enable probability prediction for information entropy reduction. Additionally, we exploit hash collision and occupancy grids as strong prior knowledge for better context modeling. To the best of our knowledge, we are the first to construct and exploit context models for NeRF compression. We achieve a size reduction of 100 \times and 70 \times with improved fidelity against the baseline Instant-NGP on Synthesic-NeRF and Tanks and Temples datasets, respectively. Additionally, we attain 86.7% and 82.3% storage size reduction against the SOTA NeRF compression method BiRF. Our code is available here: this https URL.

[CV-54] Class-Aware Cartilage Segmentation for Autonomous US-CT Registration in Robotic Intercostal Ultrasound Imaging

链接: https://arxiv.org/abs/2406.04100
作者: Zhongliang Jiang,Yunfeng Kang,Yuan Bi,Xuesong Li,Chenyang Li,Nassir Navab
关键词: clinical examinations owing, Ultrasound imaging, clinical examinations, examinations owing, autonomous examination systems
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Ultrasound imaging has been widely used in clinical examinations owing to the advantages of being portable, real-time, and radiation-free. Considering the potential of extensive deployment of autonomous examination systems in hospitals, robotic US imaging has attracted increased attention. However, due to the inter-patient variations, it is still challenging to have an optimal path for each patient, particularly for thoracic applications with limited acoustic windows, e.g., intercostal liver imaging. To address this problem, a class-aware cartilage bone segmentation network with geometry-constraint post-processing is presented to capture patient-specific rib skeletons. Then, a dense skeleton graph-based non-rigid registration is presented to map the intercostal scanning path from a generic template to individual patients. By explicitly considering the high-acoustic impedance bone structures, the transferred scanning path can be precisely located in the intercostal space, enhancing the visibility of internal organs by reducing the acoustic shadow. To evaluate the proposed approach, the final path mapping performance is validated on five distinct CTs and two volunteer US data, resulting in ten pairs of CT-US combinations. Results demonstrate that the proposed graph-based registration method can robustly and precisely map the path from CT template to individual patients (Euclidean error: 2.21\pm1.11~mm ).

[CV-55] Interpretable Lightweight Transformer via Unrolling of Learned Graph Smoothness Priors

链接: https://arxiv.org/abs/2406.04090
作者: Tam Thuc Do,Parham Eftekhar,Seyed Alireza Hosseini,Gene Cheung,Philip Chou
关键词: graph Laplacian regularizer, quadratic graph Laplacian, lightweight transformer-like neural, unrolling iterative optimization, iterative optimization algorithms
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We build interpretable and lightweight transformer-like neural networks by unrolling iterative optimization algorithms that minimize graph smoothness priors – the quadratic graph Laplacian regularizer (GLR) and the \ell_1 -norm graph total variation (GTV) – subject to an interpolation constraint. The crucial insight is that a normalized signal-dependent graph learning module amounts to a variant of the basic self-attention mechanism in conventional transformers. Unlike “black-box” transformers that require learning of large key, query and value matrices to compute scaled dot products as affinities and subsequent output embeddings, resulting in huge parameter sets, our unrolled networks employ shallow CNNs to learn low-dimensional features per node to establish pairwise Mahalanobis distances and construct sparse similarity graphs. At each layer, given a learned graph, the target interpolated signal is simply a low-pass filtered output derived from the minimization of an assumed graph smoothness prior, leading to a dramatic reduction in parameter count. Experiments for two image interpolation applications verify the restoration performance, parameter efficiency and robustness to covariate shift of our graph-based unrolled networks compared to conventional transformers.

[CV-56] Semmeldetector: Application of Machine Learning in Commercial Bakeries

链接: https://arxiv.org/abs/2406.04050
作者: Thomas H. Schmitt,Maximilian Bundscherer,Tobias Bocklet
关键词: count baked goods, baked goods, classify and count, utilizes object detection, Semmeldetector
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The Semmeldetector, is a machine learning application that utilizes object detection models to detect, classify and count baked goods in images. Our application allows commercial bakers to track unsold baked goods, which allows them to optimize production and increase resource efficiency. We compiled a dataset comprising 1151 images that distinguishes between 18 different types of baked goods to train our detection models. To facilitate model training, we used a Copy-Paste augmentation pipeline to expand our dataset. We trained the state-of-the-art object detection model YOLOv8 on our detection task. We tested the impact of different training data, model scale, and online image augmentation pipelines on model performance. Our overall best performing model, achieved an AP@0.5 of 89.1% on our test set. Based on our results, we conclude that machine learning can be a valuable tool even for unforeseen industries like bakeries, even with very limited datasets.

[CV-57] Shaping History: Advanced Machine Learning Techniques for the Analysis and Dating of Cuneiform Tablets over Three Millennia

链接: https://arxiv.org/abs/2406.04039
作者: Danielle Kapon,Michael Fire,Shai Gordin
关键词: fourth millennium BCE, late fourth millennium, earliest writing systems, millennium BCE, humanity earliest writing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 24 pages, 18 figures

点击查看摘要

Abstract:Cuneiform tablets, emerging in ancient Mesopotamia around the late fourth millennium BCE, represent one of humanity’s earliest writing systems. Characterized by wedge-shaped marks on clay tablets, these artifacts provided insight into Mesopotamian civilization across various domains. Traditionally, the analysis and dating of these tablets rely on subjective assessment of shape and writing style, leading to uncertainties in pinpointing their exact temporal origins. Recent advances in digitization have revolutionized the study of cuneiform by enhancing accessibility and analytical capabilities. Our research uniquely focuses on the silhouette of tablets as significant indicators of their historical periods, diverging from most studies that concentrate on textual content. Utilizing an unprecedented dataset of over 94,000 images from the Cuneiform Digital Library Initiative collection, we apply deep learning methods to classify cuneiform tablets, covering over 3,000 years of history. By leveraging statistical, computational techniques, and generative modeling through Variational Auto-Encoders (VAEs), we achieve substantial advancements in the automatic classification of these ancient documents, focusing on the tablets’ silhouettes as key predictors. Our classification approach begins with a Decision Tree using height-to-width ratios and culminates with a ResNet50 model, achieving a 61% macro F1-score for tablet silhouettes. Moreover, we introduce novel VAE-powered tools to enhance explainability and enable researchers to explore changes in tablet shapes across different eras and genres. This research contributes to document analysis and diplomatics by demonstrating the value of large-scale data analysis combined with statistical methods. These insights offer valuable tools for historians and epigraphists, enriching our understanding of cuneiform tablets and the cultures that produced them.

[CV-58] Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis

链接: https://arxiv.org/abs/2406.04032
作者: Marianna Ohanyan,Hayk Manukyan,Zhangyang Wang,Shant Navasardyan,Humphrey Shi
关键词: framework for layout-conditional, synthesis that facilitates, training-free framework, facilitates the creation, creation of detailed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present Zero-Painter, a novel training-free framework for layout-conditional text-to-image synthesis that facilitates the creation of detailed and controlled imagery from textual prompts. Our method utilizes object masks and individual descriptions, coupled with a global text prompt, to generate images with high fidelity. Zero-Painter employs a two-stage process involving our novel Prompt-Adjusted Cross-Attention (PACA) and Region-Grouped Cross-Attention (ReGCA) blocks, ensuring precise alignment of generated objects with textual prompts and mask shapes. Our extensive experiments demonstrate that Zero-Painter surpasses current state-of-the-art methods in preserving textual details and adhering to mask shapes.

[CV-59] Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

链接: https://arxiv.org/abs/2406.04031
作者: Zonghao Ying,Aishan Liu,Tianyuan Zhang,Zhengmin Yu,Siyuan Liang,Xianglong Liu,Dacheng Tao
关键词: uncover safety implications, safety implications, bypass guardrails, guardrails and uncover, uncover safety
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:In the realm of large vision language models (LVLMs), jailbreak attacks serve as a red-teaming approach to bypass guardrails and uncover safety implications. Existing jailbreaks predominantly focus on the visual modality, perturbing solely visual inputs in the prompt for attacks. However, they fall short when confronted with aligned models that fuse visual and textual features simultaneously for generation. To address this limitation, this paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. Initially, we adversarially embed universally harmful perturbations in an image, guided by a few-shot query-agnostic corpus (e.g., affirmative prefixes and negative inhibitions). This process ensures that image prompt LVLMs to respond positively to any harmful queries. Subsequently, leveraging the adversarial image, we optimize textual prompts with specific harmful intent. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts through a feedback-iteration manner. To validate the efficacy of our approach, we conducted extensive evaluations on various datasets and LVLMs, demonstrating that our method significantly outperforms other methods by large margins (+29.03% in attack success rate on average). Additionally, we showcase the potential of our attacks on black-box commercial LVLMs, such as Gemini and ChatGLM.

[CV-60] 3rd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation

链接: https://arxiv.org/abs/2406.04002
作者: Ruipu Wu,Jifei Che,Han Li,Chengjing Wu,Ting Liu,Luoqi Liu
关键词: Video panoptic segmentation, extends panoptic segmentation, panoptic segmentation, Pixel-level Video Understanding, Video panoptic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video panoptic segmentation is an advanced task that extends panoptic segmentation by applying its concept to video sequences. In the hope of addressing the challenge of video panoptic segmentation in diverse conditions, We utilize DVIS++ as our baseline model and enhance it by introducing a comprehensive approach centered on the query-wise ensemble, supplemented by additional techniques. Our proposed approach achieved a VPQ score of 57.01 on the VIPSeg test set, and ranked 3rd in the VPS track of the 3rd Pixel-level Video Understanding in the Wild Challenge.

[CV-61] Unveiling the Dynamics of Information Interplay in Supervised Learning

链接: https://arxiv.org/abs/2406.03999
作者: Kun Song,Zhiquan Tan,Bochao Zou,Huimin Ma,Weiran Huang
关键词: MIR and HDR, classification head vectors, matrix information theory, Neural Collapse, classification heads
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICML 2024

点击查看摘要

Abstract:In this paper, we use matrix information theory as an analytical tool to analyze the dynamics of the information interplay between data representations and classification head vectors in the supervised learning process. Specifically, inspired by the theory of Neural Collapse, we introduce matrix mutual information ratio (MIR) and matrix entropy difference ratio (HDR) to assess the interactions of data representation and class classification heads in supervised learning, and we determine the theoretical optimal values for MIR and HDR when Neural Collapse happens. Our experiments show that MIR and HDR can effectively explain many phenomena occurring in neural networks, for example, the standard supervised training dynamics, linear mode connectivity, and the performance of label smoothing and pruning. Additionally, we use MIR and HDR to gain insights into the dynamics of grokking, which is an intriguing phenomenon observed in supervised training, where the model demonstrates generalization capabilities long after it has learned to fit the training data. Furthermore, we introduce MIR and HDR as loss terms in supervised and semi-supervised learning to optimize the information interactions among samples and classification heads. The empirical results provide evidence of the method’s effectiveness, demonstrating that the utilization of MIR and HDR not only aids in comprehending the dynamics throughout the training process but can also enhances the training procedure itself.

[CV-62] LNQ Challenge 2023: Learning Mediastinal Lymph Node Segmentation with a Probabilistic Lymph Node Atlas

链接: https://arxiv.org/abs/2406.03984
作者: Sofija Engelson,Jan Ehrhardt,Timo Kepp,Joshua Niemeijer,Heinz Handels
关键词: precise cancer staging, influencing subsequent decisions, achieving precise cancer, node metastases plays, lymph node metastases
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

Abstract:The evaluation of lymph node metastases plays a crucial role in achieving precise cancer staging, influencing subsequent decisions regarding treatment options. Lymph node detection poses challenges due to the presence of unclear boundaries and the diverse range of sizes and morphological characteristics, making it a resource-intensive process. As part of the LNQ 2023 MICCAI challenge, we propose the use of anatomical priors as a tool to address the challenges that persist in mediastinal lymph node segmentation in combination with the partial annotation of the challenge training data. The model ensemble using all suggested modifications yields a Dice score of 0.6033 and segments 57% of the ground truth lymph nodes, compared to 27% when training on CT only. Segmentation accuracy is improved significantly by incorporating a probabilistic lymph node atlas in loss weighting and post-processing. The largest performance gains are achieved by oversampling fully annotated data to account for the partial annotation of the challenge training data, as well as adding additional data augmentation to address the high heterogeneity of the CT images and lymph node appearance. Our code is available at this https URL.

[CV-63] Vectorized Conditional Neural Fields: A Framework for Solving Time-dependent Parametric Partial Differential Equations

链接: https://arxiv.org/abs/2406.03919
作者: Jan Hagnberger,Marimuthu Kalimuthu,Daniel Musekamp,Mathias Niepert
关键词: Partial Differential Equations, solving Partial Differential, Differential Equations, Partial Differential, solving Partial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Computational Physics (physics.comp-ph)
*备注: Accepted for publication at the 41st International Conference on Machine Learning (ICML) 2024

点击查看摘要

Abstract:Transformer models are increasingly used for solving Partial Differential Equations (PDEs). Several adaptations have been proposed, all of which suffer from the typical problems of Transformers, such as quadratic memory and time complexity. Furthermore, all prevalent architectures for PDE solving lack at least one of several desirable properties of an ideal surrogate model, such as (i) generalization to PDE parameters not seen during training, (ii) spatial and temporal zero-shot super-resolution, (iii) continuous temporal extrapolation, (iv) support for 1D, 2D, and 3D PDEs, and (v) efficient inference for longer temporal rollouts. To address these limitations, we propose Vectorized Conditional Neural Fields (VCNeFs), which represent the solution of time-dependent PDEs as neural fields. Contrary to prior methods, however, VCNeFs compute, for a set of multiple spatio-temporal query points, their solutions in parallel and model their dependencies through attention mechanisms. Moreover, VCNeF can condition the neural field on both the initial conditions and the parameters of the PDEs. An extensive set of experiments demonstrates that VCNeFs are competitive with and often outperform existing ML-based surrogate models.

[CV-64] Frequency-based Matcher for Long-tailed Semantic Segmentation

链接: https://arxiv.org/abs/2406.03917
作者: Shan Li,Lu Yang,Pu Cao,Liulei Li,Huadong Ma
关键词: semantic segmentation technology, computer vision community, semantic segmentation, long-tailed semantic segmentation, segmentation technology
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for publication as a Regular paper in the IEEE Transactions on Multimedia

点击查看摘要

Abstract:The successful application of semantic segmentation technology in the real world has been among the most exciting achievements in the computer vision community over the past decade. Although the long-tailed phenomenon has been investigated in many fields, e.g., classification and object detection, it has not received enough attention in semantic segmentation and has become a non-negligible obstacle to applying semantic segmentation technology in autonomous driving and virtual reality. Therefore, in this work, we focus on a relatively under-explored task setting, long-tailed semantic segmentation (LTSS). We first establish three representative datasets from different aspects, i.e., scene, object, and human. We further propose a dual-metric evaluation system and construct the LTSS benchmark to demonstrate the performance of semantic segmentation methods and long-tailed solutions. We also propose a transformer-based algorithm to improve LTSS, frequency-based matcher, which solves the oversuppression problem by one-to-many matching and automatically determines the number of matching queries for each class. Given the comprehensiveness of this work and the importance of the issues revealed, this work aims to promote the empirical study of semantic segmentation tasks. Our datasets, codes, and models will be publicly available.

[CV-65] ArMeme: Propagandistic Content in Arabic Memes

链接: https://arxiv.org/abs/2406.03916
作者: Firoj Alam,Abul Hasnat,Fatema Ahmed,Md Arid Hasan,Maram Hasanain
关键词: digital communication, mislead audiences, rise of digital, cultural and political, political expression
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: disinformation, misinformation, factuality, harmfulness, fake news, propaganda, multimodality, text, images

点击查看摘要

Abstract:With the rise of digital communication, memes have become a significant medium for cultural and political expression that is often used to mislead audiences. Identification of such misleading and persuasive multimodal content has become more important among various stakeholders, including social media platforms, policymakers, and the broader society as they often cause harm to individuals, organizations, and/or society. While there has been effort to develop AI-based automatic systems for resource-rich languages (e.g., English), it is relatively little to none for medium to low resource languages. In this study, we focused on developing an Arabic memes dataset with manual annotations of propagandistic content. We annotated ~6K Arabic memes collected from various social media platforms, which is a first resource for Arabic multimodal research. We provide a comprehensive analysis aiming to develop computational tools for their detection. We will make them publicly available for the community.

[CV-66] Exploring the Zero-Shot Capabilities of Vision-Language Models for Improving Gaze Following

链接: https://arxiv.org/abs/2406.03907
作者: Anshul Gupta,Pierre Vuillecard,Arya Farkhondeh,Jean-Marc Odobez
关键词: Contextual cues related, provide valuable information, Contextual cues, pose and interactions, interactions with objects
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at the GAZE Workshop at CVPR 2024

点击查看摘要

Abstract:Contextual cues related to a person’s pose and interactions with objects and other people in the scene can provide valuable information for gaze following. While existing methods have focused on dedicated cue extraction methods, in this work we investigate the zero-shot capabilities of Vision-Language Models (VLMs) for extracting a wide array of contextual cues to improve gaze following performance. We first evaluate various VLMs, prompting strategies, and in-context learning (ICL) techniques for zero-shot cue recognition performance. We then use these insights to extract contextual cues for gaze following, and investigate their impact when incorporated into a state of the art model for the task. Our analysis indicates that BLIP-2 is the overall top performing VLM and that ICL can improve performance. We also observe that VLMs are sensitive to the choice of the text prompt although ensembling over multiple text prompts can provide more robust performance. Additionally, we discover that using the entire image along with an ellipse drawn around the target person is the most effective strategy for visual prompting. For gaze following, incorporating the extracted cues results in better generalization performance, especially when considering a larger set of cues, highlighting the potential of this approach.

[CV-67] Decay Pruning Method: Smooth Pruning With a Self-Rectifying Procedure

链接: https://arxiv.org/abs/2406.03879
作者: Minghao Yang,Linlin Gao,Pengyuan Li,Wenbo Li,Yihong Dong,Zhiying Cui
关键词: Current structured pruning, considerable accuracy drops, accuracy drops due, Current structured, structured pruning methods
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current structured pruning methods often result in considerable accuracy drops due to abrupt network changes and loss of information from pruned structures. To address these issues, we introduce the Decay Pruning Method (DPM), a novel smooth pruning approach with a self-rectifying mechanism. DPM consists of two key components: (i) Smooth Pruning: It converts conventional single-step pruning into multi-step smooth pruning, gradually reducing redundant structures to zero over N steps with ongoing optimization. (ii) Self-Rectifying: This procedure further enhances the aforementioned process by rectifying sub-optimal pruning based on gradient information. Our approach demonstrates strong generalizability and can be easily integrated with various existing pruning methods. We validate the effectiveness of DPM by integrating it with three popular pruning methods: OTOv2, Depgraph, and Gate Decorator. Experimental results show consistent improvements in performance compared to the original pruning methods, along with further reductions of FLOPs in most scenarios.

[CV-68] Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving

链接: https://arxiv.org/abs/2406.03877
作者: Xiaosong Jia,Zhenjie Yang,Qifeng Li,Zhiyuan Zhang,Junchi Yan
关键词: autonomous driving technologies, autonomous driving, rapid scaling, era marked, technologies are approaching
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In an era marked by the rapid scaling of foundation models, autonomous driving technologies are approaching a transformative threshold where end-to-end autonomous driving (E2E-AD) emerges due to its potential of scaling up in the data-driven manner. However, existing E2E-AD methods are mostly evaluated under the open-loop log-replay manner with L2 errors and collision rate as metrics (e.g., in nuScenes), which could not fully reflect the driving performance of algorithms as recently acknowledged in the community. For those E2E-AD methods evaluated under the closed-loop protocol, they are tested in fixed routes (e.g., Town05Long and Longest6 in CARLA) with the driving score as metrics, which is known for high variance due to the unsmoothed metric function and large randomness in the long route. Besides, these methods usually collect their own data for training, which makes algorithm-level fair comparison infeasible. To fulfill the paramount need of comprehensive, realistic, and fair testing environments for Full Self-Driving (FSD), we present Bench2Drive, the first benchmark for evaluating E2E-AD systems’ multiple abilities in a closed-loop manner. Bench2Drive’s official training data consists of 2 million fully annotated frames, collected from 10000 short clips uniformly distributed under 44 interactive scenarios (cut-in, overtaking, detour, etc), 23 weathers (sunny, foggy, rainy, etc), and 12 towns (urban, village, university, etc) in CARLA v2. Its evaluation protocol requires E2E-AD models to pass 44 interactive scenarios under different locations and weathers which sums up to 220 routes and thus provides a comprehensive and disentangled assessment about their driving capability under different situations. We implement state-of-the-art E2E-AD models and evaluate them in Bench2Drive, providing insights regarding current status and future directions. Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.03877 [cs.RO] (or arXiv:2406.03877v1 [cs.RO] for this version)

[CV-69] Quantum Implicit Neural Representations

链接: https://arxiv.org/abs/2406.03873
作者: Jiaming Zhao,Wenbo Qiao,Peng Zhang,Hui Gao
关键词: neural networks, Implicit neural representations, Fourier Neural Networks, neural, Implicit neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper was accepted by icml 2024

点击查看摘要

Abstract:Implicit neural representations have emerged as a powerful paradigm to represent signals such as images and sounds. This approach aims to utilize neural networks to parameterize the implicit function of the signal. However, when representing implicit functions, traditional neural networks such as ReLU-based multilayer perceptrons face challenges in accurately modeling high-frequency components of signals. Recent research has begun to explore the use of Fourier Neural Networks (FNNs) to overcome this limitation. In this paper, we propose Quantum Implicit Representation Network (QIREN), a novel quantum generalization of FNNs. Furthermore, through theoretical analysis, we demonstrate that QIREN possesses a quantum advantage over classical FNNs. Lastly, we conducted experiments in signal representation, image superresolution, and image generation tasks to show the superior performance of QIREN compared to state-of-the-art (SOTA) models. Our work not only incorporates quantum advantages into implicit neural representations but also uncovers a promising application direction for Quantum Neural Networks.

[CV-70] LLplace: The 3D Indoor Scene Layout Generation and Editing via Large Language Model

链接: https://arxiv.org/abs/2406.03866
作者: Yixuan Yang,Junru Lu,Zixiang Zhao,Zhen Luo,James J.Q. Yu,Victor Sanchez,Feng Zheng
关键词: automated space planning, Large Language Models, proprietary Large Language, virtual reality, space planning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Designing 3D indoor layouts is a crucial task with significant applications in virtual reality, interior design, and automated space planning. Existing methods for 3D layout design either rely on diffusion models, which utilize spatial relationship priors, or heavily leverage the inferential capabilities of proprietary Large Language Models (LLMs), which require extensive prompt engineering and in-context exemplars via black-box trials. These methods often face limitations in generalization and dynamic scene editing. In this paper, we introduce LLplace, a novel 3D indoor scene layout designer based on lightweight fine-tuned open-source LLM Llama3. LLplace circumvents the need for spatial relationship priors and in-context exemplars, enabling efficient and credible room layout generation based solely on user inputs specifying the room type and desired objects. We curated a new dialogue dataset based on the 3D-Front dataset, expanding the original data volume and incorporating dialogue data for adding and removing objects. This dataset can enhance the LLM’s spatial understanding. Furthermore, through dialogue, LLplace activates the LLM’s capability to understand 3D layouts and perform dynamic scene editing, enabling the addition and removal of objects. Our approach demonstrates that LLplace can effectively generate and edit 3D indoor layouts interactively and outperform existing methods in delivering high-quality 3D design solutions. Code and dataset will be released.

[CV-71] Semantic Similarity Score for Measuring Visual Similarity at Semantic Level

链接: https://arxiv.org/abs/2406.03865
作者: Senran Fan,Zhicheng Bao,Chen Dong,Haotai Liang,Xiaodong Xu,Ping Zhang
关键词: semantic communication systems, communication systems, revolutionary communication architecture, Semantic communication, communication
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Semantic communication, as a revolutionary communication architecture, is considered a promising novel communication paradigm. Unlike traditional symbol-based error-free communication systems, semantic-based visual communication systems extract, compress, transmit, and reconstruct images at the semantic level. However, widely used image similarity evaluation metrics, whether pixel-based MSE or PSNR or structure-based MS-SSIM, struggle to accurately measure the loss of semantic-level information of the source during system transmission. This presents challenges in evaluating the performance of visual semantic communication systems, especially when comparing them with traditional communication systems. To address this, we propose a semantic evaluation metric – SeSS (Semantic Similarity Score), based on Scene Graph Generation and graph matching, which shifts the similarity scores between images into semantic-level graph matching scores. Meanwhile, semantic similarity scores for tens of thousands of image pairs are manually annotated to fine-tune the hyperparameters in the graph matching algorithm, aligning the metric more closely with human semantic perception. The performance of the SeSS is tested on different datasets, including (1)images transmitted by traditional and semantic communication systems at different compression rates, (2)images transmitted by traditional and semantic communication systems at different signal-to-noise ratios, (3)images generated by large-scale model with different noise levels introduced, and (4)cases of images subjected to certain special transformations. The experiments demonstrate the effectiveness of SeSS, indicating that the metric can measure the semantic-level differences in semantic-level information of images and can be used for evaluation in visual semantic communication systems.

[CV-72] From operculum and body tail movements to different coupling of physical activity and respiratory frequency in farmed gilthead sea bream and European sea bass. Insights on aquaculture biosensing

链接: https://arxiv.org/abs/2406.03859
作者: Miguel A. Ferrer,Josep A. Calduch-Giner,Moises Díaz,Javier Sosa,Enrique Rosell-Moll,Judith Santana Abril,Graciela Santana Sosa,Tomás Bautista Delgado,Cristina Carmona,Juan Antonio Martos-Sitcha,Enric Cabruja,Juan Manuel Afonso,Aurelio Vega,Manuel Lozano,Juan Antonio Montiel-Nelson,Jaume Pérez-Sánchez
关键词: European sea bass, European sea, AEFishBIT tri-axial accelerometer, gilthead sea bream, Sparus aurata
类目: Computer Vision and Pattern Recognition (cs.CV); Populations and Evolution (q-bio.PE)
*备注:

点击查看摘要

Abstract:The AEFishBIT tri-axial accelerometer was externally attached to the operculum to assess the divergent activity and respiratory patterns of two marine farmed fish, the gilthead sea bream (Sparus aurata) and European sea bass (Dicentrarchus labrax). Analysis of raw data from exercised fish highlighted the large amplitude of operculum aperture and body tail movements in European sea bass, which were overall more stable at low-medium exercise intensity levels. Cosinor analysis in free-swimming fish (on-board data processing) highlighted a pronounced daily rhythmicity of locomotor activity and respiratory frequency in both gilthead sea bream and European sea bass. Acrophases of activity and respiration were coupled in gilthead sea bream, acting feeding time (once daily at 11:00 h) as a main synchronizing factor. By contrast, locomotor activity and respiratory frequency were out of phase in European sea bass with activity acrophase on early morning and respiration acrophase on the afternoon. The daily range of activity and respiration variation was also higher in European sea bass, probably as part of the adaptation of this fish species to act as a fast swimming predator. In any case, lower locomotor activity and enhanced respiration were associated with larger body weight in both fish species. This agrees with the notion that selection for fast growth in farming conditions is accompanied by a lower activity profile, which may favor an efficient feed conversion for growth purposes. Therefore, the use of behavioral monitoring is becoming a reliable and large-scale promising tool for selecting more efficient farmed fish, allowing researchers and farmers to establish stricter criteria of welfare for more sustainable and ethical fish production.

[CV-73] MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition

链接: https://arxiv.org/abs/2406.03857
作者: Stefan Gerd Fritsch,Cennet Oguz,Vitor Fortes Rey,Lala Ray,Maximilian Kiefer-Emmanouilidis,Paul Lukowicz
关键词: Human Activity Recognition, human computer interaction, Human Activity, Activity Recognition, sports and fitness
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human Activity Recognition is a longstanding problem in AI with applications in a broad range of areas: from healthcare, sports and fitness, security, and human computer interaction to robotics. The performance of HAR in real-world settings is strongly dependent on the type and quality of the input signal that can be acquired. Given an unobstructed, high-quality camera view of a scene, computer vision systems, in particular in conjunction with foundational models (e.g., CLIP), can today fairly reliably distinguish complex activities. On the other hand, recognition using modalities such as wearable sensors (which are often more broadly available, e.g, in mobile phones and smartwatches) is a more difficult problem, as the signals often contain less information and labeled training data is more difficult to acquire. In this work, we show how we can improve HAR performance across different modalities using multimodal contrastive pretraining. Our approach MuJo (Multimodal Joint Feature Space Learning), learns a multimodal joint feature space with video, language, pose, and IMU sensor data. The proposed approach combines contrastive and multitask learning methods and analyzes different multitasking strategies for learning a compact shared representation. A large dataset with parallel video, language, pose, and sensor data points is also introduced to support the research, along with an analysis of the robustness of the multimodal joint space for modal-incomplete and low-resource data. On the MM-Fit dataset, our model achieves an impressive Macro F1-Score of up to 0.992 with only 2% of the train data and 0.999 when using all available training data for classification tasks. Moreover, in the scenario where the MM-Fit dataset is unseen, we demonstrate a generalization performance of up to 0.638.

[CV-74] Monocular Localization with Semantics Map for Autonomous Vehicles

链接: https://arxiv.org/abs/2406.03835
作者: Jixiang Wan,Xudong Zhang,Shuzhou Dong,Yuwei Zhang,Yuchen Yang,Ruoxi Wu,Ye Jiang,Jijunnan Li,Jinquan Lin,Ming Yang
关键词: Accurate and robust, robust localization remains, remains a significant, significant challenge, Accurate
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Accurate and robust localization remains a significant challenge for autonomous vehicles. The cost of sensors and limitations in local computational efficiency make it difficult to scale to large commercial applications. Traditional vision-based approaches focus on texture features that are susceptible to changes in lighting, season, perspective, and appearance. Additionally, the large storage size of maps with descriptors and complex optimization processes hinder system performance. To balance efficiency and accuracy, we propose a novel lightweight visual semantic localization algorithm that employs stable semantic features instead of low-level texture features. First, semantic maps are constructed offline by detecting semantic objects, such as ground markers, lane lines, and poles, using cameras or LiDAR sensors. Then, online visual localization is performed through data association of semantic features and map objects. We evaluated our proposed localization framework in the publicly available KAIST Urban dataset and in scenarios recorded by ourselves. The experimental results demonstrate that our method is a reliable and practical localization solution in various autonomous driving localization tasks.

[CV-75] Amortized Equation Discovery in Hybrid Dynamical Systems

链接: https://arxiv.org/abs/2406.03818
作者: Yongtuo Liu,Sara Magliacane,Miltiadis Kofinas,Efstratios Gavves
关键词: express complex systems, discrete states, Hybrid dynamical systems, prevalent in science, science and engineering
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Symbolic Computation (cs.SC)
*备注: 24 pages, 5 figures, accepted by International Conference on Machine Learning (ICML) 2024

点击查看摘要

Abstract:Hybrid dynamical systems are prevalent in science and engineering to express complex systems with continuous and discrete states. To learn the laws of systems, all previous methods for equation discovery in hybrid systems follow a two-stage paradigm, i.e. they first group time series into small cluster fragments and then discover equations in each fragment separately through methods in non-hybrid systems. Although effective, these methods do not fully take advantage of the commonalities in the shared dynamics of multiple fragments that are driven by the same equations. Besides, the two-stage paradigm breaks the interdependence between categorizing and representing dynamics that jointly form hybrid systems. In this paper, we reformulate the problem and propose an end-to-end learning framework, i.e. Amortized Equation Discovery (AMORE), to jointly categorize modes and discover equations characterizing the dynamics of each mode by all segments of the mode. Experiments on four hybrid and six non-hybrid systems show that our method outperforms previous methods on equation discovery, segmentation, and forecasting.

[CV-76] Enhanced Semantic Segmentation Pipeline for WeatherProof Dataset Challenge

链接: https://arxiv.org/abs/2406.03799
作者: Nan Zhang,Xidan Zhang,Jianing Wei,Fangjun Wang,Zhiming Tan
关键词: WeatherProof Dataset Challenge, report describes, describes the winning, CVPR, Track
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This report describes the winning solution to the WeatherProof Dataset Challenge (CVPR 2024 UG2+ Track 3). Details regarding the challenge are available at this https URL. We propose an enhanced semantic segmentation pipeline for this challenge. Firstly, we improve semantic segmentation models, using backbone pretrained with Depth Anything to improve UperNet model and SETRMLA model, and adding language guidance based on both weather and category information to InternImage model. Secondly, we introduce a new dataset WeatherProofExtra with wider viewing angle and employ data augmentation methods, including adverse weather and super-resolution. Finally, effective training strategies and ensemble method are applied to improve final performance further. Our solution is ranked 1st on the final leaderboard. Code will be available at this https URL.

[CV-77] Low-Rank Similarity Mining for Multimodal Dataset Distillation

链接: https://arxiv.org/abs/2406.03793
作者: Yue Xu,Zhilin Lin,Yusong Qiu,Cewu Lu,Yong-Lu Li
关键词: witnessed rapid development, recent years, poses unique, under-explored challenges, witnessed rapid
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ICML 2024

点击查看摘要

Abstract:Though dataset distillation has witnessed rapid development in recent years, the distillation of multimodal data, e.g., image-text pairs, poses unique and under-explored challenges. Unlike unimodal data, image-text contrastive learning (ITC) data lack inherent categorization and should instead place greater emphasis on modality correspondence. In this work, we propose Low-Rank Similarity Mining (LoRS) for multimodal dataset distillation, that concurrently distills a ground truth similarity matrix with image-text pairs, and leverages low-rank factorization for efficiency and scalability. The proposed approach brings significant improvement to the existing algorithms, marking a significant contribution to the field of visual-language dataset distillation. We advocate adopting LoRS as a foundational synthetic data setup for image-text dataset distillation. Our code is available at this https URL.

[CV-78] XL-HeadTags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags

链接: https://arxiv.org/abs/2406.03776
作者: Faisal Tareque Shohan,Mir Tafseer Nayeem,Samsul Islam,Abu Ubaida Akash,Shafiq Joty
关键词: published online daily, articles published online, published online, online daily, daily can overwhelm
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: ACL 2024 camera ready

点击查看摘要

Abstract:Millions of news articles published online daily can overwhelm readers. Headlines and entity (topic) tags are essential for guiding readers to decide if the content is worth their time. While headline generation has been extensively studied, tag generation remains largely unexplored, yet it offers readers better access to topics of interest. The need for conciseness in capturing readers’ attention necessitates improved content selection strategies for identifying salient and relevant segments within lengthy articles, thereby guiding language models effectively. To address this, we propose to leverage auxiliary information such as images and captions embedded in the articles to retrieve relevant sentences and utilize instruction tuning with variations to generate both headlines and tags for news articles in a multilingual context. To make use of the auxiliary information, we have compiled a dataset named XL-HeadTags, which includes 20 languages across 6 diverse language families. Through extensive evaluation, we demonstrate the effectiveness of our plug-and-play multimodal-multilingual retrievers for both tasks. Additionally, we have developed a suite of tools for processing and evaluating multilingual texts, significantly contributing to the research community by enabling more accurate and efficient analysis across languages.

[CV-79] Instance Segmentation and Teeth Classification in Panoramic X-rays

链接: https://arxiv.org/abs/2406.03747
作者: Devichand Budagam,Ayush Kumar,Sayan Ghosh,Anuj Shrivastav,Azamat Zhanatuly Imanbayev,Iskander Rafailovich Akhmetov,Dmitrii Kaplun,Sergey Antonov,Artem Rychenkov,Gleb Cyganov,Aleksandr Sinitca
关键词: Teeth segmentation, Teeth, recognition are critical, segmentation, deep learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: submtted to Expert Systems with Applications Journal

点击查看摘要

Abstract:Teeth segmentation and recognition are critical in various dental applications and dental diagnosis. Automatic and accurate segmentation approaches have been made possible by integrating deep learning models. Although teeth segmentation has been studied in the past, only some techniques were able to effectively classify and segment teeth simultaneously. This article offers a pipeline of two deep learning models, U-Net and YOLOv8, which results in BB-UNet, a new architecture for the classification and segmentation of teeth on panoramic X-rays that is efficient and reliable. We have improved the quality and reliability of teeth segmentation by utilising the YOLOv8 and U-Net capabilities. The proposed networks have been evaluated using the mean average precision (mAP) and dice coefficient for YOLOv8 and BB-UNet, respectively. We have achieved a 3% increase in mAP score for teeth classification compared to existing methods, and a 10-15% increase in dice coefficient for teeth segmentation compared to U-Net across different categories of teeth. A new Dental dataset was created based on UFBA-UESC dataset with Bounding-Box and Polygon annotations of 425 dental panoramic X-rays. The findings of this research pave the way for a wider adoption of object detection models in the field of dental diagnosis.

[CV-80] ReDistill: Residual Encoded Distillation for Peak Memory Reduction

链接: https://arxiv.org/abs/2406.03744
作者: Fang Chen,Gourav Datta,Mujahid Al Rafi,Hyeran Jeon,Meng Tang
关键词: modern camera sensors, camera sensors result, neural network sizes, peak memory, memory
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The expansion of neural network sizes and the enhancement of image resolution through modern camera sensors result in heightened memory and power demands for neural networks. Reducing peak memory, which is the maximum memory consumed during the execution of a neural network, is critical to deploy neural networks on edge devices with limited memory budget. A naive approach to reducing peak memory is aggressive down-sampling of feature maps via pooling with large stride, which often results in unacceptable degradation in network performance. To mitigate this problem, we propose residual encoded distillation (ReDistill) for peak memory reduction in a teacher-student framework, in which a student network with less memory is derived from the teacher network using aggressive pooling. We apply our distillation method to multiple problems in computer vision including image classification and diffusion based image generation. For image classification, our method yields 2x-3.2x measured peak memory on an edge GPU with negligible degradation in accuracy for most CNN based architectures. Additionally, our method yields improved test accuracy for tiny vision transformer (ViT) based models distilled from large CNN based teacher architectures. For diffusion-based image generation, our proposed distillation method yields a denoising network with 4x lower theoretical peak memory while maintaining decent diversity and fidelity for image generation. Experiments demonstrate our method’s superior performance compared to other feature-based and response-based distillation methods.

[CV-81] Evaluating Durability: Benchmark Insights into Multimodal Watermarking

链接: https://arxiv.org/abs/2406.03728
作者: Jielin Qiu,William Han,Xuandong Zhao,Shangbang Long,Christos Faloutsos,Lei Li
关键词: monitor content distribution, verify authenticity, assert copyright, increasingly employed, employed to assert
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the development of large models, watermarks are increasingly employed to assert copyright, verify authenticity, or monitor content distribution. As applications become more multimodal, the utility of watermarking techniques becomes even more critical. The effectiveness and reliability of these watermarks largely depend on their robustness to various disturbances. However, the robustness of these watermarks in real-world scenarios, particularly under perturbations and corruption, is not well understood. To highlight the significance of robustness in watermarking techniques, our study evaluated the robustness of watermarked content generated by image and text generation models against common real-world image corruptions and text perturbations. Our results could pave the way for the development of more robust watermarking techniques in the future. Our project website can be found at \urlthis https URL.

[CV-82] Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling

链接: https://arxiv.org/abs/2406.03723
作者: Xinhang Liu,Yu-Wing Tai,Chi-Keung Tang,Pedro Miraldo,Suhas Lohit,Moitreya Chatterjee
关键词: Neural Radiance Fields, Extensions of Neural, Radiance Fields, Neural Radiance, model dynamic scenes
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
*备注: Paper accepted to IEEE/CVF CVPR 2024 (Spotlight). Work done when XL was an intern at MERL. Project Page Link: this https URL

点击查看摘要

Abstract:Extensions of Neural Radiance Fields (NeRFs) to model dynamic scenes have enabled their near photo-realistic, free-viewpoint rendering. Although these methods have shown some potential in creating immersive experiences, two drawbacks limit their ubiquity: (i) a significant reduction in reconstruction quality when the computing budget is limited, and (ii) a lack of semantic understanding of the underlying scenes. To address these issues, we introduce Gear-NeRF, which leverages semantic information from powerful image segmentation models. Our approach presents a principled way for learning a spatio-temporal (4D) semantic embedding, based on which we introduce the concept of gears to allow for stratified modeling of dynamic regions of the scene based on the extent of their motion. Such differentiation allows us to adjust the spatio-temporal sampling resolution for each region in proportion to its motion scale, achieving more photo-realistic dynamic novel view synthesis. At the same time, almost for free, our approach enables free-viewpoint tracking of objects of interest - a functionality not yet achieved by existing NeRF-based methods. Empirical studies validate the effectiveness of our method, where we achieve state-of-the-art rendering and tracking performance on multiple challenging datasets.

[CV-83] Attribute-Aware Implicit Modality Alignment for Text Attribute Person Search

链接: https://arxiv.org/abs/2406.03721
作者: Xin Wang,Fangfang Liu,Zheng Li,Caili Guo
关键词: Text attribute person, find specific pedestrians, person search aims, attribute person search, Text attribute
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Text attribute person search aims to find specific pedestrians through given textual attributes, which is very meaningful in the scene of searching for designated pedestrians through witness descriptions. The key challenge is the significant modality gap between textual attributes and images. Previous methods focused on achieving explicit representation and alignment through unimodal pre-trained models. Nevertheless, the absence of inter-modality correspondence in these models may lead to distortions in the local information of intra-modality. Moreover, these methods only considered the alignment of inter-modality and ignored the differences between different attribute categories. To mitigate the above problems, we propose an Attribute-Aware Implicit Modality Alignment (AIMA) framework to learn the correspondence of local representations between textual attributes and images and combine global representation matching to narrow the modality gap. Firstly, we introduce the CLIP model as the backbone and design prompt templates to transform attribute combinations into structured sentences. This facilitates the model’s ability to better understand and match image details. Next, we design a Masked Attribute Prediction (MAP) module that predicts the masked attributes after the interaction of image and masked textual attribute features through multi-modal interaction, thereby achieving implicit local relationship alignment. Finally, we propose an Attribute-IoU Guided Intra-Modal Contrastive (A-IoU IMC) loss, aligning the distribution of different textual attributes in the embedding space with their IoU distribution, achieving better semantic arrangement. Extensive experiments on the Market-1501 Attribute, PETA, and PA100K datasets show that the performance of our proposed method significantly surpasses the current state-of-the-art methods.

[CV-84] JIGMARK: A Black-Box Approach for Enhancing Image Watermarks against Diffusion Model Edits

链接: https://arxiv.org/abs/2406.03720
作者: Minzhou Pan,Yi Zeng,Xue Lin,Ning Yu,Cho-Jui Hsieh,Peter Henderson,Ruoxi Jia
关键词: accessing gradient information, False Positive Rate, True Positive Rate, investigate the vulnerability, challenge exacerbated
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:In this study, we investigate the vulnerability of image watermarks to diffusion-model-based image editing, a challenge exacerbated by the computational cost of accessing gradient information and the closed-source nature of many diffusion models. To address this issue, we introduce JIGMARK. This first-of-its-kind watermarking technique enhances robustness through contrastive learning with pairs of images, processed and unprocessed by diffusion models, without needing a direct backpropagation of the diffusion process. Our evaluation reveals that JIGMARK significantly surpasses existing watermarking solutions in resilience to diffusion-model edits, demonstrating a True Positive Rate more than triple that of leading baselines at a 1% False Positive Rate while preserving image quality. At the same time, it consistently improves the robustness against other conventional perturbations (like JPEG, blurring, etc.) and malicious watermark attacks over the state-of-the-art, often by a large margin. Furthermore, we propose the Human Aligned Variation (HAV) score, a new metric that surpasses traditional similarity measures in quantifying the number of image derivatives from image editing.

[CV-85] DSNet: A Novel Way to Use Atrous Convolutions in Semantic Segmentation

链接: https://arxiv.org/abs/2406.03702
作者: Zilu Guo,Liuyang Bian,Xuan Huang,Hu Wei,Jingyu Li,Huasheng Ni
关键词: semantic segmentation tasks, Atrous convolutions, apply atrous convolutions, semantic segmentation, method to increase
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Atrous convolutions are employed as a method to increase the receptive field in semantic segmentation tasks. However, in previous works of semantic segmentation, it was rarely employed in the shallow layers of the model. We revisit the design of atrous convolutions in modern convolutional neural networks (CNNs), and demonstrate that the concept of using large kernels to apply atrous convolutions could be a more powerful paradigm. We propose three guidelines to apply atrous convolutions more efficiently. Following these guidelines, we propose DSNet, a Dual-Branch CNN architecture, which incorporates atrous convolutions in the shallow layers of the model architecture, as well as pretraining the nearly entire encoder on ImageNet to achieve better performance. To demonstrate the effectiveness of our approach, our models achieve a new state-of-the-art trade-off between accuracy and speed on ADE20K, Cityscapes and BDD datasets. Specifically, DSNet achieves 40.0% mIOU with inference speed of 179.2 FPS on ADE20K, and 80.4% mIOU with speed of 81.9 FPS on Cityscapes. Source code and models are available at Github: this https URL.

[CV-86] Superpoint Gaussian Splatting for Real-Time High-Fidelity Dynamic Scene Reconstruction

链接: https://arxiv.org/abs/2406.03697
作者: Diwen Wan,Ruijie Lu,Gang Zeng
关键词: challenging task, view images, crucial yet challenging, Gaussian Splatting, additional time-variant MLP
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICML 2024

点击查看摘要

Abstract:Rendering novel view images in dynamic scenes is a crucial yet challenging task. Current methods mainly utilize NeRF-based methods to represent the static scene and an additional time-variant MLP to model scene deformations, resulting in relatively low rendering quality as well as slow inference speed. To tackle these challenges, we propose a novel framework named Superpoint Gaussian Splatting (SP-GS). Specifically, our framework first employs explicit 3D Gaussians to reconstruct the scene and then clusters Gaussians with similar properties (e.g., rotation, translation, and location) into superpoints. Empowered by these superpoints, our method manages to extend 3D Gaussian splatting to dynamic scenes with only a slight increase in computational expense. Apart from achieving state-of-the-art visual quality and real-time rendering under high resolutions, the superpoint representation provides a stronger manipulation capability. Extensive experiments demonstrate the practicality and effectiveness of our approach on both synthetic and real-world datasets. Please see our project page at this https URL.

[CV-87] Untrained Neural Nets for Snapshot Compressive Imaging: Theory and Algorithms

链接: https://arxiv.org/abs/2406.03694
作者: Mengyu Zhao,Xi Chen,Xin Yuan,Shirin Jalali
关键词: Snapshot compressive imaging, enabling diverse applications, Snapshot compressive, compressive imaging, hyperspectral imaging
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Snapshot compressive imaging (SCI) recovers high-dimensional (3D) data cubes from a single 2D measurement, enabling diverse applications like video and hyperspectral imaging to go beyond standard techniques in terms of acquisition speed and efficiency. In this paper, we focus on SCI recovery algorithms that employ untrained neural networks (UNNs), such as deep image prior (DIP), to model source structure. Such UNN-based methods are appealing as they have the potential of avoiding the computationally intensive retraining required for different source models and different measurement scenarios. We first develop a theoretical framework for characterizing the performance of such UNN-based methods. The theoretical framework, on the one hand, enables us to optimize the parameters of data-modulating masks, and on the other hand, provides a fundamental connection between the number of data frames that can be recovered from a single measurement to the parameters of the untrained NN. We also employ the recently proposed bagged-deep-image-prior (bagged-DIP) idea to develop SCI Bagged Deep Video Prior (SCI-BDVP) algorithms that address the common challenges faced by standard UNN solutions. Our experimental results show that in video SCI our proposed solution achieves state-of-the-art among UNN methods, and in the case of noisy measurements, it even outperforms supervised solutions.

[CV-88] Principles of Designing Robust Remote Face Anti-Spoofing Systems

链接: https://arxiv.org/abs/2406.03684
作者: Xiang Xu,Tianchen Zhao,Zheng Zhang,Zhihua Li,Jon Wu,Alessandro Achille,Mani Srivastava
关键词: Protecting digital identities, face anti-spoofing, Protecting digital, identities of human, plays a crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注: Under review

点击查看摘要

Abstract:Protecting digital identities of human face from various attack vectors is paramount, and face anti-spoofing plays a crucial role in this endeavor. Current approaches primarily focus on detecting spoofing attempts within individual frames to detect presentation attacks. However, the emergence of hyper-realistic generative models capable of real-time operation has heightened the risk of digitally generated attacks. In light of these evolving threats, this paper aims to address two key aspects. First, it sheds light on the vulnerabilities of state-of-the-art face anti-spoofing methods against digital attacks. Second, it presents a comprehensive taxonomy of common threats encountered in face anti-spoofing systems. Through a series of experiments, we demonstrate the limitations of current face anti-spoofing detection techniques and their failure to generalize to novel digital attack scenarios. Notably, the existing models struggle with digital injection attacks including adversarial noise, realistic deepfake attacks, and digital replay attacks. To aid in the design and implementation of robust face anti-spoofing systems resilient to these emerging vulnerabilities, the paper proposes key design principles from model accuracy and robustness to pipeline robustness and even platform robustness. Especially, we suggest to implement the proactive face anti-spoofing system using active sensors to significant reduce the risks for unseen attack vectors and improve the user experience.

[CV-89] 3rd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation

链接: https://arxiv.org/abs/2406.03668
作者: Xinyu Liu,Jing Zhang,Kexin Zhang,Yuting Yang,Licheng Jiao,Shuyuan Yang
关键词: Video Object Segmentation, distinguishing foreground objects, Object Segmentation, coMplex video Object, Video Object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Video Object Segmentation (VOS) is a vital task in computer vision, focusing on distinguishing foreground objects from the background across video frames. Our work draws inspiration from the Cutie model, and we investigate the effects of object memory, the total number of memory frames, and input resolution on segmentation performance. This report validates the effectiveness of our inference method on the coMplex video Object SEgmentation (MOSE) dataset, which features complex occlusions. Our experimental results demonstrate that our approach achieves a J\F score of 0.8139 on the test set, securing the third position in the final ranking. These findings highlight the robustness and accuracy of our method in handling challenging VOS scenarios.

[CV-90] Partial Label Learning with Focal Loss for Sea Ice Classification Based on Ice Charts

链接: https://arxiv.org/abs/2406.03645
作者: Behzad Vahedi,Benjamin Lucas,Farnoush Banaei-Kashani,Andrew P. Barrett,Walter N. Meier,Siri Jodha Khalsa,Morteza Karimzadeh
关键词: Arctic and Earth, requires consistent monitoring, Earth climate, Sea ice, ice
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Sea ice, crucial to the Arctic and Earth’s climate, requires consistent monitoring and high-resolution mapping. Manual sea ice mapping, however, is time-consuming and subjective, prompting the need for automated deep learning-based classification approaches. However, training these algorithms is challenging because expert-generated ice charts, commonly used as training data, do not map single ice types but instead map polygons with multiple ice types. Moreover, the distribution of various ice types in these charts is frequently imbalanced, resulting in a performance bias towards the dominant class. In this paper, we present a novel GeoAI approach to training sea ice classification by formalizing it as a partial label learning task with explicit confidence scores to address multiple labels and class imbalance. We treat the polygon-level labels as candidate partial labels, assign the corresponding ice concentrations as confidence scores to each candidate label, and integrate them with focal loss to train a Convolutional Neural Network (CNN). Our proposed approach leads to enhanced performance for sea ice classification in Sentinel-1 dual-polarized SAR images, improving classification accuracy (from 87% to 92%) and weighted average F-1 score (from 90% to 93%) compared to the conventional training approach of using one-hot encoded labels and Categorical Cross-Entropy loss. It also improves the F-1 score in 4 out of the 6 sea ice classes.

[CV-91] Degrees of Freedom Matter: Inferring Dynamics from Point Trajectories

链接: https://arxiv.org/abs/2406.03625
作者: Yan Zhang,Sergey Prokudin,Marko Mihajlovic,Qianli Ma,Siyu Tang
关键词: enhancing applications related, scene reconstruction, computer vision, essential in enhancing, avatar creation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: cvpr24 post camera ready

点击查看摘要

Abstract:Understanding the dynamics of generic 3D scenes is fundamentally challenging in computer vision, essential in enhancing applications related to scene reconstruction, motion tracking, and avatar creation. In this work, we address the task as the problem of inferring dense, long-range motion of 3D points. By observing a set of point trajectories, we aim to learn an implicit motion field parameterized by a neural network to predict the movement of novel points within the same domain, without relying on any data-driven or scene-specific priors. To achieve this, our approach builds upon the recently introduced dynamic point field model that learns smooth deformation fields between the canonical frame and individual observation frames. However, temporal consistency between consecutive frames is neglected, and the number of required parameters increases linearly with the sequence length due to per-frame modeling. To address these shortcomings, we exploit the intrinsic regularization provided by SIREN, and modify the input layer to produce a spatiotemporally smooth motion field. Additionally, we analyze the motion field Jacobian matrix, and discover that the motion degrees of freedom (DOFs) in an infinitesimal area around a point and the network hidden variables have different behaviors to affect the model’s representational power. This enables us to improve the model representation capability while retaining the model compactness. Furthermore, to reduce the risk of overfitting, we introduce a regularization term based on the assumption of piece-wise motion smoothness. Our experiments assess the model’s performance in predicting unseen point trajectories and its application in temporal mesh alignment with guidance. The results demonstrate its superiority and effectiveness. The code and data for the project are publicly available: \urlthis https URL

[CV-92] FedPylot: Navigating Federated Learning for Real-Time Object Detection in Internet of Vehicles

链接: https://arxiv.org/abs/2406.03611
作者: Cyprien Quéméneur,Soumaya Cherkaoui
关键词: enabling low-latency big, low-latency big data, big data processing, dense interconnected network, intelligent transportation systems
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The Internet of Vehicles (IoV) emerges as a pivotal component for autonomous driving and intelligent transportation systems (ITS), by enabling low-latency big data processing in a dense interconnected network that comprises vehicles, infrastructures, pedestrians and the cloud. Autonomous vehicles are heavily reliant on machine learning (ML) and can strongly benefit from the wealth of sensory data generated at the edge, which calls for measures to reconcile model training with preserving the privacy of sensitive user data. Federated learning (FL) stands out as a promising solution to train sophisticated ML models in vehicular networks while protecting the privacy of road users and mitigating communication overhead. This paper examines the federated optimization of the cutting-edge YOLOv7 model to tackle real-time object detection amid data heterogeneity, encompassing unbalancedness, concept drift, and label distribution skews. To this end, we introduce FedPylot, a lightweight MPI-based prototype to simulate federated object detection experiments on high-performance computing (HPC) systems, where we safeguard server-client communications using hybrid encryption. Our study factors in accuracy, communication cost, and inference speed, thereby presenting a balanced approach to the challenges faced by autonomous vehicles. We demonstrate promising results for the applicability of FL in IoV and hope that FedPylot will provide a basis for future research into federated real-time object detection. The source code is available at this https URL.

[CV-93] Hi5: 2D Hand Pose Estimation with Zero Human Annotation

链接: https://arxiv.org/abs/2406.03599
作者: Masum Hasan,Cengiz Ozel,Nina Long,Alexander Martin,Samuel Potter,Tariq Adnan,Sangwu Lee,Amir Zadeh,Ehsan Hoque
关键词: collecting high-quality synthetic, hand pose estimation, large synthetic hand, inexpensive method, method for collecting
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a new large synthetic hand pose estimation dataset, Hi5, and a novel inexpensive method for collecting high-quality synthetic data that requires no human annotation or validation. Leveraging recent advancements in computer graphics, high-fidelity 3D hand models with diverse genders and skin colors, and dynamic environments and camera movements, our data synthesis pipeline allows precise control over data diversity and representation, ensuring robust and fair model training. We generate a dataset with 583,000 images with accurate pose annotation using a single consumer PC that closely represents real-world variability. Pose estimation models trained with Hi5 perform competitively on real-hand benchmarks while surpassing models trained with real data when tested on occlusions and perturbations. Our experiments show promising results for synthetic data as a viable solution for data representation problems in real datasets. Overall, this paper provides a promising new approach to synthetic data creation and annotation that can reduce costs and increase the diversity and quality of data for hand pose estimation.

[CV-94] CountCLIP – [Re] Teaching CLIP to Count to Ten

链接: https://arxiv.org/abs/2406.03586
作者: Harshvardhan Mestha,Tejas Agarwal,Karan Bania,Shreyas V,Yash Bhisikar
关键词: relevant downstream tasks, Large vision-language models, learn rich joint, rich joint image-text, joint image-text representations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large vision-language models (VLMs) are shown to learn rich joint image-text representations enabling high performances in relevant downstream tasks. However, they fail to showcase their quantitative understanding of objects, and they lack good counting-aware representation. This paper conducts a reproducibility study of ‘Teaching CLIP to Count to Ten’ (Paiss et al., 2023), which presents a method to finetune a CLIP model (Radford et al., 2021) to improve zero-shot counting accuracy in an image while maintaining the performance for zero-shot classification by introducing a counting-contrastive loss term. We improve the model’s performance on a smaller subset of their training data with lower computational resources. We verify these claims by reproducing their study with our own code. The implementation can be found at this https URL.

[CV-95] Understanding the Limitations of Diffusion Concept Algebra Through Food

链接: https://arxiv.org/abs/2406.03582
作者: E. Zhixuan Zeng,Yuhao Chen,Alexander Wong
关键词: Image generation techniques, latent diffusion models, Image generation, recent years, latent diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Image generation techniques, particularly latent diffusion models, have exploded in popularity in recent years. Many techniques have been developed to manipulate and clarify the semantic concepts these large-scale models learn, offering crucial insights into biases and concept relationships. However, these techniques are often only validated in conventional realms of human or animal faces and artistic style transitions. The food domain offers unique challenges through complex compositions and regional biases, which can shed light on the limitations and opportunities within existing methods. Through the lens of food imagery, we analyze both qualitative and quantitative patterns within a concept traversal technique. We reveal measurable insights into the model’s ability to capture and represent the nuances of culinary diversity, while also identifying areas where the model’s biases and limitations emerge.

[CV-96] Enhancing Traffic Sign Recognition with Tailored Data Augmentation: Addressing Class Imbalance and Instance Scarcity

链接: https://arxiv.org/abs/2406.03576
作者: Ulan Alsiyeu,Zhasdauren Duisebekov
关键词: paper tackles critical, tackles critical challenges, traffic sign recognition, sign recognition systems, road safety
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper tackles critical challenges in traffic sign recognition (TSR), which is essential for road safety – specifically, class imbalance and instance scarcity in datasets. We introduce tailored data augmentation techniques, including synthetic image generation, geometric transformations, and a novel obstacle-based augmentation method to enhance dataset quality for improved model robustness and accuracy. Our methodology incorporates diverse augmentation processes to accurately simulate real-world conditions, thereby expanding the training data’s variety and representativeness. Our findings demonstrate substantial improvements in TSR models performance, offering significant implications for traffic sign recognition systems. This research not only addresses dataset limitations in TSR but also proposes a model for similar challenges across different regions and applications, marking a step forward in the field of computer vision and traffic sign recognition systems.

[CV-97] Npix2Cpix: A GAN-based Image-to-Image Translation Network with Retrieval-Classification Integration for Watermark Retrieval from Historical Document Images

链接: https://arxiv.org/abs/2406.03556
作者: Utsab Saha,Sawradip Saha,Shaikh Anowarul Fattah,Mohammad Saquib
关键词: codicology and history, identification and restoration, restoration of ancient, major topic, topic in codicology
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The identification and restoration of ancient watermarks have long been a major topic in codicology and history. Classifying historical documents based on watermarks can be difficult due to the diversity of watermarks, crowded and noisy samples, multiple modes of representation, and minor distinctions between classes and intra-class changes. This paper proposes a U-net-based conditional generative adversarial network (GAN) to translate noisy raw historical watermarked images into clean, handwriting-free images with just watermarks. Considering its ability to perform image translation from degraded (noisy) pixels to clean pixels, the proposed network is termed as Npix2Cpix. Instead of employing directly degraded watermarked images, the proposed network uses image-to-image translation using adversarial learning to create clutter and handwriting-free images for restoring and categorizing the watermarks for the first time. In order to learn the mapping from input noisy image to output clean image, the generator and discriminator of the proposed U-net-based GAN are trained using two separate loss functions, each of which is based on the distance between images. After using the proposed GAN to pre-process noisy watermarked images, Siamese-based one-shot learning is used to classify watermarks. According to experimental results on a large-scale historical watermark dataset, extracting watermarks from tainted images can result in high one-shot classification accuracy. The qualitative and quantitative evaluation of the retrieved watermarks illustrates the effectiveness of the proposed approach.

[CV-98] VideoPhy: Evaluating Physical Commonsense for Video Generation

链接: https://arxiv.org/abs/2406.03520
作者: Hritik Bansal,Zongyu Lin,Tianyi Xie,Zeshun Zong,Michal Yarom,Yonatan Bitton,Chenfanfu Jiang,Yizhou Sun,Kai-Wei Chang,Aditya Grover
关键词: Recent advances, internet-scale video data, video data pretraining, create high-quality videos, generative models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 36 pages, 26 figures, 8 tables

点击查看摘要

Abstract:Recent advances in internet-scale video data pretraining have led to the development of text-to-video generative models that can create high-quality videos across a broad range of visual concepts and styles. Due to their ability to synthesize realistic motions and render complex objects, these generative models have the potential to become general-purpose simulators of the physical world. However, it is unclear how far we are from this goal with the existing text-to-video generative models. To this end, we present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities (e.g. marbles will roll down when placed on a slanted surface). Specifically, we curate a list of 688 captions that involve interactions between various material types in the physical world (e.g., solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on these captions from diverse state-of-the-art text-to-video generative models, including open models (e.g., VideoCrafter2) and closed models (e.g., Lumiere from Google, Pika). Further, our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts, while also lack physical commonsense. Specifically, the best performing model, Pika, generates videos that adhere to the caption and physical laws for only 19.7% of the instances. VideoPhy thus highlights that the video generative models are far from accurately simulating the physical world. Finally, we also supplement the dataset with an auto-evaluator, VideoCon-Physics, to assess semantic adherence and physical commonsense at scale.

[CV-99] LLMs Meet Multimodal Generation and Editing: A Survey

链接: https://arxiv.org/abs/2405.19334
作者: Yingqing He,Zhaoyang Liu,Jingye Chen,Zeyue Tian,Hongyu Liu,Xiaowei Chi,Runtao Liu,Ruibin Yuan,Yazhou Xing,Wenhai Wang,Jifeng Dai,Yong Zhang,Wei Xue,Qifeng Liu,Yike Guo,Qifeng Chen
关键词: large language models, large language, combining LLMs, growing interest, interest in combining
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
*备注: 51 Pages with 16 Figures, 12 Tables, and 534 References. GitHub Repository at: this https URL

点击查看摘要

Abstract:With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on understanding. This survey elaborates on multimodal generation across different domains, including image, video, 3D, and audio, where we highlight the notable advancements with milestone works in these fields. Specifically, we exhaustively investigate the key technical components behind methods and multimodal datasets utilized in these studies. Moreover, we dig into tool-augmented multimodal agents that can use existing generative models for human-computer interaction. Lastly, we also comprehensively discuss the advancement in AI safety and investigate emerging applications as well as future prospects. Our work provides a systematic and insightful overview of multimodal generation, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at this https URL

[CV-100] Information-driven Affordance Discovery for Efficient Robotic Manipulation

链接: https://arxiv.org/abs/2405.03865
作者: Pietro Mazzaglia,Taco Cohen,Daniel Dijkman
关键词: aid robotic manipulation, robotic manipulation, aid robotic, providing information, Robotic
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2308.14915

点击查看摘要

Abstract:Robotic affordances, providing information about what actions can be taken in a given situation, can aid robotic manipulation. However, learning about affordances requires expensive large annotated datasets of interactions or demonstrations. In this work, we argue that well-directed interactions with the environment can mitigate this problem and propose an information-based measure to augment the agent’s objective and accelerate the affordance discovery process. We provide a theoretical justification of our approach and we empirically validate the approach both in simulation and real-world tasks. Our method, which we dub IDA, enables the efficient discovery of visual affordances for several action primitives, such as grasping, stacking objects, or opening drawers, strongly improving data efficiency in simulation, and it allows us to learn grasping affordances in a small number of interactions, on a real-world setup with a UFACTORY XArm 6 robot arm.

[CV-101] LDM-RSIC: Exploring Distortion Prior with Latent Diffusion Models for Remote Sensing Image Compression

链接: https://arxiv.org/abs/2406.03961
作者: Junhui Li,Jutao Li,Xingsong Hou,Huake Wang,Yutao Zhang,Yujie Dun,Wenke Sun
关键词: entropy model estimation, Deep learning-based image, algorithms typically focus, learning-based image compression, image compression
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning-based image compression algorithms typically focus on designing encoding and decoding networks and improving the accuracy of entropy model estimation to enhance the rate-distortion (RD) performance. However, few algorithms leverage the compression distortion prior from existing compression algorithms to improve RD performance. In this paper, we propose a latent diffusion model-based remote sensing image compression (LDM-RSIC) method, which aims to enhance the final decoding quality of RS images by utilizing the generated distortion prior from a LDM. Our approach consists of two stages. In the first stage, a self-encoder learns prior from the high-quality input image. In the second stage, the prior is generated through an LDM, conditioned on the decoded image of an existing learning-based image compression algorithm, to be used as auxiliary information for generating the texture-rich enhanced image. To better utilize the prior, a channel attention and gate-based dynamic feature attention module (DFAM) is embedded into a Transformer-based multi-scale enhancement network (MEN) for image enhancement. Extensive experiments demonstrate the proposed LDM-RSIC significantly outperforms existing state-of-the-art traditional and learning-based image compression algorithms in terms of both subjective perception and objective metrics. Additionally, we use the LDM-based scheme to improve the traditional image compression algorithm JPEG2000 and obtain 32.00% bit savings on the DOTA testing set. The code will be available at this https URL.

[CV-102] Data-Centric Label Smoothing for Explainable Glaucoma Screening from Eye Fundus Images

链接: https://arxiv.org/abs/2406.03903
作者: Adrian Galdran,Miguel A. González Ballester
关键词: advanced optimization strategies, modern machine learning, current computing capabilities, computer vision system, vision system tend
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to ISBI 2024 (Challenges), 2nd position in the JustRAIGS challenge ( this https URL )

点击查看摘要

Abstract:As current computing capabilities increase, modern machine learning and computer vision system tend to increase in complexity, mostly by means of larger models and advanced optimization strategies. Although often neglected, in many problems there is also much to be gained by considering potential improvements in understanding and better leveraging already-available training data, including annotations. This so-called data-centric approach can lead to substantial performance increases, sometimes beyond what can be achieved by larger models. In this paper we adopt such an approach for the task of justifiable glaucoma screening from retinal images. In particular, we focus on how to combine information from multiple annotators of different skills into a tailored label smoothing scheme that allows us to better employ a large collection of fundus images, instead of discarding samples suffering from inter-rater variability. Internal validation results indicate that our bespoke label smoothing approach surpasses the performance of a standard resnet50 model and also the same model trained with conventional label smoothing techniques, in particular for the multi-label scenario of predicting clinical reasons of glaucoma likelihood in a highly imbalanced screening context. Our code is made available at this http URL .

[CV-103] C2RV: Cross-Regional and Cross-View Learning for Sparse-View CBCT Reconstruction

链接: https://arxiv.org/abs/2406.03902
作者: Yiqun Lin,Jiewen Yang,Hualiang Wang,Xinpeng Ding,Wei Zhao,Xiaomeng Li
关键词: Cone beam computed, important imaging technology, imaging technology widely, beam computed tomography, Cone beam
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to CVPR 2024

点击查看摘要

Abstract:Cone beam computed tomography (CBCT) is an important imaging technology widely used in medical scenarios, such as diagnosis and preoperative planning. Using fewer projection views to reconstruct CT, also known as sparse-view reconstruction, can reduce ionizing radiation and further benefit interventional radiology. Compared with sparse-view reconstruction for traditional parallel/fan-beam CT, CBCT reconstruction is more challenging due to the increased dimensionality caused by the measurement process based on cone-shaped X-ray beams. As a 2D-to-3D reconstruction problem, although implicit neural representations have been introduced to enable efficient training, only local features are considered and different views are processed equally in previous works, resulting in spatial inconsistency and poor performance on complicated anatomies. To this end, we propose C^2RV by leveraging explicit multi-scale volumetric representations to enable cross-regional learning in the 3D space. Additionally, the scale-view cross-attention module is introduced to adaptively aggregate multi-scale and multi-view features. Extensive experiments demonstrate that our C^2RV achieves consistent and significant improvement over previous state-of-the-art methods on datasets with diverse anatomy.

[CV-104] Polyp and Surgical Instrument Segmentation with Double Encoder-Decoder Networks

链接: https://arxiv.org/abs/2406.03901
作者: Adrian Galdran
关键词: MedAI competition, endoscopic images, paper describes, describes a solution, participants were required
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper describes a solution for the MedAI competition, in which participants were required to segment both polyps and surgical instruments from endoscopic images. Our approach relies on a double encoder-decoder neural network which we have previously applied for polyp segmentation, but with a series of enhancements: a more powerful encoder architecture, an improved optimization procedure, and the post-processing of segmentations based on tempered model ensembling. Experimental results show that our method produces segmentations that show a good agreement with manual delineations provided by medical experts.

[CV-105] Shadow and Light: Digitally Reconstructed Radiographs for Disease Classification

链接: https://arxiv.org/abs/2406.03688
作者: Benjamin Hou,Qingqing Zhu,Tejas Sudarshan Mathai,Qiao Jin,Zhiyong Lu,Ronald M. Summers
关键词: recently released CT-RATE, Digitally Reconstructed Radiographs, released CT-RATE dataset, frontal Digitally Reconstructed, synthetic chest X-ray
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we introduce DRR-RATE, a large-scale synthetic chest X-ray dataset derived from the recently released CT-RATE dataset. DRR-RATE comprises of 50,188 frontal Digitally Reconstructed Radiographs (DRRs) from 21,304 unique patients. Each image is paired with a corresponding radiology text report and binary labels for 18 pathology classes. Given the controllable nature of DRR generation, it facilitates the inclusion of lateral view images and images from any desired viewing position. This opens up avenues for research into new and novel multimodal applications involving paired CT, X-ray images from various views, text, and binary labels. We demonstrate the applicability of DRR-RATE alongside existing large-scale chest X-ray resources, notably the CheXpert dataset and CheXnet model. Experiments demonstrate that CheXnet, when trained and tested on the DRR-RATE dataset, achieves sufficient to high AUC scores for the six common pathologies cited in common literature: Atelectasis, Cardiomegaly, Consolidation, Lung Lesion, Lung Opacity, and Pleural Effusion. Additionally, CheXnet trained on the CheXpert dataset can accurately identify several pathologies, even when operating out of distribution. This confirms that the generated DRR images effectively capture the essential pathology features from CT images. The dataset and labels are publicly accessible at this https URL.

机器学习

[LG-0] Verbalized Machine Learning: Revisiting Machine Learning with Language Models

链接: https://arxiv.org/abs/2406.04344
作者: Tim Z. Xiao,Robert Bamler,Bernhard Schölkopf,Weiyang Liu
关键词: large progress made, large language models, machine learning models, machine learning, large progress
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report v1 (92 pages, 15 figures)

点击查看摘要

Abstract:Motivated by the large progress made by large language models (LLMs), we introduce the framework of verbalized machine learning (VML). In contrast to conventional machine learning models that are typically optimized over a continuous parameter space, VML constrains the parameter space to be human-interpretable natural language. Such a constraint leads to a new perspective of function approximation, where an LLM with a text prompt can be viewed as a function parameterized by the text prompt. Guided by this perspective, we revisit classical machine learning problems, such as regression and classification, and find that these problems can be solved by an LLM-parameterized learner and optimizer. The major advantages of VML include (1) easy encoding of inductive bias: prior knowledge about the problem and hypothesis class can be encoded in natural language and fed into the LLM-parameterized learner; (2) automatic model class selection: the optimizer can automatically select a concrete model class based on data and verbalized prior knowledge, and it can update the model class during training; and (3) interpretable learner updates: the LLM-parameterized optimizer can provide explanations for why each learner update is performed. We conduct several studies to empirically evaluate the effectiveness of VML, and hope that VML can serve as a stepping stone to stronger interpretability and trustworthiness in ML.

[LG-1] On the Expressive Power of Spectral Invariant Graph Neural Networks

链接: https://arxiv.org/abs/2406.04336
作者: Bohang Zhang,Lingxiao Zhao,Haggai Maron
关键词: Graph Neural Networks, enhance Graph Neural, Neural Networks, Incorporating spectral information, Graph Neural
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Combinatorics (math.CO); Spectral Theory (math.SP)
*备注: 31 pages; 3 figures; to appear in ICML 2024

点击查看摘要

Abstract:Incorporating spectral information to enhance Graph Neural Networks (GNNs) has shown promising results but raises a fundamental challenge due to the inherent ambiguity of eigenvectors. Various architectures have been proposed to address this ambiguity, referred to as spectral invariant architectures. Notable examples include GNNs and Graph Transformers that use spectral distances, spectral projection matrices, or other invariant spectral features. However, the potential expressive power of these spectral invariant architectures remains largely unclear. The goal of this work is to gain a deep theoretical understanding of the expressive power obtainable when using spectral features. We first introduce a unified message-passing framework for designing spectral invariant GNNs, called Eigenspace Projection GNN (EPNN). A comprehensive analysis shows that EPNN essentially unifies all prior spectral invariant architectures, in that they are either strictly less expressive or equivalent to EPNN. A fine-grained expressiveness hierarchy among different architectures is also established. On the other hand, we prove that EPNN itself is bounded by a recently proposed class of Subgraph GNNs, implying that all these spectral invariant architectures are strictly less expressive than 3-WL. Finally, we discuss whether using spectral features can gain additional expressiveness when combined with more expressive GNNs.

[LG-2] Coarse-To-Fine Tensor Trains for Compact Visual Representations

链接: https://arxiv.org/abs/2406.04332
作者: Sebastian Loeschcke,Dan Wang,Christian Leth-Espensen,Serge Belongie,Michael J. Kastoryano,Sagie Benaim
关键词: tensor train, tensor, tensor train representation, train, ability to learn
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:The ability to learn compact, high-quality, and easy-to-optimize representations for visual data is paramount to many applications such as novel view synthesis and 3D reconstruction. Recent work has shown substantial success in using tensor networks to design such compact and high-quality representations. However, the ability to optimize tensor-based representations, and in particular, the highly compact tensor train representation, is still lacking. This has prevented practitioners from deploying the full potential of tensor networks for visual data. To this end, we propose ‘Prolongation Upsampling Tensor Train (PuTT)’, a novel method for learning tensor train representations in a coarse-to-fine manner. Our method involves the prolonging or `upsampling’ of a learned tensor train representation, creating a sequence of ‘coarse-to-fine’ tensor trains that are incrementally refined. We evaluate our representation along three axes: (1). compression, (2). denoising capability, and (3). image completion capability. To assess these axes, we consider the tasks of image fitting, 3D fitting, and novel view synthesis, where our method shows an improved performance compared to state-of-the-art tensor-based methods. For full results see our project webpage: this https URL

[LG-3] PaCE: Parsimonious Concept Engineering for Large Language Models

链接: https://arxiv.org/abs/2406.04331
作者: Jinqi Luo,Tianjiao Ding,Kwan Ho Ryan Chan,Darshan Thaker,Aditya Chattopadhyay,Chris Callison-Burch,René Vidal
关键词: Large Language Models, Large Language, wide variety, Large, Alignment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 26 pages, 17 figures, 5 tables, dataset and code at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) are being used for a wide variety of tasks. While they are capable of generating human-like responses, they can also produce undesirable output including potentially harmful information, racist or sexist language, and hallucinations. Alignment methods are designed to reduce such undesirable output, via techniques such as fine-tuning, prompt engineering, and representation engineering. However, existing methods face several challenges: some require costly fine-tuning for every alignment task; some do not adequately remove undesirable concepts, failing alignment; some remove benign concepts, lowering the linguistic capabilities of LLMs. To address these issues, we propose Parsimonious Concept Engineering (PaCE), a novel activation engineering framework for alignment. First, to sufficiently model the concepts, we construct a large-scale concept dictionary in the activation space, in which each atom corresponds to a semantic concept. Then, given any alignment task, we instruct a concept partitioner to efficiently annotate the concepts as benign or undesirable. Finally, at inference time, we decompose the LLM activations along the concept dictionary via sparse coding, to accurately represent the activation as a linear combination of the benign and undesirable components. By removing the latter ones from the activation, we reorient the behavior of LLMs towards alignment goals. We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising, and show that PaCE achieves state-of-the-art alignment performance while maintaining linguistic capabilities.

[LG-4] Simplified and Generalized Masked Diffusion for Discrete Data

链接: https://arxiv.org/abs/2406.04329
作者: Jiaxin Shi,Kehang Han,Zhe Wang,Arnaud Doucet,Michalis K. Titsias
关键词: masked diffusion models, masked diffusion, actively explored, diffusion models, models
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Masked (or absorbing) diffusion is actively explored as an alternative to autoregressive models for generative modeling of discrete data. However, existing work in this area has been hindered by unnecessarily complex model formulations and unclear relationships between different perspectives, leading to suboptimal parameterization, training objectives, and ad hoc adjustments to counteract these issues. In this work, we aim to provide a simple and general framework that unlocks the full potential of masked diffusion models. We show that the continuous-time variational objective of masked diffusion models is a simple weighted integral of cross-entropy losses. Our framework also enables training generalized masked diffusion models with state-dependent masking schedules. When evaluated by perplexity, our models trained on OpenWebText surpass prior diffusion language models at GPT-2 scale and demonstrate superior performance on 4 out of 5 zero-shot language modeling tasks. Furthermore, our models vastly outperform previous discrete diffusion models on pixel-level image modeling, achieving 2.78~(CIFAR-10) and 3.42 (ImageNet 64 \times 64) bits per dimension that are comparable or better than autoregressive models of similar sizes.

[LG-5] he Brains Bitter Lesson: Scaling Speech Decoding With Self-Supervised Learning

链接: https://arxiv.org/abs/2406.04328
作者: Dulhan Jayalath,Gilad Landau,Brendan Shillingford,Mark Woolrich,Oiwi Parker Jones
关键词: brain activity, past few years, years have produced, produced a series, series of spectacular
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, under review

点击查看摘要

Abstract:The past few years have produced a series of spectacular advances in the decoding of speech from brain activity. The engine of these advances has been the acquisition of labelled data, with increasingly large datasets acquired from single subjects. However, participants exhibit anatomical and other individual differences, and datasets use varied scanners and task designs. As a result, prior work has struggled to leverage data from multiple subjects, multiple datasets, multiple tasks, and unlabelled datasets. In turn, the field has not benefited from the rapidly growing number of open neural data repositories to exploit large-scale data and deep learning. To address this, we develop an initial set of neuroscience-inspired self-supervised objectives, together with a neural architecture, for representation learning from heterogeneous and unlabelled neural recordings. Experimental results show that representations learned with these objectives generalise across subjects, datasets, and tasks, and are also learned faster than using only labelled data. In addition, we set new benchmarks for two foundational speech decoding tasks. Taken together, these methods now unlock the potential for training speech decoding models with orders of magnitude more existing data.

[LG-6] Causal Estimation of Memorisation Profiles

链接: https://arxiv.org/abs/2406.04327
作者: Pietro Lesci,Clara Meister,Thomas Hofmann,Andreas Vlachos,Tiago Pimentel
关键词: preventing copyright infringements, studying models’ training, models’ training dynamics, Understanding memorisation, societal implications
类目: Machine Learning (cs.LG)
*备注: Published at the ACL 2024 Conference (main)

点击查看摘要

Abstract:Understanding memorisation in language models has practical and societal implications, e.g., studying models’ training dynamics or preventing copyright infringements. Prior work defines memorisation as the causal effect of training with an instance on the model’s ability to predict that instance. This definition relies on a counterfactual: the ability to observe what would have happened had the model not seen that instance. Existing methods struggle to provide computationally efficient and accurate estimates of this counterfactual. Further, they often estimate memorisation for a model architecture rather than for a specific model instance. This paper fills an important gap in the literature, proposing a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics. Using this method, we characterise a model’s memorisation profile–its memorisation trends across training–by only observing its behaviour on a small set of instances throughout training. In experiments with the Pythia model suite, we find that memorisation (i) is stronger and more persistent in larger models, (ii) is determined by data order and learning rate, and (iii) has stable trends across model sizes, thus making memorisation in larger models predictable from smaller ones.

[LG-7] ATraDiff: Accelerating Online Reinforcement Learning with Imaginary Trajectories

链接: https://arxiv.org/abs/2406.04323
作者: Qianlan Yang,Yu-Xiong Wang
关键词: Training autonomous agents, Training autonomous, low data efficiency, offline data, due to low
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: ICML 2024 Accepted

点击查看摘要

Abstract:Training autonomous agents with sparse rewards is a long-standing problem in online reinforcement learning (RL), due to low data efficiency. Prior work overcomes this challenge by extracting useful knowledge from offline data, often accomplished through the learning of action distribution from offline data and utilizing the learned distribution to facilitate online RL. However, since the offline data are given and fixed, the extracted knowledge is inherently limited, making it difficult to generalize to new tasks. We propose a novel approach that leverages offline data to learn a generative diffusion model, coined as Adaptive Trajectory Diffuser (ATraDiff). This model generates synthetic trajectories, serving as a form of data augmentation and consequently enhancing the performance of online RL methods. The key strength of our diffuser lies in its adaptability, allowing it to effectively handle varying trajectory lengths and mitigate distribution shifts between online and offline data. Because of its simplicity, ATraDiff seamlessly integrates with a wide spectrum of RL methods. Empirical evaluation shows that ATraDiff consistently achieves state-of-the-art performance across a variety of environments, with particularly pronounced improvements in complicated settings. Our code and demo video are available at this https URL .

[LG-8] VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

链接: https://arxiv.org/abs/2406.04321
作者: Zeyue Tian,Zhaoyang Liu,Ruibin Yuan,Jiahao Pan,Xiaoqiang Huang,Qifeng Liu,Xu Tan,Qifeng Chen,Wei Xue,Yike Guo
关键词: generation conditioned solely, study music generation, music generation conditioned, systematically study music, systematically study
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
*备注: The code and datasets will be available at this https URL

点击查看摘要

Abstract:In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 190K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music that is both acoustically and semantically aligned with the video. By incorporating local and global visual cues, VidMuse enables the creation of musically coherent audio tracks that consistently match the video content through Long-Short-Term modeling. Through extensive experiments, VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment. The code and datasets will be available at this https URL.

[LG-9] Chimera: Effectively Modeling Multivariate Time Series with 2-Dimensional State Space Models

链接: https://arxiv.org/abs/2406.04320
作者: Ali Behrouz,Michele Santacatterina,Ramin Zabih
关键词: Modeling multivariate time, State Space Models, time series, time series modeling, multivariate time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Modeling multivariate time series is a well-established problem with a wide range of applications from healthcare to financial markets. Traditional State Space Models (SSMs) are classical approaches for univariate time series modeling due to their simplicity and expressive power to represent linear dependencies. They, however, have fundamentally limited expressive power to capture non-linear dependencies, are slow in practice, and fail to model the inter-variate information flow. Despite recent attempts to improve the expressive power of SSMs by using deep structured SSMs, the existing methods are either limited to univariate time series, fail to model complex patterns (e.g., seasonal patterns), fail to dynamically model the dependencies of variate and time dimensions, and/or are input-independent. We present Chimera that uses two input-dependent 2-D SSM heads with different discretization processes to learn long-term progression and seasonal patterns. To improve the efficiency of complex 2D recurrence, we present a fast training using a new 2-dimensional parallel selective scan. We further present and discuss 2-dimensional Mamba and Mamba-2 as the spacial cases of our 2D SSM. Our experimental evaluation shows the superior performance of Chimera on extensive and diverse benchmarks, including ECG and speech time series classification, long-term and short-term time series forecasting, and time series anomaly detection.

[LG-10] Adaptive Sampling of k-Space in Magnetic Resonance for Rapid Pathology Prediction

链接: https://arxiv.org/abs/2406.04318
作者: Chen-Yu Yen,Raghav Singhal,Umang Sharma,Rajesh Ranganath,Sumit Chopra,Lerrel Pinto
关键词: Magnetic Resonance, proven diagnostic utility, inaccessible imaging modality, imaging modality, diagnostic utility
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: ICML 2024. Project website at this https URL

点击查看摘要

Abstract:Magnetic Resonance (MR) imaging, despite its proven diagnostic utility, remains an inaccessible imaging modality for disease surveillance at the population level. A major factor rendering MR inaccessible is lengthy scan times. An MR scanner collects measurements associated with the underlying anatomy in the Fourier space, also known as the k-space. Creating a high-fidelity image requires collecting large quantities of such measurements, increasing the scan time. Traditionally to accelerate an MR scan, image reconstruction from under-sampled k-space data is the method of choice. However, recent works show the feasibility of bypassing image reconstruction and directly learning to detect disease directly from a sparser learned subset of the k-space measurements. In this work, we propose Adaptive Sampling for MR (ASMR), a sampling method that learns an adaptive policy to sequentially select k-space samples to optimize for target disease detection. On 6 out of 8 pathology classification tasks spanning the Knee, Brain, and Prostate MR scans, ASMR reaches within 2% of the performance of a fully sampled classifier while using only 8% of the k-space, as well as outperforming prior state-of-the-art work in k-space sampling such as EMRT, LOUPE, and DPS.

[LG-11] Regularized KL-Divergence for Well-Defined Function-Space Variational Inference in Bayesian neural networks

链接: https://arxiv.org/abs/2406.04317
作者: Tristan Cinquin,Robert Bamler
关键词: Bayesian neural networks, Bayesian neural, neural networks, principled uncertainty modeling, uncertainty modeling important
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian neural networks (BNN) promise to combine the predictive performance of neural networks with principled uncertainty modeling important for safety-critical systems and decision making. However, posterior uncertainty estimates depend on the choice of prior, and finding informative priors in weight-space has proven difficult. This has motivated variational inference (VI) methods that pose priors directly on the function generated by the BNN rather than on weights. In this paper, we address a fundamental issue with such function-space VI approaches pointed out by Burt et al. (2020), who showed that the objective function (ELBO) is negative infinite for most priors of interest. Our solution builds on generalized VI (Knoblauch et al., 2019) with the regularized KL divergence (Quang, 2019) and is, to the best of our knowledge, the first well-defined variational objective for function-space inference in BNNs with Gaussian process (GP) priors. Experiments show that our method incorporates the properties specified by the GP prior on synthetic and small real-world data sets, and provides competitive uncertainty estimates for regression, classification and out-of-distribution detection compared to BNN baselines with both function and weight-space priors.

[LG-12] Improving Alignment and Robustness with Short Circuiting

链接: https://arxiv.org/abs/2406.04313
作者: Andy Zou,Long Phan,Justin Wang,Derek Duenas,Maxwell Lin,Maksym Andriushchenko,Rowan Wang,Zico Kolter,Matt Fredrikson,Dan Hendrycks
关键词: highly vulnerable, harmful, adversarial, attacks, harmful outputs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that “short-circuits” models as they respond with harmful outputs. Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to plug these holes by countering specific attacks. As an alternative to refusal training and adversarial training, short-circuiting directly controls the representations that are responsible for harmful outputs in the first place. Our technique can be applied to both text-only and multimodal language models to prevent the generation of harmful outputs without sacrificing utility – even in the presence of powerful unseen attacks. Notably, while adversarial robustness in standalone image recognition remains an open challenge, short-circuiting allows the larger multimodal system to reliably withstand image “hijacks” that aim to produce harmful content. Finally, we extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack. Our approach represents a significant step forward in the development of reliable safeguards to harmful behavior and adversarial attacks.

[LG-13] ReFiNe: Recursive Field Networks for Cross-modal Multi-scene Representation

链接: https://arxiv.org/abs/2406.04309
作者: Sergey Zakharov,Katherine Liu,Adrien Gaidon,Rares Ambrus
关键词: involve trading modeling, trading modeling accuracy, multi-shape representation, involve trading, common trade-offs
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: SIGGRAPH 2024. Project Page: this https URL

点击查看摘要

Abstract:The common trade-offs of state-of-the-art methods for multi-shape representation (a single model “packing” multiple objects) involve trading modeling accuracy against memory and storage. We show how to encode multiple shapes represented as continuous neural fields with a higher degree of precision than previously possible and with low memory usage. Key to our approach is a recursive hierarchical formulation that exploits object self-similarity, leading to a highly compressed and efficient shape latent space. Thanks to the recursive formulation, our method supports spatial and global-to-local latent feature fusion without needing to initialize and maintain auxiliary data structures, while still allowing for continuous field queries to enable applications such as raytracing. In experiments on a set of diverse datasets, we provide compelling qualitative results and demonstrate state-of-the-art multi-scene reconstruction and compression results with a single network per dataset.

[LG-14] Approximation-Aware Bayesian Optimization

链接: https://arxiv.org/abs/2406.04308
作者: Natalie Maus,Kyurae Kim,Geoff Pleiss,David Eriksson,John P. Cunningham,Jacob R. Gardner
关键词: High-dimensional Bayesian optimization, obtaining meaningful results, Bayesian optimization, High-dimensional Bayesian, evaluations before obtaining
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:High-dimensional Bayesian optimization (BO) tasks such as molecular design often require 10,000 function evaluations before obtaining meaningful results. While methods like sparse variational Gaussian processes (SVGPs) reduce computational requirements in these settings, the underlying approximations result in suboptimal data acquisitions that slow the progress of optimization. In this paper we modify SVGPs to better align with the goals of BO: targeting informed data acquisition rather than global posterior fidelity. Using the framework of utility-calibrated variational inference, we unify GP approximation and data acquisition into a joint optimization problem, thereby ensuring optimal decisions under a limited computational budget. Our approach can be used with any decision-theoretic acquisition function and is compatible with trust region methods like TuRBO. We derive efficient joint objectives for the expected improvement and knowledge gradient acquisition functions in both the standard and batch BO settings. Our approach outperforms standard SVGPs on high-dimensional benchmark tasks in control and molecular design.

[LG-15] Semantically Diverse Language Generation for Uncertainty Estimation in Language Models

链接: https://arxiv.org/abs/2406.04306
作者: Lukas Aichberger,Kajetan Schweighofer,Mykyta Ielanskyi,Sepp Hochreiter
关键词: Large language models, Large language, Large, language models, LLMs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) can suffer from hallucinations when generating text. These hallucinations impede various applications in society and industry by making LLMs untrustworthy. Current LLMs generate text in an autoregressive fashion by predicting and appending text tokens. When an LLM is uncertain about the semantic meaning of the next tokens to generate, it is likely to start hallucinating. Thus, it has been suggested that hallucinations stem from predictive uncertainty. We introduce Semantically Diverse Language Generation (SDLG) to quantify predictive uncertainty in LLMs. SDLG steers the LLM to generate semantically diverse yet likely alternatives for an initially generated text. This approach provides a precise measure of aleatoric semantic uncertainty, detecting whether the initial text is likely to be hallucinated. Experiments on question-answering tasks demonstrate that SDLG consistently outperforms existing methods while being the most computationally efficient, setting a new standard for uncertainty estimation in LLMs.

[LG-16] Vision-LSTM: xLSTM as Generic Vision Backbone

链接: https://arxiv.org/abs/2406.04303
作者: Benedikt Alkin,Maximilian Beck,Korbinian Pöppel,Sepp Hochreiter,Johannes Brandstetter
关键词: natural language processing, Transformers are widely, language processing, initially introduced, introduced for natural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.

[LG-17] Representational Alignment Supports Effective Machine Teaching

链接: https://arxiv.org/abs/2406.04302
作者: Ilia Sucholutsky,Katherine M. Collins,Maya Malaviya,Nori Jacoby,Weiyang Liu,Theodore R. Sumers,Michalis Korakakis,Umang Bhatt,Mark Ho,Joshua B. Tenenbaum,Brad Love,Zachary A. Pardos,Adrian Weller,Thomas L. Griffiths
关键词: representational alignment, utility curve, representational, alignment, student
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:A good teacher should not only be knowledgeable; but should be able to communicate in a way that the student understands – to share the student’s representation of the world. In this work, we integrate insights from machine teaching and pragmatic communication with the burgeoning literature on representational alignment to characterize a utility curve defining a relationship between representational alignment and teacher capability for promoting student learning. To explore the characteristics of this utility curve, we design a supervised learning environment that disentangles representational alignment from teacher accuracy. We conduct extensive computational experiments with machines teaching machines, complemented by a series of experiments in which machines teach humans. Drawing on our findings that improved representational alignment with a student improves student learning outcomes (i.e., task accuracy), we design a classroom matching procedure that assigns students to teachers based on the utility curve. If we are to design effective machine teachers, it is not enough to build teachers that are accurate – we want teachers that can align, representationally, to their students too.

[LG-18] NoisyGL: A Comprehensive Benchmark for Graph Neural Networks under Label Noise

链接: https://arxiv.org/abs/2406.04299
作者: Zhonghao Wang,Danyu Sun,Sheng Zhou,Haobo Wang,Jiapei Fan,Longtao Huang,Jiajun Bu
关键词: exhibit strong potential, Graph Neural Networks, node classification task, Neural Networks, Graph Neural
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Submitted to the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks

点击查看摘要

Abstract:Graph Neural Networks (GNNs) exhibit strong potential in node classification task through a message-passing mechanism. However, their performance often hinges on high-quality node labels, which are challenging to obtain in real-world scenarios due to unreliable sources or adversarial attacks. Consequently, label noise is common in real-world graph data, negatively impacting GNNs by propagating incorrect information during training. To address this issue, the study of Graph Neural Networks under Label Noise (GLN) has recently gained traction. However, due to variations in dataset selection, data splitting, and preprocessing techniques, the community currently lacks a comprehensive benchmark, which impedes deeper understanding and further development of GLN. To fill this gap, we introduce NoisyGL in this paper, the first comprehensive benchmark for graph neural networks under label noise. NoisyGL enables fair comparisons and detailed analyses of GLN methods on noisy labeled graph data across various datasets, with unified experimental settings and interface. Our benchmark has uncovered several important insights that were missed in previous research, and we believe these findings will be highly beneficial for future studies. We hope our open-source benchmark library will foster further advancements in this field. The code of the benchmark can be found in this https URL.

[LG-19] Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation

链接: https://arxiv.org/abs/2406.04291
作者: Adam Fisch,Joshua Maynez,R. Alex Hofer,Bhuwan Dhingra,Amir Globerson,William W. Cohen
关键词: Stratified Prediction-Powered Inference, Prediction-powered inference, limited human-labeled data, statistical estimates based, tighter confidence intervals
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data. PPI achieves this by combining small amounts of human-labeled data with larger amounts of data labeled by a reasonably accurate – but potentially biased – automatic system, in a way that results in tighter confidence intervals for certain parameters of interest (e.g., the mean performance of a language model). In this paper, we propose a method called Stratified Prediction-Powered Inference (StratPPI), in which we show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies. Without making any assumptions on the underlying automatic labeling system or data distribution, we derive an algorithm for computing provably valid confidence intervals for population parameters (such as averages) that is based on stratified sampling. In particular, we show both theoretically and empirically that, with appropriate choices of stratification and sample allocation, our approach can provide substantially tighter confidence intervals than unstratified approaches. Specifically, StratPPI is expected to improve in cases where the performance of the autorater varies across different conditional distributions of the target data.

[LG-20] What is Dataset Distillation Learning?

链接: https://arxiv.org/abs/2406.04284
作者: William Yang,Ye Zhu,Zhiwei Deng,Olga Russakovsky
关键词: distilled data, strategy to overcome, overcome the hurdles, learning a compact, compact set
类目: Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

Abstract:Dataset distillation has emerged as a strategy to overcome the hurdles associated with large datasets by learning a compact set of synthetic data that retains essential information from the original dataset. While distilled data can be used to train high performing models, little is understood about how the information is stored. In this study, we posit and answer three questions about the behavior, representativeness, and point-wise information content of distilled data. We reveal distilled data cannot serve as a substitute for real data during training outside the standard evaluation setting for dataset distillation. Additionally, the distillation process retains high task performance by compressing information related to the early training dynamics of real models. Finally, we provide an framework for interpreting distilled data and reveal that individual distilled data points contain meaningful semantic information. This investigation sheds light on the intricate nature of distilled data, providing a better understanding on how they can be effectively utilized.

[LG-21] xMIL: Insightful Explanations for Multiple Instance Learning in Histopathology

链接: https://arxiv.org/abs/2406.04280
作者: Julius Hense,Mina Jamshidi Idaji,Oliver Eberle,Thomas Schnake,Jonas Dippel,Laure Ciernik,Oliver Buchstab,Andreas Mock,Frederick Klauschen,Klaus-Robert Müller
关键词: Multiple instance learning, supervised machine learning, weakly supervised machine, machine learning, Multiple instance
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multiple instance learning (MIL) is an effective and widely used approach for weakly supervised machine learning. In histopathology, MIL models have achieved remarkable success in tasks like tumor detection, biomarker prediction, and outcome prognostication. However, MIL explanation methods are still lagging behind, as they are limited to small bag sizes or disregard instance interactions. We revisit MIL through the lens of explainable AI (XAI) and introduce xMIL, a refined framework with more general assumptions. We demonstrate how to obtain improved MIL explanations using layer-wise relevance propagation (LRP) and conduct extensive evaluation experiments on three toy settings and four real-world histopathology datasets. Our approach consistently outperforms previous explanation attempts with particularly improved faithfulness scores on challenging biomarker prediction tasks. Finally, we showcase how xMIL explanations enable pathologists to extract insights from MIL models, representing a significant advance for knowledge discovery and model debugging in digital histopathology.

[LG-22] Generative AI-in-the-loop: Integrating LLMs and GPTs into the Next Generation Networks

链接: https://arxiv.org/abs/2406.04276
作者: Han Zhang,Akram Bin Sediq,Ali Afana,Melike Erol-Kantarci
关键词: created numerous opportunities, machine learning, recent years, techniques have created, created numerous
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, machine learning (ML) techniques have created numerous opportunities for intelligent mobile networks and have accelerated the automation of network operations. However, complex network tasks may involve variables and considerations even beyond the capacity of traditional ML algorithms. On the other hand, large language models (LLMs) have recently emerged, demonstrating near-human-level performance in cognitive tasks across various fields. However, they remain prone to hallucinations and often lack common sense in basic tasks. Therefore, they are regarded as assistive tools for humans. In this work, we propose the concept of “generative AI-in-the-loop” and utilize the semantic understanding, context awareness, and reasoning abilities of LLMs to assist humans in handling complex or unforeseen situations in mobile communication networks. We believe that combining LLMs and ML models allows both to leverage their respective capabilities and achieve better results than either model alone. To support this idea, we begin by analyzing the capabilities of LLMs and compare them with traditional ML algorithms. We then explore potential LLM-based applications in line with the requirements of next-generation networks. We further examine the integration of ML and LLMs, discussing how they can be used together in mobile networks. Unlike existing studies, our research emphasizes the fusion of LLMs with traditional ML-driven next-generation networks and serves as a comprehensive refinement of existing surveys. Finally, we provide a case study to enhance ML-based network intrusion detection with synthesized data generated by LLMs. Our case study further demonstrates the advantages of our proposed idea.

[LG-23] Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models

链接: https://arxiv.org/abs/2406.04274
作者: Xiang Ji,Sanjeev Kulkarni,Mengdi Wang,Tengyang Xie
关键词: aligning large language, large language models, preference optimization methods, studies the challenge, challenge of aligning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This work studies the challenge of aligning large language models (LLMs) with offline preference data. We focus on alignment by Reinforcement Learning from Human Feedback (RLHF) in particular. While popular preference optimization methods exhibit good empirical performance in practice, they are not theoretically guaranteed to converge to the optimal policy and can provably fail when the data coverage is sparse by classical offline reinforcement learning (RL) results. On the other hand, a recent line of work has focused on theoretically motivated preference optimization methods with provable guarantees, but these are not computationally efficient for large-scale applications like LLM alignment. To bridge this gap, we propose SPAC, a new offline preference optimization method with self-play, inspired by the on-average pessimism technique from the offline RL literature, to be the first provable and scalable approach to LLM alignment. We both provide theoretical analysis for its convergence under single-policy concentrability for the general function approximation setting and demonstrate its competitive empirical performance for LLM alignment on a 7B Mistral model with Open LLM Leaderboard evaluations.

[LG-24] Open-Endedness is Essential for Artificial Superhuman Intelligence

链接: https://arxiv.org/abs/2406.04268
作者: Edward Hughes,Michael Dennis,Jack Parker-Holder,Feryal Behbahani,Aditi Mavalankar,Yuge Shi,Tom Schaul,Tim Rocktaschel
关键词: internetscale data, recent years, tremendous surge, general capabilities, fuelled by training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years there has been a tremendous surge in the general capabilities of AI systems, mainly fuelled by training foundation models on internetscale data. Nevertheless, the creation of openended, ever self-improving AI remains elusive. In this position paper, we argue that the ingredients are now in place to achieve openendedness in AI systems with respect to a human observer. Furthermore, we claim that such open-endedness is an essential property of any artificial superhuman intelligence (ASI). We begin by providing a concrete formal definition of open-endedness through the lens of novelty and learnability. We then illustrate a path towards ASI via open-ended systems built on top of foundation models, capable of making novel, humanrelevant discoveries. We conclude by examining the safety implications of generally-capable openended AI. We expect that open-ended foundation models will prove to be an increasingly fertile and safety-critical area of research in the near future.

[LG-25] ransformers need glasses! Information over-squashing in language tasks

链接: https://arxiv.org/abs/2406.04267
作者: Federico Barbero,Andrea Banino,Steven Kapturowski,Dharshan Kumaran,João G.M. Araújo,Alex Vitvitskyi,Razvan Pascanu,Petar Veličković
关键词: existing frontier large, frontier large language, study how information, information propagates, architectural backbone
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large language models (LLMs). We rely on a theoretical signal propagation analysis – specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways – leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input, which relates to the well-known phenomenon of over-squashing in graph neural networks. We provide empirical evidence supporting our claims on contemporary LLMs. Our theory also points to simple solutions towards ameliorating these issues.

[LG-26] Simulating Fast and Slow: Learning Policies for Black-Box Optimization

链接: https://arxiv.org/abs/2406.04261
作者: Fabio Valerio Massoli,Tim Bakker,Thomas Hehn,Tribhuvanesh Orekondy,Arash Behboodi
关键词: machine learning community, learning community due, recent years, science and engineering, point of focus
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, solving optimization problems involving black-box simulators has become a point of focus for the machine learning community due to their ubiquity in science and engineering. The simulators describe a forward process f_\mathrmsim: (\psi, x) \rightarrow y from simulation parameters \psi and input data x to observations y , and the goal of the optimization problem is to find parameters \psi that minimize a desired loss function. Sophisticated optimization algorithms typically require gradient information regarding the forward process, f_\mathrmsim , with respect to the parameters \psi . However, obtaining gradients from black-box simulators can often be prohibitively expensive or, in some cases, impossible. Furthermore, in many applications, practitioners aim to solve a set of related problems. Thus, starting the optimization ``ab initio", i.e. from scratch, each time might be inefficient if the forward model is expensive to evaluate. To address those challenges, this paper introduces a novel method for solving classes of similar black-box optimization problems by learning an active learning policy that guides a differentiable surrogate’s training and uses the surrogate’s gradients to optimize the simulation parameters with gradient descent. After training the policy, downstream optimization of problems involving black-box simulators requires up to \sim 90% fewer expensive simulator calls compared to baselines such as local surrogate-based approaches, numerical optimization, and Bayesian methods.

[LG-27] Data Measurements for Decentralized Data Markets

链接: https://arxiv.org/abs/2406.04257
作者: Charles Lu,Mohammad Mohammadi Amiri,Ramesh Raskar
关键词: Decentralized data markets, machine learning, Decentralized data, markets can provide, provide more equitable
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 20 pages, 11 figures

点击查看摘要

Abstract:Decentralized data markets can provide more equitable forms of data acquisition for machine learning. However, to realize practical marketplaces, efficient techniques for seller selection need to be developed. We propose and benchmark federated data measurements to allow a data buyer to find sellers with relevant and diverse datasets. Diversity and relevance measures enable a buyer to make relative comparisons between sellers without requiring intermediate brokers and training task-dependent models.

[LG-28] Hypernetworks for Personalizing ASR to Atypical Speech

链接: https://arxiv.org/abs/2406.04240
作者: Max Mueller-Eberstein,Dianna Yee,Karren Yang,Gautam Varma Mantena,Colin Lea
关键词: recently shown promise, automatic speech recognition, personalizing automatic speech, adapting general population, general population models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) for personalizing automatic speech recognition (ASR) has recently shown promise for adapting general population models to atypical speech. However, these approaches assume a priori knowledge of the atypical speech disorder being adapted for – the diagnosis of which requires expert knowledge that is not always available. Even given this knowledge, data scarcity and high inter/intra-speaker variability further limit the effectiveness of traditional fine-tuning. To circumvent these challenges, we first identify the minimal set of model parameters required for ASR adaptation. Our analysis of each individual parameter’s effect on adaptation performance allows us to reduce Word Error Rate (WER) by half while adapting 0.03% of all weights. Alleviating the need for cohort-specific models, we next propose the novel use of a meta-learned hypernetwork to generate highly individualized, utterance-level adaptations on-the-fly for a diverse set of atypical speech characteristics. Evaluating adaptation at the global, cohort and individual-level, we show that hypernetworks generalize better to out-of-distribution speakers, while maintaining an overall relative WER reduction of 75.2% using 0.1% of the full parameter budget.

[LG-29] Solving Inverse Problems in Protein Space Using Diffusion-Based Priors

链接: https://arxiv.org/abs/2406.04239
作者: Axel Levy,Eric R. Chan,Sara Fridovich-Keil,Frédéric Poitevin,Ellen D. Zhong,Gordon Wetzstein
关键词: understood and controlled, protein structure determination, structure determination, inverse problems, structure
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The interaction of a protein with its environment can be understood and controlled via its 3D structure. Experimental methods for protein structure determination, such as X-ray crystallography or cryogenic electron microscopy, shed light on biological processes but introduce challenging inverse problems. Learning-based approaches have emerged as accurate and efficient methods to solve these inverse problems for 3D structure determination, but are specialized for a predefined type of measurement. Here, we introduce a versatile framework to turn raw biophysical measurements of varying types into 3D atomic models. Our method combines a physics-based forward model of the measurement process with a pretrained generative model providing a task-agnostic, data-driven prior. Our method outperforms posterior sampling baselines on both linear and non-linear inverse problems. In particular, it is the first diffusion-based method for refining atomic models from cryo-EM density maps.

[LG-30] he CLRS-Text Algorithmic Reasoning Language Benchmark

链接: https://arxiv.org/abs/2406.04229
作者: Larisa Markeeva,Sean McLeish,Borja Ibarz,Wilfried Bounsi,Olga Kozlova,Alex Vitvitskyi,Charles Blundell,Tom Goldstein,Avi Schwarzschild,Petar Veličković
关键词: Eliciting reasoning capabilities, building intelligent systems, Eliciting reasoning, language models, intelligent systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: Preprint, under review. Comments welcome

点击查看摘要

Abstract:Eliciting reasoning capabilities from language models (LMs) is a critical direction on the path towards building intelligent systems. Most recent studies dedicated to reasoning focus on out-of-distribution performance on procedurally-generated synthetic benchmarks, bespoke-built to evaluate specific skills only. This trend makes results hard to transfer across publications, slowing down progress. Three years ago, a similar issue was identified and rectified in the field of neural algorithmic reasoning, with the advent of the CLRS benchmark. CLRS is a dataset generator comprising graph execution traces of classical algorithms from the Introduction to Algorithms textbook. Inspired by this, we propose CLRS-Text – a textual version of these algorithmic traces. Out of the box, CLRS-Text is capable of procedurally generating trace data for thirty diverse, challenging algorithmic tasks across any desirable input distribution, while offering a standard pipeline in which any additional algorithmic tasks may be created in the benchmark. We fine-tune and evaluate various LMs as generalist executors on this benchmark, validating prior work and revealing a novel, interesting challenge for the LM reasoning community. Our code is available at this https URL.

[LG-31] R-CONV: An Analytical Approach for Efficient Data Reconstruction via Convolutional Gradients

链接: https://arxiv.org/abs/2406.04227
作者: Tamer Ahmed Eltaras,Qutaibah Malluhi,Alessandro Savino,Stefano Di Carlo,Adnan Qayyum,Junaid Qadir
关键词: exchanging raw data, federated learning, effort to learn, learn from extensive, extensive collections
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the effort to learn from extensive collections of distributed data, federated learning has emerged as a promising approach for preserving privacy by using a gradient-sharing mechanism instead of exchanging raw data. However, recent studies show that private training data can be leaked through many gradient attacks. While previous analytical-based attacks have successfully reconstructed input data from fully connected layers, their effectiveness diminishes when applied to convolutional layers. This paper introduces an advanced data leakage method to efficiently exploit convolutional layers’ gradients. We present a surprising finding: even with non-fully invertible activation functions, such as ReLU, we can analytically reconstruct training samples from the gradients. To the best of our knowledge, this is the first analytical approach that successfully reconstructs convolutional layer inputs directly from the gradients, bypassing the need to reconstruct layers’ outputs. Prior research has mainly concentrated on the weight constraints of convolution layers, overlooking the significance of gradient constraints. Our findings demonstrate that existing analytical methods used to estimate the risk of gradient attacks lack accuracy. In some layers, attacks can be launched with less than 5% of the reported constraints.

[LG-32] Multi-Agent Imitation Learning: Value is Easy Regret is Hard

链接: https://arxiv.org/abs/2406.04219
作者: Jingwu Tang,Gokul Swamy,Fei Fang,Zhiwei Steven Wu
关键词: multi-agent imitation learning, imitation learning, multi-agent imitation, attempting to coordinate, regret gap
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study a multi-agent imitation learning (MAIL) problem where we take the perspective of a learner attempting to coordinate a group of agents based on demonstrations of an expert doing so. Most prior work in MAIL essentially reduces the problem to matching the behavior of the expert within the support of the demonstrations. While doing so is sufficient to drive the value gap between the learner and the expert to zero under the assumption that agents are non-strategic, it does not guarantee robustness to deviations by strategic agents. Intuitively, this is because strategic deviations can depend on a counterfactual quantity: the coordinator’s recommendations outside of the state distribution their recommendations induce. In response, we initiate the study of an alternative objective for MAIL in Markov Games we term the regret gap that explicitly accounts for potential deviations by agents in the group. We first perform an in-depth exploration of the relationship between the value and regret gaps. First, we show that while the value gap can be efficiently minimized via a direct extension of single-agent IL algorithms, even value equivalence can lead to an arbitrarily large regret gap. This implies that achieving regret equivalence is harder than achieving value equivalence in MAIL. We then provide a pair of efficient reductions to no-regret online convex optimization that are capable of minimizing the regret gap (a) under a coverage assumption on the expert (MALICE) or (b) with access to a queryable expert (BLADES).

[LG-33] What Do Language Models Learn in Context? The Structured Task Hypothesis

链接: https://arxiv.org/abs/2406.04216
作者: Jiaoda Li,Yifan Hou,Mrinmaya Sachan,Ryan Cotterell
关键词: Large language models, Large language, termed in-context learning, termed in-context, exhibit an intriguing
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: This work is published in ACL 2024

点击查看摘要

Abstract:Large language models (LLMs) exhibit an intriguing ability to learn a novel task from in-context examples presented in a demonstration, termed in-context learning (ICL). Understandably, a swath of research has been dedicated to uncovering the theories underpinning ICL. One popular hypothesis explains ICL by task selection. LLMs identify the task based on the demonstration and generalize it to the prompt. Another popular hypothesis is that ICL is a form of meta-learning, i.e., the models learn a learning algorithm at pre-training time and apply it to the demonstration. Finally, a third hypothesis argues that LLMs use the demonstration to select a composition of tasks learned during pre-training to perform ICL. In this paper, we empirically explore these three hypotheses that explain LLMs’ ability to learn in context with a suite of experiments derived from common text classification tasks. We invalidate the first two hypotheses with counterexamples and provide evidence in support of the last hypothesis. Our results suggest an LLM could learn a novel task in context via composing tasks learned during pre-training.

[LG-34] mCSQA: Multilingual Commonsense Reasoning Dataset with Unified Creation Strategy by Language Models and Humans

链接: https://arxiv.org/abs/2406.04215
作者: Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe
关键词: natural language understanding, evaluate natural language, language understanding capabilities, challenging to curate, common sense
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at Findings of ACL 2024

点击查看摘要

Abstract:It is very challenging to curate a dataset for language-specific knowledge and common sense in order to evaluate natural language understanding capabilities of language models. Due to the limitation in the availability of annotators, most current multilingual datasets are created through translation, which cannot evaluate such language-specific aspects. Therefore, we propose Multilingual CommonsenseQA (mCSQA) based on the construction process of CSQA but leveraging language models for a more efficient construction, e.g., by asking LM to generate questions/answers, refine answers and verify QAs followed by reduced human efforts for verification. Constructed dataset is a benchmark for cross-lingual language-transfer capabilities of multilingual LMs, and experimental results showed high language-transfer capabilities for questions that LMs could easily solve, but lower transfer capabilities for questions requiring deep knowledge or commonsense. This highlights the necessity of language-specific datasets for evaluation and training. Finally, our method demonstrated that multilingual LMs could create QA including language-specific knowledge, significantly reducing the dataset creation cost compared to manual creation. The datasets are available at this https URL.

[LG-35] Aligning Agents like Large Language Models

链接: https://arxiv.org/abs/2406.04208
作者: Adam Jelley,Yuhan Cao,Dave Bignell,Sam Devlin,Tabish Rashid
关键词: high-dimensional sensory information, information is challenging, high-dimensional sensory, sensory information, Training agents
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Training agents to behave as desired in complex 3D environments from high-dimensional sensory information is challenging. Imitation learning from diverse human behavior provides a scalable approach for training an agent with a sensible behavioral prior, but such an agent may not perform the specific behaviors of interest when deployed. To address this issue, we draw an analogy between the undesirable behaviors of imitation learning agents and the unhelpful responses of unaligned large language models (LLMs). We then investigate how the procedure for aligning LLMs can be applied to aligning agents in a 3D environment from pixels. For our analysis, we utilize an academically illustrative part of a modern console game in which the human behavior distribution is multi-modal, but we want our agent to imitate a single mode of this behavior. We demonstrate that we can align our agent to consistently perform the desired mode, while providing insights and advice for successfully applying this approach to training agents. Project webpage at this https URL .

[LG-36] owards Principled Superhuman AI for Multiplayer Symmetric Games

链接: https://arxiv.org/abs/2406.04201
作者: Jiawei Ge,Yuanhao Wang,Wenzhe Li,Chi Jin
关键词: extensively studied two-player, studied two-player zero-sum, two-player zero-sum games, present unique challenges, number of players
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Multiplayer games, when the number of players exceeds two, present unique challenges that fundamentally distinguish them from the extensively studied two-player zero-sum games. These challenges arise from the non-uniqueness of equilibria and the risk of agents performing highly suboptimally when adopting equilibrium strategies. While a line of recent works developed learning systems successfully achieving human-level or even superhuman performance in popular multiplayer games such as Mahjong, Poker, and Diplomacy, two critical questions remain unaddressed: (1) What is the correct solution concept that AI agents should find? and (2) What is the general algorithmic framework that provably solves all games within this class? This paper takes the first step towards solving these unique challenges of multiplayer games by provably addressing both questions in multiplayer symmetric normal-form games. We also demonstrate that many meta-algorithms developed in prior practical systems for multiplayer games can fail to achieve even the basic goal of obtaining agent’s equal share of the total reward.

[LG-37] Shield Synthesis for LTL Modulo Theories

链接: https://arxiv.org/abs/2406.04184
作者: Andoni Rodriguez,Guy Amir,Davide Corsi,Cesar Sanchez,Guy Katz
关键词: Machine Learning, achieved remarkable success, achieved remarkable, remarkable success, Machine
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:In recent years, Machine Learning (ML) models have achieved remarkable success in various domains. However, these models also tend to demonstrate unsafe behaviors, precluding their deployment in safety-critical systems. To cope with this issue, ample research focuses on developing methods that guarantee the safe behaviour of a given ML model. A prominent example is shielding which incorporates an external component (a “shield”) that blocks unwanted behavior. Despite significant progress, shielding suffers from a main setback: it is currently geared towards properties encoded solely in propositional logics (e.g., LTL) and is unsuitable for richer logics. This, in turn, limits the widespread applicability of shielding in many real-world systems. In this work, we address this gap, and extend shielding to LTL modulo theories, by building upon recent advances in reactive synthesis modulo theories. This allowed us to develop a novel approach for generating shields conforming to complex safety specifications in these more expressive, logics. We evaluated our shields and demonstrate their ability to handle rich data with temporal dynamics. To the best of our knowledge, this is the first approach for synthesizing shields for such expressivity.

[LG-38] Element-wise Multiplication Based Physics-informed Neural Networks

链接: https://arxiv.org/abs/2406.04170
作者: Feilong Jiang,Xiaonan Hou,Min Xia
关键词: physics-informed neural networks, partial differential equations, resolving partial differential, received widespread attention, Based Physics-informed Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:As a promising framework for resolving partial differential equations (PDEs), physics-informed neural networks (PINNs) have received widespread attention from industrial and scientific fields. However, lack of expressive ability and initialization pathology issues are found to prevent the application of PINNs in complex PDEs. In this work, we propose Element-wise Multiplication Based Physics-informed Neural Networks (EM-PINNs) to resolve these issues. The element-wise multiplication operation is adopted to transform features into high-dimensional, non-linear spaces, which effectively enhance the expressive capability of PINNs. Benefiting from element-wise multiplication operation, EM-PINNs can eliminate the initialization pathologies of PINNs. The proposed structure is verified on various benchmarks. The results show that EM-PINNs have strong expressive ability.

[LG-39] Repurposing Language Models into Embedding Models: Finding the Compute-Optimal Recipe

链接: https://arxiv.org/abs/2406.04165
作者: Alicja Ziarko,Albert Q. Jiang,Bartosz Piotrowski,Wenda Li,Mateja Jamnik,Piotr Miłoś
关键词: semantic similarity assessment, document retrieval, similarity assessment, train text embedding, text embedding models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text embeddings are essential for many tasks, such as document retrieval, clustering, and semantic similarity assessment. In this paper, we study how to contrastively train text embedding models in a compute-optimal fashion, given a suite of pre-trained decoder-only language models. Our innovation is an algorithm that produces optimal configurations of model sizes, data quantities, and fine-tuning methods for text-embedding models at different computational budget levels. The resulting recipe, which we obtain through extensive experiments, can be used by practitioners to make informed design choices for their embedding models. Specifically, our findings suggest that full fine-tuning and low-rank adaptation fine-tuning produce optimal models at lower and higher computational budgets respectively.

[LG-40] Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness

链接: https://arxiv.org/abs/2406.04156
作者: Lars Hillebrand,Prabhupad Pradhan,Christian Bauckhage,Rafet Sifa
关键词: paragraph-level text representations, pre-training technique aimed, large language models, pointer-guided segment ordering, technique aimed
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 17 pages, 3 figures, 5 tables, accepted at ECML-PKDD 2024

点击查看摘要

Abstract:We introduce “pointer-guided segment ordering” (SO), a novel pre-training technique aimed at enhancing the contextual understanding of paragraph-level text representations in large language models. Our methodology leverages a self-attention-driven pointer network to restore the original sequence of shuffled text segments, addressing the challenge of capturing the structural coherence and contextual dependencies within documents. This pre-training approach is complemented by a fine-tuning methodology that incorporates dynamic sampling, augmenting the diversity of training instances and improving sample efficiency for various downstream applications. We evaluate our method on a diverse set of datasets, demonstrating its efficacy in tasks requiring sequential text classification across scientific literature and financial reporting domains. Our experiments show that pointer-guided pre-training significantly enhances the model’s ability to understand complex document structures, leading to state-of-the-art performance in downstream classification tasks.

[LG-41] Improving Physics-Augmented Continuum Neural Radiance Field-Based Geometry-Agnostic System Identification with Lagrangian Particle Optimization

链接: https://arxiv.org/abs/2406.04155
作者: Takuhiro Kaneko
关键词: Geometry-agnostic system identification, Geometry-agnostic system, Eulerian grid representations, video sequences, grid representations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted to CVPR 2024. Project page: this https URL

点击查看摘要

Abstract:Geometry-agnostic system identification is a technique for identifying the geometry and physical properties of an object from video sequences without any geometric assumptions. Recently, physics-augmented continuum neural radiance fields (PAC-NeRF) has demonstrated promising results for this technique by utilizing a hybrid Eulerian-Lagrangian representation, in which the geometry is represented by the Eulerian grid representations of NeRF, the physics is described by a material point method (MPM), and they are connected via Lagrangian particles. However, a notable limitation of PAC-NeRF is that its performance is sensitive to the learning of the geometry from the first frames owing to its two-step optimization. First, the grid representations are optimized with the first frames of video sequences, and then the physical properties are optimized through video sequences utilizing the fixed first-frame grid representations. This limitation can be critical when learning of the geometric structure is difficult, for example, in a few-shot (sparse view) setting. To overcome this limitation, we propose Lagrangian particle optimization (LPO), in which the positions and features of particles are optimized through video sequences in Lagrangian space. This method allows for the optimization of the geometric structure across the entire video sequence within the physical constraints imposed by the MPM. The experimental results demonstrate that the LPO is useful for geometric correction and physical identification in sparse-view settings.

[LG-42] Learned Feature Importance Scores for Automated Feature Engineering

链接: https://arxiv.org/abs/2406.04153
作者: Yihe Dong,Sercan Arik,Nathanael Yoder,Tomas Pfister
关键词: demonstrated substantial utility, small data regime, machine learning workflows, shifts are severe, Feature engineering
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature engineering has demonstrated substantial utility for many machine learning workflows, such as in the small data regime or when distribution shifts are severe. Thus automating this capability can relieve much manual effort and improve model performance. Towards this, we propose AutoMAN, or Automated Mask-based Feature Engineering, an automated feature engineering framework that achieves high accuracy, low latency, and can be extended to heterogeneous and time-varying data. AutoMAN is based on effectively exploring the candidate transforms space, without explicitly manifesting transformed features. This is achieved by learning feature importance masks, which can be extended to support other modalities such as time series. AutoMAN learns feature transform importance end-to-end, incorporating a dataset’s task target directly into feature engineering, resulting in state-of-the-art performance with significantly lower latency compared to alternatives.

[LG-43] Fast Redescription Mining Using Locality-Sensitive Hashing

链接: https://arxiv.org/abs/2406.04148
作者: Maiju Karjalainen,Esther Galbrun,Pauli Miettinen
关键词: data analysis technique, Redescription mining, diverse fields, redescription mining approaches, analysis technique
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 20 pages, 4 figures, to appear at ECML-PKDD 2024

点击查看摘要

Abstract:Redescription mining is a data analysis technique that has found applications in diverse fields. The most used redescription mining approaches involve two phases: finding matching pairs among data attributes and extending the pairs. This process is relatively efficient when the number of attributes remains limited and when the attributes are Boolean, but becomes almost intractable when the data consist of many numerical attributes. In this paper, we present new algorithms that perform the matching and extension orders of magnitude faster than the existing approaches. Our algorithms are based on locality-sensitive hashing with a tailored approach to handle the discretisation of numerical attributes as used in redescription mining.

[LG-44] Redundancy-aware Action Spaces for Robot Learning

链接: https://arxiv.org/abs/2406.04144
作者: Pietro Mazzaglia,Nicholas Backshall,Xiao Ma,Stephen James
关键词: dominant action modes, robot learning literature, controlling robot arms, task space control, modes for controlling
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Published in the RA-L journal

点击查看摘要

Abstract:Joint space and task space control are the two dominant action modes for controlling robot arms within the robot learning literature. Actions in joint space provide precise control over the robot’s pose, but tend to suffer from inefficient training; actions in task space boast data-efficient training but sacrifice the ability to perform tasks in confined spaces due to limited control over the full joint configuration. This work analyses the criteria for designing action spaces for robot manipulation and introduces ER (End-effector Redundancy), a novel action space formulation that, by addressing the redundancies present in the manipulator, aims to combine the advantages of both joint and task spaces, offering fine-grained comprehensive control with overactuated robot arms whilst achieving highly efficient robot learning. We present two implementations of ER, ERAngle (ERA) and ERJoint (ERJ), and we show that ERJ in particular demonstrates superior performance across multiple settings, especially when precise control over the robot configuration is required. We validate our results both in simulated and real robotic environments.

[LG-45] Do Language Models Understand Morality? Towards a Robust Detection of Moral Content

链接: https://arxiv.org/abs/2406.04143
作者: Luana Bulla,Aldo Gangemi,Misael Mongiovì
关键词: natural language processing, including natural language, Natural Language Inference, natural language, Large Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The task of detecting moral values in text has significant implications in various fields, including natural language processing, social sciences, and ethical decision-making. Previously proposed supervised models often suffer from overfitting, leading to hyper-specialized moral classifiers that struggle to perform well on data from different domains. To address this issue, we introduce novel systems that leverage abstract concepts and common-sense knowledge acquired from Large Language Models and Natural Language Inference models during previous stages of training on multiple data sources. By doing so, we aim to develop versatile and robust methods for detecting moral values in real-world scenarios. Our approach uses the GPT 3.5 model as a zero-shot ready-made unsupervised multi-label classifier for moral values detection, eliminating the need for explicit training on labeled data. We compare it with a smaller NLI-based zero-shot model. The results show that the NLI approach achieves competitive results compared to the Davinci model. Furthermore, we conduct an in-depth investigation of the performance of supervised systems in the context of cross-domain multi-label moral value detection. This involves training supervised models on different domains to explore their effectiveness in handling data from different sources and comparing their performance with the unsupervised methods. Our contributions encompass a thorough analysis of both supervised and unsupervised methodologies for cross-domain value detection. We introduce the Davinci model as a state-of-the-art zero-shot unsupervised moral values classifier, pushing the boundaries of moral value detection without the need for explicit training on labeled data. Additionally, we perform a comparative evaluation of our approach with the supervised models, shedding light on their respective strengths and weaknesses.

[LG-46] Optimal Batched Linear Bandits

链接: https://arxiv.org/abs/2406.04137
作者: Xuanfei Ren,Tianyuan Jin,Pan Xu
关键词: linear bandit problem, batched linear bandit, asymptotic optimality, regret, asymptotically optimal regret
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 26 pages, 6 figures, 4 tables. To appear in the proceedings of the 41st International Conference on Machine Learning (ICML 2024)

点击查看摘要

Abstract:We introduce the E ^4 algorithm for the batched linear bandit problem, incorporating an Explore-Estimate-Eliminate-Exploit framework. With a proper choice of exploration rate, we prove E ^4 achieves the finite-time minimax optimal regret with only O(\log\log T) batches, and the asymptotically optimal regret with only 3 batches as T\rightarrow\infty , where T is the time horizon. We further prove a lower bound on the batch complexity of linear contextual bandits showing that any asymptotically optimal algorithm must require at least 3 batches in expectation as T\rightarrow\infty , which indicates E ^4 achieves the asymptotic optimality in regret and batch complexity simultaneously. To the best of our knowledge, E ^4 is the first algorithm for linear bandits that simultaneously achieves the minimax and asymptotic optimality in regret with the corresponding optimal batch complexities. In addition, we show that with another choice of exploration rate E ^4 achieves an instance-dependent regret bound requiring at most O(\log T) batches, and maintains the minimax optimality and asymptotic optimality. We conduct thorough experiments to evaluate our algorithm on randomly generated instances and the challenging \textitEnd of Optimism instances \citeplattimore2017end which were shown to be hard to learn for optimism based algorithms. Empirical results show that E ^4 consistently outperforms baseline algorithms with respect to regret minimization, batch complexity, and computational efficiency.

[LG-47] Legal Judgment Reimagined: PredEx and the Rise of Intelligent AI Interpretation in Indian Courts

链接: https://arxiv.org/abs/2406.04136
作者: Shubham Kumar Nigam,Anurag Sharma,Danush Khanna,Noel Shallum,Kripabandhu Ghosh,Arnab Bhattacharya
关键词: Large Language Models, Large Language, predicting judicial outcomes, judicial outcomes poses, outcomes poses significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the era of Large Language Models (LLMs), predicting judicial outcomes poses significant challenges due to the complexity of legal proceedings and the scarcity of expert-annotated datasets. Addressing this, we introduce \textbfPrediction with \textbfExplanation (\textttPredEx), the largest expert-annotated dataset for legal judgment prediction and explanation in the Indian context, featuring over 15,000 annotations. This groundbreaking corpus significantly enhances the training and evaluation of AI models in legal analysis, with innovations including the application of instruction tuning to LLMs. This method has markedly improved the predictive accuracy and explanatory depth of these models for legal judgments. We employed various transformer-based models, tailored for both general and Indian legal contexts. Through rigorous lexical, semantic, and expert assessments, our models effectively leverage \textttPredEx to provide precise predictions and meaningful explanations, establishing it as a valuable benchmark for both the legal profession and the NLP community.

[LG-48] Compressible Dynamics in Deep Overparameterized Low-Rank Learning Adaptation

链接: https://arxiv.org/abs/2406.04112
作者: Can Yaras,Peng Wang,Laura Balzano,Qing Qu
关键词: increased computational requirements, model sizes grow, offers great benefits, models offers great, optimization and generalization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: Accepted at ICML’24 (Oral)

点击查看摘要

Abstract:While overparameterization in machine learning models offers great benefits in terms of optimization and generalization, it also leads to increased computational requirements as model sizes grow. In this work, we show that by leveraging the inherent low-dimensional structures of data and compressible dynamics within the model parameters, we can reap the benefits of overparameterization without the computational burdens. In practice, we demonstrate the effectiveness of this approach for deep low-rank matrix completion as well as fine-tuning language models. Our approach is grounded in theoretical findings for deep overparameterized low-rank matrix recovery, where we show that the learning dynamics of each weight matrix are confined to an invariant low-dimensional subspace. Consequently, we can construct and train compact, highly compressed factorizations possessing the same benefits as their overparameterized counterparts. In the context of deep matrix completion, our technique substantially improves training efficiency while retaining the advantages of overparameterization. For language model fine-tuning, we propose a method called “Deep LoRA”, which improves the existing low-rank adaptation (LoRA) technique, leading to reduced overfitting and a simplified hyperparameter setup, while maintaining comparable efficiency. We validate the effectiveness of Deep LoRA on natural language tasks, particularly when fine-tuning with limited data.

[LG-49] From Tissue Plane to Organ World: A Benchmark Dataset for Multimodal Biomedical Image Registration using Deep Co-Attention Networks

链接: https://arxiv.org/abs/2406.04105
作者: Yifeng Wang,Weipeng Li,Thomas Pearce,Haohan Wang
关键词: numerous disease states, emerging methodology expected, Correlating neuropathology, human organ spanning, spanning the meso
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Correlating neuropathology with neuroimaging findings provides a multiscale view of pathologic changes in the human organ spanning the meso- to micro-scales, and is an emerging methodology expected to shed light on numerous disease states. To gain the most information from this multimodal, multiscale approach, it is desirable to identify precisely where a histologic tissue section was taken from within the organ in order to correlate with the tissue features in exactly the same organ region. Histology-to-organ registration poses an extra challenge, as any given histologic section can capture only a small portion of a human organ. Making use of the capabilities of state-of-the-art deep learning models, we unlock the potential to address and solve such intricate challenges. Therefore, we create the ATOM benchmark dataset, sourced from diverse institutions, with the primary objective of transforming this challenge into a machine learning problem and delivering outstanding outcomes that enlighten the biomedical community. The performance of our RegisMCAN model demonstrates the potential of deep learning to accurately predict where a subregion extracted from an organ image was obtained from within the overall 3D volume. The code and dataset can be found at: this https URL

[LG-50] Multistep Distillation of Diffusion Models via Moment Matching

链接: https://arxiv.org/abs/2406.04103
作者: Tim Salimans,Thomas Mensink,Jonathan Heek,Emiel Hoogeboom
关键词: faster to sample, making diffusion models, diffusion models faster, making diffusion, diffusion models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:We present a new method for making diffusion models faster to sample. The method distills many-step diffusion models into few-step models by matching conditional expectations of the clean data given noisy data along the sampling trajectory. Our approach extends recently proposed one-step methods to the multi-step case, and provides a new perspective by interpreting these approaches in terms of moment matching. By using up to 8 sampling steps, we obtain distilled models that outperform not only their one-step versions but also their original many-step teacher models, obtaining new state-of-the-art results on the Imagenet dataset. We also show promising results on a large text-to-image model where we achieve fast generation of high resolution images directly in image space, without needing autoencoders or upsamplers.

[LG-51] Enhancing Weather Predictions: Super-Resolution via Deep Diffusion Models

链接: https://arxiv.org/abs/2406.04099
作者: Jan Martinů,Petr Šimánek
关键词: deep-learning diffusion models, diffusion models, low-resolution weather data, study investigates, approach aimed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study investigates the application of deep-learning diffusion models for the super-resolution of weather data, a novel approach aimed at enhancing the spatial resolution and detail of meteorological variables. Leveraging the capabilities of diffusion models, specifically the SR3 and ResDiff architectures, we present a methodology for transforming low-resolution weather data into high-resolution outputs. Our experiments, conducted using the WeatherBench dataset, focus on the super-resolution of the two-meter temperature variable, demonstrating the models’ ability to generate detailed and accurate weather maps. The results indicate that the ResDiff model, further improved by incorporating physics-based modifications, significantly outperforms traditional SR3 methods in terms of Mean Squared Error (MSE), Structural Similarity Index (SSIM), and Peak Signal-to-Noise Ratio (PSNR). This research highlights the potential of diffusion models in meteorological applications, offering insights into their effectiveness, challenges, and prospects for future advancements in weather prediction and climate analysis.

[LG-52] Scaling and evaluating sparse autoencoders

链接: https://arxiv.org/abs/2406.04093
作者: Leo Gao,Tom Dupré la Tour,Henk Tillman,Gabriel Goh,Rajan Troll,Alec Radford,Ilya Sutskever,Jan Leike,Jeffrey Wu
关键词: sparse bottleneck layer, Sparse autoencoders provide, extracting interpretable features, sparse bottleneck, promising unsupervised approach
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release training code and autoencoders for open-source models, as well as a visualizer.

[LG-53] Interpretable Lightweight Transformer via Unrolling of Learned Graph Smoothness Priors

链接: https://arxiv.org/abs/2406.04090
作者: Tam Thuc Do,Parham Eftekhar,Seyed Alireza Hosseini,Gene Cheung,Philip Chou
关键词: graph Laplacian regularizer, quadratic graph Laplacian, lightweight transformer-like neural, unrolling iterative optimization, iterative optimization algorithms
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:We build interpretable and lightweight transformer-like neural networks by unrolling iterative optimization algorithms that minimize graph smoothness priors – the quadratic graph Laplacian regularizer (GLR) and the \ell_1 -norm graph total variation (GTV) – subject to an interpolation constraint. The crucial insight is that a normalized signal-dependent graph learning module amounts to a variant of the basic self-attention mechanism in conventional transformers. Unlike “black-box” transformers that require learning of large key, query and value matrices to compute scaled dot products as affinities and subsequent output embeddings, resulting in huge parameter sets, our unrolled networks employ shallow CNNs to learn low-dimensional features per node to establish pairwise Mahalanobis distances and construct sparse similarity graphs. At each layer, given a learned graph, the target interpolated signal is simply a low-pass filtered output derived from the minimization of an assumed graph smoothness prior, leading to a dramatic reduction in parameter count. Experiments for two image interpolation applications verify the restoration performance, parameter efficiency and robustness to covariate shift of our graph-based unrolled networks compared to conventional transformers.

[LG-54] On Limitation of Transformer for Learning HMMs

链接: https://arxiv.org/abs/2406.04089
作者: Jiachen Hu,Qinghua Liu,Chi Jin
关键词: Hidden Markov Models, Hidden Markov, basic sequential models, natural language processing, Recurrent Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Despite the remarkable success of Transformer-based architectures in various sequential modeling tasks, such as natural language processing, computer vision, and robotics, their ability to learn basic sequential models, like Hidden Markov Models (HMMs), is still unclear. This paper investigates the performance of Transformers in learning HMMs and their variants through extensive experimentation and compares them to Recurrent Neural Networks (RNNs). We show that Transformers consistently underperform RNNs in both training speed and testing accuracy across all tested HMM models. There are even challenging HMM instances where Transformers struggle to learn, while RNNs can successfully do so. Our experiments further reveal the relation between the depth of Transformers and the longest sequence length it can effectively learn, based on the types and the complexity of HMMs. To address the limitation of transformers in modeling HMMs, we demonstrate that a variant of the Chain-of-Thought (CoT), called \textitblock CoT in the training phase, can help transformers to reduce the evaluation error and to learn longer sequences at a cost of increasing the training time. Finally, we complement our empirical findings by theoretical results proving the expressiveness of transformers in approximating HMMs with logarithmic depth.

[LG-55] Deterministic Uncertainty Propagation for Improved Model-Based Offline Reinforcement Learning

链接: https://arxiv.org/abs/2406.04088
作者: Abdullah Akgül,Manuel Haußmann,Melih Kandemir
关键词: offline Reinforcement Learning, Reinforcement Learning, distributional shift problem, incorporate uncertainty-based reward, uncertainty-based reward penalization
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current approaches to model-based offline Reinforcement Learning (RL) often incorporate uncertainty-based reward penalization to address the distributional shift problem. While these approaches have achieved some success, we argue that this penalization introduces excessive conservatism, potentially resulting in suboptimal policies through underestimation. We identify as an important cause of over-penalization the lack of a reliable uncertainty estimator capable of propagating uncertainties in the Bellman operator. The common approach to calculating the penalty term relies on sampling-based uncertainty estimation, resulting in high variance. To address this challenge, we propose a novel method termed Moment Matching Offline Model-Based Policy Optimization (MOMBO). MOMBO learns a Q-function using moment matching, which allows us to deterministically propagate uncertainties through the Q-function. We evaluate MOMBO’s performance across various environments and demonstrate empirically that MOMBO is a more stable and sample-efficient approach.

[LG-56] Bootstrapping Expectiles in Reinforcement Learning

链接: https://arxiv.org/abs/2406.04081
作者: Pierre Clavier,Emmanuel Rachelson,Erwan Le Pennec,Matthieu Geist
关键词: classic Reinforcement Learning, Reinforcement Learning, Bellman operator, concept of bootstrapping, classic Reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Many classic Reinforcement Learning (RL) algorithms rely on a Bellman operator, which involves an expectation over the next states, leading to the concept of bootstrapping. To introduce a form of pessimism, we propose to replace this expectation with an expectile. In practice, this can be very simply done by replacing the L_2 loss with a more general expectile loss for the critic. Introducing pessimism in RL is desirable for various reasons, such as tackling the overestimation problem (for which classic solutions are double Q-learning or the twin-critic approach of TD3) or robust RL (where transitions are adversarial). We study empirically these two cases. For the overestimation problem, we show that the proposed approach, ExpectRL, provides better results than a classic twin-critic. On robust RL benchmarks, involving changes of the environment, we show that our approach is more robust than classic RL algorithms. We also introduce a variation of ExpectRL combined with domain randomization which is competitive with state-of-the-art robust RL agents. Eventually, we also extend \ExpectRL with a mechanism for choosing automatically the expectile value, that is the degree of pessimism

[LG-57] Batch-in-Batch: a new adversarial training framework for initial perturbation and sample selection

链接: https://arxiv.org/abs/2406.04070
作者: Yinting Wu(1),Pai Peng(2),Bo Cai(3),Le Li(1). ((1) School of Mathematics and Statistics, and Key Lab NAA–MOE, Central China Normal University, (2) School of Mathematics and Computer Science, Jianghan University, (3) Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, and School of Cyber Science and Engineering, Wuhan University)
关键词: simple uniform distribution, commonly generate independent, training methods commonly, methods commonly generate, uniform distribution
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 29 pages, 11 figures

点击查看摘要

Abstract:Adversarial training methods commonly generate independent initial perturbation for adversarial samples from a simple uniform distribution, and obtain the training batch for the classifier without selection. In this work, we propose a simple yet effective training framework called Batch-in-Batch (BB) to enhance models robustness. It involves specifically a joint construction of initial values that could simultaneously generates m sets of perturbations from the original batch set to provide more diversity for adversarial samples; and also includes various sample selection strategies that enable the trained models to have smoother losses and avoid overconfident outputs. Through extensive experiments on three benchmark datasets (CIFAR-10, SVHN, CIFAR-100) with two networks (PreActResNet18 and WideResNet28-10) that are used in both the single-step (Noise-Fast Gradient Sign Method, N-FGSM) and multi-step (Projected Gradient Descent, PGD-10) adversarial training, we show that models trained within the BB framework consistently have higher adversarial accuracy across various adversarial settings, notably achieving over a 13% improvement on the SVHN dataset with an attack radius of 8/255 compared to the N-FGSM baseline model. Furthermore, experimental analysis of the efficiency of both the proposed initial perturbation method and sample selection strategies validates our insights. Finally, we show that our framework is cost-effective in terms of computational resources, even with a relatively large value of m .

[LG-58] Reassessing How to Compare and Improve the Calibration of Machine Learning Models

链接: https://arxiv.org/abs/2406.04068
作者: Muthu Chidambaram,Rong Ge
关键词: outcome matches, outcome conditional, machine learning model, machine learning, predicted probability
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 20 pages, 7 figures

点击查看摘要

Abstract:A machine learning model is calibrated if its predicted probability for an outcome matches the observed frequency for that outcome conditional on the model prediction. This property has become increasingly important as the impact of machine learning models has continued to spread to various domains. As a result, there are now a dizzying number of recent papers on measuring and improving the calibration of (specifically deep learning) models. In this work, we reassess the reporting of calibration metrics in the recent literature. We show that there exist trivial recalibration approaches that can appear seemingly state-of-the-art unless calibration and prediction metrics (i.e. test accuracy) are accompanied by additional generalization metrics such as negative log-likelihood. We then derive a calibration-based decomposition of Bregman divergences that can be used to both motivate a choice of calibration metric based on a generalization metric, and to detect trivial calibration. Finally, we apply these ideas to develop a new extension to reliability diagrams that can be used to jointly visualize calibration as well as the estimated generalization error of a model.

[LG-59] Bisimulation Metrics are Optimal Transport Distances and Can be Computed Efficiently

链接: https://arxiv.org/abs/2406.04056
作者: Sergio Calo,Anders Jonsson,Gergely Neu,Ludovic Schwartz,Javier Segovia
关键词: Markov chains, optimal transport distances, optimal transport, formulating optimal transport, Markov decision process
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a new framework for formulating optimal transport distances between Markov chains. Previously known formulations studied couplings between the entire joint distribution induced by the chains, and derived solutions via a reduction to dynamic programming (DP) in an appropriately defined Markov decision process. This formulation has, however, not led to particularly efficient algorithms so far, since computing the associated DP operators requires fully solving a static optimal transport problem, and these operators need to be applied numerous times during the overall optimization process. In this work, we develop an alternative perspective by considering couplings between a flattened version of the joint distributions that we call discounted occupancy couplings, and show that calculating optimal transport distances in the full space of joint distributions can be equivalently formulated as solving a linear program (LP) in this reduced space. This LP formulation allows us to port several algorithmic ideas from other areas of optimal transport theory. In particular, our formulation makes it possible to introduce an appropriate notion of entropy regularization into the optimization problem, which in turn enables us to directly calculate optimal transport distances via a Sinkhorn-like method we call Sinkhorn Value Iteration (SVI). We show both theoretically and empirically that this method converges quickly to an optimal coupling, essentially at the same computational cost of running vanilla Sinkhorn in each pair of states. Along the way, we point out that our optimal transport distance exactly matches the common notion of bisimulation metrics between Markov chains, and thus our results also apply to computing such metrics, and in fact our algorithm turns out to be significantly more efficient than the best known methods developed so far for this purpose.

[LG-60] Leveraging SPD Matrices on Riemannian Manifolds in Quantum Classical Hybrid Models for Structural Health Monitoring

链接: https://arxiv.org/abs/2406.04055
作者: Azadeh Alavi,Sanduni Jayasinghe
关键词: assists modern structural, modern structural health, structural health monitoring, finite element modeling, health monitoring systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 3 pages, 1 figure

点击查看摘要

Abstract:Realtime finite element modeling of bridges assists modern structural health monitoring systems by providing comprehensive insights into structural integrity. This capability is essential for ensuring the safe operation of bridges and preventing sudden catastrophic failures. However, FEM computational cost and the need for realtime analysis pose significant challenges. Additionally, the input data is a 7 dimensional vector, while the output is a 1017 dimensional vector, making accurate and efficient analysis particularly difficult. In this study, we propose a novel hybrid quantum classical Multilayer Perceptron pipeline leveraging Symmetric Positive Definite matrices and Riemannian manifolds for effective data representation. To maintain the integrity of the qubit structure, we utilize SPD matrices, ensuring data representation is well aligned with the quantum computational framework. Additionally, the method leverages polynomial feature expansion to capture nonlinear relationships within the data. The proposed pipeline combines classical fully connected neural network layers with quantum circuit layers to enhance model performance and efficiency. Our experiments focused on various configurations of such hybrid models to identify the optimal structure for accurate and efficient realtime analysis. The best performing model achieved a Mean Squared Error of 0.00031, significantly outperforming traditional methods.

[LG-61] Multivector Neurons: Better and Faster O(n)-Equivariant Clifford Graph Neural Networks

链接: https://arxiv.org/abs/2406.04052
作者: Cong Liu,David Ruhe,Patrick Forré
关键词: high computational complexity, current deep learning, computational complexity, distances and angles, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Most current deep learning models equivariant to O(n) or SO(n) either consider mostly scalar information such as distances and angles or have a very high computational complexity. In this work, we test a few novel message passing graph neural networks (GNNs) based on Clifford multivectors, structured similarly to other prevalent equivariant models in geometric deep learning. Our approach leverages efficient invariant scalar features while simultaneously performing expressive learning on multivector representations, particularly through the use of the equivariant geometric product operator. By integrating these elements, our methods outperform established efficient baseline models on an N-Body simulation task and protein denoising task while maintaining a high efficiency. In particular, we push the state-of-the-art error on the N-body dataset to 0.0035 (averaged over 3 runs); an 8% improvement over recent methods. Our implementation is available on Github.

[LG-62] Energy-based Epistemic Uncertainty for Graph Neural Networks

链接: https://arxiv.org/abs/2406.04043
作者: Dominik Fuchsgruber,Tom Wollschläger,Stephan Günnemann
关键词: Graph Neural Network, Neural Network, Graph Neural, quantifying the epistemic, domains with interdependent
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In domains with interdependent data, such as graphs, quantifying the epistemic uncertainty of a Graph Neural Network (GNN) is challenging as uncertainty can arise at different structural scales. Existing techniques neglect this issue or only distinguish between structure-aware and structure-agnostic uncertainty without combining them into a single measure. We propose GEBM, an energy-based model (EBM) that provides high-quality uncertainty estimates by aggregating energy at different structural levels that naturally arise from graph diffusion. In contrast to logit-based EBMs, we provably induce an integrable density in the data space by regularizing the energy function. We introduce an evidential interpretation of our EBM that significantly improves the predictive robustness of the GNN. Our framework is a simple and effective post hoc method applicable to any pre-trained GNN that is sensitive to various distribution shifts. It consistently achieves the best separation of in-distribution and out-of-distribution data on 6 out of 7 anomaly types while having the best average rank over shifts on \emphall datasets.

[LG-63] Linear Opinion Pooling for Uncertainty Quantification on Graphs

链接: https://arxiv.org/abs/2406.04041
作者: Clemens Damke,Eyke Hüllermeier
关键词: address the problem, support uncertainty quantification, uncertainty quantification, quantify the predictive, predictive uncertainty
类目: Machine Learning (cs.LG)
*备注: Accepted for the 40th Conference on Uncertainty in Artificial Intelligence (UAI 2024). Implementation available at this https URL

点击查看摘要

Abstract:We address the problem of uncertainty quantification for graph-structured data, or, more specifically, the problem to quantify the predictive uncertainty in (semi-supervised) node classification. Key questions in this regard concern the distinction between two different types of uncertainty, aleatoric and epistemic, and how to support uncertainty quantification by leveraging the structural information provided by the graph topology. Challenging assumptions and postulates of state-of-the-art methods, we propose a novel approach that represents (epistemic) uncertainty in terms of mixtures of Dirichlet distributions and refers to the established principle of linear opinion pooling for propagating information between neighbored nodes in the graph. The effectiveness of this approach is demonstrated in a series of experiments on a variety of graph-structured datasets.

[LG-64] Shaping History: Advanced Machine Learning Techniques for the Analysis and Dating of Cuneiform Tablets over Three Millennia

链接: https://arxiv.org/abs/2406.04039
作者: Danielle Kapon,Michael Fire,Shai Gordin
关键词: fourth millennium BCE, late fourth millennium, earliest writing systems, millennium BCE, humanity earliest writing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 24 pages, 18 figures

点击查看摘要

Abstract:Cuneiform tablets, emerging in ancient Mesopotamia around the late fourth millennium BCE, represent one of humanity’s earliest writing systems. Characterized by wedge-shaped marks on clay tablets, these artifacts provided insight into Mesopotamian civilization across various domains. Traditionally, the analysis and dating of these tablets rely on subjective assessment of shape and writing style, leading to uncertainties in pinpointing their exact temporal origins. Recent advances in digitization have revolutionized the study of cuneiform by enhancing accessibility and analytical capabilities. Our research uniquely focuses on the silhouette of tablets as significant indicators of their historical periods, diverging from most studies that concentrate on textual content. Utilizing an unprecedented dataset of over 94,000 images from the Cuneiform Digital Library Initiative collection, we apply deep learning methods to classify cuneiform tablets, covering over 3,000 years of history. By leveraging statistical, computational techniques, and generative modeling through Variational Auto-Encoders (VAEs), we achieve substantial advancements in the automatic classification of these ancient documents, focusing on the tablets’ silhouettes as key predictors. Our classification approach begins with a Decision Tree using height-to-width ratios and culminates with a ResNet50 model, achieving a 61% macro F1-score for tablet silhouettes. Moreover, we introduce novel VAE-powered tools to enhance explainability and enable researchers to explore changes in tablet shapes across different eras and genres. This research contributes to document analysis and diplomatics by demonstrating the value of large-scale data analysis combined with statistical methods. These insights offer valuable tools for historians and epigraphists, enriching our understanding of cuneiform tablets and the cultures that produced them.

[LG-65] Road Network Representation Learning with the Third Law of Geography

链接: https://arxiv.org/abs/2406.04038
作者: Haicang Zhou,Weiming Huang,Yile Chen,Tiantian He,Gao Cong,Yew-Soon Ong
关键词: effective vectorized representations, Road network representation, Law, representation learning aims, aims to learn
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Road network representation learning aims to learn compressed and effective vectorized representations for road segments that are applicable to numerous tasks. In this paper, we identify the limitations of existing methods, particularly their overemphasis on the distance effect as outlined in the First Law of Geography. In response, we propose to endow road network representation with the principles of the recent Third Law of Geography. To this end, we propose a novel graph contrastive learning framework that employs geographic configuration-aware graph augmentation and spectral negative sampling, ensuring that road segments with similar geographic configurations yield similar representations, and vice versa, aligning with the principles stated in the Third Law. The framework further fuses the Third Law with the First Law through a dual contrastive learning objective to effectively balance the implications of both laws. We evaluate our framework on two real-world datasets across three downstream tasks. The results show that the integration of the Third Law significantly improves the performance of road segment representations in downstream tasks.

[LG-66] Spatio-temporal Early Prediction based on Multi-objective Reinforcement Learning

链接: https://arxiv.org/abs/2406.04035
作者: Wei Shao,Yufan Kang,Ziyan Peng,Xiao Xiao,Lei Wang,Yuhui Yang,Flora D Salim
关键词: conflicting goals, prediction tasks, prediction, Accuracy, Accuracy and timeliness
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Conference

点击查看摘要

Abstract:Accuracy and timeliness are indeed often conflicting goals in prediction tasks. Premature predictions may yield a higher rate of false alarms, whereas delaying predictions to gather more information can render them too late to be useful. In applications such as wildfires, crimes, and traffic jams, timely predictions are vital for safeguarding human life and property. Consequently, finding a balance between accuracy and timeliness is crucial. In this paper, we propose a spatio-temporal early prediction model based on Multi-Objective reinforcement learning that can either implement an optimal policy given a preference or infer the preference based on a small number of samples. The model addresses two primary challenges: 1) enhancing the accuracy of early predictions and 2) providing the optimal policy for determining the most suitable prediction time for each area. Our method demonstrates superior performance on three large-scale real-world datasets, surpassing existing methods in early spatio-temporal prediction tasks.

[LG-67] Pre-trained Transformer Uncovers Meaningful Patterns in Human Mobility Data

链接: https://arxiv.org/abs/2406.04029
作者: Alameen Najjar
关键词: country-scale unlabeled human, learns embeddings capable, unlabeled human mobility, mobility data learns, data learns embeddings
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, 12 figures, 14 tables

点击查看摘要

Abstract:We empirically demonstrate that a transformer pre-trained on country-scale unlabeled human mobility data learns embeddings capable, through fine-tuning, of developing a deep understanding of the target geography and its corresponding mobility patterns. Utilizing an adaptation framework, we evaluate the performance of our pre-trained embeddings in encapsulating a broad spectrum of concepts directly and indirectly related to human mobility. This includes basic notions, such as geographic location and distance, and extends to more complex constructs, such as administrative divisions and land cover. Our extensive empirical analysis reveals a substantial performance boost gained from pre-training, reaching up to 38% in tasks such as tree-cover regression. We attribute this result to the ability of the pre-training to uncover meaningful patterns hidden in the raw data, beneficial for modeling relevant high-level concepts. The pre-trained embeddings emerge as robust representations of regions and trajectories, potentially valuable for a wide range of downstream applications.

[LG-68] Unveiling the Dynamics of Information Interplay in Supervised Learning

链接: https://arxiv.org/abs/2406.03999
作者: Kun Song,Zhiquan Tan,Bochao Zou,Huimin Ma,Weiran Huang
关键词: MIR and HDR, classification head vectors, matrix information theory, Neural Collapse, classification heads
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICML 2024

点击查看摘要

Abstract:In this paper, we use matrix information theory as an analytical tool to analyze the dynamics of the information interplay between data representations and classification head vectors in the supervised learning process. Specifically, inspired by the theory of Neural Collapse, we introduce matrix mutual information ratio (MIR) and matrix entropy difference ratio (HDR) to assess the interactions of data representation and class classification heads in supervised learning, and we determine the theoretical optimal values for MIR and HDR when Neural Collapse happens. Our experiments show that MIR and HDR can effectively explain many phenomena occurring in neural networks, for example, the standard supervised training dynamics, linear mode connectivity, and the performance of label smoothing and pruning. Additionally, we use MIR and HDR to gain insights into the dynamics of grokking, which is an intriguing phenomenon observed in supervised training, where the model demonstrates generalization capabilities long after it has learned to fit the training data. Furthermore, we introduce MIR and HDR as loss terms in supervised and semi-supervised learning to optimize the information interactions among samples and classification heads. The empirical results provide evidence of the method’s effectiveness, demonstrating that the utilization of MIR and HDR not only aids in comprehending the dynamics throughout the training process but can also enhances the training procedure itself.

[LG-69] HackAtari: Atari Learning Environments for Robust and Continual Reinforcement Learning

链接: https://arxiv.org/abs/2406.03997
作者: Quentin Delfosse,Jannis Blüml,Bjarne Gregori,Kristian Kersting
关键词: Artificial agents’ adaptability, effective deployment, Artificial agents’, alignment with intended, Atari Learning Environment
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 main pages, 4 pages references, 19 pages of appendix

点击查看摘要

Abstract:Artificial agents’ adaptability to novelty and alignment with intended behavior is crucial for their effective deployment. Reinforcement learning (RL) leverages novelty as a means of exploration, yet agents often struggle to handle novel situations, hindering generalization. To address these issues, we propose HackAtari, a framework introducing controlled novelty to the most common RL benchmark, the Atari Learning Environment. HackAtari allows us to create novel game scenarios (including simplification for curriculum learning), to swap the game elements’ colors, as well as to introduce different reward signals for the agent. We demonstrate that current agents trained on the original environments include robustness failures, and evaluate HackAtari’s efficacy in enhancing RL agents’ robustness and aligning behavior through experiments using C51 and PPO. Overall, HackAtari can be used to improve the robustness of current and future RL algorithms, allowing Neuro-Symbolic RL, curriculum RL, causal RL, as well as LLM-driven RL. Our work underscores the significance of developing interpretable in RL agents.

[LG-70] Position: Embracing Negative Results in Machine Learning

链接: https://arxiv.org/abs/2406.03980
作者: Florian Karl,Lukas Malte Kemeter,Gabriel Dax,Paulina Sierak
关键词: exhibited predictive performance, machine learning methods, machine learning research, predictive performance, primarily rated
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Publications proposing novel machine learning methods are often primarily rated by exhibited predictive performance on selected problems. In this position paper we argue that predictive performance alone is not a good indicator for the worth of a publication. Using it as such even fosters problems like inefficiencies of the machine learning research community as a whole and setting wrong incentives for researchers. We therefore put out a call for the publication of “negative” results, which can help alleviate some of these problems and improve the scientific output of the machine learning research community. To substantiate our position, we present the advantages of publishing negative results and provide concrete measures for the community to move towards a paradigm where their publication is normalized.

[LG-71] Mini Honor of Kings: A Lightweight Environment for Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2406.03978
作者: Lin Liu,Jian Zhao,Cheng Hu,Zhengtao Cao,Youpeng Zhao,Zhenbin Ye,Meng Meng,Wenjun Wang,Zhaofeng He,Houqiang Li,Xia Lin,Lanxiao Huang
关键词: high computational demands, multi-agent reinforcement learning, Honor of Kings, limited customization, reinforcement learning
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Games are widely used as research environments for multi-agent reinforcement learning (MARL), but they pose three significant challenges: limited customization, high computational demands, and oversimplification. To address these issues, we introduce the first publicly available map editor for the popular mobile game Honor of Kings and design a lightweight environment, Mini Honor of Kings (Mini HoK), for researchers to conduct experiments. Mini HoK is highly efficient, allowing experiments to be run on personal PCs or laptops while still presenting sufficient challenges for existing MARL algorithms. We have tested our environment on common MARL algorithms and demonstrated that these algorithms have yet to find optimal solutions within this environment. This facilitates the dissemination and advancement of MARL methods within the research community. Additionally, we hope that more researchers will leverage the Honor of Kings map editor to develop innovative and scientifically valuable new maps. Our code and user manual are available at: this https URL.

[LG-72] Weight-based Decomposition: A Case for Bilinear MLPs

链接: https://arxiv.org/abs/2406.03947
作者: Michael T. Pearce,Thomas Dooms,Alice Rigg
关键词: Gated Linear Units, common building block, Gated Linear, Linear Units, modern foundation models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gated Linear Units (GLUs) have become a common building block in modern foundation models. Bilinear layers drop the non-linearity in the “gate” but still have comparable performance to other GLUs. An attractive quality of bilinear layers is that they can be fully expressed in terms of a third-order tensor and linear operations. Leveraging this, we develop a method to decompose the bilinear tensor into a set of sparsely interacting eigenvectors that show promising interpretability properties in preliminary experiments for shallow image classifiers (MNIST) and small language models (Tiny Stories). Since the decomposition is fully equivalent to the model’s original computations, bilinear layers may be an interpretability-friendly architecture that helps connect features to the model weights. Application of our method may not be limited to pretrained bilinear models since we find that language models such as TinyLlama-1.1B can be finetuned into bilinear variants.

[LG-73] A Probabilistic Approach to Learning the Degree of Equivariance in Steerable CNNs

链接: https://arxiv.org/abs/2406.03946
作者: Lars Veefkind,Gabriele Cesa
关键词: Steerable convolutional neural, Steerable convolutional, enhance task performance, modelling geometric symmetries, convolutional neural networks
类目: Machine Learning (cs.LG)
*备注: 9 pages, to be published at ICML 2024 as main conference paper

点击查看摘要

Abstract:Steerable convolutional neural networks (SCNNs) enhance task performance by modelling geometric symmetries through equivariance constraints on weights. Yet, unknown or varying symmetries can lead to overconstrained weights and decreased performance. To address this, this paper introduces a probabilistic method to learn the degree of equivariance in SCNNs. We parameterise the degree of equivariance as a likelihood distribution over the transformation group using Fourier coefficients, offering the option to model layer-wise and shared equivariance. These likelihood distributions are regularised to ensure an interpretable degree of equivariance across the network. Advantages include the applicability to many types of equivariant networks through the flexible framework of SCNNs and the ability to learn equivariance with respect to any subgroup of any compact group without requiring additional layers. Our experiments reveal competitive performance on datasets with mixed symmetries, with learnt likelihood distributions that are representative of the underlying degree of equivariance.

[LG-74] Provably Neural Active Learning Succeeds via Prioritizing Perplexing Samples

链接: https://arxiv.org/abs/2406.03944
作者: Dake Bu,Wei Huang,Taiji Suzuki,Ji Cheng,Qingfu Zhang,Zhiqiang Xu,Hau-San Wong
关键词: Neural Network-based active, utilizes neural networks, Network-based active learning, Neural Network-based, Network-based active
类目: Machine Learning (cs.LG)
*备注: Accepted by the 41th Intemational Conference on Machine Learning (lCML 2024)

点击查看摘要

Abstract:Neural Network-based active learning (NAL) is a cost-effective data selection technique that utilizes neural networks to select and train on a small subset of samples. While existing work successfully develops various effective or theory-justified NAL algorithms, the understanding of the two commonly used query criteria of NAL: uncertainty-based and diversity-based, remains in its infancy. In this work, we try to move one step forward by offering a unified explanation for the success of both query criteria-based NAL from a feature learning view. Specifically, we consider a feature-noise data model comprising easy-to-learn or hard-to-learn features disrupted by noise, and conduct analysis over 2-layer NN-based NALs in the pool-based scenario. We provably show that both uncertainty-based and diversity-based NAL are inherently amenable to one and the same principle, i.e., striving to prioritize samples that contain yet-to-be-learned features. We further prove that this shared principle is the key to their success-achieve small test error within a small labeled set. Contrastingly, the strategy-free passive learning exhibits a large test error due to the inadequate learning of yet-to-be-learned features, necessitating resort to a significantly larger label complexity for a sufficient test error reduction. Experimental results validate our findings.

[LG-75] Breeding Programs Optimization with Reinforcement Learning

链接: https://arxiv.org/abs/2406.03932
作者: Omar G. Younis,Luca Corinzia,Ioannis N. Athanasiadis,Andreas Krause,Joachim M. Buhmann,Matteo Turchetta
关键词: greenhouse gas emissions, decreasing land usage, improving agricultural productivity, potentially decreasing land, land usage
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2023 Workshop on Tackling Climate Change with Machine Learning

点击查看摘要

Abstract:Crop breeding is crucial in improving agricultural productivity while potentially decreasing land usage, greenhouse gas emissions, and water consumption. However, breeding programs are challenging due to long turnover times, high-dimensional decision spaces, long-term objectives, and the need to adapt to rapid climate change. This paper introduces the use of Reinforcement Learning (RL) to optimize simulated crop breeding programs. RL agents are trained to make optimal crop selection and cross-breeding decisions based on genetic information. To benchmark RL-based breeding algorithms, we introduce a suite of Gym environments. The study demonstrates the superiority of RL techniques over standard practices in terms of genetic gain when simulated in silico using real-world genomic maize data.

[LG-76] Latent Neural Operator for Solving Forward and Inverse PDE Problems

链接: https://arxiv.org/abs/2406.03923
作者: Tian Wang,Chuang Wang
关键词: effectively solve PDE, Neural operators effectively, operators effectively solve, Latent Neural Operator, solve PDE problems
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Neural operators effectively solve PDE problems from data without knowing the explicit equations, which learn the map from the input sequences of observed samples to the predicted values. Most existed works build the model in the original geometric space, leading to high computational costs when the number of sample points is large. We present the Latent Neural Operator (LNO) solving PDEs in the latent space. In particular, we first propose Physics-Cross-Attention (PhCA) transforming representation from the geometric space to the latent space, then learn the operator in the latent space, and finally recover the real-world geometric space via the inverse PhCA map. Our model retains flexibility that can decode values in any position not limited to locations defined in training set, and therefore can naturally perform interpolation and extrapolation tasks particularly useful for inverse problems. Moreover, the proposed LNO improves in both prediction accuracy and computational efficiency. Experiments show that LNO reduces the GPU memory by 50%, speeds up training 1.8 times, and reaches state-of-the-art accuracy on four out of six benchmarks for forward problems and a benchmark for inverse problem.

[LG-77] owards Physically Consistent Deep Learning For Climate Model Parameterizations

链接: https://arxiv.org/abs/2406.03920
作者: Birgit Kühbacher,Fernando Iglesias-Suarez,Niki Kilbertus,Veronika Eyring
关键词: projecting climate change, play a critical, critical role, role in understanding, understanding and projecting
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Climate models play a critical role in understanding and projecting climate change. Due to their complexity, their horizontal resolution of ~40-100 km remains too coarse to resolve processes such as clouds and convection, which need to be approximated via parameterizations. These parameterizations are a major source of systematic errors and large uncertainties in climate projections. Deep learning (DL)-based parameterizations, trained on computationally expensive, short high-resolution simulations, have shown great promise for improving climate models in that regard. However, their lack of interpretability and tendency to learn spurious non-physical correlations result in reduced trust in the climate simulation. We propose an efficient supervised learning framework for DL-based parameterizations that leads to physically consistent models with improved interpretability and negligible computational overhead compared to standard supervised training. First, key features determining the target physical processes are uncovered. Subsequently, the neural network is fine-tuned using only those relevant features. We show empirically that our method robustly identifies a small subset of the inputs as actual physical drivers, therefore, removing spurious non-physical relationships. This results in by design physically consistent and interpretable neural networks while maintaining the predictive performance of standard black-box DL-based parameterizations. Our framework represents a crucial step in addressing a major challenge in data-driven climate model parameterizations by respecting the underlying physical processes, and may also benefit physically consistent deep learning in other research fields.

[LG-78] Vectorized Conditional Neural Fields: A Framework for Solving Time-dependent Parametric Partial Differential Equations

链接: https://arxiv.org/abs/2406.03919
作者: Jan Hagnberger,Marimuthu Kalimuthu,Daniel Musekamp,Mathias Niepert
关键词: Partial Differential Equations, solving Partial Differential, Differential Equations, Partial Differential, solving Partial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Computational Physics (physics.comp-ph)
*备注: Accepted for publication at the 41st International Conference on Machine Learning (ICML) 2024

点击查看摘要

Abstract:Transformer models are increasingly used for solving Partial Differential Equations (PDEs). Several adaptations have been proposed, all of which suffer from the typical problems of Transformers, such as quadratic memory and time complexity. Furthermore, all prevalent architectures for PDE solving lack at least one of several desirable properties of an ideal surrogate model, such as (i) generalization to PDE parameters not seen during training, (ii) spatial and temporal zero-shot super-resolution, (iii) continuous temporal extrapolation, (iv) support for 1D, 2D, and 3D PDEs, and (v) efficient inference for longer temporal rollouts. To address these limitations, we propose Vectorized Conditional Neural Fields (VCNeFs), which represent the solution of time-dependent PDEs as neural fields. Contrary to prior methods, however, VCNeFs compute, for a set of multiple spatio-temporal query points, their solutions in parallel and model their dependencies through attention mechanisms. Moreover, VCNeF can condition the neural field on both the initial conditions and the parameters of the PDEs. An extensive set of experiments demonstrates that VCNeFs are competitive with and often outperform existing ML-based surrogate models.

[LG-79] Neuro-Symbolic Temporal Point Processes

链接: https://arxiv.org/abs/2406.03914
作者: Yang Yang,Chao Yang,Boyang Li,Yinghao Fu,Shuang Li
关键词: explain irregular events, discover a compact, temporal logic rules, compact set, explain irregular
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Our goal is to \textitefficiently discover a compact set of temporal logic rules to explain irregular events of interest. We introduce a neural-symbolic rule induction framework within the temporal point process model. The negative log-likelihood is the loss that guides the learning, where the explanatory logic rules and their weights are learned end-to-end in a \textitdifferentiable way. Specifically, predicates and logic rules are represented as \textitvector embeddings , where the predicate embeddings are fixed and the rule embeddings are trained via gradient descent to obtain the most appropriate compositional representations of the predicate embeddings. To make the rule learning process more efficient and flexible, we adopt a \textitsequential covering algorithm , which progressively adds rules to the model and removes the event sequences that have been explained until all event sequences have been covered. All the found rules will be fed back to the models for a final rule embedding and weight refinement. Our approach showcases notable efficiency and accuracy across synthetic and real datasets, surpassing state-of-the-art baselines by a wide margin in terms of efficiency.

[LG-80] GenSafe: A Generalizable Safety Enhancer for Safe Reinforcement Learning Algorithms Based on Reduced Order Markov Decision Process Model

链接: https://arxiv.org/abs/2406.03912
作者: Zhehua Zhou,Xuan Xie,Jiayang Song,Zhan Shu,Lei Ma
关键词: Markov Decision Process, demonstrated impressive achievements, random exploration raises, deep reinforcement learning, Safe Reinforcement Learning
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Although deep reinforcement learning has demonstrated impressive achievements in controlling various autonomous systems, e.g., autonomous vehicles or humanoid robots, its inherent reliance on random exploration raises safety concerns in their real-world applications. To improve system safety during the learning process, a variety of Safe Reinforcement Learning (SRL) algorithms have been proposed, which usually incorporate safety constraints within the Constrained Markov Decision Process (CMDP) framework. However, the efficacy of these SRL algorithms often relies on accurate function approximations, a task that is notably challenging to accomplish in the early learning stages due to data insufficiency. To address this problem, we introduce a Genralizable Safety enhancer (GenSafe) in this work. Leveraging model order reduction techniques, we first construct a Reduced Order Markov Decision Process (ROMDP) as a low-dimensional proxy for the original cost function in CMDP. Then, by solving ROMDP-based constraints that are reformulated from the original cost constraints, the proposed GenSafe refines the actions taken by the agent to enhance the possibility of constraint satisfaction. Essentially, GenSafe acts as an additional safety layer for SRL algorithms, offering broad compatibility across diverse SRL approaches. The performance of GenSafe is examined on multiple SRL benchmark problems. The results show that, it is not only able to improve the safety performance, especially in the early learning phases, but also to maintain the task performance at a satisfactory level.

[LG-81] ransductive Off-policy Proximal Policy Optimization

链接: https://arxiv.org/abs/2406.03894
作者: Yaozhong Gan,Renye Yan,Xiaoyang Tan,Zhe Wu,Junliang Xing
关键词: Proximal Policy Optimization, reinforcement learning algorithm, popular model-free reinforcement, model-free reinforcement learning, Proximal Policy
类目: Machine Learning (cs.LG)
*备注: 18

点击查看摘要

Abstract:Proximal Policy Optimization (PPO) is a popular model-free reinforcement learning algorithm, esteemed for its simplicity and efficacy. However, due to its inherent on-policy nature, its proficiency in harnessing data from disparate policies is constrained. This paper introduces a novel off-policy extension to the original PPO method, christened Transductive Off-policy PPO (ToPPO). Herein, we provide theoretical justification for incorporating off-policy data in PPO training and prudent guidelines for its safe application. Our contribution includes a novel formulation of the policy improvement lower bound for prospective policies derived from off-policy data, accompanied by a computationally efficient mechanism to optimize this bound, underpinned by assurances of monotonic improvement. Comprehensive experimental results across six representative tasks underscore ToPPO’s promising performance.

[LG-82] Polyhedral Conic Classifier for CTR Prediction

链接: https://arxiv.org/abs/2406.03892
作者: Beyza Turkmen,Ramazan Tarik Turksoy,Hasan Saribas,Hakan Cevikalp
关键词: industrial recommender systems, click-through rate, recommender systems, addressing the inherent, geometric asymmetry
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel approach for click-through rate (CTR) prediction within industrial recommender systems, addressing the inherent challenges of numerical imbalance and geometric asymmetry. These challenges stem from imbalanced datasets, where positive (click) instances occur less frequently than negatives (non-clicks), and geometrically asymmetric distributions, where positive samples exhibit visually coherent patterns while negatives demonstrate greater diversity. To address these challenges, we have used a deep neural network classifier that uses the polyhedral conic functions. This classifier is similar to the one-class classifiers in spirit and it returns compact polyhedral acceptance regions to separate the positive class samples from the negative samples that have diverse distributions. Extensive experiments have been conducted to test the proposed approach using state-of-the-art (SOTA) CTR prediction models on four public datasets, namely Criteo, Avazu, MovieLens and Frappe. The experimental evaluations highlight the superiority of our proposed approach over Binary Cross Entropy (BCE) Loss, which is widely used in CTR prediction tasks.

[LG-83] Exploring Pessimism and Optimism Dynamics in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2406.03890
作者: Bahareh Tasdighi,Nicklas Werge,Yi-Shan Wu,Melih Kandemir
关键词: deep reinforcement learning, shown promise, promise in deep, deep reinforcement, reinforcement learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Off-policy actor-critic algorithms have shown promise in deep reinforcement learning for continuous control tasks. Their success largely stems from leveraging pessimistic state-action value function updates, which effectively address function approximation errors and improve performance. However, such pessimism can lead to under-exploration, constraining the agent’s ability to explore/refine its policies. Conversely, optimism can counteract under-exploration, but it also carries the risk of excessive risk-taking and poor convergence if not properly balanced. Based on these insights, we introduce Utility Soft Actor-Critic (USAC), a novel framework within the actor-critic paradigm that enables independent control over the degree of pessimism/optimism for both the actor and the critic via interpretable parameters. USAC adapts its exploration strategy based on the uncertainty of critics through a utility function that allows us to balance between pessimism and optimism separately. By going beyond binary choices of optimism and pessimism, USAC represents a significant step towards achieving balance within off-policy actor-critic algorithms. Our experiments across various continuous control problems show that the degree of pessimism or optimism depends on the nature of the task. Furthermore, we demonstrate that USAC can outperform state-of-the-art algorithms for appropriately configured pessimism/optimism parameters.

[LG-84] BiomedBench: A benchmark suite of TinyML biomedical applications for low-power wearables

链接: https://arxiv.org/abs/2406.03886
作者: Dimitrios Samakovlis,Stefano Albini,Rubén Rodríguez Álvarez,Denisa-Andreea Constantinescu,Pasquale Davide Schiavone,Miguel Peón Quirós,David Atienza
关键词: allowed real-time monitoring, recent decades, received a lot, lot of attention, attention in recent
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 7 pages, 5 figures. Sumbitted to Design Test Special Issue TinyML

点击查看摘要

Abstract:The design of low-power wearables for the biomedical domain has received a lot of attention in recent decades, as technological advances in chip manufacturing have allowed real-time monitoring of patients using low-complexity ML within the mW range. Despite advances in application and hardware design research, the domain lacks a systematic approach to hardware evaluation. In this work, we propose BiomedBench, a new benchmark suite composed of complete end-to-end TinyML biomedical applications for real-time monitoring of patients using wearable devices. Each application presents different requirements during typical signal acquisition and processing phases, including varying computational workloads and relations between active and idle times. Furthermore, our evaluation of five state-of-the-art low-power platforms in terms of energy efficiency shows that modern platforms cannot effectively target all types of biomedical applications. BiomedBench will be released as an open-source suite to enable future improvements in the entire domain of bioengineering systems and TinyML application design.

[LG-85] Memorization in deep learning: A survey

链接: https://arxiv.org/abs/2406.03880
作者: Jiaheng Wei,Yanjun Zhang,Leo Yu Zhang,Ming Ding,Chao Chen,Kok-Leong Ong,Jun Zhang,Yang Xiang
关键词: Deep Neural Networks, Deep Neural, Neural Networks, learning processes remains, Deep Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep Learning (DL) powered by Deep Neural Networks (DNNs) has revolutionized various domains, yet understanding the intricacies of DNN decision-making and learning processes remains a significant challenge. Recent investigations have uncovered an interesting memorization phenomenon in which DNNs tend to memorize specific details from examples rather than learning general patterns, affecting model generalization, security, and privacy. This raises critical questions about the nature of generalization in DNNs and their susceptibility to security breaches. In this survey, we present a systematic framework to organize memorization definitions based on the generalization and security/privacy domains and summarize memorization evaluation methods at both the example and model levels. Through a comprehensive literature review, we explore DNN memorization behaviors and their impacts on security and privacy. We also introduce privacy vulnerabilities caused by memorization and the phenomenon of forgetting and explore its connection with memorization. Furthermore, we spotlight various applications leveraging memorization and forgetting mechanisms, including noisy label learning, privacy preservation, and model enhancement. This survey offers the first-in-kind understanding of memorization in DNNs, providing insights into its challenges and opportunities for enhancing AI development while addressing critical ethical concerns.

[LG-86] Decay Pruning Method: Smooth Pruning With a Self-Rectifying Procedure

链接: https://arxiv.org/abs/2406.03879
作者: Minghao Yang,Linlin Gao,Pengyuan Li,Wenbo Li,Yihong Dong,Zhiying Cui
关键词: Current structured pruning, considerable accuracy drops, accuracy drops due, Current structured, structured pruning methods
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current structured pruning methods often result in considerable accuracy drops due to abrupt network changes and loss of information from pruned structures. To address these issues, we introduce the Decay Pruning Method (DPM), a novel smooth pruning approach with a self-rectifying mechanism. DPM consists of two key components: (i) Smooth Pruning: It converts conventional single-step pruning into multi-step smooth pruning, gradually reducing redundant structures to zero over N steps with ongoing optimization. (ii) Self-Rectifying: This procedure further enhances the aforementioned process by rectifying sub-optimal pruning based on gradient information. Our approach demonstrates strong generalizability and can be easily integrated with various existing pruning methods. We validate the effectiveness of DPM by integrating it with three popular pruning methods: OTOv2, Depgraph, and Gate Decorator. Experimental results show consistent improvements in performance compared to the original pruning methods, along with further reductions of FLOPs in most scenarios.

[LG-87] Quantum Implicit Neural Representations

链接: https://arxiv.org/abs/2406.03873
作者: Jiaming Zhao,Wenbo Qiao,Peng Zhang,Hui Gao
关键词: neural networks, Implicit neural representations, Fourier Neural Networks, neural, Implicit neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper was accepted by icml 2024

点击查看摘要

Abstract:Implicit neural representations have emerged as a powerful paradigm to represent signals such as images and sounds. This approach aims to utilize neural networks to parameterize the implicit function of the signal. However, when representing implicit functions, traditional neural networks such as ReLU-based multilayer perceptrons face challenges in accurately modeling high-frequency components of signals. Recent research has begun to explore the use of Fourier Neural Networks (FNNs) to overcome this limitation. In this paper, we propose Quantum Implicit Representation Network (QIREN), a novel quantum generalization of FNNs. Furthermore, through theoretical analysis, we demonstrate that QIREN possesses a quantum advantage over classical FNNs. Lastly, we conducted experiments in signal representation, image superresolution, and image generation tasks to show the superior performance of QIREN compared to state-of-the-art (SOTA) models. Our work not only incorporates quantum advantages into implicit neural representations but also uncovers a promising application direction for Quantum Neural Networks.

[LG-88] PairNet: Training with Observed Pairs to Estimate Individual Treatment Effect

链接: https://arxiv.org/abs/2406.03864
作者: Lokesh Nagalapatti,Pranava Singhal,Avishek Ghosh,Sunita Sarawagi
关键词: individual treatment effect, covariate vector, ITE, predict outcome, dataset of individuals
类目: Machine Learning (cs.LG)
*备注: Lokesh and Pranava contributed equally. Accepted at ICML-24

点击查看摘要

Abstract:Given a dataset of individuals each described by a covariate vector, a treatment, and an observed outcome on the treatment, the goal of the individual treatment effect (ITE) estimation task is to predict outcome changes resulting from a change in treatment. A fundamental challenge is that in the observational data, a covariate’s outcome is observed only under one treatment, whereas we need to infer the difference in outcomes under two different treatments. Several existing approaches address this issue through training with inferred pseudo-outcomes, but their success relies on the quality of these pseudo-outcomes. We propose PairNet, a novel ITE estimation training strategy that minimizes losses over pairs of examples based on their factual observed outcomes. Theoretical analysis for binary treatments reveals that PairNet is a consistent estimator of ITE risk, and achieves smaller generalization error than baseline models. Empirical comparison with thirteen existing methods across eight benchmarks, covering both discrete and continuous treatments, shows that PairNet achieves significantly lower ITE error compared to the baselines. Also, it is model-agnostic and easy to implement.

[LG-89] Behavior-Targeted Attack on Reinforcement Learning with Limited Access to Victims Policy

链接: https://arxiv.org/abs/2406.03862
作者: Shojiro Yamabe,Kazuto Fukuchi,Ryoma Senda,Jun Sakuma
关键词: adding adversarial modifications, victim state observation, victim agent behavior, victim state, victim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This study considers the attack on reinforcement learning agents where the adversary aims to control the victim’s behavior as specified by the adversary by adding adversarial modifications to the victim’s state observation. While some attack methods reported success in manipulating the victim agent’s behavior, these methods often rely on environment-specific heuristics. In addition, all existing attack methods require white-box access to the victim’s policy. In this study, we propose a novel method for manipulating the victim agent in the black-box (i.e., the adversary is allowed to observe the victim’s state and action only) and no-box (i.e., the adversary is allowed to observe the victim’s state only) setting without requiring environment-specific heuristics. Our attack method is formulated as a bi-level optimization problem that is reduced to a distribution matching problem and can be solved by an existing imitation learning algorithm in the black-box and no-box settings. Empirical evaluations on several reinforcement learning benchmarks show that our proposed method has superior attack performance to baselines.

[LG-90] MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition

链接: https://arxiv.org/abs/2406.03857
作者: Stefan Gerd Fritsch,Cennet Oguz,Vitor Fortes Rey,Lala Ray,Maximilian Kiefer-Emmanouilidis,Paul Lukowicz
关键词: Human Activity Recognition, human computer interaction, Human Activity, Activity Recognition, sports and fitness
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human Activity Recognition is a longstanding problem in AI with applications in a broad range of areas: from healthcare, sports and fitness, security, and human computer interaction to robotics. The performance of HAR in real-world settings is strongly dependent on the type and quality of the input signal that can be acquired. Given an unobstructed, high-quality camera view of a scene, computer vision systems, in particular in conjunction with foundational models (e.g., CLIP), can today fairly reliably distinguish complex activities. On the other hand, recognition using modalities such as wearable sensors (which are often more broadly available, e.g, in mobile phones and smartwatches) is a more difficult problem, as the signals often contain less information and labeled training data is more difficult to acquire. In this work, we show how we can improve HAR performance across different modalities using multimodal contrastive pretraining. Our approach MuJo (Multimodal Joint Feature Space Learning), learns a multimodal joint feature space with video, language, pose, and IMU sensor data. The proposed approach combines contrastive and multitask learning methods and analyzes different multitasking strategies for learning a compact shared representation. A large dataset with parallel video, language, pose, and sensor data points is also introduced to support the research, along with an analysis of the robustness of the multimodal joint space for modal-incomplete and low-resource data. On the MM-Fit dataset, our model achieves an impressive Macro F1-Score of up to 0.992 with only 2% of the train data and 0.999 when using all available training data for classification tasks. Moreover, in the scenario where the MM-Fit dataset is unseen, we demonstrate a generalization performance of up to 0.638.

[LG-91] Why the Metric Backbone Preserves Community Structure

链接: https://arxiv.org/abs/2406.03852
作者: Maximilien Dreveton,Charbel Chucri,Matthias Grossglauser,Patrick Thiran
关键词: metric backbone, all-pairs shortest paths, metric, backbone, union of all-pairs
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:The metric backbone of a weighted graph is the union of all-pairs shortest paths. It is obtained by removing all edges (u,v) that are not the shortest path between u and v . In networks with well-separated communities, the metric backbone tends to preserve many inter-community edges, because these edges serve as bridges connecting two communities, but tends to delete many intra-community edges because the communities are dense. This suggests that the metric backbone would dilute or destroy the community structure of the network. However, this is not borne out by prior empirical work, which instead showed that the metric backbone of real networks preserves the community structure of the original network well. In this work, we analyze the metric backbone of a broad class of weighted random graphs with communities, and we formally prove the robustness of the community structure with respect to the deletion of all the edges that are not in the metric backbone. An empirical comparison of several graph sparsification techniques confirms our theoretical finding and shows that the metric backbone is an efficient sparsifier in the presence of communities.

[LG-92] A Noise-robust Multi-head Attention Mechanism for Formation Resistivity Prediction: Frequency Aware LSTM

链接: https://arxiv.org/abs/2406.03849
作者: Yongan Zhang,Junfeng Zhao,Jian Li,Xuanran Wang,Youzhuang Sun,Yuntian Chen,Dongxiao Zhang
关键词: geothermal energy resources, temporal anti-noise block, formation resistivity plays, gas reservoirs, identification and assessment
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The prediction of formation resistivity plays a crucial role in the evaluation of oil and gas reservoirs, identification and assessment of geothermal energy resources, groundwater detection and monitoring, and carbon capture and storage. However, traditional well logging techniques fail to measure accurate resistivity in cased boreholes, and the transient electromagnetic method for cased borehole resistivity logging encounters challenges of high-frequency disaster (the problem of inadequate learning by neural networks in high-frequency features) and noise interference, badly affecting accuracy. To address these challenges, frequency-aware framework and temporal anti-noise block are proposed to build frequency aware LSTM (FAL). The frequency-aware framework implements a dual-stream structure through wavelet transformation, allowing the neural network to simultaneously handle high-frequency and low-frequency flows of time-series data, thus avoiding high-frequency disaster. The temporal anti-noise block integrates multiple attention mechanisms and soft-threshold attention mechanisms, enabling the model to better distinguish noise from redundant features. Ablation experiments demonstrate that the frequency-aware framework and temporal anti-noise block contribute significantly to performance improvement. FAL achieves a 24.3% improvement in R2 over LSTM, reaching the highest value of 0.91 among all models. In robustness experiments, the impact of noise on FAL is approximately 1/8 of the baseline, confirming the noise resistance of FAL. The proposed FAL effectively reduces noise interference in predicting formation resistivity from cased transient electromagnetic well logging curves, better learns high-frequency features, and thereby enhances the prediction accuracy and noise resistance of the neural network model.

[LG-93] Open Problem: Active Representation Learning

链接: https://arxiv.org/abs/2406.03845
作者: Nikola Milosevic,Gesine Müller,Jan Huisken,Nico Scherf
关键词: partially observable environments, Active Representation Learning, Active Simultaneous Localization, Representation Learning, observable environments
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this work, we introduce the concept of Active Representation Learning, a novel class of problems that intertwines exploration and representation learning within partially observable environments. We extend ideas from Active Simultaneous Localization and Mapping (active SLAM), and translate them to scientific discovery problems, exemplified by adaptive microscopy. We explore the need for a framework that derives exploration skills from representations that are in some sense actionable, aiming to enhance the efficiency and effectiveness of data collection and model building in the natural sciences.

[LG-94] Exploiting Global Graph Homophily for Generalized Defense in Graph Neural Networks

链接: https://arxiv.org/abs/2406.03833
作者: Duanyu Li,Huijun Wu,Min Xie,Xugang Wu,Zhenwei Wu,Wenzhe Zhang
关键词: numerous tasks involving, tasks involving graph-related, graph-related data analysis, Graph neural network, involving graph-related data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural network (GNN) models play a pivotal role in numerous tasks involving graph-related data analysis. Despite their efficacy, similar to other deep learning models, GNNs are susceptible to adversarial attacks. Even minor perturbations in graph data can induce substantial alterations in model predictions. While existing research has explored various adversarial defense techniques for GNNs, the challenge of defending against adversarial attacks on real-world scale graph data remains largely unresolved. On one hand, methods reliant on graph purification and preprocessing tend to excessively emphasize local graph information, leading to sub-optimal defensive outcomes. On the other hand, approaches rooted in graph structure learning entail significant time overheads, rendering them impractical for large-scale graphs. In this paper, we propose a new defense method named Talos, which enhances the global, rather than local, homophily of graphs as a defense. Experiments show that the proposed approach notably outperforms state-of-the-art defense approaches, while imposing little computational overhead.

[LG-95] Predictability Analysis of Regression Problems via Conditional Entropy Estimations

链接: https://arxiv.org/abs/2406.03824
作者: Yu-Hsueh Fang,Chia-Yen Lee
关键词: predict continuous outcomes, machine learning, continuous outcomes, field of machine, pivotal due
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:In the field of machine learning, regression problems are pivotal due to their ability to predict continuous outcomes. Traditional error metrics like mean squared error, mean absolute error, and coefficient of determination measure model accuracy. The model accuracy is the consequence of the selected model and the features, which blurs the analysis of contribution. Predictability, in the other hand, focus on the predictable level of a target variable given a set of features. This study introduces conditional entropy estimators to assess predictability in regression problems, bridging this gap. We enhance and develop reliable conditional entropy estimators, particularly the KNIFE-P estimator and LMC-P estimator, which offer under- and over-estimation, providing a practical framework for predictability analysis. Extensive experiments on synthesized and real-world datasets demonstrate the robustness and utility of these estimators. Additionally, we extend the analysis to the coefficient of determination (R^2 ), enhancing the interpretability of predictability. The results highlight the effectiveness of KNIFE-P and LMC-P in capturing the achievable performance and limitations of feature sets, providing valuable tools in the development of regression models. These indicators offer a robust framework for assessing the predictability for regression problems.

[LG-96] A Survey on Intelligent Internet of Things: Applications Security Privacy and Future Directions

链接: https://arxiv.org/abs/2406.03820
作者: Ons Aouedi,Thai-Hoc Vu,Alessio Sacco,Dinh C. Nguyen,Kandaraj Piamrat,Guido Marchetto,Quoc-Viet Pham
关键词: Internet of Things, customer services, rapid advances, promoted a revolution, revolution in communication
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: This work has been accepted by IEEE Communications Surveys Tutorials

点击查看摘要

Abstract:The rapid advances in the Internet of Things (IoT) have promoted a revolution in communication technology and offered various customer services. Artificial intelligence (AI) techniques have been exploited to facilitate IoT operations and maximize their potential in modern application scenarios. In particular, the convergence of IoT and AI has led to a new networking paradigm called Intelligent IoT (IIoT), which has the potential to significantly transform businesses and industrial domains. This paper presents a comprehensive survey of IIoT by investigating its significant applications in mobile networks, as well as its associated security and privacy issues. Specifically, we explore and discuss the roles of IIoT in a wide range of key application domains, from smart healthcare and smart cities to smart transportation and smart industries. Through such extensive discussions, we investigate important security issues in IIoT networks, where network attacks, confidentiality, integrity, and intrusion are analyzed, along with a discussion of potential countermeasures. Privacy issues in IIoT networks were also surveyed and discussed, including data, location, and model privacy leakage. Finally, we outline several key challenges and highlight potential research directions in this important area.

[LG-97] Subspace Clustering in Wavelet Packets Domain

链接: https://arxiv.org/abs/2406.03819
作者: Ivica Kopriva,Damir Sersic
关键词: cluster data points, subspaces model, Subspace clustering, MERA tensor network, utilize the union
类目: Machine Learning (cs.LG)
*备注: 18 pages, 9 tables, 1 figure

点击查看摘要

Abstract:Subspace clustering (SC) algorithms utilize the union of subspaces model to cluster data points according to the subspaces from which they are drawn. To better address separability of subspaces and robustness to noise we propose a wavelet packet (WP) based transform domain subspace clustering. Depending on the number of resolution levels, WP yields several representations instantiated in terms of subbands. The first approach combines original and subband data into one complementary multi-view representation. Afterward, we formulate joint representation learning as a low-rank MERA tensor network approximation problem. That is motivated by the strong representation power of the MERA network to capture complex intra/inter-view dependencies in corresponding self-representation tensor. In the second approach, we use a self-stopping computationally efficient method to select the subband with the smallest clustering error on the validation set. When existing SC algorithms are applied to the chosen subband, their performance is expected to improve. Consequently, both approaches enable the re-use of SC algorithms developed so far. Improved clustering performance is due to the dual nature of subbands as representations and filters, which is essential for noise suppression. We exemplify the proposed WP domain approach to SC on the MERA tensor network and eight other well-known linear SC algorithms using six well-known image datasets representing faces, digits, and objects. Although WP domain-based SC is a linear method, it achieved clustering performance comparable with some best deep SC algorithms and outperformed many other deep SC algorithms by a significant margin. That is in particular case for the WP MERA SC algorithm. On the COIL100 dataset, it achieves an accuracy of 87.45% and outperforms the best deep SC competitor in the amount of 14.75%.

[LG-98] Amortized Equation Discovery in Hybrid Dynamical Systems

链接: https://arxiv.org/abs/2406.03818
作者: Yongtuo Liu,Sara Magliacane,Miltiadis Kofinas,Efstratios Gavves
关键词: express complex systems, discrete states, Hybrid dynamical systems, prevalent in science, science and engineering
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Symbolic Computation (cs.SC)
*备注: 24 pages, 5 figures, accepted by International Conference on Machine Learning (ICML) 2024

点击查看摘要

Abstract:Hybrid dynamical systems are prevalent in science and engineering to express complex systems with continuous and discrete states. To learn the laws of systems, all previous methods for equation discovery in hybrid systems follow a two-stage paradigm, i.e. they first group time series into small cluster fragments and then discover equations in each fragment separately through methods in non-hybrid systems. Although effective, these methods do not fully take advantage of the commonalities in the shared dynamics of multiple fragments that are driven by the same equations. Besides, the two-stage paradigm breaks the interdependence between categorizing and representing dynamics that jointly form hybrid systems. In this paper, we reformulate the problem and propose an end-to-end learning framework, i.e. Amortized Equation Discovery (AMORE), to jointly categorize modes and discover equations characterizing the dynamics of each mode by all segments of the mode. Experiments on four hybrid and six non-hybrid systems show that our method outperforms previous methods on equation discovery, segmentation, and forecasting.

[LG-99] How to Scale Inverse RL to Large State Spaces? A Provably Efficient Approach

链接: https://arxiv.org/abs/2406.03812
作者: Filippo Lazzati,Mirco Mutti,Alberto Maria Metelli
关键词: Inverse Reinforcement Learning, online Inverse Reinforcement, Reinforcement Learning, Inverse Reinforcement, online IRL focus
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In online Inverse Reinforcement Learning (IRL), the learner can collect samples about the dynamics of the environment to improve its estimate of the reward function. Since IRL suffers from identifiability issues, many theoretical works on online IRL focus on estimating the entire set of rewards that explain the demonstrations, named the feasible reward set. However, none of the algorithms available in the literature can scale to problems with large state spaces. In this paper, we focus on the online IRL problem in Linear Markov Decision Processes (MDPs). We show that the structure offered by Linear MDPs is not sufficient for efficiently estimating the feasible set when the state space is large. As a consequence, we introduce the novel framework of rewards compatibility, which generalizes the notion of feasible set, and we develop CATY-IRL, a sample efficient algorithm whose complexity is independent of the cardinality of the state space in Linear MDPs. When restricted to the tabular setting, we demonstrate that CATY-IRL is minimax optimal up to logarithmic factors. As a by-product, we show that Reward-Free Exploration (RFE) enjoys the same worst-case rate, improving over the state-of-the-art lower bound. Finally, we devise a unifying framework for IRL and RFE that may be of independent interest.

[LG-100] Cross-variable Linear Integrated ENhanced Transformer for Photovoltaic power forecasting

链接: https://arxiv.org/abs/2406.03808
作者: Jiaxin Gao,Qinglong Cao,Yuntian Chen,Dongxiao Zhang
关键词: enabling efficient energy, efficient energy management, power forecasting, Photovoltaic power forecasting, power forecasting plays
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Photovoltaic (PV) power forecasting plays a crucial role in optimizing the operation and planning of PV systems, thereby enabling efficient energy management and grid integration. However, un certainties caused by fluctuating weather conditions and complex interactions between different variables pose significant challenges to accurate PV power forecasting. In this study, we propose PV-Client (Cross-variable Linear Integrated ENhanced Transformer for Photovoltaic power forecasting) to address these challenges and enhance PV power forecasting accuracy. PV-Client employs an ENhanced Transformer module to capture complex interactions of various features in PV systems, and utilizes a linear module to learn trend information in PV power. Diverging from conventional time series-based Transformer models that use cross-time Attention to learn dependencies between different time steps, the Enhanced Transformer module integrates cross-variable Attention to capture dependencies between PV power and weather factors. Furthermore, PV-Client streamlines the embedding and position encoding layers by replacing the Decoder module with a projection layer. Experimental results on three real-world PV power datasets affirm PV-Client’s state-of-the-art (SOTA) performance in PV power forecasting. Specifically, PV-Client surpasses the second-best model GRU by 5.3% in MSE metrics and 0.9% in accuracy metrics at the Jingang Station. Similarly, PV-Client outperforms the second-best model SVR by 10.1% in MSE metrics and 0.2% in accuracy metrics at the Xinqingnian Station, and PV-Client exhibits superior performance compared to the second-best model SVR with enhancements of 3.4% in MSE metrics and 0.9% in accuracy metrics at the Hongxing Station.

[LG-101] Infusing Self-Consistency into Density Functional Theory Hamiltonian Prediction via Deep Equilibrium Models

链接: https://arxiv.org/abs/2406.03794
作者: Zun Wang,Chang Liu,Nianlong Zou,He Zhang,Xinran Wei,Lin Huang,Lijun Wu,Bin Shao
关键词: Density Functional Theory, Functional Theory Hamiltonian, Equilibrium Density Functional, predicting Density Functional, Density Functional
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this study, we introduce a unified neural network architecture, the Deep Equilibrium Density Functional Theory Hamiltonian (DEQH) model, which incorporates Deep Equilibrium Models (DEQs) for predicting Density Functional Theory (DFT) Hamiltonians. The DEQH model inherently captures the self-consistency nature of Hamiltonian, a critical aspect often overlooked by traditional machine learning approaches for Hamiltonian prediction. By employing DEQ within our model architecture, we circumvent the need for DFT calculations during the training phase to introduce the Hamiltonian’s self-consistency, thus addressing computational bottlenecks associated with large or complex systems. We propose a versatile framework that combines DEQ with off-the-shelf machine learning models for predicting Hamiltonians. When benchmarked on the MD17 and QH9 datasets, DEQHNet, an instantiation of the DEQH framework, has demonstrated a significant improvement in prediction accuracy. Beyond a predictor, the DEQH model is a Hamiltonian solver, in the sense that it uses the fixed-point solving capability of the deep equilibrium model to iteratively solve for the Hamiltonian. Ablation studies of DEQHNet further elucidate the network’s effectiveness, offering insights into the potential of DEQ-integrated networks for Hamiltonian learning.

[LG-102] Low-Rank Similarity Mining for Multimodal Dataset Distillation

链接: https://arxiv.org/abs/2406.03793
作者: Yue Xu,Zhilin Lin,Yusong Qiu,Cewu Lu,Yong-Lu Li
关键词: witnessed rapid development, recent years, poses unique, under-explored challenges, witnessed rapid
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ICML 2024

点击查看摘要

Abstract:Though dataset distillation has witnessed rapid development in recent years, the distillation of multimodal data, e.g., image-text pairs, poses unique and under-explored challenges. Unlike unimodal data, image-text contrastive learning (ITC) data lack inherent categorization and should instead place greater emphasis on modality correspondence. In this work, we propose Low-Rank Similarity Mining (LoRS) for multimodal dataset distillation, that concurrently distills a ground truth similarity matrix with image-text pairs, and leverages low-rank factorization for efficiency and scalability. The proposed approach brings significant improvement to the existing algorithms, marking a significant contribution to the field of visual-language dataset distillation. We advocate adopting LoRS as a foundational synthetic data setup for image-text dataset distillation. Our code is available at this https URL.

[LG-103] Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU

链接: https://arxiv.org/abs/2406.03791
作者: Daniel Galvez,Vladimir Bataev,Hainan Xu,Tim Kaldewey
关键词: RNN Transducer, billion parameter RNN-T, vast majority, today is spent, RNN-T
类目: Machine Learning (cs.LG)
*备注: Interspeech 2024 Proceedings

点击查看摘要

Abstract:The vast majority of inference time for RNN Transducer (RNN-T) models today is spent on decoding. Current state-of-the-art RNN-T decoding implementations leave the GPU idle ~80% of the time. Leveraging a new CUDA 12.4 feature, CUDA graph conditional nodes, we present an exact GPU-based implementation of greedy decoding for RNN-T models that eliminates this idle time. Our optimizations speed up a 1.1 billion parameter RNN-T model end-to-end by a factor of 2.5x. This technique can applied to the “label looping” alternative greedy decoding algorithm as well, achieving 1.7x and 1.4x end-to-end speedups when applied to 1.1 billion parameter RNN-T and Token and Duration Transducer models respectively. This work enables a 1.1 billion parameter RNN-T model to run only 16% slower than a similarly sized CTC model, contradicting the common belief that RNN-T models are not suitable for high throughput inference. The implementation is available in NVIDIA NeMo.

[LG-104] Enhancing Graph U-Nets for Mesh-Agnostic Spatio-Temporal Flow Prediction

链接: https://arxiv.org/abs/2406.03789
作者: Sunwoong Yang,Ricardo Vinuesa,Namwoo Kang
关键词: convolutional neural networks, inherent mesh dependency, deep-learning approaches based, conventional deep-learning approaches, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:This study aims to overcome the conventional deep-learning approaches based on convolutional neural networks, whose applicability to complex geometries and unstructured meshes is limited due to their inherent mesh dependency. We propose novel approaches to improve mesh-agnostic spatio-temporal prediction of transient flow fields using graph U-Nets, enabling accurate prediction on diverse mesh configurations. Key enhancements to the graph U-Net architecture, including the Gaussian mixture model convolutional operator and noise injection approaches, provide increased flexibility in modeling node dynamics: the former reduces prediction error by 95% compared to conventional convolutional operators, while the latter improves long-term prediction robustness, resulting in an error reduction of 86%. We also investigate transductive and inductive-learning perspectives of graph U-Nets with proposed improvements. In the transductive setting, they effectively predict quantities for unseen nodes within the trained graph. In the inductive setting, they successfully perform in mesh scenarios with different vortex-shedding periods, showing 98% improvement in predicting the future flow fields compared to a model trained without the inductive settings. It is found that graph U-Nets without pooling operations, i.e. without reducing and restoring the node dimensionality of the graph data, perform better in inductive settings due to their ability to learn from the detailed structure of each graph. Meanwhile, we also discover that the choice of normalization technique significantly impacts graph U-Net performance.

[LG-105] Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices

链接: https://arxiv.org/abs/2406.03777
作者: Ruiyang Qin,Dancheng Liu,Zheyu Yan,Zhaoxuan Tan,Zixuan Pan,Zhenge Jia,Meng Jiang,Ahmed Abbasi,Jinjun Xiong,Yiyu Shi
关键词: designing large language, unlimited computing resources, large language models, training and inference, scaling laws
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:The scaling laws have become the de facto guidelines for designing large language models (LLMs), but they were studied under the assumption of unlimited computing resources for both training and inference. As LLMs are increasingly used as personalized intelligent assistants, their customization (i.e., learning through fine-tuning) and deployment onto resource-constrained edge devices will become more and more prevalent. An urging but open question is how a resource-constrained computing environment would affect the design choices for a personalized LLM. We study this problem empirically in this work. In particular, we consider the tradeoffs among a number of key design factors and their intertwined impacts on learning efficiency and accuracy. The factors include the learning methods for LLM customization, the amount of personalized data used for learning customization, the types and sizes of LLMs, the compression methods of LLMs, the amount of time afforded to learn, and the difficulty levels of the target use cases. Through extensive experimentation and benchmarking, we draw a number of surprisingly insightful guidelines for deploying LLMs onto resource-constrained devices. For example, an optimal choice between parameter learning and RAG may vary depending on the difficulty of the downstream task, the longer fine-tuning time does not necessarily help the model, and a compressed LLM may be a better choice than an uncompressed LLM to learn from limited personalized data.

[LG-106] DeepRacer on Physical Track: Parameters Exploration and Performance Evaluation

链接: https://arxiv.org/abs/2406.03769
作者: Sinan Koparan,Bahman Javadi
关键词: physical racetrack capabilities, descent batch size, gradient descent batch, physical environment, physical
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper focuses on the physical racetrack capabilities of AWS DeepRacer. Two separate experiments were conducted. The first experiment (Experiment I) focused on evaluating the impact of hyperparameters on the physical environment. Hyperparameters such as gradient descent batch size and loss type were changed systematically as well as training time settings. The second experiment (Experiment II) focused on exploring AWS DeepRacer object avoidance in the physical environment. It was uncovered that in the simulated environment, models with a higher gradient descent batch size had better performance than models with a lower gradient descent batch size. Alternatively, in the physical environment, a gradient descent batch size of 128 appears to be preferable. It was found that models using the loss type of Huber outperformed models that used the loss type of MSE in both the simulated and physical environments. Finally, object avoidance in the simulated environment appeared to be effective; however, when bringing these models to the physical environment, there was a pronounced challenge to avoid objects. Therefore, object avoidance in the physical environment remains an open challenge.

[LG-107] Enhancing In-Context Learning Performance with just SVD-Based Weight Pruning: A Theoretical Perspective

链接: https://arxiv.org/abs/2406.03768
作者: Xinhao Yao,Xiaolin Hu,Shenzhi Yang,Yong Liu
关键词: Pre-trained large language, large language models, striking in-context learning, demonstrated striking in-context, Pre-trained large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pre-trained large language models (LLMs) based on Transformer have demonstrated striking in-context learning (ICL) abilities. With a few demonstration input-label pairs, they can predict the label for an unseen input without any parameter updates. In this paper, we show an exciting phenomenon that SVD-based weight pruning can enhance ICL performance, and more surprising, pruning weights in deep layers often results in more stable performance improvements in shallow layers. However, the underlying mechanism of those findings still remains an open question. To reveal those findings, we conduct an in-depth theoretical analysis by presenting the implicit gradient descent (GD) trajectories of ICL and giving the mutual information based generalization bounds of ICL via full implicit GD trajectories. This helps us reasonably explain the surprising experimental findings. Besides, based on all our experimental and theoretical insights, we intuitively propose a simple, model-compression and derivative-free algorithm for downstream tasks in enhancing ICL inference. Experiments on benchmark datasets and open source LLMs display the method effectiveness\footnoteThe code is available at \urlthis https URL.

[LG-108] RoboCoder: Robotic Learning from Basic Skills to General Tasks with Large Language Models

链接: https://arxiv.org/abs/2406.03757
作者: Jingyao Li,Pengguang Chen,Sitong Wu,Chuanyang Zheng,Hong Xu,Jiaya Jia
关键词: Large Language Models, improved the prospects, Large Language, Language Models, integrates Large Language
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has improved the prospects for robotic tasks. However, existing benchmarks are still limited to single tasks with limited generalization capabilities. In this work, we introduce a comprehensive benchmark and an autonomous learning framework, RoboCoder aimed at enhancing the generalization capabilities of robots in complex environments. Unlike traditional methods that focus on single-task learning, our research emphasizes the development of a general-purpose robotic coding algorithm that enables robots to leverage basic skills to tackle increasingly complex tasks. The newly proposed benchmark consists of 80 manually designed tasks across 7 distinct entities, testing the models’ ability to learn from minimal initial mastery. Initial testing revealed that even advanced models like GPT-4 could only achieve a 47% pass rate in three-shot scenarios with humanoid entities. To address these limitations, the RoboCoder framework integrates Large Language Models (LLMs) with a dynamic learning system that uses real-time environmental feedback to continuously update and refine action codes. This adaptive method showed a remarkable improvement, achieving a 36% relative improvement. Our codes will be released.

[LG-109] Adaptive Multi-Scale Decomposition Framework for Time Series Forecasting

链接: https://arxiv.org/abs/2406.03751
作者: Yifan Hu,Peiyuan Liu,Peng Zhu,Dawei Cheng,Tao Dai
关键词: Transformer-based methods excel, emerged as leading, leading approaches, Transformer-based methods, Adaptive Multi-predictor Synthesis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer-based and MLP-based methods have emerged as leading approaches in time series forecasting (TSF). While Transformer-based methods excel in capturing long-range dependencies, they suffer from high computational complexities and tend to overfit. Conversely, MLP-based methods offer computational efficiency and adeptness in modeling temporal dynamics, but they struggle with capturing complex temporal patterns effectively. To address these challenges, we propose a novel MLP-based Adaptive Multi-Scale Decomposition (AMD) framework for TSF. Our framework decomposes time series into distinct temporal patterns at multiple scales, leveraging the Multi-Scale Decomposable Mixing (MDM) block to dissect and aggregate these patterns in a residual manner. Complemented by the Dual Dependency Interaction (DDI) block and the Adaptive Multi-predictor Synthesis (AMS) block, our approach effectively models both temporal and channel dependencies and utilizes autocorrelation to refine multi-scale data integration. Comprehensive experiments demonstrate that our AMD framework not only overcomes the limitations of existing methods but also consistently achieves state-of-the-art performance in both long-term and short-term forecasting tasks across various datasets, showcasing superior efficiency. Code is available at \urlthis https URL

[LG-110] Instance Segmentation and Teeth Classification in Panoramic X-rays

链接: https://arxiv.org/abs/2406.03747
作者: Devichand Budagam,Ayush Kumar,Sayan Ghosh,Anuj Shrivastav,Azamat Zhanatuly Imanbayev,Iskander Rafailovich Akhmetov,Dmitrii Kaplun,Sergey Antonov,Artem Rychenkov,Gleb Cyganov,Aleksandr Sinitca
关键词: Teeth segmentation, Teeth, recognition are critical, segmentation, deep learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: submtted to Expert Systems with Applications Journal

点击查看摘要

Abstract:Teeth segmentation and recognition are critical in various dental applications and dental diagnosis. Automatic and accurate segmentation approaches have been made possible by integrating deep learning models. Although teeth segmentation has been studied in the past, only some techniques were able to effectively classify and segment teeth simultaneously. This article offers a pipeline of two deep learning models, U-Net and YOLOv8, which results in BB-UNet, a new architecture for the classification and segmentation of teeth on panoramic X-rays that is efficient and reliable. We have improved the quality and reliability of teeth segmentation by utilising the YOLOv8 and U-Net capabilities. The proposed networks have been evaluated using the mean average precision (mAP) and dice coefficient for YOLOv8 and BB-UNet, respectively. We have achieved a 3% increase in mAP score for teeth classification compared to existing methods, and a 10-15% increase in dice coefficient for teeth segmentation compared to U-Net across different categories of teeth. A new Dental dataset was created based on UFBA-UESC dataset with Bounding-Box and Polygon annotations of 425 dental panoramic X-rays. The findings of this research pave the way for a wider adoption of object detection models in the field of dental diagnosis.

[LG-111] ReDistill: Residual Encoded Distillation for Peak Memory Reduction

链接: https://arxiv.org/abs/2406.03744
作者: Fang Chen,Gourav Datta,Mujahid Al Rafi,Hyeran Jeon,Meng Tang
关键词: modern camera sensors, camera sensors result, neural network sizes, peak memory, memory
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The expansion of neural network sizes and the enhancement of image resolution through modern camera sensors result in heightened memory and power demands for neural networks. Reducing peak memory, which is the maximum memory consumed during the execution of a neural network, is critical to deploy neural networks on edge devices with limited memory budget. A naive approach to reducing peak memory is aggressive down-sampling of feature maps via pooling with large stride, which often results in unacceptable degradation in network performance. To mitigate this problem, we propose residual encoded distillation (ReDistill) for peak memory reduction in a teacher-student framework, in which a student network with less memory is derived from the teacher network using aggressive pooling. We apply our distillation method to multiple problems in computer vision including image classification and diffusion based image generation. For image classification, our method yields 2x-3.2x measured peak memory on an edge GPU with negligible degradation in accuracy for most CNN based architectures. Additionally, our method yields improved test accuracy for tiny vision transformer (ViT) based models distilled from large CNN based teacher architectures. For diffusion-based image generation, our proposed distillation method yields a denoising network with 4x lower theoretical peak memory while maintaining decent diversity and fidelity for image generation. Experiments demonstrate our method’s superior performance compared to other feature-based and response-based distillation methods.

[LG-112] Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

链接: https://arxiv.org/abs/2406.03736
作者: Jingyang Ou,Shen Nie,Kaiwen Xue,Fengqi Zhu,Jiacheng Sun,Zhenguo Li,Chongxuan Li
关键词: absorbing discrete diffusion, concrete score, Discrete diffusion, Discrete diffusion models, processes have shown
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Discrete diffusion models with absorbing processes have shown promise in language modeling. The key quantities to be estimated are the ratios between the marginal probabilities of two transitive states at all timesteps, called the concrete score. In this paper, we reveal that the concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data, multiplied by a time-dependent scalar in an analytic form. Motivated by the finding, we propose reparameterized absorbing discrete diffusion (RADD), a dedicated diffusion model that characterizes the time-independent conditional probabilities. Besides its simplicity, RADD can reduce the number of function evaluations (NFEs) by caching the output of the time-independent network when the noisy sample remains unchanged in a sampling interval. Empirically, RADD is up to 3.5 times faster while consistently achieving a better performance than the strongest baseline. Built upon the new factorization of the concrete score, we further prove a surprising result that the exact likelihood of absorbing diffusion can be rewritten to a simple form (named denoising cross-entropy) and then estimated efficiently by the Monte Carlo method. The resulting approach also applies to the original parameterization of the concrete score. It significantly advances the state-of-the-art discrete diffusion on 5 zero-shot language modeling benchmarks (measured by perplexity) at the GPT-2 scale.

[LG-113] Phase-Amplitude Reduction-Based Imitation Learning

链接: https://arxiv.org/abs/2406.03735
作者: Satoshi Yamamori,Jun Morimoto
关键词: phase-amplitude reduction method, imitation learning framework, phase-amplitude reduction, imitation learning, method
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:In this study, we propose the use of the phase-amplitude reduction method to construct an imitation learning framework. Imitating human movement trajectories is recognized as a promising strategy for generating a range of human-like robot movements. Unlike previous dynamical system-based imitation learning approaches, our proposed method allows the robot not only to imitate a limit cycle trajectory but also to replicate the transient movement from the initial or disturbed state to the limit cycle. Consequently, our method offers a safer imitation learning approach that avoids generating unpredictable motions immediately after disturbances or from a specified initial state. We first validated our proposed method by reconstructing a simple limit-cycle attractor. We then compared the proposed approach with a conventional method on a lemniscate trajectory tracking task with a simulated robot arm. Our findings confirm that our proposed method can more accurately generate transient movements to converge on a target periodic attractor compared to the previous standard approach. Subsequently, we applied our method to a real robot arm to imitate periodic human movements.

[LG-114] Credit Card Fraud Detection Using Advanced Transformer Model

链接: https://arxiv.org/abs/2406.03733
作者: Chang Yu,Yongshun Xu,Jin Cao,Ye Zhang,Yinxin Jin,Mengran Zhu
关键词: mobile payment systems, credit card fraud, payment systems, credit card, financial security
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper have been received by this https URL

点击查看摘要

Abstract:With the proliferation of various online and mobile payment systems, credit card fraud has emerged as a significant threat to financial security. This study focuses on innovative applications of the latest Transformer models for more robust and precise fraud detection. To ensure the reliability of the data, we meticulously processed the data sources, balancing the dataset to address the issue of data sparsity significantly. We also selected highly correlated vectors to strengthen the training this http URL guarantee the reliability and practicality of the new Transformer model, we conducted performance comparisons with several widely adopted models, including Support Vector Machine (SVM), Random Forest, Neural Network, and Logistic Regression. We rigorously compared these models using metrics such as Precision, Recall, and F1 Score. Through these detailed analyses and comparisons, we present to the readers a highly efficient and powerful anti-fraud mechanism with promising prospects. The results demonstrate that the Transformer model not only excels in traditional applications but also shows great potential in niche areas like fraud detection, offering a substantial advancement in the field.

[LG-115] Quality-Diversity with Limited Resources

链接: https://arxiv.org/abs/2406.03731
作者: Ren-Jian Wang,Ke Xue,Cong Guan,Chao Qian
关键词: powerful optimization paradigm, diverse solutions, powerful optimization, optimization paradigm, aim of generating
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: ICML 2024

点击查看摘要

Abstract:Quality-Diversity (QD) algorithms have emerged as a powerful optimization paradigm with the aim of generating a set of high-quality and diverse solutions. To achieve such a challenging goal, QD algorithms require maintaining a large archive and a large population in each iteration, which brings two main issues, sample and resource efficiency. Most advanced QD algorithms focus on improving the sample efficiency, while the resource efficiency is overlooked to some extent. Particularly, the resource overhead during the training process has not been touched yet, hindering the wider application of QD algorithms. In this paper, we highlight this important research question, i.e., how to efficiently train QD algorithms with limited resources, and propose a novel and effective method called RefQD to address it. RefQD decomposes a neural network into representation and decision parts, and shares the representation part with all decision parts in the archive to reduce the resource overhead. It also employs a series of strategies to address the mismatch issue between the old decision parts and the newly updated representation part. Experiments on different types of tasks from small to large resource consumption demonstrate the excellent performance of RefQD: it not only uses significantly fewer resources (e.g., 16% GPU memories on QDax and 3.7% on Atari) but also achieves comparable or better performance compared to sample-efficient QD algorithms. Our code is available at \urlthis https URL.

[LG-116] FastGAS: Fast Graph-based Annotation Selection for In-Context Learning

链接: https://arxiv.org/abs/2406.03730
作者: Zihan Chen,Song Wang,Cong Shen,Jundong Li
关键词: empowers large language, large language models, In-context learning, empowers large, language models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In-context learning (ICL) empowers large language models (LLMs) to tackle new tasks by using a series of training instances as prompts. Since generating the prompts needs to sample from a vast pool of instances and annotate them (e.g., add labels in classification task), existing methods have proposed to select a subset of unlabeled examples for annotation, thus enhancing the quality of prompts and concurrently mitigating annotation costs. However, these methods often require a long time to select instances due to their complexity, hindering their practical viability. To address this limitation, we propose a graph-based selection method, FastGAS, designed to efficiently identify high-quality instances while minimizing computational overhead. Initially, we construct a data similarity graph based on instance similarities. Subsequently, employing a graph partitioning algorithm, we partition the graph into pieces. Within each piece (i.e., subgraph), we adopt a greedy approach to pick the most representative nodes. By aggregating nodes from diverse pieces and annotating the corresponding instances, we identify a set of diverse and representative instances for ICL. Compared to prior approaches, our method not only exhibits superior performance on different tasks but also significantly reduces selection time. In addition, we demonstrate the efficacy of our approach in LLMs of larger sizes.

[LG-117] Enhancing Sign Language Detection through Mediapipe and Convolutional Neural Networks (CNN)

链接: https://arxiv.org/abs/2406.03729
作者: Aditya Raj Verma,Gagandeep Singh,Karnim Meghwal,Banawath Ramji,Praveen Kumar Dadheech
关键词: American Sign Language, ASL dataset, sign language, research combines MediaPipe, ASL
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This research combines MediaPipe and CNNs for the efficient and accurate interpretation of ASL dataset for the real-time detection of sign language. The system presented here captures and processes hands’ gestures in real time. the intended purpose was to create a very easy, accurate, and fast way of entering commands without the necessity of touching something.MediaPipe supports one of the powerful frameworks in real-time hand tracking capabilities for the ability to capture and preprocess hand movements, which increases the accuracy of the gesture recognition system. Actually, the integration of CNN with the MediaPipe results in higher efficiency in using the model of real-time processing.The accuracy achieved by the model on ASL datasets is 99.12%.The model was tested using American Sign Language (ASL) datasets. The results were then compared to those of existing methods to evaluate how well it performed, using established evaluation techniques. The system will have applications in the communication, education, and accessibility domains. Making systems such as described in this paper even better will assist people with hearing impairment and make things accessible to them. We tested the recognition and translation performance on an ASL dataset and achieved better accuracy over previous this http URL is meant to the research is to identify the characters that American signs recognize using hand images taken from a web camera by based on mediapipe and CNNs

[LG-118] Efficient Graph Encoder Embedding for Large Sparse Graphs in Python

链接: https://arxiv.org/abs/2406.03726
作者: Xihan Qin,Cencheng Shen
关键词: generating fixed-sized attributes, prevalent machine learning, capturing key features, machine learning technique, Graph Encoder Embedding
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph is a ubiquitous representation of data in various research fields, and graph embedding is a prevalent machine learning technique for capturing key features and generating fixed-sized attributes. However, most state-of-the-art graph embedding methods are computationally and spatially expensive. Recently, the Graph Encoder Embedding (GEE) has been shown as the fastest graph embedding technique and is suitable for a variety of network data applications. As real-world data often involves large and sparse graphs, the huge sparsity usually results in redundant computations and storage. To address this issue, we propose an improved version of GEE, sparse GEE, which optimizes the calculation and storage of zero entries in sparse matrices to enhance the running time further. Our experiments demonstrate that the sparse version achieves significant speedup compared to the original GEE with Python implementation for large sparse graphs, and sparse GEE is capable of processing millions of edges within minutes on a standard laptop.

[LG-119] Offline Multi-Objective Optimization

链接: https://arxiv.org/abs/2406.03722
作者: Ke Xue,Rong-Xi Tan,Xiaobin Huang,Chao Qian
关键词: offline MOO, MOO, black-box objective function, Offline, objective function
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: ICML 2024

点击查看摘要

Abstract:Offline optimization aims to maximize a black-box objective function with a static dataset and has wide applications. In addition to the objective function being black-box and expensive to evaluate, numerous complex real-world problems entail optimizing multiple conflicting objectives, i.e., multi-objective optimization (MOO). Nevertheless, offline MOO has not progressed as much as offline single-objective optimization (SOO), mainly due to the lack of benchmarks like Design-Bench for SOO. To bridge this gap, we propose a first benchmark for offline MOO, covering a range of problems from synthetic to real-world tasks. This benchmark provides tasks, datasets, and open-source examples, which can serve as a foundation for method comparisons and advancements in offline MOO. Furthermore, we analyze how the current related methods can be adapted to offline MOO from four fundamental perspectives, including data, model architecture, learning algorithm, and search algorithm. Empirical results show improvements over the best value of the training set, demonstrating the effectiveness of offline MOO methods. As no particular method stands out significantly, there is still an open challenge in further enhancing the effectiveness of offline MOO. We finally discuss future challenges for offline MOO, with the hope of shedding some light on this emerging field. Our code is available at \urlthis https URL.

[LG-120] A Survey on Medical Large Language Models: Technology Application Trustworthiness and Future Directions

链接: https://arxiv.org/abs/2406.03712
作者: Lei Liu,Xiaoyan Yang,Junchi Lei,Xiaoyang Liu,Yue Shen,Zhiqiang Zhang,Peng Wei,Jinjie Gu,Zhixuan Chu,Zhan Qin,Kui Ren
关键词: GPT series models, Large language models, received substantial attention, substantial attention due, understanding human-level language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs), such as GPT series models, have received substantial attention due to their impressive capabilities for generating and understanding human-level language. More recently, LLMs have emerged as an innovative and powerful adjunct in the medical field, transforming traditional practices and heralding a new era of enhanced healthcare services. This survey provides a comprehensive overview of Medical Large Language Models (Med-LLMs), outlining their evolution from general to the medical-specific domain (i.e, Technology and Application), as well as their transformative impact on healthcare (e.g., Trustworthiness and Safety). Concretely, starting from the fundamental history and technology of LLMs, we first delve into the progressive adaptation and refinements of general LLM models in the medical domain, especially emphasizing the advanced algorithms that boost the LLMs’ performance in handling complicated medical environments, including clinical reasoning, knowledge graph, retrieval-augmented generation, human alignment, and multi-modal learning. Secondly, we explore the extensive applications of Med-LLMs across domains such as clinical decision support, report generation, and medical education, illustrating their potential to streamline healthcare services and augment patient outcomes. Finally, recognizing the imperative and responsible innovation, we discuss the challenges of ensuring fairness, accountability, privacy, and robustness in Med-LLMs applications. Finally, we conduct a concise discussion for anticipating possible future trajectories of Med-LLMs, identifying avenues for the prudent expansion of Med-LLMs. By consolidating above-mentioned insights, this review seeks to provide a comprehensive investigation of the potential strengths and limitations of Med-LLMs for professionals and researchers, ensuring a responsible landscape in the healthcare setting.

[LG-121] winS: Revisiting Non-Stationarity in Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2406.03710
作者: Jiaxi Hu,Qingsong Wen,Sijie Ruan,Li Liu,Yuxuan Liang
关键词: significant practical applications, series forecasting tasks, garnered increasing attention, increasing attention due, time series forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, multivariate time series forecasting tasks have garnered increasing attention due to their significant practical applications, leading to the emergence of various deep forecasting models. However, real-world time series exhibit pronounced non-stationary distribution characteristics. These characteristics are not solely limited to time-varying statistical properties highlighted by non-stationary Transformer but also encompass three key aspects: nested periodicity, absence of periodic distributions, and hysteresis among time variables. In this paper, we begin by validating this theory through wavelet analysis and propose the Transformer-based TwinS model, which consists of three modules to address the non-stationary periodic distributions: Wavelet Convolution, Period-Aware Attention, and Channel-Temporal Mixed MLP. Specifically, The Wavelet Convolution models nested periods by scaling the convolution kernel size like wavelet transform. The Period-Aware Attention guides attention computation by generating period relevance scores through a convolutional sub-network. The Channel-Temporal Mixed MLP captures the overall relationships between time series through channel-time mixing learning. TwinS achieves SOTA performance compared to mainstream TS models, with a maximum improvement in MSE of 25.8% over PatchTST.

[LG-122] What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions

链接: https://arxiv.org/abs/2406.03707
作者: Liyi Zhang,Michael Y. Li,Thomas L. Griffiths
关键词: Autoregressive language models, extract latent structure, large language models, language models, structure from text
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:Autoregressive language models have demonstrated a remarkable ability to extract latent structure from text. The embeddings from large language models have been shown to capture aspects of the syntax and semantics of language. But what \em should embeddings represent? We connect the autoregressive prediction objective to the idea of constructing predictive sufficient statistics to summarize the information contained in a sequence of observations, and use this connection to identify three settings where the optimal content of embeddings can be identified: independent identically distributed data, where the embedding should capture the sufficient statistics of the data; latent state models, where the embedding should encode the posterior distribution over states given the data; and discrete hypothesis spaces, where the embedding should reflect the posterior distribution over hypotheses given the data. We then conduct empirical probing studies to show that transformers encode these three kinds of latent generating distributions, and that they perform well in out-of-distribution cases and without token memorization in these settings.

[LG-123] Excluding the Irrelevant: Focusing Reinforcement Learning through Continuous Action Masking

链接: https://arxiv.org/abs/2406.03704
作者: Roland Stolz,Hanna Krasowski,Jakob Thumm,Michael Eichelbeck,Philipp Gassert,Matthias Althoff
关键词: relevant actions, action, Continuous action spaces, relevant, commonly defined
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Continuous action spaces in reinforcement learning (RL) are commonly defined as interval sets. While intervals usually reflect the action boundaries for tasks well, they can be challenging for learning because the typically large global action space leads to frequent exploration of irrelevant actions. Yet, little task knowledge can be sufficient to identify significantly smaller state-specific sets of relevant actions. Focusing learning on these relevant actions can significantly improve training efficiency and effectiveness. In this paper, we propose to focus learning on the set of relevant actions and introduce three continuous action masking methods for exactly mapping the action space to the state-dependent set of relevant actions. Thus, our methods ensure that only relevant actions are executed, enhancing the predictability of the RL agent and enabling its use in safety-critical applications. We further derive the implications of the proposed methods on the policy gradient. Using Proximal Policy Optimization (PPO), we evaluate our methods on three control tasks, where the relevant action set is computed based on the system dynamics and a relevant state set. Our experiments show that the three action masking methods achieve higher final rewards and converge faster than the baseline without action masking.

[LG-124] Synthesizing Conversations from Unlabeled Documents using Automatic Response Segmentation

链接: https://arxiv.org/abs/2406.03703
作者: Fanyou Wu,Weijie Xu,Chandan K. Reddy,Srinivasan H. Sengamedu
关键词: conversational question answering, costly training data, question answering, tackle the challenge, challenge of inadequate
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: findings of ACL 2024

点击查看摘要

Abstract:In this study, we tackle the challenge of inadequate and costly training data that has hindered the development of conversational question answering (ConvQA) systems. Enterprises have a large corpus of diverse internal documents. Instead of relying on a searching engine, a more compelling approach for people to comprehend these documents is to create a dialogue system. In this paper, we propose a robust dialog synthesising method. We learn the segmentation of data for the dialog task instead of using segmenting at sentence boundaries. The synthetic dataset generated by our proposed method achieves superior quality when compared to WikiDialog, as assessed through machine and human evaluations. By employing our inpainted data for ConvQA retrieval system pre-training, we observed a notable improvement in performance across OR-QuAC benchmarks.

[LG-125] BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning

链接: https://arxiv.org/abs/2406.03686
作者: Artem Zholus,Maksim Kuznetsov,Roman Schutski,Rim Shayakhmetov,Daniil Polykovskiy,Sarath Chandar,Alex Zhavoronkov
关键词: extremely challenging task, complex physical interactions, Generating novel active, extremely challenging, challenging task
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating novel active molecules for a given protein is an extremely challenging task for generative models that requires an understanding of the complex physical interactions between the molecule and its environment. In this paper, we present a novel generative model, BindGPT which uses a conceptually simple but powerful approach to create 3D molecules within the protein’s binding site. Our model produces molecular graphs and conformations jointly, eliminating the need for an extra graph reconstruction step. We pretrain BindGPT on a large-scale dataset and fine-tune it with reinforcement learning using scores from external simulation software. We demonstrate how a single pretrained language model can serve at the same time as a 3D molecular generative model, conformer generator conditioned on the molecular graph, and a pocket-conditioned 3D molecule generator. Notably, the model does not make any representational equivariance assumptions about the domain of generation. We show how such simple conceptual approach combined with pretraining and scaling can perform on par or better than the current best specialized diffusion models, language models, and graph neural networks while being two orders of magnitude cheaper to sample.

[LG-126] Bayesian Power Steering: An Effective Approach for Domain Adaptation of Diffusion Models

链接: https://arxiv.org/abs/2406.03683
作者: Ding Huang,Ting Li,Jian Huang
关键词: Bayesian Power Steering, called Bayesian Power, structure called Bayesian, Power Steering, network structure called
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 25 pages, 26 figures, and 4 tables

点击查看摘要

Abstract:We propose a Bayesian framework for fine-tuning large diffusion models with a novel network structure called Bayesian Power Steering (BPS). We clarify the meaning behind adaptation from a \textitlarge probability space to a \textitsmall probability space and explore the task of fine-tuning pre-trained models using learnable modules from a Bayesian perspective. BPS extracts task-specific knowledge from a pre-trained model’s learned prior distribution. It efficiently leverages large diffusion models, differentially intervening different hidden features with a head-heavy and foot-light configuration. Experiments highlight the superiority of BPS over contemporary methods across a range of tasks even with limited amount of data. Notably, BPS attains an FID score of 10.49 under the sketch condition on the COCO17 dataset.

[LG-127] A Universal Class of Sharpness-Aware Minimization Algorithms

链接: https://arxiv.org/abs/2406.03682
作者: Behrooz Tahmasebi,Ashkan Soleymani,Dara Bahri,Stefanie Jegelka,Patrick Jaillet
关键词: developing optimization algorithms, training loss Hessian, suitable biases, sharpness measures, achieving generalization
类目: Machine Learning (cs.LG)
*备注: ICML 2024. Code is available at this http URL

点击查看摘要

Abstract:Recently, there has been a surge in interest in developing optimization algorithms for overparameterized models as achieving generalization is believed to require algorithms with suitable biases. This interest centers on minimizing sharpness of the original loss function; the Sharpness-Aware Minimization (SAM) algorithm has proven effective. However, most literature only considers a few sharpness measures, such as the maximum eigenvalue or trace of the training loss Hessian, which may not yield meaningful insights for non-convex optimization scenarios like neural networks. Additionally, many sharpness measures are sensitive to parameter invariances in neural networks, magnifying significantly under rescaling parameters. Motivated by these challenges, we introduce a new class of sharpness measures in this paper, leading to new sharpness-aware objective functions. We prove that these measures are \textituniversally expressive, allowing any function of the training loss Hessian matrix to be represented by appropriate hyperparameters. Furthermore, we show that the proposed objective functions explicitly bias towards minimizing their corresponding sharpness measures, and how they allow meaningful applications to models with parameter invariances (such as scale-invariances). Finally, as instances of our proposed general framework, we present \textitFrob-SAM and \textitDet-SAM, which are specifically designed to minimize the Frobenius norm and the determinant of the Hessian of the training loss, respectively. We also demonstrate the advantages of our general framework through extensive experiments.

[LG-128] Meta-learning for Positive-unlabeled Classification

链接: https://arxiv.org/abs/2406.03680
作者: Atsutoshi Kumagai,Tomoharu Iwata,Yasuhiro Fujiwara
关键词: unseen target tasks, propose a meta-learning, improves the performance, performance of binary, unseen target
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 21 pages

点击查看摘要

Abstract:We propose a meta-learning method for positive and unlabeled (PU) classification, which improves the performance of binary classifiers obtained from only PU data in unseen target tasks. PU learning is an important problem since PU data naturally arise in real-world applications such as outlier detection and information retrieval. Existing PU learning methods require many PU data, but sufficient data are often unavailable in practice. The proposed method minimizes the test classification risk after the model is adapted to PU data by using related tasks that consist of positive, negative, and unlabeled data. We formulate the adaptation as an estimation problem of the Bayes optimal classifier, which is an optimal classifier to minimize the classification risk. The proposed method embeds each instance into a task-specific space using neural networks. With the embedded PU data, the Bayes optimal classifier is estimated through density-ratio estimation of PU densities, whose solution is obtained as a closed-form solution. The closed-form solution enables us to efficiently and effectively minimize the test classification risk. We empirically show that the proposed method outperforms existing methods with one synthetic and three real-world datasets.

[LG-129] On the Effects of Data Scale on Computer Control Agents

链接: https://arxiv.org/abs/2406.03679
作者: Wei Li,William Bishop,Alice Li,Chris Rawles,Folawiyo Campbell-Ajala,Divya Tyamagundlu,Oriana Riva
关键词: Autonomous agents, accomplish human tasks, interfaces to accomplish, accomplish human, control computer interfaces
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous agents that control computer interfaces to accomplish human tasks are emerging. Leveraging LLMs to power such agents has been of special interest, but unless fine-tuned on human-collected task demonstrations, performance is still relatively low. In this work we study whether fine-tuning alone is a viable approach for building real-world computer control agents. %In particularly, we investigate how performance measured on both high and low-level tasks in domain and out of domain scales as more training data is collected. To this end we collect and release a new dataset, AndroidControl, consisting of 15,283 demonstrations of everyday tasks with Android apps. Compared to existing datasets, each AndroidControl task instance includes both high and low-level human-generated instructions, allowing us to explore the level of task complexity an agent can handle. Moreover, AndroidControl is the most diverse computer control dataset to date, including 15,283 unique tasks over 833 Android apps, thus allowing us to conduct in-depth analysis of the model performance in and out of the domain of the training data. Using the dataset, we find that when tested in domain fine-tuned models outperform zero and few-shot baselines and scale in such a way that robust performance might feasibly be obtained simply by collecting more data. Out of domain, performance scales significantly more slowly and suggests that in particular for high-level tasks, fine-tuning on more data alone may be insufficient for achieving robust out-of-domain performance. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2406.03679 [cs.AI] (or arXiv:2406.03679v1 [cs.AI] for this version)

[LG-130] Reflective Policy Optimization

链接: https://arxiv.org/abs/2406.03678
作者: Yaozhong Gan,Renye Yan,Zhe Wu,Junliang Xing
关键词: Region Policy Optimization, Proximal Policy Optimization, Trust Region Policy, Reflective Policy Optimization, Policy Optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 20 pages

点击查看摘要

Abstract:On-policy reinforcement learning methods, like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), often demand extensive data per update, leading to sample inefficiency. This paper introduces Reflective Policy Optimization (RPO), a novel on-policy extension that amalgamates past and future state-action information for policy optimization. This approach empowers the agent for introspection, allowing modifications to its actions within the current state. Theoretical analysis confirms that policy performance is monotonically improved and contracts the solution space, consequently expediting the convergence procedure. Empirical results demonstrate RPO’s feasibility and efficacy in two reinforcement learning benchmarks, culminating in superior sample efficiency. The source code of this work is available at this https URL.

[LG-131] PANDA: Expanded Width-Aware Message Passing Beyond Rewiring

链接: https://arxiv.org/abs/2406.03671
作者: Jeongwhan Choi,Sumin Park,Hyowon Wi,Sung-Bae Cho,Noseong Park
关键词: graph neural network, Recent research, neural network, identified a critical, critical issue
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at ICML 2024

点击查看摘要

Abstract:Recent research in the field of graph neural network (GNN) has identified a critical issue known as “over-squashing,” resulting from the bottleneck phenomenon in graph structures, which impedes the propagation of long-range information. Prior works have proposed a variety of graph rewiring concepts that aim at optimizing the spatial or spectral properties of graphs to promote the signal propagation. However, such approaches inevitably deteriorate the original graph topology, which may lead to a distortion of information flow. To address this, we introduce an expanded width-aware (PANDA) message passing, a new message passing paradigm where nodes with high centrality, a potential source of over-squashing, are selectively expanded in width to encapsulate the growing influx of signals from distant nodes. Experimental results show that our method outperforms existing rewiring methods, suggesting that selectively expanding the hidden state of nodes can be a compelling alternative to graph rewiring for addressing the over-squashing.

[LG-132] owards Dynamic Trend Filtering through Trend Point Detection with Reinforcement Learning

链接: https://arxiv.org/abs/2406.03665
作者: Jihyeon Seong,Sekwang Oh,Jaesik Choi
关键词: simplifies complex time, filtering simplifies complex, time series data, complex time series, Trend filtering simplifies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 11 figures

点击查看摘要

Abstract:Trend filtering simplifies complex time series data by applying smoothness to filter out noise while emphasizing proximity to the original data. However, existing trend filtering methods fail to reflect abrupt changes in the trend due to `approximateness,’ resulting in constant smoothness. This approximateness uniformly filters out the tail distribution of time series data, characterized by extreme values, including both abrupt changes and noise. In this paper, we propose Trend Point Detection formulated as a Markov Decision Process (MDP), a novel approach to identifying essential points that should be reflected in the trend, departing from approximations. We term these essential points as Dynamic Trend Points (DTPs) and extract trends by interpolating them. To identify DTPs, we utilize Reinforcement Learning (RL) within a discrete action space and a forecasting sum-of-squares loss function as a reward, referred to as the Dynamic Trend Filtering network (DTF-net). DTF-net integrates flexible noise filtering, preserving critical original subsequences while removing noise as required for other subsequences. We demonstrate that DTF-net excels at capturing abrupt changes compared to other trend filtering algorithms and enhances forecasting performance, as abrupt changes are predicted rather than smoothed out.

[LG-133] he Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision

链接: https://arxiv.org/abs/2406.03662
作者: Liv Gorton
关键词: Recent work, sparse autoencoders, caused by superposition, work on sparse, shown promise
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent work on sparse autoencoders (SAEs) has shown promise in extracting interpretable features from neural networks and addressing challenges with polysemantic neurons caused by superposition. In this paper, we apply SAEs to the early vision layers of InceptionV1, a well-studied convolutional neural network, with a focus on curve detectors. Our results demonstrate that SAEs can uncover new interpretable features not apparent from examining individual neurons, including additional curve detectors that fill in previous gaps. We also find that SAEs can decompose some polysemantic neurons into more monosemantic constituent features. These findings suggest SAEs are a valuable tool for understanding InceptionV1, and convolutional neural networks more generally.

[LG-134] Inductive Generalization in Reinforcement Learning from Specifications

链接: https://arxiv.org/abs/2406.03651
作者: Vignesh Subramanian,Rohit Kushwah,Subhajit Roy,Suguman Bansal
关键词: logical specifications, inductive, natural inductive structure, Abstract, inductive generalization framework
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:We present a novel inductive generalization framework for RL from logical specifications. Many interesting tasks in RL environments have a natural inductive structure. These inductive tasks have similar overarching goals but they differ inductively in low-level predicates and distributions. We present a generalization procedure that leverages this inductive relationship to learn a higher-order function, a policy generator, that generates appropriately adapted policies for instances of an inductive task in a zero-shot manner. An evaluation of the proposed approach on a set of challenging control benchmarks demonstrates the promise of our framework in generalizing to unseen policies for long-horizon tasks.

[LG-135] Decision-focused Graph Neural Networks for Combinatorial Optimization

链接: https://arxiv.org/abs/2406.03647
作者: Yang Liu,Chuan Zhou,Peng Zhang,Shirui Pan,Zhao Li,Hongyang Chen
关键词: investigating combinatorial optimization, recent years, notable interest, interest in investigating, investigating combinatorial
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

Abstract:In recent years, there has been notable interest in investigating combinatorial optimization (CO) problems by neural-based framework. An emerging strategy to tackle these challenging problems involves the adoption of graph neural networks (GNNs) as an alternative to traditional algorithms, a subject that has attracted considerable attention. Despite the growing popularity of GNNs and traditional algorithm solvers in the realm of CO, there is limited research on their integrated use and the correlation between them within an end-to-end framework. The primary focus of our work is to formulate a more efficient and precise framework for CO by employing decision-focused learning on graphs. Additionally, we introduce a decision-focused framework that utilizes GNNs to address CO problems with auxiliary support. To realize an end-to-end approach, we have designed two cascaded modules: (a) an unsupervised trained graph predictive model, and (b) a solver for quadratic binary unconstrained optimization. Empirical evaluations are conducted on various classical tasks, including maximum cut, maximum independent set, and minimum vertex cover. The experimental results on classical CO problems (i.e. MaxCut, MIS, and MVC) demonstrate the superiority of our method over both the standalone GNN approach and classical methods.

[LG-136] Is Free Self-Alignment Possible?

链接: https://arxiv.org/abs/2406.03642
作者: Dyah Adila,Changho Shin,Yijing Zhang,Frederic Sala
关键词: Aligning pretrained language, Aligning pretrained, resource-intensive process, substantial compute, complex and resource-intensive
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aligning pretrained language models (LMs) is a complex and resource-intensive process, often requiring access to large amounts of ground-truth preference data and substantial compute. Are these costs necessary? That is, it is possible to align using only inherent model knowledge and without additional training? We tackle this challenge with AlignEZ, a novel approach that uses (1) self-generated preference data and (2) representation editing to provide nearly cost-free alignment. During inference, AlignEZ modifies LM representations to reduce undesirable and boost desirable components using subspaces identified via self-generated preference pairs. Our experiments reveal that this nearly cost-free procedure significantly narrows the gap between base pretrained and tuned models by an average of 31.6%, observed across six datasets and three model architectures. Additionally, we explore the potential of using AlignEZ as a means of expediting more expensive alignment procedures. Our experiments show that AlignEZ improves DPO models tuned only using a small subset of ground-truth preference data. Lastly, we study the conditions under which improvement using AlignEZ is feasible, providing valuable insights into its effectiveness.

[LG-137] Synthetic Programming Elicitation and Repair for Text-to-Code in Very Low-Resource Programming Languages

链接: https://arxiv.org/abs/2406.03636
作者: Federico Mora,Justin Wong,Haley Lepe,Sahil Bhatia,Karim Elmaaroufi,George Varghese,Joseph E. Gonzalez,Elizabeth Polgreen,Sanjit A. Seshia
关键词: demonstrated remarkable zero-shot, remarkable zero-shot fluency, related tasks ranging, Recent advances, challenging code related
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注: 15 pages, 6 figures, 1 table

点击查看摘要

Abstract:Recent advances in large language models (LLMs) for code applications have demonstrated remarkable zero-shot fluency and instruction following on challenging code related tasks ranging from test case generation to self-repair. Unsurprisingly, however, models struggle to compose syntactically valid programs in programming languages unrepresented in pre-training, referred to as very low-resource Programming Languages (VLPLs). VLPLs appear in crucial settings including domain-specific languages for internal to tools and tool-chains and legacy languages. Inspired by an HCI technique called natural program elicitation, we propose designing an intermediate language that LLMs ``naturally’’ know how to use and which can be automatically compiled to the target VLPL. Specifically, we introduce synthetic programming elicitation and compilation (SPEAK), an approach that enables LLMs to generate syntactically valid code even for VLPLs. We empirically evaluate the performance of SPEAK in a case study and find that, compared to existing retrieval and fine-tuning baselines, SPEAK produces syntactically correct programs more frequently without sacrificing semantic correctness.

[LG-138] Discovering Bias in Latent Space: An Unsupervised Debiasing Approach

链接: https://arxiv.org/abs/2406.03631
作者: Dyah Adila,Shuai Zhang,Boran Han,Yuyang Wang
关键词: capabilities of foundation, highly sensitive, foundation models, superficial image features, bias
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The question-answering (QA) capabilities of foundation models are highly sensitive to prompt variations, rendering their performance susceptible to superficial, non-meaning-altering changes. This vulnerability often stems from the model’s preference or bias towards specific input characteristics, such as option position or superficial image features in multi-modal settings. We propose to rectify this bias directly in the model’s internal representation. Our approach, SteerFair, finds the bias direction in the model’s representation space and steers activation values away from it during inference. Specifically, we exploit the observation that bias often adheres to simple association rules, such as the spurious association between the first option and correctness likelihood. Next, we construct demonstrations of these rules from unlabeled samples and use them to identify the bias directions. We empirically show that SteerFair significantly reduces instruction-tuned model performance variance across prompt modifications on three benchmark tasks. Remarkably, our approach surpasses a supervised baseline with 100 labels by an average of 10.86% accuracy points and 12.95 score points and matches the performance with 500 labels.

[LG-139] Active ML for 6G: Towards Efficient Data Generation Acquisition and Annotation

链接: https://arxiv.org/abs/2406.03630
作者: Omar Alhussein,Ning Zhang,Sami Muhaidat,Weihua Zhuang
关键词: active learning, active, learning, active machine learning, area that remains
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to IEEE Network Magazine

点击查看摘要

Abstract:This paper explores the integration of active machine learning (ML) for 6G networks, an area that remains under-explored yet holds potential. Unlike passive ML systems, active ML can be made to interact with the network environment. It actively selects informative and representative data points for training, thereby reducing the volume of data needed while accelerating the learning process. While active learning research mainly focuses on data annotation, we call for a network-centric active learning framework that considers both annotation (i.e., what is the label) and data acquisition (i.e., which and how many samples to collect). Moreover, we explore the synergy between generative artificial intelligence (AI) and active learning to overcome existing limitations in both active learning and generative AI. This paper also features a case study on a mmWave throughput prediction problem to demonstrate the practical benefits and improved performance of active learning for 6G networks. Furthermore, we discuss how the implications of active learning extend to numerous 6G network use cases. We highlight the potential of active learning based 6G networks to enhance computational efficiency, data annotation and acquisition efficiency, adaptability, and overall network intelligence. We conclude with a discussion on challenges and future research directions for active learning in 6G networks, including development of novel query strategies, distributed learning integration, and inclusion of human- and machine-in-the-loop learning.

[LG-140] Private Online Learning via Lazy Algorithms

链接: https://arxiv.org/abs/2406.03620
作者: Hilal Asi,Tomer Koren,Daogao Liu,Kunal Talwar
关键词: online convex optimization, prediction from experts, convex optimization, online learning, private online learning
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of private online learning, specifically, online prediction from experts (OPE) and online convex optimization (OCO). We propose a new transformation that transforms lazy online learning algorithms into private algorithms. We apply our transformation for differentially private OPE and OCO using existing lazy algorithms for these problems. Our final algorithms obtain regret, which significantly improves the regret in the high privacy regime \varepsilon \ll 1 , obtaining \sqrtT \log d + T^1/3 \log(d)/\varepsilon^2/3 for DP-OPE and \sqrtT + T^1/3 \sqrtd/\varepsilon^2/3 for DP-OCO. We also complement our results with a lower bound for DP-OPE, showing that these rates are optimal for a natural family of low-switching private algorithms.

[LG-141] Symmetry Discovery Beyond Affine Transformations

链接: https://arxiv.org/abs/2406.03619
作者: Ben Shaw,Abram Magner,Kevin R. Moon
关键词: machine learning tasks, learning tasks, shown to improve, improve various machine, machine learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Symmetry detection has been shown to improve various machine learning tasks. In the context of continuous symmetry detection, current state of the art experiments are limited to the detection of affine transformations. Under the manifold assumption, we outline a framework for discovering continuous symmetry in data beyond the affine transformation group. We also provide a similar framework for discovering discrete symmetry. We experimentally compare our method to an existing method known as LieGAN and show that our method is competitive at detecting affine symmetries for large sample sizes and superior than LieGAN for small sample sizes. We also show our method is able to detect continuous symmetries beyond the affine group and is generally more computationally efficient than LieGAN.

[LG-142] Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs

链接: https://arxiv.org/abs/2406.03614
作者: Alexander Bakumenko(1),Kateřina Hlaváčková-Schindler(2),Claudia Plant(2),Nina C. Hubig(1) ((1) Clemson University, USA, (2) University of Vienna, Austria)
关键词: Detecting anomalies, utmost importance, importance to ensure, ensure trustworthiness, machine learning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Risk Management (q-fin.RM)
*备注:

点击查看摘要

Abstract:Detecting anomalies in general ledger data is of utmost importance to ensure trustworthiness of financial records. Financial audits increasingly rely on machine learning (ML) algorithms to identify irregular or potentially fraudulent journal entries, each characterized by a varying number of transactions. In machine learning, heterogeneity in feature dimensions adds significant complexity to data analysis. In this paper, we introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. To encode non-semantic categorical data from real-world financial records, we tested 3 pre-trained general purpose sentence-transformer models. For the downstream classification task, we implemented and evaluated 5 optimized ML models including Logistic Regression, Random Forest, Gradient Boosting Machines, Support Vector Machines, and Neural Networks. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines, in selected settings even by a large margin. The findings further underscore the effectiveness of LLMs in enhancing anomaly detection in financial journal entries, particularly by tackling feature sparsity. We discuss a promising perspective on using LLM embeddings for non-semantic data in the financial context and beyond.

[LG-143] FedPylot: Navigating Federated Learning for Real-Time Object Detection in Internet of Vehicles

链接: https://arxiv.org/abs/2406.03611
作者: Cyprien Quéméneur,Soumaya Cherkaoui
关键词: enabling low-latency big, low-latency big data, big data processing, dense interconnected network, intelligent transportation systems
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The Internet of Vehicles (IoV) emerges as a pivotal component for autonomous driving and intelligent transportation systems (ITS), by enabling low-latency big data processing in a dense interconnected network that comprises vehicles, infrastructures, pedestrians and the cloud. Autonomous vehicles are heavily reliant on machine learning (ML) and can strongly benefit from the wealth of sensory data generated at the edge, which calls for measures to reconcile model training with preserving the privacy of sensitive user data. Federated learning (FL) stands out as a promising solution to train sophisticated ML models in vehicular networks while protecting the privacy of road users and mitigating communication overhead. This paper examines the federated optimization of the cutting-edge YOLOv7 model to tackle real-time object detection amid data heterogeneity, encompassing unbalancedness, concept drift, and label distribution skews. To this end, we introduce FedPylot, a lightweight MPI-based prototype to simulate federated object detection experiments on high-performance computing (HPC) systems, where we safeguard server-client communications using hybrid encryption. Our study factors in accuracy, communication cost, and inference speed, thereby presenting a balanced approach to the challenges faced by autonomous vehicles. We demonstrate promising results for the applicability of FL in IoV and hope that FedPylot will provide a basis for future research into federated real-time object detection. The source code is available at this https URL.

[LG-144] Alignment Calibration: Machine Unlearning for Contrastive Learning under Auditing

链接: https://arxiv.org/abs/2406.03603
作者: Yihan Wang,Yiwei Lu,Guojun Zhang,Franziska Boenisch,Adam Dziedzic,Yaoliang Yu,Xiao-Shan Gao
关键词: pre-trained model parameters, viable solutions, solutions to revoke, contrastive learning, unlearning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning provides viable solutions to revoke the effect of certain training data on pre-trained model parameters. Existing approaches provide unlearning recipes for classification and generative models. However, a category of important machine learning models, i.e., contrastive learning (CL) methods, is overlooked. In this paper, we fill this gap by first proposing the framework of Machine Unlearning for Contrastive learning (MUC) and adapting existing methods. Furthermore, we observe that several methods are mediocre unlearners and existing auditing tools may not be sufficient for data owners to validate the unlearning effects in contrastive learning. We thus propose a novel method called Alignment Calibration (AC) by explicitly considering the properties of contrastive learning and optimizing towards novel auditing metrics to easily verify unlearning. We empirically compare AC with baseline methods on SimCLR, MoCo and CLIP. We observe that AC addresses drawbacks of existing methods: (1) achieving state-of-the-art performance and approximating exact unlearning (retraining); (2) allowing data owners to clearly visualize the effect caused by unlearning through black-box auditing.

[LG-145] Hi5: 2D Hand Pose Estimation with Zero Human Annotation

链接: https://arxiv.org/abs/2406.03599
作者: Masum Hasan,Cengiz Ozel,Nina Long,Alexander Martin,Samuel Potter,Tariq Adnan,Sangwu Lee,Amir Zadeh,Ehsan Hoque
关键词: collecting high-quality synthetic, hand pose estimation, large synthetic hand, inexpensive method, method for collecting
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a new large synthetic hand pose estimation dataset, Hi5, and a novel inexpensive method for collecting high-quality synthetic data that requires no human annotation or validation. Leveraging recent advancements in computer graphics, high-fidelity 3D hand models with diverse genders and skin colors, and dynamic environments and camera movements, our data synthesis pipeline allows precise control over data diversity and representation, ensuring robust and fair model training. We generate a dataset with 583,000 images with accurate pose annotation using a single consumer PC that closely represents real-world variability. Pose estimation models trained with Hi5 perform competitively on real-hand benchmarks while surpassing models trained with real data when tested on occlusions and perturbations. Our experiments show promising results for synthetic data as a viable solution for data representation problems in real datasets. Overall, this paper provides a promising new approach to synthetic data creation and annotation that can reduce costs and increase the diversity and quality of data for hand pose estimation.

[LG-146] Why is “Problems” Predictive of Positive Sentiment? A Case Study of Explaining Unintuitive Features in Sentiment Classification

链接: https://arxiv.org/abs/2406.03594
作者: Jiaming Qu,Jaime Arguello,Yue Wang
关键词: model makes predictions, machine learning model, learning model makes, algorithms aim, makes predictions
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explainable AI (XAI) algorithms aim to help users understand how a machine learning model makes predictions. To this end, many approaches explain which input features are most predictive of a target label. However, such explanations can still be puzzling to users (e.g., in product reviews, the word “problems” is predictive of positive sentiment). If left unexplained, puzzling explanations can have negative impacts. Explaining unintuitive associations between an input feature and a target label is an underexplored area in XAI research. We take an initial effort in this direction using unintuitive associations learned by sentiment classifiers as a case study. We propose approaches for (1) automatically detecting associations that can appear unintuitive to users and (2) generating explanations to help users understand why an unintuitive feature is predictive. Results from a crowdsourced study (N=300) found that our proposed approaches can effectively detect and explain predictive but unintuitive features in sentiment classification.

[LG-147] BVE EKF: A viewpoint estimator for the estimation of the objects position in the 3D task space using Extended Kalman Filters

链接: https://arxiv.org/abs/2406.03591
作者: Sandro Costa Magalhães,António Paulo Moreira,Filipe Neves dos Santos,Jorge Dias
关键词: RGB-D sensors face, sensors face multiple, face multiple challenges, multiple challenges operating, RGB-D sensors
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:RGB-D sensors face multiple challenges operating under open-field environments because of their sensitivity to external perturbations such as radiation or rain. Multiple works are approaching the challenge of perceiving the 3D position of objects using monocular cameras. However, most of these works focus mainly on deep learning-based solutions, which are complex, data-driven, and difficult to predict. So, we aim to approach the problem of predicting the 3D objects’ position using a Gaussian viewpoint estimator named best viewpoint estimator (BVE) powered by an extended Kalman filter (EKF). The algorithm proved efficient on the tasks and reached a maximum average Euclidean error of about 32 mm. The experiments were deployed and evaluated in MATLAB using artificial Gaussian noise. Future work aims to implement the system in a robotic system.

[LG-148] CountCLIP – [Re] Teaching CLIP to Count to Ten

链接: https://arxiv.org/abs/2406.03586
作者: Harshvardhan Mestha,Tejas Agarwal,Karan Bania,Shreyas V,Yash Bhisikar
关键词: relevant downstream tasks, Large vision-language models, learn rich joint, rich joint image-text, joint image-text representations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large vision-language models (VLMs) are shown to learn rich joint image-text representations enabling high performances in relevant downstream tasks. However, they fail to showcase their quantitative understanding of objects, and they lack good counting-aware representation. This paper conducts a reproducibility study of ‘Teaching CLIP to Count to Ten’ (Paiss et al., 2023), which presents a method to finetune a CLIP model (Radford et al., 2021) to improve zero-shot counting accuracy in an image while maintaining the performance for zero-shot classification by introducing a counting-contrastive loss term. We improve the model’s performance on a smaller subset of their training data with lower computational resources. We verify these claims by reproducing their study with our own code. The implementation can be found at this https URL.

[LG-149] A Comparison of Recent Algorithms for Symbolic Regression to Genetic Programming

链接: https://arxiv.org/abs/2406.03585
作者: Yousef A. Radwan,Gabriel Kronberger,Stephan Winkler
关键词: Symbolic regression, produce interpretable results, goal to produce, produce interpretable, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Symbolic regression is a machine learning method with the goal to produce interpretable results. Unlike other machine learning methods such as, e.g. random forests or neural networks, which are opaque, symbolic regression aims to model and map data in a way that can be understood by scientists. Recent advancements, have attempted to bridge the gap between these two fields; new methodologies attempt to fuse the mapping power of neural networks and deep learning techniques with the explanatory power of symbolic regression. In this paper, we examine these new emerging systems and test the performance of an end-to-end transformer model for symbolic regression versus the reigning traditional methods based on genetic programming that have spearheaded symbolic regression throughout the years. We compare these systems on novel datasets to avoid bias to older methods who were improved on well-known benchmark datasets. Our results show that traditional GP methods as implemented e.g., by Operon still remain superior to two recently published symbolic regression methods.

[LG-150] Reconciling Heterogeneous Effects in Causal Inference

链接: https://arxiv.org/abs/2406.03575
作者: Audrey Chang,Emily Diana,Alexander Williams Tolbert
关键词: reference class problem, reference class, problem pitch paper, causal inference, class problem
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this position and problem pitch paper, we offer a solution to the reference class problem in causal inference. We apply the Reconcile algorithm for model multiplicity in machine learning to reconcile heterogeneous effects in causal inference. Discrepancy between conditional average treatment effect (CATE) estimators of heterogeneous effects poses the reference class problem, where estimates for individual predictions differ by choice of reference class. By adopting the individual to group framework for interpreting probability, we can recognize that the reference class problem – which appears across fields such as philosophy of science and causal inference – is equivalent to the model multiplicity problem in computer science. We then apply the Reconcile Algorithm to reconcile differences in estimates of individual probability among CATE estimators. Because the reference class problem manifests in contexts of individual probability prediction using group-based evidence, our results have tangible implications for ensuring fair outcomes in high-stakes such as healthcare, insurance, and housing, especially for marginalized communities. By highlighting the importance of mitigating disparities in predictive modeling, our work invites further exploration into interdisciplinary strategies that combine technical rigor with a keen awareness of social implications. Ultimately, our findings advocate for a holistic approach to algorithmic fairness, underscoring the critical role of thoughtful, well-rounded solutions in achieving the broader goals of equity and access.

[LG-151] A Simple Learning-Augmented Algorithm for Online Packing with Concave Objectives

链接: https://arxiv.org/abs/2406.03574
作者: Elena Grigorescu,Young-San Lin,Maoyuan Song
关键词: extensively studied recently, algorithms, computer-science community, Learning-augmented algorithms, online
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 13 pages, 2 figures. Abstract shortened to fit arXiv limit

点击查看摘要

Abstract:Learning-augmented algorithms has been extensively studied recently in the computer-science community, due to the potential of using machine learning predictions in order to improve the performance of algorithms. Predictions are especially useful for online algorithms making irrevocable decisions without knowledge of the future. Such learning-augmented algorithms aim to overcome the limitations of classical online algorithms when the predictions are accurate, and still perform comparably when the predictions are inaccurate. A common approach is to adapt existing online algorithms to the particular advice notion employed, which often involves understanding previous sophisticated algorithms and their analyses. However, ideally, one would simply use previous online solutions in a black-box fashion, without much loss in the approximation guarantees. Such clean solutions that avoid opening up black-boxes are often rare, and may be even missed the first time around. For example, Grigorescu et al. (NeurIPS 22) proposed a learning-augmented algorithms for online covering linear programs, but it later turned out that their results can be subsumed by a natural approach that switches between the advice and an online algorithm given as a black-box, as noted in their paper. In this work, we introduce and analyze a simple learning-augmented algorithm for online packing problems with linear constraints and concave objectives. We exhibit several direct applications of our framework including online packing linear programming, knapsack, resource management benefit, throughput maximization, and network utility maximization. We further raise the problem of understanding necessary and sufficient conditions for when such simple black-box solutions may be optimal. We believe this is an important direction of research that would unify many ad-hoc approaches from the literature. Comments: 13 pages, 2 figures. Abstract shortened to fit arXiv limit Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2406.03574 [cs.DS] (or arXiv:2406.03574v1 [cs.DS] for this version)

[LG-152] GFN: A graph feedforward network for resolution-invariant reduced operator learning in multifidelity applications

链接: https://arxiv.org/abs/2406.03569
作者: Oisín M. Morrison,Federico Pichi,Jan S. Hesthaven
关键词: resolution-invariant model order, order reduction strategy, model order reduction, work presents, graph feedforward network
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work presents a novel resolution-invariant model order reduction strategy for multifidelity applications. We base our architecture on a novel neural network layer developed in this work, the graph feedforward network, which extends the concept of feedforward networks to graph-structured data by creating a direct link between the weights of a neural network and the nodes of a mesh, enhancing the interpretability of the network. We exploit the method’s capability of training and testing on different mesh sizes in an autoencoder-based reduction strategy for parametrised partial differential equations. We show that this extension comes with provable guarantees on the performance via error bounds. The capabilities of the proposed methodology are tested on three challenging benchmarks, including advection-dominated phenomena and problems with a high-dimensional parameter space. The method results in a more lightweight and highly flexible strategy when compared to state-of-the-art models, while showing excellent generalisation performance in both single fidelity and multifidelity scenarios.

[LG-153] Neural empirical interpolation method for nonlinear model reduction

链接: https://arxiv.org/abs/2406.03562
作者: Max Hirsch,Federico Pichi,Jan S. Hesthaven
关键词: empirical interpolation method, discrete empirical interpolation, partial differential equation, neural empirical interpolation, neural network-based alternative
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we introduce the neural empirical interpolation method (NEIM), a neural network-based alternative to the discrete empirical interpolation method for reducing the time complexity of computing the nonlinear term in a reduced order model (ROM) for a parameterized nonlinear partial differential equation. NEIM is a greedy algorithm which accomplishes this reduction by approximating an affine decomposition of the nonlinear term of the ROM, where the vector terms of the expansion are given by neural networks depending on the ROM solution, and the coefficients are given by an interpolation of some “optimal” coefficients. Because NEIM is based on a greedy strategy, we are able to provide a basic error analysis to investigate its performance. NEIM has the advantages of being easy to implement in models with automatic differentiation, of being a nonlinear projection of the ROM nonlinearity, of being efficient for both nonlocal and local nonlinearities, and of relying solely on data and not the explicit form of the ROM nonlinearity. We demonstrate the effectiveness of the methodology on solution-dependent and solution-independent nonlinearities, a nonlinear elliptic problem, and a nonlinear parabolic model of liquid crystals.

[LG-154] Robust Communication and Computation using Deep Learning via Joint Uncertainty Injection

链接: https://arxiv.org/abs/2406.03548
作者: Robert-Jeron Reifert,Hayssam Dahrouj,Alaa Alameer Ahmad,Haris Gacanin,Aydin Sezgin
关键词: key empowering pillars, artificial intelligence, stand as key, integration of machine, key empowering
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 7 pages, 6 figures, one table. Accepted for presentation at the 19th International Symposium on Wireless Communication Systems 2024 (ISWCS 2024)

点击查看摘要

Abstract:The convergence of communication and computation, along with the integration of machine learning and artificial intelligence, stand as key empowering pillars for the sixth-generation of communication systems (6G). This paper considers a network of one base station serving a number of devices simultaneously using spatial multiplexing. The paper then presents an innovative deep learning-based approach to simultaneously manage the transmit and computing powers, alongside computation allocation, amidst uncertainties in both channel and computing states information. More specifically, the paper aims at proposing a robust solution that minimizes the worst-case delay across the served devices subject to computation and power constraints. The paper uses a deep neural network (DNN)-based solution that maps estimated channels and computation requirements to optimized resource allocations. During training, uncertainty samples are injected after the DNN output to jointly account for both communication and computation estimation errors. The DNN is then trained via backpropagation using the robust utility, thus implicitly learning the uncertainty distributions. Our results validate the enhanced robust delay performance of the joint uncertainty injection versus the classical DNN approach, especially in high channel and computational uncertainty regimes.

[LG-155] A Geometric View of Data Complexity: Efficient Local Intrinsic Dimension Estimation with Diffusion Models

链接: https://arxiv.org/abs/2406.03537
作者: Hamidreza Kamkari,Brendan Leigh Ross,Rasa Hosseinzadeh,Jesse C. Cresswell,Gabriel Loaiza-Ganem
关键词: High-dimensional data commonly, local intrinsic dimension, data commonly lies, High-dimensional data, low-dimensional submanifolds
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 10 pages

点击查看摘要

Abstract:High-dimensional data commonly lies on low-dimensional submanifolds, and estimating the local intrinsic dimension (LID) of a datum – i.e. the dimension of the submanifold it belongs to – is a longstanding problem. LID can be understood as the number of local factors of variation: the more factors of variation a datum has, the more complex it tends to be. Estimating this quantity has proven useful in contexts ranging from generalization in neural networks to detection of out-of-distribution data, adversarial examples, and AI-generated text. The recent successes of deep generative models present an opportunity to leverage them for LID estimation, but current methods based on generative models produce inaccurate estimates, require more than a single pre-trained model, are computationally intensive, or do not exploit the best available deep generative models, i.e. diffusion models (DMs). In this work, we show that the Fokker-Planck equation associated with a DM can provide a LID estimator which addresses all the aforementioned deficiencies. Our estimator, called FLIPD, is compatible with all popular DMs, and outperforms existing baselines on LID estimation benchmarks. We also apply FLIPD on natural images where the true LID is unknown. Compared to competing estimators, FLIPD exhibits a higher correlation with non-LID measures of complexity, better matches a qualitative assessment of complexity, and is the only estimator to remain tractable with high-resolution images at the scale of Stable Diffusion.

[LG-156] VideoPhy: Evaluating Physical Commonsense for Video Generation

链接: https://arxiv.org/abs/2406.03520
作者: Hritik Bansal,Zongyu Lin,Tianyi Xie,Zeshun Zong,Michal Yarom,Yonatan Bitton,Chenfanfu Jiang,Yizhou Sun,Kai-Wei Chang,Aditya Grover
关键词: Recent advances, internet-scale video data, video data pretraining, create high-quality videos, generative models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 36 pages, 26 figures, 8 tables

点击查看摘要

Abstract:Recent advances in internet-scale video data pretraining have led to the development of text-to-video generative models that can create high-quality videos across a broad range of visual concepts and styles. Due to their ability to synthesize realistic motions and render complex objects, these generative models have the potential to become general-purpose simulators of the physical world. However, it is unclear how far we are from this goal with the existing text-to-video generative models. To this end, we present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities (e.g. marbles will roll down when placed on a slanted surface). Specifically, we curate a list of 688 captions that involve interactions between various material types in the physical world (e.g., solid-solid, solid-fluid, fluid-fluid). We then generate videos conditioned on these captions from diverse state-of-the-art text-to-video generative models, including open models (e.g., VideoCrafter2) and closed models (e.g., Lumiere from Google, Pika). Further, our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts, while also lack physical commonsense. Specifically, the best performing model, Pika, generates videos that adhere to the caption and physical laws for only 19.7% of the instances. VideoPhy thus highlights that the video generative models are far from accurately simulating the physical world. Finally, we also supplement the dataset with an auto-evaluator, VideoCon-Physics, to assess semantic adherence and physical commonsense at scale.

[LG-157] Noise-Aware Algorithm for Heterogeneous Differentially Private Federated Learning

链接: https://arxiv.org/abs/2406.03519
作者: Saber Malekmohammadi,Yaoliang Yu,Yang Cao
关键词: rigorous data privacy, High utility, rigorous data, data distributed, federated learning
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024

点击查看摘要

Abstract:High utility and rigorous data privacy are of the main goals of a federated learning (FL) system, which learns a model from the data distributed among some clients. The latter has been tried to achieve by using differential privacy in FL (DPFL). There is often heterogeneity in clients privacy requirements, and existing DPFL works either assume uniform privacy requirements for clients or are not applicable when server is not fully trusted (our setting). Furthermore, there is often heterogeneity in batch and/or dataset size of clients, which as shown, results in extra variation in the DP noise level across clients model updates. With these sources of heterogeneity, straightforward aggregation strategies, e.g., assigning clients aggregation weights proportional to their privacy parameters will lead to lower utility. We propose Robust-HDP, which efficiently estimates the true noise level in clients model updates and reduces the noise-level in the aggregated model updates considerably. Robust-HDP improves utility and convergence speed, while being safe to the clients that may maliciously send falsified privacy parameter to server. Extensive experimental results on multiple datasets and our theoretical analysis confirm the effectiveness of Robust-HDP. Our code can be found here.

[LG-158] Buffered Asynchronous Secure Aggregation for Cross-Device Federated Learning

链接: https://arxiv.org/abs/2406.03516
作者: Kun Wang,Yi-Rui Yang,Wu-Jun Li
关键词: secure aggregation, federated learning, secure aggregation protocols, existing secure aggregation, Asynchronous federated learning
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Asynchronous federated learning (AFL) is an effective method to address the challenge of device heterogeneity in cross-device federated learning. However, AFL is usually incompatible with existing secure aggregation protocols used to protect user privacy in federated learning because most existing secure aggregation protocols are based on synchronous aggregation. To address this problem, we propose a novel secure aggregation protocol named buffered asynchronous secure aggregation (BASA) in this paper. Compared with existing protocols, BASA is fully compatible with AFL and provides secure aggregation under the condition that each user only needs one round of communication with the server without relying on any synchronous interaction among users. Based on BASA, we propose the first AFL method which achieves secure aggregation without extra requirements on hardware. We empirically demonstrate that BASA outperforms existing secure aggregation protocols for cross-device federated learning in terms of training efficiency and scalability.

[LG-159] MagiNet: Mask-Aware Graph Imputation Network for Incomplete Traffic Data

链接: https://arxiv.org/abs/2406.03511
作者: Jianping Zhou,Bin Lu,Zhanyu Liu,Siyu Pan,Xuejun Feng,Hua Wei,Guanjie Zheng,Xinbing Wang,Chenghu Zhou
关键词: Intelligent Transportation System, Due to detector, communication failures, detector malfunctions, malfunctions and communication
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages, 7 figures

点击查看摘要

Abstract:Due to detector malfunctions and communication failures, missing data is ubiquitous during the collection of traffic data. Therefore, it is of vital importance to impute the missing values to facilitate data analysis and decision-making for Intelligent Transportation System (ITS). However, existing imputation methods generally perform zero pre-filling techniques to initialize missing values, introducing inevitable noises. Moreover, we observe prevalent over-smoothing interpolations, falling short in revealing the intrinsic spatio-temporal correlations of incomplete traffic data. To this end, we propose Mask-Aware Graph imputation Network: MagiNet. Our method designs an adaptive mask spatio-temporal encoder to learn the latent representations of incomplete data, eliminating the reliance on pre-filling missing values. Furthermore, we devise a spatio-temporal decoder that stacks multiple blocks to capture the inherent spatial and temporal dependencies within incomplete traffic data, alleviating over-smoothing imputation. Extensive experiments demonstrate that our method outperforms state-of-the-art imputation methods on five real-world traffic datasets, yielding an average improvement of 4.31% in RMSE and 3.72% in MAPE.

[LG-160] Mutual Information Guided Backdoor Mitigation for Pre-trained Encoders

链接: https://arxiv.org/abs/2406.03508
作者: Tingxu Han,Weisong Sun,Ziqi Ding,Chunrong Fang,Hanwei Qian,Jiaxun Li,Zhenyu Chen,Xiangyu Zhang
关键词: Self-supervised learning, requiring labeled data, increasingly attractive, requiring labeled, Self-supervised
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) is increasingly attractive for pre-training encoders without requiring labeled data. Downstream tasks built on top of those pre-trained encoders can achieve nearly state-of-the-art performance. The pre-trained encoders by SSL, however, are vulnerable to backdoor attacks as demonstrated by existing studies. Numerous backdoor mitigation techniques are designed for downstream task models. However, their effectiveness is impaired and limited when adapted to pre-trained encoders, due to the lack of label information when pre-training. To address backdoor attacks against pre-trained encoders, in this paper, we innovatively propose a mutual information guided backdoor mitigation technique, named MIMIC. MIMIC treats the potentially backdoored encoder as the teacher net and employs knowledge distillation to distill a clean student encoder from the teacher net. Different from existing knowledge distillation approaches, MIMIC initializes the student with random weights, inheriting no backdoors from teacher nets. Then MIMIC leverages mutual information between each layer and extracted features to locate where benign knowledge lies in the teacher net, with which distillation is deployed to clone clean features from teacher to student. We craft the distillation loss with two aspects, including clone loss and attention loss, aiming to mitigate backdoors and maintain encoder performance at the same time. Our evaluation conducted on two backdoor attacks in SSL demonstrates that MIMIC can significantly reduce the attack success rate by only utilizing 5% of clean data, surpassing seven state-of-the-art backdoor mitigation techniques.

[LG-161] Robust Prediction Model for Multidimensional and Unbalanced Datasets

链接: https://arxiv.org/abs/2406.03507
作者: Pooja Thakar,Anil Mehta,Manisha
关键词: Data Mining, promising field, applied in multiple, predictive capabilities, Data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

Abstract:Data Mining is a promising field and is applied in multiple domains for its predictive capabilities. Data in the real world cannot be readily used for data mining as it suffers from the problems of multidimensionality, unbalance and missing values. It is difficult to use its predictive capabilities by novice users. It is difficult for a beginner to find the relevant set of attributes from a large pool of data available. The paper presents a Robust Prediction Model that finds a relevant set of attributes; resolves the problems of unbalanced and multidimensional real-life datasets and helps in finding patterns for informed decision making. Model is tested upon five different datasets in the domain of Health Sector, Education, Business and Fraud Detection. The results showcase the robust behaviour of the model and its applicability in various domains.

[LG-162] Fuzzy Convolution Neural Networks for Tabular Data Classification

链接: https://arxiv.org/abs/2406.03506
作者: Arun D. Kulkarni
关键词: tabular data classification, data, tabular data, attracted a great, great deal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 16 figures, Submitted to IEEE Access

点击查看摘要

Abstract:Recently, convolution neural networks (CNNs) have attracted a great deal of attention due to their remarkable performance in various domains, particularly in image and text classification tasks. However, their application to tabular data classification remains underexplored. There are many fields such as bioinformatics, finance, medicine where nonimage data are prevalent. Adaption of CNNs to classify nonimage data remains highly challenging. This paper investigates the efficacy of CNNs for tabular data classification, aiming to bridge the gap between traditional machine learning approaches and deep learning techniques. We propose a novel framework fuzzy convolution neural network (FCNN) tailored specifically for tabular data to capture local patterns within feature vectors. In our approach, we map feature values to fuzzy memberships. The fuzzy membership vectors are converted into images that are used to train the CNN model. The trained CNN model is used to classify unknown feature vectors. To validate our approach, we generated six complex noisy data sets. We used randomly selected seventy percent samples from each data set for training and thirty percent for testing. The data sets were also classified using the state-of-the-art machine learning algorithms such as the decision tree (DT), support vector machine (SVM), fuzzy neural network (FNN), Bayes classifier, and Random Forest (RF). Experimental results demonstrate that our proposed model can effectively learn meaningful representations from tabular data, achieving competitive or superior performance compared to existing methods. Overall, our finding suggests that the proposed FCNN model holds promise as a viable alternative for tabular data classification tasks, offering a fresh prospective and potentially unlocking new opportunities for leveraging deep learning in structured data analysis.

[LG-163] Dynamic and Adaptive Feature Generation with LLM

链接: https://arxiv.org/abs/2406.03505
作者: Xinhao Zhang,Jinghan Zhang,Banafsheh Rekabdar,Yuanchun Zhou,Pengfei Wang,Kunpeng Liu
关键词: feature generation, upcoming modeling, crucial environment, points get vectorized, vectorized and embedded
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The representation of feature space is a crucial environment where data points get vectorized and embedded for upcoming modeling. Thus the efficacy of machine learning (ML) algorithms is closely related to the quality of feature engineering. As one of the most important techniques, feature generation transforms raw data into an optimized feature space conducive to model training and further refines the space. Despite the advancements in automated feature engineering and feature generation, current methodologies often suffer from three fundamental issues: lack of explainability, limited applicability, and inflexible strategy. These shortcomings frequently hinder and limit the deployment of ML models across varied scenarios. Our research introduces a novel approach adopting large language models (LLMs) and feature-generating prompts to address these challenges. We propose a dynamic and adaptive feature generation method that enhances the interpretability of the feature generation process. Our approach broadens the applicability across various data types and tasks and draws advantages over strategic flexibility. A broad range of experiments showcases that our approach is significantly superior to existing methods.

[LG-164] Position: Rethinking Post-Hoc Search-Based Neural Approaches for Solving Large-Scale Traveling Salesman Problems

链接: https://arxiv.org/abs/2406.03503
作者: Yifan Xia,Xianliang Yang,Zichuan Liu,Zhihao Liu,Lei Song,Jiang Bian
关键词: Monte Carlo tree, heatmap-guided Monte Carlo, Carlo tree search, Monte Carlo, solving large-scale traveling
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by International Conference on Machine Learning (ICML 2024)

点击查看摘要

Abstract:Recent advancements in solving large-scale traveling salesman problems (TSP) utilize the heatmap-guided Monte Carlo tree search (MCTS) paradigm, where machine learning (ML) models generate heatmaps, indicating the probability distribution of each edge being part of the optimal solution, to guide MCTS in solution finding. However, our theoretical and experimental analysis raises doubts about the effectiveness of ML-based heatmap generation. In support of this, we demonstrate that a simple baseline method can outperform complex ML approaches in heatmap generation. Furthermore, we question the practical value of the heatmap-guided MCTS paradigm. To substantiate this, our findings show its inferiority to the LKH-3 heuristic despite the paradigm’s reliance on problem-specific, hand-crafted strategies. For the future, we suggest research directions focused on developing more theoretically sound heatmap generation methods and exploring autonomous, generalizable ML approaches for combinatorial problems. The code is available for review: this https URL.

[LG-165] Llumnix: Dynamic Scheduling for Large Language Model Serving

链接: https://arxiv.org/abs/2406.03243
作者: Biao Sun,Ziming Huang,Hanyu Zhao,Wencong Xiao,Xinyi Zhang,Yong Li,Wei Lin
关键词: large language models, people daily lives, Inference serving, LLM serving, LLM serving system
类目: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: To appear at OSDI '24; open-source repo will be available in June 2024

点击查看摘要

Abstract:Inference serving for large language models (LLMs) is the key to unleashing their potential in people’s daily lives. However, efficient LLM serving remains challenging today because the requests are inherently heterogeneous and unpredictable in terms of resource and latency requirements, as a result of the diverse applications and the dynamic execution nature of LLMs. Existing systems are fundamentally limited in handling these characteristics and cause problems such as severe queuing delays, poor tail latencies, and SLO violations. We introduce Llumnix, an LLM serving system that reacts to such heterogeneous and unpredictable requests by runtime rescheduling across multiple model instances. Similar to context switching across CPU cores in modern operating systems, Llumnix reschedules requests to improve load balancing and isolation, mitigate resource fragmentation, and differentiate request priorities and SLOs. Llumnix implements the rescheduling with an efficient and scalable live migration mechanism for requests and their in-memory states, and exploits it in a dynamic scheduling policy that unifies the multiple rescheduling scenarios elegantly. Our evaluations show that Llumnix improves tail latencies by an order of magnitude, accelerates high-priority requests by up to 1.5x, and delivers up to 36% cost savings while achieving similar tail latencies, compared against state-of-the-art LLM serving systems. Llumnix is publicly available at this https URL. Comments: To appear at OSDI '24; open-source repo will be available in June 2024 Subjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2406.03243 [cs.AR] (or arXiv:2406.03243v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2406.03243 Focus to learn more arXiv-issued DOI via DataCite

[LG-166] Information-driven Affordance Discovery for Efficient Robotic Manipulation

链接: https://arxiv.org/abs/2405.03865
作者: Pietro Mazzaglia,Taco Cohen,Daniel Dijkman
关键词: aid robotic manipulation, robotic manipulation, aid robotic, providing information, Robotic
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2308.14915

点击查看摘要

Abstract:Robotic affordances, providing information about what actions can be taken in a given situation, can aid robotic manipulation. However, learning about affordances requires expensive large annotated datasets of interactions or demonstrations. In this work, we argue that well-directed interactions with the environment can mitigate this problem and propose an information-based measure to augment the agent’s objective and accelerate the affordance discovery process. We provide a theoretical justification of our approach and we empirically validate the approach both in simulation and real-world tasks. Our method, which we dub IDA, enables the efficient discovery of visual affordances for several action primitives, such as grasping, stacking objects, or opening drawers, strongly improving data efficiency in simulation, and it allows us to learn grasping affordances in a small number of interactions, on a real-world setup with a UFACTORY XArm 6 robot arm.

[LG-167] ENG: Time-Evolving Natural Gradient for Solving PDEs With Deep Neural Nets Toward Machine Precision

链接: https://arxiv.org/abs/2404.10771
作者: Zhuo Chen,Jacob McCarran,Esteban Vizcaino,Marin Soljačić,Di Luo
关键词: Partial differential equations, modeling dynamical systems, Partial differential, science and engineering, instrumental for modeling
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Partial differential equations (PDEs) are instrumental for modeling dynamical systems in science and engineering. The advent of neural networks has initiated a significant shift in tackling these complexities though challenges in accuracy persist, especially for initial value problems. In this paper, we introduce the \textitTime-Evolving Natural Gradient (TENG) , generalizing time-dependent variational principles and optimization-based time integration, leveraging natural gradient optimization to obtain high accuracy in neural-network-based PDE solutions. Our comprehensive development includes algorithms like TENG-Euler and its high-order variants, such as TENG-Heun, tailored for enhanced precision and efficiency. TENG’s effectiveness is further validated through its performance, surpassing current leading methods and achieving \textitmachine precision in step-by-step optimizations across a spectrum of PDEs, including the heat equation, Allen-Cahn equation, and Burgers’ equation.

[LG-168] Online learning of quantum processes

链接: https://arxiv.org/abs/2406.04250
作者: Asad Raza,Matthias C. Caro,Jens Eisert,Sumeet Khatri
关键词: adaptively chosen observables, accurately predict expectation, online learning, learning, online
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 14 + 72 pages, 6 figures

点击查看摘要

Abstract:Among recent insights into learning quantum states, online learning and shadow tomography procedures are notable for their ability to accurately predict expectation values even of adaptively chosen observables. In contrast to the state case, quantum process learning tasks with a similarly adaptive nature have received little attention. In this work, we investigate online learning tasks for quantum processes. Whereas online learning is infeasible for general quantum channels, we show that channels of bounded gate complexity as well as Pauli channels can be online learned in the regret and mistake-bounded models of online learning. In fact, we can online learn probabilistic mixtures of any exponentially large set of known channels. We also provide a provably sample-efficient shadow tomography procedure for Pauli channels. Our results extend beyond quantum channels to non-Markovian multi-time processes, with favorable regret and mistake bounds, as well as a shadow tomography procedure. We complement our online learning upper bounds with mistake as well as computational lower bounds. On the technical side, we make use of the multiplicative weights update algorithm, classical adaptive data analysis, and Bell sampling, as well as tools from the theory of quantum combs for multi-time quantum processes. Our work initiates a study of online learning for classes of quantum channels and, more generally, non-Markovian quantum processes. Given the importance of online learning for state shadow tomography, this may serve as a step towards quantum channel variants of adaptive shadow tomography.

[LG-169] Online learning of a panoply of quantum objects

链接: https://arxiv.org/abs/2406.04245
作者: Akshay Bansal,Ian George,Soumik Ghosh,Jamie Sikora,Alice Zheng
关键词: regret bound, quantum, unknown quantum object, sublinear regret bound, wishes to learn
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 34 pages. Comments welcome

点击查看摘要

Abstract:In many quantum tasks, there is an unknown quantum object that one wishes to learn. An online strategy for this task involves adaptively refining a hypothesis to reproduce such an object or its measurement statistics. A common evaluation metric for such a strategy is its regret, or roughly the accumulated errors in hypothesis statistics. We prove a sublinear regret bound for learning over general subsets of positive semidefinite matrices via the regularized-follow-the-leader algorithm and apply it to various settings where one wishes to learn quantum objects. For concrete applications, we present a sublinear regret bound for learning quantum states, effects, channels, interactive measurements, strategies, co-strategies, and the collection of inner products of pure states. Our bound applies to many other quantum objects with compact, convex representations. In proving our regret bound, we establish various matrix analysis results useful in quantum information theory. This includes a generalization of Pinsker’s inequality for arbitrary positive semidefinite operators with possibly different traces, which may be of independent interest and applicable to more general classes of divergences.

[LG-170] Essentially Sharp Estimates on the Entropy Regularization Error in Discrete Discounted Markov Decision Processes

链接: https://arxiv.org/abs/2406.04163
作者: Johannes Müller,Semih Cayci
关键词: infinite-horizon discrete discounted, discrete discounted Markov, discounted Markov decision, Markov decision processes, natural policy gradient
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 25 pages, 1 figure

点击查看摘要

Abstract:We study the error introduced by entropy regularization of infinite-horizon discrete discounted Markov decision processes. We show that this error decreases exponentially in the inverse regularization strength both in a weighted KL-divergence and in value with a problem-specific exponent. We provide a lower bound matching our upper bound up to a polynomial factor. Our proof relies on the correspondence of the solutions of entropy-regularized Markov decision processes with gradient flows of the unregularized reward with respect to a Riemannian metric common in natural policy gradient methods. Further, this correspondence allows us to identify the limit of the gradient flow as the generalized maximum entropy optimal policy, thereby characterizing the implicit bias of the Kakade gradient flow which corresponds to a time-continuous version of the natural policy gradient method. We use this to show that for entropy-regularized natural policy gradient methods the overall error decays exponentially in the square root of the number of iterations improving existing sublinear guarantees.

[LG-171] Stochastic Polyak Step-sizes and Momentum: Convergence Guarantees and Practical Performance

链接: https://arxiv.org/abs/2406.04142
作者: Dimitris Oikonomou,Nicolas Loizou
关键词: Stochastic Heavy Ball, Heavy Ball method, Heavy Ball, machine learning tasks, solving large-scale stochastic
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 39 pages, 20 Figures

点击查看摘要

Abstract:Stochastic gradient descent with momentum, also known as Stochastic Heavy Ball method (SHB), is one of the most popular algorithms for solving large-scale stochastic optimization problems in various machine learning tasks. In practical scenarios, tuning the step-size and momentum parameters of the method is a prohibitively expensive and time-consuming process. In this work, inspired by the recent advantages of stochastic Polyak step-size in the performance of stochastic gradient descent (SGD), we propose and explore new Polyak-type variants suitable for the update rule of the SHB method. In particular, using the Iterate Moving Average (IMA) viewpoint of SHB, we propose and analyze three novel step-size selections: MomSPS _\max , MomDecSPS, and MomAdaSPS. For MomSPS _\max , we provide convergence guarantees for SHB to a neighborhood of the solution for convex and smooth problems (without assuming interpolation). If interpolation is also satisfied, then using MomSPS _\max , SHB converges to the true solution at a fast rate matching the deterministic HB. The other two variants, MomDecSPS and MomAdaSPS, are the first adaptive step-sizes for SHB that guarantee convergence to the exact minimizer without prior knowledge of the problem parameters and without assuming interpolation. The convergence analysis of SHB is tight and obtains the convergence guarantees of SGD with stochastic Polyak step-sizes as a special case. We supplement our analysis with experiments that validate the theory and demonstrate the effectiveness and robustness of the new algorithms.

[LG-172] A Large-Scale Neutral Comparison Study of Survival Models on Low-Dimensional Data

链接: https://arxiv.org/abs/2406.04098
作者: Lukas Burk,John Zobolas,Bernd Bischl,Andreas Bender,Marvin N. Wright,Raphael Sonabend
关键词: large-scale neutral benchmark, neutral benchmark experiment, benchmark experiment focused, focused on single-event, work presents
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 42 pages, 28 figures

点击查看摘要

Abstract:This work presents the first large-scale neutral benchmark experiment focused on single-event, right-censored, low-dimensional survival data. Benchmark experiments are essential in methodological research to scientifically compare new and existing model classes through proper empirical evaluation. Existing benchmarks in the survival literature are often narrow in scope, focusing, for example, on high-dimensional data. Additionally, they may lack appropriate tuning or evaluation procedures, or are qualitative reviews, rather than quantitative comparisons. This comprehensive study aims to fill the gap by neutrally evaluating a broad range of methods and providing generalizable conclusions. We benchmark 18 models, ranging from classical statistical approaches to many common machine learning methods, on 32 publicly available datasets. The benchmark tunes for both a discrimination measure and a proper scoring rule to assess performance in different settings. Evaluating on 8 survival metrics, we assess discrimination, calibration, and overall predictive performance of the tested models. Using discrimination measures, we find that no method significantly outperforms the Cox model. However, (tuned) Accelerated Failure Time models were able to achieve significantly better results with respect to overall predictive performance as measured by the right-censored log-likelihood. Machine learning methods that performed comparably well include Oblique Random Survival Forests under discrimination, and Cox-based likelihood-boosting under overall predictive performance. We conclude that for predictive purposes in the standard survival analysis setting of low-dimensional, right-censored data, the Cox Proportional Hazards model remains a simple and robust method, sufficient for practitioners.

[LG-173] Dynamic angular synchronization under smoothness constraints

链接: https://arxiv.org/abs/2406.04071
作者: Ernesto Araya,Mihai Cucuringu,Hemant Tyagi
关键词: classical angular synchronization, recovering unknown angles, theta, synchronization problem consists, angular synchronization problem
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 40 pages, 9 figures

点击查看摘要

Abstract:Given an undirected measurement graph \mathcalH = ([n], \mathcalE) , the classical angular synchronization problem consists of recovering unknown angles \theta_1^,\dots,\theta_n^ from a collection of noisy pairwise measurements of the form (\theta_i^* - \theta_j^*) \mod 2\pi , for all \i,j\ \in \mathcalE . This problem arises in a variety of applications, including computer vision, time synchronization of distributed networks, and ranking from pairwise comparisons. In this paper, we consider a dynamic version of this problem where the angles, and also the measurement graphs evolve over T time points. Assuming a smoothness condition on the evolution of the latent angles, we derive three algorithms for joint estimation of the angles over all time points. Moreover, for one of the algorithms, we establish non-asymptotic recovery guarantees for the mean-squared error (MSE) under different statistical models. In particular, we show that the MSE converges to zero as T increases under milder conditions than in the static setting. This includes the setting where the measurement graphs are highly sparse and disconnected, and also when the measurement noise is large and can potentially increase with T . We complement our theoretical results with experiments on synthetic data.

[LG-174] Slicing Mutual Information Generalization Bounds for Neural Networks

链接: https://arxiv.org/abs/2406.04047
作者: Kimia Nadjahi,Kristjan Greenewald,Rickard Brüel Gabrielsson,Justin Solomon
关键词: unseen data, training data, input-output mutual information, learned hypothesis, ability of machine
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at ICML 2024

点击查看摘要

Abstract:The ability of machine learning (ML) algorithms to generalize well to unseen data has been studied through the lens of information theory, by bounding the generalization error with the input-output mutual information (MI), i.e., the MI between the training data and the learned hypothesis. Yet, these bounds have limited practicality for modern ML applications (e.g., deep learning), due to the difficulty of evaluating MI in high dimensions. Motivated by recent findings on the compressibility of neural networks, we consider algorithms that operate by slicing the parameter space, i.e., trained on random lower-dimensional subspaces. We introduce new, tighter information-theoretic generalization bounds tailored for such algorithms, demonstrating that slicing improves generalization. Our bounds offer significant computational and statistical advantages over standard MI bounds, as they rely on scalable alternative measures of dependence, i.e., disintegrated mutual information and k -sliced mutual information. Then, we extend our analysis to algorithms whose parameters do not need to exactly lie on random subspaces, by leveraging rate-distortion theory. This strategy yields generalization bounds that incorporate a distortion term measuring model compressibility under slicing, thereby tightening existing bounds without compromising performance or requiring model compression. Building on this, we propose a regularization scheme enabling practitioners to control generalization through compressibility. Finally, we empirically validate our results and achieve the computation of non-vacuous information-theoretic generalization bounds for neural networks, a task that was previously out of reach.

[LG-175] Variational inference Mixture of Gaussians Bayesian Machine Learning

链接: https://arxiv.org/abs/2406.04012
作者: Tom Huix,Anna Korba,Alain Durmus,Eric Moulines
关键词: approach in Bayesian, parametric family, Bayesian inference, minimizing a loss, popular approach
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Variational inference (VI) is a popular approach in Bayesian inference, that looks for the best approximation of the posterior distribution within a parametric family, minimizing a loss that is typically the (reverse) Kullback-Leibler (KL) divergence. Despite its empirical success, the theoretical properties of VI have only received attention recently, and mostly when the parametric family is the one of Gaussians. This work aims to contribute to the theoretical study of VI in the non-Gaussian case by investigating the setting of Mixture of Gaussians with fixed covariance and constant weights. In this view, VI over this specific family can be casted as the minimization of a Mollified relative entropy, i.e. the KL between the convolution (with respect to a Gaussian kernel) of an atomic measure supported on Diracs, and the target distribution. The support of the atomic measure corresponds to the localization of the Gaussian components. Hence, solving variational inference becomes equivalent to optimizing the positions of the Diracs (the particles), which can be done through gradient descent and takes the form of an interacting particle system. We study two sources of error of variational inference in this context when optimizing the mollified relative entropy. The first one is an optimization result, that is a descent lemma establishing that the algorithm decreases the objective at each iteration. The second one is an approximation error, that upper bounds the objective between an optimal finite mixture and the target distribution.

[LG-176] Statistical Multicriteria Benchmarking via the GSD-Front

链接: https://arxiv.org/abs/2406.03924
作者: Christoph Jansen(1),Georg Schollmeyer(2),Julian Rodemann(2),Hannah Blocher(2),Thomas Augustin(2) ((1) Lancaster University Leipzig, (2) Ludwig-Maximilians-Universität München)
关键词: reliable methods, increasingly important, vast number, methods for comparing, Comparisons
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: CJ, GS,JR and HB equally contributed to this work

点击查看摘要

Abstract:Given the vast number of classifiers that have been (and continue to be) proposed, reliable methods for comparing them are becoming increasingly important. The desire for reliability is broken down into three main aspects: (1) Comparisons should allow for different quality metrics simultaneously. (2) Comparisons should take into account the statistical uncertainty induced by the choice of benchmark suite. (3) The robustness of the comparisons under small deviations in the underlying assumptions should be verifiable. To address (1), we propose to compare classifiers using a generalized stochastic dominance ordering (GSD) and present the GSD-front as an information-efficient alternative to the classical Pareto-front. For (2), we propose a consistent statistical estimator for the GSD-front and construct a statistical test for whether a (potentially new) classifier lies in the GSD-front of a set of state-of-the-art classifiers. For (3), we relax our proposed test using techniques from robust statistics and imprecise probabilities. We illustrate our concepts on the benchmark suite PMLB and on the platform OpenML.

[LG-177] Data-Centric Label Smoothing for Explainable Glaucoma Screening from Eye Fundus Images

链接: https://arxiv.org/abs/2406.03903
作者: Adrian Galdran,Miguel A. González Ballester
关键词: advanced optimization strategies, modern machine learning, current computing capabilities, computer vision system, vision system tend
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to ISBI 2024 (Challenges), 2nd position in the JustRAIGS challenge ( this https URL )

点击查看摘要

Abstract:As current computing capabilities increase, modern machine learning and computer vision system tend to increase in complexity, mostly by means of larger models and advanced optimization strategies. Although often neglected, in many problems there is also much to be gained by considering potential improvements in understanding and better leveraging already-available training data, including annotations. This so-called data-centric approach can lead to substantial performance increases, sometimes beyond what can be achieved by larger models. In this paper we adopt such an approach for the task of justifiable glaucoma screening from retinal images. In particular, we focus on how to combine information from multiple annotators of different skills into a tailored label smoothing scheme that allows us to better employ a large collection of fundus images, instead of discarding samples suffering from inter-rater variability. Internal validation results indicate that our bespoke label smoothing approach surpasses the performance of a standard resnet50 model and also the same model trained with conventional label smoothing techniques, in particular for the multi-label scenario of predicting clinical reasons of glaucoma likelihood in a highly imbalanced screening context. Our code is made available at this http URL .

[LG-178] Polyp and Surgical Instrument Segmentation with Double Encoder-Decoder Networks

链接: https://arxiv.org/abs/2406.03901
作者: Adrian Galdran
关键词: MedAI competition, endoscopic images, paper describes, describes a solution, participants were required
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper describes a solution for the MedAI competition, in which participants were required to segment both polyps and surgical instruments from endoscopic images. Our approach relies on a double encoder-decoder neural network which we have previously applied for polyp segmentation, but with a series of enhancements: a more powerful encoder architecture, an improved optimization procedure, and the post-processing of segmentations based on tempered model ensembling. Experimental results show that our method produces segmentations that show a good agreement with manual delineations provided by medical experts.

[LG-179] Data-driven discovery of self-similarity using neural networks

链接: https://arxiv.org/abs/2406.03896
作者: Ryota Watanabe,Takanori Ishii,Yuji Hirono,Hirokazu Maruoka
关键词: Finding self-similarity, key step, step for understanding, complex physical phenomena, understanding the governing
类目: oft Condensed Matter (cond-mat.soft); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 21 pages, 15 figures, 5 tables

点击查看摘要

Abstract:Finding self-similarity is a key step for understanding the governing law behind complex physical phenomena. Traditional methods for identifying self-similarity often rely on specific models, which can introduce significant bias. In this paper, we present a novel neural network-based approach that discovers self-similarity directly from observed data, without presupposing any models. The presence of self-similar solutions in a physical problem signals that the governing law contains a function whose arguments are given by power-law monomials of physical parameters, which are characterized by power-law exponents. The basic idea is to enforce such particular forms structurally in a neural network in a parametrized way. We train the neural network model using the observed data, and when the training is successful, we can extract the power exponents that characterize scale-transformation symmetries of the physical problem. We demonstrate the effectiveness of our method with both synthetic and experimental data, validating its potential as a robust, model-independent tool for exploring self-similarity in complex systems.

[LG-180] Spherinator and HiPSter: Representation Learning for Unbiased Knowledge Discovery from Simulations

链接: https://arxiv.org/abs/2406.03810
作者: Kai L. Polsterer,Bernd Doser,Andreas Fehlner,Sebastian Trujillo-Gomez
关键词: astrophysics and cosmology, approximation to experimental, experimental laboratories, laboratories in astrophysics, outputs severely limit
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 4 pages, 1 figure

点击查看摘要

Abstract:Simulations are the best approximation to experimental laboratories in astrophysics and cosmology. However, the complexity, richness, and large size of their outputs severely limit the interpretability of their predictions. We describe a new, unbiased, and machine learning based approach to obtaining useful scientific insights from a broad range of simulations. The method can be used on today’s largest simulations and will be essential to solve the extreme data exploration and analysis challenges posed by the Exascale era. Furthermore, this concept is so flexible, that it will also enable explorative access to observed data. Our concept is based on applying nonlinear dimensionality reduction to learn compact representations of the data in a low-dimensional space. The simulation data is projected onto this space for interactive inspection, visual interpretation, sample selection, and local analysis. We present a prototype using a rotational invariant hyperspherical variational convolutional autoencoder, utilizing a power distribution in the latent space, and trained on galaxies from IllustrisTNG simulation. Thereby, we obtain a natural Hubble tuning fork like similarity space that can be visualized interactively on the surface of a sphere by exploiting the power of HiPS tilings in Aladin Lite.

[LG-181] Projection-Free Variance Reduction Methods for Stochastic Constrained Multi-Level Compositional Optimization

链接: https://arxiv.org/abs/2406.03787
作者: Wei Jiang,Sifan Yang,Wenhao Yang,Yibo Wang,Yuanyu Wan,Lijun Zhang
关键词: constrained multi-level optimization, stochastic constrained multi-level, paper investigates projection-free, multi-level optimization, paper investigates
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates projection-free algorithms for stochastic constrained multi-level optimization. In this context, the objective function is a nested composition of several smooth functions, and the decision set is closed and convex. Existing projection-free algorithms for solving this problem suffer from two limitations: 1) they solely focus on the gradient mapping criterion and fail to match the optimal sample complexities in unconstrained settings; 2) their analysis is exclusively applicable to non-convex functions, without considering convex and strongly convex objectives. To address these issues, we introduce novel projection-free variance reduction algorithms and analyze their complexities under different criteria. For gradient mapping, our complexities improve existing results and match the optimal rates for unconstrained problems. For the widely-used Frank-Wolfe gap criterion, we provide theoretical guarantees that align with those for single-level problems. Additionally, by using a stage-wise adaptation, we further obtain complexities for convex and strongly convex functions. Finally, numerical experiments on different tasks demonstrate the effectiveness of our methods.

[LG-182] Privacy Preserving Semi-Decentralized Mean Estimation over Intermittently-Connected Networks

链接: https://arxiv.org/abs/2406.03766
作者: Rajarshi Saha,Mohamed Seif,Michal Yemini,Andrea J. Goldsmith,H. Vincent Poor
关键词: unreliable wireless network, problem of privately, privately estimating, vectors distributed, unreliable wireless
类目: ignal Processing (eess.SP); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 14 pages, 6 figures. arXiv admin note: text overlap with arXiv:2303.00035

点击查看摘要

Abstract:We consider the problem of privately estimating the mean of vectors distributed across different nodes of an unreliable wireless network, where communications between nodes can fail intermittently. We adopt a semi-decentralized setup, wherein to mitigate the impact of intermittently connected links, nodes can collaborate with their neighbors to compute a local consensus, which they relay to a central server. In such a setting, the communications between any pair of nodes must ensure that the privacy of the nodes is rigorously maintained to prevent unauthorized information leakage. We study the tradeoff between collaborative relaying and privacy leakage due to the data sharing among nodes and, subsequently, propose PriCER: Private Collaborative Estimation via Relaying – a differentially private collaborative algorithm for mean estimation to optimize this tradeoff. The privacy guarantees of PriCER arise (i) implicitly, by exploiting the inherent stochasticity of the flaky network connections, and (ii) explicitly, by adding Gaussian perturbations to the estimates exchanged by the nodes. Local and central privacy guarantees are provided against eavesdroppers who can observe different signals, such as the communications amongst nodes during local consensus and (possibly multiple) transmissions from the relays to the central server. We substantiate our theoretical findings with numerical simulations. Our implementation is available at this https URL.

[LG-183] Discrete error dynamics of mini-batch gradient descent for least squares regression

链接: https://arxiv.org/abs/2406.03696
作者: Jackie Lok,Rishi Sonthalia,Elizaveta Rebrova
关键词: mini-batch gradient descent, gradient descent, full-batch gradient descent, mini-batch gradient, sampling without replacement
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 26 pages

点击查看摘要

Abstract:We study the discrete dynamics of mini-batch gradient descent for least squares regression when sampling without replacement. We show that the dynamics and generalization error of mini-batch gradient descent depends on a sample cross-covariance matrix Z between the original features X and a set of new features \widetildeX , in which each feature is modified by the mini-batches that appear before it during the learning process in an averaged way. Using this representation, we rigorously establish that the dynamics of mini-batch and full-batch gradient descent agree up to leading order with respect to the step size using the linear scaling rule. We also study discretization effects that a continuous-time gradient flow analysis cannot detect, and show that mini-batch gradient descent converges to a step-size dependent solution, in contrast with full-batch gradient descent. Finally, we investigate the effects of batching, assuming a random matrix model, by using tools from free probability theory to numerically compute the spectrum of Z .

[LG-184] A Hybrid Deep Learning Classification of Perimetric Glaucoma Using Peripapillary Nerve Fiber Layer Reflectance and Other OCT Parameters from Three Anatomy Regions

链接: https://arxiv.org/abs/2406.03663
作者: Ou Tan,David S. Greenfield,Brian A. Francis,Rohit Varma,Joel S. Schuman,David Huang,Dongseok Choi
关键词: hybrid deep learning, deep learning model, deep learning, NFL reflectance, hybrid deep
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 12 pages

点击查看摘要

Abstract:Precis: A hybrid deep-learning model combines NFL reflectance and other OCT parameters to improve glaucoma diagnosis. Objective: To investigate if a deep learning model could be used to combine nerve fiber layer (NFL) reflectance and other OCT parameters for glaucoma diagnosis. Patients and Methods: This is a prospective observational study where of 106 normal subjects and 164 perimetric glaucoma (PG) patients. Peripapillary NFL reflectance map, NFL thickness map, optic head analysis of disc, and macular ganglion cell complex thickness were obtained using spectral domain OCT. A hybrid deep learning model combined a fully connected network (FCN) and a convolution neural network (CNN) to develop and combine those OCT maps and parameters to distinguish normal and PG eyes. Two deep learning models were compared based on whether the NFL reflectance map was used as part of the input or not. Results: The hybrid deep learning model with reflectance achieved 0.909 sensitivity at 99% specificity and 0.926 at 95%. The overall accuracy was 0.948 with 0.893 sensitivity and 1.000 specificity, and the AROC was 0.979, which is significantly better than the logistic regression models (p 0.001). The second best model is the hybrid deep learning model w/o reflectance, which also had significantly higher AROC than logistic regression models (p 0.001). Logistic regression with reflectance model had slightly higher AROC or sensitivity than the other logistic regression model without reflectance (p = 0.024). Conclusions: Hybrid deep learning model significantly improved the diagnostic accuracy, without or without NFL reflectance. Hybrid deep learning model, combining reflectance/NFL thickness/GCC thickness/ONH parameter, may be a practical model for glaucoma screen purposes.

[LG-185] Equivalence Set Restricted Latent Class Models (ESRLCM)

链接: https://arxiv.org/abs/2406.03653
作者: Jesse Bowers,Steve Culpepper
关键词: multivariate categorical data, Restricted Latent Class, Latent Class Models, Latent Class, Set Restricted Latent
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 43 pages, 10 tables, 1 figure

点击查看摘要

Abstract:Latent Class Models (LCMs) are used to cluster multivariate categorical data, commonly used to interpret survey responses. We propose a novel Bayesian model called the Equivalence Set Restricted Latent Class Model (ESRLCM). This model identifies clusters who have common item response probabilities, and does so more generically than traditional restricted latent attribute models. We verify the identifiability of ESRLCMs, and demonstrate the effectiveness in both simulations and real-world applications.

[LG-186] Ensembling Portfolio Strategies for Long-Term Investments: A Distribution-Free Preference Framework for Decision-Making and Algorithms

链接: https://arxiv.org/abs/2406.03652
作者: Duy Khanh Lam
关键词: outperform individual strategies, ensembling multiple strategies, paper investigates, investigates the problem, problem of ensembling
类目: Portfolio Management (q-fin.PM); Information Theory (cs.IT); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: 25 pages, 12 figures, 3 tables, working paper

点击查看摘要

Abstract:This paper investigates the problem of ensembling multiple strategies for sequential portfolios to outperform individual strategies in terms of long-term wealth. Due to the uncertainty of strategies’ performances in the future market, which are often based on specific models and statistical assumptions, investors often mitigate risk and enhance robustness by combining multiple strategies, akin to common approaches in collective learning prediction. However, the absence of a distribution-free and consistent preference framework complicates decisions of combination due to the ambiguous objective. To address this gap, we introduce a novel framework for decision-making in combining strategies, irrespective of market conditions, by establishing the investor’s preference between decisions and then forming a clear objective. Through this framework, we propose a combinatorial strategy construction, free from statistical assumptions, for any scale of component strategies, even infinite, such that it meets the determined criterion. Finally, we test the proposed strategy along with its accelerated variant and some other multi-strategies. The numerical experiments show results in favor of the proposed strategies, albeit with small tradeoffs in their Sharpe ratios, in which their cumulative wealths eventually exceed those of the best component strategies while the accelerated strategy significantly improves performance.

[LG-187] Style Mixture of Experts for Expressive Text-To-Speech Synthesis

链接: https://arxiv.org/abs/2406.03637
作者: Ahad Jawaid,Shreeram Suresh Chandra,Junchen Lu,Berrak Sisman
关键词: Recent advances, style transfer TTS, style, improved the expressiveness, expressiveness of synthesized
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Recent advances in style transfer text-to-speech (TTS) have improved the expressiveness of synthesized speech. Despite these advancements, encoding stylistic information from diverse and unseen reference speech remains challenging. This paper introduces StyleMoE, an approach that divides the embedding space, modeled by the style encoder, into tractable subsets handled by style experts. The proposed method replaces the style encoder in a TTS system with a Mixture of Experts (MoE) layer. By utilizing a gating network to route reference speeches to different style experts, each expert specializes in aspects of the style space during optimization. Our experiments objectively and subjectively demonstrate the effectiveness of our proposed method in increasing the coverage of the style space for diverse and unseen styles. This approach can enhance the performance of existing state-of-the-art style transfer TTS models, marking the first study of MoE in style transfer TTS to our knowledge.

[LG-188] Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance

链接: https://arxiv.org/abs/2406.03628
作者: Ryumei Nakada,Yichen Xu,Lexin Li,Linjun Zhang
关键词: machine learning, synthetic data, data, textbf, high-quality synthetic data
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 59 pages, 7 figures

点击查看摘要

Abstract:Imbalanced data and spurious correlations are common challenges in machine learning and data science. Oversampling, which artificially increases the number of instances in the underrepresented classes, has been widely adopted to tackle these challenges. In this article, we introduce OPAL (\textbfOversam\textbfPling with \textbfArtificial \textbfLLM-generated data), a systematic oversampling approach that leverages the capabilities of large language models (LLMs) to generate high-quality synthetic data for minority groups. Recent studies on synthetic data generation using deep generative models mostly target prediction tasks. Our proposal differs in that we focus on handling imbalanced data and spurious correlations. More importantly, we develop a novel theory that rigorously characterizes the benefits of using the synthetic data, and shows the capacity of transformers in generating high-quality synthetic data for both labels and covariates. We further conduct intensive numerical experiments to demonstrate the efficacy of our proposed approach compared to some representative alternative solutions.

[LG-189] BEACON: A Bayesian Optimization Strategy for Novelty Search in Expensive Black-Box Systems

链接: https://arxiv.org/abs/2406.03616
作者: Wei-Ting Tang,Ankush Chakrabarty,Joel A. Paulson
关键词: automatically uncover diverse, simulations or experiments, automatically uncover, uncover diverse system, neural architecture search
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Novelty search (NS) refers to a class of exploration algorithms that automatically uncover diverse system behaviors through simulations or experiments. Systematically obtaining diverse outcomes is a key component in many real-world design problems such as material and drug discovery, neural architecture search, reinforcement learning, and robot navigation. Since the relationship between the inputs and outputs (i.e., behaviors) of these complex systems is typically not available in closed form, NS requires a black-box perspective. Consequently, popular NS algorithms rely on evolutionary optimization and other meta-heuristics that require intensive sampling of the input space, which is impractical when the system is expensive to evaluate. We propose a Bayesian optimization inspired algorithm for sample-efficient NS that is specifically designed for such expensive black-box systems. Our approach models the input-to-behavior mapping with multi-output Gaussian processes (MOGP) and selects the next point to evaluate by maximizing a novelty metric that depends on a posterior sample drawn from the MOGP that promotes both exploration and exploitation. By leveraging advances in efficient posterior sampling and high-dimensional Gaussian process modeling, we discuss how our approach can be made scalable with respect to both amount of data and number of inputs. We test our approach on ten synthetic benchmark problems and eight real-world problems (with up to 2133 inputs) including new applications such as discovery of diverse metal organic frameworks for use in clean energy technology. We show that our approach greatly outperforms existing NS algorithms by finding substantially larger sets of diverse behaviors under limited sample budgets.

[LG-190] A New Branch-and-Bound Pruning Framework for ell_0-Regularized Problems

链接: https://arxiv.org/abs/2406.03504
作者: Theo Guyard,Cédric Herzet,Clément Elvira,Ayşe-Nur Arslan
关键词: learning problems involving, resolution of learning, problems involving, pruning tests, learning problems
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the resolution of learning problems involving \ell_0 -regularization via Branch-and-Bound (BnB) algorithms. These methods explore regions of the feasible space of the problem and check whether they do not contain solutions through “pruning tests”. In standard implementations, evaluating a pruning test requires to solve a convex optimization problem, which may result in computational bottlenecks. In this paper, we present an alternative to implement pruning tests for some generic family of \ell_0 -regularized problems. Our proposed procedure allows the simultaneous assessment of several regions and can be embedded in standard BnB implementations with a negligible computational overhead. We show through numerical simulations that our pruning strategy can improve the solving time of BnB procedures by several orders of magnitude for typical problems encountered in machine-learning applications.

[LG-191] Graphon Mean Field Games with a Representative Player: Analysis and Learning Algorithm

链接: https://arxiv.org/abs/2405.08005
作者: Fuzhong Zhou,Chenyu Zhang,Xu Chen,Xuan Di
关键词: discrete time graphon, study stochastic games, interaction among agents, time graphon game, propose a discrete
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published as a conference paper at ICML 2024

点击查看摘要

Abstract:We propose a discrete time graphon game formulation on continuous state and action spaces using a representative player to study stochastic games with heterogeneous interaction among agents. This formulation admits both philosophical and mathematical advantages, compared to a widely adopted formulation using a continuum of players. We prove the existence and uniqueness of the graphon equilibrium with mild assumptions, and show that this equilibrium can be used to construct an approximate solution for finite player game on networks, which is challenging to analyze and solve due to curse of dimensionality. An online oracle-free learning algorithm is developed to solve the equilibrium numerically, and sample complexity analysis is provided for its convergence.

信息检索

[IR-0] PaCE: Parsimonious Concept Engineering for Large Language Models

链接: https://arxiv.org/abs/2406.04331
作者: Jinqi Luo,Tianjiao Ding,Kwan Ho Ryan Chan,Darshan Thaker,Aditya Chattopadhyay,Chris Callison-Burch,René Vidal
关键词: Large Language Models, Large Language, wide variety, Large, Alignment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 26 pages, 17 figures, 5 tables, dataset and code at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) are being used for a wide variety of tasks. While they are capable of generating human-like responses, they can also produce undesirable output including potentially harmful information, racist or sexist language, and hallucinations. Alignment methods are designed to reduce such undesirable output, via techniques such as fine-tuning, prompt engineering, and representation engineering. However, existing methods face several challenges: some require costly fine-tuning for every alignment task; some do not adequately remove undesirable concepts, failing alignment; some remove benign concepts, lowering the linguistic capabilities of LLMs. To address these issues, we propose Parsimonious Concept Engineering (PaCE), a novel activation engineering framework for alignment. First, to sufficiently model the concepts, we construct a large-scale concept dictionary in the activation space, in which each atom corresponds to a semantic concept. Then, given any alignment task, we instruct a concept partitioner to efficiently annotate the concepts as benign or undesirable. Finally, at inference time, we decompose the LLM activations along the concept dictionary via sparse coding, to accurately represent the activation as a linear combination of the benign and undesirable components. By removing the latter ones from the activation, we reorient the behavior of LLMs towards alignment goals. We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising, and show that PaCE achieves state-of-the-art alignment performance while maintaining linguistic capabilities.

[IR-1] Measuring and Addressing Indexical Bias in Information Retrieval

链接: https://arxiv.org/abs/2406.04298
作者: Caleb Ziems,William Held,Jane Dwivedi-Yu,Diyi Yang
关键词: Information Retrieval, deliver relevant content, relevant content, rankings for fairness, balance of ideas
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注: ACL 2024

点击查看摘要

Abstract:Information Retrieval (IR) systems are designed to deliver relevant content, but traditional systems may not optimize rankings for fairness, neutrality, or the balance of ideas. Consequently, IR can often introduce indexical biases, or biases in the positional order of documents. Although indexical bias can demonstrably affect people’s opinion, voting patterns, and other behaviors, these issues remain understudied as the field lacks reliable metrics and procedures for automatically measuring indexical bias. Towards this end, we introduce the PAIR framework, which supports automatic bias audits for ranked documents or entire IR systems. After introducing DUO, the first general-purpose automatic bias metric, we run an extensive evaluation of 8 IR systems on a new corpus of 32k synthetic and 4.7k natural documents, with 4k queries spanning 1.4k controversial issue topics. A human behavioral study validates our approach, showing that our bias metric can help predict when and how indexical bias will shift a reader’s opinion.

[IR-2] VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval

链接: https://arxiv.org/abs/2406.04292
作者: Junjie Zhou,Zheng Liu,Shitao Xiao,Bo Zhao,Yongping Xiong
关键词: popular in practice, increasingly popular, Multi-modal retrieval, Multi-modal, data
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ACL 2024 main conference

点击查看摘要

Abstract:Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal multi-modal retrieval. Our work brings forth threefold technical contributions. Firstly, we introduce a flexible architecture which extends a powerful text encoder with the image understanding capability by introducing visual token embeddings. Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model. Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capability using the generated composed image-text data. In our experiments, VISTA achieves superior performances across a variety of multi-modal retrieval tasks in both zero-shot and supervised settings. Our model, data, and source code are available at this https URL.

[IR-3] Data Measurements for Decentralized Data Markets

链接: https://arxiv.org/abs/2406.04257
作者: Charles Lu,Mohammad Mohammadi Amiri,Ramesh Raskar
关键词: Decentralized data markets, machine learning, Decentralized data, markets can provide, provide more equitable
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 20 pages, 11 figures

点击查看摘要

Abstract:Decentralized data markets can provide more equitable forms of data acquisition for machine learning. However, to realize practical marketplaces, efficient techniques for seller selection need to be developed. We propose and benchmark federated data measurements to allow a data buyer to find sellers with relevant and diverse datasets. Diversity and relevance measures enable a buyer to make relative comparisons between sellers without requiring intermediate brokers and training task-dependent models.

[IR-4] On The Persona-based Summarization of Domain-Specific Documents

链接: https://arxiv.org/abs/2406.03986
作者: Ankan Mullick,Sombit Bose,Rounak Saha,Ayan Kumar Bhowmick,Pawan Goyal,Niloy Ganguly,Prasenjit Dey,Ravi Kokku
关键词: storing information necessitates, large information repositories, complexity of consuming, ever-expanding world, increasing complexity
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In an ever-expanding world of domain-specific knowledge, the increasing complexity of consuming, and storing information necessitates the generation of summaries from large information repositories. However, every persona of a domain has different requirements of information and hence their summarization. For example, in the healthcare domain, a persona-based (such as Doctor, Nurse, Patient etc.) approach is imperative to deliver targeted medical information efficiently. Persona-based summarization of domain-specific information by humans is a high cognitive load task and is generally not preferred. The summaries generated by two different humans have high variability and do not scale in cost and subject matter expertise as domains and personas grow. Further, AI-generated summaries using generic Large Language Models (LLMs) may not necessarily offer satisfactory accuracy for different domains unless they have been specifically trained on domain-specific data and can also be very expensive to use in day-to-day operations. Our contribution in this paper is two-fold: 1) We present an approach to efficiently fine-tune a domain-specific small foundation LLM using a healthcare corpus and also show that we can effectively evaluate the summarization quality using AI-based critiquing. 2) We further show that AI-based critiquing has good concordance with Human-based critiquing of the summaries. Hence, such AI-based pipelines to generate domain-specific persona-based summaries can be easily scaled to other domains such as legal, enterprise documents, education etc. in a very efficient and cost-effective manner.

[IR-5] Beyond Similarity: Personalized Federated Recommendation with Composite Aggregation

链接: https://arxiv.org/abs/2406.03933
作者: Honglei Zhang,Haoxuan Li,Jundong Chen,Sen Cui,Kunda Yan,Abudukelimu Wuerkaixi,Xin Zhou,Zhiqi Shen,Yidong Li
关键词: collect global knowledge, aggregating local models, Federated recommendation aims, massive devices, ensuring privacy
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Federated recommendation aims to collect global knowledge by aggregating local models from massive devices, to provide recommendations while ensuring privacy. Current methods mainly leverage aggregation functions invented by federated vision community to aggregate parameters from similar clients, e.g., clustering aggregation. Despite considerable performance, we argue that it is suboptimal to apply them to federated recommendation directly. This is mainly reflected in the disparate model architectures. Different from structured parameters like convolutional neural networks in federated vision, federated recommender models usually distinguish itself by employing one-to-one item embedding table. Such a discrepancy induces the challenging embedding skew issue, which continually updates the trained embeddings but ignores the non-trained ones during aggregation, thus failing to predict future items accurately. To this end, we propose a personalized Federated recommendation model with Composite Aggregation (FedCA), which not only aggregates similar clients to enhance trained embeddings, but also aggregates complementary clients to update non-trained embeddings. Besides, we formulate the overall learning process into a unified optimization algorithm to jointly learn the similarity and complementarity. Extensive experiments on several real-world datasets substantiate the effectiveness of our proposed model. The source codes are available at this https URL.

[IR-6] Polyhedral Conic Classifier for CTR Prediction

链接: https://arxiv.org/abs/2406.03892
作者: Beyza Turkmen,Ramazan Tarik Turksoy,Hasan Saribas,Hakan Cevikalp
关键词: industrial recommender systems, click-through rate, recommender systems, addressing the inherent, geometric asymmetry
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel approach for click-through rate (CTR) prediction within industrial recommender systems, addressing the inherent challenges of numerical imbalance and geometric asymmetry. These challenges stem from imbalanced datasets, where positive (click) instances occur less frequently than negatives (non-clicks), and geometrically asymmetric distributions, where positive samples exhibit visually coherent patterns while negatives demonstrate greater diversity. To address these challenges, we have used a deep neural network classifier that uses the polyhedral conic functions. This classifier is similar to the one-class classifiers in spirit and it returns compact polyhedral acceptance regions to separate the positive class samples from the negative samples that have diverse distributions. Extensive experiments have been conducted to test the proposed approach using state-of-the-art (SOTA) CTR prediction models on four public datasets, namely Criteo, Avazu, MovieLens and Frappe. The experimental evaluations highlight the superiority of our proposed approach over Binary Cross Entropy (BCE) Loss, which is widely used in CTR prediction tasks.

[IR-7] Reducing the climate impact of data portals: a case study

链接: https://arxiv.org/abs/2406.03858
作者: Noah Gießing,Madhurima Deb,Ankit Satpute,Moritz Schubotz,Olaf Teschke
关键词: carbon footprint share, communication technology, sector has steadily, steadily increased, past decade
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注: 4 pages

点击查看摘要

Abstract:The carbon footprint share of the information and communication technology (ICT) sector has steadily increased in the past decade and is predicted to make up as much as 23 % of global emissions in 2030. This shows a pressing need for developers, including the information retrieval community, to make their code more energy-efficient. In this project proposal, we discuss techniques to reduce the energy footprint of the MaRDI (Mathematical Research Data Initiative) Portal, a MediaWiki-based knowledge base. In future work, we plan to implement these changes and provide concrete measurements on the gain in energy efficiency. Researchers developing similar knowledge bases can adapt our measures to reduce their environmental footprint. In this way, we are working on mitigating the climate impact of Information Retrieval research.

[IR-8] XL-HeadTags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags

链接: https://arxiv.org/abs/2406.03776
作者: Faisal Tareque Shohan,Mir Tafseer Nayeem,Samsul Islam,Abu Ubaida Akash,Shafiq Joty
关键词: published online daily, articles published online, published online, online daily, daily can overwhelm
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: ACL 2024 camera ready

点击查看摘要

Abstract:Millions of news articles published online daily can overwhelm readers. Headlines and entity (topic) tags are essential for guiding readers to decide if the content is worth their time. While headline generation has been extensively studied, tag generation remains largely unexplored, yet it offers readers better access to topics of interest. The need for conciseness in capturing readers’ attention necessitates improved content selection strategies for identifying salient and relevant segments within lengthy articles, thereby guiding language models effectively. To address this, we propose to leverage auxiliary information such as images and captions embedded in the articles to retrieve relevant sentences and utilize instruction tuning with variations to generate both headlines and tags for news articles in a multilingual context. To make use of the auxiliary information, we have compiled a dataset named XL-HeadTags, which includes 20 languages across 6 diverse language families. Through extensive evaluation, we demonstrate the effectiveness of our plug-and-play multimodal-multilingual retrievers for both tasks. Additionally, we have developed a suite of tools for processing and evaluating multilingual texts, significantly contributing to the research community by enabling more accurate and efficient analysis across languages.

[IR-9] Attribute-Aware Implicit Modality Alignment for Text Attribute Person Search

链接: https://arxiv.org/abs/2406.03721
作者: Xin Wang,Fangfang Liu,Zheng Li,Caili Guo
关键词: Text attribute person, find specific pedestrians, person search aims, attribute person search, Text attribute
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Text attribute person search aims to find specific pedestrians through given textual attributes, which is very meaningful in the scene of searching for designated pedestrians through witness descriptions. The key challenge is the significant modality gap between textual attributes and images. Previous methods focused on achieving explicit representation and alignment through unimodal pre-trained models. Nevertheless, the absence of inter-modality correspondence in these models may lead to distortions in the local information of intra-modality. Moreover, these methods only considered the alignment of inter-modality and ignored the differences between different attribute categories. To mitigate the above problems, we propose an Attribute-Aware Implicit Modality Alignment (AIMA) framework to learn the correspondence of local representations between textual attributes and images and combine global representation matching to narrow the modality gap. Firstly, we introduce the CLIP model as the backbone and design prompt templates to transform attribute combinations into structured sentences. This facilitates the model’s ability to better understand and match image details. Next, we design a Masked Attribute Prediction (MAP) module that predicts the masked attributes after the interaction of image and masked textual attribute features through multi-modal interaction, thereby achieving implicit local relationship alignment. Finally, we propose an Attribute-IoU Guided Intra-Modal Contrastive (A-IoU IMC) loss, aligning the distribution of different textual attributes in the embedding space with their IoU distribution, achieving better semantic arrangement. Extensive experiments on the Market-1501 Attribute, PETA, and PA100K datasets show that the performance of our proposed method significantly surpasses the current state-of-the-art methods.

人工智能

[AI-0] Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

链接: https://arxiv.org/abs/2406.04338
作者: Fangfu Liu,Hanyang Wang,Shunyu Yao,Shengjun Zhang,Jie Zhou,Yueqi Duan
关键词: recent years, rapid development, physical properties, physical, dynamic movements
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project page: this https URL

点击查看摘要

Abstract:In recent years, there has been rapid development in 3D generation models, opening up new possibilities for applications such as simulating the dynamic movements of 3D objects and customizing their behaviors. However, current 3D generative models tend to focus only on surface features such as color and shape, neglecting the inherent physical properties that govern the behavior of objects in the real world. To accurately simulate physics-aligned dynamics, it is essential to predict the physical properties of materials and incorporate them into the behavior prediction process. Nonetheless, predicting the diverse materials of real-world objects is still challenging due to the complex nature of their physical attributes. In this paper, we propose \textbfPhysics3D, a novel method for learning various physical properties of 3D objects through a video diffusion model. Our approach involves designing a highly generalizable physical simulation system based on a viscoelastic material model, which enables us to simulate a wide range of materials with high-fidelity capabilities. Moreover, we distill the physical priors from a video diffusion model that contains more understanding of realistic object materials. Extensive experiments demonstrate the effectiveness of our method with both elastic and plastic materials. Physics3D shows great potential for bridging the gap between the physical world and virtual neural space, providing a better integration and application of realistic physical principles in virtual environments. Project page: this https URL.

[AI-1] Coherent Zero-Shot Visual Instruction Generation

链接: https://arxiv.org/abs/2406.04337
作者: Quynh Phung,Songwei Ge,Jia-Bin Huang
关键词: require consistent representation, smooth state transitions, sequential steps remains, generating visual instructions, formidable challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:Despite the advances in text-to-image synthesis, particularly with diffusion models, generating visual instructions that require consistent representation and smooth state transitions of objects across sequential steps remains a formidable challenge. This paper introduces a simple, training-free framework to tackle the issues, capitalizing on the advancements in diffusion models and large language models (LLMs). Our approach systematically integrates text comprehension and image generation to ensure visual instructions are visually appealing and maintain consistency and accuracy throughout the instruction sequence. We validate the effectiveness by testing multi-step instructions and comparing the text alignment and consistency with several baselines. Our experiments show that our approach can visualize coherent and visually pleasing instructions

[AI-2] PaCE: Parsimonious Concept Engineering for Large Language Models

链接: https://arxiv.org/abs/2406.04331
作者: Jinqi Luo,Tianjiao Ding,Kwan Ho Ryan Chan,Darshan Thaker,Aditya Chattopadhyay,Chris Callison-Burch,René Vidal
关键词: Large Language Models, Large Language, wide variety, Large, Alignment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 26 pages, 17 figures, 5 tables, dataset and code at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) are being used for a wide variety of tasks. While they are capable of generating human-like responses, they can also produce undesirable output including potentially harmful information, racist or sexist language, and hallucinations. Alignment methods are designed to reduce such undesirable output, via techniques such as fine-tuning, prompt engineering, and representation engineering. However, existing methods face several challenges: some require costly fine-tuning for every alignment task; some do not adequately remove undesirable concepts, failing alignment; some remove benign concepts, lowering the linguistic capabilities of LLMs. To address these issues, we propose Parsimonious Concept Engineering (PaCE), a novel activation engineering framework for alignment. First, to sufficiently model the concepts, we construct a large-scale concept dictionary in the activation space, in which each atom corresponds to a semantic concept. Then, given any alignment task, we instruct a concept partitioner to efficiently annotate the concepts as benign or undesirable. Finally, at inference time, we decompose the LLM activations along the concept dictionary via sparse coding, to accurately represent the activation as a linear combination of the benign and undesirable components. By removing the latter ones from the activation, we reorient the behavior of LLMs towards alignment goals. We conduct experiments on tasks such as response detoxification, faithfulness enhancement, and sentiment revising, and show that PaCE achieves state-of-the-art alignment performance while maintaining linguistic capabilities.

[AI-3] ATraDiff: Accelerating Online Reinforcement Learning with Imaginary Trajectories

链接: https://arxiv.org/abs/2406.04323
作者: Qianlan Yang,Yu-Xiong Wang
关键词: Training autonomous agents, Training autonomous, low data efficiency, offline data, due to low
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: ICML 2024 Accepted

点击查看摘要

Abstract:Training autonomous agents with sparse rewards is a long-standing problem in online reinforcement learning (RL), due to low data efficiency. Prior work overcomes this challenge by extracting useful knowledge from offline data, often accomplished through the learning of action distribution from offline data and utilizing the learned distribution to facilitate online RL. However, since the offline data are given and fixed, the extracted knowledge is inherently limited, making it difficult to generalize to new tasks. We propose a novel approach that leverages offline data to learn a generative diffusion model, coined as Adaptive Trajectory Diffuser (ATraDiff). This model generates synthetic trajectories, serving as a form of data augmentation and consequently enhancing the performance of online RL methods. The key strength of our diffuser lies in its adaptability, allowing it to effectively handle varying trajectory lengths and mitigate distribution shifts between online and offline data. Because of its simplicity, ATraDiff seamlessly integrates with a wide spectrum of RL methods. Empirical evaluation shows that ATraDiff consistently achieves state-of-the-art performance across a variety of environments, with particularly pronounced improvements in complicated settings. Our code and demo video are available at this https URL .

[AI-4] Chimera: Effectively Modeling Multivariate Time Series with 2-Dimensional State Space Models

链接: https://arxiv.org/abs/2406.04320
作者: Ali Behrouz,Michele Santacatterina,Ramin Zabih
关键词: Modeling multivariate time, State Space Models, time series, time series modeling, multivariate time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Modeling multivariate time series is a well-established problem with a wide range of applications from healthcare to financial markets. Traditional State Space Models (SSMs) are classical approaches for univariate time series modeling due to their simplicity and expressive power to represent linear dependencies. They, however, have fundamentally limited expressive power to capture non-linear dependencies, are slow in practice, and fail to model the inter-variate information flow. Despite recent attempts to improve the expressive power of SSMs by using deep structured SSMs, the existing methods are either limited to univariate time series, fail to model complex patterns (e.g., seasonal patterns), fail to dynamically model the dependencies of variate and time dimensions, and/or are input-independent. We present Chimera that uses two input-dependent 2-D SSM heads with different discretization processes to learn long-term progression and seasonal patterns. To improve the efficiency of complex 2D recurrence, we present a fast training using a new 2-dimensional parallel selective scan. We further present and discuss 2-dimensional Mamba and Mamba-2 as the spacial cases of our 2D SSM. Our experimental evaluation shows the superior performance of Chimera on extensive and diverse benchmarks, including ECG and speech time series classification, long-term and short-term time series forecasting, and time series anomaly detection.

[AI-5] Adaptive Sampling of k-Space in Magnetic Resonance for Rapid Pathology Prediction

链接: https://arxiv.org/abs/2406.04318
作者: Chen-Yu Yen,Raghav Singhal,Umang Sharma,Rajesh Ranganath,Sumit Chopra,Lerrel Pinto
关键词: Magnetic Resonance, proven diagnostic utility, inaccessible imaging modality, imaging modality, diagnostic utility
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: ICML 2024. Project website at this https URL

点击查看摘要

Abstract:Magnetic Resonance (MR) imaging, despite its proven diagnostic utility, remains an inaccessible imaging modality for disease surveillance at the population level. A major factor rendering MR inaccessible is lengthy scan times. An MR scanner collects measurements associated with the underlying anatomy in the Fourier space, also known as the k-space. Creating a high-fidelity image requires collecting large quantities of such measurements, increasing the scan time. Traditionally to accelerate an MR scan, image reconstruction from under-sampled k-space data is the method of choice. However, recent works show the feasibility of bypassing image reconstruction and directly learning to detect disease directly from a sparser learned subset of the k-space measurements. In this work, we propose Adaptive Sampling for MR (ASMR), a sampling method that learns an adaptive policy to sequentially select k-space samples to optimize for target disease detection. On 6 out of 8 pathology classification tasks spanning the Knee, Brain, and Prostate MR scans, ASMR reaches within 2% of the performance of a fully sampled classifier while using only 8% of the k-space, as well as outperforming prior state-of-the-art work in k-space sampling such as EMRT, LOUPE, and DPS.

[AI-6] Improving Alignment and Robustness with Short Circuiting

链接: https://arxiv.org/abs/2406.04313
作者: Andy Zou,Long Phan,Justin Wang,Derek Duenas,Maxwell Lin,Maksym Andriushchenko,Rowan Wang,Zico Kolter,Matt Fredrikson,Dan Hendrycks
关键词: highly vulnerable, harmful, adversarial, attacks, harmful outputs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that “short-circuits” models as they respond with harmful outputs. Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to plug these holes by countering specific attacks. As an alternative to refusal training and adversarial training, short-circuiting directly controls the representations that are responsible for harmful outputs in the first place. Our technique can be applied to both text-only and multimodal language models to prevent the generation of harmful outputs without sacrificing utility – even in the presence of powerful unseen attacks. Notably, while adversarial robustness in standalone image recognition remains an open challenge, short-circuiting allows the larger multimodal system to reliably withstand image “hijacks” that aim to produce harmful content. Finally, we extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack. Our approach represents a significant step forward in the development of reliable safeguards to harmful behavior and adversarial attacks.

[AI-7] Semantically Diverse Language Generation for Uncertainty Estimation in Language Models

链接: https://arxiv.org/abs/2406.04306
作者: Lukas Aichberger,Kajetan Schweighofer,Mykyta Ielanskyi,Sepp Hochreiter
关键词: Large language models, Large language, Large, language models, LLMs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) can suffer from hallucinations when generating text. These hallucinations impede various applications in society and industry by making LLMs untrustworthy. Current LLMs generate text in an autoregressive fashion by predicting and appending text tokens. When an LLM is uncertain about the semantic meaning of the next tokens to generate, it is likely to start hallucinating. Thus, it has been suggested that hallucinations stem from predictive uncertainty. We introduce Semantically Diverse Language Generation (SDLG) to quantify predictive uncertainty in LLMs. SDLG steers the LLM to generate semantically diverse yet likely alternatives for an initially generated text. This approach provides a precise measure of aleatoric semantic uncertainty, detecting whether the initial text is likely to be hallucinated. Experiments on question-answering tasks demonstrate that SDLG consistently outperforms existing methods while being the most computationally efficient, setting a new standard for uncertainty estimation in LLMs.

[AI-8] Vision-LSTM: xLSTM as Generic Vision Backbone

链接: https://arxiv.org/abs/2406.04303
作者: Benedikt Alkin,Maximilian Beck,Korbinian Pöppel,Sepp Hochreiter,Johannes Brandstetter
关键词: natural language processing, Transformers are widely, language processing, initially introduced, introduced for natural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.

[AI-9] ABEX: Data Augmentation for Low-Resource NLU via Expanding Abstract Descriptions

链接: https://arxiv.org/abs/2406.04286
作者: Sreyan Ghosh,Utkarsh Tyagi,Sonal Kumar,C. K. Evuru,S Ramaneswaran,S Sakshi,Dinesh Manocha
关键词: Natural Language Understanding, low-resource Natural Language, Language Understanding, Natural Language, effective generative data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: ACL 2024 Main Conference. Code and data: this https URL

点击查看摘要

Abstract:We present ABEX, a novel and effective generative data augmentation methodology for low-resource Natural Language Understanding (NLU) tasks. ABEX is based on ABstract-and-EXpand, a novel paradigm for generating diverse forms of an input document – we first convert a document into its concise, abstract description and then generate new documents based on expanding the resultant abstraction. To learn the task of expanding abstract descriptions, we first train BART on a large-scale synthetic dataset with abstract-document pairs. Next, to generate abstract descriptions for a document, we propose a simple, controllable, and training-free method based on editing AMR graphs. ABEX brings the best of both worlds: by expanding from abstract representations, it preserves the original semantic properties of the documents, like style and meaning, thereby maintaining alignment with the original label and data distribution. At the same time, the fundamental process of elaborating on abstract descriptions facilitates diverse generations. We demonstrate the effectiveness of ABEX on 4 NLU tasks spanning 12 datasets and 4 low-resource settings. ABEX outperforms all our baselines qualitatively with improvements of 0.04% - 38.8%. Qualitatively, ABEX outperforms all prior methods from literature in terms of context and length diversity.

[AI-10] Generative AI-in-the-loop: Integrating LLMs and GPTs into the Next Generation Networks

链接: https://arxiv.org/abs/2406.04276
作者: Han Zhang,Akram Bin Sediq,Ali Afana,Melike Erol-Kantarci
关键词: created numerous opportunities, machine learning, recent years, techniques have created, created numerous
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, machine learning (ML) techniques have created numerous opportunities for intelligent mobile networks and have accelerated the automation of network operations. However, complex network tasks may involve variables and considerations even beyond the capacity of traditional ML algorithms. On the other hand, large language models (LLMs) have recently emerged, demonstrating near-human-level performance in cognitive tasks across various fields. However, they remain prone to hallucinations and often lack common sense in basic tasks. Therefore, they are regarded as assistive tools for humans. In this work, we propose the concept of “generative AI-in-the-loop” and utilize the semantic understanding, context awareness, and reasoning abilities of LLMs to assist humans in handling complex or unforeseen situations in mobile communication networks. We believe that combining LLMs and ML models allows both to leverage their respective capabilities and achieve better results than either model alone. To support this idea, we begin by analyzing the capabilities of LLMs and compare them with traditional ML algorithms. We then explore potential LLM-based applications in line with the requirements of next-generation networks. We further examine the integration of ML and LLMs, discussing how they can be used together in mobile networks. Unlike existing studies, our research emphasizes the fusion of LLMs with traditional ML-driven next-generation networks and serves as a comprehensive refinement of existing surveys. Finally, we provide a case study to enhance ML-based network intrusion detection with synthesized data generated by LLMs. Our case study further demonstrates the advantages of our proposed idea.

[AI-11] Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models

链接: https://arxiv.org/abs/2406.04274
作者: Xiang Ji,Sanjeev Kulkarni,Mengdi Wang,Tengyang Xie
关键词: aligning large language, large language models, preference optimization methods, studies the challenge, challenge of aligning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:This work studies the challenge of aligning large language models (LLMs) with offline preference data. We focus on alignment by Reinforcement Learning from Human Feedback (RLHF) in particular. While popular preference optimization methods exhibit good empirical performance in practice, they are not theoretically guaranteed to converge to the optimal policy and can provably fail when the data coverage is sparse by classical offline reinforcement learning (RL) results. On the other hand, a recent line of work has focused on theoretically motivated preference optimization methods with provable guarantees, but these are not computationally efficient for large-scale applications like LLM alignment. To bridge this gap, we propose SPAC, a new offline preference optimization method with self-play, inspired by the on-average pessimism technique from the offline RL literature, to be the first provable and scalable approach to LLM alignment. We both provide theoretical analysis for its convergence under single-policy concentrability for the general function approximation setting and demonstrate its competitive empirical performance for LLM alignment on a 7B Mistral model with Open LLM Leaderboard evaluations.

[AI-12] ELFS: Enhancing Label-Free Coreset Selection via Clustering-based Pseudo-Labeling

链接: https://arxiv.org/abs/2406.04273
作者: Haizhong Zheng,Elisa Tsai,Yifu Lu,Jiachen Sun,Brian R. Bartoldson,Bhavya Kailkhura,Atul Prakash
关键词: High-quality human-annotated data, High-quality human-annotated, human annotation process, deep learning pipelines, modern deep learning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:High-quality human-annotated data is crucial for modern deep learning pipelines, yet the human annotation process is both costly and time-consuming. Given a constrained human labeling budget, selecting an informative and representative data subset for labeling can significantly reduce human annotation effort. Well-performing state-of-the-art (SOTA) coreset selection methods require ground-truth labels over the whole dataset, failing to reduce the human labeling burden. Meanwhile, SOTA label-free coreset selection methods deliver inferior performance due to poor geometry-based scores. In this paper, we introduce ELFS, a novel label-free coreset selection method. ELFS employs deep clustering to estimate data difficulty scores without ground-truth labels. Furthermore, ELFS uses a simple but effective double-end pruning method to mitigate bias on calculated scores, which further improves the performance on selected coresets. We evaluate ELFS on five vision benchmarks and show that ELFS consistently outperforms SOTA label-free baselines. For instance, at a 90% pruning rate, ELFS surpasses the best-performing baseline by 5.3% on CIFAR10 and 7.1% on CIFAR100. Moreover, ELFS even achieves comparable performance to supervised coreset selection at low pruning rates (e.g., 30% and 50%) on CIFAR10 and ImageNet-1K.

[AI-13] Open-Endedness is Essential for Artificial Superhuman Intelligence

链接: https://arxiv.org/abs/2406.04268
作者: Edward Hughes,Michael Dennis,Jack Parker-Holder,Feryal Behbahani,Aditi Mavalankar,Yuge Shi,Tom Schaul,Tim Rocktaschel
关键词: internetscale data, recent years, tremendous surge, general capabilities, fuelled by training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years there has been a tremendous surge in the general capabilities of AI systems, mainly fuelled by training foundation models on internetscale data. Nevertheless, the creation of openended, ever self-improving AI remains elusive. In this position paper, we argue that the ingredients are now in place to achieve openendedness in AI systems with respect to a human observer. Furthermore, we claim that such open-endedness is an essential property of any artificial superhuman intelligence (ASI). We begin by providing a concrete formal definition of open-endedness through the lens of novelty and learnability. We then illustrate a path towards ASI via open-ended systems built on top of foundation models, capable of making novel, humanrelevant discoveries. We conclude by examining the safety implications of generally-capable openended AI. We expect that open-ended foundation models will prove to be an increasingly fertile and safety-critical area of research in the near future.

[AI-14] MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

链接: https://arxiv.org/abs/2406.04264
作者: Junjie Zhou,Yan Shu,Bo Zhao,Boya Wu,Shitao Xiao,Xi Yang,Yongping Xiong,Bo Zhang,Tiejun Huang,Zheng Liu
关键词: Long Video Understanding, Video Understanding, Multi-task Long Video, video understanding benchmarks, Long Video
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To address the above problems, we propose a new benchmark, called MLVU (Multi-task Long Video Understanding Benchmark), for the comprehensive and in-depth evaluation of LVU. MLVU presents the following critical