本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-05-30)

今日共更新511篇论文,其中:

  • 自然语言处理72篇(Computation and Language (cs.CL))
  • 计算机视觉116篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能149篇(Artificial Intelligence (cs.AI))
  • 机器学习203篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] X-VILA: Cross-Modality Alignment for Large Language Model
[NLP-0] X-VILA:大型语言模型的跨模式对齐

链接: https://arxiv.org/abs/2405.19335
作者: Hanrong Ye,De-An Huang,Yao Lu,Zhiding Yu,Wei Ping,Andrew Tao,Jan Kautz,Song Han,Dan Xu,Pavlo Molchanov,Hongxu Yin
关键词: omni-modality model designed, large language models, incorporating image, omni-modality model, model designed
中文关键词: 设计的全形态模型,大型语言模型,合并图像,全形态模型,设计的模型
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Technical Report

点击查看摘要

Abstract:We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effective interleaved any-to-any modality instruction-following dataset. Furthermore, we identify a significant problem with the current cross-modality alignment method, which results in visual information loss. To address the issue, we propose a visual alignment mechanism with a visual embedding highway module. We then introduce a resource-efficient recipe for training X-VILA, that exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins. X-VILA also showcases emergent properties across modalities even in the absence of similar training data. The project will be made open-source.
摘要:我们介绍了X-VILA,这是一个全通道模型,旨在通过结合图像、视频和音频通道来扩展大型语言模型(LLM)的功能。通过将特定于通道的编码器与LLM输入对准,将扩散解码器与LLM输出对齐,X-VILA实现了跨通道的理解、推理和生成。为了促进这种跨通道对齐,我们策划了一个有效的交错任意到任意通道指令遵循的数据集。此外,我们发现当前的跨通道对齐方法存在一个严重的问题,它导致视觉信息丢失。为了解决这一问题,我们提出了一种带有视觉嵌入高速公路模块的视觉对齐机制。然后,我们介绍了一种资源高效的X-Vila培训方法,该方法在任意对任意通道的对话中表现出熟练程度,大大超过了以前的方法。即使在没有类似的训练数据的情况下,X-Vila也可以展示各种模式的新特性。该项目将成为开源项目。

[NLP-1] LLMs Meet Multimodal Generation and Editing: A Survey
[NLP-1] LLM满足多模式生成和编辑:一项调查

链接: https://arxiv.org/abs/2405.19334
作者: Yingqing He,Zhaoyang Liu,Jingye Chen,Zeyue Tian,Hongyu Liu,Xiaowei Chi,Runtao Liu,Ruibin Yuan,Yazhou Xing,Wenhai Wang,Jifeng Dai,Yong Zhang,Wei Xue,Qifeng Liu,Yike Guo,Qifeng Chen
关键词: large language models, large language, combining LLMs, growing interest, interest in combining
中文关键词: 大型语言模型,大型语言,结合LLM,日益增长的兴趣,对结合的兴趣
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 51 Pages with 16 Figures, 12 Tables, and 534 References. GitHub Repository at: this https URL

点击查看摘要

Abstract:With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on understanding. This survey elaborates on multimodal generation across different domains, including image, video, 3D, and audio, where we highlight the notable advancements with milestone works in these fields. Specifically, we exhaustively investigate the key technical components behind methods and multimodal datasets utilized in these studies. Moreover, we dig into tool-augmented multimodal agents that can use existing generative models for human-computer interaction. Lastly, we also comprehensively discuss the advancement in AI safety and investigate emerging applications as well as future prospects. Our work provides a systematic and insightful overview of multimodal generation, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at this https URL
摘要:随着近年来大语言模型的发展,将大语言模型与多通道学习相结合越来越受到人们的关注。以往对多通道大语言模型的研究主要集中在理解上。这项调查详细阐述了不同领域的多模式生成,包括图像、视频、3D和音频,其中我们重点介绍了这些领域的里程碑式工作取得的显著进展。具体地说,我们详尽地调查了这些研究中使用的方法和多模式数据集背后的关键技术组件。此外,我们深入研究了工具增强的多通道代理,它们可以使用现有的生成模型进行人机交互。最后,我们还全面讨论了人工智能安全方面的进展,并研究了新兴的应用以及未来的前景。我们的工作提供了一个系统和有洞察力的多模式生成概述,预计将推动人工智能的生成内容(AIGC)和世界模型的发展。所有相关论文的精选列表可在此HTTPS URL中找到

[NLP-2] MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series
[NLP-2] MAP-Neo:高功能、透明的双语大型语言模型系列

链接: https://arxiv.org/abs/2405.19327
作者: Ge Zhang,Scott Qu,Jiaheng Liu,Chenchen Zhang,Chenghua Lin,Chou Leuang Yu,Danny Pan,Esther Cheng,Jie Liu,Qunshu Lin,Raven Yuan,Tuney Zheng,Wei Pang,Xinrun Du,Yiming Liang,Yinghao Ma,Yizhi Li,Ziyang Ma,Bill Lin,Emmanouil Benetos,Huan Yang,Junting Zhou,Kaijing Ma,Minghao Liu,Morry Niu,Noah Wang,Quehry Que,Ruibo Liu,Sine Liu,Shawn Guo,Soren Gao,Wangchunshu Zhou,Xinyue Zhang,Yizhi Zhou,Yubo Wang,Yuelin Bai,Yuhan Zhang,Yuxiang Zhang,Zenith Wang,Zhenzhu Yang,Zijian Zhao,Jiajun Zhang,Wanli Ouyang,Wenhao Huang,Wenhu Chen
关键词: made great strides, achieve unprecedented performance, LLMs, made great, great strides
中文关键词: 取得了巨大的进步,取得了前所未有的业绩,法学硕士,取得了巨大的进步
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model’s weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs.
摘要:大型语言模型近年来取得了长足的进步,在不同的任务中取得了前所未有的性能。然而,由于商业利益,像GPT、双子座和克劳德这样最具竞争力的型号一直被限制在专有界面后面,没有透露培训细节。最近,许多机构已经开源了几个像Llama-3这样强大的LLM,可以与现有的闭源LLM相媲美。然而,只有模型的权重具有最详细的信息(例如,中间检查点、训练前语料库和训练代码等)。未被披露。为了提高LLMS的透明度,研究界已经形成了开源的真正开放的LLMS(例如,Pythia、Amber、Olmo),其中提供了更多的细节(例如,预培训语料库和培训代码)。这些模型极大地推动了对这些大型模型的科学研究,包括它们的优势、劣势、偏差和风险。然而,我们观察到,现有的在推理、知识和编码任务方面真正开放的LLM仍然不如现有的具有类似模型大小的最先进的LLM。为此,我们开源了MAP-Neo,一个高性能、透明的双语模型,在4.5T高质量令牌上从头开始训练7B参数。我们的MAP-Neo是第一个完全开源的双语LLM,与现有的最先进的LLM相比,其性能相当。此外,我们将所有细节开源以重现我们的MAP-Neo,其中提供了清理后的预训练语料库、数据清理管道、检查点和优化的培训/评估框架。最后,我们希望我们的MAP-NEO将加强和加强开放的研究社区,并激发更多的创新和创造力,以促进LLMS的进一步改进。

[NLP-3] Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
[NLP-3] LLM生成和归因的最近邻推测解码

链接: https://arxiv.org/abs/2405.19325
作者: Minghan Li,Xilun Chen,Ari Holtzman,Beidi Chen,Jimmy Lin,Wen-tau Yih,Xi Victoria Lin
关键词: Large language models, Large language, hallucinate and lack, lack the ability, ability to provide
中文关键词: 大语言模型,大语言,幻觉和缺乏,缺乏能力,提供能力
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often hallucinate and lack the ability to provide attribution for their generations. Semi-parametric LMs, such as kNN-LM, approach these limitations by refining the output of an LM for a given prompt using its nearest neighbor matches in a non-parametric data store. However, these models often exhibit slow inference speeds and produce non-fluent texts. In this paper, we introduce Nearest Neighbor Speculative Decoding (NEST), a novel semi-parametric language modeling approach that is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources. NEST performs token-level retrieval at each inference step to compute a semi-parametric mixture distribution and identify promising span continuations in a corpus. It then uses an approximate speculative decoding procedure that accepts a prefix of the retrieved span or generates a new token. NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks, surpassing the conventional kNN-LM method and performing competitively with in-context retrieval augmentation. In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.
摘要:大型语言模型(LLM)经常产生幻觉,并且缺乏为他们这一代人提供归因的能力。半参数LMS,如KNN-LM,通过在非参数数据存储中使用其最近邻匹配来针对给定提示优化LM的输出,从而接近这些限制。然而,这些模型往往表现出较慢的推理速度,并产生不流畅的文本。本文介绍了一种新的半参数语言建模方法Nest-Neighbor Sponative Decoding(Nest-Neighbor Sponative Decoding),该方法能够将任意长度的真实文本段合并到LM生成中,并提供其来源的属性。Nest在每个推理步骤执行令牌级检索,以计算半参数混合分布,并在语料库中识别有希望的跨度延续。然后,它使用接受检索到的跨度的前缀或生成新令牌的近似推测解码过程。在各种知识密集型任务中,Nest显著提高了基本LM的生成质量和归属率,超过了传统的KNN-LM方法,并且在上下文中检索增强的情况下具有竞争力。此外,Nest显著提高了生成速度,当应用于Llama-2-Chat 70B时,推理时间加速1.8倍。

[NLP-4] Are Large Language Models Chameleons?
[NLP-4] 大型语言模型是变色龙吗?

链接: https://arxiv.org/abs/2405.19323
作者: Mingmeng Geng,Sihong He,Roberto Trotta
关键词: large language models, personality tendencies, large language, worldviews and personality, European Social Survey
中文关键词: 大语言模型、性格倾向、大语言、世界观和性格,欧洲社会调查
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 16 pages,8 figures

点击查看摘要

Abstract:Do large language models (LLMs) have their own worldviews and personality tendencies? Simulations in which an LLM was asked to answer subjective questions were conducted more than 1 million times. Comparison of the responses from different LLMs with real data from the European Social Survey (ESS) suggests that the effect of prompts on bias and variability is fundamental, highlighting major cultural, age, and gender biases. Methods for measuring the difference between LLMs and survey data are discussed, such as calculating weighted means and a new proposed measure inspired by Jaccard similarity. We conclude that it is important to analyze the robustness and variability of prompts before using LLMs to model individual decisions or collective behavior, as their imitation abilities are approximate at best.
摘要:大型语言模型(LLM)是否有自己的世界观和性格倾向?要求LLM回答主观问题的模拟进行了超过100万次。将不同LLM的回复与欧洲社会调查(ESS)的真实数据进行比较表明,提示对偏见和变异性的影响是根本性的,凸显了主要的文化、年龄和性别偏见。讨论了测量LLM和调查数据之间差异的方法,例如计算加权平均值和受Jaccard相似性启发提出的新测量方法。我们的结论是,在使用LLM对个人决策或集体行为进行建模之前,分析提示的稳健性和可变性非常重要,因为它们的模仿能力充其量也是大致的。

[NLP-5] Robust Preference Optimization through Reward Model Distillation
[NLP-5] 通过奖励模型蒸馏进行稳健偏好优化

链接: https://arxiv.org/abs/2405.19316
作者: Adam Fisch,Jacob Eisenstein,Vicky Zayats,Alekh Agarwal,Ahmad Beirami,Chirag Nagpal,Pete Shaw,Jonathan Berant
关键词: Language model, involves maximizing, preference, Direct Preference Optimization, reward model
中文关键词: 语言模型,涉及最大化、偏好、直接偏好优化、奖励模型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. However, typical preference datasets have only a single, or at most a few, annotation per preference pair, which causes DPO to overconfidently assign rewards that trend towards infinite magnitude. This frequently leads to degenerate policies, sometimes causing even the probabilities of the preferred generations to go to zero. In this work, we analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs: we train the LM to produce probabilities that match the distribution induced by a reward model trained on the preference data. Moreover, to account for uncertainty in the reward model we are distilling from, we optimize against a family of reward models that, as a whole, is likely to include at least one reasonable proxy for the preference distribution. Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations, while preserving the simple supervised nature of DPO.
摘要:语言模型(LM)后训练(或对齐)涉及最大化来自偏好注释的奖励函数。直接偏好优化(DPO)是一种流行的离线比对方法,它直接根据偏好数据训练策略,而不需要训练奖励模型或应用强化学习。然而,典型的偏好数据集对于每个偏好对只有一个或最多几个注释,这导致DPO过于自信地分配趋向无穷大的奖励。这经常导致策略退化,有时甚至导致优先世代的概率为零。在这项工作中,我们分析了这一现象,并提出了蒸馏来获得比世代对更好的真实偏好分布的代理:我们训练LM来产生与基于偏好数据训练的奖励模型所诱导的分布相匹配的概率。此外,为了解决我们从中提取的报酬模型中的不确定性,我们针对一族报酬模型进行了优化,作为一个整体,可能至少包括一个合理的偏好分布代理。我们的结果表明,从这样一族奖励模型中提取可以提高对偏好标注中分布变化的稳健性,同时保持DPO的简单监督性质。

[NLP-6] Matryoshka Query Transformer for Large Vision-Language Models
[NLP-6] 大型视觉语言模型的Matryoshka查询Transformer

链接: https://arxiv.org/abs/2405.19315
作者: Wenbo Hu,Zi-Yi Dou,Liunian Harold Li,Amita Kamath,Nanyun Peng,Kai-Wei Chang
关键词: Large Vision-Language Models, Large Vision-Language, visual tokens, tokens, visual
中文关键词: 大型视觉语言模型,大型视觉语言,视觉标记,标记,视觉
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint. Our code and model are publicly available at this https URL

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m = M latent query tokens and train the model using only these first m tokens, discarding the rest. Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA’s fixed 576. Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each. Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.
摘要:大型视觉语言模型通常将图像编码成固定数量的视觉标记(例如,576个),并使用语言模型处理这些标记。尽管LVLM具有很强的性能,但它们在适应不同的计算限制方面面临着挑战。这就提出了一个问题:我们能否在视觉令牌的数量上实现灵活性,以适应不同的任务和计算资源?我们对此的回答是肯定的。受Matryoshka表示学习的启发,我们引入了Matryoshka查询转换器(MQT),它能够在推理过程中将图像编码成m个视觉标记,其中m可以是任何数字,直到预定义的最大值。这是通过使用具有M个潜在查询令牌的查询转换器来压缩视觉嵌入来实现的。在每个训练步骤中,我们随机选择m=M个潜在查询标记,并只使用这前m个标记来训练模型,丢弃其余的。将MQT和LLaVA相结合,我们对单个模型进行一次训练,并灵活地大幅减少了推理时视觉标记的数量,同时保持了与为每个标记训练独立模型类似或更好的性能。我们的模型MQT-LLAVA在11个基准测试中匹配LLaVA-1.5的性能,使用最多256个令牌,而不是LLaVA固定的576个。减少到16个令牌(TFLOPS减少8倍)只会在MMBch上牺牲2.4个点的性能。在某些任务上,比如Science QA和MMMU,我们甚至可以降到只有2个视觉令牌,每个只有3%和6%的性能下降。我们对视觉符号数量带来的精度和计算成本之间的权衡的探索有助于未来的研究,以达到两全其美的目的。

[NLP-7] Language Models Trained to do Arithmetic Predict Human Risky and Intertemporal Choice
[NLP-7] 语言模型接受算术预测人类风险和跨时期选择的训练

链接: https://arxiv.org/abs/2405.19313
作者: Jian-Qiao Zhu,Haijiang Yan,Thomas L. Griffiths
关键词: Large Language Models, Large Language, Language Models, LLMs, prompted researchers
中文关键词: 大型语言模型、大型语言、语言模型、LLM,促使研究人员
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:The observed similarities in the behavior of humans and Large Language Models (LLMs) have prompted researchers to consider the potential of using LLMs as models of human cognition. However, several significant challenges must be addressed before LLMs can be legitimately regarded as cognitive models. For instance, LLMs are trained on far more data than humans typically encounter, and may have been directly trained on human data in specific cognitive tasks or aligned with human preferences. Consequently, the origins of these behavioral similarities are not well understood. In this paper, we propose a novel way to enhance the utility of LLMs as cognitive models. This approach involves (i) leveraging computationally equivalent tasks that both an LLM and a rational agent need to master for solving a cognitive problem and (ii) examining the specific task distributions required for an LLM to exhibit human-like behaviors. We apply this approach to decision-making – specifically risky and intertemporal choice – where the key computationally equivalent task is the arithmetic of expected value calculations. We show that an LLM pretrained on an ecologically valid arithmetic dataset, which we call Arithmetic-GPT, predicts human behavior better than many traditional cognitive models. Pretraining LLMs on ecologically valid arithmetic datasets is sufficient to produce a strong correspondence between these models and human decision-making. Our results also suggest that LLMs used as cognitive models should be carefully investigated via ablation studies of the pretraining data.
摘要:观察到的人类行为与大语言模型的相似之处促使研究人员考虑将大语言模型用作人类认知模型的可能性。然而,在LLM被合法地视为认知模型之前,必须解决几个重大挑战。例如,LLMS接受的训练数据比人类通常遇到的要多得多,并且可能在特定认知任务中直接使用人类数据进行训练,或者与人类的偏好保持一致。因此,这些行为相似性的起源还没有被很好地理解。在本文中,我们提出了一种新的方法来提高LLMS作为认知模型的实用性。这种方法涉及(I)利用LLM和Rational代理在解决认知问题时需要掌握的计算等价任务,以及(Ii)检查LLM展示类似人类行为所需的特定任务分布。我们将这种方法应用于决策–特别是风险和跨期选择–其中关键的计算等价任务是期望值计算的算法。我们表明,在生态有效的算术数据集上预训练的LLM,比许多传统认知模型更好地预测人类行为。在生态有效的算术数据集上对LLM进行预训练足以在这些模型和人类决策之间产生强烈的对应。我们的结果还表明,作为认知模型的LLMS应该通过对训练前数据的消融研究来仔细研究。

[NLP-8] Expert-Guided Extinction of Toxic Tokens for Debiased Generation
[NLP-8] 专家引导的去偏见一代的有毒代币灭绝

链接: https://arxiv.org/abs/2405.19299
作者: Xueyao Sun,Kaize Shi,Haoran Tang,Guandong Xu,Qing Li
关键词: Large language models, Large language, language models, toxic prompts, Large
中文关键词: 大型语言模型,大型语言,语言模型,有毒提示,大型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can elicit social bias during generations, especially when inference with toxic prompts. Controlling the sensitive attributes in generation encounters challenges in data distribution, generalizability, and efficiency. Specifically, fine-tuning and retrieval demand extensive unbiased corpus, while direct prompting requires meticulously curated instructions for correcting the output in multiple rounds of thoughts but poses challenges on memory and inference latency. In this work, we propose the Expert-Guided Extinction of Toxic Tokens for Debiased Generation (EXPOSED) to eliminate the undesired harmful outputs for LLMs without the aforementioned requirements. EXPOSED constructs a debiasing expert based on the abundant toxic corpus to expose and elicit the potentially dangerous tokens. It then processes the output to the LLMs and constructs a fair distribution by suppressing and attenuating the toxic tokens. EXPOSED is evaluated on fairness benchmarks over three LLM families. Extensive experiments demonstrate that compared with other baselines, the proposed EXPOSED significantly reduces the potential social bias while balancing fairness and generation performance.
摘要:大型语言模型会在几代人中引发社会偏见,尤其是在使用有毒提示进行推理时。在生成过程中控制敏感属性在数据分布、通用性和效率方面遇到了挑战。具体地说,微调和检索需要广泛的无偏见语料库,而直接提示需要精心挑选的指令来纠正多轮思维的输出,但对记忆和推理延迟构成了挑战。在这项工作中,我们提出了专家指导的消除无偏生成(Exposed)毒物令牌的方法,以消除不满足上述要求的LLM的有害输出。Exposed构建了一个基于丰富的有毒语料库的去偏见专家,以揭露和引出潜在的危险令牌。然后,它将输出处理到LLM,并通过抑制和衰减有毒令牌来构建公平的分配。Exposed是根据三个LLM家庭的公平基准进行评估的。大量实验表明,与其他基线算法相比,该算法在平衡公平性和生成性能的同时,显著降低了潜在的社会偏差。

[NLP-9] Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation
[NLP-9] 集成多尺度上下文化信息以实现基于字节的神经机器翻译

链接: https://arxiv.org/abs/2405.19290
作者: Langlin Huang,Yang Feng
关键词: Neural Machine Translation, Neural Machine, building in Neural, Machine Translation, Subword tokenization
中文关键词: 神经机器翻译,神经机器,内置神经,机器翻译,子词标记化
类目: Computation and Language (cs.CL)
备注: Accepted by ACL2024 Findings

点击查看摘要

Abstract:Subword tokenization is a common method for vocabulary building in Neural Machine Translation (NMT) models. However, increasingly complex tasks have revealed its disadvantages. First, a vocabulary cannot be modified once it is learned, making it hard to adapt to new words. Second, in multilingual translation, the imbalance in data volumes across different languages spreads to the vocabulary, exacerbating translations involving low-resource languages. While byte-based tokenization addresses these issues, byte-based models struggle with the low information density inherent in UTF-8 byte sequences. Previous works enhance token semantics through local contextualization but fail to select an appropriate contextualizing scope based on the input. Consequently, we propose the Multi-Scale Contextualization (MSC) method, which learns contextualized information of varying scales across different hidden state dimensions. It then leverages the attention module to dynamically integrate the multi-scale contextualized information. Experiments show that MSC significantly outperforms subword-based and other byte-based methods in both multilingual and out-of-domain scenarios. Code can be found in this https URL.
摘要:子词标记化是神经机器翻译(NMT)模型中一种常用的词汇构建方法。然而,日益复杂的任务暴露了它的缺点。首先,词汇一旦学会就不能修改,很难适应新单词。第二,在多语言翻译中,不同语言之间的数据量不平衡蔓延到词汇,加剧了涉及低资源语言的翻译。虽然基于字节的标记化解决了这些问题,但基于字节的模型正在努力解决UTF-8字节序列固有的低信息密度问题。以往的研究通过局部语境化增强了令牌语义,但没有根据输入选择合适的语境化范围。因此,我们提出了多尺度上下文化(MSC)方法,该方法学习不同隐藏状态维度上不同尺度的上下文信息。然后,它利用注意力模块来动态地集成多尺度的上下文信息。实验表明,在多语言和域外场景中,MSC的性能都明显优于基于子字和其他基于字节的方法。代码可以在此HTTPS URL中找到。

[NLP-10] MASSIVE Multilingual Abstract Meaning Representation: A Dataset and Baselines for Hallucination Detection
[NLP-10] 大量多语言抽象意义表示:幻觉检测的数据集和基线

链接: https://arxiv.org/abs/2405.19285
作者: Michael Regan,Shira Wein,George Baker,Emilio Monti
关键词: Abstract Meaning Representation, Meaning Representation, Abstract Meaning, core meaning, semantic formalism
中文关键词: 抽象意义表示,意义表示,抽象意义,核心意义,语义形式主义
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Abstract Meaning Representation (AMR) is a semantic formalism that captures the core meaning of an utterance. There has been substantial work developing AMR corpora in English and more recently across languages, though the limited size of existing datasets and the cost of collecting more annotations are prohibitive. With both engineering and scientific questions in mind, we introduce MASSIVE-AMR, a dataset with more than 84,000 text-to-graph annotations, currently the largest and most diverse of its kind: AMR graphs for 1,685 information-seeking utterances mapped to 50+ typologically diverse languages. We describe how we built our resource and its unique features before reporting on experiments using large language models for multilingual AMR and SPARQL parsing as well as applying AMRs for hallucination detection in the context of knowledge base question answering, with results shedding light on persistent issues using LLMs for structured parsing.
摘要:抽象意义表示(MRC)是一种捕捉话语核心意义的语义形式主义。尽管现有数据集的有限规模和收集更多注释的成本令人望而却步,但人们已经做了大量的工作来开发英语以及最近的跨语言的MRC数据库。考虑到工程和科学问题,我们引入了MASSIVE-AMPS,这是一个拥有超过84,000个文本到图形注释的数据集,目前是同类数据中最大、最多样化的:包含1,685种信息寻求话语的MRC图,映射到50多种类型多样的语言。我们描述了我们如何构建我们的资源及其独特功能,然后报告使用大型语言模型进行多语言MRC和SPARQL解析,以及在知识库问答的背景下应用MRC进行幻觉检测的实验,结果揭示了使用LLM进行结构化解析的持续问题。

[NLP-11] PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications
[NLP-11] 儿科GPT:大型语言模型作为儿科应用的中国医疗助理

链接: https://arxiv.org/abs/2405.19266
作者: Dingkang Yang,Jinjie Wei,Dongling Xiao,Shunli Wang,Tong Wu,Gang Li,Mingcheng Li,Shuaibing Wang,Jiawei Chen,Yue Jiang,Qingyao Xu,Ke Li,Peng Zhai,Lihua Zhang
关键词: Developing intelligent pediatric, consultation systems offers, systems offers promising, offers promising prospects, improving diagnostic efficiency
中文关键词: 开发智能儿科,咨询系统提供,系统提供有前途,提供有前途,提高诊断效率
类目: Computation and Language (cs.CL)
备注: A Technical Report on a Powerful Chinese Medical Large Language Model

点击查看摘要

Abstract:Developing intelligent pediatric consultation systems offers promising prospects for improving diagnostic efficiency, especially in China, where healthcare resources are scarce. Despite recent advances in Large Language Models (LLMs) for Chinese medicine, their performance is sub-optimal in pediatric applications due to inadequate instruction data and vulnerable training procedures. To address the above issues, this paper builds PedCorpus, a high-quality dataset of over 300,000 multi-task instructions from pediatric textbooks, guidelines, and knowledge graph resources to fulfil diverse diagnostic demands. Upon well-designed PedCorpus, we propose PediatricsGPT, the first Chinese pediatric LLM assistant built on a systematic and robust training pipeline. In the continuous pre-training phase, we introduce a hybrid instruction pre-training mechanism to mitigate the internal-injected knowledge inconsistency of LLMs for medical domain adaptation. Immediately, the full-parameter Supervised Fine-Tuning (SFT) is utilized to incorporate the general medical knowledge schema into the models. After that, we devise a direct following preference optimization to enhance the generation of pediatrician-like humanistic responses. In the parameter-efficient secondary SFT phase, a mixture of universal-specific experts strategy is presented to resolve the competency conflict between medical generalist and pediatric expertise mastery. Extensive results based on the metrics, GPT-4, and doctor evaluations on distinct doctor downstream tasks show that PediatricsGPT consistently outperforms previous Chinese medical LLMs. Our model and dataset will be open-source for community development.
摘要:发展智能儿科会诊系统为提高诊断效率提供了广阔的前景,特别是在医疗资源稀缺的中国。尽管中医药的大型语言模型(LLM)最近取得了进展,但由于教学数据不足和培训程序薄弱,它们在儿科应用中的表现并不理想。为了解决上述问题,本文从儿科教科书、指南和知识图谱资源中构建了PedCorpus,这是一个包含30多万个多任务指令的高质量数据集,以满足不同的诊断需求。在精心设计的语料库的基础上,我们提出了儿科GPT,这是中国第一个建立在系统和强大的培训流水线上的儿科LLM助手。在连续预训练阶段,我们引入了一种混合指令预训练机制来缓解医学领域自适应学习模型内部注入的知识不一致问题。立即,利用全参数监督精调(SFT)将一般医学知识模式融入模型中。之后,我们设计了一个直接跟随偏好优化,以提高儿科医生般的人文反应的产生。在参数有效的二级SFT阶段,提出了一种混合通用特定专家的策略来解决医学通才和儿科专业知识掌握之间的能力冲突。基于指标、GPT-4和医生对不同医生下游任务的评估的广泛结果表明,儿科GPT始终优于以前的中医LLMS。我们的模型和数据集将是开源的,用于社区开发。

[NLP-12] AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data
[NLP-12] AlchemistCoder:通过对多源数据进行事后诸葛亮调整来协调和激发代码能力

链接: https://arxiv.org/abs/2405.19265
作者: Zifan Song,Yudong Wang,Wenwei Zhang,Kuikun Liu,Chengqi Lyu,Demin Song,Qipeng Guo,Hang Yan,Dahua Lin,Kai Chen,Cairong Zhao
关键词: Open-source Large Language, Large Language Models, Open-source Large, Large Language, delivered impressive performance
中文关键词: 开源大型语言、大型语言模型、开源大型、大型语言,交付了令人印象深刻的性能
类目: Computation and Language (cs.CL)
备注: Preprint with 20 pages and 20 figures. Source code and models at this https URL

点击查看摘要

Abstract:Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data. To achieve this, we pioneer to unveil inherent conflicts among the various styles and qualities in multi-source code corpora and introduce data-specific prompts with hindsight relabeling, termed AlchemistPrompts, to harmonize different data sources and instruction-response pairs. Additionally, we propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review. Extensive experiments demonstrate that AlchemistCoder holds a clear lead among all models of the same size (6.7B/7B) and rivals or even surpasses larger models (15B/33B/70B), showcasing the efficacy of our method in refining instruction-following capabilities and advancing the boundaries of code intelligence.
摘要:开放源码的大型语言模型(LLM)及其专门的变体,尤其是代码LLM,最近提供了令人印象深刻的性能。然而,以前的Code LLM通常在质量和多样性有限的单源数据上进行微调,这可能不足以激发预先训练的Code LLM的潜力。在本文中,我们提出了AlChemistCoder,这是一系列代码LLM,具有增强的代码生成和泛化能力,可以对多源数据进行微调。为了实现这一点,我们率先揭示了多源代码语料库中各种风格和质量之间的内在冲突,并引入了事后重新标记的特定数据提示,称为AlChemistPrompt,以协调不同的数据源和指令-响应对。此外,我们建议将数据构建过程作为代码理解任务纳入到微调数据中,包括指令演变、数据过滤和代码审查。大量的实验表明,AlChemistCoder在相同尺寸的所有型号(6.7B/7B)中遥遥领先,与更大的型号(15B/33B/70B)竞争甚至超过,显示了我们方法在提炼指令跟随能力和推进代码智能边界方面的有效性。

[NLP-13] Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models
[NLP-13] 从弱到强搜索:通过搜索小语言模型来对齐大语言模型

链接: https://arxiv.org/abs/2405.19262
作者: Zhanhui Zhou,Zhixuan Liu,Jie Liu,Zhichen Dong,Chao Yang,Yu Qiao
关键词: large language model, Large language, human preferences, Large, fine-tuned to align
中文关键词: 大型语言模型,大型语言,人类偏好,大型,微调以对齐
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models are usually fine-tuned to align with human preferences. However, fine-tuning a large language model can be challenging. In this work, we introduce \textitweak-to-strong search , framing the alignment of a large language model as a test-time greedy search to maximize the log-likelihood difference between small tuned and untuned models while sampling from the frozen large model. This method serves both as (i) a compute-efficient model up-scaling strategy that avoids directly tuning the large model and as (ii) an instance of weak-to-strong generalization that enhances a strong model with weak test-time guidance. Empirically, we demonstrate the flexibility of weak-to-strong search across different tasks. In controlled-sentiment generation and summarization, we use tuned and untuned \textttgpt2 s to effectively improve the alignment of large models without additional training. Crucially, in a more difficult instruction-following benchmark, AlpacaEval 2.0, we show that reusing off-the-shelf small model pairs (e.g., \textttzephyr-7b-beta and its untuned version) can significantly improve the length-controlled win rates of both white-box and black-box large models against \textttgpt-4-turbo (e.g., 34.4 \rightarrow 37.9 for \textttLlama-3-70B-Instruct and 16.0 \rightarrow 20.1 for \textttgpt-3.5-turbo-instruct ), despite the small models’ low win rates \approx 10.0 .
摘要:大型语言模型通常会进行微调,以符合人类的偏好。然而,对大型语言模型进行微调可能是具有挑战性的。在这项工作中,我们引入了弱到强搜索,将大语言模型的对齐定义为测试时贪婪搜索,以最大化小调整模型和未调整模型之间的对数似然差异,同时从冻结的大模型中进行采样。该方法既用作(I)避免直接调整大型模型的计算效率模型放大策略,也用作(Ii)用弱测试时间指导增强强模型的弱到强泛化的实例。经验性地,我们展示了跨不同任务的从弱到强搜索的灵活性。在控制情绪的生成和总结中,我们使用了调谐和非调谐的S来有效地提高大型模型的对齐能力,而不需要额外的训练。至关重要的是,在一个更难的指令遵循基准测试中,我们表明,重复使用现成的小模型对(例如,\texttzePhyr-7b-beta及其未调版本)可以显著提高白盒和黑盒大型模型相对于\extttgpt-4-Turbo的长度控制胜率(例如,\exttLlama-3-70B-Indict的34.4\right tarrow 37.9和\extttgpt-3.5-turbo-Indict的16.0\right tarrow 20.1),尽管小型模型的胜率很低,但\r约为10.0。

[NLP-14] Faster Cascades via Speculative Decoding
[NLP-14] 通过推测解码更快的级联

链接: https://arxiv.org/abs/2405.19261
作者: Harikrishna Narasimhan,Wittawat Jitkrittum,Ankit Singh Rawat,Seungyeon Kim,Neha Gupta,Aditya Krishna Menon,Sanjiv Kumar
关键词: models’ inference efficiency, improving language models’, language models’ inference, inference efficiency, speculative decoding
中文关键词: 模型的推理效率,改进语言模型,语言模型的推理,推理效率,推测解码
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cascades and speculative decoding are two common approaches to improving language models’ inference efficiency. Both approaches involve interleaving models of different sizes, but via fundamentally distinct mechanisms: cascades employ a deferral rule that invokes the larger model only for “hard” inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in parallel verification mode. These mechanisms offer different benefits: empirically, cascades are often capable of yielding better quality than even the larger model, while theoretically, speculative decoding offers a guarantee of quality-neutrality. In this paper, we leverage the best of both these approaches by designing new speculative cascading techniques that implement their deferral rule through speculative execution. We characterize the optimal deferral rule for our speculative cascades, and employ a plug-in approximation to the optimal rule. Through experiments with T5 models on benchmark language tasks, we show that the proposed approach yields better cost-quality trade-offs than cascading and speculative decoding baselines.
摘要:级联和推测解码是提高语言模型推理效率的两种常用方法。这两种方法都涉及交织不同大小的模型,但通过根本不同的机制:级联使用延迟规则,该规则仅针对“硬”输入调用较大的模型,而推测解码使用推测执行来主要在并行验证模式下调用较大的模型。这些机制提供了不同的好处:从经验上讲,级联往往能够产生比更大的模型更好的质量,而从理论上讲,投机性解码提供了质量中立的保证。在本文中,我们通过设计新的投机级联技术来利用这两种方法的优点,该技术通过投机性执行来实现它们的延迟规则。我们刻画了我们的投机级联的最优延迟规则,并使用了一个插件近似最优规则。通过在基准语言任务上使用T5模型进行的实验表明,该方法比级联和推测译码基线具有更好的代价质量折衷。

[NLP-15] Lower Bounds on the Expressivity of Recurrent Neural Language Models
[NLP-15] 回归神经语言模型表达能力的下限

链接: https://arxiv.org/abs/2405.19222
作者: Anej Svete,Franz Nowak,Anisha Mohamed Sahabdeen,Ryan Cotterell
关键词: emph, representational capacity, recent successes, successes and spread, spread of large
中文关键词: 时间、代表能力、最近的成功、成功和传播、大的传播
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The recent successes and spread of large neural language models (LMs) call for a thorough understanding of their computational ability. Describing their computational abilities through LMs’ \emphrepresentational capacity is a lively area of research. However, investigation into the representational capacity of neural LMs has predominantly focused on their ability to \emphrecognize formal languages. For example, recurrent neural networks (RNNs) with Heaviside activations are tightly linked to regular languages, i.e., languages defined by finite-state automata (FSAs). Such results, however, fall short of describing the capabilities of RNN \emphlanguage models (LMs), which are definitionally \emphdistributions over strings. We take a fresh look at the representational capacity of RNN LMs by connecting them to \emphprobabilistic FSAs and demonstrate that RNN LMs with linearly bounded precision can express arbitrary regular LMs.
摘要:大型神经语言模型(LM)最近的成功和传播需要彻底了解它们的计算能力。通过LM的印象代表能力来描述他们的计算能力是一个活跃的研究领域。然而,对神经LM表示能力的研究主要集中在它们识别形式语言的能力上。例如,具有Heaviside激活的循环神经网络(RNN)与常规语言紧密相连,即由有限状态自动机(FSA)定义的语言。然而,这样的结果无法描述RNN \emph语言模型(LM)的能力,这些模型在定义上是字符串上的emph分布。我们通过将RNN LM与概率FSA联系起来重新审视RNN LM的表示能力,并证明具有线性有界精度的RNN LM可以表达任意规则LM。

[NLP-16] WRDScore: New Metric for Evaluation of Natural Language Generation Models
[NLP-16] WRDScore:自然语言生成模型评估的新指标

链接: https://arxiv.org/abs/2405.19220
作者: Ravil Mussabayev
关键词: natural language generation, faces significant difficulties, language generation, faces significant, test data
中文关键词: 自然语言生成,面临重大困难,语言生成,面临重大测试数据
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The problem of natural language generation, and, more specifically, method name prediction, faces significant difficulties when proposed models need to be evaluated on test data. Such a metric would need to consider the versatility with which a single method can be named, with respect to both semantics and syntax. Measuring the direct overlap between the predicted and reference (true) sequences will not be able to capture these subtleties. Other existing embedding based metrics either do not measure precision and recall or impose strict unrealistic assumptions on both sequences. To address these issues, we propose a new metric that, on the one hand, is very simple and lightweight, and, on the other hand, is able to calculate precision and recall without resorting to any assumptions while obtaining good performance with respect to the human judgement.
摘要:当提出的模型需要根据测试数据进行评估时,自然语言生成问题,更具体地说,方法名称预测面临着重大困难。这样的指标需要考虑单个方法在语义和语法方面的通用性。测量预测序列和参考(真实)序列之间的直接重叠将无法捕捉这些微妙之处。其他现有的基于嵌入的指标要么不衡量精确度和召回率,要么对两个序列施加严格不切实际的假设。为了解决这些问题,我们提出了一种新的指标,一方面它非常简单且轻量级,另一方面,它能够在不依赖任何假设的情况下计算精确度和召回率,同时在人类判断方面获得良好的性能。

[NLP-17] VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
[NLP-17] VideoTree:用于长视频LLM推理的自适应基于树的视频表示

链接: https://arxiv.org/abs/2405.19209
作者: Ziyang Wang,Shoubin Yu,Elias Stengel-Eskin,Jaehong Yoon,Feng Cheng,Gedas Bertasius,Mohit Bansal
关键词: Video-language understanding tasks, short video clips, video understanding tasks, Large Language Models, understanding tasks
中文关键词: 视频语言理解任务、短视频剪辑、视频理解任务、大型语言模型、理解任务
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages, first three authors contributed equally; Project page: this https URL

点击查看摘要

Abstract:Video-language understanding tasks have focused on short video clips, often struggling with long-form video understanding tasks. Recently, many long video-language understanding approaches have leveraged the reasoning capabilities of Large Language Models (LLMs) to perform long video QA, transforming videos into densely sampled frame captions, and asking LLMs to respond to text queries over captions. However, the frames used for captioning are often redundant and contain irrelevant information, making dense sampling inefficient, and ignoring the fact that video QA requires varying levels of granularity, with some video segments being highly relevant to the question (needing more fine-grained detail) while others being less relevant. Thus, these LLM-based approaches are prone to missing information and operate on large numbers of irrelevant captions, lowering both performance and efficiency. To address these issues, we introduce VideoTree, a query-adaptive and hierarchical framework for long-video understanding with LLMs. VideoTree dynamically extracts query-related information from a video and builds a tree-based representation for LLM reasoning. First, VideoTree adaptively selects frames for captioning by iteratively clustering frames based on their visual features and scoring clusters using their relevance to the query. Second, it organizes visual clusters into a query-adaptive and hierarchical tree structure; the tree encodes varying levels of granularity, with higher resolution on relevant segments. Finally, VideoTree produces an answer by traversing the tree’s keyframes and passing their captions to an LLM answerer. Our method improves both reasoning accuracy and efficiency compared to existing methods: VideoTree achieves a 7.0%, 2.2%, and 2.7% accuracy gain over baselines on the EgoSchema, NExT-QA, and IntentQA benchmarks, respectively, while reducing inference time by 40%.
摘要:视频语言理解任务主要集中在短视频片段上,往往与长形式的视频理解任务纠缠在一起。最近,许多长视频语言理解方法利用大语言模型的推理能力来执行长视频问答,将视频转换为密集采样的帧字幕,并要求大语言模型响应关于字幕的文本查询。然而,用于字幕的帧通常是冗余的,包含不相关的信息,使得密集采样效率低下,并且忽略了视频QA需要不同级别的粒度的事实,一些视频片段与问题高度相关(需要更细粒度的细节),而其他视频片段则不那么相关。因此,这些基于LLM的方法容易丢失信息,并且操作大量不相关的字幕,降低了性能和效率。为了解决这些问题,我们引入了VideoTree,这是一个查询自适应的分层框架,用于使用LLMS进行长视频理解。VideoTree从视频中动态地提取与查询相关的信息,并建立基于树的表示来进行LLM推理。首先,VideoTree根据帧的视觉特征对帧进行迭代聚类,并根据帧与查询的相关性对聚类进行评分,从而自适应地选择要添加字幕的帧。其次,它将视觉簇组织成查询自适应的分层树结构;该树编码不同级别的粒度,在相关片段上具有更高的分辨率。最后,VideoTree通过遍历树的关键帧并将它们的字幕传递给LLM Answerer来生成答案。与现有方法相比,我们的方法同时提高了推理精度和效率:在EgoSchema、Next-QA和IntentQA基准上,VideoTree分别比基线提高了7.0%、2.2%和2.7%的精度,同时将推理时间减少了40%。

[NLP-18] MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification
[NLP-18] MetaToken:通过元分类检测图像描述中的幻觉

链接: https://arxiv.org/abs/2405.19186
作者: Laura Fieback(1,2),Jakob Spiegelberg(1),Hanno Gottschalk(2) ((1) Volkswagen AG, (2) TU Berlin)
关键词: Vision Language Models, shown remarkable capabilities, Large Vision Language, visual question answering, Language Models
中文关键词: 视觉语言模型,表现出非凡的能力,大视觉语言,视觉问答,语言模型
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have shown remarkable capabilities in multimodal tasks like visual question answering or image captioning. However, inconsistencies between the visual information and the generated text, a phenomenon referred to as hallucinations, remain an unsolved problem with regard to the trustworthiness of LVLMs. To address this problem, recent works proposed to incorporate computationally costly Large (Vision) Language Models in order to detect hallucinations on a sentence- or subsentence-level. In this work, we introduce MetaToken, a lightweight binary classifier to detect hallucinations on the token-level at negligible cost. Based on a statistical analysis, we reveal key factors of hallucinations in LVLMs which have been overseen in previous works. MetaToken can be applied to any open-source LVLM without any knowledge about ground truth data providing a reliable detection of hallucinations. We evaluate our method on four state-of-the-art LVLMs demonstrating the effectiveness of our approach.
摘要:大型视觉语言模型在视觉问答或图像字幕等多通道任务中表现出了卓越的性能。然而,视觉信息和生成的文本之间的不一致,即所谓的幻觉现象,仍然是关于低密度脂蛋白可信度的一个悬而未决的问题。为了解决这个问题,最近的工作建议纳入计算代价高昂的大型(Vision)语言模型,以便在句子或子句水平上检测幻觉。在这项工作中,我们引入了MetaToken,一个轻量级的二进制分类器来检测令牌级别的幻觉,成本可以忽略不计。在统计分析的基础上,我们揭示了前人所研究的LVLMS产生幻觉的关键因素。MetaToken可以应用于任何开源的LVLM,而不需要任何关于地面真实数据的知识,从而提供可靠的幻觉检测。我们在四个最先进的LVLM上对我们的方法进行了评估,证明了我们方法的有效性。

[NLP-19] DGRC: An Effective Fine-tuning Framework for Distractor Generation in Chinese Multi-choice Reading Comprehension
[NLP-19] DGRC:中文多项选择阅读理解中干扰因子生成的有效微调框架

链接: https://arxiv.org/abs/2405.19139
作者: Runfeng Lin,Dacheng Xu,Huijiang Wang,Zebiao Chen,Yating Wang,Shouqiang Liu
关键词: learner knowledge proficiency, standardized tests, evaluating a learner, efficient and widely, widely used format
中文关键词: 学习者知识熟练程度、标准化测试、评估学习者、高效且广泛使用的格式
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When evaluating a learner’s knowledge proficiency, the multiple-choice question is an efficient and widely used format in standardized tests. Nevertheless, generating these questions, particularly plausible distractors (incorrect options), poses a considerable challenge. Generally, the distractor generation can be classified into cloze-style distractor generation (CDG) and natural questions distractor generation (NQDG). In contrast to the CDG, utilizing pre-trained language models (PLMs) for NQDG presents three primary challenges: (1) PLMs are typically trained to generate correct'' content, like answers, while rarely trained to generate plausible" content, like distractors; (2) PLMs often struggle to produce content that aligns well with specific knowledge and the style of exams; (3) NQDG necessitates the model to produce longer, context-sensitive, and question-relevant distractors. In this study, we introduce a fine-tuning framework named DGRC for NQDG in Chinese multi-choice reading comprehension from authentic examinations. DGRC comprises three major components: hard chain-of-thought, multi-task learning, and generation mask patterns. The experiment results demonstrate that DGRC significantly enhances generation performance, achieving a more than 2.5-fold improvement in BLEU scores.
摘要:在评价学习者的知识水平时,多项选择题是标准化考试中一种有效且被广泛使用的形式。然而,产生这些问题,特别是看似合理的干扰因素(不正确的选择),会带来相当大的挑战。一般来说,分心词生成可以分为完形填空分心词生成(CDG)和自然问题分心词生成(NQDG)。与CDG形成对比的是,为NQDG使用预先训练的语言模型(PLM)面临三个主要挑战:(1)PLM通常被训练成生成像答案这样的“正确”内容,而很少被训练成产生“可信”的内容,如干扰;(2)PLM往往难以产生与特定知识和考试风格很好一致的内容;(3)NQDG使模型有必要产生更长的、上下文敏感的和与问题相关的干扰因素。在这项研究中,我们介绍了一个名为DGRC的微调框架,用于汉语多项选择阅读理解中的NQDG。DGRC包括三个主要组成部分:硬思维链、多任务学习和生成掩码模式。实验结果表明,DGRC显著提高了世代性能,BLEU得分提高了2.5倍以上。

[NLP-20] PathReasoner: Modeling Reasoning Path with Equivalent Extension for Logical Question Answering
[NLP-20] PathReasoner:为逻辑问题解答建模具有等效扩展的推理路径

链接: https://arxiv.org/abs/2405.19109
作者: Fangzhi Xu,Qika Lin,Tianzhe Zhao,Jiawei Han,Jun Liu
关键词: attracted great interest, Logical, Logical reasoning task, Logical reasoning, reasoning
中文关键词: 引起了极大的兴趣,逻辑,逻辑推理任务,逻辑推理,推理
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2024

点击查看摘要

Abstract:Logical reasoning task has attracted great interest since it was proposed. Faced with such a task, current competitive models, even large language models (e.g., ChatGPT and PaLM 2), still perform badly. Previous promising LMs struggle in logical consistency modeling and logical structure perception. To this end, we model the logical reasoning task by transforming each logical sample into reasoning paths and propose an architecture \textbfPathReasoner. It addresses the task from the views of both data and model. To expand the diversity of the logical samples, we propose an atom extension strategy supported by equivalent logical formulas, to form new reasoning paths. From the model perspective, we design a stack of transformer-style blocks. In particular, we propose a path-attention module to joint model in-atom and cross-atom relations with the high-order diffusion strategy. Experiments show that PathReasoner achieves competitive performances on two logical reasoning benchmarks and great generalization abilities.
摘要:逻辑推理任务自提出以来就引起了人们极大的兴趣。面对这样的任务,当前的竞争模型,甚至是大型语言模型(如ChatGPT和Palm2),仍然表现不佳。以前有希望的最小二乘系统在逻辑一致性建模和逻辑结构感知方面存在困难。为此,我们通过将每个逻辑样本转换为推理路径来对逻辑推理任务进行建模,并提出了一个体系结构。它从数据和模型两个角度阐述了这项任务。为了扩大逻辑样本的多样性,提出了一种等价逻辑公式支持的原子扩展策略,形成新的推理路径。从模型的角度,我们设计了一个变压器样式的堆栈。具体地说,我们提出了一个路径注意模块来联合建模原子内和原子间的关系,并采用高阶扩散策略。实验表明,该算法在两个逻辑推理基准上的性能相当,并且具有很强的泛化能力。

[NLP-21] Faithful Chart Summarization with ChaTS-Pi
[NLP-21] 使用ChaTS-Pi进行忠实图表总结

链接: https://arxiv.org/abs/2405.19094
作者: Syrine Krichene,Francesco Piccinno,Fangyu Liu,Julian Martin Eisenschlos
关键词: visually impaired people, communicate insights, explore data, impaired people, visually impaired
中文关键词: 视障人士,交流见解,探索数据,视障人士,视障人士
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in the proceedings of the 2024 Annual Meeting of the Association for Computational Linguistics

点击查看摘要

Abstract:Chart-to-summary generation can help explore data, communicate insights, and help the visually impaired people. Multi-modal generative models have been used to produce fluent summaries, but they can suffer from factual and perceptual errors. In this work we present CHATS-CRITIC, a reference-free chart summarization metric for scoring faithfulness. CHATS-CRITIC is composed of an image-to-text model to recover the table from a chart, and a tabular entailment model applied to score the summary sentence by sentence. We find that CHATS-CRITIC evaluates the summary quality according to human ratings better than reference-based metrics, either learned or n-gram based, and can be further used to fix candidate summaries by removing not supported sentences. We then introduce CHATS-PI, a chart-to-summary pipeline that leverages CHATS-CRITIC during inference to fix and rank sampled candidates from any chart-summarization model. We evaluate CHATS-PI and CHATS-CRITIC using human raters, establishing state-of-the-art results on two popular chart-to-summary datasets.
摘要:图表到摘要的生成可以帮助探索数据、交流见解,并帮助视障人士。多模式生成模式被用来生成流畅的摘要,但它们可能会受到事实错误和感知错误的影响。在这项工作中,我们提出了Chats-Critic,一种无引用的图表摘要度量,用于评估忠诚度。Chats-Critic由用于从图表恢复表格的图像到文本模型和用于逐句对摘要进行评分的表格蕴涵模型组成。我们发现,Chats-Critic根据人类评分来评估摘要质量比基于参考的度量更好,无论是基于学习的还是基于n-gram的,并且可以通过删除不支持的句子来进一步用于修复候选摘要。然后,我们介绍了Chats-PI,这是一种图表到摘要的管道,它在推理过程中利用Chats-Critic来固定和排序来自任何图表摘要模型的样本候选对象。我们使用人类评分器评估Chats-PI和Chats-Critic,在两个流行的从图表到摘要的数据集上建立最先进的结果。

[NLP-22] Multi-stage Retrieve and Re-rank Model for Automatic Medical Coding Recommendation
[NLP-22] 自动医疗编码推荐的多阶段排序和重排序模型

链接: https://arxiv.org/abs/2405.19093
作者: Xindi Wang,Robert E. Mercer,Frank Rudzicz
关键词: classification system encompassing, definitive medical classification, medical classification system, International Classification, range of diseases
中文关键词: 分类系统涵盖明确的医学分类、医学分类系统、国际分类、疾病范围
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to NAACL 2024 – camera-ready version

点击查看摘要

Abstract:The International Classification of Diseases (ICD) serves as a definitive medical classification system encompassing a wide range of diseases and conditions. The primary objective of ICD indexing is to allocate a subset of ICD codes to a medical record, which facilitates standardized documentation and management of various health conditions. Most existing approaches have suffered from selecting the proper label subsets from an extremely large ICD collection with a heavy long-tailed label distribution. In this paper, we leverage a multi-stage ``retrieve and re-rank’’ framework as a novel solution to ICD indexing, via a hybrid discrete retrieval method, and re-rank retrieved candidates with contrastive learning that allows the model to make more accurate predictions from a simplified label space. The retrieval model is a hybrid of auxiliary knowledge of the electronic health records (EHR) and a discrete retrieval method (BM25), which efficiently collects high-quality candidates. In the last stage, we propose a label co-occurrence guided contrastive re-ranking model, which re-ranks the candidate labels by pulling together the clinical notes with positive ICD codes. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures on the MIMIC-III benchmark.
摘要:《国际疾病分类》(ICD)是一个权威性的医学分类系统,涵盖了广泛的疾病和状况。ICD索引的主要目标是将ICD代码的子集分配给病历,这有助于标准化文件记录和各种健康状况的管理。大多数现有的方法都难以从具有大量长尾标签分布的非常大的ICD集合中选择合适的标签子集。在这篇文章中,我们利用一个多阶段的“检索和重新排序”框架作为ICD索引的一种新的解决方案,通过混合离散检索方法,并使用对比学习对检索到的候选进行重新排序,这使得模型能够从简化的标签空间中做出更准确的预测。该检索模型是电子健康记录辅助知识(EHR)和离散检索方法(BM25)的混合,有效地收集高质量的候选对象。在最后阶段,我们提出了一种标签共现引导的对比重排序模型,该模型通过将具有正ICD代码的临床记录集合在一起来对候选标签进行重排序。实验结果表明,在MIMIC-III基准测试中,该方法在多个指标上都达到了最好的性能。

[NLP-23] Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions
[NLP-23] 破解并置密码:人工智能模型能否理解幽默的矛盾

链接: https://arxiv.org/abs/2405.19088
作者: Zhe Hu,Tuo Liang,Jing Li,Yiren Lu,Yunlai Zhou,Yiran Qiao,Jing Ma,Yu Yin
关键词: demonstrated remarkable proficiency, demonstrated remarkable, remarkable proficiency, wide range, multimodal language models
中文关键词: 表现出非凡的熟练程度,表现出非凡的熟练程度,广泛的多模式语言模型
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in large multimodal language models have demonstrated remarkable proficiency across a wide range of tasks. Yet, these models still struggle with understanding the nuances of human humor through juxtaposition, particularly when it involves nonlinear narratives that underpin many jokes and humor cues. This paper investigates this challenge by focusing on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction. We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI’s capabilities in recognizing and interpreting these comics, ranging from literal content comprehension to deep narrative reasoning. Through extensive experimentation and analysis of recent commercial or open-sourced large (vision) language models, we assess their capability to comprehend the complex interplay of the narrative humor inherent in these comics. Our results show that even state-of-the-art models still lag behind human performance on this task. Our findings offer insights into the current limitations and potential improvements for AI in understanding human creative expressions.
摘要:大型多通道语言模型的最新进展已经显示出在广泛的任务范围内表现出非凡的熟练程度。然而,这些模型仍然难以通过并置来理解人类幽默的细微差别,特别是当它涉及支撑许多笑话和幽默线索的非线性叙事时。本文通过关注具有矛盾叙事的漫画来研究这一挑战,其中每个漫画由两个面板组成,它们创造了一个幽默的矛盾。我们引入了Yesbut基准,它包括不同难度的任务,旨在评估人工智能识别和解释这些漫画的能力,从字面内容理解到深度叙事推理。通过对最近商业或开源的大型(视觉)语言模型的广泛实验和分析,我们评估了它们理解这些漫画中固有的叙事幽默的复杂相互作用的能力。我们的结果表明,在这项任务上,即使是最先进的模型也仍然落后于人类的表现。我们的发现为人工智能在理解人类创造性表达方面目前的限制和潜在的改进提供了见解。

[NLP-24] MEMoE: Enhancing Model Editing with Mixture of Experts Adaptors
[NLP-24] MEMoE:通过混合专家适配器增强模型编辑

链接: https://arxiv.org/abs/2405.19086
作者: Renzhi Wang,Piji Li
关键词: Large Language Models, Large Language, Language Models, behavior of Large, Model editing aims
中文关键词: 大型语言模型、大型语言、语言模型、大型行为、模型编辑目标
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Model editing aims to efficiently alter the behavior of Large Language Models (LLMs) within a desired scope, while ensuring no adverse impact on other inputs. Recent years have witnessed various model editing methods been proposed. However, these methods either exhibit poor overall performance or struggle to strike a balance between generalization and locality. We propose MOMoE, a model editing adapter utilizing a Mixture of Experts (MoE) architecture with a knowledge anchor routing strategy. MOMoE updates knowledge using a bypass MoE structure, keeping the original parameters unchanged to preserve the general ability of LLMs. And, the knowledge anchor routing ensures that inputs requiring similar knowledge are routed to the same expert, thereby enhancing the generalization of the updated knowledge. Experimental results show the superiority of our approach over both batch editing and sequential batch editing tasks, exhibiting exceptional overall performance alongside outstanding balance between generalization and locality. Our code will be available.
摘要:模型编辑的目的是在期望的范围内有效地改变大型语言模型(LLM)的行为,同时确保不会对其他输入产生不利影响。近年来,各种模型编辑方法相继提出。然而,这些方法要么表现出较差的整体性能,要么难以在普遍性和局部性之间取得平衡。我们提出了MOMoE,一种基于混合专家体系结构和知识锚路由策略的模型编辑适配器。MOMOE使用旁路MOE结构更新知识,保持原始参数不变,以保持LLMS的一般能力。并且,知识锚路由确保需要相似知识的输入被路由到相同的专家,从而增强了更新知识的泛化。实验结果表明,该方法优于批处理和顺序批处理任务,在通用性和局部性之间取得了良好的平衡,表现出优异的整体性能。我们的代码将可用。

[NLP-25] Auxiliary Knowledge-Induced Learning for Automatic Multi-Label Medical Document Classification
[NLP-25] 多标签医疗文档自动分类的辅助知识诱导学习

链接: https://arxiv.org/abs/2405.19084
作者: Xindi Wang,Robert E. Mercer,Frank Rudzicz
关键词: ICD codes, ICD, International Classification, authoritative medical classification, medical classification system
中文关键词: ICD代码、ICD、国际分类、权威医学分类、医学分类体系
类目: Computation and Language (cs.CL)
备注: Accepted to LREC-COLING 2024 – camera-ready version

点击查看摘要

Abstract:The International Classification of Diseases (ICD) is an authoritative medical classification system of different diseases and conditions for clinical and management purposes. ICD indexing assigns a subset of ICD codes to a medical record. Since human coding is labour-intensive and error-prone, many studies employ machine learning to automate the coding process. ICD coding is a challenging task, as it needs to assign multiple codes to each medical document from an extremely large hierarchically organized collection. In this paper, we propose a novel approach for ICD indexing that adopts three ideas: (1) we use a multi-level deep dilated residual convolution encoder to aggregate the information from the clinical notes and learn document representations across different lengths of the texts; (2) we formalize the task of ICD classification with auxiliary knowledge of the medical records, which incorporates not only the clinical texts but also different clinical code terminologies and drug prescriptions for better inferring the ICD codes; and (3) we introduce a graph convolutional network to leverage the co-occurrence patterns among ICD codes, aiming to enhance the quality of label representations. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures.
摘要:国际疾病分类(ICD)是一种权威性的医学分类体系,对不同的疾病和情况进行分类,以供临床和管理之用。ICD索引将ICD代码的子集分配给病历。由于人工编码是劳动密集型且容易出错的,许多研究使用机器学习来自动化编码过程。ICD编码是一项具有挑战性的任务,因为它需要为极大的分层组织集合中的每个医疗文档分配多个代码。在本文中,我们提出了一种新的ICD索引方法,该方法采用了三种思想:(1)使用多级深度扩展残差卷积编码器来聚集临床病历中的信息,并学习不同长度文本中的文档表示;(2)利用病历的辅助知识来形式化ICD分类任务,不仅包含临床文本,还包括不同的临床编码术语和药物处方,以更好地推断ICD编码;(3)引入图卷积网络来利用ICD编码之间的共现模式,旨在提高标签表示的质量。实验结果表明,该方法在多个指标上都达到了最好的性能。

[NLP-26] Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design
[NLP-26] Cephalo:用于生物启发材料分析和设计的多模式视觉语言模型

链接: https://arxiv.org/abs/2405.19076
作者: Markus J. Buehler
关键词: multimodal vision large, multi-agent AI frameworks, present Cephalo, vision large language, series of multimodal
中文关键词: 多模式视觉大型、多智能体人工智能框架、Present Cephalo、视觉大型语言、系列多模式
类目: Computer Vision and Pattern Recognition (cs.CV); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Cephalo, a series of multimodal vision large language models (V-LLMs) designed for materials science applications, integrating visual and linguistic data for enhanced understanding and interaction within human-AI and multi-agent AI frameworks. A key innovation of Cephalo is its advanced dataset generation method, which employs a sophisticated algorithm to accurately detect and separate images and their corresponding textual descriptions from PDF documents, such as scientific papers. The method includes a careful refinement of image-text pairs through integrated vision and language processing, ensuring high-quality, contextually relevant, and well reasoned training data. Cephalo is trained on integrated image and text data extracted from thousands of scientific papers and science-focused Wikipedia pages demonstrates can interpret complex visual scenes, generate precise language descriptions, and answer queries about images effectively. The combination of a vision encoder with an autoregressive transformer supports complex natural language understanding in an integrated model, which can be coupled with other generative methods to create an image-to-text-to-image or image-to-text-to-3D pipeline. To explore the development of larger models from smaller ones, we merge sets of layers that originate from different pre-trained source models. This hybrid approach allows us to leverage the domain-specific expertise and general conversational capabilities to harness the strengths of multiple models. We examine the models in diverse use cases that incorporate biological materials, fracture and engineering analysis, protein biophysics, and bio-inspired design based on insect behavior. Generative applications include bio-inspired designs, including pollen-inspired architected materials, as well as the synthesis of bio-inspired material microstructures from a photograph of a solar eclipse.
摘要:我们提出了一系列多通道视觉大语言模型(V-LLMS),它是为材料科学应用而设计的,它集成了视觉和语言数据,以增强人类-人工智能和多智能体人工智能框架内的理解和交互。Cephalo的一个关键创新是其先进的数据集生成方法,该方法使用复杂的算法从PDF文档(如科学论文)中准确地检测和分离图像及其相应的文本描述。该方法包括通过集成视觉和语言处理来仔细细化图像-文本对,确保高质量、上下文相关和合理的训练数据。Cephalo接受了从数千篇科学论文和专注于科学的维基百科页面中提取的集成图像和文本数据的培训,证明了它可以解释复杂的视觉场景,生成准确的语言描述,并有效地回答有关图像的查询。视觉编码器和自回归转换器的组合在一个集成模型中支持复杂的自然语言理解,该模型可以与其他生成方法相结合,以创建图像到文本或图像到文本到3D的管道。为了探索从小模型到大模型的发展,我们合并了来自不同预训练源模型的层集合。这种混合方法允许我们利用特定领域的专业知识和一般对话功能来利用多个模型的优势。我们在不同的用例中检查模型,这些用例包括生物材料、断裂和工程分析、蛋白质生物物理学和基于昆虫行为的生物灵感设计。生产性应用包括生物灵感设计,包括受花粉启发的建筑材料,以及从日食照片合成生物灵感材料微结构。

[NLP-27] BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation
[NLP-27] BLSP-KD:通过知识提炼的Bootstrapping语音预训练

链接: https://arxiv.org/abs/2405.19041
作者: Chen Wang,Minpeng Liao,Zhongqiang Huang,Jiajun Zhang
关键词: speech-text length mismatch, Knowledge Distillation, large language models, Bootstrapping Language-Speech Pretraining, optimizing alignment quality
中文关键词: 语音-文本长度不匹配、知识蒸馏、大型语言模型、Bootstrapping语法-语音预训练、优化对齐质量
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent end-to-end approaches have shown promise in extending large language models (LLMs) to speech inputs, but face limitations in directly assessing and optimizing alignment quality and fail to achieve fine-grained alignment due to speech-text length mismatch. We introduce BLSP-KD, a novel approach for Bootstrapping Language-Speech Pretraining via Knowledge Distillation, which addresses these limitations through two key techniques. First, it optimizes speech-text alignment by minimizing the divergence between the LLM’s next-token prediction distributions for speech and text inputs using knowledge distillation. Second, it employs a continuous-integrate-andfire strategy to segment speech into tokens that correspond one-to-one with text tokens, enabling fine-grained alignment. We also introduce Partial LoRA (PLoRA), a new adaptation method supporting LLM finetuning for speech inputs under knowledge distillation. Quantitative evaluation shows that BLSP-KD outperforms previous end-to-end baselines and cascaded systems with comparable scale of parameters, facilitating general instruction-following capabilities for LLMs with speech inputs. This approach provides new possibilities for extending LLMs to spoken language interactions.
摘要:最近的端到端方法在将大语言模型(LLM)扩展到语音输入方面表现出了良好的前景,但在直接评估和优化对齐质量方面存在局限性,并且由于语音-文本长度不匹配而无法实现细粒度对齐。我们介绍了BLSP-KD,一种新的基于知识蒸馏的自举语言-语音预训练方法,它通过两个关键技术解决了这些局限性。首先,它通过使用知识精馏最小化LLM的语音和文本输入的下一个令牌预测分布之间的分歧来优化语音-文本对齐。其次,它使用持续集成与点火策略将语音分割成与文本标记一一对应的标记,从而实现细粒度对齐。我们还介绍了一种新的自适应方法–部分LORA(Partial Lora),它支持在知识提取的情况下对语音输入进行LLM精调。定量评估表明,BLSP-KD的性能优于以前的端到端基线和级联系统,具有类似的参数范围,有利于具有语音输入的LLM的通用指令跟随能力。这种方法为将LLMS扩展到口语交互提供了新的可能性。

[NLP-28] DiveR-CT: Diversity-enhanced Red Teaming with Relaxing Constraints
[NLP-28] DiveR-CT:多元化增强的红色团队,具有轻松的约束

链接: https://arxiv.org/abs/2405.19026
作者: Andrew Zhao,Quentin Xu,Matthieu Lin,Shenzhi Wang,Yong-jin Liu,Zilong Zheng,Gao Huang
关键词: raising significant concerns, Recent advances, large language models, made them indispensable, raising significant
中文关键词: 引发重大担忧,最近的进步、大型语言模型使它们不可或缺,引发重大担忧
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximizing attack success rate. Additionally, methods that decrease the cosine similarity from historical embeddings with semantic diversity rewards lead to novelty stagnation as history grows. To address these issues, we introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT’s marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. Project details and code can be found at this https URL.
摘要:大型语言模型(LLM)的最新进展使它们变得不可或缺,这引发了人们对它们的安全性的极大关注。自动红色团队提供了一种很有前途的替代方案,可以替代劳动密集型和容易出错的手动漏洞探测,提供更一致和可扩展的安全评估。然而,现有的方法往往通过关注最大化攻击成功率来损害多样性。此外,通过语义多样性奖励降低历史嵌入的余弦相似性的方法会随着历史的发展而导致新颖性停滞不前。为了解决这些问题,我们引入了Diver-CT,它放松了对客观和语义奖励的传统限制,赋予了政策更大的自由度来增强多样性。我们的实验展示了Diver-CT在以下方面的显著优势:1)生成在不同攻击成功率级别上在各种多样性度量中表现更好的数据;2)通过基于收集的数据进行安全调整,更好地增强蓝色团队模型的弹性;3)允许动态控制目标权重,以获得可靠和可控的攻击成功率;以及4)降低奖励过度优化的易感性。项目详细信息和代码可在此HTTPS URL中找到。

[NLP-29] Evaluating the External and Parametric Knowledge Fusion of Large Language Models
[NLP-29] 评估大型语言模型的外部知识和参数知识融合

链接: https://arxiv.org/abs/2405.19010
作者: Hao Zhang,Yuyang Zhang,Xiaoguang Li,Wenxuan Shi,Haonan Xu,Huanshuo Liu,Yasheng Wang,Lifeng Shang,Qun Liu,Yong Liu,Ruiming Tang
关键词: Integrating external knowledge, large language models, static parametric memory, Integrating external, parametric knowledge
中文关键词: 集成外部知识、大型语言模型、静态参数记忆、集成外部参数知识
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 15 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Integrating external knowledge into large language models (LLMs) presents a promising solution to overcome the limitations imposed by their antiquated and static parametric memory. Prior studies, however, have tended to over-reliance on external knowledge, underestimating the valuable contributions of an LLMs’ intrinsic parametric knowledge. The efficacy of LLMs in blending external and parametric knowledge remains largely unexplored, especially in cases where external knowledge is incomplete and necessitates supplementation by their parametric knowledge. We propose to deconstruct knowledge fusion into four distinct scenarios, offering the first thorough investigation of LLM behavior across each. We develop a systematic pipeline for data construction and knowledge infusion to simulate these fusion scenarios, facilitating a series of controlled experiments. Our investigation reveals that enhancing parametric knowledge within LLMs can significantly bolster their capability for knowledge integration. Nonetheless, we identify persistent challenges in memorizing and eliciting parametric knowledge, and determining parametric knowledge boundaries. Our findings aim to steer future explorations on harmonizing external and parametric knowledge within LLMs.
摘要:将外部知识集成到大型语言模型(LLM)中是一种很有前途的解决方案,可以克服它们陈旧的静态参数记忆带来的限制。然而,以前的研究倾向于过度依赖外部知识,低估了LLMS内在参数知识的有价值的贡献。LLMS在混合外部知识和参数知识方面的有效性在很大程度上仍未得到探索,特别是在外部知识不完整且需要通过其参数知识进行补充的情况下。我们建议将知识融合解构为四种不同的场景,首次对每种场景的LLM行为进行彻底调查。我们为数据构建和知识注入开发了一个系统的管道,以模拟这些融合场景,促进了一系列对照实验。我们的研究表明,在LLMS中增强参数知识可以显著增强其知识整合能力。尽管如此,我们在记忆和引出参数知识以及确定参数知识边界方面发现了持续存在的挑战。我们的发现旨在指导未来在LLMS中协调外部和参数知识的探索。

[NLP-30] EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture
[NLP-30] EasyAnimate:基于Transformer架构的高性能长视频生成方法

链接: https://arxiv.org/abs/2405.18991
作者: Jiaqi Xu,Xinyi Zou,Kunzhe Huang,Yunkuo Chen,Bo Liu,MengLi Cheng,Xing Shi,Jun Huang
关键词: paper presents EasyAnimate, high-performance outcomes, paper presents, leverages the power, power of transformer
中文关键词: 论文介绍了EasyAnimate,高性能成果,论文介绍了,利用了Transformer的力量
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 6 pages, 5 figures

点击查看摘要

Abstract:This paper presents EasyAnimate, an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes. We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a motion module block. It is used to capture temporal dynamics, thereby ensuring the production of consistent frames and seamless motion transitions. The motion module can be adapted to various DiT baseline methods to generate video with different styles. It can also generate videos with different frame rates and resolutions during both training and inference phases, suitable for both images and videos. Moreover, we introduce slice VAE, a novel approach to condense the temporal axis, facilitating the generation of long duration videos. Currently, EasyAnimate exhibits the proficiency to generate videos with 144 frames. We provide a holistic ecosystem for video production based on DiT, encompassing aspects such as data pre-processing, VAE training, DiT models training (both the baseline model and LoRA model), and end-to-end video inference. Code is available at: this https URL. We are continuously working to enhance the performance of our method.
摘要:本文介绍了EasyAnimate,这是一种先进的视频生成方法,它利用变压器架构的强大功能来实现高性能结果。我们扩展了最初为2D图像合成设计的DIT框架,通过加入运动模块块来适应3D视频生成的复杂性。它用于捕获时间动态,从而确保产生一致的帧和无缝的运动过渡。运动模块可以适应各种DIT基线方法,以生成不同风格的视频。它还可以在训练和推理阶段生成不同帧率和分辨率的视频,适用于图像和视频。此外,我们还引入了一种新的压缩时间轴的方法–切片VAE,以便于生成长持续时间的视频。目前,EasyAnimate表现出了生成144帧视频的熟练程度。我们为基于DIT的视频制作提供了一个完整的生态系统,包括数据预处理、VAE培训、DIT模型培训(基线模型和LORA模型)和端到端视频推理等方面。代码可在以下网址获得:这个HTTPS URL。我们正在不断努力提高我们方法的性能。

[NLP-31] Encoding Hierarchical Schema via Concept Flow for Multifaceted Ideology Detection
[NLP-31] 通过概念流编码分层模式以实现多面意识形态检测

链接: https://arxiv.org/abs/2405.18974
作者: Songtao Liu,Bang Wang,Wei Xiang,Han Xu,Minghua Xu
关键词: Multifaceted ideology detection, aims to detect, detect the ideological, ideological leanings, leanings of texts
中文关键词: 多方面意识形态检测,旨在检测、检测意识形态、意识形态倾向、文本倾向
类目: Computation and Language (cs.CL)
备注: 13pages, 4 figures (Accepted to Findings of ACL 2024)

点击查看摘要

Abstract:Multifaceted ideology detection (MID) aims to detect the ideological leanings of texts towards multiple facets. Previous studies on ideology detection mainly focus on one generic facet and ignore label semantics and explanatory descriptions of ideologies, which are a kind of instructive information and reveal the specific concepts of ideologies. In this paper, we develop a novel concept semantics-enhanced framework for the MID task. Specifically, we propose a bidirectional iterative concept flow (BICo) method to encode multifaceted ideologies. BICo enables the concepts to flow across levels of the schema tree and enriches concept representations with multi-granularity semantics. Furthermore, we explore concept attentive matching and concept-guided contrastive learning strategies to guide the model to capture ideology features with the learned concept semantics. Extensive experiments on the benchmark dataset show that our approach achieves state-of-the-art performance in MID, including in the cross-topic scenario.
摘要:多方面意识形态检测的目的是检测文本对多个方面的意识形态倾向。以往关于意识形态检测的研究主要集中在一个方面,而忽略了意识形态的标签语义和解释性描述,因为它们是一种启发性信息,揭示了意识形态的具体概念。在本文中,我们为中间任务开发了一个新的概念语义增强框架。具体地说,我们提出了一种双向迭代概念流(BICO)方法来编码多方面的意识形态。BICO使概念能够在模式树的各个层次上流动,并用多粒度语义丰富了概念表示。此外,我们还探索了概念注意匹配和概念引导的对比学习策略,以指导模型利用所学习的概念语义来捕捉意识形态特征。在基准数据集上的大量实验表明,我们的方法在MID上取得了最好的性能,包括在跨主题场景中。

[NLP-32] Are You Sure? Rank Them Again: Repeated Ranking For Better Preference Datasets
[NLP-32] 你确定吗?再次对它们进行排名:重复排名以获得更好的偏好数据集

链接: https://arxiv.org/abs/2405.18952
作者: Peter Devine
关键词: Reinforcement Learning, aligns model outputs, Training Large Language, Large Language Models, Training Large
中文关键词: 强化学习,对齐模型输出,训练大型语言,大型语言模型,训练大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Training Large Language Models (LLMs) with Reinforcement Learning from AI Feedback (RLAIF) aligns model outputs more closely with human preferences. This involves an evaluator model ranking multiple candidate responses to user prompts. However, the rankings from popular evaluator models such as GPT-4 can be inconsistent. We propose the Repeat Ranking method - where we evaluate the same responses multiple times and train only on those responses which are consistently ranked. Using 2,714 prompts in 62 languages, we generated responses from 7 top multilingual LLMs and had GPT-4 rank them five times each. Evaluating on MT-Bench chat benchmarks in six languages, our method outperformed the standard practice of training on all available prompts. Our work highlights the quality versus quantity trade-off in RLAIF dataset generation and offers a stackable strategy for enhancing dataset and thus model quality.
摘要:通过人工智能反馈强化学习(RLAIF)训练大型语言模型(LLM),使模型输出与人类偏好更紧密地保持一致。这涉及对用户提示的多个候选响应进行排名的评估模型。然而,GPT-4等流行评估模型的排名可能不一致。我们提出了重复排名方法–我们多次评估相同的响应,并仅对那些一致排名的响应进行训练。我们使用62种语言的2,714个提示,从7个顶级多语言LLM中生成了回复,并让GPT-4对它们进行了五次排名。通过评估六种语言的MT-Bench聊天基准,我们的方法优于所有可用提示的标准训练实践。我们的工作强调了RLAIF数据集生成中质量与数量的权衡,并提供了一种可堆叠的策略来增强数据集,从而增强模型质量。

[NLP-33] Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding
[NLP-33] Kestrel:点接地多模式LLM,用于部分感知3D视觉语言理解

链接: https://arxiv.org/abs/2405.18937
作者: Junjie Fei,Mahmoud Ahmed,Jian Ding,Eslam Mohamed Bakr,Mohamed Elhoseiny
关键词: achieved significant progress, part level, segmentation grounding, Point Grounded Captioning, Part-Aware Point
中文关键词: 取得了重大进展、零件级别、细分基础、点接地字幕、零件感知点
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While 3D MLLMs have achieved significant progress, they are restricted to object and scene understanding and struggle to understand 3D spatial structures at the part level. In this paper, we introduce Kestrel, representing a novel approach that empowers 3D MLLMs with part-aware understanding, enabling better interpretation and segmentation grounding of 3D objects at the part level. Despite its significance, the current landscape lacks tasks and datasets that endow and assess this capability. Therefore, we propose two novel tasks: (1) Part-Aware Point Grounding, the model is tasked with directly predicting a part-level segmentation mask based on user instructions, and (2) Part-Aware Point Grounded Captioning, the model provides a detailed caption that includes part-level descriptions and their corresponding masks. To support learning and evaluating for these tasks, we introduce 3DCoMPaT Grounded Instructions Dataset (3DCoMPaT-GRIN). 3DCoMPaT-GRIN Vanilla, comprising 789k part-aware point cloud-instruction-segmentation mask triplets, is used to evaluate MLLMs’ ability of part-aware segmentation grounding. 3DCoMPaT-GRIN Grounded Caption, containing 107k part-aware point cloud-instruction-grounded caption triplets, assesses both MLLMs’ part-aware language comprehension and segmentation grounding capabilities. Our introduced tasks, dataset, and Kestrel represent a preliminary effort to bridge the gap between human cognition and 3D MLLMs, i.e., the ability to perceive and engage with the environment at both global and part levels. Extensive experiments on the 3DCoMPaT-GRIN show that Kestrel can generate user-specified segmentation masks, a capability not present in any existing 3D MLLM. Kestrel thus established a benchmark for evaluating the part-aware language comprehension and segmentation grounding of 3D objects. Project page at this https URL
摘要:虽然3D MLLMS已经取得了重大进展,但它们仅限于对物体和场景的理解,在局部水平上难以理解3D空间结构。在本文中,我们介绍了Kestrel,它代表了一种新的方法,使3D MLLMS具有部分感知的理解,能够在部分级别更好地解释和分割3D对象。尽管意义重大,但目前的情况缺乏赋予和评估这一能力的任务和数据集。因此,我们提出了两个新的任务:(1)部件感知的点定位,该模型的任务是根据用户指令直接预测部件级分割掩码;(2)部件感知的点定位字幕,该模型提供包括部件级描述及其对应掩码的详细字幕。为了支持这些任务的学习和评估,我们引入了3DCoMPaT基础指令数据集(3DCoMPaT-GRIN)。3DCoMPaT-GRIN Vanilla包括789K部分感知点云指令分割掩码三元组,用于评估MLLMS的部分感知分割基础能力。3DCoMPaT-GRIN接地字幕,包含107K部分感知点云指令接地字幕三元组,评估MLLMS的部分感知语言理解和分割接地能力。我们引入的任务、数据集和Kestrel代表了弥合人类认知和3D MLLMS之间差距的初步努力,即在全局和部分水平上感知和参与环境的能力。在3DCoMPaT-GRIN上的大量实验表明,Kestrel可以生成用户指定的分割掩码,这是任何现有的3D MLLM都不存在的功能。因此,Kestrel为评估3D对象的部分感知语言理解和分割基础建立了一个基准。位于此HTTPS URL的项目页面

[NLP-34] Understanding and Addressing the Under-Translation Problem from the Perspective of Decoding Objective
[NLP-34] 从解码目标的角度理解和解决翻译不足问题

链接: https://arxiv.org/abs/2405.18922
作者: Chenze Shao,Fandong Meng,Jiali Zeng,Jie Zhou
关键词: Neural Machine Translation, Neural Machine, made remarkable progress, Machine Translation, past years
中文关键词: 神经机器翻译,神经机器,取得显着进展,机器翻译,过去几年
类目: Computation and Language (cs.CL)
备注: ACL 2024 main conference

点击查看摘要

Abstract:Neural Machine Translation (NMT) has made remarkable progress over the past years. However, under-translation and over-translation remain two challenging problems in state-of-the-art NMT systems. In this work, we conduct an in-depth analysis on the underlying cause of under-translation in NMT, providing an explanation from the perspective of decoding objective. To optimize the beam search objective, the model tends to overlook words it is less confident about, leading to the under-translation phenomenon. Correspondingly, the model’s confidence in predicting the End Of Sentence (EOS) diminishes when under-translation occurs, serving as a mild penalty for under-translated candidates. Building upon this analysis, we propose employing the confidence of predicting EOS as a detector for under-translation, and strengthening the confidence-based penalty to penalize candidates with a high risk of under-translation. Experiments on both synthetic and real-world data show that our method can accurately detect and rectify under-translated outputs, with minor impact on other correct translations.
摘要:神经机器翻译(NMT)在过去的几年中取得了显著的进步。然而,在现有的自然机器翻译系统中,欠翻译和过度翻译仍然是两个具有挑战性的问题。在这项工作中,我们对自然语言翻译中翻译不足的深层原因进行了深入的分析,并从解码目标的角度对其进行了解释。为了优化波束搜索目标,该模型倾向于忽略它不太有信心的单词,从而导致欠翻译现象。相应地,当出现翻译不足时,该模型预测句尾(EOS)的置信度降低,对翻译不足的候选人起到了轻微的惩罚作用。基于这一分析,我们建议使用预测状态方程的置信度作为翻译不足的检测器,并加强基于置信度的惩罚,以惩罚翻译不足风险较高的应聘者。在人工合成数据和真实数据上的实验表明,我们的方法能够准确地检测和纠正翻译不足的输出,而对其他正确翻译的影响很小。

[NLP-35] owards Faithful Chain-of-Thought: Large Language Models are Bridging Reasoners
[NLP-35] owards忠实的思想链:大型语言模型是桥梁推理者

链接: https://arxiv.org/abs/2405.18915
作者: Jiachun Li,Pengfei Cao,Yubo Chen,Kang Liu,Jun Zhao
关键词: Large language models, Large language, language models, CoT, Large
中文关键词: 大型语言模型,大型语言,语言模型,CoT,大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, under review

点击查看摘要

Abstract:Large language models (LLMs) suffer from serious unfaithful chain-of-thought (CoT) issues. Previous work attempts to measure and explain it but lacks in-depth analysis within CoTs and does not consider the interactions among all reasoning components jointly. In this paper, we first study the CoT faithfulness issue at the granularity of CoT steps, identify two reasoning paradigms: centralized reasoning and distributed reasoning, and find their relationship with faithfulness. Subsequently, we conduct a joint analysis of the causal relevance among the context, CoT, and answer during reasoning. The result proves that, when the LLM predicts answers, it can recall correct information missing in the CoT from the context, leading to unfaithfulness issues. Finally, we propose the inferential bridging method to mitigate this issue, in which we use the attribution method to recall information as hints for CoT generation and filter out noisy CoTs based on their semantic consistency and attribution scores. Extensive experiments demonstrate that our approach effectively alleviates the unfaithful CoT problem.
摘要:大型语言模型(LLM)存在严重的不忠实思维链(COT)问题。以前的工作试图测量和解释它,但在COTS中缺乏深入的分析,也没有共同考虑所有推理组件之间的相互作用。在本文中,我们首先在COT步骤的粒度上研究了COT的忠实性问题,确定了两种推理范型:集中式推理和分布式推理,并找出了它们与忠实性的关系。随后,我们对推理过程中语境、COT和答案之间的因果关联性进行了联合分析。结果证明,当LLM预测答案时,它可以从上下文中回忆出COT中缺失的正确信息,从而导致不信任度问题。最后,我们提出了推理桥接方法来缓解这一问题,该方法使用属性方法来回忆信息,作为生成CoT的提示,并根据它们的语义一致性和属性得分过滤掉噪声Cots。大量实验表明,该方法有效地缓解了不忠诚度问题。

[NLP-36] Language Generation with Strictly Proper Scoring Rules
[NLP-36] 具有严格正确的评分规则的语言生成

链接: https://arxiv.org/abs/2405.18906
作者: Chenze Shao,Fandong Meng,Yijin Liu,Jie Zhou
关键词: maximum likelihood estimation, logarithmic score, maximum likelihood, likelihood estimation, proper scoring rules
中文关键词: 最大似然估计、对数得分、最大似然、似然估计、适当的评分规则
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICML 2024

点击查看摘要

Abstract:Language generation based on maximum likelihood estimation (MLE) has become the fundamental approach for text generation. Maximum likelihood estimation is typically performed by minimizing the log-likelihood loss, also known as the logarithmic score in statistical decision theory. The logarithmic score is strictly proper in the sense that it encourages honest forecasts, where the expected score is maximized only when the model reports true probabilities. Although many strictly proper scoring rules exist, the logarithmic score is the only local scoring rule among them that depends exclusively on the probability of the observed sample, making it capable of handling the exponentially large sample space of natural text. In this work, we propose a straightforward strategy for adapting scoring rules to language generation, allowing for language modeling with any non-local scoring rules. Leveraging this strategy, we train language generation models using two classic strictly proper scoring rules, the Brier score and the Spherical score, as alternatives to the logarithmic score. Experimental results indicate that simply substituting the loss function, without adjusting other hyperparameters, can yield substantial improvements in model’s generation capabilities. Moreover, these improvements can scale up to large language models (LLMs) such as LLaMA-7B and LLaMA-13B. Source code: \urlthis https URL.
摘要:基于最大似然估计(MLE)的语言生成已成为文本生成的基本方法。最大似然估计通常是通过最小化对数似然损失来执行的,在统计决策理论中也称为对数分数。对数得分在鼓励诚实预测的意义上是严格合适的,只有当模型报告真实概率时,预期得分才会最大化。虽然存在许多严格合适的评分规则,但对数评分是其中唯一完全依赖于观察样本概率的局部评分规则,使其能够处理自然文本的指数级大样本空间。在这项工作中,我们提出了一种简单的策略来使评分规则适应语言生成,允许使用任何非本地评分规则进行语言建模。利用这一策略,我们使用两个经典的严格正确的评分规则-Brier评分和球形评分-作为对数评分的替代规则来训练语言生成模型。实验结果表明,简单地替换损失函数,而不调整其他超参数,可以显著提高模型的生成能力。此外,这些改进可以扩展到大型语言模型(LLM),如Llama-7B和Llama-13B。源代码:\urlThis HTTPS URL。

[NLP-37] LLMs achieve adult human performance on higher-order theory of mind tasks
[NLP-37] LLM在思维任务的高级理论方面实现成人的表现

链接: https://arxiv.org/abs/2405.18870
作者: Winnie Street,John Oliver Siy,Geoff Keeling,Adrien Baranes,Benjamin Barnett,Michael McKibben,Tatenda Kanyere,Alison Lentz,Blaise Aguera y Arcas,Robin I. M. Dunbar
关键词: theory of mind, large language models, recursive manner, examines the extent, large language
中文关键词: 心理理论、大语言模型、回归方式、检查程度、大语言
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper examines the extent to which large language models (LLMs) have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite – Multi-Order Theory of Mind QA – and using it to compare the performance of five LLMs to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for the realisation of ToM abilities, and that the best-performing LLMs have developed a generalised capacity for ToM. Given the role that higher-order ToM plays in a wide range of cooperative and competitive human behaviours, these findings have significant implications for user-facing LLM applications.
摘要:本文考察了大型语言模型(LLM)在多大程度上发展了高阶心理理论(Tom),即人类以递归方式对多种心理和情绪状态进行推理的能力(例如,我认为你相信她知道)。本文在先前工作的基础上,引入了一个手写测试套件–多阶心理理论QA–并用它来比较五个LLM与一个新收集的成人基准的性能。我们发现GPT-4和Flan-Palm在TOM任务上总体上达到了成人水平和接近成人水平,GPT-4在六阶推理上超过了成人水平。我们的结果表明,模型大小和微调之间存在相互作用,以实现TOM的能力,并且表现最好的LLM已经为TOM发展了一种普遍的能力。鉴于高阶心理理论在人类广泛的合作和竞争行为中所起的作用,这些发现对面向用户的LLM应用程序具有重要的意义。

[NLP-38] Simulation Modelling and Classification of Wiki Contributors: Spotting The Good The Bad and The Ugly
[NLP-38] 维基贡献者的模拟建模和分类:发现好的、坏的和丑陋的

链接: https://arxiv.org/abs/2405.18845
作者: Silvia García Méndez,Fátima Leal,Benedita Malheiro,Juan Carlos Burguillo Rial,Bruno Veloso,Adriana E. Chis,Horacio González Vélez
关键词: data acquisition process, highly relevant data, relevant data ranging, voluntary contributors feed, contributors feed platforms
中文关键词: 数据获取过程、高度相关的数据、相关的数据范围、自愿贡献者提要、贡献者提要平台
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data crowdsourcing is a data acquisition process where groups of voluntary contributors feed platforms with highly relevant data ranging from news, comments, and media to knowledge and classifications. It typically processes user-generated data streams to provide and refine popular services such as wikis, collaborative maps, e-commerce sites, and social networks. Nevertheless, this modus operandi raises severe concerns regarding ill-intentioned data manipulation in adversarial environments. This paper presents a simulation, modelling, and classification approach to automatically identify human and non-human (bots) as well as benign and malign contributors by using data fabrication to balance classes within experimental data sets, data stream modelling to build and update contributor profiles and, finally, autonomic data stream classification. By employing WikiVoyage - a free worldwide wiki travel guide open to contribution from the general public - as a testbed, our approach proves to significantly boost the confidence and quality of the classifier by using a class-balanced data stream, comprising both real and synthetic data. Our empirical results show that the proposed method distinguishes between benign and malign bots as well as human contributors with a classification accuracy of up to 92 %.
摘要:数据众包是一种数据获取过程,一群志愿贡献者向平台提供高度相关的数据,从新闻、评论和媒体到知识和分类。它通常处理用户生成的数据流,以提供和改进流行的服务,如维基、协作地图、电子商务网站和社交网络。然而,这种工作方式引起了人们对敌对环境中恶意数据操纵的严重担忧。本文提出了一种模拟、建模和分类的方法,通过使用数据伪造来平衡实验数据集中的类别,通过数据流建模来建立和更新贡献者配置文件,最后使用自主数据流分类来自动识别人类和非人类(BoT)以及良性和恶意贡献者。通过使用Wikivoyage-一个免费的全球Wiki旅行指南向公众开放-作为试验床,我们的方法被证明通过使用类别平衡的数据流(包括真实数据和合成数据)显著提高了分类器的信心和质量。我们的实验结果表明,该方法能够区分良性和恶意机器人以及人类贡献者,分类准确率高达92%。

[NLP-39] oxicity Detection for Free
[NLP-39] 免费毒性检测

链接: https://arxiv.org/abs/2405.18822
作者: Zhanhao Hu,Julien Piet,Geng Zhao,Jiantao Jiao,David Wagner
关键词: follow safety requirements, Current LLMs, refuse toxic prompts, toxic prompts, generally aligned
中文关键词: 遵循安全要求,当前的LLM,拒绝有毒提示,有毒提示,总体一致
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current LLMs are generally aligned to follow safety requirements and tend to refuse toxic prompts. However, LLMs can fail to refuse toxic prompts or be overcautious and refuse benign examples. In addition, state-of-the-art toxicity detectors have low TPRs at low FPR, incurring high costs in real-world applications where toxic examples are rare. In this paper, we explore Moderation Using LLM Introspection (MULI), which detects toxic prompts using the information extracted directly from LLMs themselves. We found significant gaps between benign and toxic prompts in the distribution of alternative refusal responses and in the distribution of the first response token’s logits. These gaps can be used to detect toxicities: We show that a toy model based on the logits of specific starting tokens gets reliable performance, while requiring no training or additional computational cost. We build a more robust detector using a sparse logistic regression model on the first response token logits, which greatly exceeds SOTA detectors under multiple metrics.
摘要:当前的LLM通常遵循安全要求,并倾向于拒绝有毒提示。然而,LLMS可能无法拒绝有害的提示,或者过于谨慎,拒绝善意的例子。此外,最先进的毒性检测器具有低TPR和低FPR,在很少有有毒实例的实际应用中会招致高昂的成本。在本文中,我们使用LLM自省(MULI)来探索适度,它使用直接从LLM本身提取的信息来检测有毒提示。我们发现良性提示和有害提示在替代拒绝反应的分布和第一反应令牌的日志分布上存在显著差异。这些差距可以用来检测毒性:我们证明了基于特定开始令牌的逻辑的玩具模型获得了可靠的性能,而不需要训练或额外的计算成本。我们使用稀疏Logistic回归模型对第一个响应令牌日志构建了一个更健壮的检测器,在多个度量下大大超过了SOTA检测器。

[NLP-40] LMO-DP: Optimizing the Randomization Mechanism for Differentially Private Fine-Tuning (Large) Language Models
[NLP-40] LMO-DP:优化差异专用微调(大型)语言模型的随机化机制

链接: https://arxiv.org/abs/2405.18776
作者: Qin Yang,Meisam Mohammad,Han Wang,Ali Payani,Ashish Kundu,Kai Shu,Yan Yan,Yuan Hong
关键词: Differentially Private Stochastic, Stochastic Gradient Descent, Private Stochastic Gradient, Differentially Private, Private Stochastic
中文关键词: 差异私人随机、随机梯度下降、私人随机梯度、差异私人、私人随机
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 15 figures

点击查看摘要

Abstract:Differentially Private Stochastic Gradient Descent (DP-SGD) and its variants have been proposed to ensure rigorous privacy for fine-tuning large-scale pre-trained language models. However, they rely heavily on the Gaussian mechanism, which may overly perturb the gradients and degrade the accuracy, especially in stronger privacy regimes (e.g., the privacy budget \epsilon 3 ). To address such limitations, we propose a novel Language Model-based Optimal Differential Privacy (LMO-DP) mechanism, which takes the first step to enable the tight composition of accurately fine-tuning (large) language models with a sub-optimal DP mechanism, even in strong privacy regimes (e.g., 0.1\leq \epsilon3 ). Furthermore, we propose a novel offline optimal noise search method to efficiently derive the sub-optimal DP that significantly reduces the noise magnitude. For instance, fine-tuning RoBERTa-large (with 300M parameters) on the SST-2 dataset can achieve an accuracy of 92.20% (given \epsilon=0.3 , \delta=10^-10 ) by drastically outperforming the Gaussian mechanism (e.g., \sim 50% for small \epsilon and \delta ). We also draw similar findings on the text generation tasks on GPT-2. Finally, to our best knowledge, LMO-DP is also the first solution to accurately fine-tune Llama-2 with strong differential privacy guarantees. The code will be released soon and available upon request.
摘要:差分私有随机梯度下降(DP-SGD)及其变种已被提出,以确保对大规模预先训练的语言模型的微调的严格隐私。然而,它们严重依赖于高斯机制,这可能会过度干扰梯度并降低准确性,特别是在更强的隐私制度下(例如,隐私预算\epsilon 3)。为了解决这些局限性,我们提出了一种新的基于语言模型的最优差分隐私机制(LMO-DP),它首先实现了精确微调(大)语言模型与次优DP机制的紧密组合,即使在强隐私机制(例如,0.1\leq\epsilon3)中也是如此。此外,我们提出了一种新的离线最优噪声搜索方法,以有效地获得显著降低噪声幅度的次优DP。例如,在SST-2数据集上微调Roberta-Large(具有300M参数)可以通过显著优于高斯机制(例如,对于Small\epsilon和\Delta为\sim 50%)实现92.20%的准确率(给定\epsilon=0.3,\Delta=10^-10)。我们在GPT-2上的文本生成任务上也得出了类似的结果。最后,据我们所知,LMO-DP也是第一个精准微调Llama-2的解决方案,具有强大的差异化隐私保障。该代码将很快发布,并可根据要求提供。

[NLP-41] Musical Phrase Segmentation via Grammatical Induction
[NLP-41] 通过语法归纳进行音乐短语分段

链接: https://arxiv.org/abs/2405.18742
作者: Reed Perkins,Dan Ventura
关键词: musical phrase segmentation, grammatical induction algorithms, grammatical induction, outline a solution, phrase segmentation
中文关键词: 音乐短语分割、语法归纳算法、语法归纳、概述解决方案、短语分割
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Extended version of a paper appearing in the proceedings of IJCAI 2024 that includes additional material in an appendix. Please cite the IJCAI version

点击查看摘要

Abstract:We outline a solution to the challenge of musical phrase segmentation that uses grammatical induction algorithms, a class of algorithms which infer a context-free grammar from an input sequence. We analyze the performance of five grammatical induction algorithms on three datasets using various musical viewpoint combinations. Our experiments show that the LONGESTFIRST algorithm achieves the best F1 scores across all three datasets and that input encodings that include the duration viewpoint result in the best performance.
摘要:我们概述了一种使用语法归纳算法来应对音乐短语分段挑战的解决方案,这是一类从输入序列中推断出不受上下文影响的语法的算法。我们使用各种音乐观点组合在三个数据集中分析了五种语法归纳算法的性能。我们的实验表明,LONGESTFIRST算法在所有三个数据集中实现了最佳的F1分数,并且包括持续时间观点的输入编码会产生最佳的性能。

[NLP-42] Genshin: General Shield for Natural Language Processing with Large Language Models
[NLP-42] Genshin:具有大型语言模型的自然语言处理的通用盾牌

链接: https://arxiv.org/abs/2405.18741
作者: Xiao Peng,Tao Liu,Ying Wang
关键词: Large language models, demonstrating considerable advancement, Large language, trending recently, Natural Language Processing
中文关键词: 大型语言模型,展示了相当大的进步,大型语言,最近的趋势,自然语言处理
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) like ChatGPT, Gemini, or LLaMA have been trending recently, demonstrating considerable advancement and generalizability power in countless domains. However, LLMs create an even bigger black box exacerbating opacity, with interpretability limited to few approaches. The uncertainty and opacity embedded in LLMs’ nature restrict their application in high-stakes domains like financial fraud, phishing, etc. Current approaches mainly rely on traditional textual classification with posterior interpretable algorithms, suffering from attackers who may create versatile adversarial samples to break the system’s defense, forcing users to make trade-offs between efficiency and robustness. To address this issue, we propose a novel cascading framework called Genshin (General Shield for Natural Language Processing with Large Language Models), utilizing LLMs as defensive one-time plug-ins. Unlike most applications of LLMs that try to transform text into something new or structural, Genshin uses LLMs to recover text to its original state. Genshin aims to combine the generalizability of the LLM, the discrimination of the median model, and the interpretability of the simple model. Our experiments on the task of sentimental analysis and spam detection have shown fatal flaws of the current median models and exhilarating results on LLMs’ recovery ability, demonstrating that Genshin is both effective and efficient. In our ablation study, we unearth several intriguing observations. Utilizing the LLM defender, a tool derived from the 4th paradigm, we have reproduced BERT’s 15% optimal mask rate results in the 3rd paradigm of NLP. Additionally, when employing the LLM as a potential adversarial tool, attackers are capable of executing effective attacks that are nearly semantically lossless.
摘要:像ChatGPT、Gemini或Llama这样的大型语言模型(LLM)最近已经成为一种趋势,在无数领域显示出相当大的先进性和泛化能力。然而,LLM创建了一个更大的黑匣子,加剧了不透明度,可解释性仅限于几种方法。LLMS本质上的不确定性和不透明性限制了它们在高风险领域的应用,如金融欺诈、网络钓鱼等。目前的方法主要依赖于传统的文本分类和后验可解释算法,攻击者可能会创建通用的对抗性样本来破坏系统的防御,迫使用户在效率和健壮性之间做出权衡。为了解决这个问题,我们提出了一种新颖的级联框架Genshin(General Shield For Natural Language Processing With Large Language Models),利用LLMS作为防御性的一次性插件。与大多数试图将文本转换为新的或结构化的文本的LLMS应用程序不同,Genshin使用LLMS将文本恢复到其原始状态。Genshin的目标是将LLM的泛化能力、中值模型的区分性和简单模型的可解释性结合起来。我们在情感分析和垃圾邮件检测任务上的实验表明,现有的中值模型存在致命缺陷,并且在LLMS的恢复能力上取得了令人振奋的结果,证明了Genshin是有效的和高效的。在我们的消融研究中,我们发现了几个有趣的观察结果。利用LLM Defender,一个源自第四范式的工具,我们在NLP的第三范式中复制了Bert的15%最优掩蔽率结果。此外,当使用LLM作为潜在的敌意工具时,攻击者能够执行几乎在语义上无损的有效攻击。

[NLP-43] Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs
[NLP-43] 多模式LLM中的反向图像检索线索参数记忆

链接: https://arxiv.org/abs/2405.18740
作者: Jialiang Xu,Michael Moor,Jure Leskovec
关键词: recent multimodal large, multimodal large language, Reverse Image Retrieval, large language models, suite still struggle
中文关键词: 最近的多模式大型语言、多模式大型语言、反向图像检索、大型语言模型、套件仍在苦苦挣扎
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite impressive advances in recent multimodal large language models (MLLMs), state-of-the-art models such as from the GPT-4 suite still struggle with knowledge-intensive tasks. To address this, we consider Reverse Image Retrieval (RIR) augmented generation, a simple yet effective strategy to augment MLLMs with web-scale reverse image search results. RIR robustly improves knowledge-intensive visual question answering (VQA) of GPT-4V by 37-43%, GPT-4 Turbo by 25-27%, and GPT-4o by 18-20% in terms of open-ended VQA evaluation metrics. To our surprise, we discover that RIR helps the model to better access its own world knowledge. Concretely, our experiments suggest that RIR augmentation helps by providing further visual and textual cues without necessarily containing the direct answer to a query. In addition, we elucidate cases in which RIR can hurt performance and conduct a human evaluation. Finally, we find that the overall advantage of using RIR makes it difficult for an agent that can choose to use RIR to perform better than an approach where RIR is the default setting.
摘要:尽管最近多模式大型语言模型(MLLM)取得了令人印象深刻的进展,但最先进的模型,如GPT-4套件,仍在为知识密集型任务而苦苦挣扎。为了解决这个问题,我们考虑了反向图像检索(RIR)增强生成,这是一种简单但有效的策略,可以用网络规模的反向图像搜索结果来扩展MLLMS。在开放式视觉问答评估指标方面,RIR将GPT-4V的知识密集型视觉问答(VQA)提高了37%-43%,GPT-4Turbo提高了25%-27%,GPT-4o提高了18%-20%。令我们惊讶的是,我们发现RIR帮助模型更好地访问自己的世界知识。具体地说,我们的实验表明,RIR增强通过提供进一步的视觉和文本提示来帮助,而不一定包含对查询的直接答案。此外,我们还阐述了RIR可能损害性能并进行人类评估的情况。最后,我们发现,使用RIR的总体优势使得可以选择使用RIR的代理很难比使用RIR作为默认设置的方法执行得更好。

[NLP-44] CtrlA: Adaptive Retrieval-Augmented Generation via Probe-Guided Control
[NLP-44] CtrlA:通过探针引导控制的自适应检索增强生成

链接: https://arxiv.org/abs/2405.18727
作者: Huanshuo Liu,Hao Zhang,Zhijiang Guo,Kuicai Dong,Xiangyang Li,Yi Quan Lee,Cong Zhang,Yong Liu
关键词: large language models, Adaptive RAG, adaptive RAG methods, existing adaptive RAG, Retrieval-augmented generation
中文关键词: 大型语言模型、自适应RAG、自适应RAG方法、现有自适应RAG、检索增强生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 28 pages, 7 figures, 9 tables

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a promising solution for mitigating hallucinations of large language models (LLMs) with retrieved external knowledge. Adaptive RAG enhances this approach by dynamically assessing the retrieval necessity, aiming to balance external and internal knowledge usage. However, existing adaptive RAG methods primarily realize retrieval on demand by relying on superficially verbalize-based or probability-based feedback of LLMs, or directly fine-tuning LLMs via carefully crafted datasets, resulting in unreliable retrieval necessity decisions, heavy extra costs, and sub-optimal response generation. We present the first attempts to delve into the internal states of LLMs to mitigate such issues by introducing an effective probe-guided adaptive RAG framework, termed CtrlA. Specifically, CtrlA employs an honesty probe to regulate the LLM’s behavior by manipulating its representations for increased honesty, and a confidence probe to monitor the internal states of LLM and assess confidence levels, determining the retrieval necessity during generation. Experiments show that CtrlA is superior to existing adaptive RAG methods on a diverse set of tasks, the honesty control can effectively make LLMs more honest and confidence monitoring is proven to be a promising indicator of retrieval trigger. Our codes are available at this https URL.
摘要:检索增强生成(RAG)是一种很有前途的解决方案,可用于缓解具有检索到的外部知识的大型语言模型(LLM)的幻觉。自适应RAG通过动态评估检索必要性来增强这种方法,旨在平衡外部和内部知识的使用。然而,现有的自适应RAG方法主要依靠表面上基于言语或基于概率的LLM反馈来实现按需检索,或者通过精心设计的数据集直接微调LLM,导致不可靠的检索必要性决策、高昂的额外成本和次优响应生成。我们首次尝试通过引入一个有效的探头引导的自适应RAG框架CtrlA来深入研究LLMS的内部状态以缓解此类问题。具体地说,CtrlA使用诚实探测器来规范LLM的行为,通过操纵其表示来增加诚实,并使用置信度探测器来监控LLM的内部状态并评估置信度,以确定在生成期间检索的必要性。实验表明,CtrlA在不同的任务集上优于现有的自适应RAG方法,诚实控制可以有效地使LLMS更诚实,可信度监测被证明是一种很有前途的检索触发指标。我们的代码可以在这个HTTPS URL上找到。

[NLP-45] Correctable Landmark Discovery via Large Models for Vision-Language Navigation
[NLP-45] 通过视觉语言导航的大型模型发现可纠正的地标

链接: https://arxiv.org/abs/2405.18721
作者: Bingqian Lin,Yunshuang Nie,Ziming Wei,Yi Zhu,Hang Xu,Shikui Ma,Jianzhuang Liu,Xiaodan Liang
关键词: follow language instructions, target position, VLN, LaNdmark DiScOvery, follow language
中文关键词: 遵循语言说明、目标位置、VLN、LaNdmark DiScOvery、遵循语言
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by TPAMI 2024

点击查看摘要

Abstract:Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position. A key factor for successful navigation is to align the landmarks implied in the instruction with diverse visual observations. However, previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes, since they learn from limited navigation data and lack sufficient open-world alignment knowledge. In this work, we propose a new VLN paradigm, called COrrectable LaNdmark DiScOvery via Large ModEls (CONSOLE). In CONSOLE, we cast VLN as an open-world sequential landmark discovery problem, by introducing a novel correctable landmark discovery scheme based on two large models ChatGPT and CLIP. Specifically, we use ChatGPT to provide rich open-world landmark cooccurrence commonsense, and conduct CLIP-driven landmark discovery based on these commonsense priors. To mitigate the noise in the priors due to the lack of visual constraints, we introduce a learnable cooccurrence scoring module, which corrects the importance of each cooccurrence according to actual observations for accurate landmark discovery. We further design an observation enhancement strategy for an elegant combination of our framework with different VLN agents, where we utilize the corrected landmark features to obtain enhanced observation features for action decision. Extensive experimental results on multiple popular VLN benchmarks (R2R, REVERIE, R4R, RxR) show the significant superiority of CONSOLE over strong baselines. Especially, our CONSOLE establishes the new state-of-the-art results on R2R and R4R in unseen scenarios. Code is available at this https URL.
摘要:视觉-语言导航(VLN)要求智能体按照语言指令到达目标位置。成功导航的一个关键因素是将说明书中隐含的地标与不同的视觉观察对齐。然而,以前的VLN代理无法执行准确的通道比对,特别是在未探索的场景中,因为它们从有限的导航数据中学习,并且缺乏足够的开放世界比对知识。在这项工作中,我们提出了一种新的VLN范式,称为通过大型模型(控制台)的可纠正地标发现。在控制台中,我们将VLN描述为一个开放世界的顺序地标发现问题,在ChatGPT和CLIP两个大型模型的基础上引入了一个新的可纠正的地标发现方案。具体地说,我们使用ChatGPT来提供丰富的开放世界地标共现常识,并基于这些常识先验进行剪辑驱动的地标发现。为了减少由于缺乏视觉约束而导致的先验噪声,我们引入了一个可学习的共现评分模块,该模块根据实际观察来校正每个共现的重要性,以便准确地发现地标。我们进一步设计了一种观测增强策略,将我们的框架与不同的VLN代理巧妙地结合在一起,其中我们利用校正后的地标特征来获得增强的观测特征,用于行动决策。在多个流行的VLN基准(R2R、Reflie、R4R、RXR)上的广泛实验结果表明,控制台比强基线具有显著的优势。特别是,我们的控制台在看不见的场景中建立了关于R2R和R4R的最新最先进的结果。代码可在此HTTPS URL上找到。

[NLP-46] Contextual Position Encoding: Learning to Count Whats Important
[NLP-46] 上下文位置编码:学会计算重要的内容

链接: https://arxiv.org/abs/2405.18719
作者: Olga Golovneva,Tianlu Wang,Jason Weston,Sainbayar Sukhbaatar
关键词: Large Language Models, component of Large, Large Language, attention mechanism, critical component
中文关键词: 大型语言模型,大型组件,大型语言,注意力机制,关键组件
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The attention mechanism is a critical component of Large Language Models (LLMs) that allows tokens in a sequence to interact with each other, but is order-invariant. Incorporating position encoding (PE) makes it possible to address by position, such as attending to the i-th token. However, current PE methods use token counts to derive position, and thus cannot generalize to higher levels of abstraction, such as attending to the i-th sentence. In this paper, we propose a new position encoding method, Contextual Position Encoding (CoPE), that allows positions to be conditioned on context by incrementing position only on certain tokens determined by the model. This allows more general position addressing such as attending to the i -th particular word, noun, or sentence. We show that CoPE can solve the selective copy, counting and Flip-Flop tasks where popular position embeddings fail, and improves perplexity on language modeling and coding tasks.
摘要:注意力机制是大型语言模型(LLM)的一个重要组成部分,它允许序列中的标记相互交互,但是顺序不变的。叠加位置编码(PE)使得可以按位置进行地址,例如处理第i个令牌。然而,当前的PE方法使用标记计数来推导位置,因此无法概括到更高的抽象级别,例如处理第i个句子。本文中,我们提出了一种新的位置编码方法–上下文位置编码(CoPE),该方法允许通过仅在模型确定的某些标记上增加位置来根据上下文来限制位置。这允许使用更一般的位置称呼,例如关注第i个特定单词、名词或句子。我们表明,CoPE可以解决流行位置嵌入失败的选择性复制、计数和触发器任务,并改善语言建模和编码任务的困惑。

[NLP-47] Efficient Model-agnostic Alignment via Bayesian Persuasion
[NLP-47] 通过Bayesian说服实现高效的模型不可知对齐

链接: https://arxiv.org/abs/2405.18718
作者: Fengshuo Bai,Mingzhi Wang,Zhaowei Zhang,Boyuan Chen,Yinda Xu,Ying Wen,Yaodong Yang
关键词: keeping LLMs consensus, large language models, keeping LLMs, LLMs consensus, human intent
中文关键词: 保持LLM共识、大型语言模型、保持LLM、LLM共识、人类意图
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With recent advancements in large language models (LLMs), alignment has emerged as an effective technique for keeping LLMs consensus with human intent. Current methods primarily involve direct training through Supervised Fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), both of which require substantial computational resources and extensive ground truth data. This paper explores an efficient method for aligning black-box large models using smaller models, introducing a model-agnostic and lightweight Bayesian Persuasion Alignment framework. We formalize this problem as an optimization of the signaling strategy from the small model’s perspective. In the persuasion process, the small model (Advisor) observes the information item (i.e., state) and persuades large models (Receiver) to elicit improved responses. The Receiver then generates a response based on the input, the signal from the Advisor, and its updated belief about the information item. Through training using our framework, we demonstrate that the Advisor can significantly enhance the performance of various Receivers across a range of tasks. We theoretically analyze our persuasion framework and provide an upper bound on the Advisor’s regret, confirming its effectiveness in learning the optimal signaling strategy. Our Empirical results demonstrates that GPT-2 can significantly improve the performance of various models, achieving an average enhancement of 16.1% in mathematical reasoning ability and 13.7% in code generation. We hope our work can provide an initial step toward rethinking the alignment framework from the Bayesian Persuasion perspective.
摘要:随着大语言模型的发展,对齐已成为保持大语言模型与人类意图一致性的一种有效技术。目前的方法主要包括通过监督精调(SFT)或从人类反馈的强化学习(RLHF)进行直接训练,这两种方法都需要大量的计算资源和大量的地面真实数据。本文探索了一种使用较小模型对齐黑盒大模型的有效方法,引入了一个模型不可知的轻量级贝叶斯说服对齐框架。我们从小模型的角度将这个问题形式化为信令策略的优化。在说服过程中,小模型(Advisor)观察信息项(即状态),并说服大模型(接受者)以获得更好的反应。然后,接收方根据输入、来自Advisor的信号以及它对信息项的更新信念生成响应。通过使用我们的框架进行培训,我们证明了Advisor可以显著提高各种接收者在一系列任务中的性能。我们从理论上分析了我们的说服框架,并提供了顾问后悔的上限,证实了它在学习最优信号策略方面的有效性。我们的实验结果表明,GPT-2能够显著提高各种模型的性能,数学推理能力平均提高16.1%,代码生成能力平均提高13.7%。我们希望我们的工作能够为从贝叶斯劝说的角度重新思考对齐框架迈出第一步。

[NLP-48] Calibrating Reasoning in Language Models with Internal Consistency
[NLP-48] 具有内部一致性的语言模型中的推理校准

链接: https://arxiv.org/abs/2405.18711
作者: Zhihui Xie,Jizhou Guo,Tong Yu,Shuai Li
关键词: Large language models, demonstrated impressive capabilities, Large language, elicits verbalized reasoning, aided by techniques
中文关键词: 大型语言模型,表现出令人印象深刻的能力,大型语言,在技术的帮助下实现言语推理
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities in various reasoning tasks, aided by techniques like chain-of-thought (CoT) prompting that elicits verbalized reasoning. However, LLMs often generate text with obvious mistakes and contradictions, raising doubts about their ability to robustly process and utilize generated rationales. In this work, we investigate CoT reasoning in LLMs through the lens of internal representations, focusing on how these representations are influenced by generated rationales. Our preliminary analysis reveals that while generated rationales improve answer accuracy, inconsistencies emerge between the model’s internal representations in middle layers and those in final layers, potentially undermining the reliability of their reasoning processes. To address this, we propose internal consistency as a measure of the model’s confidence by examining the agreement of latent predictions decoded from intermediate layers. Extensive empirical studies across different models and datasets demonstrate that internal consistency effectively distinguishes between correct and incorrect reasoning paths. Motivated by this, we propose a new approach to calibrate CoT reasoning by up-weighting reasoning paths with high internal consistency, resulting in a significant boost in reasoning performance. Further analysis uncovers distinct patterns in attention and feed-forward modules across layers, providing insights into the emergence of internal inconsistency. In summary, our results demonstrate the potential of using internal representations for self-evaluation of LLMs.
摘要:大型语言模型(LLM)在各种推理任务中表现出了令人印象深刻的能力,这得益于引发言语推理的思想链(CoT)等技术。然而,LLM生成的文本往往带有明显的错误和矛盾,令人怀疑它们有力地处理和利用生成的理由的能力。在这项工作中,我们通过内部表征的视角来研究LLMS中的CoT推理,重点是这些表征是如何受到生成的理性的影响的。我们的初步分析表明,尽管生成的推理提高了答案的准确性,但模型在中间层和最终层的内部表示之间出现了不一致,潜在地破坏了它们推理过程的可靠性。为了解决这个问题,我们建议通过检查从中间层解码的潜在预测的一致性来衡量模型的可信度。对不同模型和数据集的广泛实证研究表明,内部一致性有效地区分了正确和不正确的推理路径。受此启发,我们提出了一种通过对内部一致性较高的推理路径进行升权来校准COT推理的新方法,从而显著提高了推理性能。进一步的分析揭示了不同层次的注意力和前馈模块的不同模式,为内部不一致的出现提供了洞察。综上所述,我们的结果证明了使用内部表征来进行LLMS自我评估的潜力。

[NLP-49] Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation
[NLP-49] 通过一致经验估计进行高效的基于偏好的强化学习

链接: https://arxiv.org/abs/2405.18688
作者: Fengshuo Bai,Rui Zhao,Hongming Zhang,Sijia Cui,Ying Wen,Yaodong Yang,Bo Xu,Lei Han
关键词: Preference-based reinforcement learning, shown impressive capabilities, Preference-based reinforcement, shown impressive, impressive capabilities
中文关键词: 基于偏好的强化学习,表现出令人印象深刻的能力,基于偏好的强化,表现出令人印象深刻的能力
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering. However, a notable limitation of PbRL is its dependency on substantial human feedback. This dependency stems from the learning loop, which entails accurate reward learning compounded with value/policy learning, necessitating a considerable number of samples. To boost the learning loop, we propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques. Label smoothing reduces overfitting of the reward model by smoothing human preference labels. Additionally, we bootstrap a conservative estimate \widehatQ using well-supported state-action pairs from the current replay memory to mitigate overestimation bias and utilize it for policy learning regularization. Our experimental results across a variety of complex tasks, both in online and offline settings, demonstrate that our approach improves feedback efficiency, outperforming state-of-the-art methods by a large margin. Ablation studies further reveal that SEER achieves a more accurate Q-function compared to prior work.
摘要:基于偏好的强化学习(PbRL)在没有报酬工程的情况下训练智能体方面表现出了令人印象深刻的能力。然而,PbRL的一个显著局限性是它依赖于大量的人类反馈。这种依赖源于学习循环,这需要准确的奖励学习与价值/政策学习相结合,需要相当数量的样本。为了加速学习循环,我们提出了一种结合了标签平滑和策略正则化技术的高效PbRL方法SEER。标签平滑通过平滑人类偏好标签来减少奖励模型的过度拟合。此外,我们使用支持良好的状态-动作对从当前重放记忆中引导保守估计\widehatQ,以减轻高估偏差,并将其用于策略学习正则化。我们对各种复杂任务的实验结果表明,无论是在线还是离线环境,我们的方法都提高了反馈效率,远远超过了最先进的方法。烧蚀研究进一步表明,与以前的工作相比,SEER实现了更精确的Q函数。

[NLP-50] Can GPT Redefine Medical Understanding? Evaluating GPT on Biomedical Machine Reading Comprehension
[NLP-50] GPT能否重新定义医学理解?生物医学机器阅读理解中GPT的评估

链接: https://arxiv.org/abs/2405.18682
作者: Shubham Vatsal,Ayush Singh
关键词: shown remarkable performance, Large language models, Large language, shown remarkable, closed-book biomedical MRC
中文关键词: 表现出出色的性能,大型语言模型,大型语言,表现出出色的、闭门生物医学MRC
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance on many tasks in different domains. However, their performance in closed-book biomedical machine reading comprehension (MRC) has not been evaluated in depth. In this work, we evaluate GPT on four closed-book biomedical MRC benchmarks. We experiment with different conventional prompting techniques as well as introduce our own novel prompting method. To solve some of the retrieval problems inherent to LLMs, we propose a prompting strategy named Implicit Retrieval Augmented Generation (RAG) that alleviates the need for using vector databases to retrieve important chunks in traditional RAG setups. Moreover, we report qualitative assessments on the natural language generation outputs from our approach. The results show that our new prompting technique is able to get the best performance in two out of four datasets and ranks second in rest of them. Experiments show that modern-day LLMs like GPT even in a zero-shot setting can outperform supervised models, leading to new state-of-the-art (SoTA) results on two of the benchmarks.
摘要:大型语言模型在不同领域的许多任务中表现出了显著的性能。然而,它们在闭卷生物医学机器阅读理解(MRC)中的表现还没有得到深入的评估。在这项工作中,我们在四个封闭的生物医学MRC基准上评估GPT。我们对不同的常规提示技术进行了实验,并介绍了我们自己的新提示方法。为了解决LLMS固有的一些检索问题,我们提出了一种称为隐式检索增强生成(RAG)的提示策略,该策略减少了在传统RAG设置中使用矢量数据库来检索重要块的需要。此外,我们报告了对我们方法的自然语言生成输出的定性评估。结果表明,我们的新提示技术能够在四个数据集中的两个中获得最好的性能,在其余的数据集中排名第二。实验表明,像GPT这样的现代LLM即使在零射击设置下也可以超越监督模型,从而在其中两个基准上产生新的最先进的(SOTA)结果。

[NLP-51] LLM-based Hierarchical Concept Decomposition for Interpretable Fine-Grained Image Classification
[NLP-51] 基于LLM的分层概念分解用于可解释细粒度图像分类

链接: https://arxiv.org/abs/2405.18672
作者: Renyi Qu,Mark Yatskar
关键词: unstructured text outputs, Recent advancements, achieved competitive performance, large language models, advancements in interpretable
中文关键词: 非结构化文本输出、最近的进步、取得的竞争性能、大型语言模型、可解释性的进步
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in interpretable models for vision-language tasks have achieved competitive performance; however, their interpretability often suffers due to the reliance on unstructured text outputs from large language models (LLMs). This introduces randomness and compromises both transparency and reliability, which are essential for addressing safety issues in AI systems. We introduce \textttHi-CoDe (Hierarchical Concept Decomposition), a novel framework designed to enhance model interpretability through structured concept analysis. Our approach consists of two main components: (1) We use GPT-4 to decompose an input image into a structured hierarchy of visual concepts, thereby forming a visual concept tree. (2) We then employ an ensemble of simple linear classifiers that operate on concept-specific features derived from CLIP to perform classification. Our approach not only aligns with the performance of state-of-the-art models but also advances transparency by providing clear insights into the decision-making process and highlighting the importance of various concepts. This allows for a detailed analysis of potential failure modes and improves model compactness, therefore setting a new benchmark in interpretability without compromising the accuracy.
摘要:近年来,视觉语言任务的可解释模型取得了有竞争力的性能,然而,由于依赖于来自大型语言模型的非结构化文本输出,它们的可解释性经常受到影响。这引入了随机性,并损害了透明度和可靠性,这对解决人工智能系统中的安全问题至关重要。介绍了一种通过结构化概念分析来增强模型可解释性的框架-.我们的方法由两个主要部分组成:(1)使用GPT-4将输入图像分解成结构化的视觉概念层次,从而形成视觉概念树。(2)然后,我们使用一个简单的线性分类器集成,这些分类器对来自CLIP的特定概念特征进行操作来执行分类。我们的方法不仅与最先进的模型的性能保持一致,而且通过提供对决策过程的清晰见解并强调各种概念的重要性来提高透明度。这允许对潜在故障模式进行详细分析,并提高模型的紧凑性,因此在不影响准确性的情况下设定了新的可解释性基准。

[NLP-52] Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
[NLP-52] 拉链:用于融合模式的多塔解码器架构

链接: https://arxiv.org/abs/2405.18669
作者: Vicky Zayats,Peter Chen,Melissa Merrari,Dirk Padfield
关键词: Integrating multiple generative, poses significant challenges, parts poses significant, Integrating multiple, multiple generative foundation
中文关键词: 整合多个生成,构成重大挑战,部分构成重大挑战,整合多个、多个生成基础
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Under review at NeurIPS

点击查看摘要

Abstract:Integrating multiple generative foundation models, especially those trained on different modalities, into something greater than the sum of its parts poses significant challenges. Two key hurdles are the availability of aligned data (concepts that contain similar meaning but is expressed differently in different modalities), and effectively leveraging unimodal representations in cross-domain generative tasks, without compromising their original unimodal capabilities. We propose Zipper, a multi-tower decoder architecture that addresses these concerns by using cross-attention to flexibly compose multimodal generative models from independently pre-trained unimodal decoders. In our experiments fusing speech and text modalities, we show the proposed architecture performs very competitively in scenarios with limited aligned text-speech data. We also showcase the flexibility of our model to selectively maintain unimodal (e.g., text-to-text generation) generation performance by freezing the corresponding modal tower (e.g. text). In cross-modal tasks such as automatic speech recognition (ASR) where the output modality is text, we show that freezing the text backbone results in negligible performance degradation. In cross-modal tasks such as text-to-speech generation (TTS) where the output modality is speech, we show that using a pre-trained speech backbone results in superior performance to the baseline. Comments: Under review at NeurIPS Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS) Cite as: arXiv:2405.18669 [cs.LG] (or arXiv:2405.18669v1 [cs.LG] for this version)
摘要:将多个生成性基础模型,特别是那些在不同模式下训练的模型,整合成比其各部分之和更大的东西,带来了巨大的挑战。两个关键障碍是对齐数据的可用性(包含相似含义但在不同模式下表达不同的概念),以及在跨域生成任务中有效利用单峰表示而不损害其原始单峰能力。我们提出了Zipper,这是一种多塔译码架构,通过使用交叉注意来灵活地从独立预训练的单模译码中组合多模式生成模型来解决这些问题。在我们的融合语音和文本模式的实验中,我们表明所提出的体系结构在有限对齐的文本-语音数据的情况下表现出非常有竞争力的性能。我们还展示了我们的模型的灵活性,通过冻结相应的模式塔(例如文本),选择性地保持单峰(例如,文本到文本生成)的生成性能。在输出模式为文本的跨模式任务中,例如自动语音识别(ASR),我们表明冻结文本主干会导致可以忽略的性能下降。在跨模式任务中,例如文本到语音生成(TTS),其中输出模式是语音,我们证明了使用预先训练的语音主干可以产生比基线更好的性能。评论:在NeurIPS主题审查中:机器学习(cs.LG);人工智能(cs.AI);计算和语言(cs.CL);音频和语音处理(eess.AS)引用AS:arxiv:2405.18669cs.lg

[NLP-53] Understanding Intrinsic Socioeconomic Biases in Large Language Models
[NLP-53] 了解大型语言模型中内在的社会经济偏见

链接: https://arxiv.org/abs/2405.18662
作者: Mina Arzaghi,Florian Carichon,Golnoosh Farnadi
关键词: Large Language Models, Large Language, critical decision-making processes, decision-making processes, Language Models
中文关键词: 大型语言模型、大型语言、关键决策过程、决策过程、语言模型
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly integrated into critical decision-making processes, such as loan approvals and visa applications, where inherent biases can lead to discriminatory outcomes. In this paper, we examine the nuanced relationship between demographic attributes and socioeconomic biases in LLMs, a crucial yet understudied area of fairness in LLMs. We introduce a novel dataset of one million English sentences to systematically quantify socioeconomic biases across various demographic groups. Our findings reveal pervasive socioeconomic biases in both established models such as GPT-2 and state-of-the-art models like Llama 2 and Falcon. We demonstrate that these biases are significantly amplified when considering intersectionality, with LLMs exhibiting a remarkable capacity to extract multiple demographic attributes from names and then correlate them with specific socioeconomic biases. This research highlights the urgent necessity for proactive and robust bias mitigation techniques to safeguard against discriminatory outcomes when deploying these powerful models in critical real-world applications.
摘要:大型语言模型(LLM)越来越多地融入贷款审批和签证申请等关键决策过程中,在这些过程中,固有的偏见可能会导致歧视性结果。在这篇文章中,我们考察了低收入群体中人口统计属性和社会经济偏见之间的微妙关系,低收入群体是一个关键但未被充分研究的公平领域。我们引入了一个包含一百万个英语句子的新数据集,以系统地量化不同人口群体的社会经济偏见。我们的发现揭示了在GPT-2等现有模型和骆驼2号和猎鹰等最先进模型中普遍存在的社会经济偏见。我们证明,当考虑交叉性时,这些偏见被显著放大,LLMS显示出从姓名中提取多种人口统计属性并将其与特定的社会经济偏见相关联的非凡能力。这项研究强调了在关键的现实世界应用中部署这些强大的模型时,迫切需要主动和稳健的偏差缓解技术来防止歧视性结果。

[NLP-54] Recent Advances of Foundation Language Models-based Continual Learning: A Survey
[NLP-54] 基于基础语言模型的持续学习的最新进展:调查

链接: https://arxiv.org/abs/2405.18653
作者: Yutao Yang,Jie Zhou,Xuanwen Ding,Tianyu Huai,Shunyu Liu,Qin Chen,Liang He,Yuan Xie
关键词: marked significant achievements, foundation language models, natural language processing, language models, models
中文关键词: 显着的重大成就、基础语言模型、自然语言处理、语言模型、模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, foundation language models (LMs) have marked significant achievements in the domains of natural language processing (NLP) and computer vision (CV). Unlike traditional neural network models, foundation LMs obtain a great ability for transfer learning by acquiring rich commonsense knowledge through pre-training on extensive unsupervised datasets with a vast number of parameters. However, they still can not emulate human-like continuous learning due to catastrophic forgetting. Consequently, various continual learning (CL)-based methodologies have been developed to refine LMs, enabling them to adapt to new tasks without forgetting previous knowledge. However, a systematic taxonomy of existing approaches and a comparison of their performance are still lacking, which is the gap that our survey aims to fill. We delve into a comprehensive review, summarization, and classification of the existing literature on CL-based approaches applied to foundation language models, such as pre-trained language models (PLMs), large language models (LLMs) and vision-language models (VLMs). We divide these studies into offline CL and online CL, which consist of traditional methods, parameter-efficient-based methods, instruction tuning-based methods and continual pre-training methods. Offline CL encompasses domain-incremental learning, task-incremental learning, and class-incremental learning, while online CL is subdivided into hard task boundary and blurry task boundary settings. Additionally, we outline the typical datasets and metrics employed in CL research and provide a detailed analysis of the challenges and future work for LMs-based continual learning.
摘要:近年来,基础语言模型在自然语言处理和计算机视觉领域取得了显著的成就。与传统的神经网络模型不同,Funding LMS通过对大量参数的非监督数据集进行预训练,获取丰富的常识知识,从而获得了很强的迁移学习能力。然而,由于灾难性的遗忘,它们仍然不能模仿人类的持续学习。因此,各种基于持续学习(CL)的方法被开发出来,以改进LMS,使其能够在不忘记先前知识的情况下适应新任务。然而,仍然缺乏对现有方法的系统分类和对其性能的比较,这是我们的调查旨在填补的空白。我们对基于CL的方法应用于基础语言模型的现有文献进行了全面的回顾、总结和分类,例如预训练语言模型(PLM)、大型语言模型(LLM)和视觉语言模型(VLM)。我们将这些研究分为离线学习方法和在线学习方法,包括传统方法、基于参数效率的方法、基于指令调优的方法和持续预训练方法。离线学习包括领域增量学习、任务增量学习和班级增量学习,而在线学习又细分为硬任务边界和模糊任务边界设置。此外,我们概述了合作学习研究中使用的典型数据集和度量标准,并详细分析了基于学习管理系统的持续学习面临的挑战和未来的工作。

[NLP-55] raining LLMs to Better Self-Debug and Explain Code
[NLP-55] 培养法学硕士以更好地自我约束和解释代码

链接: https://arxiv.org/abs/2405.18649
作者: Nan Jiang,Xiaopeng Li,Shiqi Wang,Qiang Zhou,Soneya Binta Hossain,Baishakhi Ray,Varun Kumar,Xiaofei Ma,Anoop Deoras
关键词: code, LLMs, refinement, pass, code generation
中文关键词: 代码、LLM、细化、传递、代码生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:In the domain of code generation, self-debugging is crucial. It allows LLMs to refine their generated code based on execution feedback. This is particularly important because generating correct solutions in one attempt proves challenging for complex tasks. Prior works on self-debugging mostly focus on prompting methods by providing LLMs with few-shot examples, which work poorly on small open-sourced LLMs. In this work, we propose a training framework that significantly improves self-debugging capability of LLMs. Intuitively, we observe that a chain of explanations on the wrong code followed by code refinement helps LLMs better analyze the wrong code and do refinement. We thus propose an automated pipeline to collect a high-quality dataset for code explanation and refinement by generating a number of explanations and refinement trajectories and filtering via execution verification. We perform supervised fine-tuning (SFT) and further reinforcement learning (RL) on both success and failure trajectories with a novel reward design considering code explanation and refinement quality. SFT improves the pass@1 by up to 15.92% and pass@10 by 9.30% over four benchmarks. RL training brings additional up to 3.54% improvement on pass@1 and 2.55% improvement on pass@10. The trained LLMs show iterative refinement ability, and can keep refining code continuously. Lastly, our human evaluation shows that the LLMs trained with our framework generate more useful code explanations and help developers better understand bugs in source code.
摘要:在代码生成领域,自调试至关重要。它允许LLM根据执行反馈改进其生成的代码。这一点尤其重要,因为事实证明,在一次尝试中生成正确的解决方案对于复杂的任务是具有挑战性的。以前关于自调试的工作主要集中在通过向LLM提供极少的例子来提示方法,而这些例子在小型开源LLM上的效果很差。在这项工作中,我们提出了一个训练框架,显著提高了LLMS的自调试能力。直观地说,我们观察到,对错误代码的一系列解释,然后是代码优化,有助于LLM更好地分析错误代码并进行优化。因此,我们提出了一种自动流水线来收集高质量的数据集,用于代码解释和精化,方法是生成一些解释和精化轨迹,并通过执行验证进行过滤。我们对成功和失败轨迹进行了监督微调(SFT)和进一步强化学习(RL),并采用了一种考虑了代码解释和精化质量的新颖奖励设计。在四个基准测试中,SFT将PASS@1提高了15.92%,将PASS@10提高了9.30%。RL训练使PASS@1和PASS@10分别提高了3.54%和2.55%。训练后的LLM具有迭代求精能力,可以持续不断地求精代码。最后,我们的人工评估表明,使用我们的框架训练的LLM生成了更有用的代码解释,并帮助开发人员更好地了解源代码中的错误。

[NLP-56] JADS: A Framework for Self-supervised Joint Aspect Discovery and Summarization
[NLP-56] JADS:自我监督联合方面发现和总结的框架

链接: https://arxiv.org/abs/2405.18642
作者: Xiaobo Guo,Jay Desai,Srinivasan H. Sengamedu
关键词: group relevant sentences, group relevant, include multiple aspects, Joint Aspect Discovery, modeling to group
中文关键词: 组相关句子,组相关,包括多个方面,联合方面发现,建模到组
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:To generate summaries that include multiple aspects or topics for text documents, most approaches use clustering or topic modeling to group relevant sentences and then generate a summary for each group. These approaches struggle to optimize the summarization and clustering algorithms jointly. On the other hand, aspect-based summarization requires known aspects. Our solution integrates topic discovery and summarization into a single step. Given text data, our Joint Aspect Discovery and Summarization algorithm (JADS) discovers aspects from the input and generates a summary of the topics, in one step. We propose a self-supervised framework that creates a labeled dataset by first mixing sentences from multiple documents (e.g., CNN/DailyMail articles) as the input and then uses the article summaries from the mixture as the labels. The JADS model outperforms the two-step baselines. With pretraining, the model achieves better performance and stability. Furthermore, embeddings derived from JADS exhibit superior clustering capabilities. Our proposed method achieves higher semantic alignment with ground truth and is factual.
摘要:为了为文本文档生成包含多个方面或主题的摘要,大多数方法使用聚类或主题建模来对相关句子进行分组,然后为每组生成摘要。这些方法难以联合优化摘要和聚类算法。另一方面,基于方面的摘要需要已知的方面。我们的解决方案将主题发现和摘要集成到一个步骤中。给定文本数据,我们的联合方面发现和摘要算法(JADS)在一个步骤中从输入中发现方面并生成主题摘要。我们提出了一种自监督框架,该框架首先从多个文档(例如CNN/DailyMail文章)中混合句子作为输入,然后使用混合文档中的文章摘要作为标签来创建标签数据集。JADS模型的表现优于两步基线。通过预训练,该模型获得了更好的性能和稳定性。此外,从JADS派生的嵌入显示了卓越的集群功能。我们提出的方法实现了与基本事实更高的语义对齐,并且是事实。

[NLP-57] ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models
[NLP-57] ConSiDERS人类评估框架:重新思考生成性大型语言模型的人类评估

链接: https://arxiv.org/abs/2405.18638
作者: Aparna Elangovan,Ling Liu,Lei Xu,Sravan Bodapati,Dan Roth
关键词: human behavioral psychology, large language models, user experience research, position paper, results are reliable
中文关键词: 人类行为心理学、大型语言模型、用户体验研究、立场论文、结果可靠
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in ACL 2024

点击查看摘要

Abstract:In this position paper, we argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking that draws upon insights from disciplines such as user experience research and human behavioral psychology to ensure that the experimental design and results are reliable. The conclusions from these evaluations, thus, must consider factors such as usability, aesthetics, and cognitive biases. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. Furthermore, the evaluation should differentiate the capabilities and weaknesses of increasingly powerful large language models – which requires effective test sets. The scalability of human evaluation is also crucial to wider adoption. Hence, to design an effective human evaluation system in the age of generative NLP, we propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars --Consistency, Scoring Critera, Differentiating, User Experience, Responsible, and Scalability.
摘要:在这份立场文件中,我们认为人类对生成性大语言模型的评价应该是一项多学科的工作,借鉴用户体验研究和人类行为心理学等学科的见解,以确保实验设计和结果是可靠的。因此,这些评估得出的结论必须考虑可用性、美感和认知偏差等因素。我们强调了认知偏差如何将流畅的信息和真实性混为一谈,以及认知不确定性如何影响利克特等评级分数的可靠性。此外,评估应该区分日益强大的大型语言模型的能力和弱点–这需要有效的测试集。人类评估的可扩展性对于更广泛的采用也是至关重要的。因此,为了在生成式自然语言处理时代设计一个有效的人类评价体系,我们提出了考虑到人类的评价框架,该框架由6个支柱组成–一致性、评分准则、差异化、用户体验、责任感和可扩展性。

[NLP-58] A Theoretical Understanding of Self-Correction through In-context Alignment
[NLP-58] 通过上下文调整自我纠正的理论理解

链接: https://arxiv.org/abs/2405.18634
作者: Yifei Wang,Yuyang Wu,Zeming Wei,Stefanie Jegelka,Yisen Wang
关键词: limited human experiences, recent studies show, mimicking limited human, studies show initial, show initial evidence
中文关键词: 最近的研究表明,人类经历有限,模仿有限的人类,研究表明,初步证据
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Going beyond mimicking limited human experiences, recent studies show initial evidence that, like humans, large language models (LLMs) are capable of improving their abilities purely by self-correction, i.e., correcting previous responses through self-examination, in certain circumstances. Nevertheless, little is known about how such capabilities arise. In this work, based on a simplified setup akin to an alignment task, we theoretically analyze self-correction from an in-context learning perspective, showing that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way. Notably, going beyond previous theories on over-simplified linear transformers, our theoretical construction underpins the roles of several key designs of realistic transformers for self-correction: softmax attention, multi-head attention, and the MLP block. We validate these findings extensively on synthetic datasets. Inspired by these findings, we also illustrate novel applications of self-correction, such as defending against LLM jailbreaks, where a simple self-correction step does make a large difference. We believe that these findings will inspire further research on understanding, exploiting, and enhancing self-correction for building better foundation models.
摘要:除了模仿有限的人类经验之外,最近的研究表明,像人类一样,大语言模型(LLM)能够完全通过自我纠正来提高自己的能力,即在某些情况下,通过自我检查来纠正之前的反应。然而,人们对这种能力是如何产生的知之甚少。在这项工作中,基于类似于对齐任务的简化设置,我们从情境学习的角度对自我纠正进行了理论分析,结果表明,当LLMS给予相对准确的自我检查作为奖励时,他们能够以情境方式精炼反应。值得注意的是,超越了以前关于过度简化的线性变压器的理论,我们的理论构建支持了几个用于自校正的现实变压器的关键设计的作用:软最大关注、多头关注和MLP块。我们在人工数据集上广泛地验证了这些发现。受这些发现的启发,我们还说明了自我纠正的新应用,例如防御LLM越狱,在这种情况下,一个简单的自我纠正步骤确实会产生很大的不同。我们相信,这些发现将启发关于理解、利用和增强自我纠正以构建更好的基础模型的进一步研究。

[NLP-59] Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference
[NLP-59] 硬件感知并行提示解码,以提高内存效率加速LLM推理

链接: https://arxiv.org/abs/2405.18628
作者: Hao (Mark)Chen,Wayne Luk,Ka Fai Cedric Yiu,Rui Li,Konstantin Mishchenko,Stylianos I. Venieris,Hongxiang Fan
关键词: Large Language Models, Language Models, Large Language, results in significant, hardware performance
中文关键词: 大型语言模型,语言模型,大型语言,带来显着的硬件性能
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: The code for this implementation is available at this https URL

点击查看摘要

Abstract:The auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. While recent research has investigated various speculative decoding techniques for multi-token generation, these efforts have primarily focused on improving processing speed such as throughput. Crucially, they often neglect other metrics essential for real-life deployments, such as memory consumption and training cost. To overcome these limitations, we propose a novel parallel prompt decoding that requires only 0.0002 % trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Inspired by the human natural language generation process, PPD approximates outputs generated at future timesteps in parallel by using multiple prompt tokens. This approach partially recovers the missing conditional dependency information necessary for multi-token generation, resulting in up to a 28% higher acceptance rate for long-range predictions. Furthermore, we present a hardware-aware dynamic sparse tree technique that adaptively optimizes this decoding scheme to fully leverage the computational capacities on different GPUs. Through extensive experiments across LLMs ranging from MobileLlama to Vicuna-13B on a wide range of benchmarks, our approach demonstrates up to 2.49 \times speedup and maintains a minimal runtime memory overhead of just 0.0004 %. More importantly, our parallel prompt decoding can serve as an orthogonal optimization for synergistic integration with existing speculative decoding, showing up to 1.22\times further speed improvement. Our code is available at this https URL.
摘要:大语言模型(LLM)的自回归译码在硬件性能上有很大的开销。虽然最近的研究已经调查了用于多令牌生成的各种推测解码技术,但这些努力主要集中在提高处理速度,如吞吐量。至关重要的是,他们往往忽略了实际部署所必需的其他指标,例如内存消耗和培训成本。为了克服这些局限性,我们提出了一种新颖的并行提示解码算法,它只需要0.0002的可训练参数,使得在单个A100-40 GB GPU上仅需16小时即可进行高效的训练。受人类自然语言生成过程的启发,PPD通过使用多个提示标记并行地逼近未来时间步长生成的输出。这种方法部分恢复了多令牌生成所需的缺失条件依赖信息,导致长期预测的接受率最高提高了28%。此外,我们提出了一种硬件感知的动态稀疏树技术,该技术自适应地优化了该解码方案,以充分利用不同GPU上的计算能力。通过在从MobileLlama到Vicuna13B的各种基准测试上的广泛实验,我们的方法显示了高达2.49倍的加速比,并保持了仅为0.0004%的最小运行时内存开销。更重要的是,我们的并行提示译码可以作为与现有推测译码协同集成的正交优化,表现出高达1.22倍的速度提升。我们的代码可以在这个HTTPS URL上找到。

[NLP-60] RealitySummary: On-Demand Mixed Reality Document Enhancement using Large Language Models
[NLP-60] Reality摘要:使用大型语言模型的按需混合现实文档增强

链接: https://arxiv.org/abs/2405.18620
作者: Aditya Gunturu,Shivesh Jadon,Nandi Zhang,Jarin Thundathil,Wesley Willett,Ryo Suzuki
关键词: mixed reality reading, reality reading assistant, introduce RealitySummary, mixed reality, enhance physical reading
中文关键词: 混合现实阅读,现实阅读助手,引入RealitySummit,混合现实,增强物理阅读
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce RealitySummary, a mixed reality reading assistant that can enhance any printed or digital document using on-demand text extraction, summarization, and augmentation. While augmented reading tools promise to enhance physical reading experiences with overlaid digital content, prior systems have typically required pre-processed documents, which limits their generalizability and real-world use cases. In this paper, we explore on-demand document augmentation by leveraging large language models. To understand generalizable techniques for diverse documents, we first conducted an exploratory design study which identified five categories of document enhancements (summarization, augmentation, navigation, comparison, and extraction). Based on this, we developed a proof-of-concept system that can automatically extract and summarize text using Google Cloud OCR and GPT-4, then embed information around documents using a Microsoft Hololens 2 and Apple Vision Pro. We demonstrate real-time examples of six specific document augmentations: 1) summaries, 2) comparison tables, 3) timelines, 4) keyword lists, 5) summary highlighting, and 6) information cards. Results from a usability study (N=12) and in-the-wild study (N=11) highlight the potential benefits of on-demand MR document enhancement and opportunities for future research.
摘要:我们介绍了Reality摘要,这是一个混合现实阅读助手,可以使用按需文本提取、摘要和增强来增强任何印刷或数字文档。虽然增强阅读工具承诺通过覆盖数字内容来增强物理阅读体验,但现有系统通常需要预处理文档,这限制了它们的普适性和现实世界的用例。在这篇文章中,我们探索了通过利用大型语言模型进行按需文档扩充。为了了解适用于不同文档的通用技术,我们首先进行了一项探索性设计研究,确定了五类文档增强(摘要、增强、导航、比较和提取)。基于此,我们开发了一个概念验证系统,该系统可以使用Google Cloud OCR和GPT-4自动提取和汇总文本,然后使用Microsoft Hololens 2和Apple Vision Pro将信息嵌入到文档周围。我们演示了六个特定文档扩充的实时示例:1)摘要、2)比较表、3)时间表、4)关键字列表、5)摘要突出显示和6)信息卡。可用性研究(N=12)和野外研究(N=11)的结果突出了按需增强MR文档的潜在好处和未来研究的机会。

[NLP-61] GLOCON Database: Design Decisions and User Manual (v1.0)
[NLP-61] GLCON数据库:设计决策和用户手册(v1.0)

链接: https://arxiv.org/abs/2405.18613
作者: Ali Hürriyetoğlu,Osman Mutlu,Fırat Duruşan,Erdem Yörük
关键词: contentious events automatically, events automatically extracted, multiple languages, database of contentious, automatically extracted
中文关键词: 有争议的事件自动提取,多语言,有争议的数据库,自动提取
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Databases (cs.DB); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:GLOCON is a database of contentious events automatically extracted from national news sources from various countries in multiple languages. National news sources are utilized, and complete news archives are processed to create an event list for each source. Automation is achieved using a gold standard corpus sampled randomly from complete news archives (Yörük et al. 2022) and all annotated by at least two domain experts based on the event definition provided in Duruşan et al. (2022).
摘要:GLCON是一个从各国多语言的国家新闻来源中自动提取的有争议事件数据库。利用国家新闻来源,并处理完整的新闻档案,为每个来源创建事件列表。自动化是使用从完整新闻档案中随机抽样的黄金标准文集(Yörük等人,2022)实现的,所有文集均由至少两名领域专家根据Duruğan等人(2022)提供的事件定义进行注释。

[NLP-62] BioBERT-based Deep Learning and Merged ChemProt-DrugProt for Enhanced Biomedical Relation Extraction
[NLP-62] 基于BioBERT的深度学习和合并的ChemProt-DrugProt用于增强生物医学关系提取

链接: https://arxiv.org/abs/2405.18605
作者: Bridget T. McInnes,Jiawei Tang,Darshini Mahendran,Mai H. Nguyen
关键词: enhancing relation extraction, focusing specifically, chemical-gene interactions, paper presents, presents a methodology
中文关键词: 论文提出,加强关系提取,特别关注化学与基因的相互作用,提出了一种方法
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Molecular Networks (q-bio.MN)
备注:

点击查看摘要

Abstract:This paper presents a methodology for enhancing relation extraction from biomedical texts, focusing specifically on chemical-gene interactions. Leveraging the BioBERT model and a multi-layer fully connected network architecture, our approach integrates the ChemProt and DrugProt datasets using a novel merging strategy. Through extensive experimentation, we demonstrate significant performance improvements, particularly in CPR groups shared between the datasets. The findings underscore the importance of dataset merging in augmenting sample counts and improving model accuracy. Moreover, the study highlights the potential of automated information extraction in biomedical research and clinical practice.
摘要:本文提出了一种增强生物医学文本中关系提取的方法,特别关注化学与基因的相互作用。利用BioBERT模型和多层全连接网络架构,我们的方法使用新颖的合并策略集成了ChemProt和DrugProt数据集。通过广泛的实验,我们展示了显着的性能改进,特别是在数据集之间共享的CPR组中。研究结果强调了数据集合并在增加样本计数和提高模型准确性方面的重要性。此外,该研究强调了自动化信息提取在生物医学研究和临床实践中的潜力。

[NLP-63] Low-rank finetuning for LLMs: A fairness perspective
[NLP-63] LLM的低级别微调:公平角度

链接: https://arxiv.org/abs/2405.18572
作者: Saswat Das,Marco Romanelli,Cuong Tran,Zarreen Reza,Bhavya Kailkhura,Ferdinando Fioretto
关键词: Large Language Models, fine-tuning Large Language, Large Language, Low-rank approximation techniques, Language Models
中文关键词: 大型语言模型、微调大型语言、大型语言、低级逼近技术、语言模型
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models (LLMs) due to their reduced computational and memory requirements. This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution. Our findings reveal that there are cases in which low-rank fine-tuning falls short in learning such shifts. This, in turn, produces non-negligible side effects, especially when fine-tuning is adopted for toxicity mitigation in pre-trained models, or in scenarios where it is important to provide fair models. Through comprehensive empirical evidence on several models, datasets, and tasks, we show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors. We also show that this extends to sequential decision-making tasks, emphasizing the need for careful evaluation to promote responsible LLMs development.
摘要:低等级逼近技术由于降低了计算和内存需求,已成为微调大型语言模型(LLM)的事实标准。本文研究了这些方法在捕捉微调数据集从初始预训练数据分布的转变方面的有效性。我们的研究结果表明,在某些情况下,低级微调在学习此类转变方面表现不佳。这反过来又会产生不可忽视的副作用,特别是当在预训练模型中采用微调来缓解毒性时,或者在提供公平模型很重要的场景中。通过对多个模型、数据集和任务的全面经验证据,我们表明低等级微调无意中保留了不良偏见和有毒行为。我们还表明,这延伸到顺序决策任务,强调需要进行仔细评估以促进负责任的法学硕士开发。

[NLP-64] Its Not a Modality Gap: Characterizing and Addressing the Contrastive Gap
[NLP-64] 这不是形式差距:描述和解决对比差距

链接: https://arxiv.org/abs/2405.18570
作者: Abrar Fahim,Alex Murphy,Alona Fyshe
关键词: embedding input images, contrastive models, Multi-modal contrastive models, contrastive, gap
中文关键词: 嵌入输入图像,对比模型,多模式对比模型,对比,差距
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-modal contrastive models such as CLIP achieve state-of-the-art performance in zero-shot classification by embedding input images and texts on a joint representational space. Recently, a modality gap has been reported in two-encoder contrastive models like CLIP, meaning that the image and text embeddings reside in disjoint areas of the latent space. Previous studies suggest that this gap exists due to 1) the cone effect, 2) mismatched pairs in the dataset, and 3) insufficient training. We show that, even when accounting for all these factors, and even when using the same modality, the contrastive loss actually creates a gap during training. As a result, We propose that the modality gap is inherent to the two-encoder contrastive loss and rename it the contrastive gap. We present evidence that attributes this contrastive gap to low uniformity in CLIP space, resulting in embeddings that occupy only a small portion of the latent space. To close the gap, we adapt the uniformity and alignment properties of unimodal contrastive loss to the multi-modal setting and show that simply adding these terms to the CLIP loss distributes the embeddings more uniformly in the representational space, closing the gap. In our experiments, we show that the modified representational space achieves better performance than default CLIP loss in downstream tasks such as zero-shot image classification and multi-modal arithmetic.
摘要:像CLIP这样的多模式对比模型通过将输入图像和文本嵌入到联合表示空间中,实现了最先进的零镜头分类。最近,在像CLIP这样的双编码者对比模型中发现了通道间隙,这意味着嵌入的图像和文本位于潜在空间的不相交区域。以前的研究表明,这种差距的存在是由于1)锥体效应,2)数据集中不匹配的对,以及3)训练不足。我们发现,即使考虑到所有这些因素,即使使用相同的通道,对比损失实际上也会在训练过程中造成差距。因此,我们提出通道间隙是两个编码者对比损失的内在原因,并将其重新命名为对比间隙。我们提出的证据将这种对比差距归因于剪辑空间中的低一致性,导致嵌入只占据了潜在空间的一小部分。为了缩小差距,我们将单峰对比损失的一致性和对准性质适应于多模式设置,并证明了简单地将这些项添加到剪辑损失中可以在表示空间中更均匀地分布嵌入,从而缩小差距。实验表明,在零镜头图像分类和多模式算法等下游任务中,改进的表示空间比默认的剪裁丢失具有更好的性能。

[NLP-65] Automatic detection of cognitive impairment in elderly people using an entertainment chatbot with Natural Language Processing capabilities
[NLP-65] 使用具有自然语言处理功能的娱乐聊天机器人自动检测老年人的认知障碍

链接: https://arxiv.org/abs/2405.18542
作者: Francisco de Arriba-Pérez,Silvia García-Méndez,Francisco J. González-Castaño,Enrique Costa-Montenegro
关键词: Previous researchers, cognitive impairment, researchers have proposed, therapeutic monitoring, proposed intelligent systems
中文关键词: 之前的研究人员,认知障碍,研究人员提出了,治疗监测,提出了智能系统
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Previous researchers have proposed intelligent systems for therapeutic monitoring of cognitive impairments. However, most existing practical approaches for this purpose are based on manual tests. This raises issues such as excessive caretaking effort and the white-coat effect. To avoid these issues, we present an intelligent conversational system for entertaining elderly people with news of their interest that monitors cognitive impairment transparently. Automatic chatbot dialogue stages allow assessing content description skills and detecting cognitive impairment with Machine Learning algorithms. We create these dialogue flows automatically from updated news items using Natural Language Generation techniques. The system also infers the gold standard of the answers to the questions, so it can assess cognitive capabilities automatically by comparing these answers with the user responses. It employs a similarity metric with values in [0, 1], in increasing level of similarity. To evaluate the performance and usability of our approach, we have conducted field tests with a test group of 30 elderly people in the earliest stages of dementia, under the supervision of gerontologists. In the experiments, we have analysed the effect of stress and concentration in these users. Those without cognitive impairment performed up to five times better. In particular, the similarity metric varied between 0.03, for stressed and unfocused participants, and 0.36, for relaxed and focused users. Finally, we developed a Machine Learning algorithm based on textual analysis features for automatic cognitive impairment detection, which attained accuracy, F-measure and recall levels above 80%. We have thus validated the automatic approach to detect cognitive impairment in elderly people based on entertainment content.
摘要:以前的研究人员已经提出了用于认知障碍治疗监测的智能系统。然而,现有的大多数用于此目的的实用方法都是基于手动测试的。这引发了过度照顾努力和白大褂效应等问题。为了避免这些问题,我们提出了一个智能对话系统,用于娱乐老年人的兴趣新闻,透明地监测认知障碍。自动聊天机器人对话阶段允许使用机器学习算法评估内容描述技能和检测认知障碍。我们使用自然语言生成技术从更新的新闻项自动创建这些对话流。该系统还推断出问题答案的黄金标准,因此它可以通过将这些答案与用户反应进行比较来自动评估认知能力。它采用了一种相似性度量,其值在[0,1]中,按相似程度递增。为了评估我们方法的性能和可用性,我们在老年病学家的监督下,对30名痴呆症早期的老年人进行了现场测试。在实验中,我们分析了压力和注意力对这些使用者的影响。那些没有认知障碍的人表现得更好,高达五倍。特别是,对于压力大的和不专注的参与者,相似性度量在0.03之间,对于放松和专注的用户,相似性度量在0.36之间变化。最后,我们开发了一种基于文本分析特征的机器学习算法,用于认知损伤的自动检测,准确率、F-MEASURE和召回率都在80%以上。因此,我们验证了基于娱乐内容检测老年人认知障碍的自动方法。

[NLP-66] Learning diverse attacks on large language models for robust red-teaming and safety tuning
[NLP-66] 学习对大型语言模型的多样化攻击,以实现强大的红色团队化和安全调整

链接: https://arxiv.org/abs/2405.18540
作者: Seanie Lee,Minsu Kim,Lynn Cherif,David Dobre,Juho Lee,Sung Ju Hwang,Kenji Kawaguchi,Gauthier Gidel,Yoshua Bengio,Nikolay Malkin,Moksh Jain
关键词: elicit harmful responses, large language models, critical step, step in ensuring, ensuring the safe
中文关键词: 引发有害反应、大型语言模型、关键步骤、确保、确保安全的步骤
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.
摘要:红色团队,或识别引发有害响应的提示,是确保安全和负责任地部署大型语言模型(LLM)的关键步骤。开发针对多种攻击提示的有效防护需要发现不同的攻击。自动红色团队通常使用强化学习来微调攻击者语言模型,以生成引发来自目标LLM的不良响应的提示,例如通过辅助毒性分类器来测量。我们表明,即使使用显式正则化来支持新颖性和多样性,现有的方法也会遭受模式崩溃或无法产生有效的攻击。作为一种灵活的、符合概率原则的替代方案,我们建议使用GFlowNet微调,然后进行二次平滑阶段,来训练攻击者模型以生成多样化和有效的攻击提示。我们发现,我们的方法产生的攻击对大范围的目标LLM有效,无论是否进行安全调整,并在目标LLM之间很好地转移。最后,我们证明了使用我们的方法生成的红队提示的数据集进行安全调整的模型对于来自其他基于RL的红队方法的攻击是健壮的。

[NLP-67] LLMs and Memorization: On Quality and Specificity of Copyright Compliance
[NLP-67] 法学硕士和小型化:关于版权合规的质量和专门性

链接: https://arxiv.org/abs/2405.18492
作者: Felix B Mueller,Rebekka Görge,Anna K Bernzen,Janna C Pirk,Maximilian Poretschkin
关键词: Memorization in large, large language models, growing concern, large language, Memorization
中文关键词: 大型语言模型中的小型化,日益增长的关注,大型语言,小型化
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Memorization in large language models (LLMs) is a growing concern. LLMs have been shown to easily reproduce parts of their training data, including copyrighted work. This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act. In this work, we propose a systematic analysis to quantify the extent of potential copyright infringements in LLMs using European law as an example. Unlike previous work, we evaluate instruction-finetuned models in a realistic end-user scenario. Our analysis builds on a proposed threshold of 160 characters, which we borrow from the German Copyright Service Provider Act and a fuzzy text matching algorithm to identify potentially copyright-infringing textual reproductions. The specificity of countermeasures against copyright infringement is analyzed by comparing model behavior on copyrighted and public domain data. We investigate what behaviors models show instead of producing protected text (such as refusal or hallucination) and provide a first legal assessment of these behaviors. We find that there are huge differences in copyright compliance, specificity, and appropriate refusal among popular LLMs. Alpaca, GPT 4, GPT 3.5, and Luminous perform best in our comparison, with OpenGPT-X, Alpaca, and Luminous producing a particularly low absolute number of potential copyright violations. Code will be published soon.
摘要:大型语言模型中的记忆问题日益受到关注。LLM已被证明可以很容易地复制其部分培训数据,包括受版权保护的作品。这是一个需要解决的重要问题,因为它可能违反现有的版权法和欧洲人工智能法案。在这项工作中,我们建议以欧洲法律为例进行系统的分析,以量化低成本管理中潜在的版权侵权程度。与以前的工作不同,我们在现实的最终用户场景中评估指令微调模型。我们的分析建立在160个字符的建议阈值的基础上,我们借用了德国版权服务提供商法案和模糊文本匹配算法来识别潜在的侵犯版权的文本复制。通过比较版权数据和公共领域数据上的典型行为,分析了版权侵权对策的特殊性。我们调查模型显示了哪些行为而不是产生受保护的文本(如拒绝或幻觉),并对这些行为提供了第一次法律评估。我们发现,流行的LLM在版权遵从性、专一性和适当拒绝方面存在巨大差异。羊驼、GPT 4、GPT 3.5和Lighous在我们的比较中表现最好,OpenGPT-X、Alpaca和Lighous产生的潜在侵犯版权的绝对数量特别低。代码很快就会发布。

[NLP-68] Multi-objective Representation for Numbers in Clinical Narratives Using CamemBERT-bio
[NLP-68] 使用CamemBERT-bio对临床叙述中的数字进行多目标表示

链接: https://arxiv.org/abs/2405.18448
作者: Boammani Aser Lompo,Thanh-Dung Le
关键词: distinct physiological categories, physiological categories, medical documents, distinct physiological, traditional NLP models
中文关键词: 不同的生理类别、生理类别、医学文档、不同的生理、传统NLP模型
类目: Computation and Language (cs.CL); Signal Processing (eess.SP)
备注: Under the revision. arXiv admin note: substantial text overlap with arXiv:2404.10171

点击查看摘要

Abstract:This research aims to classify numerical values extracted from medical documents across seven distinct physiological categories, employing CamemBERT-bio. Previous studies suggested that transformer-based models might not perform as well as traditional NLP models in such tasks. To enhance CamemBERT-bio’s performances, we introduce two main innovations: integrating keyword embeddings into the model and adopting a number-agnostic strategy by excluding all numerical data from the text. The implementation of label embedding techniques refines the attention mechanisms, while the technique of using a `numerical-blind’ dataset aims to bolster context-centric learning. Another key component of our research is determining the criticality of extracted numerical data. To achieve this, we utilized a simple approach that involves verifying if the value falls within the established standard ranges. Our findings are encouraging, showing substantial improvements in the effectiveness of CamemBERT-bio, surpassing conventional methods with an F1 score of 0.89. This represents an over 20% increase over the 0.73 F_1 score of traditional approaches and an over 9% increase over the 0.82 F_1 score of state-of-the-art approaches. All this was achieved despite using small and imbalanced training datasets.
摘要:本研究旨在利用Camembert-Bio对从七个不同的生理类别的医学文献中提取的数值进行分类。以前的研究表明,基于变压器的模型在这类任务中的表现可能不如传统的NLP模型。为了提高Camembert-Bio的性能,我们引入了两个主要的创新:将关键字嵌入到模型中,以及通过从文本中排除所有数字数据来采用数字不可知策略。标签嵌入技术的实施完善了注意机制,而使用“数字盲”数据集的技术旨在支持以上下文为中心的学习。我们研究的另一个关键部分是确定提取的数字数据的关键程度。为了实现这一点,我们使用了一种简单的方法,包括验证值是否在既定的标准范围内。我们的发现令人鼓舞,显示出卡门伯特-BIO的有效性有了实质性的改善,超过了F1得分为0.89的传统方法。这比传统方法的0.73 F_1分数增加了20多分,比最先进的方法的0.82 F_1分数增加了9分以上。尽管使用了小而不平衡的训练数据集,但所有这些都是实现的。

[NLP-69] InversionView: A General-Purpose Method for Reading Information from Neural Activations
[NLP-69] InversionView:一种读取神经激活信息的通用方法

链接: https://arxiv.org/abs/2405.17653
作者: Xinting Huang,Madhur Panwar,Navin Goyal,Michael Hahn
关键词: neural networks, fully decipher, information encoded, workings of neural, encoded in neural
中文关键词: 神经网络,完全破译,信息编码,神经的工作,神经编码
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The inner workings of neural networks can be better understood if we can fully decipher the information encoded in neural activations. In this paper, we argue that this information is embodied by the subset of inputs that give rise to similar activations. Computing such subsets is nontrivial as the input space is exponentially large. We propose InversionView, which allows us to practically inspect this subset by sampling from a trained decoder model conditioned on activations. This helps uncover the information content of activation vectors, and facilitates understanding of the algorithms implemented by transformer models. We present three case studies where we investigate models ranging from small transformers to GPT-2. In these studies, we demonstrate the characteristics of our method, show the distinctive advantages it offers, and provide causally verified circuits.
摘要:如果我们能够完全破译神经激活中编码的信息,就可以更好地理解神经网络的内部运作方式。在本文中,我们认为该信息体现在引起类似激活的输入子集中。计算此类子集并不简单,因为输入空间是指数级大的。我们提出了InversionView,它使我们能够通过从以激活为条件的训练解码器模型中进行采样来实际检查这个子集。这有助于揭示激活载体的信息内容,并促进理解Transformer模型实现的算法。我们提供了三个案例研究,其中我们调查了从小型变压器到GPT-2的模型。在这些研究中,我们展示了我们方法的特征,展示了它提供的独特优势,并提供了因果验证的电路。

[NLP-70] Are queries and keys always relevant? A case study on Transformer wave functions
[NLP-70] 查询和键总是相关的吗?Transformer波函数案例研究

链接: https://arxiv.org/abs/2405.18874
作者: Riccardo Rende,Luciano Loris Viteritti
关键词: natural language processing, dot product attention, product attention mechanism, standard attention mechanisms, modern Transformers
中文关键词: 自然语言处理、点产品注意力、产品注意力机制、标准注意力机制、现代变形金刚
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Computation and Language (cs.CL); Computational Physics (physics.comp-ph)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:The dot product attention mechanism, originally designed for natural language processing (NLP) tasks, is a cornerstone of modern Transformers. It adeptly captures semantic relationships between word pairs in sentences by computing a similarity overlap between queries and keys. In this work, we explore the suitability of Transformers, focusing on their attention mechanisms, in the specific domain of the parametrization of variational wave functions to approximate ground states of quantum many-body spin Hamiltonians. Specifically, we perform numerical simulations on the two-dimensional J_1 - J_2 Heisenberg model, a common benchmark in the field of quantum-many body systems on lattice. By comparing the performance of standard attention mechanisms with a simplified version that excludes queries and keys, relying solely on positions, we achieve competitive results while reducing computational cost and parameter usage. Furthermore, through the analysis of the attention maps generated by standard attention mechanisms, we show that the attention weights become effectively input-independent at the end of the optimization. We support the numerical results with analytical calculations, providing physical insights of why queries and keys should be, in principle, omitted from the attention mechanism when studying large systems. Interestingly, the same arguments can be extended to the NLP domain, in the limit of long input sentences.
摘要:点积注意机制最初是为自然语言处理(NLP)任务设计的,是现代变形金刚的基石。它通过计算查询和关键字之间的相似度重叠来巧妙地捕捉句子中词对之间的语义关系。在这项工作中,我们探索了变形器在变分波函数的参数化为近似量子多体自旋哈密顿量基态的特定领域中的适用性,重点讨论了它们的注意机制。具体地说,我们对晶格量子多体系统中常见的标杆–二维J_1-J_2海森堡模型进行了数值模拟。通过比较标准注意机制和仅依赖于位置的排除查询和关键字的简化版本的性能,我们在减少计算代价和参数使用的同时获得了具有竞争力的结果。此外,通过对标准注意机制生成的注意图的分析,我们发现在优化结束时,注意权重变得有效地与输入无关。我们用分析计算支持数字结果,提供了关于为什么在研究大型系统时原则上应该从注意力机制中省略查询和键的物理见解。有趣的是,同样的论点可以扩展到NLP领域,在长输入句子的限制下。

[NLP-71] Improving Speech Decoding from ECoG with Self-Supervised Pretraining
[NLP-71] 通过自我监督预训练改进ECoG的语音解码

链接: https://arxiv.org/abs/2405.18639
作者: Brian A. Yuan,Joseph G. Makin
关键词: intracranial brain-machine interfaces, Recent work, deep neural networks, high accuracy, essentially by treating
中文关键词: 脑机接口,最近的工作,深度神经网络,高准确性,本质上是通过治疗
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent work on intracranial brain-machine interfaces has demonstrated that spoken speech can be decoded with high accuracy, essentially by treating the problem as an instance of supervised learning and training deep neural networks to map from neural activity to text. However, such networks pay for their expressiveness with very large numbers of labeled data, a requirement that is particularly burdensome for invasive neural recordings acquired from human patients. On the other hand, these patients typically produce speech outside of the experimental blocks used for training decoders. Making use of such data, and data from other patients, to improve decoding would ease the burden of data collection – especially onerous for dys- and anarthric patients. Here we demonstrate that this is possible, by reengineering wav2vec – a simple, self-supervised, fully convolutional model that learns latent representations of audio using a noise-contrastive loss – for electrocorticographic (ECoG) data. We train this model on unlabelled ECoG recordings, and subsequently use it to transform ECoG from labeled speech sessions into wav2vec’s representation space, before finally training a supervised encoder-decoder to map these representations to text. We experiment with various numbers of labeled blocks; for almost all choices, the new representations yield superior decoding performance to the original ECoG data, and in no cases do they yield worse. Performance can also be improved in some cases by pretraining wav2vec on another patient’s data. In the best cases, wav2vec’s representations decrease word error rates over the original data by upwards of 50%.
摘要:最近关于颅内脑机接口的研究表明,口语可以高精度地解码,本质上是通过将语音问题视为有监督学习的一个实例,并训练深层神经网络从神经活动映射到文本。然而,这样的网络用非常大量的标记数据来换取它们的表现力,这一要求对于从人类患者那里获得的侵入性神经记录来说尤其繁重。另一方面,这些患者通常会在用于训练解码器的实验块之外产生语音。利用这样的数据以及来自其他患者的数据来改进解码将减轻数据收集的负担–特别是对发育不良和节律不全患者来说。在这里,我们证明了这是可能的,通过重新设计Wave2vec-一种简单的、自我监督的完全卷积模型,使用噪声对比损失来学习音频的潜在表示-用于皮层脑电(ECoG)数据。我们在未标记的ECoG记录上训练这个模型,然后使用它将ECoG从标记的语音会话转换到Wave2vec的表示空间,最后训练一个有监督的编解码器将这些表示映射到文本。我们对不同数量的标记块进行了实验;对于几乎所有的选择,新的表示法产生了比原始ECoG数据更好的译码性能,并且在任何情况下都不会产生更差的结果。在某些情况下,还可以通过对另一个患者的数据进行预训练Wave2vec来提高性能。在最好的情况下,Wave2vec的表示比原始数据的错误率降低了50%以上。

计算机视觉

[CV-0] X-VILA: Cross-Modality Alignment for Large Language Model

链接: https://arxiv.org/abs/2405.19335
作者: Hanrong Ye,De-An Huang,Yao Lu,Zhiding Yu,Wei Ping,Andrew Tao,Jan Kautz,Song Han,Dan Xu,Pavlo Molchanov,Hongxu Yin
关键词: omni-modality model designed, large language models, incorporating image, omni-modality model, model designed
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Technical Report

点击查看摘要

Abstract:We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effective interleaved any-to-any modality instruction-following dataset. Furthermore, we identify a significant problem with the current cross-modality alignment method, which results in visual information loss. To address the issue, we propose a visual alignment mechanism with a visual embedding highway module. We then introduce a resource-efficient recipe for training X-VILA, that exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins. X-VILA also showcases emergent properties across modalities even in the absence of similar training data. The project will be made open-source.

[CV-1] LLMs Meet Multimodal Generation and Editing: A Survey

链接: https://arxiv.org/abs/2405.19334
作者: Yingqing He,Zhaoyang Liu,Jingye Chen,Zeyue Tian,Hongyu Liu,Xiaowei Chi,Runtao Liu,Ruibin Yuan,Yazhou Xing,Wenhai Wang,Jifeng Dai,Yong Zhang,Wei Xue,Qifeng Liu,Yike Guo,Qifeng Chen
关键词: large language models, large language, combining LLMs, growing interest, interest in combining
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 51 Pages with 16 Figures, 12 Tables, and 534 References. GitHub Repository at: this https URL

点击查看摘要

Abstract:With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on understanding. This survey elaborates on multimodal generation across different domains, including image, video, 3D, and audio, where we highlight the notable advancements with milestone works in these fields. Specifically, we exhaustively investigate the key technical components behind methods and multimodal datasets utilized in these studies. Moreover, we dig into tool-augmented multimodal agents that can use existing generative models for human-computer interaction. Lastly, we also comprehensively discuss the advancement in AI safety and investigate emerging applications as well as future prospects. Our work provides a systematic and insightful overview of multimodal generation, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at this https URL

[CV-2] Multi-Modal Generative Embedding Model

链接: https://arxiv.org/abs/2405.19333
作者: Feipeng Ma,Hongwei Xue,Guangting Wang,Yizhou Zhou,Fengyun Rao,Shilin Yan,Yueyi Zhang,Siying Wu,Mike Zheng Shou,Xiaoyan Sun
关键词: embedding, Large Language Model, model, generation, Generative Embedding Model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Most multi-modal tasks can be formulated into problems of either generation or embedding. Existing models usually tackle these two types of problems by decoupling language modules into a text decoder for generation, and a text encoder for embedding. To explore the minimalism of multi-modal paradigms, we attempt to achieve only one model per modality in this work. We propose a Multi-Modal Generative Embedding Model (MM-GEM), whereby the generative and embedding objectives are encapsulated in one Large Language Model. We also propose a PoolAggregator to boost efficiency and enable the ability of fine-grained embedding and generation. A surprising finding is that these two objectives do not significantly conflict with each other. For example, MM-GEM instantiated from ViT-Large and TinyLlama shows competitive performance on benchmarks for multimodal embedding models such as cross-modal retrieval and zero-shot classification, while has good ability of image captioning. Additionally, MM-GEM can seamlessly execute region-level image caption generation and retrieval tasks. Besides, the advanced text model in MM-GEM brings over 5% improvement in Recall@1 for long text and image retrieval.

[CV-3] NPGA: Neural Parametric Gaussian Avatars

链接: https://arxiv.org/abs/2405.19331
作者: Simon Giebenhain,Tobias Kirschstein,Martin Rünz,Lourdes Agapito,Matthias Nießner
关键词: important stepping stone, integrating virtual components, digital versions, everyday lives, versions of human
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注: Project Page: see this https URL ; Youtube Video: see this https URL

点击查看摘要

Abstract:The creation of high-fidelity, digital versions of human heads is an important stepping stone in the process of further integrating virtual components into our everyday lives. Constructing such avatars is a challenging research problem, due to a high demand for photo-realism and real-time rendering performance. In this work, we propose Neural Parametric Gaussian Avatars (NPGA), a data-driven approach to create high-fidelity, controllable avatars from multi-view video recordings. We build our method around 3D Gaussian Splatting for its highly efficient rendering and to inherit the topological flexibility of point clouds. In contrast to previous work, we condition our avatars’ dynamics on the rich expression space of neural parametric head models (NPHM), instead of mesh-based 3DMMs. To this end, we distill the backward deformation field of our underlying NPHM into forward deformations which are compatible with rasterization-based rendering. All remaining fine-scale, expression-dependent details are learned from the multi-view videos. To increase the representational capacity of our avatars, we augment the canonical Gaussian point cloud using per-primitive latent features which govern its dynamic behavior. To regularize this increased dynamic expressivity, we propose Laplacian terms on the latent features and predicted dynamics. We evaluate our method on the public NeRSemble dataset, demonstrating that NPGA significantly outperforms the previous state-of-the-art avatars on the self-reenactment task by 2.6 PSNR. Furthermore, we demonstrate accurate animation capabilities from real-world monocular videos.

[CV-4] Reasoning3D – Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models

链接: https://arxiv.org/abs/2405.19326
作者: Tianrun Chen,Chunan Yu,Jing Li,Jianqi Zhang,Lanyun Zhu,Deyi Ji,Yong Zhang,Ying Zang,Zejian Li,Lingyun Sun
关键词: Reasoning Segmentation, searching and localization, transcends limitations, Segmentation, instance segmentation
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a new task: Zero-Shot 3D Reasoning Segmentation for parts searching and localization for objects, which is a new paradigm to 3D segmentation that transcends limitations for previous category-specific 3D semantic segmentation, 3D instance segmentation, and open-vocabulary 3D segmentation. We design a simple baseline method, Reasoning3D, with the capability to understand and execute complex commands for (fine-grained) segmenting specific parts for 3D meshes with contextual awareness and reasoned answers for interactive segmentation. Specifically, Reasoning3D leverages an off-the-shelf pre-trained 2D segmentation network, powered by Large Language Models (LLMs), to interpret user input queries in a zero-shot manner. Previous research have shown that extensive pre-training endows foundation models with prior world knowledge, enabling them to comprehend complex commands, a capability we can harness to “segment anything” in 3D with limited 3D datasets (source efficient). Experimentation reveals that our approach is generalizable and can effectively localize and highlight parts of 3D objects (in 3D mesh) based on implicit textual queries, including these articulated 3d objects and real-world scanned data. Our method can also generate natural language explanations corresponding to these 3D models and the decomposition. Moreover, our training-free approach allows rapid deployment and serves as a viable universal baseline for future research of part-level 3d (semantic) object understanding in various fields including robotics, object manipulation, part assembly, autonomous driving applications, augment reality and virtual reality (AR/VR), and medical applications. The code, the model weight, the deployment guide, and the evaluation protocol are: this http URL

[CV-5] DGD: Dynamic 3D Gaussians Distillation

链接: https://arxiv.org/abs/2405.19321
作者: Isaac Labe,Noam Issachar,Itai Lang,Sagie Benaim
关键词: single monocular video, video as input, semantic radiance fields, radiance field captures, tackle the task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We tackle the task of learning dynamic 3D semantic radiance fields given a single monocular video as input. Our learned semantic radiance field captures per-point semantics as well as color and geometric properties for a dynamic 3D scene, enabling the generation of novel views and their corresponding semantics. This enables the segmentation and tracking of a diverse set of 3D semantic entities, specified using a simple and intuitive interface that includes a user click or a text prompt. To this end, we present DGD, a unified 3D representation for both the appearance and semantics of a dynamic 3D scene, building upon the recently proposed dynamic 3D Gaussians representation. Our representation is optimized over time with both color and semantic information. Key to our method is the joint optimization of the appearance and semantic attributes, which jointly affect the geometric properties of the scene. We evaluate our approach in its ability to enable dense semantic 3D object tracking and demonstrate high-quality results that are fast to render, for a diverse set of scenes. Our project webpage is available on this https URL

[CV-6] Matryoshka Query Transformer for Large Vision-Language Models

链接: https://arxiv.org/abs/2405.19315
作者: Wenbo Hu,Zi-Yi Dou,Liunian Harold Li,Amita Kamath,Nanyun Peng,Kai-Wei Chang
关键词: Large Vision-Language Models, Large Vision-Language, visual tokens, tokens, visual
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Preprint. Our code and model are publicly available at this https URL

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m = M latent query tokens and train the model using only these first m tokens, discarding the rest. Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA’s fixed 576. Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each. Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.

[CV-7] Real-Time Environment Condition Classification for Autonomous Vehicles

链接: https://arxiv.org/abs/2405.19305
作者: Marco Introvigne,Andrea Ramazzina,Stefanie Walz,Dominik Scheuble,Mario Bijelic
关键词: Current autonomous driving, Current autonomous, autonomous driving technologies, well-defined operation conditions, geo-fenced areas
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current autonomous driving technologies are being rolled out in geo-fenced areas with well-defined operation conditions such as time of operation, area, weather conditions and road conditions. In this way, challenging conditions as adverse weather, slippery road or densely-populated city centers can be excluded. In order to lift the geo-fenced restriction and allow a more dynamic availability of autonomous driving functions, it is necessary for the vehicle to autonomously perform an environment condition assessment in real time to identify when the system cannot operate safely and either stop operation or require the resting passenger to take control. In particular, adverse-weather challenges are a fundamental limitation as sensor performance degenerates quickly, prohibiting the use of sensors such as cameras to locate and monitor road signs, pedestrians or other vehicles. To address this issue, we train a deep learning model to identify outdoor weather and dangerous road conditions, enabling a quick reaction to new situations and environments. We achieve this by introducing an improved taxonomy and label hierarchy for a state-of-the-art adverse-weather dataset, relabelling it with a novel semi-automated labeling pipeline. Using the novel proposed dataset and hierarchy, we train RECNet, a deep learning model for the classification of environment conditions from a single RGB frame. We outperform baseline models by relative 16% in F1- Score, while maintaining a real-time capable performance of 20 Hz.

[CV-8] Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare

链接: https://arxiv.org/abs/2405.19298
作者: Hanwei Zhu,Haoning Wu,Yixuan Li,Zicheng Zhang,Baoliang Chen,Lingyu Zhu,Yuming Fang,Guangtao Zhai,Weisi Lin,Shiqi Wang
关键词: remains largely unexplored, transfer reliable relative, large multimodal models, scores remains largely, reliable relative quality
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:While recent advancements in large multimodal models (LMMs) have significantly improved their abilities in image quality assessment (IQA) relying on absolute quality rating, how to transfer reliable relative quality comparison outputs to continuous perceptual quality scores remains largely unexplored. To address this gap, we introduce Compare2Score-an all-around LMM-based no-reference IQA (NR-IQA) model, which is capable of producing qualitatively comparative responses and effectively translating these discrete comparative levels into a continuous quality score. Specifically, during training, we present to generate scaled-up comparative instructions by comparing images from the same IQA dataset, allowing for more flexible integration of diverse IQA datasets. Utilizing the established large-scale training corpus, we develop a human-like visual quality comparator. During inference, moving beyond binary choices, we propose a soft comparison method that calculates the likelihood of the test image being preferred over multiple predefined anchor images. The quality score is further optimized by maximum a posteriori estimation with the resulting probability matrix. Extensive experiments on nine IQA datasets validate that the Compare2Score effectively bridges text-defined comparative levels during training with converted single image quality score for inference, surpassing state-of-the-art IQA models across diverse scenarios. Moreover, we verify that the probability-matrix-based inference conversion not only improves the rating accuracy of Compare2Score but also zero-shot general-purpose LMMs, suggesting its intrinsic effectiveness.

[CV-9] Neural Isometries: Taming Transformations for Equivariant ML

链接: https://arxiv.org/abs/2405.19296
作者: Thomas W. Mitchel,Michael Taylor,Vincent Sitzmann
关键词: tractable analytical expression, defy tractable analytical, Real-world geometry, vision tasks, analytical expression
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world geometry and 3D vision tasks are replete with challenging symmetries that defy tractable analytical expression. In this paper, we introduce Neural Isometries, an autoencoder framework which learns to map the observation space to a general-purpose latent space wherein encodings are related by isometries whenever their corresponding observations are geometrically related in world space. Specifically, we regularize the latent space such that maps between encodings preserve a learned inner product and commute with a learned functional operator, in the same manner as rigid-body transformations commute with the Laplacian. This approach forms an effective backbone for self-supervised representation learning, and we demonstrate that a simple off-the-shelf equivariant network operating in the pre-trained latent space can achieve results on par with meticulously-engineered, handcrafted networks designed to handle complex, nonlinear symmetries. Furthermore, isometric maps capture information about the respective transformations in world space, and we show that this allows us to regress camera poses directly from the coefficients of the maps between encodings of adjacent views of a scene.

[CV-10] 3D Neural Edge Reconstruction

链接: https://arxiv.org/abs/2405.19295
作者: Lei Li,Songyou Peng,Zehao Yu,Shaohui Liu,Rémi Pautrat,Xiaochuan Yin,Marc Pollefeys
关键词: Real-world objects, including straight lines, including straight, objects and environments, environments are predominantly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Real-world objects and environments are predominantly composed of edge features, including straight lines and curves. Such edges are crucial elements for various applications, such as CAD modeling, surface meshing, lane mapping, etc. However, existing traditional methods only prioritize lines over curves for simplicity in geometric modeling. To this end, we introduce EMAP, a new method for learning 3D edge representations with a focus on both lines and curves. Our method implicitly encodes 3D edge distance and direction in Unsigned Distance Functions (UDF) from multi-view edge maps. On top of this neural representation, we propose an edge extraction algorithm that robustly abstracts parametric 3D edges from the inferred edge points and their directions. Comprehensive evaluations demonstrate that our method achieves better 3D edge reconstruction on multiple challenging datasets. We further show that our learned UDF field enhances neural surface reconstruction by capturing more details.

[CV-11] Programmable Motion Generation for Open-Set Motion Control Tasks

链接: https://arxiv.org/abs/2405.19283
作者: Hanchao Liu,Xiaohang Zhan,Shaoli Huang,Tai-Jiang Mu,Ying Shan
关键词: real-world scenarios necessitates, motion control, motion, control, motion control problem
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by CVPR 2024

点击查看摘要

Abstract:Character animation in real-world scenarios necessitates a variety of constraints, such as trajectories, key-frames, interactions, etc. Existing methodologies typically treat single or a finite set of these constraint(s) as separate control tasks. They are often specialized, and the tasks they address are rarely extendable or customizable. We categorize these as solutions to the close-set motion control problem. In response to the complexity of practical motion control, we propose and attempt to solve the open-set motion control problem. This problem is characterized by an open and fully customizable set of motion control tasks. To address this, we introduce a new paradigm, programmable motion generation. In this paradigm, any given motion control task is broken down into a combination of atomic constraints. These constraints are then programmed into an error function that quantifies the degree to which a motion sequence adheres to them. We utilize a pre-trained motion generation model and optimize its latent code to minimize the error function of the generated motion. Consequently, the generated motion not only inherits the prior of the generative model but also satisfies the required constraints. Experiments show that we can generate high-quality motions when addressing a wide range of unseen tasks. These tasks encompass motion control by motion dynamics, geometric constraints, physical laws, interactions with scenes, objects or the character own body parts, etc. All of these are achieved in a unified approach, without the need for ad-hoc paired training data collection or specialized network designs. During the programming of novel tasks, we observed the emergence of new skills beyond those of the prior model. With the assistance of large language models, we also achieved automatic programming. We hope that this work will pave the way for the motion control of general AI agents.

[CV-12] ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning

链接: https://arxiv.org/abs/2405.19237
作者: Ruchika Chavhan,Da Li,Timothy Hospedales
关键词: impressive image-generation capabilities, perpetuating societal biases, demonstrated impressive image-generation, generating unsafe content, violating copyright
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While large-scale text-to-image diffusion models have demonstrated impressive image-generation capabilities, there are significant concerns about their potential misuse for generating unsafe content, violating copyright, and perpetuating societal biases. Recently, the text-to-image generation community has begun addressing these concerns by editing or unlearning undesired concepts from pre-trained models. However, these methods often involve data-intensive and inefficient fine-tuning or utilize various forms of token remapping, rendering them susceptible to adversarial jailbreaks. In this paper, we present a simple and effective training-free approach, ConceptPrune, wherein we first identify critical regions within pre-trained models responsible for generating undesirable concepts, thereby facilitating straightforward concept unlearning via weight pruning. Experiments across a range of concepts including artistic styles, nudity, object erasure, and gender debiasing demonstrate that target concepts can be efficiently erased by pruning a tiny fraction, approximately 0.12% of total weights, enabling multi-concept erasure and robustness against various white-box and black-box adversarial attacks.

[CV-13] Forward-Backward Knowledge Distillation for Continual Clustering

链接: https://arxiv.org/abs/2405.19234
作者: Mohammadreza Sadeghi,Zihan Wang,Narges Armanfard
关键词: enabling neural networks, explicit label information, Unsupervised Continual Learning, Unsupervised Continual Clustering, knowledge distillation
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unsupervised Continual Learning (UCL) is a burgeoning field in machine learning, focusing on enabling neural networks to sequentially learn tasks without explicit label information. Catastrophic Forgetting (CF), where models forget previously learned tasks upon learning new ones, poses a significant challenge in continual learning, especially in UCL, where labeled information of data is not accessible. CF mitigation strategies, such as knowledge distillation and replay buffers, often face memory inefficiency and privacy issues. Although current research in UCL has endeavored to refine data representations and address CF in streaming data contexts, there is a noticeable lack of algorithms specifically designed for unsupervised clustering. To fill this gap, in this paper, we introduce the concept of Unsupervised Continual Clustering (UCC). We propose Forward-Backward Knowledge Distillation for unsupervised Continual Clustering (FBCC) to counteract CF within the context of UCC. FBCC employs a single continual learner (the ``teacher’') with a cluster projector, along with multiple student models, to address the CF issue. The proposed method consists of two phases: Forward Knowledge Distillation, where the teacher learns new clusters while retaining knowledge from previous tasks with guidance from specialized student models, and Backward Knowledge Distillation, where a student model mimics the teacher’s behavior to retain task-specific knowledge, aiding the teacher in subsequent tasks. FBCC marks a pioneering approach to UCC, demonstrating enhanced performance and memory efficiency in clustering across various tasks, outperforming the application of clustering algorithms to the latent space of state-of-the-art UCL algorithms.

[CV-14] ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions

链接: https://arxiv.org/abs/2405.19226
作者: Honglin Lin,Siyu Li,Guoshun Nan,Chaoyue Tang,Xueting Wang,Jingxin Xu,Rong Yankai,Zhili Zhou,Yutong Gao,Qimei Cui,Xiaofeng Tao
关键词: aims to identify, set of minimally, based on linguistically, IRCD, minimally contrastive candidates
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: Accepted in ACL 2024 Findings

点击查看摘要

Abstract:Image retrieval from contextual descriptions (IRCD) aims to identify an image within a set of minimally contrastive candidates based on linguistically complex text. Despite the success of VLMs, they still significantly lag behind human performance in IRCD. The main challenges lie in aligning key contextual cues in two modalities, where these subtle cues are concealed in tiny areas of multiple contrastive images and within the complex linguistics of textual descriptions. This motivates us to propose ContextBLIP, a simple yet effective method that relies on a doubly contextual alignment scheme for challenging IRCD. Specifically, 1) our model comprises a multi-scale adapter, a matching loss, and a text-guided masking loss. The adapter learns to capture fine-grained visual cues. The two losses enable iterative supervision for the adapter, gradually highlighting the focal patches of a single image to the key textual cues. We term such a way as intra-contextual alignment. 2) Then, ContextBLIP further employs an inter-context encoder to learn dependencies among candidates, facilitating alignment between the text to multiple images. We term this step as inter-contextual alignment. Consequently, the nuanced cues concealed in each modality can be effectively aligned. Experiments on two benchmarks show the superiority of our method. We observe that ContextBLIP can yield comparable results with GPT-4V, despite involving about 7,500 times fewer parameters.

[CV-15] VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

链接: https://arxiv.org/abs/2405.19209
作者: Ziyang Wang,Shoubin Yu,Elias Stengel-Eskin,Jaehong Yoon,Feng Cheng,Gedas Bertasius,Mohit Bansal
关键词: Video-language understanding tasks, short video clips, video understanding tasks, Large Language Models, understanding tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 20 pages, first three authors contributed equally; Project page: this https URL

点击查看摘要

Abstract:Video-language understanding tasks have focused on short video clips, often struggling with long-form video understanding tasks. Recently, many long video-language understanding approaches have leveraged the reasoning capabilities of Large Language Models (LLMs) to perform long video QA, transforming videos into densely sampled frame captions, and asking LLMs to respond to text queries over captions. However, the frames used for captioning are often redundant and contain irrelevant information, making dense sampling inefficient, and ignoring the fact that video QA requires varying levels of granularity, with some video segments being highly relevant to the question (needing more fine-grained detail) while others being less relevant. Thus, these LLM-based approaches are prone to missing information and operate on large numbers of irrelevant captions, lowering both performance and efficiency. To address these issues, we introduce VideoTree, a query-adaptive and hierarchical framework for long-video understanding with LLMs. VideoTree dynamically extracts query-related information from a video and builds a tree-based representation for LLM reasoning. First, VideoTree adaptively selects frames for captioning by iteratively clustering frames based on their visual features and scoring clusters using their relevance to the query. Second, it organizes visual clusters into a query-adaptive and hierarchical tree structure; the tree encodes varying levels of granularity, with higher resolution on relevant segments. Finally, VideoTree produces an answer by traversing the tree’s keyframes and passing their captions to an LLM answerer. Our method improves both reasoning accuracy and efficiency compared to existing methods: VideoTree achieves a 7.0%, 2.2%, and 2.7% accuracy gain over baselines on the EgoSchema, NExT-QA, and IntentQA benchmarks, respectively, while reducing inference time by 40%.

[CV-16] E3Gen: Efficient Expressive and Editable Avatars Generation

链接: https://arxiv.org/abs/2405.19203
作者: Weitian Zhang,Yichao Yan,Yunhui Liu,Xingdong Sheng,Xiaokang Yang
关键词: editable digital avatar, Gaussian, digital avatar generation, editable digital, avatar generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper aims to introduce 3D Gaussian for efficient, expressive, and editable digital avatar generation. This task faces two major challenges: (1) The unstructured nature of 3D Gaussian makes it incompatible with current generation pipelines; (2) the expressive animation of 3D Gaussian in a generative setting that involves training with multiple subjects remains unexplored. In this paper, we propose a novel avatar generation method named E^3 Gen, to effectively address these challenges. First, we propose a novel generative UV features plane representation that encodes unstructured 3D Gaussian onto a structured 2D UV space defined by the SMPL-X parametric model. This novel representation not only preserves the representation ability of the original 3D Gaussian but also introduces a shared structure among subjects to enable generative learning of the diffusion model. To tackle the second challenge, we propose a part-aware deformation module to achieve robust and accurate full-body expressive pose control. Extensive experiments demonstrate that our method achieves superior performance in avatar generation and enables expressive full-body pose control and editing.

[CV-17] Going beyond compositional generalization DDPMs can produce zero-shot interpolation

链接: https://arxiv.org/abs/2405.19201
作者: Justin Deschenaux,Igor Krawczuk,Grigorios Chrysos,Volkan Cevher
关键词: Denoising Diffusion Probabilistic, Diffusion Probabilistic Models, Denoising Diffusion, exhibit remarkable capabilities, Diffusion Probabilistic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Denoising Diffusion Probabilistic Models (DDPMs) exhibit remarkable capabilities in image generation, with studies suggesting that they can generalize by composing latent factors learned from the training data. In this work, we go further and study DDPMs trained on strictly separate subsets of the data distribution with large gaps on the support of the latent factors. We show that such a model can effectively generate images in the unexplored, intermediate regions of the distribution. For instance, when trained on clearly smiling and non-smiling faces, we demonstrate a sampling procedure which can generate slightly smiling faces without reference images (zero-shot interpolation). We replicate these findings for other attributes as well as other datasets. \hrefthis https URL\textOur code is available on GitHub.

[CV-18] LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model

链接: https://arxiv.org/abs/2405.19194
作者: Hongen Liu,Yi Liu,Di Sun,Jiahao Wang,Gang Pan
关键词: Video text spotting, text spotting aims, Video text, text, simultaneously localize
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video text spotting aims to simultaneously localize, recognize and track text instances in videos. To address the limited recognition capability of end-to-end methods, tracking the zero-shot results of state-of-the-art image text spotters directly can achieve impressive performance. However, owing to the domain gap between different datasets, these methods usually obtain limited tracking trajectories on extreme dataset. Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements, albeit at the expense of considerable training resources. In this paper, we propose a Language Collaboration and Glyph Perception Model, termed LOGO to enhance the performance of conventional text spotters through the integration of a synergy module. To achieve this goal, a language synergy classifier (LSC) is designed to explicitly discern text instances from background noise in the recognition stage. Specially, the language synergy classifier can output text content or background code based on the legibility of text regions, thus computing language scores. Subsequently, fusion scores are computed by taking the average of detection scores and language scores, and are utilized to re-score the detection results before tracking. By the re-scoring mechanism, the proposed LSC facilitates the detection of low-resolution text instances while filtering out text-like regions. Besides, the glyph supervision and visual position mixture module are proposed to enhance the recognition accuracy of noisy text regions, and acquire more discriminative tracking features, respectively. Extensive experiments on public benchmarks validate the effectiveness of the proposed method.

[CV-19] MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification

链接: https://arxiv.org/abs/2405.19186
作者: Laura Fieback(1,2),Jakob Spiegelberg(1),Hanno Gottschalk(2) ((1) Volkswagen AG, (2) TU Berlin)
关键词: Vision Language Models, shown remarkable capabilities, Large Vision Language, visual question answering, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have shown remarkable capabilities in multimodal tasks like visual question answering or image captioning. However, inconsistencies between the visual information and the generated text, a phenomenon referred to as hallucinations, remain an unsolved problem with regard to the trustworthiness of LVLMs. To address this problem, recent works proposed to incorporate computationally costly Large (Vision) Language Models in order to detect hallucinations on a sentence- or subsentence-level. In this work, we introduce MetaToken, a lightweight binary classifier to detect hallucinations on the token-level at negligible cost. Based on a statistical analysis, we reveal key factors of hallucinations in LVLMs which have been overseen in previous works. MetaToken can be applied to any open-source LVLM without any knowledge about ground truth data providing a reliable detection of hallucinations. We evaluate our method on four state-of-the-art LVLMs demonstrating the effectiveness of our approach.

[CV-20] Model Agnostic Defense against Adversarial Patch Attacks on Object Detection in Unmanned Aerial Vehicles

链接: https://arxiv.org/abs/2405.19179
作者: Saurabh Pathak,Samridha Shrestha,Abdelrahman AlMahmoud
关键词: Unmanned Aerial Vehicles, Aerial Vehicles, Unmanned Aerial, component in Unmanned, completing high-level tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: submitted to IROS 2024

点击查看摘要

Abstract:Object detection forms a key component in Unmanned Aerial Vehicles (UAVs) for completing high-level tasks that depend on the awareness of objects on the ground from an aerial perspective. In that scenario, adversarial patch attacks on an onboard object detector can severely impair the performance of upstream tasks. This paper proposes a novel model-agnostic defense mechanism against the threat of adversarial patch attacks in the context of UAV-based object detection. We formulate adversarial patch defense as an occlusion removal task. The proposed defense method can neutralize adversarial patches located on objects of interest, without exposure to adversarial patches during training. Our lightweight single-stage defense approach allows us to maintain a model-agnostic nature, that once deployed does not require to be updated in response to changes in the object detection pipeline. The evaluations in digital and physical domains show the feasibility of our method for deployment in UAV object detection pipelines, by significantly decreasing the Attack Success Ratio without incurring significant processing costs. As a result, the proposed defense solution can improve the reliability of object detection for UAVs.

[CV-21] Exploring AI-based Anonymization of Industrial Image and Video Data in the Context of Feature Preservation

链接: https://arxiv.org/abs/2405.19173
作者: Sabrina Cynthia Triess,Timo Leitritz,Christian Jauch
关键词: rising technologies, increasingly important, protection of privacy-sensitive, privacy-sensitive information, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at 32nd European Signal Processing Conference (EUSIPCO 2024)

点击查看摘要

Abstract:With rising technologies, the protection of privacy-sensitive information is becoming increasingly important. In industry and production facilities, image or video recordings are beneficial for documentation, tracing production errors or coordinating workflows. Individuals in images or videos need to be anonymized. However, the anonymized data should be reusable for further applications. In this work, we apply the Deep Learning-based full-body anonymization framework DeepPrivacy2, which generates artificial identities, to industrial image and video data. We compare its performance with conventional anonymization techniques. Therefore, we consider the quality of identity generation, temporal consistency, and the applicability of pose estimation and action recognition.

[CV-22] CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval

链接: https://arxiv.org/abs/2405.19149
作者: Xintong Jiang,Yaxiong Wang,Mengjian Li,Yujiao Wu,Bingwen Hu,Xueming Qian
关键词: image-text pair query, Composed Image Retrieval, involves searching, image-text pair, target images based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: arXiv admin note: text overlap with arXiv:2309.02169

点击查看摘要

Abstract:Composed Image Retrieval (CIR) involves searching for target images based on an image-text pair query. While current methods treat this as a query-target matching problem, we argue that CIR triplets contain additional associations beyond this primary relation. In our paper, we identify two new relations within triplets, treating each triplet as a graph node. Firstly, we introduce the concept of text-bridged image alignment, where the query text serves as a bridge between the query image and the target image. We propose a hinge-based cross-attention mechanism to incorporate this relation into network learning. Secondly, we explore complementary text reasoning, considering CIR as a form of cross-modal retrieval where two images compose to reason about complementary text. To integrate these perspectives effectively, we design a twin attention-based compositor. By combining these complementary associations with the explicit query pair-target image relation, we establish a comprehensive set of constraints for CIR. Our framework, CaLa (Complementary Association Learning for Augmenting Composed Image Retrieval), leverages these insights. We evaluate CaLa on CIRR and FashionIQ benchmarks with multiple backbones, demonstrating its superiority in composed image retrieval.

[CV-23] ACCSAMS: Automatic Conversion of Exam Documents to Accessible Learning Material for Blind and Visually Impaired

链接: https://arxiv.org/abs/2405.19124
作者: David Wilkening,Omar Moured,Thorsten Schwarz,Karin Muller,Rainer Stiefelhagen
关键词: essential educational materials, essential educational, educational materials, Exam documents, exam preparation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ICCHP 2024

点击查看摘要

Abstract:Exam documents are essential educational materials for exam preparation. However, they pose a significant academic barrier for blind and visually impaired students, as they are often created without accessibility considerations. Typically, these documents are incompatible with screen readers, contain excessive white space, and lack alternative text for visual elements. This situation frequently requires intervention by experienced sighted individuals to modify the format and content for accessibility. We propose ACCSAMS, a semi-automatic system designed to enhance the accessibility of exam documents. Our system offers three key contributions: (1) creating an accessible layout and removing unnecessary white space, (2) adding navigational structures, and (3) incorporating alternative text for visual elements that were previously missing. Additionally, we present the first multilingual manually annotated dataset, comprising 1,293 German and 900 English exam documents which could serve as a good training source for deep learning models.

[CV-24] ChartFormer: A Large Vision Language Model for Converting Chart Images into Tactile Accessible SVGs

链接: https://arxiv.org/abs/2405.19117
作者: Omar Moured,Sara Alzalabny,Anas Osman,Thorsten Schwarz,Karin Muller,Rainer Stiefelhagen
关键词: interpreting complex data, complex data, crucial for interpreting, interpreting complex, Visualizations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ICCHP 2024. Codes will be available at this https URL

点击查看摘要

Abstract:Visualizations, such as charts, are crucial for interpreting complex data. However, they are often provided as raster images, which are not compatible with assistive technologies for people with blindness and visual impairments, such as embossed papers or tactile displays. At the same time, creating accessible vector graphics requires a skilled sighted person and is time-intensive. In this work, we leverage advancements in the field of chart analysis to generate tactile charts in an end-to-end manner. Our three key contributions are as follows: (1) introducing the ChartFormer model trained to convert raster chart images into tactile-accessible SVGs, (2) training this model on the Chart2Tactile dataset, a synthetic chart dataset we created following accessibility standards, and (3) evaluating the effectiveness of our SVGs through a pilot user study with an refreshable two-dimensional tactile display. Our work is publicly available at this https URL .

[CV-25] Alt4Blind: A User Interface to Simplify Charts Alt-Text Creation

链接: https://arxiv.org/abs/2405.19111
作者: Omar Moured,Shahid Ali Farooqui,Karin Muller,Sharifeh Fadaeijouybari,Thorsten Schwarz,Mohammed Javed,Rainer Stiefelhagen
关键词: Alternative Texts, making graphics accessible, essential for making, making graphics, graphics accessible
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: Accepted at ICCHP 2024. Codes will be available at this https URL

点击查看摘要

Abstract:Alternative Texts (Alt-Text) for chart images are essential for making graphics accessible to people with blindness and visual impairments. Traditionally, Alt-Text is manually written by authors but often encounters issues such as oversimplification or complication. Recent trends have seen the use of AI for Alt-Text generation. However, existing models are susceptible to producing inaccurate or misleading information. We address this challenge by retrieving high-quality alt-texts from similar chart images, serving as a reference for the user when creating alt-texts. Our three contributions are as follows: (1) we introduce a new benchmark comprising 5,000 real images with semantically labeled high-quality Alt-Texts, collected from Human Computer Interaction venues. (2) We developed a deep learning-based model to rank and retrieve similar chart images that share the same visual and textual semantics. (3) We designed a user interface (UI) to facilitate the alt-text creation process. Our preliminary interviews and investigations highlight the usability of our UI. For the dataset and further details, please refer to our project page: this https URL.

[CV-26] Enhancing Zero-Shot Facial Expression Recognition by LLM Knowledge Transfer

链接: https://arxiv.org/abs/2405.19100
作者: Zengqun Zhao,Yu Cao,Shaogang Gong,Ioannis Patras
关键词: supervised learning manner, Current facial expression, Current facial, facial expression recognition, high-quality annotations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Current facial expression recognition (FER) models are often designed in a supervised learning manner thus are constrained by the lack of large-scale facial expression images with high-quality annotations. Consequently, these models often fail to generalize well, performing poorly on unseen images in training. Vision-language-based zero-shot models demonstrate a promising potential for addressing such challenges. However, these models lack task-specific knowledge therefore are not optimized for the nuances of recognizing facial expressions. To bridge this gap, this work proposes a novel method, Exp-CLIP, to enhance zero-shot FER by transferring the task knowledge from large language models (LLMs). Specifically, based on the pre-trained vision-language encoders, we incorporate a projection head designed to map the initial joint vision-language space into a space that captures representations of facial actions. To train this projection head for subsequent zero-shot predictions, we propose to align the projected visual representations with task-specific semantic meanings derived from the LLM encoder, and the text instruction-based strategy is employed to customize the LLM knowledge. Given unlabelled facial data and efficient training of the projection head, Exp-CLIP achieves superior zero-shot results to the CLIP models and several other large vision-language models (LVLMs) on seven in-the-wild FER datasets. The code and pre-trained models are available at \urlthis https URL.

[CV-27] Efficient Black-box Adversarial Attacks via Bayesian Optimization Guided by a Function Prior

链接: https://arxiv.org/abs/2405.19098
作者: Shuyu Cheng,Yibo Miao,Yinpeng Dong,Xiao Yang,Xiao-Shan Gao,Jun Zhu
关键词: challenging black-box adversarial, Prior-guided Bayesian Optimization, studies the challenging, aims to generate, output feedback
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: ICML 2024

点击查看摘要

Abstract:This paper studies the challenging black-box adversarial attack that aims to generate adversarial examples against a black-box model by only using output feedback of the model to input queries. Some previous methods improve the query efficiency by incorporating the gradient of a surrogate white-box model into query-based attacks due to the adversarial transferability. However, the localized gradient is not informative enough, making these methods still query-intensive. In this paper, we propose a Prior-guided Bayesian Optimization (P-BO) algorithm that leverages the surrogate model as a global function prior in black-box adversarial attacks. As the surrogate model contains rich prior information of the black-box one, P-BO models the attack objective with a Gaussian process whose mean function is initialized as the surrogate model’s loss. Our theoretical analysis on the regret bound indicates that the performance of P-BO may be affected by a bad prior. Therefore, we further propose an adaptive integration strategy to automatically adjust a coefficient on the function prior by minimizing the regret bound. Extensive experiments on image classifiers and large vision-language models demonstrate the superiority of the proposed algorithm in reducing queries and improving attack success rates compared with the state-of-the-art black-box attacks. Code is available at this https URL.

[CV-28] Benchmarking and Improving Detail Image Caption

链接: https://arxiv.org/abs/2405.19092
作者: Hongyuan Dong,Jiawen Li,Bohong Wu,Jiacong Wang,Yuan Zhang,Haoyuan Guo
关键词: Image captioning, long been regarded, model image captioning, Image, evaluation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image captioning has long been regarded as a fundamental task in visual understanding. Recently, however, few large vision-language model (LVLM) research discusses model’s image captioning performance because of the outdated short-caption benchmarks and unreliable evaluation metrics. In this work, we propose to benchmark detail image caption task by curating high-quality evaluation datasets annotated by human experts, GPT-4V and Gemini-1.5-Pro. We also design a more reliable caption evaluation metric called CAPTURE (CAPtion evaluation by exTracting and coUpling coRE information). CAPTURE extracts visual elements, e.g., objects, attributes and relations from captions, and then matches these elements through three stages, achieving the highest consistency with expert judgements over other rule-based or model-based caption metrics. The proposed benchmark and metric provide reliable evaluation for LVLM’s detailed image captioning ability. Guided by this evaluation, we further explore to unleash LVLM’s detail caption capabilities by synthesizing high-quality data through a five-stage data construction pipeline. Our pipeline only uses a given LVLM itself and other open-source tools, without any human or GPT-4V annotation in the loop. Experiments show that the proposed data construction strategy significantly improves model-generated detail caption data quality for LVLMs with leading performance, and the data quality can be further improved in a self-looping paradigm. All code and dataset will be publicly available at this https URL.

[CV-29] Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

链接: https://arxiv.org/abs/2405.19088
作者: Zhe Hu,Tuo Liang,Jing Li,Yiren Lu,Yunlai Zhou,Yiran Qiao,Jing Ma,Yu Yin
关键词: demonstrated remarkable proficiency, demonstrated remarkable, remarkable proficiency, wide range, multimodal language models
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advancements in large multimodal language models have demonstrated remarkable proficiency across a wide range of tasks. Yet, these models still struggle with understanding the nuances of human humor through juxtaposition, particularly when it involves nonlinear narratives that underpin many jokes and humor cues. This paper investigates this challenge by focusing on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction. We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI’s capabilities in recognizing and interpreting these comics, ranging from literal content comprehension to deep narrative reasoning. Through extensive experimentation and analysis of recent commercial or open-sourced large (vision) language models, we assess their capability to comprehend the complex interplay of the narrative humor inherent in these comics. Our results show that even state-of-the-art models still lag behind human performance on this task. Our findings offer insights into the current limitations and potential improvements for AI in understanding human creative expressions.

[CV-30] Patch-enhanced Mask Encoder Prompt Image Generation

链接: https://arxiv.org/abs/2405.19085
作者: Shusong Xu,Peiye Liu
关键词: Artificial Intelligence Generated, Artificial Intelligence, Intelligence Generated Content, Intelligence Generated, high-cost advertising applications
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Artificial Intelligence Generated Content(AIGC), known for its superior visual results, represents a promising mitigation method for high-cost advertising applications. Numerous approaches have been developed to manipulate generated content under different conditions. However, a crucial limitation lies in the accurate description of products in advertising applications. Applying previous methods directly may lead to considerable distortion and deformation of advertised products, primarily due to oversimplified content control conditions. Hence, in this work, we propose a patch-enhanced mask encoder approach to ensure accurate product descriptions while preserving diverse backgrounds. Our approach consists of three components Patch Flexible Visibility, Mask Encoder Prompt Adapter and an image Foundation Model. Patch Flexible Visibility is used for generating a more reasonable background image. Mask Encoder Prompt Adapter enables region-controlled fusion. We also conduct an analysis of the structure and operational mechanisms of the Generation Module. Experimental results show our method can achieve the highest visual results and FID scores compared with other methods.

[CV-31] Uniform vs. Lognormal Kinematics in Robots: Perceptual Preferences for Robotic Movements

链接: https://arxiv.org/abs/2405.19081
作者: Jose J. Quintana,Miguel A. Ferrer,Moises Diaz,Jose J. Feo,Adam Wolniakowski,Konstantsin Miatliuk
关键词: common work environment, Collaborative robots, Collaborative, cobots interact, work environment
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Collaborative robots or cobots interact with humans in a common work environment. In cobots, one under investigated but important issue is related to their movement and how it is perceived by humans. This paper tries to analyze whether humans prefer a robot moving in a human or in a robotic fashion. To this end, the present work lays out what differentiates the movement performed by an industrial robotic arm from that performed by a human one. The main difference lies in the fact that the robotic movement has a trapezoidal speed profile, while for the human arm, the speed profile is bell-shaped and during complex movements, it can be considered as a sum of superimposed bell-shaped movements. Based on the lognormality principle, a procedure was developed for a robotic arm to perform human-like movements. Both speed profiles were implemented in two industrial robots, namely, an ABB IRB 120 and a Universal Robot UR3. Three tests were used to study the subjects’ preference when seeing both movements and another analyzed the same when interacting with the robot by touching its ends with their fingers.

[CV-32] Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design

链接: https://arxiv.org/abs/2405.19076
作者: Markus J. Buehler
关键词: multimodal vision large, multi-agent AI frameworks, present Cephalo, vision large language, series of multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Cephalo, a series of multimodal vision large language models (V-LLMs) designed for materials science applications, integrating visual and linguistic data for enhanced understanding and interaction within human-AI and multi-agent AI frameworks. A key innovation of Cephalo is its advanced dataset generation method, which employs a sophisticated algorithm to accurately detect and separate images and their corresponding textual descriptions from PDF documents, such as scientific papers. The method includes a careful refinement of image-text pairs through integrated vision and language processing, ensuring high-quality, contextually relevant, and well reasoned training data. Cephalo is trained on integrated image and text data extracted from thousands of scientific papers and science-focused Wikipedia pages demonstrates can interpret complex visual scenes, generate precise language descriptions, and answer queries about images effectively. The combination of a vision encoder with an autoregressive transformer supports complex natural language understanding in an integrated model, which can be coupled with other generative methods to create an image-to-text-to-image or image-to-text-to-3D pipeline. To explore the development of larger models from smaller ones, we merge sets of layers that originate from different pre-trained source models. This hybrid approach allows us to leverage the domain-specific expertise and general conversational capabilities to harness the strengths of multiple models. We examine the models in diverse use cases that incorporate biological materials, fracture and engineering analysis, protein biophysics, and bio-inspired design based on insect behavior. Generative applications include bio-inspired designs, including pollen-inspired architected materials, as well as the synthesis of bio-inspired material microstructures from a photograph of a solar eclipse.

[CV-33] Resurrecting Old Classes with New Data for Exemplar-Free Continual Learning

链接: https://arxiv.org/abs/2405.19074
作者: Dipam Goswami,Albin Soutif–Cormerais,Yuyang Liu,Sandesh Kamath,Bartłomiej Twardowski,Joost van de Weijer
关键词: catastrophic forgetting, suffer from catastrophic, hard to counter, store exemplars, exemplars of previous
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at CVPR 2024

点击查看摘要

Abstract:Continual learning methods are known to suffer from catastrophic forgetting, a phenomenon that is particularly hard to counter for methods that do not store exemplars of previous tasks. Therefore, to reduce potential drift in the feature extractor, existing exemplar-free methods are typically evaluated in settings where the first task is significantly larger than subsequent tasks. Their performance drops drastically in more challenging settings starting with a smaller first task. To address this problem of feature drift estimation for exemplar-free methods, we propose to adversarially perturb the current samples such that their embeddings are close to the old class prototypes in the old model embedding space. We then estimate the drift in the embedding space from the old to the new model using the perturbed images and compensate the prototypes accordingly. We exploit the fact that adversarial samples are transferable from the old to the new feature space in a continual learning setting. The generation of these images is simple and computationally cheap. We demonstrate in our experiments that the proposed approach better tracks the movement of prototypes in embedding space and outperforms existing methods on several standard continual learning benchmarks as well as on fine-grained datasets. Code is available at this https URL.

[CV-34] FUSU: A Multi-temporal-source Land Use Change Segmentation Dataset for Fine-grained Urban Semantic Understanding

链接: https://arxiv.org/abs/2405.19055
作者: Shuai Yuan,Guancong Lin,Lixian Zhang,Runmin Dong,Jinxiao Zhang,Shuang Chen,Juepeng Zheng,Jie Wang,Haohuan Fu
关键词: Fine urban change, understanding human-environment interactions, Fine urban, multi-temporal remote sensing, human-environment interactions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Fine urban change segmentation using multi-temporal remote sensing images is essential for understanding human-environment interactions. Despite advances in remote sensing data for urban monitoring, coarse-grained classification systems and the lack of continuous temporal observations hinder the application of deep learning to urban change analysis. To address this, we introduce FUSU, a multi-source, multi-temporal change segmentation dataset for fine-grained urban semantic understanding. FUSU features the most detailed land use classification system to date, with 17 classes and 30 billion pixels of annotations. It includes bi-temporal high-resolution satellite images with 20-50 cm ground sample distance and monthly optical and radar satellite time series, covering 847 km2 across five urban areas in China. The fine-grained pixel-wise annotations and high spatial-temporal resolution data provide a robust foundation for deep learning models to understand urbanization and land use changes. To fully leverage FUSU, we propose a unified time-series architecture for both change detection and segmentation and benchmark FUSU on various methods for several tasks. Dataset and code will be available at: this https URL.

[CV-35] A Good Foundation is Worth Many Labels: Label-Efficient Panoptic Segmentation

链接: https://arxiv.org/abs/2405.19035
作者: Niclas Vödisch,Kürsat Petek,Markus Käppeler,Abhinav Valada,Wolfram Burgard
关键词: achieving accurate predictions, widespread application, application of learning-based, reduce the required, required amount
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:A key challenge for the widespread application of learning-based models for robotic perception is to significantly reduce the required amount of annotated training data while achieving accurate predictions. This is essential not only to decrease operating costs but also to speed up deployment time. In this work, we address this challenge for PAnoptic SegmenTation with fEw Labels (PASTEL) by exploiting the groundwork paved by visual foundation models. We leverage descriptive image features from such a model to train two lightweight network heads for semantic segmentation and object boundary detection, using very few annotated training samples. We then merge their predictions via a novel fusion module that yields panoptic maps based on normalized cut. To further enhance the performance, we utilize self-training on unlabeled images selected by a feature-driven similarity scheme. We underline the relevance of our approach by employing PASTEL to important robot perception use cases from autonomous driving and agricultural robotics. In extensive experiments, we demonstrate that PASTEL significantly outperforms previous methods for label-efficient segmentation even when using fewer annotations. The code of our work is publicly available at this http URL.

[CV-36] Enhancing Vision-Language Model with Unmasked Token Alignment

链接: https://arxiv.org/abs/2405.19009
作者: Jihao Liu,Jinliang Zheng,Boxiao Liu,Yu Liu,Hongsheng Li
关键词: Contrastive pre-training, Masked Image Modeling, CLIP, standard technique, Contrastive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by TMLR; Code and models are available at this https URL

点击查看摘要

Abstract:Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer efficient self-supervised learning for single-modal representations. This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder. The pre-trained ViT can be directly applied for zero-shot evaluation even without training on image-text pairs. Compared to MIM approaches, UTA does not suffer from training-finetuning inconsistency and is much more training-efficient by avoiding using the extra [MASK] tokens. Extensive experimental results demonstrate that UTA can enhance CLIP models and outperform existing MIM methods on various uni- and multi-modal benchmarks. Code and models are available at this https URL.

[CV-37] Auto-selected Knowledge Adapters for Lifelong Person Re-identification

链接: https://arxiv.org/abs/2405.19005
作者: Xuelin Qian,Ruiqi Wu,Gong Cheng,Junwei Han
关键词: Lifelong Person Re-Identification, Person Re-Identification, extends traditional ReID, Lifelong Person, extends traditional
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Lifelong Person Re-Identification (LReID) extends traditional ReID by requiring systems to continually learn from non-overlapping datasets across different times and locations, adapting to new identities while preserving knowledge of previous ones. Existing approaches, either rehearsal-free or rehearsal-based, still suffer from the problem of catastrophic forgetting since they try to cram diverse knowledge into one fixed model. To overcome this limitation, we introduce a novel framework AdalReID, that adopts knowledge adapters and a parameter-free auto-selection mechanism for lifelong learning. Concretely, we incrementally build distinct adapters to learn domain-specific knowledge at each step, which can effectively learn and preserve knowledge across different datasets. Meanwhile, the proposed auto-selection strategy adaptively calculates the knowledge similarity between the input set and the adapters. On the one hand, the appropriate adapters are selected for the inputs to process ReID, and on the other hand, the knowledge interaction and fusion between adapters are enhanced to improve the generalization ability of the model. Extensive experiments are conducted to demonstrate the superiority of our AdalReID, which significantly outperforms SOTAs by about 10 \sim 20% mAP on both seen and unseen domains.

[CV-38] EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

链接: https://arxiv.org/abs/2405.18991
作者: Jiaqi Xu,Xinyi Zou,Kunzhe Huang,Yunkuo Chen,Bo Liu,MengLi Cheng,Xing Shi,Jun Huang
关键词: paper presents EasyAnimate, high-performance outcomes, paper presents, leverages the power, power of transformer
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注: 6 pages, 5 figures

点击查看摘要

Abstract:This paper presents EasyAnimate, an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes. We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a motion module block. It is used to capture temporal dynamics, thereby ensuring the production of consistent frames and seamless motion transitions. The motion module can be adapted to various DiT baseline methods to generate video with different styles. It can also generate videos with different frame rates and resolutions during both training and inference phases, suitable for both images and videos. Moreover, we introduce slice VAE, a novel approach to condense the temporal axis, facilitating the generation of long duration videos. Currently, EasyAnimate exhibits the proficiency to generate videos with 144 frames. We provide a holistic ecosystem for video production based on DiT, encompassing aspects such as data pre-processing, VAE training, DiT models training (both the baseline model and LoRA model), and end-to-end video inference. Code is available at: this https URL. We are continuously working to enhance the performance of our method.

[CV-39] ranscending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval

链接: https://arxiv.org/abs/2405.18959
作者: Rui Yang,Shuang Wang,Yingping Han,Yuanheng Li,Dong Zhao,Dou Quan,Yanhe Guo,Licheng Jiao
关键词: Remote Sensing Image-Text, Remote Sensing, Sensing Image-Text Retrieval, Sensing Image-Text, Remote
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:Remote Sensing Image-Text Retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain. Considering the multi-scale representations in image content and text vocabulary can enable the models to learn richer representations and enhance retrieval. Current multi-scale RSITR approaches typically align multi-scale fused image features with text features, but overlook aligning image-text pairs at distinct scales separately. This oversight restricts their ability to learn joint representations suitable for effective retrieval. We introduce a novel Multi-Scale Alignment (MSA) method to overcome this limitation. Our method comprises three key innovations: (1) Multi-scale Cross-Modal Alignment Transformer (MSCMAT), which computes cross-attention between single-scale image features and localized text features, integrating global textual context to derive a matching score matrix within a mini-batch, (2) a multi-scale cross-modal semantic alignment loss that enforces semantic alignment across scales, and (3) a cross-scale multi-modal semantic consistency loss that uses the matching matrix from the largest scale to guide alignment at smaller scales. We evaluated our method across multiple datasets, demonstrating its efficacy with various visual backbones and establishing its superiority over existing state-of-the-art methods. The GitHub URL for our project is: this https URL

[CV-40] RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision

链接: https://arxiv.org/abs/2405.18955
作者: Jinzhong Wang,Xuetao Tian,Shun Dai,Tao Zhuo,Haorui Zeng,Hongjuan Liu,Jiaqi Liu,Xiuwei Zhang,Yanning Zhang
关键词: garnered significant attention, Multispectral object detection, utilizing both visible, Shuffled Multi-receptive Attention, Group Shuffled Multi-receptive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multispectral object detection, utilizing both visible (RGB) and thermal infrared (T) modals, has garnered significant attention for its robust performance across diverse weather and lighting conditions. However, effectively exploiting the complementarity between RGB-T modals while maintaining efficiency remains a critical challenge. In this paper, a very simple Group Shuffled Multi-receptive Attention (GSMA) module is proposed to extract and combine multi-scale RGB and thermal features. Then, the extracted multi-modal features are directly integrated with a multi-level path aggregation neck, which significantly improves the fusion effect and efficiency. Meanwhile, multi-modal object detection often adopts union annotations for both modals. This kind of supervision is not sufficient and unfair, since objects observed in one modal may not be seen in the other modal. To solve this issue, Multi-modal Supervision (MS) is proposed to sufficiently supervise RGB-T object detection. Comprehensive experiments on two challenging benchmarks, KAIST and DroneVehicle, demonstrate the proposed model achieves the state-of-the-art accuracy while maintaining competitive efficiency.

[CV-41] WTTFNet: A Weather-Time-Trajectory Fusion Network for Pedestrian Trajectory Prediction in Urban Complex

链接: https://arxiv.org/abs/2405.18945
作者: Ho Chun Wu,Esther Hoi Shan Lau,Paul Yuen,Kevin Hung,John Kwok Tai Chui,Andrew Kwok Fai Lui
关键词: urban complex, complex is challenging, Pedestrian trajectory modelling, Pacific Trade Center, affect pedestrian behavior
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:Pedestrian trajectory modelling in an urban complex is challenging because pedestrians can have many possible destinations, such as shops, escalators, and attractions. Moreover, weather and time-of-day may affect pedestrian behavior. In this paper, a new weather-time-trajectory fusion network (WTTFNet) is proposed to improve the performance of baseline deep neural network architecture. By incorporating weather and time-of-day information as an embedding structure, a novel WTTFNet based on gate multimodal unit is used to fuse the multimodal information and deep representation of trajectories. A joint loss function based on focal loss is used to co-optimize both the deep trajectory features and final classifier, which helps to improve the accuracy in predicting the intended destination of pedestrians and hence the trajectories under possible scenarios of class imbalances. Experimental results using the Osaka Asia and Pacific Trade Center (ATC) dataset shows improved performance of the proposed approach over state-of-the-art algorithms by 23.67% increase in classification accuracy, 9.16% and 7.07% reduction of average and final displacement error. The proposed approach may serve as an attractive approach for improving existing baseline trajectory prediction models when they are applied to scenarios with influences of weather-time conditions. It can be employed in numerous applications such as pedestrian facility engineering, public space development and technology-driven retail.

[CV-42] Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding

链接: https://arxiv.org/abs/2405.18937
作者: Junjie Fei,Mahmoud Ahmed,Jian Ding,Eslam Mohamed Bakr,Mohamed Elhoseiny
关键词: achieved significant progress, part level, segmentation grounding, Point Grounded Captioning, Part-Aware Point
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:While 3D MLLMs have achieved significant progress, they are restricted to object and scene understanding and struggle to understand 3D spatial structures at the part level. In this paper, we introduce Kestrel, representing a novel approach that empowers 3D MLLMs with part-aware understanding, enabling better interpretation and segmentation grounding of 3D objects at the part level. Despite its significance, the current landscape lacks tasks and datasets that endow and assess this capability. Therefore, we propose two novel tasks: (1) Part-Aware Point Grounding, the model is tasked with directly predicting a part-level segmentation mask based on user instructions, and (2) Part-Aware Point Grounded Captioning, the model provides a detailed caption that includes part-level descriptions and their corresponding masks. To support learning and evaluating for these tasks, we introduce 3DCoMPaT Grounded Instructions Dataset (3DCoMPaT-GRIN). 3DCoMPaT-GRIN Vanilla, comprising 789k part-aware point cloud-instruction-segmentation mask triplets, is used to evaluate MLLMs’ ability of part-aware segmentation grounding. 3DCoMPaT-GRIN Grounded Caption, containing 107k part-aware point cloud-instruction-grounded caption triplets, assesses both MLLMs’ part-aware language comprehension and segmentation grounding capabilities. Our introduced tasks, dataset, and Kestrel represent a preliminary effort to bridge the gap between human cognition and 3D MLLMs, i.e., the ability to perceive and engage with the environment at both global and part levels. Extensive experiments on the 3DCoMPaT-GRIN show that Kestrel can generate user-specified segmentation masks, a capability not present in any existing 3D MLLM. Kestrel thus established a benchmark for evaluating the part-aware language comprehension and segmentation grounding of 3D objects. Project page at this https URL

[CV-43] MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

链接: https://arxiv.org/abs/2405.18924
作者: Miguel A. Ferrer,Abhijit Das,Moises Diaz,Aythami Morales,Cristina Carmona-Duarte,Umapada Pal
关键词: Script identification plays, Script identification, plays a vital, vital role, role in applications
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Script identification plays a vital role in applications that involve handwriting and document analysis within a multi-script and multi-lingual environment. Moreover, it exhibits a profound connection with human cognition. This paper provides a new database for benchmarking script identification algorithms, which contains both printed and handwritten documents collected from a wide variety of scripts, such as Arabic, Bengali (Bangla), Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu, and Thai. The dataset consists of 1,135 documents scanned from local newspaper and handwritten letters as well as notes from different native writers. Further, these documents are segmented into lines and words, comprising a total of 13,979 and 86,655 lines and words, respectively, in the dataset. Easy-to-go benchmarks are proposed with handcrafted and deep learning methods. The benchmark includes results at the document, line, and word levels with printed and handwritten documents. Results of script identification independent of the document/line/word level and independent of the printed/handwritten letters are also given. The new multi-lingual database is expected to create new script identifiers, present various challenges, including identifying handwritten and printed samples and serve as a foundation for future research in script identification based on the reported results of the three benchmarks.

[CV-44] Exploring Human-in-the-Loop Test-Time Adaptation by Synergizing Active Learning and Model Selection

链接: https://arxiv.org/abs/2405.18911
作者: Yushu Li,Yongyi Su,Xulei Yang,Kui Jia,Xun Xu
关键词: Existing test-time adaptation, test-time adaptation, testing data stream, unlabeled testing data, active learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Existing test-time adaptation (TTA) approaches often adapt models with the unlabeled testing data stream. A recent attempt relaxed the assumption by introducing limited human annotation, referred to as Human-In-the-Loop Test-Time Adaptation (HILTTA) in this study. The focus of existing HILTTA lies on selecting the most informative samples to label, a.k.a. active learning. In this work, we are motivated by a pitfall of TTA, i.e. sensitive to hyper-parameters, and propose to approach HILTTA by synergizing active learning and model selection. Specifically, we first select samples for human annotation (active learning) and then use the labeled data to select optimal hyper-parameters (model selection). A sample selection strategy is tailored for choosing samples by considering the balance between active learning and model selection purposes. We demonstrate on 4 TTA datasets that the proposed HILTTA approach is compatible with off-the-shelf TTA methods which outperform the state-of-the-art HILTTA methods and stream-based active learning methods. Importantly, our proposed method can always prevent choosing the worst hyper-parameters on all off-the-shelf TTA methods. The source code will be released upon publication.

[CV-45] Spectral Fidelity and Spatial Enhancement: An Assessment and Cascading of Pan-Sharpening Techniques for Satellite Imagery

链接: https://arxiv.org/abs/2405.18900
作者: Abdul Aziz A.B,A.B Abdul Rahim
关键词: satellite imagery, presents a comprehensive, comprehensive assessment, techniques for satellite, research presents
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:This research presents a comprehensive assessment of pan-sharpening techniques for satellite imagery, focusing on the critical aspects of spectral fidelity and spatial enhancement. Motivated by the need for informed algorithm selection in remote sensing, A novel cascaded and structured evaluation framework has been proposed with a detailed comparative analysis of existing methodologies. The research findings underscore the intricate trade-offs between spectral accuracy of about 88% with spatial resolution enhancement. The research sheds light on the practical implications of pan-sharpening and emphasizes the significance of both spectral and spatial aspects in remote sensing applications. Various pan-sharpening algorithms were systematically employed to provide a holistic view of their performance, contributing to a deeper understanding of their capabilities and limitations.

[CV-46] MLAE: Masked LoRA Experts for Parameter-Efficient Fine-Tuning

链接: https://arxiv.org/abs/2405.18897
作者: Junjie Wang,Guangjing Yang,Wentao Chen,Huahui Yi,Xiaohu Wu,Qicheng Lao
关键词: extensive parameter updates, parameter updates required, large-scale pre-trained models, Low-Rank Adaptation, challenges posed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Tech report

点击查看摘要

Abstract:In response to the challenges posed by the extensive parameter updates required for full fine-tuning of large-scale pre-trained models, parameter-efficient fine-tuning (PEFT) methods, exemplified by Low-Rank Adaptation (LoRA), have emerged. LoRA simplifies the fine-tuning process but may still struggle with a certain level of redundancy in low-rank matrices and limited effectiveness from merely increasing their rank. To address these issues, a natural idea is to enhance the independence and diversity of the learning process for the low-rank matrices. Therefore, we propose Masked LoRA Experts (MLAE), an innovative approach that applies the concept of masking to PEFT. Our method incorporates a cellular decomposition strategy that transforms a low-rank matrix into independent rank-1 submatrices, or ``experts’', thus enhancing independence. Additionally, we introduce a binary mask matrix that selectively activates these experts during training to promote more diverse and anisotropic learning, based on expert-level dropout strategies. Our investigations reveal that this selective activation not only enhances performance but also fosters a more diverse acquisition of knowledge with a marked decrease in parameter similarity among MLAE, significantly boosting the quality of the model while barely increasing the parameter count. Remarkably, MLAE achieves new SOTA performance with an average accuracy score of 78.8% on the VTAB-1k benchmark and 90.9% on the FGVC benchmark, demonstrating superior performance. Our code is available at this https URL.

[CV-47] DecomCAM: Advancing Beyond Saliency Maps through Decomposition and Integration

链接: https://arxiv.org/abs/2405.18882
作者: Yuguang Yang,Runtang Guo,Sheng Wu,Yimi Wang,Linlin Yang,Bo Fan,Jilong Zhong,Juan Zhang,Baochang Zhang
关键词: notably pre-trained vision-language, Interpreting complex deep, Interpreting complex, complex deep networks, pre-trained vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by Neurocomputing journal

点击查看摘要

Abstract:Interpreting complex deep networks, notably pre-trained vision-language models (VLMs), is a formidable challenge. Current Class Activation Map (CAM) methods highlight regions revealing the model’s decision-making basis but lack clear saliency maps and detailed interpretability. To bridge this gap, we propose DecomCAM, a novel decomposition-and-integration method that distills shared patterns from channel activation maps. Utilizing singular value decomposition, DecomCAM decomposes class-discriminative activation maps into orthogonal sub-saliency maps (OSSMs), which are then integrated together based on their contribution to the target concept. Extensive experiments on six benchmarks reveal that DecomCAM not only excels in locating accuracy but also achieves an optimizing balance between interpretability and computational efficiency. Further analysis unveils that OSSMs correlate with discernible object components, facilitating a granular understanding of the model’s reasoning. This positions DecomCAM as a potential tool for fine-grained interpretation of advanced deep learning models. The code is avaible at this https URL.

[CV-48] EventZoom: A Progressive Approach to Event-Based Data Augmentation for Enhanced Neuromorphic Vision

链接: https://arxiv.org/abs/2405.18880
作者: Yiting Dong,Xiang He,Guobin Shen,Dongcheng Zhao,Yang Li,Yi Zeng
关键词: Dynamic Vision Sensors, Vision Sensors, traditional video capture, Event data captured, Dynamic Vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Event data captured by Dynamic Vision Sensors (DVS) offers a unique approach to visual processing that differs from traditional video capture, showcasing its efficiency in dynamic and real-time scenarios. Despite advantages such as high temporal resolution and low energy consumption, the application of event data faces challenges due to limited dataset size and diversity. To address this, we developed EventZoom – a data augmentation strategy specifically designed for event data. EventZoom employs a progressive temporal strategy that intelligently blends time and space to enhance the diversity and complexity of the data while maintaining its authenticity. This method aims to improve the quality of data for model training and enhance the adaptability and robustness of algorithms in handling complex dynamic scenes. We have experimentally validated EventZoom across various supervised learning frameworks, including supervised, semi-supervised, and unsupervised learning. Our results demonstrate that EventZoom consistently outperforms other data augmentation methods, confirming its effectiveness and applicability as a powerful event-based data augmentation tool in diverse learning settings.

[CV-49] Single image super-resolution based on trainable feature matching attention network

链接: https://arxiv.org/abs/2405.18872
作者: Qizhou Chen,Qing Shao
关键词: Convolutional Neural Networks, Neural Networks, Convolutional Neural, Trainable Feature Matching, recent years
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 35pages, 12 figures

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have been widely employed for image Super-Resolution (SR) in recent years. Various techniques enhance SR performance by altering CNN structures or incorporating improved self-attention mechanisms. Interestingly, these advancements share a common trait. Instead of explicitly learning high-frequency details, they learn an implicit feature processing mode that utilizes weighted sums of a feature map’s own elements for reconstruction, akin to convolution and non-local. In contrast, early dictionary-based approaches learn feature decompositions explicitly to match and rebuild Low-Resolution (LR) features. Building on this analysis, we introduce Trainable Feature Matching (TFM) to amalgamate this explicit feature learning into CNNs, augmenting their representation capabilities. Within TFM, trainable feature sets are integrated to explicitly learn features from training images through feature matching. Furthermore, we integrate non-local and channel attention into our proposed Trainable Feature Matching Attention Network (TFMAN) to further enhance SR performance. To alleviate the computational demands of non-local operations, we propose a streamlined variant called Same-size-divided Region-level Non-Local (SRNL). SRNL conducts non-local computations in parallel on blocks uniformly divided from the input feature map. The efficacy of TFM and SRNL is validated through ablation studies and module explorations. We employ a recurrent convolutional network as the backbone of our TFMAN to optimize parameter utilization. Comprehensive experiments on benchmark datasets demonstrate that TFMAN achieves superior results in most comparisons while using fewer parameters. The code is available at this https URL.

[CV-50] Neural Radiance Fields for Novel View Synthesis in Monocular Gastroscopy

链接: https://arxiv.org/abs/2405.18863
作者: Zijie Jiang,Yusuke Monno,Masatoshi Okutomi,Sho Suzuki,Kenji Miki
关键词: synthesis of arbitrarily, promising topic, Enabling the synthesis, Poisson surface reconstruction, arbitrarily novel viewpoint
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted for EMBC 2024

点击查看摘要

Abstract:Enabling the synthesis of arbitrarily novel viewpoint images within a patient’s stomach from pre-captured monocular gastroscopic images is a promising topic in stomach diagnosis. Typical methods to achieve this objective integrate traditional 3D reconstruction techniques, including structure-from-motion (SfM) and Poisson surface reconstruction. These methods produce explicit 3D representations, such as point clouds and meshes, thereby enabling the rendering of the images from novel viewpoints. However, the existence of low-texture and non-Lambertian regions within the stomach often results in noisy and incomplete reconstructions of point clouds and meshes, hindering the attainment of high-quality image rendering. In this paper, we apply the emerging technique of neural radiance fields (NeRF) to monocular gastroscopic data for synthesizing photo-realistic images for novel viewpoints. To address the performance degradation due to view sparsity in local regions of monocular gastroscopy, we incorporate geometry priors from a pre-reconstructed point cloud into the training of NeRF, which introduces a novel geometry-based loss to both pre-captured observed views and generated unobserved views. Compared to other recent NeRF methods, our approach showcases high-fidelity image renderings from novel viewpoints within the stomach both qualitatively and quantitatively.

[CV-51] Domain-Inspired Sharpness-Aware Minimization Under Domain Shifts

链接: https://arxiv.org/abs/2405.18861
作者: Ruipeng Zhang,Ziqing Fan,Jiangchao Yao,Ya Zhang,Yanfeng Wang
关键词: Domain-Inspired Sharpness-Aware Minimization, Sharpness-Aware Minimization, paper presents, presents a Domain-Inspired, Domain-Inspired Sharpness-Aware
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2024

点击查看摘要

Abstract:This paper presents a Domain-Inspired Sharpness-Aware Minimization (DISAM) algorithm for optimization under domain shifts. It is motivated by the inconsistent convergence degree of SAM across different domains, which induces optimization bias towards certain domains and thus impairs the overall convergence. To address this issue, we consider the domain-level convergence consistency in the sharpness estimation to prevent the overwhelming (deficient) perturbations for less (well) optimized domains. Specifically, DISAM introduces the constraint of minimizing variance in the domain loss, which allows the elastic gradient calibration in perturbation generation: when one domain is optimized above the averaging level \textitw.r.t. loss, the gradient perturbation towards that domain will be weakened automatically, and vice versa. Under this mechanism, we theoretically show that DISAM can achieve faster overall convergence and improved generalization in principle when inconsistent convergence emerges. Extensive experiments on various domain generalization benchmarks show the superiority of DISAM over a range of state-of-the-art methods. Furthermore, we show the superior efficiency of DISAM in parameter-efficient fine-tuning combined with the pretraining models. The source code is released at this https URL.

[CV-52] SSGA-Net: Stepwise Spatial Global-local Aggregation Networks for for Autonomous Driving

链接: https://arxiv.org/abs/2405.18857
作者: Yiming Cui,Cheng Han,Dongfang Liu
关键词: Visual-based perception, autonomous driving, key module, module for autonomous, visual perception tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual-based perception is the key module for autonomous driving. Among those visual perception tasks, video object detection is a primary yet challenging one because of feature degradation caused by fast motion or multiple poses. Current models usually aggregate features from the neighboring frames to enhance the object representations for the task heads to generate more accurate predictions. Though getting better performance, these methods rely on the information from the future frames and suffer from high computational complexity. Meanwhile, the aggregation process is not reconfigurable during the inference time. These issues make most of the existing models infeasible for online applications. To solve these problems, we introduce a stepwise spatial global-local aggregation network. Our proposed models mainly contain three parts: 1). Multi-stage stepwise network gradually refines the predictions and object representations from the previous stage; 2). Spatial global-local aggregation fuses the local information from the neighboring frames and global semantics from the current frame to eliminate the feature degradation; 3). Dynamic aggregation strategy stops the aggregation process early based on the refinement results to remove redundancy and improve efficiency. Extensive experiments on the ImageNet VID benchmark validate the effectiveness and efficiency of our proposed models.

[CV-53] Supervised Contrastive Learning for Snapshot Spectral Imaging Face Anti-Spoofing

链接: https://arxiv.org/abs/2405.18853
作者: Chuanbiao Song,Yan Hong,Jun Lan,Huijia Zhu,Weiqiang Wang,Jianfu Zhang
关键词: facial recognition systems, highly realistic silicone, Snapshot Spectral Imaging, cutting-edge re-balanced contrastive, re-balanced contrastive learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: We rank first at the Chalearn Snapshot Spectral Imaging Face Anti-spoofing Challenge on CVPR 2024; the paper is accepted by CVPR 2024 workshop;

点击查看摘要

Abstract:This study reveals a cutting-edge re-balanced contrastive learning strategy aimed at strengthening face anti-spoofing capabilities within facial recognition systems, with a focus on countering the challenges posed by printed photos, and highly realistic silicone or latex masks. Leveraging the HySpeFAS dataset, which benefits from Snapshot Spectral Imaging technology to provide hyperspectral images, our approach harmonizes class-level contrastive learning with data resampling and an innovative real-face oriented reweighting technique. This method effectively mitigates dataset imbalances and reduces identity-related biases. Notably, our strategy achieved an unprecedented 0.0000% Average Classification Error Rate (ACER) on the HySpeFAS dataset, ranking first at the Chalearn Snapshot Spectral Imaging Face Anti-spoofing Challenge on CVPR 2024.

[CV-54] LetsMap: Unsupervised Representation Learning for Semantic BEV Mapping

链接: https://arxiv.org/abs/2405.18852
作者: Nikhil Gosala,Kürsat Petek,B Ravi Kiran,Senthil Yogamani,Paulo Drews-Jr,Wolfram Burgard,Abhinav Valada
关键词: Bird Eye View, Semantic Bird Eye, Bird Eye, strong occlusion reasoning, Eye View
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 23 pages, 5 figures

点击查看摘要

Abstract:Semantic Bird’s Eye View (BEV) maps offer a rich representation with strong occlusion reasoning for various decision making tasks in autonomous driving. However, most BEV mapping approaches employ a fully supervised learning paradigm that relies on large amounts of human-annotated BEV ground truth data. In this work, we address this limitation by proposing the first unsupervised representation learning approach to generate semantic BEV maps from a monocular frontal view (FV) image in a label-efficient manner. Our approach pretrains the network to independently reason about scene geometry and scene semantics using two disjoint neural pathways in an unsupervised manner and then finetunes it for the task of semantic BEV mapping using only a small fraction of labels in the BEV. We achieve label-free pretraining by exploiting spatial and temporal consistency of FV images to learn scene geometry while relying on a novel temporal masked autoencoder formulation to encode the scene representation. Extensive evaluations on the KITTI-360 and nuScenes datasets demonstrate that our approach performs on par with the existing state-of-the-art approaches while using only 1% of BEV labels and no additional labeled data.

[CV-55] SFANet: Spatial-Frequency Attention Network for Weather Forecasting

链接: https://arxiv.org/abs/2405.18849
作者: Jiaze Wang,Hao Chen,Hongcan Xu,Jinpeng Li,Bowen Wang,Kun Shao,Furui Liu,Huaxi Chen,Guangyong Chen,Pheng-Ann Heng
关键词: Weather forecasting plays, driving decision-making, risk management, plays a critical, critical role
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Weather forecasting plays a critical role in various sectors, driving decision-making and risk management. However, traditional methods often struggle to capture the complex dynamics of meteorological systems, particularly in the presence of high-resolution data. In this paper, we propose the Spatial-Frequency Attention Network (SFANet), a novel deep learning framework designed to address these challenges and enhance the accuracy of spatiotemporal weather prediction. Drawing inspiration from the limitations of existing methodologies, we present an innovative approach that seamlessly integrates advanced token mixing and attention mechanisms. By leveraging both pooling and spatial mixing strategies, SFANet optimizes the processing of high-dimensional spatiotemporal sequences, preserving inter-component relational information and modeling extensive long-range relationships. To further enhance feature integration, we introduce a novel spatial-frequency attention module, enabling the model to capture intricate cross-modal correlations. Our extensive experimental evaluation on two distinct datasets, the Storm EVent ImageRy (SEVIR) and the Institute for Climate and Application Research (ICAR) - El Niño Southern Oscillation (ENSO) dataset, demonstrates the remarkable performance of SFANet. Notably, SFANet achieves substantial advancements over state-of-the-art methods, showcasing its proficiency in forecasting precipitation patterns and predicting El Niño events.

[CV-56] Descriptive Image Quality Assessment in the Wild

链接: https://arxiv.org/abs/2405.18842
作者: Zhiyuan You,Jinjin Gu,Zheyuan Li,Xin Cai,Kaiwen Zhu,Tianfan Xue,Chao Dong
关键词: Vision Language Models, Vision Language, advancement of Vision, Language Models, Image Quality Assessment
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With the rapid advancement of Vision Language Models (VLMs), VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression and capture the multifaceted nature of IQA tasks. However, current methods are still far from practical usage. First, prior works focus narrowly on specific sub-tasks or settings, which do not align with diverse real-world applications. Second, their performance is sub-optimal due to limitations in dataset coverage, scale, and quality. To overcome these challenges, we introduce Depicted image Quality Assessment in the Wild (DepictQA-Wild). Our method includes a multi-functional IQA task paradigm that encompasses both assessment and comparison tasks, brief and detailed responses, full-reference and non-reference scenarios. We introduce a ground-truth-informed dataset construction approach to enhance data quality, and scale up the dataset to 495K under the brief-detail joint framework. Consequently, we construct a comprehensive, large-scale, and high-quality dataset, named DQ-495K. We also retain image resolution during training to better handle resolution-related quality issues, and estimate a confidence score that is helpful to filter out low-quality responses. Experimental results demonstrate that DepictQA-Wild significantly outperforms traditional score-based methods, prior VLM-based IQA models, and proprietary GPT-4V in distortion identification, instant rating, and reasoning tasks. Our advantages are further confirmed by real-world applications including assessing the web-downloaded images and ranking model-processed images. Datasets and codes will be released in this https URL.

[CV-57] Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation

链接: https://arxiv.org/abs/2405.18840
作者: Zelin Peng,Zhengqin Xu,Zhilin Zeng,Yaoming Wang,Lingxi Xie,Qi Tian,Wei Shen
关键词: arbitrary text descriptions, CLIP, CLIP text encoder, Open-vocabulary semantic segmentation, seeks to label
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Open-vocabulary semantic segmentation seeks to label each pixel in an image with arbitrary text descriptions. Vision-language foundation models, especially CLIP, have recently emerged as powerful tools for acquiring open-vocabulary capabilities. However, fine-tuning CLIP to equip it with pixel-level prediction ability often suffers three issues: 1) high computational cost, 2) misalignment between the two inherent modalities of CLIP, and 3) degraded generalization ability on unseen categories. To address these issues, we propose H-CLIP a symmetrical parameter-efficient fine-tuning (PEFT) strategy conducted in hyperspherical space for both of the two CLIP modalities. Specifically, the PEFT strategy is achieved by a series of efficient block-diagonal learnable transformation matrices and a dual cross-relation communication module among all learnable matrices. Since the PEFT strategy is conducted symmetrically to the two CLIP modalities, the misalignment between them is mitigated. Furthermore, we apply an additional constraint to PEFT on the CLIP text encoder according to the hyperspherical energy principle, i.e., minimizing hyperspherical energy during fine-tuning preserves the intrinsic structure of the original parameter space, to prevent the destruction of the generalization ability offered by the CLIP text encoder. Extensive evaluations across various benchmarks show that H-CLIP achieves new SOTA open-vocabulary semantic segmentation results while only requiring updating approximately 4% of the total parameters of CLIP.

[CV-58] MEGA: Masked Generative Autoencoder for Human Mesh Recovery

链接: https://arxiv.org/abs/2405.18839
作者: Guénolé Fiche,Simon Leglaive,Xavier Alameda-Pineda,Francesc Moreno-Noguer
关键词: highly ambiguous problem, Human Mesh Recovery, single RGB image, single RGB, Mesh Recovery
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as similar 2D projections can correspond to multiple 3D interpretations. Nevertheless, most HMR methods overlook this ambiguity and make a single prediction without accounting for the associated uncertainty. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. MEGA enables us to propose multiple outputs and to evaluate the uncertainty of the predictions. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches.

[CV-59] Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks

链接: https://arxiv.org/abs/2405.18831
作者: Simranjit Singh,Georgios Pavlakos,Dimitrios Stamoulis
关键词: Visual Question Answering, Visual Question, Question Answering, paradigms influence existing, foundation models grows
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at 1st Workshop on Multimodalities for 3D Scenes CVPR 2024

点击查看摘要

Abstract:As interest in “reformulating” the 3D Visual Question Answering (VQA) problem in the context of foundation models grows, it is imperative to assess how these new paradigms influence existing closed-vocabulary datasets. In this case study, we evaluate the zero-shot performance of foundational models (GPT-4 Vision and GPT-4) on well-established 3D VQA benchmarks, namely 3D-VQA and ScanQA. We provide an investigation to contextualize the performance of GPT-based agents relative to traditional modeling approaches. We find that GPT-based agents without any fine-tuning perform on par with the closed vocabulary approaches. Our findings corroborate recent results that “blind” models establish a surprisingly strong baseline in closed-vocabulary settings. We demonstrate that agents benefit significantly from scene-specific vocabulary via in-context textual grounding. By presenting a preliminary comparison with previous baselines, we hope to inform the community’s ongoing efforts to refine multi-modal 3D benchmarks.

[CV-60] Flow Priors for Linear Inverse Problems via Iterative Corrupted Trajectory Matching

链接: https://arxiv.org/abs/2405.18816
作者: Yasi Zhang,Peiyu Yu,Yaxuan Zhu,Yingshan Chang,Feng Gao,Ying Nian Wu,Oscar Leong
关键词: Generative models based, attracted significant attention, Generative models, high-resolution image synthesis, attracted significant
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative models based on flow matching have attracted significant attention for their simplicity and superior performance in high-resolution image synthesis. By leveraging the instantaneous change-of-variables formula, one can directly compute image likelihoods from a learned flow, making them enticing candidates as priors for downstream tasks such as inverse problems. In particular, a natural approach would be to incorporate such image probabilities in a maximum-a-posteriori (MAP) estimation problem. A major obstacle, however, lies in the slow computation of the log-likelihood, as it requires backpropagating through an ODE solver, which can be prohibitively slow for high-dimensional problems. In this work, we propose an iterative algorithm to approximate the MAP estimator efficiently to solve a variety of linear inverse problems. Our algorithm is mathematically justified by the observation that the MAP objective can be approximated by a sum of N ``local MAP’’ objectives, where N is the number of function evaluations. By leveraging Tweedie’s formula, we show that we can perform gradient steps to sequentially optimize these objectives. We validate our approach for various linear inverse problems, such as super-resolution, deblurring, inpainting, and compressed sensing, and demonstrate that we can outperform other methods based on flow matching.

[CV-61] MindSemantix: Deciphering Brain Visual Experiences with a Brain-Language Model

链接: https://arxiv.org/abs/2405.18812
作者: Ziqi Ren,Jie Li,Xuetong Xue,Xin Li,Fan Yang,Zhicheng Jiao,Xinbo Gao
关键词: Deciphering the human, brain activities captured, human visual experience, brain, neuroscience research
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 6 figures

点击查看摘要

Abstract:Deciphering the human visual experience through brain activities captured by fMRI represents a compelling and cutting-edge challenge in the field of neuroscience research. Compared to merely predicting the viewed image itself, decoding brain activity into meaningful captions provides a higher-level interpretation and summarization of visual information, which naturally enhances the application flexibility in real-world situations. In this work, we introduce MindSemantix, a novel multi-modal framework that enables LLMs to comprehend visually-evoked semantic content in brain activity. Our MindSemantix explores a more ideal brain captioning paradigm by weaving LLMs into brain activity analysis, crafting a seamless, end-to-end Brain-Language Model. To effectively capture semantic information from brain responses, we propose Brain-Text Transformer, utilizing a Brain Q-Former as its core architecture. It integrates a pre-trained brain encoder with a frozen LLM to achieve multi-modal alignment of brain-vision-language and establish a robust brain-language correspondence. To enhance the generalizability of neural representations, we pre-train our brain encoder on a large-scale, cross-subject fMRI dataset using self-supervised learning techniques. MindSemantix provides more feasibility to downstream brain decoding tasks such as stimulus reconstruction. Conditioned by MindSemantix captioning, our framework facilitates this process by integrating with advanced generative models like Stable Diffusion and excels in understanding brain visual perception. MindSemantix generates high-quality captions that are deeply rooted in the visual and semantic information derived from brain activity. This approach has demonstrated substantial quantitative improvements over prior art. Our code will be released.

[CV-62] UniPTS: A Unified Framework for Proficient Post-Training Sparsity

链接: https://arxiv.org/abs/2405.18810
作者: Jingjing Xie,Yuxin Zhang,Mingbao Lin,Zhihang Lin,Liujuan Cao,Rongrong Ji
关键词: recently emerged avenue, Post-training Sparsity, Existing PTS methods, chases efficient network, PTS
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted by CVPR2024

点击查看摘要

Abstract:Post-training Sparsity (PTS) is a recently emerged avenue that chases efficient network sparsity with limited data in need. Existing PTS methods, however, undergo significant performance degradation compared with traditional methods that retrain the sparse networks via the whole dataset, especially at high sparsity ratios. In this paper, we attempt to reconcile this disparity by transposing three cardinal factors that profoundly alter the performance of conventional sparsity into the context of PTS. Our endeavors particularly comprise (1) A base-decayed sparsity objective that promotes efficient knowledge transferring from dense network to the sparse counterpart. (2) A reducing-regrowing search algorithm designed to ascertain the optimal sparsity distribution while circumventing overfitting to the small calibration set in PTS. (3) The employment of dynamic sparse training predicated on the preceding aspects, aimed at comprehensively optimizing the sparsity structure while ensuring training stability. Our proposed framework, termed UniPTS, is validated to be much superior to existing PTS methods across extensive benchmarks. As an illustration, it amplifies the performance of POT, a recently proposed recipe, from 3.9% to 68.6% when pruning ResNet-50 at 90% sparsity ratio on ImageNet. We release the code of our paper at this https URL.

[CV-63] BRACTIVE: A Brain Activation Approach to Human Visual Brain Learning

链接: https://arxiv.org/abs/2405.18808
作者: Xuan-Bac Nguyen,Hojin Jang,Xin Li,Samee U. Khan,Pawan Sinha,Khoa Luu
关键词: efficient processing unit, highly efficient processing, processing unit, human visual brain, highly efficient
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The human brain is a highly efficient processing unit, and understanding how it works can inspire new algorithms and architectures in machine learning. In this work, we introduce a novel framework named Brain Activation Network (BRACTIVE), a transformer-based approach to studying the human visual brain. The main objective of BRACTIVE is to align the visual features of subjects with corresponding brain representations via fMRI signals. It allows us to identify the brain’s Regions of Interest (ROI) of the subjects. Unlike previous brain research methods, which can only identify ROIs for one subject at a time and are limited by the number of subjects, BRACTIVE automatically extends this identification to multiple subjects and ROIs. Our experiments demonstrate that BRACTIVE effectively identifies person-specific regions of interest, such as face and body-selective areas, aligning with neuroscience findings and indicating potential applicability to various object categories. More importantly, we found that leveraging human visual brain activity to guide deep neural networks enhances performance across various benchmarks. It encourages the potential of BRACTIVE in both neuroscience and machine intelligence studies.

[CV-64] SketchTriplet: Self-Supervised Scenarized Sketch-Text-Image Triplet Generation

链接: https://arxiv.org/abs/2405.18801
作者: Zhenbei Wu,Qiang Wang,Jie Yang
关键词: free-hand sketch presents, challenging problem, scarcity of free-hand, presents a challenging, sketch
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The scarcity of free-hand sketch presents a challenging problem. Despite the emergence of some large-scale sketch datasets, these datasets primarily consist of sketches at the single-object level. There continues to be a lack of large-scale paired datasets for scene sketches. In this paper, we propose a self-supervised method for scene sketch generation that does not rely on any existing scene sketch, enabling the transformation of single-object sketches into scene sketches. To accomplish this, we introduce a method for vector sketch captioning and sketch semantic expansion. Additionally, we design a sketch generation network that incorporates a fusion of multi-modal perceptual constraints, suitable for application in zero-shot image-to-sketch downstream task, demonstrating state-of-the-art performance through experimental validation. Finally, leveraging our proposed sketch-to-sketch generation method, we contribute a large-scale dataset centered around scene sketches, comprising highly semantically consistent “text-sketch-image” triplets. Our research confirms that this dataset can significantly enhance the capabilities of existing models in sketch-based image retrieval and sketch-controlled image synthesis tasks. We will make our dataset and code publicly available.

[CV-65] Face processing emerges from object-trained convolutional neural networks

链接: https://arxiv.org/abs/2405.18800
作者: Zhenhua Zhao,Ji Chen,Zhicheng Lin,Haojiang Ying
关键词: domain-specific neurocognitive mechanisms, depends on unique, domain-specific neurocognitive, long been debated, domain-general object recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 31 pages, 5 Figures

点击查看摘要

Abstract:Whether face processing depends on unique, domain-specific neurocognitive mechanisms or domain-general object recognition mechanisms has long been debated. Directly testing these competing hypotheses in humans has proven challenging due to extensive exposure to both faces and objects. Here, we systematically test these hypotheses by capitalizing on recent progress in convolutional neural networks (CNNs) that can be trained without face exposure (i.e., pre-trained weights). Domain-general mechanism accounts posit that face processing can emerge from a neural network without specialized pre-training on faces. Consequently, we trained CNNs solely on objects and tested their ability to recognize and represent faces as well as objects that look like faces (face pareidolia stimuli)… Due to the character limits, for more details see in attached pdf

[CV-66] Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics

链接: https://arxiv.org/abs/2405.18790
作者: Zhangkai Ni,Yue Liu,Keyan Ding,Wenhan Yang,Hanli Wang,Shiqi Wang
关键词: Deep learning-based methods, Deep Feature Statistics, human rating data, Multi-scale Deep Feature, influenced the blind
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
*备注: Accepted to IEEE Transactions on Multimedia 2024

点击查看摘要

Abstract:Deep learning-based methods have significantly influenced the blind image quality assessment (BIQA) field, however, these methods often require training using large amounts of human rating data. In contrast, traditional knowledge-based methods are cost-effective for training but face challenges in effectively extracting features aligned with human visual perception. To bridge these gaps, we propose integrating deep features from pre-trained visual models with a statistical analysis model into a Multi-scale Deep Feature Statistics (MDFS) model for achieving opinion-unaware BIQA (OU-BIQA), thereby eliminating the reliance on human rating data and significantly improving training efficiency. Specifically, we extract patch-wise multi-scale features from pre-trained vision models, which are subsequently fitted into a multivariate Gaussian (MVG) model. The final quality score is determined by quantifying the distance between the MVG model derived from the test image and the benchmark MVG model derived from the high-quality image set. A comprehensive series of experiments conducted on various datasets show that our proposed model exhibits superior consistency with human visual perception compared to state-of-the-art BIQA models. Furthermore, it shows improved generalizability across diverse target-specific BIQA tasks. Our code is available at: this https URL

[CV-67] MOKD: Cross-domain Finetuning for Few-shot Classification via Maximizing Optimized Kernel Dependence

链接: https://arxiv.org/abs/2405.18786
作者: Hongduan Tian,Feng Liu,Tongliang Liu,Bo Du,Yiu-ming Cheung,Bo Han
关键词: cross-domain few-shot classification, nearest centroid classifier, few-shot classification, cross-domain few-shot, construct a metric
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In cross-domain few-shot classification, \emphnearest centroid classifier (NCC) aims to learn representations to construct a metric space where few-shot classification can be performed by measuring the similarities between samples and the prototype of each class. An intuition behind NCC is that each sample is pulled closer to the class centroid it belongs to while pushed away from those of other classes. However, in this paper, we find that there exist high similarities between NCC-learned representations of two samples from different classes. In order to address this problem, we propose a bi-level optimization framework, \emphmaximizing optimized kernel dependence (MOKD) to learn a set of class-specific representations that match the cluster structures indicated by labeled data of the given task. Specifically, MOKD first optimizes the kernel adopted in \emphHilbert-Schmidt independence criterion (HSIC) to obtain the optimized kernel HSIC (opt-HSIC) that can capture the dependence more precisely. Then, an optimization problem regarding the opt-HSIC is addressed to simultaneously maximize the dependence between representations and labels and minimize the dependence among all samples. Extensive experiments on Meta-Dataset demonstrate that MOKD can not only achieve better generalization performance on unseen domains in most cases but also learn better data representation clusters. The project repository of MOKD is available at: \hrefthis https URLthis https URL.

[CV-68] LP-3DGS: Learning to Prune 3D Gaussian Splatting

链接: https://arxiv.org/abs/2405.18784
作者: Zhaoliang Zhang,Tianchen Song,Yongjae Lee,Li Yang,Cheng Peng,Rama Chellappa,Deliang Fan
关键词: Gaussian Splatting, fast rendering speed, view synthesis, mainstream methodologies, pruning ratio
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has become one of the mainstream methodologies for novel view synthesis (NVS) due to its high quality and fast rendering speed. However, as a point-based scene representation, 3DGS potentially generates a large number of Gaussians to fit the scene, leading to high memory usage. Improvements that have been proposed require either an empirical and preset pruning ratio or importance score threshold to prune the point cloud. Such hyperparamter requires multiple rounds of training to optimize and achieve the maximum pruning ratio, while maintaining the rendering quality for each scene. In this work, we propose learning-to-prune 3DGS (LP-3DGS), where a trainable binary mask is applied to the importance score that can find optimal pruning ratio automatically. Instead of using the traditional straight-through estimator (STE) method to approximate the binary mask gradient, we redesign the masking function to leverage the Gumbel-Sigmoid method, making it differentiable and compatible with the existing training process of 3DGS. Extensive experiments have shown that LP-3DGS consistently produces a good balance that is both efficient and high quality.

[CV-69] LLaMA-Reg: Using LLaMA 2 for Unsupervised Medical Image Registration

链接: https://arxiv.org/abs/2405.18774
作者: Mingrui Ma,Yu Yang
关键词: large language model, large language, Medical image registration, language model, pretrained large language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Medical image registration is an essential topic in medical image analysis. In this paper, we propose a method for medical image registration using a pretrained large language model. We find that using the pretrained large language model to encode deep features of the medical images in the registration model can effectively improve image registration accuracy, indicating the great potential of the large language model in medical image registration tasks. We use dual encoders to perform deep feature extraction on image pairs and then input the features into the pretrained large language model. To adapt the large language model to our registration task, the weights of the large language model are frozen in the registration model, and an adapter is utilized to fine-tune the large language model, which aims at (a) mapping the visual tokens to the language space before the large language model computing, (b) project the modeled language tokens output from the large language model to the visual space. Our method combines output features from the fine-tuned large language model with the features output from each encoder layer to gradually generate the deformation fields required for registration in the decoder. To demonstrate the effectiveness of the large prediction model in registration tasks, we conducted experiments on knee and brain MRI and achieved state-of-the-art results.

[CV-70] Leveraging Many-To-Many Relationships for Defending Against Visual-Language Adversarial Attacks

链接: https://arxiv.org/abs/2405.18770
作者: Futa Waseda,Antonio Tejero-de-Pablos
关键词: Recent studies, revealed that vision-language, ITR, adversarial attacks, adversarial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Under review

点击查看摘要

Abstract:Recent studies have revealed that vision-language (VL) models are vulnerable to adversarial attacks for image-text retrieval (ITR). However, existing defense strategies for VL models primarily focus on zero-shot image classification, which do not consider the simultaneous manipulation of image and text, as well as the inherent many-to-many (N:N) nature of ITR, where a single image can be described in numerous ways, and vice versa. To this end, this paper studies defense strategies against adversarial attacks on VL models for ITR for the first time. Particularly, we focus on how to leverage the N:N relationship in ITR to enhance adversarial robustness. We found that, although adversarial training easily overfits to specific one-to-one (1:1) image-text pairs in the train data, diverse augmentation techniques to create one-to-many (1:N) / many-to-one (N:1) image-text pairs can significantly improve adversarial robustness in VL models. Additionally, we show that the alignment of the augmented image-text pairs is crucial for the effectiveness of the defense strategy, and that inappropriate augmentations can even degrade the model’s performance. Based on these findings, we propose a novel defense strategy that leverages the N:N relationship in ITR, which effectively generates diverse yet highly-aligned N:N pairs using basic augmentations and generative model-based augmentations. This work provides a novel perspective on defending against adversarial attacks in VL tasks and opens up new research directions for future work.

[CV-71] OUS: Scene-Guided Dynamic Facial Expression Recognition

链接: https://arxiv.org/abs/2405.18769
作者: Xinji Mai,Haoran Wang,Zeng Tao,Junxiong Lin,Shaoqi Yan,Yan Wang,Jing Liu,Jiawen Yu,Xuan Tong,Yating Li,Wenqiang Zhang
关键词: Rigid Cognitive Problem, Dynamic Facial Expression, Facial Expression Recognition, Rigid Cognitive, Dynamic Facial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Dynamic Facial Expression Recognition (DFER) is crucial for affective computing but often overlooks the impact of scene context. We have identified a significant issue in current DFER tasks: human annotators typically integrate emotions from various angles, including environmental cues and body language, whereas existing DFER methods tend to consider the scene as noise that needs to be filtered out, focusing solely on facial information. We refer to this as the Rigid Cognitive Problem. The Rigid Cognitive Problem can lead to discrepancies between the cognition of annotators and models in some samples. To align more closely with the human cognitive paradigm of emotions, we propose an Overall Understanding of the Scene DFER method (OUS). OUS effectively integrates scene and facial features, combining scene-specific emotional knowledge for DFER. Extensive experiments on the two largest datasets in the DFER field, DFEW and FERV39k, demonstrate that OUS significantly outperforms existing methods. By analyzing the Rigid Cognitive Problem, OUS successfully understands the complex relationship between scene context and emotional expression, closely aligning with human emotional understanding in real-world scenarios.

[CV-72] Inpaint Biases: A Pathway to Accurate and Unbiased Image Generation

链接: https://arxiv.org/abs/2405.18762
作者: Jiyoon Myung,Jihyeon Park
关键词: accurately rendering unconventional, rendering unconventional concepts, training datasets, paper examines, accurately rendering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper examines the limitations of advanced text-to-image models in accurately rendering unconventional concepts which are scarcely represented or absent in their training datasets. We identify how these limitations not only confine the creative potential of these models but also pose risks of reinforcing stereotypes. To address these challenges, we introduce the Inpaint Biases framework, which employs user-defined masks and inpainting techniques to enhance the accuracy of image generation, particularly for novel or inaccurately rendered objects. Through experimental validation, we demonstrate how this framework significantly improves the fidelity of generated images to the user’s intent, thereby expanding the models’ creative capabilities and mitigating the risk of perpetuating biases. Our study contributes to the advancement of text-to-image models as unbiased, versatile tools for creative expression.

[CV-73] Provable Contrastive Continual Learning

链接: https://arxiv.org/abs/2405.18756
作者: Yichen Wen,Zhiquan Tan,Kaipeng Zheng,Chuanlong Xie,Weiran Huang
关键词: dynamic data distributions, Continual learning, requires learning incremental, contrastive continual learning, Continual learning requires
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Machine Learning (stat.ML)
*备注: Accepted by ICML 2024

点击查看摘要

Abstract:Continual learning requires learning incremental tasks with dynamic data distributions. So far, it has been observed that employing a combination of contrastive loss and distillation loss for training in continual learning yields strong performance. To the best of our knowledge, however, this contrastive continual learning framework lacks convincing theoretical explanations. In this work, we fill this gap by establishing theoretical performance guarantees, which reveal how the performance of the model is bounded by training losses of previous tasks in the contrastive continual learning framework. Our theoretical explanations further support the idea that pre-training can benefit continual learning. Inspired by our theoretical analysis of these guarantees, we propose a novel contrastive continual learning algorithm called CILA, which uses adaptive distillation coefficients for different tasks. These distillation coefficients are easily computed by the ratio between average distillation losses and average contrastive losses from previous tasks. Our method shows great improvement on standard benchmarks and achieves new state-of-the-art performance.

[CV-74] On the Limits of Multi-modal Meta-Learning with Auxiliary Task Modulation Using Conditional Batch Normalization

链接: https://arxiv.org/abs/2405.18751
作者: Jordi Armengol-Estapé,Vincent Michalski,Ramnath Kumar,Pierre-Luc St-Charles,Doina Precup,Samira Ebrahimi Kahou
关键词: Few-shot learning aims, small number, Few-shot learning, bridge network, Few-shot
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Few-shot learning aims to learn representations that can tackle novel tasks given a small number of examples. Recent studies show that cross-modal learning can improve representations for few-shot classification. More specifically, language is a rich modality that can be used to guide visual learning. In this work, we experiment with a multi-modal architecture for few-shot learning that consists of three components: a classifier, an auxiliary network, and a bridge network. While the classifier performs the main classification task, the auxiliary network learns to predict language representations from the same input, and the bridge network transforms high-level features of the auxiliary network into modulation parameters for layers of the few-shot classifier using conditional batch normalization. The bridge should encourage a form of lightweight semantic alignment between language and vision which could be useful for the classifier. However, after evaluating the proposed approach on two popular few-shot classification benchmarks we find that a) the improvements do not reproduce across benchmarks, and b) when they do, the improvements are due to the additional compute and parameters introduced by the bridge network. We contribute insights and recommendations for future work in multi-modal meta-learning, especially when using language representations.

[CV-75] 2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

链接: https://arxiv.org/abs/2405.18750
作者: Jiachen Li,Weixi Feng,Tsu-Jui Fu,Xinyi Wang,Sugato Basu,Wenhu Chen,William Yang Wang
关键词: achieved significant success, slow sampling speed, iterative sampling processes, achieved significant, significant success
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Diffusion-based text-to-video (T2V) models have achieved significant success but continue to be hampered by the slow sampling speed of their iterative sampling processes. To address the challenge, consistency models have been proposed to facilitate fast inference, albeit at the cost of sample quality. In this work, we aim to break the quality bottleneck of a video consistency model (VCM) to achieve \textbfboth fast and high-quality video generation . We introduce T2V-Turbo, which integrates feedback from a mixture of differentiable reward models into the consistency distillation (CD) process of a pre-trained T2V model. Notably, we directly optimize rewards associated with single-step generations that arise naturally from computing the CD loss, effectively bypassing the memory constraints imposed by backpropagating gradients through an iterative sampling process. Remarkably, the 4-step generations from our T2V-Turbo achieve the highest total score on VBench, even surpassing Gen-2 and Pika. We further conduct human evaluations to corroborate the results, validating that the 4-step generations from our T2V-Turbo are preferred over the 50-step DDIM samples from their teacher models, representing more than a tenfold acceleration while improving video generation quality.

[CV-76] PanoNormal: Monocular Indoor 360deg Surface Normal Estimation

链接: https://arxiv.org/abs/2405.18745
作者: Kun Huang,Fanglue Zhang,Neil Dodgson
关键词: dense regression computer, computer vision tasks, regression computer vision, surface normal estimation, Equirectangular image
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The presence of spherical distortion on the Equirectangular image is an acknowledged challenge in dense regression computer vision tasks, such as surface normal estimation. Recent advances in convolutional neural networks (CNNs) strive to mitigate spherical distortion but often fall short in capturing holistic structures effectively, primarily due to their fixed receptive field. On the other hand, vision transformers (ViTs) excel in establishing long-range dependencies through a global self-attention mechanism, yet they encounter limitations in preserving local details. We introduce \textitPanoNormal, a monocular surface normal estimation architecture designed for 360° images, which combines the strengths of CNNs and ViTs. Specifically, we employ a multi-level global self-attention scheme with the consideration of the spherical feature distribution, enhancing the comprehensive understanding of the scene. Our experimental results demonstrate that our approach achieves state-of-the-art performance across multiple popular 360° monocular datasets. The code and models will be released.

[CV-77] WLC-Net: a robust and fast deep-learning wood-leaf classification method

链接: https://arxiv.org/abs/2405.18737
作者: Hanlong Li,Pei Wang,Yuhan Wu,Jing Ren,Yuhang Gao,Lingyun Zhang,Mingtai Zhang,Wenxin Chen
关键词: Wood-Leaf Classification Network, http URL address, http URL terms, DGCNN,Krishna Moorthy method,LeWoS, URL address this,we
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 41 pages, 14 figures, 5 tables

点击查看摘要

Abstract:Wood-leaf classification is an essential and fundamental prerequisite in the analysis and estimation of forest attributes from terrestrial laser scanning (TLS) point clouds,including critical measurements such as diameter at breast height(DBH),above-ground biomass(AGB),wood this http URL address this,we introduce the Wood-Leaf Classification Network(WLC-Net),a deep learning model derived from PointNet++,designed to differentiate between wood and leaf points within tree point clouds.WLC-Net enhances classification accuracy,completeness,and speed by incorporating linearity as an inherent feature,refining the input-output framework,and optimizing the centroid sampling technique.WLC-Net was trained and assessed using three distinct tree species datasets,comprising a total of 102 individual tree point clouds:21 Chinese ash trees,21 willow trees,and 60 tropical trees.For comparative evaluation,five alternative methods,including PointNet++,DGCNN,Krishna Moorthy’s method,LeWoS, and Sun’s method,were also applied to these datasets.The classification accuracy of all six methods was quantified using three metrics:overall accuracy(OA),mean Intersection over Union(mIoU),and F1-score.Across all three datasets,WLC-Net demonstrated superior performance, achieving OA scores of 0.9778, 0.9712, and 0.9508;mIoU scores of 0.9761, 0.9693,and 0.9141;and F1-scores of 0.8628, 0.7938,and 0.9019,respectively.The time costs of WLC-Net were also recorded to evaluate the efficiency.The average processing time was 102.74s per million points for this http URL terms of visual inspect,accuracy evaluation and efficiency evaluation,the results suggest that WLC-Net presents a promising approach for wood-leaf classification,distinguished by its high accuracy. In addition,WLC-Net also exhibits strong applicability across various tree point clouds and holds promise for further optimization.

[CV-78] PillarHist: A Quantization-aware Pillar Feature Encoder based on Height-aware Histogram

链接: https://arxiv.org/abs/2405.18734
作者: Sifan Zhou,Zhihang Yuan,Dawei Yang,Xubin Wen,Xing Hu,Yuguang Shi,Ziyu Zhao,Xiaobo Lu
关键词: Real-time and high-performance, object detection plays, driving and robotics, plays a critical, critical role
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 17 pages, 3 figures

点击查看摘要

Abstract:Real-time and high-performance 3D object detection plays a critical role in autonomous driving and robotics. Recent pillar-based 3D object detectors have gained significant attention due to their compact representation and low computational overhead, making them suitable for onboard deployment and quantization. However, existing pillar-based detectors still suffer from information loss along height dimension and large numerical distribution difference during pillar feature encoding (PFE), which severely limits their performance and quantization potential. To address above issue, we first unveil the importance of different input information during PFE and identify the height dimension as a key factor in enhancing 3D detection performance. Motivated by this observation, we propose a height-aware pillar feature encoder named PillarHist. Specifically, PillarHist statistics the discrete distribution of points at different heights within one pillar. This simple yet effective design greatly preserves the information along the height dimension while significantly reducing the computation overhead of the PFE. Meanwhile, PillarHist also constrains the arithmetic distribution of PFE input to a stable range, making it quantization-friendly. Notably, PillarHist operates exclusively within the PFE stage to enhance performance, enabling seamless integration into existing pillar-based methods without introducing complex operations. Extensive experiments show the effectiveness of PillarHist in terms of both efficiency and performance.

[CV-79] Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI

链接: https://arxiv.org/abs/2405.18726
作者: Che Liu,Changde Du,Xiaoyu Chen,Huiguang He
关键词: Magnetic Resonance Imaging, Drawing inspiration, high-level semantic understanding, low-level acoustic features, functional Magnetic Resonance
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Drawing inspiration from the hierarchical processing of the human auditory system, which transforms sound from low-level acoustic features to high-level semantic understanding, we introduce a novel coarse-to-fine audio reconstruction method. Leveraging non-invasive functional Magnetic Resonance Imaging (fMRI) data, our approach mimics the inverse pathway of auditory processing. Initially, we utilize CLAP to decode fMRI data coarsely into a low-dimensional semantic space, followed by a fine-grained decoding into the high-dimensional AudioMAE latent space guided by semantic features. These fine-grained neural features serve as conditions for audio reconstruction through a Latent Diffusion Model (LDM). Validation on three public fMRI datasets-Brain2Sound, Brain2Music, and Brain2Speech-underscores the superiority of our coarse-to-fine decoding method over stand-alone fine-grained approaches, showcasing state-of-the-art performance in metrics like FD, FAD, and KL. Moreover, by employing semantic prompts during decoding, we enhance the quality of reconstructed audio when semantic features are suboptimal. The demonstrated versatility of our model across diverse stimuli highlights its potential as a universal brain-to-audio framework. This research contributes to the comprehension of the human auditory system, pushing boundaries in neural decoding and audio reconstruction methodologies.

[CV-80] Correctable Landmark Discovery via Large Models for Vision-Language Navigation

链接: https://arxiv.org/abs/2405.18721
作者: Bingqian Lin,Yunshuang Nie,Ziming Wei,Yi Zhu,Hang Xu,Shikui Ma,Jianzhuang Liu,Xiaodan Liang
关键词: follow language instructions, target position, VLN, LaNdmark DiScOvery, follow language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted by TPAMI 2024

点击查看摘要

Abstract:Vision-Language Navigation (VLN) requires the agent to follow language instructions to reach a target position. A key factor for successful navigation is to align the landmarks implied in the instruction with diverse visual observations. However, previous VLN agents fail to perform accurate modality alignment especially in unexplored scenes, since they learn from limited navigation data and lack sufficient open-world alignment knowledge. In this work, we propose a new VLN paradigm, called COrrectable LaNdmark DiScOvery via Large ModEls (CONSOLE). In CONSOLE, we cast VLN as an open-world sequential landmark discovery problem, by introducing a novel correctable landmark discovery scheme based on two large models ChatGPT and CLIP. Specifically, we use ChatGPT to provide rich open-world landmark cooccurrence commonsense, and conduct CLIP-driven landmark discovery based on these commonsense priors. To mitigate the noise in the priors due to the lack of visual constraints, we introduce a learnable cooccurrence scoring module, which corrects the importance of each cooccurrence according to actual observations for accurate landmark discovery. We further design an observation enhancement strategy for an elegant combination of our framework with different VLN agents, where we utilize the corrected landmark features to obtain enhanced observation features for action decision. Extensive experimental results on multiple popular VLN benchmarks (R2R, REVERIE, R4R, RxR) show the significant superiority of CONSOLE over strong baselines. Especially, our CONSOLE establishes the new state-of-the-art results on R2R and R4R in unseen scenarios. Code is available at this https URL.

[CV-81] SketchDeco: Decorating BW Sketches with Colour

链接: https://arxiv.org/abs/2405.18716
作者: Chaitat Utintu,Pinaki Nath Chowdhury,Aneeshan Sain,Subhadeep Koley,Ayan Kumar Bhunia,Yi-Zhe Song
关键词: universal childhood activity, design and story-boarding, paper introduces, approach to sketch, universal childhood
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces a novel approach to sketch colourisation, inspired by the universal childhood activity of colouring and its professional applications in design and story-boarding. Striking a balance between precision and convenience, our method utilises region masks and colour palettes to allow intuitive user control, steering clear of the meticulousness of manual colour assignments or the limitations of textual prompts. By strategically combining ControlNet and staged generation, incorporating Stable Diffusion v1.5, and leveraging BLIP-2 text prompts, our methodology facilitates faithful image generation and user-directed colourisation. Addressing challenges of local and global consistency, we employ inventive solutions such as an inversion scheme, guided sampling, and a self-attention mechanism with a scaling factor. The resulting tool is not only fast and training-free but also compatible with consumer-grade Nvidia RTX 4090 Super GPUs, making it a valuable asset for both creative professionals and enthusiasts in various fields. Project Page: \urlthis https URL

[CV-82] NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild

链接: https://arxiv.org/abs/2405.18715
作者: Weining Ren,Zihan Zhu,Boyang Sun,Jiaqi Chen,Marc Pollefeys,Songyou Peng
关键词: Neural Radiance Fields, Neural Radiance, Radiance Fields, shown remarkable success, synthesizing photorealistic views
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR 2024, first two authors contributed equally. Project Page: this https URL

点击查看摘要

Abstract:Neural Radiance Fields (NeRFs) have shown remarkable success in synthesizing photorealistic views from multi-view images of static scenes, but face challenges in dynamic, real-world environments with distractors like moving objects, shadows, and lighting changes. Existing methods manage controlled environments and low occlusion ratios but fall short in render quality, especially under high occlusion scenarios. In this paper, we introduce NeRF On-the-go, a simple yet effective approach that enables the robust synthesis of novel views in complex, in-the-wild scenes from only casually captured image sequences. Delving into uncertainty, our method not only efficiently eliminates distractors, even when they are predominant in captures, but also achieves a notably faster convergence speed. Through comprehensive experiments on various scenes, our method demonstrates a significant improvement over state-of-the-art techniques. This advancement opens new avenues for NeRF in diverse and dynamic real-world applications.

[CV-83] FocSAM: Delving Deeply into Focused Objects in Segmenting Anything

链接: https://arxiv.org/abs/2405.18706
作者: You Huang,Zongyu Lan,Liujuan Cao,Xianming Lin,Shengchuan Zhang,Guannan Jiang,Rongrong Ji
关键词: handle diverse prompts, robust zero-shot capabilities, Segment Anything Model, marks a notable, diverse prompts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to CVPR 2024

点击查看摘要

Abstract:The Segment Anything Model (SAM) marks a notable milestone in segmentation models, highlighted by its robust zero-shot capabilities and ability to handle diverse prompts. SAM follows a pipeline that separates interactive segmentation into image preprocessing through a large encoder and interactive inference via a lightweight decoder, ensuring efficient real-time performance. However, SAM faces stability issues in challenging samples upon this pipeline. These issues arise from two main factors. Firstly, the image preprocessing disables SAM from dynamically using image-level zoom-in strategies to refocus on the target object during interaction. Secondly, the lightweight decoder struggles to sufficiently integrate interactive information with image embeddings. To address these two limitations, we propose FocSAM with a pipeline redesigned on two pivotal aspects. First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM’s image embeddings on the target object. Dwin-MSA localizes attention computations around the target object, enhancing object-related embeddings with minimal computational overhead. Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks that have significant impacts on the overall segmentation results. Experimentally, FocSAM augments SAM’s interactive segmentation performance to match the existing state-of-the-art method in segmentation quality, requiring only about 5.6% of this method’s inference time on CPUs.

[CV-84] Multi-Condition Latent Diffusion Network for Scene-Aware Neural Human Motion Prediction

链接: https://arxiv.org/abs/2405.18700
作者: Xuehao Gao,Yang Yang,Yang Wu,Shaoyi Du,Auo-Jun Qi
关键词: including understanding human, understanding human activity, human motion, human motion prediction, inferring human motion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IEEE Transactions on Image Processing

点击查看摘要

Abstract:Inferring 3D human motion is fundamental in many applications, including understanding human activity and analyzing one’s intention. While many fruitful efforts have been made to human motion prediction, most approaches focus on pose-driven prediction and inferring human motion in isolation from the contextual environment, thus leaving the body location movement in the scene behind. However, real-world human movements are goal-directed and highly influenced by the spatial layout of their surrounding scenes. In this paper, instead of planning future human motion in a ‘dark’ room, we propose a Multi-Condition Latent Diffusion network (MCLD) that reformulates the human motion prediction task as a multi-condition joint inference problem based on the given historical 3D body motion and the current 3D scene contexts. Specifically, instead of directly modeling joint distribution over the raw motion sequences, MCLD performs a conditional diffusion process within the latent embedding space, characterizing the cross-modal mapping from the past body movement and current scene context condition embeddings to the future human motion embedding. Extensive experiments on large-scale human motion prediction datasets demonstrate that our MCLD achieves significant improvements over the state-of-the-art methods on both realistic and diverse predictions.

[CV-85] Learning Diffeomorphism for Image Registration with Time-Continuous Networks using Semigroup Regularization

链接: https://arxiv.org/abs/2405.18684
作者: Mohammadjavad Matinkia,Nilanjan Ray
关键词: finding topology preserving, topology preserving deformations, medical image analysis, aimed at finding, critical task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:Diffeomorphic image registration (DIR) is a critical task in 3D medical image analysis, aimed at finding topology preserving deformations between pairs of images. Focusing on the solution of the flow map differential equation as the diffeomorphic deformation, recent methods use discrete timesteps along with various regularization terms to penalize the negative determinant of Jacobian and impose smoothness of the solution vector field. In this paper, we propose a novel learning-based approach for diffeomorphic 3D-image registration which finds the diffeomorphisms in the time continuum with fewer regularization terms and no additional integration. As one of the fundamental properties of flow maps, we exploit the semigroup property as the only form of regularization, ensuring temporally continuous diffeomorphic flows between pairs of images. Leveraging this property, our method alleviates the need for additional regularization terms and scaling and squaring integration during both training and evaluation. To achieve time-continuous diffeomorphisms, we employ time-embedded UNets, a technique commonly utilized in diffusion models. The proposed method reveals that ensuring diffeomorphism in a continuous time interval leads to better registration results. Experimental results on two public datasets (OASIS and CANDI) demonstrate the superiority of our model over both learning-based and optimization-based methods.

[CV-86] Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain

链接: https://arxiv.org/abs/2405.18679
作者: Juntao Zhang,Kun Bian,Peng Cheng,Wenbo An,Jianning Liu,Jun Zhou
关键词: State Space Models, State Space, made significant progress, Mamba deep learning, efficient hardware-aware designs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, State Space Models (SSMs) with efficient hardware-aware designs, known as the Mamba deep learning models, have made significant progress in modeling long sequences such as language understanding. Therefore, building efficient and general-purpose visual backbones based on SSMs is a promising direction. Compared to traditional convolutional neural networks (CNNs) and Vision Transformers (ViTs), the performance of Vision Mamba (ViM) methods is not yet fully competitive. To enable SSMs to process image data, ViMs typically flatten 2D images into 1D sequences, inevitably ignoring some 2D local dependencies, thereby weakening the model’s ability to interpret spatial relationships from a global perspective. We use Fast Fourier Transform (FFT) to obtain the spectrum of the feature map and add it to the original feature map, enabling ViM to model a unified visual representation in both frequency and spatial domains. The introduction of frequency domain information enables ViM to have a global receptive field during scanning. We propose a novel model called Vim-F, which employs pure Mamba encoders and scans in both the frequency and spatial domains. Moreover, we question the necessity of position embedding in ViM and remove it accordingly in Vim-F, which helps to fully utilize the efficient long-sequence modeling capability of ViM. Finally, we redesign a patch embedding for Vim-F, leveraging a convolutional stem to capture more local correlations, further improving the performance of Vim-F. Code is available at: \urlthis https URL.

[CV-87] Zero-to-Hero: Enhancing Zero-Shot Novel View Synthesis via Attention Map Filtering

链接: https://arxiv.org/abs/2405.18677
作者: Ido Sobol,Chenfeng Xu,Or Litany
关键词: immersive virtual experiences, Generating realistic images, broad applications ranging, single source image, source image remains
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Generating realistic images from arbitrary views based on a single source image remains a significant challenge in computer vision, with broad applications ranging from e-commerce to immersive virtual experiences. Recent advancements in diffusion models, particularly the Zero-1-to-3 model, have been widely adopted for generating plausible views, videos, and 3D models. However, these models still struggle with inconsistencies and implausibility in new views generation, especially for challenging changes in viewpoint. In this work, we propose Zero-to-Hero, a novel test-time approach that enhances view synthesis by manipulating attention maps during the denoising process of Zero-1-to-3. By drawing an analogy between the denoising process and stochastic gradient descent (SGD), we implement a filtering mechanism that aggregates attention maps, enhancing generation reliability and authenticity. This process improves geometric consistency without requiring retraining or significant computational resources. Additionally, we modify the self-attention mechanism to integrate information from the source view, reducing shape distortions. These processes are further supported by a specialized sampling schedule. Experimental results demonstrate substantial improvements in fidelity and consistency, validated on a diverse set of out-of-distribution objects.

[CV-88] LLM-based Hierarchical Concept Decomposition for Interpretable Fine-Grained Image Classification

链接: https://arxiv.org/abs/2405.18672
作者: Renyi Qu,Mark Yatskar
关键词: unstructured text outputs, Recent advancements, achieved competitive performance, large language models, advancements in interpretable
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Recent advancements in interpretable models for vision-language tasks have achieved competitive performance; however, their interpretability often suffers due to the reliance on unstructured text outputs from large language models (LLMs). This introduces randomness and compromises both transparency and reliability, which are essential for addressing safety issues in AI systems. We introduce \textttHi-CoDe (Hierarchical Concept Decomposition), a novel framework designed to enhance model interpretability through structured concept analysis. Our approach consists of two main components: (1) We use GPT-4 to decompose an input image into a structured hierarchy of visual concepts, thereby forming a visual concept tree. (2) We then employ an ensemble of simple linear classifiers that operate on concept-specific features derived from CLIP to perform classification. Our approach not only aligns with the performance of state-of-the-art models but also advances transparency by providing clear insights into the decision-making process and highlighting the importance of various concepts. This allows for a detailed analysis of potential failure modes and improves model compactness, therefore setting a new benchmark in interpretability without compromising the accuracy.

[CV-89] Mitigating Object Hallucination via Data Augmented Contrastive Tuning

链接: https://arxiv.org/abs/2405.18654
作者: Pritam Sarkar,Sayna Ebrahimi,Ali Etemad,Ahmad Beirami,Sercan Ö. Arık,Tomas Pfister
关键词: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, hallucinate factually inaccurate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite their remarkable progress, Multimodal Large Language Models (MLLMs) tend to hallucinate factually inaccurate information. In this work, we address object hallucinations in MLLMs, where information is offered about an object that is not present in the model input. We introduce a contrastive tuning method that can be applied to a pretrained off-the-shelf MLLM for mitigating hallucinations while preserving its general vision-language capabilities. For a given factual token, we create a hallucinated token through generative data augmentation by selectively altering the ground-truth information. The proposed contrastive tuning is applied at the token level to improve the relative likelihood of the factual token compared to the hallucinated one. Our thorough evaluation confirms the effectiveness of contrastive tuning in mitigating hallucination. Moreover, the proposed contrastive tuning is simple, fast, and requires minimal training with no additional overhead at inference.

[CV-90] Wavelet-Based Image Tokenizer for Vision Transformers

链接: https://arxiv.org/abs/2405.18616
作者: Zhenhai Zhu,Radu Soricut
关键词: Non-overlapping patch-wise convolution, vision Transformer, Non-overlapping patch-wise, patch-wise convolution, default image tokenizer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Non-overlapping patch-wise convolution is the default image tokenizer for all state-of-the-art vision Transformer (ViT) models. Even though many ViT variants have been proposed to improve its efficiency and accuracy, little research on improving the image tokenizer itself has been reported in the literature. In this paper, we propose a new image tokenizer based on wavelet transformation. We show that ViT models with the new tokenizer achieve both higher training throughput and better top-1 precision for the ImageNet validation set. We present a theoretical analysis on why the proposed tokenizer improves the training throughput without any change to ViT model architecture. Our analysis suggests that the new tokenizer can effectively handle high-resolution images and is naturally resistant to adversarial attack. Furthermore, the proposed image tokenizer offers a fresh perspective on important new research directions for ViT-based model design, such as image tokens on a non-uniform grid for image understanding.

[CV-91] Augmented Physics: A Machine Learning-Powered Tool for Creating Interactive Physics Simulations from Static Diagrams

链接: https://arxiv.org/abs/2405.18614
作者: Aditya Gunturu,Yi Wen,Jarin Thundathil,Nandi Zhang,Rubaiat Habib Kazi,Ryo Suzuki
关键词: machine learning-powered tool, learning-powered tool designed, introduce Augmented Physics, machine learning-powered, learning-powered tool
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Augmented Physics, a machine learning-powered tool designed for creating interactive physics simulations from static textbook diagrams. Leveraging computer vision techniques, such as Segment Anything and OpenCV, our web-based system enables users to semi-automatically extract diagrams from physics textbooks and then generate interactive simulations based on the extracted content. These interactive diagrams are seamlessly integrated into scanned textbook pages, facilitating interactive and personalized learning experiences across various physics concepts, including gravity, optics, circuits, and kinematics. Drawing on an elicitation study with seven physics instructors, we explore four key augmentation techniques: 1) augmented experiments, 2) animated diagrams, 3) bi-directional manipulatives, and 4) parameter visualization. We evaluate our system through technical evaluation, a usability study (N=12), and expert interviews (N=12). The study findings suggest that our system can facilitate more engaging and personalized learning experiences in physics education.

[CV-92] rack Initialization and Re-Identification for~3D Multi-View Multi-Object Tracking

链接: https://arxiv.org/abs/2405.18606
作者: Linh Van Ma,Tran Thien Dat Nguyen,Ba-Ngu Vo,Hyunsung Jang,Moongu Jeon
关键词: resolves track appearance-reappearance, automatically initiates, terminates tracks, multi-object tracking, resolves track
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:We propose a 3D multi-object tracking (MOT) solution using only 2D detections from monocular cameras, which automatically initiates/terminates tracks as well as resolves track appearance-reappearance and occlusions. Moreover, this approach does not require detector retraining when cameras are reconfigured but only the camera matrices of reconfigured cameras need to be updated. Our approach is based on a Bayesian multi-object formulation that integrates track initiation/termination, re-identification, occlusion handling, and data association into a single Bayes filtering recursion. However, the exact filter that utilizes all these functionalities is numerically intractable due to the exponentially growing number of terms in the (multi-object) filtering density, while existing approximations trade-off some of these functionalities for speed. To this end, we develop a more efficient approximation suitable for online MOT by incorporating object features and kinematics into the measurement model, which improves data association and subsequently reduces the number of terms. Specifically, we exploit the 2D detections and extracted features from multiple cameras to provide a better approximation of the multi-object filtering density to realize the track initiation/termination and re-identification functionalities. Further, incorporating a tractable geometric occlusion model based on 2D projections of 3D objects on the camera planes realizes the occlusion handling functionality of the filter. Evaluation of the proposed solution on challenging datasets demonstrates significant improvements and robustness when camera configurations change on-the-fly, compared to existing multi-view MOT solutions. The source code is publicly available at this https URL.

[CV-93] Its Not a Modality Gap: Characterizing and Addressing the Contrastive Gap

链接: https://arxiv.org/abs/2405.18570
作者: Abrar Fahim,Alex Murphy,Alona Fyshe
关键词: embedding input images, contrastive models, Multi-modal contrastive models, contrastive, gap
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal contrastive models such as CLIP achieve state-of-the-art performance in zero-shot classification by embedding input images and texts on a joint representational space. Recently, a modality gap has been reported in two-encoder contrastive models like CLIP, meaning that the image and text embeddings reside in disjoint areas of the latent space. Previous studies suggest that this gap exists due to 1) the cone effect, 2) mismatched pairs in the dataset, and 3) insufficient training. We show that, even when accounting for all these factors, and even when using the same modality, the contrastive loss actually creates a gap during training. As a result, We propose that the modality gap is inherent to the two-encoder contrastive loss and rename it the contrastive gap. We present evidence that attributes this contrastive gap to low uniformity in CLIP space, resulting in embeddings that occupy only a small portion of the latent space. To close the gap, we adapt the uniformity and alignment properties of unimodal contrastive loss to the multi-modal setting and show that simply adding these terms to the CLIP loss distributes the embeddings more uniformly in the representational space, closing the gap. In our experiments, we show that the modified representational space achieves better performance than default CLIP loss in downstream tasks such as zero-shot image classification and multi-modal arithmetic.

[CV-94] Potential Field Based Deep Metric Learning

链接: https://arxiv.org/abs/2405.18560
作者: Shubhang Bhatnagar,Narendra Ahuja
关键词: Deep metric learning, meaningful representation space, semantically meaningful representation, Deep metric, involves training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Deep metric learning (DML) involves training a network to learn a semantically meaningful representation space. Many current approaches mine n-tuples of examples and model interactions within each tuplets. We present a novel, compositional DML model, inspired by electrostatic fields in physics that, instead of in tuples, represents the influence of each example (embedding) by a continuous potential field, and superposes the fields to obtain their combined global potential field. We use attractive/repulsive potential fields to represent interactions among embeddings from images of the same/different classes. Contrary to typical learning methods, where mutual influence of samples is proportional to their distance, we enforce reduction in such influence with distance, leading to a decaying field. We show that such decay helps improve performance on real world datasets with large intra-class variations and label noise. Like other proxy-based methods, we also use proxies to succinctly represent sub-populations of examples. We evaluate our method on three standard DML benchmarks- Cars-196, CUB-200-2011, and SOP datasets where it outperforms state-of-the-art baselines.

[CV-95] Low-Rank Few-Shot Adaptation of Vision-Language Models

链接: https://arxiv.org/abs/2405.18541
作者: Maxime Zanella,Ismail Ben Ayed
关键词: Vision-Language Models, generalization capabilities, target downstream task, pushed their generalization, labeled samples
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent progress in the few-shot adaptation of Vision-Language Models (VLMs) has further pushed their generalization capabilities, at the expense of just a few labeled samples within the target downstream task. However, this promising, already quite abundant few-shot literature has focused principally on prompt learning and, to a lesser extent, on adapters, overlooking the recent advances in Parameter-Efficient Fine-Tuning (PEFT). Furthermore, existing few-shot learning methods for VLMs often rely on heavy training procedures and/or carefully chosen, task-specific hyper-parameters, which might impede their applicability. In response, we introduce Low-Rank Adaptation (LoRA) in few-shot learning for VLMs, and show its potential on 11 datasets, in comparison to current state-of-the-art prompt- and adapter-based approaches. Surprisingly, our simple CLIP-LoRA method exhibits substantial improvements, while reducing the training times and keeping the same hyper-parameters in all the target tasks, i.e., across all the datasets and numbers of shots. Certainly, our surprising results do not dismiss the potential of prompt-learning and adapter-based research. However, we believe that our strong baseline could be used to evaluate progress in these emergent subjects in few-shot VLMs.

[CV-96] ask-Driven Uncertainty Quantification in Inverse Problems via Conformal Prediction

链接: https://arxiv.org/abs/2405.18527
作者: Jeffrey Wen,Rizwan Ahmad,Philip Schniter
关键词: imaging inverse problems, seeks to recover, inverse problems, corrupted measurements, imaging inverse
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:In imaging inverse problems, one seeks to recover an image from missing/corrupted measurements. Because such problems are ill-posed, there is great motivation to quantify the uncertainty induced by the measurement-and-recovery process. Motivated by applications where the recovered image is used for a downstream task, such as soft-output classification, we propose a task-centered approach to uncertainty quantification. In particular, we use conformal prediction to construct an interval that is guaranteed to contain the task output from the true image up to a user-specified probability, and we use the width of that interval to quantify the uncertainty contributed by measurement-and-recovery. For posterior-sampling-based image recovery, we construct locally adaptive prediction intervals. Furthermore, we propose to collect measurements over multiple rounds, stopping as soon as the task uncertainty falls below an acceptable level. We demonstrate our methodology on accelerated magnetic resonance imaging (MRI).

[CV-97] REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment

链接: https://arxiv.org/abs/2405.18525
作者: Haonan Han,Rui Yang,Huan Liao,Jiankai Xing,Zunnan Xu,Xiaoming Yu,Junwei Zha,Xiu Li,Wanhua Li
关键词: multiple objects due, due to biases, biases and occlusion, Traditional, REPARO
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Traditional image-to-3D models often struggle with scenes containing multiple objects due to biases and occlusion complexities. To address this challenge, we present REPARO, a novel approach for compositional 3D asset generation from single images. REPARO employs a two-step process: first, it extracts individual objects from the scene and reconstructs their 3D meshes using off-the-shelf image-to-3D models; then, it optimizes the layout of these meshes through differentiable rendering techniques, ensuring coherent scene composition. By integrating optimal transport-based long-range appearance loss term and high-level semantic loss term in the differentiable rendering, REPARO can effectively recover the layout of 3D assets. The proposed method can significantly enhance object independence, detail accuracy, and overall scene coherence. Extensive evaluation of multi-object scenes demonstrates that our REPARO offers a comprehensive approach to address the complexities of multi-object 3D scene generation from single images.

[CV-98] Aligning in a Compact Space: Contrastive Knowledge Distillation between Heterogeneous Architectures

链接: https://arxiv.org/abs/2405.18524
作者: Hongjun Wu,Li Xiao,Xingkuo Zhang,Yining Miao
关键词: compress neural networks, Contrastive Knowledge Distillation, Knowledge distillation, neural networks, reducing the inference
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 3 figures, conference paper

点击查看摘要

Abstract:Knowledge distillation is commonly employed to compress neural networks, reducing the inference costs and memory footprint. In the scenario of homogenous architecture, feature-based methods have been widely validated for their effectiveness. However, in scenarios where the teacher and student models are of heterogeneous architectures, the inherent differences in feature representation significantly degrade the performance of these methods. Recent studies have highlighted that low-frequency components constitute the majority of image features. Motivated by this, we propose a Low-Frequency Components-based Contrastive Knowledge Distillation (LFCC) framework that significantly enhances the performance of feature-based distillation between heterogeneous architectures. Specifically, we designe a set of multi-scale low-pass filters to extract the low-frequency components of intermediate features from both the teacher and student models, aligning them in a compact space to overcome architectural disparities. Moreover, leveraging the intrinsic pairing characteristic of the teacher-student framework, we design an innovative sample-level contrastive learning framework that adeptly restructures the constraints of within-sample feature similarity and between-sample feature divergence into a contrastive learning task. This strategy enables the student model to capitalize on intra-sample feature congruence while simultaneously enhancing the discrimination of features among disparate samples. Consequently, our LFCC framework accurately captures the commonalities in feature representation across heterogeneous architectures. Extensive evaluations and empirical analyses across three architectures (CNNs, Transformers, and MLPs) demonstrate that LFCC achieves superior performance on the challenging benchmarks of ImageNet-1K and CIFAR-100. All codes will be publicly available.

[CV-99] ripletMix: Triplet Data Augmentation for 3D Understanding

链接: https://arxiv.org/abs/2405.18523
作者: Jiaze Wang,Yi Wang,Ziyu Guo,Renrui Zhang,Donghao Zhou,Guangyong Chen,Anfeng Liu,Pheng-Ann Heng
关键词: multimodal data augmentation, multimodal triplet data, vision where traditional, vital tool, tool for enhancing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data augmentation has proven to be a vital tool for enhancing the generalization capabilities of deep learning models, especially in the context of 3D vision where traditional datasets are often limited. Despite previous advancements, existing methods primarily cater to unimodal data scenarios, leaving a gap in the augmentation of multimodal triplet data, which integrates text, images, and point clouds. Simultaneously augmenting all three modalities enhances diversity and improves alignment across modalities, resulting in more comprehensive and robust 3D representations. To address this gap, we propose TripletMix, a novel approach to address the previously unexplored issue of multimodal data augmentation in 3D understanding. TripletMix innovatively applies the principles of mixed-based augmentation to multimodal triplet data, allowing for the preservation and optimization of cross-modal connections. Our proposed TripletMix combines feature-level and input-level augmentations to achieve dual enhancement between raw data and latent features, significantly improving the model’s cross-modal understanding and generalization capabilities by ensuring feature consistency and providing diverse and realistic training samples. We demonstrate that TripletMix not only improves the baseline performance of models in various learning scenarios including zero-shot and linear probing classification but also significantly enhances model generalizability. Notably, we improved the zero-shot classification accuracy on ScanObjectNN from 51.3 percent to 61.9 percent, and on Objaverse-LVIS from 46.8 percent to 51.4 percent. Our findings highlight the potential of multimodal data augmentation to significantly advance 3D object recognition and understanding.

[CV-100] Feasibility and benefits of joint learning from MRI databases with different brain diseases and modalities for segmentation

链接: https://arxiv.org/abs/2405.18511
作者: Wentian Xu,Matthew Moffat,Thalia Seale,Ziyun Liang,Felix Wagner,Daniel Whitehouse,David Menon,Virginia Newcombe,Natalie Voets,Abhirup Banerjee,Konstantinos Kamnitsas
关键词: MRI modalities, specific disease, MRI, modalities, specific pathology
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to MIDL 2024

点击查看摘要

Abstract:Models for segmentation of brain lesions in multi-modal MRI are commonly trained for a specific pathology using a single database with a predefined set of MRI modalities, determined by a protocol for the specific disease. This work explores the following open questions: Is it feasible to train a model using multiple databases that contain varying sets of MRI modalities and annotations for different brain pathologies? Will this joint learning benefit performance on the sets of modalities and pathologies available during training? Will it enable analysis of new databases with different sets of modalities and pathologies? We develop and compare different methods and show that promising results can be achieved with appropriate, simple and practical alterations to the model and training framework. We experiment with 7 databases containing 5 types of brain pathologies and different sets of MRI modalities. Results demonstrate, for the first time, that joint training on multi-modal MRI databases with different brain pathologies and sets of modalities is feasible and offers practical benefits. It enables a single model to segment pathologies encountered during training in diverse sets of modalities, while facilitating segmentation of new types of pathologies such as via follow-up fine-tuning. The insights this study provides into the potential and limitations of this paradigm should prove useful for guiding future advances in the direction. Code and pretrained models: this https URL

[CV-101] he Unified Balance Theory of Second-Moment Exponential Scaling Optimizers in Visual Tasks

链接: https://arxiv.org/abs/2405.18498
作者: Gongyue Zhang,Honghai Liu
关键词: Second-Moment Exponential Scaling, variable Second-Moment Exponential, Exponential Scaling, unifying first-order optimizers, identified a potential
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We have identified a potential method for unifying first-order optimizers through the use of variable Second-Moment Exponential Scaling(SMES). We begin with back propagation, addressing classic phenomena such as gradient vanishing and explosion, as well as issues related to dataset sparsity, and introduce the theory of balance in optimization. Through this theory, we suggest that SGD and adaptive optimizers can be unified under a broader inference, employing variable moving exponential scaling to achieve a balanced approach within a generalized formula for first-order optimizers. We conducted tests on some classic datasets and networks to confirm the impact of different balance coefficients on the overall training process.

[CV-102] Anomaly detection for the identification of volcanic unrest in satellite imagery

链接: https://arxiv.org/abs/2405.18487
作者: Robert Gabriel Popescu,Nantheera Anantrasirichai,Juliet Biggs
关键词: prior to eruptions, routinely acquired, volcanic deformation events, potential to detect, vast number
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Satellite images have the potential to detect volcanic deformation prior to eruptions, but while a vast number of images are routinely acquired, only a small percentage contain volcanic deformation events. Manual inspection could miss these anomalies, and an automatic system modelled with supervised learning requires suitably labelled datasets. To tackle these issues, this paper explores the use of unsupervised deep learning on satellite data for the purpose of identifying volcanic deformation as anomalies. Our detector is based on Patch Distribution Modeling (PaDiM), and the detection performance is enhanced with a weighted distance, assigning greater importance to features from deeper layers. Additionally, we propose a preprocessing approach to handle noisy and incomplete data points. The final framework was tested with five volcanoes, which have different deformation characteristics and its performance was compared against the supervised learning method for volcanic deformation detection.

[CV-103] owards Open Domain Text-Driven Synthesis of Multi-Person Motions

链接: https://arxiv.org/abs/2405.18483
作者: Mengyi Shan,Lu Dong,Yutao Han,Yuan Yao,Tao Liu,Ifeoma Nwogu,Guo-Jun Qi,Mitch Hill
关键词: diverse group motions, natural and diverse, diverse group, work aims, textual descriptions
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:This work aims to generate natural and diverse group motions of multiple humans from textual descriptions. While single-person text-to-motion generation is extensively studied, it remains challenging to synthesize motions for more than one or two subjects from in-the-wild prompts, mainly due to the lack of available datasets. In this work, we curate human pose and motion datasets by estimating pose information from large-scale image and video datasets. Our models use a transformer-based diffusion framework that accommodates multiple datasets with any number of subjects or frames. Experiments explore both generation of multi-person static poses and generation of multi-person motion sequences. To our knowledge, our method is the first to generate multi-subject motion sequences with high diversity and fidelity from a large variety of textual prompts.

[CV-104] GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts

链接: https://arxiv.org/abs/2405.18438
作者: Zoltán Á. Milacski,Koichiro Niinuma,Ryosuke Kawamura,Fernando de la Torre,László A. Jeni
关键词: generating human motion, descriptive language, language that characterizes, well-suited for localizing, localizing and generating
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:The connection between our 3D surroundings and the descriptive language that characterizes them would be well-suited for localizing and generating human motion in context but for one problem. The complexity introduced by multiple modalities makes capturing this connection challenging with a fixed set of descriptors. Specifically, closed vocabulary scene encoders, which require learning text-scene associations from scratch, have been favored in the literature, often resulting in inaccurate motion grounding. In this paper, we propose a method that integrates an open vocabulary scene encoder into the architecture, establishing a robust connection between text and scene. Our two-step approach starts with pretraining the scene encoder through knowledge distillation from an existing open vocabulary semantic image segmentation model, ensuring a shared text-scene feature space. Subsequently, the scene encoder is fine-tuned for conditional motion generation, incorporating two novel regularization losses that regress the category and size of the goal object. Our methodology achieves up to a 30% reduction in the goal object distance metric compared to the prior state-of-the-art baseline model on the HUMANISE dataset. This improvement is demonstrated through evaluations conducted using three implementations of our framework and a perceptual study. Additionally, our method is designed to seamlessly accommodate future 2D segmentation methods that provide per-pixel text-aligned features for distillation.

[CV-105] ransductive Zero-Shot and Few-Shot CLIP

链接: https://arxiv.org/abs/2405.18437
作者: Ségolène Martin(OPIS, CVN),Yunshi Huang(ETS),Fereshteh Shakeri(ETS),Jean-Christophe Pesquet(OPIS, CVN),Ismail Ben Ayed(ETS)
关键词: fast growing literature, adapting vision-langage models, few-shot CLIP classification, few-shot image classification, fast growing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 2024 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2024, Seattle (USA), Washington, United States

点击查看摘要

Abstract:Transductive inference has been widely investigated in few-shot image classification, but completely overlooked in the recent, fast growing literature on adapting vision-langage models like CLIP. This paper addresses the transductive zero-shot and few-shot CLIP classification challenge, in which inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently. We initially construct informative vision-text probability features, leading to a classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM), our optimization-based classification objective models the data probability distribution for each class using a Dirichlet law. The minimization problem is then tackled with a novel block Majorization-Minimization algorithm, which simultaneously estimates the distribution parameters and class assignments. Extensive numerical experiments on 11 datasets underscore the benefits and efficacy of our batch inference approach.On zero-shot tasks with test batches of 75 samples, our approach yields near 20% improvement in ImageNet accuracy over CLIP’s zero-shot performance. Additionally, we outperform state-of-the-art methods in the few-shot setting. The code is available at: this https URL.

[CV-106] A study on the adequacy of common IQA measures for medical images

链接: https://arxiv.org/abs/2405.19224
作者: Anna Breger,Clemens Karner,Ian Selby,Janek Gröhl,Sören Dittmer,Edward Lilley,Judith Babar,Jake Beckford,Timothy J Sadler,Shahab Shahipasand,Arthikkaa Thavakumar,Michael Roberts,Carola-Bibiane Schönlieb
关键词: IQA measures, machine learning algorithms, natural images, Image quality assessment, standard practice
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image quality assessment (IQA) is standard practice in the development stage of novel machine learning algorithms that operate on images. The most commonly used IQA measures have been developed and tested for natural images, but not in the medical setting. Reported inconsistencies arising in medical images are not surprising, as they have different properties than natural images. In this study, we test the applicability of common IQA measures for medical image data by comparing their assessment to manually rated chest X-ray (5 experts) and photoacoustic image data (1 expert). Moreover, we include supplementary studies on grayscale natural images and accelerated brain MRI data. The results of all experiments show a similar outcome in line with previous findings for medical imaging: PSNR and SSIM in the default setting are in the lower range of the result list and HaarPSI outperforms the other tested measures in the overall performance. Also among the top performers in our medical experiments are the full reference measures DISTS, FSIM, LPIPS and MS-SSIM. Generally, the results on natural images yield considerably higher correlations, suggesting that the additional employment of tailored IQA measures for medical imaging algorithms is needed.

[CV-107] Contrastive-Adversarial and Diffusion: Exploring pre-training and fine-tuning strategies for sulcal identification

链接: https://arxiv.org/abs/2405.19204
作者: Michail Mamalakis,Héloïse de Vareilles,Shun-Chin Jim Wu,Ingrid Agartz,Lynn Egeland Mørch-Johnsen,Jane Garrison,Jon Simons,Pietro Lio,John Suckling,Graham Murray
关键词: witnessed the establishment, computer vision, learning, diffusion denoising learning, fine-tuning
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the last decade, computer vision has witnessed the establishment of various training and learning approaches. Techniques like adversarial learning, contrastive learning, diffusion denoising learning, and ordinary reconstruction learning have become standard, representing state-of-the-art methods extensively employed for fully training or pre-training networks across various vision tasks. The exploration of fine-tuning approaches has emerged as a current focal point, addressing the need for efficient model tuning with reduced GPU memory usage and time costs while enhancing overall performance, as exemplified by methodologies like low-rank adaptation (LoRA). Key questions arise: which pre-training technique yields optimal results - adversarial, contrastive, reconstruction, or diffusion denoising? How does the performance of these approaches vary as the complexity of fine-tuning is adjusted? This study aims to elucidate the advantages of pre-training techniques and fine-tuning strategies to enhance the learning process of neural networks in independent identical distribution (IID) cohorts. We underscore the significance of fine-tuning by examining various cases, including full tuning, decoder tuning, top-level tuning, and fine-tuning of linear parameters using LoRA. Systematic summaries of model performance and efficiency are presented, leveraging metrics such as accuracy, time cost, and memory efficiency. To empirically demonstrate our findings, we focus on a multi-task segmentation-classification challenge involving the paracingulate sulcus (PCS) using different 3D Convolutional Neural Network (CNN) architectures by using the TOP-OSLO cohort comprising 596 subjects.

[CV-108] Reconstructing Interpretable Features in Computational Super-Resolution microscopy via Regularized Latent Search

链接: https://arxiv.org/abs/2405.19112
作者: Marzieh Gheisari,Auguste Genovesio
关键词: Supervised deep learning, Supervised deep, approaches can artificially, deep learning approaches, artificially increase
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted for publication in Biological Imaging

点击查看摘要

Abstract:Supervised deep learning approaches can artificially increase the resolution of microscopy images by learning a mapping between two image resolutions or modalities. However, such methods often require a large set of hard-to-get low-res/high-res image pairs and produce synthetic images with a moderate increase in resolution. Conversely, recent methods based on GAN latent search offered a drastic increase in resolution without the need of paired images. However, they offer limited reconstruction of the high-resolution image interpretable features. Here, we propose a robust super-resolution method based on regularized latent search~(RLS) that offers an actionable balance between fidelity to the ground-truth and realism of the recovered image given a distribution prior. The latter allows to split the analysis of a low-resolution image into a computational super-resolution task performed by deep learning followed by a quantification task performed by a handcrafted algorithm and based on interpretable biological features. This two-step process holds potential for various applications such as diagnostics on mobile devices, where the main aim is not to recover the high-resolution details of a specific sample but rather to obtain high-resolution images that preserve explainable and quantifiable differences between conditions.

[CV-109] A study of why we need to reassess full reference image quality assessment with medical images

链接: https://arxiv.org/abs/2405.19097
作者: Anna Breger,Ander Biguri,Malena Sabaté Landman,Ian Selby,Nicole Amberg,Elisabeth Brunner,Janek Gröhl,Sepideh Hatamikia,Clemens Karner,Lipeng Ning,Sören Dittmer,Michael Roberts,AIX-COVNET Collaboration,Carola-Bibiane Schönlieb
关键词: ensure high standards, medical images, Image quality assessment, image quality measures, high standards
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Image quality assessment (IQA) is not just indispensable in clinical practice to ensure high standards, but also in the development stage of novel algorithms that operate on medical images with reference data. This paper provides a structured and comprehensive collection of examples where the two most common full reference (FR) image quality measures prove to be unsuitable for the assessment of novel algorithms using different kinds of medical images, including real-world MRI, CT, OCT, X-Ray, digital pathology and photoacoustic imaging data. In particular, the FR-IQA measures PSNR and SSIM are known and tested for working successfully in many natural imaging tasks, but discrepancies in medical scenarios have been noted in the literature. Inconsistencies arising in medical images are not surprising, as they have very different properties than natural images which have not been targeted nor tested in the development of the mentioned measures, and therefore might imply wrong judgement of novel methods for medical images. Therefore, improvement is urgently needed in particular in this era of AI to increase explainability, reproducibility and generalizability in machine learning for medical imaging and beyond. On top of the pitfalls we will provide ideas for future research as well as suggesting guidelines for the usage of FR-IQA measures applied to medical images.

[CV-110] On the Influence of Smoothness Constraints in Computed Tomography Motion Compensation

链接: https://arxiv.org/abs/2405.19079
作者: Mareike Thies,Fabian Wagner,Noah Maul,Siyuan Mei,Mingxuan Gu,Laura Pfaff,Nastassia Vysotskaya,Haijun Yu,Andreas Maier
关键词: precise patient immobilization, Computed tomography, motion, relies on precise, Motion compensation
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Computed tomography (CT) relies on precise patient immobilization during image acquisition. Nevertheless, motion artifacts in the reconstructed images can persist. Motion compensation methods aim to correct such artifacts post-acquisition, often incorporating temporal smoothness constraints on the estimated motion patterns. This study analyzes the influence of a spline-based motion model within an existing rigid motion compensation algorithm for cone-beam CT on the recoverable motion frequencies. Results demonstrate that the choice of motion model crucially influences recoverable frequencies. The optimization-based motion compensation algorithm is able to accurately fit the spline nodes for frequencies almost up to the node-dependent theoretical limit according to the Nyquist-Shannon theorem. Notably, a higher node count does not compromise reconstruction performance for slow motion patterns, but can extend the range of recoverable high frequencies for the investigated algorithm. Eventually, the optimal motion model is dependent on the imaged anatomy, clinical use case, and scanning protocol and should be tailored carefully to the expected motion frequency spectrum to ensure accurate motion compensation.

[CV-111] EntProp: High Entropy Propagation for Improving Accuracy and Robustness

链接: https://arxiv.org/abs/2405.18931
作者: Shohei Enomoto
关键词: Deep neural networks, Deep neural, samples, struggle to generalize, impressive performance
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to UAI2024

点击查看摘要

Abstract:Deep neural networks (DNNs) struggle to generalize to out-of-distribution domains that are different from those in training despite their impressive performance. In practical applications, it is important for DNNs to have both high standard accuracy and robustness against out-of-distribution domains. One technique that achieves both of these improvements is disentangled learning with mixture distribution via auxiliary batch normalization layers (ABNs). This technique treats clean and transformed samples as different domains, allowing a DNN to learn better features from mixed domains. However, if we distinguish the domains of the samples based on entropy, we find that some transformed samples are drawn from the same domain as clean samples, and these samples are not completely different domains. To generate samples drawn from a completely different domain than clean samples, we hypothesize that transforming clean high-entropy samples to further increase the entropy generates out-of-distribution samples that are much further away from the in-distribution domain. On the basis of the hypothesis, we propose high entropy propagation~(EntProp), which feeds high-entropy samples to the network that uses ABNs. We introduce two techniques, data augmentation and free adversarial training, that increase entropy and bring the sample further away from the in-distribution domain. These techniques do not require additional training costs. Our experimental results show that EntProp achieves higher standard accuracy and robustness with a lower training cost than the baseline methods. In particular, EntProp is highly effective at training on small datasets.

[CV-112] Principled Probabilistic Imaging using Diffusion Models as Plug-and-Play Priors

链接: https://arxiv.org/abs/2405.18782
作者: Zihui Wu,Yu Sun,Yifan Chen,Bingliang Zhang,Yisong Yue,Katherine L. Bouman
关键词: modeling complex image, expressive image priors, recently shown outstanding, shown outstanding capability, Diffusion models
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Diffusion models (DMs) have recently shown outstanding capability in modeling complex image distributions, making them expressive image priors for solving Bayesian inverse problems. However, most existing DM-based methods rely on approximations in the generative process to be generic to different inverse problems, leading to inaccurate sample distributions that deviate from the target posterior defined within the Bayesian framework. To harness the generative power of DMs while avoiding such approximations, we propose a Markov chain Monte Carlo algorithm that performs posterior sampling for general inverse problems by reducing it to sampling the posterior of a Gaussian denoising problem. Crucially, we leverage a general DM formulation as a unified interface that allows for rigorously solving the denoising problem with a range of state-of-the-art DMs. We demonstrate the effectiveness of the proposed method on six inverse problems (three linear and three nonlinear), including a real-world black hole imaging problem. Experimental results indicate that our proposed method offers more accurate reconstructions and posterior estimation compared to existing DM-based imaging inverse methods.

[CV-113] Cardiovascular Disease Detection from Multi-View Chest X-rays with BI-Mamba

链接: https://arxiv.org/abs/2405.18533
作者: Zefan Yang,Jiajin Zhang,Ge Wang,Mannudeep K. Kalra,Pingkun Yan
关键词: Cardiovascular disease, CVD risk, chest X-ray, patient health management, effective patient health
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Early accepted paper for MICCAI 2024

点击查看摘要

Abstract:Accurate prediction of Cardiovascular disease (CVD) risk in medical imaging is central to effective patient health management. Previous studies have demonstrated that imaging features in computed tomography (CT) can help predict CVD risk. However, CT entails notable radiation exposure, which may result in adverse health effects for patients. In contrast, chest X-ray emits significantly lower levels of radiation, offering a safer option. This rationale motivates our investigation into the feasibility of using chest X-ray for predicting CVD risk. Convolutional Neural Networks (CNNs) and Transformers are two established network architectures for computer-aided diagnosis. However, they struggle to model very high resolution chest X-ray due to the lack of large context modeling power or quadratic time complexity. Inspired by state space sequence models (SSMs), a new class of network architectures with competitive sequence modeling power as Transfomers and linear time complexity, we propose Bidirectional Image Mamba (BI-Mamba) to complement the unidirectional SSMs with opposite directional information. BI-Mamba utilizes parallel forward and backwark blocks to encode longe-range dependencies of multi-view chest X-rays. We conduct extensive experiments on images from 10,395 subjects in National Lung Screening Trail (NLST). Results show that BI-Mamba outperforms ResNet-50 and ViT-S with comparable parameter size, and saves significant amount of GPU memory during training. Besides, BI-Mamba achieves promising performance compared with previous state of the art in CT, unraveling the potential of chest X-ray for CVD risk prediction.

[CV-114] Adaptive Multiscale Retinal Diagnosis: A Hybrid Trio-Model Approach for Comprehensive Fundus Multi-Disease Detection Leveraging Transfer Learning and Siamese Networks

链接: https://arxiv.org/abs/2405.18449
作者: Yavuz Selim Inan
关键词: billion people worldwide, visual disorders, media haze, people worldwide, worldwide are suffering
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:WHO has declared that more than 2.2 billion people worldwide are suffering from visual disorders, such as media haze, glaucoma, and drusen. At least 1 billion of these cases could have been either prevented or successfully treated, yet they remain unaddressed due to poverty, a lack of specialists, inaccurate ocular fundus diagnoses by ophthalmologists, or the presence of a rare disease. To address this, the research has developed the Hybrid Trio-Network Model Algorithm for accurately diagnosing 12 distinct common and rare eye diseases. This algorithm utilized the RFMiD dataset of 3,200 fundus images and the Binary Relevance Method to detect diseases separately, ensuring expandability and avoiding incorrect correlations. Each detector, incorporating finely tuned hyperparameters to optimize performance, consisted of three feature components: A classical transfer learning CNN model, a two-stage CNN model, and a Siamese Network. The diagnosis was made using features extracted through this Trio-Model with Ensembled Machine Learning algorithms. The proposed model achieved an average accuracy of 97% and an AUC score of 0.96. Compared to past benchmark studies, an increase of over 10% in the F1-score was observed for most diseases. Furthermore, using the Siamese Network, the model successfully made predictions in diseases like optic disc pallor, which past studies failed to predict due to low confidence. This diagnostic tool presents a stable, adaptive, cost-effective, efficient, accessible, and fast solution for globalizing early detection of both common and rare diseases.

[CV-115] QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation Challenge

链接: https://arxiv.org/abs/2405.18435
作者: Hongwei Bran,Fernando Navarro,Ivan Ezhov,Amirhossein Bayat,Dhritiman Das,Florian Kofler,Suprosanna Shit,Diana Waldmannstetter,Johannes C. Paetzold,Xiaobin Hu,Benedikt Wiestler,Lucas Zimmer,Tamaz Amiranashvili,Chinmay Prabhakar,Christoph Berger,Jonas Weidner,Michelle Alonso-Basant,Arif Rashid,Ujjwal Baid,Wesam Adel,Deniz Ali,Bhakti Baheti,Yingbin Bai,Ishaan Bhatt,Sabri Can Cetindag,Wenting Chen,Li Cheng,Prasad Dutand,Lara Dular,Mustafa A. Elattar,Ming Feng,Shengbo Gao,Henkjan Huisman,Weifeng Hu,Shubham Innani,Wei Jiat,Davood Karimi,Hugo J. Kuijf,Jin Tae Kwak,Hoang Long Le,Xiang Lia,Huiyan Lin,Tongliang Liu,Jun Ma,Kai Ma,Ting Ma,Ilkay Oksuz,Robbie Holland,Arlindo L. Oliveira,Jimut Bahan Pal,Xuan Pei,Maoying Qiao,Anindo Saha,Raghavendra Selvan,Linlin Shen,Joao Lourenco Silva,Ziga Spiclin,Sanjay Talbar,Dadong Wang,Wei Wang,Xiong Wang,Yin Wang,Ruiling Xia,Kele Xu,Yanwu Yan,Mert Yergin,Shuang Yu,Lingxi Zeng,YingLin Zhang,Jiachen Zhao,Yefeng Zheng,Martin Zukovec,Richard Do,Anton Becker,Amber Simpson,Ender Konukoglu,Andras Jakab,Spyridon Bakas,Leo Joskowicz,Bjoern Menze
关键词: medical image segmentation, medical image, reliable image segmentation, medical image interpretation, Medical Image Computing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: initial technical report

点击查看摘要

Abstract:Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the development and evaluation of automated segmentation algorithms. Accurately modeling and quantifying this variability is essential for enhancing the robustness and clinical applicability of these algorithms. We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ), which was organized in conjunction with International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2020 and 2021. The challenge focuses on the uncertainty quantification of medical image segmentation which considers the omnipresence of inter-rater variability in imaging datasets. The large collection of images with multi-rater annotations features various modalities such as MRI and CT; various organs such as the brain, prostate, kidney, and pancreas; and different image dimensions 2D-vs-3D. A total of 24 teams submitted different solutions to the problem, combining various baseline models, Bayesian neural networks, and ensemble model techniques. The obtained results indicate the importance of the ensemble models, as well as the need for further research to develop efficient 3D methods for uncertainty quantification methods in 3D segmentation tasks.

机器学习

[LG-0] X-VILA: Cross-Modality Alignment for Large Language Model

链接: https://arxiv.org/abs/2405.19335
作者: Hanrong Ye,De-An Huang,Yao Lu,Zhiding Yu,Wei Ping,Andrew Tao,Jan Kautz,Song Han,Dan Xu,Pavlo Molchanov,Hongxu Yin
关键词: omni-modality model designed, large language models, incorporating image, omni-modality model, model designed
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Technical Report

点击查看摘要

Abstract:We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effective interleaved any-to-any modality instruction-following dataset. Furthermore, we identify a significant problem with the current cross-modality alignment method, which results in visual information loss. To address the issue, we propose a visual alignment mechanism with a visual embedding highway module. We then introduce a resource-efficient recipe for training X-VILA, that exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins. X-VILA also showcases emergent properties across modalities even in the absence of similar training data. The project will be made open-source.

[LG-1] Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

链接: https://arxiv.org/abs/2405.19332
作者: Shenao Zhang,Donghan Yu,Hiteshi Sharma,Ziyi Yang,Shuohang Wang,Hany Hassan,Zhaoran Wang
关键词: aligning Large Language, Reinforcement Learning, achieved significant success, Large Language Models, aligning Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the reparameterized reward function, the resulting algorithm, named Self-Exploring Language Models (SELM), eliminates the need for a separate RM and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), the SELM objective reduces indiscriminate favor of unseen extrapolations and enhances exploration efficiency. Our experimental results demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks in different settings. Our code and models are available at this https URL.

[LG-2] MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

链接: https://arxiv.org/abs/2405.19327
作者: Ge Zhang,Scott Qu,Jiaheng Liu,Chenchen Zhang,Chenghua Lin,Chou Leuang Yu,Danny Pan,Esther Cheng,Jie Liu,Qunshu Lin,Raven Yuan,Tuney Zheng,Wei Pang,Xinrun Du,Yiming Liang,Yinghao Ma,Yizhi Li,Ziyang Ma,Bill Lin,Emmanouil Benetos,Huan Yang,Junting Zhou,Kaijing Ma,Minghao Liu,Morry Niu,Noah Wang,Quehry Que,Ruibo Liu,Sine Liu,Shawn Guo,Soren Gao,Wangchunshu Zhou,Xinyue Zhang,Yizhi Zhou,Yubo Wang,Yuelin Bai,Yuhan Zhang,Yuxiang Zhang,Zenith Wang,Zhenzhu Yang,Zijian Zhao,Jiajun Zhang,Wanli Ouyang,Wenhao Huang,Wenhu Chen
关键词: made great strides, achieve unprecedented performance, LLMs, made great, great strides
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model’s weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs.

[LG-3] Are Large Language Models Chameleons?

链接: https://arxiv.org/abs/2405.19323
作者: Mingmeng Geng,Sihong He,Roberto Trotta
关键词: large language models, personality tendencies, large language, worldviews and personality, European Social Survey
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 16 pages,8 figures

点击查看摘要

Abstract:Do large language models (LLMs) have their own worldviews and personality tendencies? Simulations in which an LLM was asked to answer subjective questions were conducted more than 1 million times. Comparison of the responses from different LLMs with real data from the European Social Survey (ESS) suggests that the effect of prompts on bias and variability is fundamental, highlighting major cultural, age, and gender biases. Methods for measuring the difference between LLMs and survey data are discussed, such as calculating weighted means and a new proposed measure inspired by Jaccard similarity. We conclude that it is important to analyze the robustness and variability of prompts before using LLMs to model individual decisions or collective behavior, as their imitation abilities are approximate at best.

[LG-4] Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF

链接: https://arxiv.org/abs/2405.19320
作者: Shicong Cen,Jincheng Mei,Katayoon Goshvadi,Hanjun Dai,Tong Yang,Sherry Yang,Dale Schuurmans,Yuejie Chi,Bo Dai
关键词: demonstrated great promise, human feedback, aligning large language, large language models, preference data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) has demonstrated great promise in aligning large language models (LLMs) with human preference. Depending on the availability of preference data, both online and offline RLHF are active areas of investigation. A key bottleneck is understanding how to incorporate uncertainty estimation in the reward function learned from the preference data for RLHF, regardless of how the preference data is collected. While the principles of optimism or pessimism under uncertainty are well-established in standard reinforcement learning (RL), a practically-implementable and theoretically-grounded form amenable to large language models is not yet available, as standard techniques for constructing confidence intervals become intractable under arbitrary policy parameterizations. In this paper, we introduce a unified approach to online and offline RLHF – value-incentivized preference optimization (VPO) – which regularizes the maximum-likelihood estimate of the reward function with the corresponding value function, modulated by a \textitsign to indicate whether the optimism or pessimism is chosen. VPO also directly optimizes the policy with implicit reward modeling, and therefore shares a simpler RLHF pipeline similar to direct preference optimization. Theoretical guarantees of VPO are provided for both online and offline settings, matching the rates of their standard RL counterparts. Moreover, experiments on text summarization and dialog verify the practicality and effectiveness of VPO. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2405.19320 [cs.LG] (or arXiv:2405.19320v1 [cs.LG] for this version)

[LG-5] Adaptive Generalized Neyman Allocation: Local Asymptotic Minimax Optimal Best Arm Identification

链接: https://arxiv.org/abs/2405.19317
作者: Masahiro Kato
关键词: Adaptive Generalized Neyman, Generalized Neyman Allocation, Neyman Allocation, minimax optimal strategy, study investigates
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Econometrics (econ.EM); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This study investigates a local asymptotic minimax optimal strategy for fixed-budget best arm identification (BAI). We propose the Adaptive Generalized Neyman Allocation (AGNA) strategy and show that its worst-case upper bound of the probability of misidentifying the best arm aligns with the worst-case lower bound under the small-gap regime, where the gap between the expected outcomes of the best and suboptimal arms is small. Our strategy corresponds to a generalization of the Neyman allocation for two-armed bandits (Neyman, 1934; Kaufmann et al., 2016) and a refinement of existing strategies such as the ones proposed by Glynn Juneja (2004) and Shin et al. (2018). Compared to Komiyama et al. (2022), which proposes a minimax rate-optimal strategy, our proposed strategy has a tighter upper bound that exactly matches the lower bound, including the constant terms, by restricting the class of distributions to the class of small-gap distributions. Our result contributes to the longstanding open issue about the existence of asymptotically optimal strategies in fixed-budget BAI, by presenting the local asymptotic minimax optimal strategy.

[LG-6] Robust Preference Optimization through Reward Model Distillation

链接: https://arxiv.org/abs/2405.19316
作者: Adam Fisch,Jacob Eisenstein,Vicky Zayats,Alekh Agarwal,Ahmad Beirami,Chirag Nagpal,Pete Shaw,Jonathan Berant
关键词: Language model, involves maximizing, preference, Direct Preference Optimization, reward model
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. However, typical preference datasets have only a single, or at most a few, annotation per preference pair, which causes DPO to overconfidently assign rewards that trend towards infinite magnitude. This frequently leads to degenerate policies, sometimes causing even the probabilities of the preferred generations to go to zero. In this work, we analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs: we train the LM to produce probabilities that match the distribution induced by a reward model trained on the preference data. Moreover, to account for uncertainty in the reward model we are distilling from, we optimize against a family of reward models that, as a whole, is likely to include at least one reasonable proxy for the preference distribution. Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations, while preserving the simple supervised nature of DPO.

[LG-7] Matryoshka Query Transformer for Large Vision-Language Models

链接: https://arxiv.org/abs/2405.19315
作者: Wenbo Hu,Zi-Yi Dou,Liunian Harold Li,Amita Kamath,Nanyun Peng,Kai-Wei Chang
关键词: Large Vision-Language Models, Large Vision-Language, visual tokens, tokens, visual
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Preprint. Our code and model are publicly available at this https URL

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) typically encode an image into a fixed number of visual tokens (e.g., 576) and process these tokens with a language model. Despite their strong performance, LVLMs face challenges in adapting to varying computational constraints. This raises the question: can we achieve flexibility in the number of visual tokens to suit different tasks and computational resources? We answer this with an emphatic yes. Inspired by Matryoshka Representation Learning, we introduce the Matryoshka Query Transformer (MQT), capable of encoding an image into m visual tokens during inference, where m can be any number up to a predefined maximum. This is achieved by employing a query transformer with M latent query tokens to compress the visual embeddings. During each training step, we randomly select m = M latent query tokens and train the model using only these first m tokens, discarding the rest. Combining MQT with LLaVA, we train a single model once, and flexibly and drastically reduce the number of inference-time visual tokens while maintaining similar or better performance compared to training independent models for each number of tokens. Our model, MQT-LLAVA, matches LLaVA-1.5 performance across 11 benchmarks using a maximum of 256 tokens instead of LLaVA’s fixed 576. Reducing to 16 tokens (8x less TFLOPs) only sacrifices the performance by 2.4 points on MMBench. On certain tasks such as ScienceQA and MMMU, we can even go down to only 2 visual tokens with performance drops of just 3% and 6% each. Our exploration of the trade-off between the accuracy and computational cost brought about by the number of visual tokens facilitates future research to achieve the best of both worlds.

[LG-8] Measuring and Mitigating Bias for Tabular Datasets with Multiple Protected Attributes

链接: https://arxiv.org/abs/2405.19300
作者: Manh Khoi Duong,Stefan Conrad
关键词: current corrigendum, propose and present, multiple protected attributes, European Union, datasets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 13 figures

点击查看摘要

Abstract:Motivated by the recital (67) of the current corrigendum of the AI Act in the European Union, we propose and present measures and mitigation strategies for discrimination in tabular datasets. We specifically focus on datasets that contain multiple protected attributes, such as nationality, age, and sex. This makes measuring and mitigating bias more challenging, as many existing methods are designed for a single protected attribute. This paper comes with a twofold contribution: Firstly, new discrimination measures are introduced. These measures are categorized in our framework along with existing ones, guiding researchers and practitioners in choosing the right measure to assess the fairness of the underlying dataset. Secondly, a novel application of an existing bias mitigation method, FairDo, is presented. We show that this strategy can mitigate any type of discrimination, including intersectional discrimination, by transforming the dataset. By conducting experiments on real-world datasets (Adult, Bank, Compas), we demonstrate that de-biasing datasets with multiple protected attributes is achievable. Further, the transformed fair datasets do not compromise any of the tested machine learning models’ performances significantly when trained on these datasets compared to the original datasets. Discrimination was reduced by up to 83% in our experimentation. For most experiments, the disparity between protected groups was reduced by at least 7% and 27% on average. Generally, the findings show that the mitigation strategy used is effective, and this study contributes to the ongoing discussion on the implementation of the European Union’s AI Act.

[LG-9] Neural Isometries: Taming Transformations for Equivariant ML

链接: https://arxiv.org/abs/2405.19296
作者: Thomas W. Mitchel,Michael Taylor,Vincent Sitzmann
关键词: tractable analytical expression, defy tractable analytical, Real-world geometry, vision tasks, analytical expression
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-world geometry and 3D vision tasks are replete with challenging symmetries that defy tractable analytical expression. In this paper, we introduce Neural Isometries, an autoencoder framework which learns to map the observation space to a general-purpose latent space wherein encodings are related by isometries whenever their corresponding observations are geometrically related in world space. Specifically, we regularize the latent space such that maps between encodings preserve a learned inner product and commute with a learned functional operator, in the same manner as rigid-body transformations commute with the Laplacian. This approach forms an effective backbone for self-supervised representation learning, and we demonstrate that a simple off-the-shelf equivariant network operating in the pre-trained latent space can achieve results on par with meticulously-engineered, handcrafted networks designed to handle complex, nonlinear symmetries. Furthermore, isometric maps capture information about the respective transformations in world space, and we show that this allows us to regress camera poses directly from the coefficients of the maps between encodings of adjacent views of a scene.

[LG-10] Understanding and Minimising Outlier Features in Neural Network Training

链接: https://arxiv.org/abs/2405.19279
作者: Bobby He,Lorenzo Noci,Daniele Paliotta,Imanol Schlag,Thomas Hofmann
关键词: magnitudes significantly exceed, Outlier Features, activation magnitudes significantly, neural network, magnitudes significantly
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Outlier Features (OF) are neurons whose activation magnitudes significantly exceed the average over a neural network’s (NN) width. They are well known to emerge during standard transformer training and have the undesirable effect of hindering quantisation in afflicted models. Despite their practical importance, little is known behind why OFs emerge during training, nor how one can minimise them. Our work focuses on the above questions, first identifying several quantitative metrics, such as the kurtosis over neuron activation norms, to measure OFs. With these metrics, we study how architectural and optimisation choices influence OFs, and provide practical insights to minimise OFs during training. As highlights, we emphasise the importance of controlling signal propagation throughout training, and propose the Outlier Protected transformer block, which removes standard Pre-Norm layers to mitigate OFs, without loss of convergence speed or training stability. Overall, our findings shed new light on our understanding of, our ability to prevent, and the complexity of this important facet in NN training dynamics. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2405.19279 [cs.LG] (or arXiv:2405.19279v1 [cs.LG] for this version)

[LG-11] Deep Latent Variable Modeling of Physiological Signals

链接: https://arxiv.org/abs/2405.19277
作者: Khuong Vo
关键词: capturing complex distributions, complex distributions, capturing complex, latent variable model, latent variable
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A deep latent variable model is a powerful method for capturing complex distributions. These models assume that underlying structures, but unobserved, are present within the data. In this dissertation, we explore high-dimensional problems related to physiological monitoring using latent variable models. First, we present a novel deep state-space model to generate electrical waveforms of the heart using optically obtained signals as inputs. This can bring about clinical diagnoses of heart disease via simple assessment through wearable devices. Second, we present a brain signal modeling scheme that combines the strengths of probabilistic graphical models and deep adversarial learning. The structured representations can provide interpretability and encode inductive biases to reduce the data complexity of neural oscillations. The efficacy of the learned representations is further studied in epilepsy seizure detection formulated as an unsupervised learning problem. Third, we propose a framework for the joint modeling of physiological measures and behavior. Existing methods to combine multiple sources of brain data provided are limited. Direct analysis of the relationship between different types of physiological measures usually does not involve behavioral data. Our method can identify the unique and shared contributions of brain regions to behavior and can be used to discover new functions of brain regions. The success of these innovative computational methods would allow the translation of biomarker findings across species and provide insight into neurocognitive analysis in numerous biological studies and clinical diagnoses, as well as emerging consumer applications.

[LG-12] Mitigating Disparate Impact of Differential Privacy in Federated Learning through Robust Clustering

链接: https://arxiv.org/abs/2405.19272
作者: Saber Malekmohammadi,Afaf Taik,Golnoosh Farnadi
关键词: decentralized machine learning, incorporates Differential Privacy, private federated learning, Federated Learning, incorporates Differential
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a decentralized machine learning (ML) approach that keeps data localized and often incorporates Differential Privacy (DP) to enhance privacy guarantees. Similar to previous work on DP in ML, we observed that differentially private federated learning (DPFL) introduces performance disparities, particularly affecting minority groups. Recent work has attempted to address performance fairness in vanilla FL through clustering, but this method remains sensitive and prone to errors, which are further exacerbated by the DP noise in DPFL. To fill this gap, in this paper, we propose a novel clustered DPFL algorithm designed to effectively identify clients’ clusters in highly heterogeneous settings while maintaining high accuracy with DP guarantees. To this end, we propose to cluster clients based on both their model updates and training loss values. Our proposed approach also addresses the server’s uncertainties in clustering clients’ model updates by employing larger batch sizes along with Gaussian Mixture Model (GMM) to alleviate the impact of noise and potential clustering errors, especially in privacy-sensitive scenarios. We provide theoretical analysis of the effectiveness of our proposed approach. We also extensively evaluate our approach across diverse data distributions and privacy budgets and show its effectiveness in mitigating the disparate impact of DP in FL settings with a small computational cost.

[LG-13] Rich-Observation Reinforcement Learning with Continuous Latent Dynamics

链接: https://arxiv.org/abs/2405.19269
作者: Yuda Song,Lili Wu,Dylan J. Foster,Akshay Krishnamurthy
关键词: reliability remain major, remain major bottlenecks, high-dimensional perceptual inputs, Sample-efficiency and reliability, Lipschitz continuous dynamics
类目: Machine Learning (cs.LG)
*备注: 63 pages, 4 figures, published at ICML 2024

点击查看摘要

Abstract:Sample-efficiency and reliability remain major bottlenecks toward wide adoption of reinforcement learning algorithms in continuous settings with high-dimensional perceptual inputs. Toward addressing these challenges, we introduce a new theoretical framework, RichCLD (Rich-Observation RL with Continuous Latent Dynamics), in which the agent performs control based on high-dimensional observations, but the environment is governed by low-dimensional latent states and Lipschitz continuous dynamics. Our main contribution is a new algorithm for this setting that is provably statistically and computationally efficient. The core of our algorithm is a new representation learning objective; we show that prior representation learning schemes tailored to discrete dynamics do not naturally extend to the continuous setting. Our new objective is amenable to practical implementation, and empirically, we find that it compares favorably to prior schemes in a standard evaluation protocol. We further provide several insights into the statistical complexity of the RichCLD framework, in particular proving that certain notions of Lipschitzness that admit sample-efficient learning in the absence of rich observations are insufficient in the rich-observation setting.

[LG-14] Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models

链接: https://arxiv.org/abs/2405.19262
作者: Zhanhui Zhou,Zhixuan Liu,Jie Liu,Zhichen Dong,Chao Yang,Yu Qiao
关键词: large language model, Large language, human preferences, Large, fine-tuned to align
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models are usually fine-tuned to align with human preferences. However, fine-tuning a large language model can be challenging. In this work, we introduce \textitweak-to-strong search , framing the alignment of a large language model as a test-time greedy search to maximize the log-likelihood difference between small tuned and untuned models while sampling from the frozen large model. This method serves both as (i) a compute-efficient model up-scaling strategy that avoids directly tuning the large model and as (ii) an instance of weak-to-strong generalization that enhances a strong model with weak test-time guidance. Empirically, we demonstrate the flexibility of weak-to-strong search across different tasks. In controlled-sentiment generation and summarization, we use tuned and untuned \textttgpt2 s to effectively improve the alignment of large models without additional training. Crucially, in a more difficult instruction-following benchmark, AlpacaEval 2.0, we show that reusing off-the-shelf small model pairs (e.g., \textttzephyr-7b-beta and its untuned version) can significantly improve the length-controlled win rates of both white-box and black-box large models against \textttgpt-4-turbo (e.g., 34.4 \rightarrow 37.9 for \textttLlama-3-70B-Instruct and 16.0 \rightarrow 20.1 for \textttgpt-3.5-turbo-instruct ), despite the small models’ low win rates \approx 10.0 .

[LG-15] Faster Cascades via Speculative Decoding

链接: https://arxiv.org/abs/2405.19261
作者: Harikrishna Narasimhan,Wittawat Jitkrittum,Ankit Singh Rawat,Seungyeon Kim,Neha Gupta,Aditya Krishna Menon,Sanjiv Kumar
关键词: models’ inference efficiency, improving language models’, language models’ inference, inference efficiency, speculative decoding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cascades and speculative decoding are two common approaches to improving language models’ inference efficiency. Both approaches involve interleaving models of different sizes, but via fundamentally distinct mechanisms: cascades employ a deferral rule that invokes the larger model only for “hard” inputs, while speculative decoding uses speculative execution to primarily invoke the larger model in parallel verification mode. These mechanisms offer different benefits: empirically, cascades are often capable of yielding better quality than even the larger model, while theoretically, speculative decoding offers a guarantee of quality-neutrality. In this paper, we leverage the best of both these approaches by designing new speculative cascading techniques that implement their deferral rule through speculative execution. We characterize the optimal deferral rule for our speculative cascades, and employ a plug-in approximation to the optimal rule. Through experiments with T5 models on benchmark language tasks, we show that the proposed approach yields better cost-quality trade-offs than cascading and speculative decoding baselines.

[LG-16] Weak Generative Sampler to Efficiently Sample Invariant Distribution of Stochastic Differential Equation

链接: https://arxiv.org/abs/2405.19256
作者: Zhiqiang Cai,Yu Cao,Yuanfei Huang,Xiang Zhou
关键词: Ito diffusion process, diffusion process presents, Ito diffusion, Planck equation, Sampling invariant distributions
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 24 pages,10 figures

点击查看摘要

Abstract:Sampling invariant distributions from an Ito diffusion process presents a significant challenge in stochastic simulation. Traditional numerical solvers for stochastic differential equations require both a fine step size and a lengthy simulation period, resulting in both biased and correlated samples. Current deep learning-based method solves the stationary Fokker–Planck equation to determine the invariant probability density function in form of deep neural networks, but they generally do not directly address the problem of sampling from the computed density function. In this work, we introduce a framework that employs a weak generative sampler (WGS) to directly generate independent and identically distributed (iid) samples induced by a transformation map derived from the stationary Fokker–Planck equation. Our proposed loss function is based on the weak form of the Fokker–Planck equation, integrating normalizing flows to characterize the invariant distribution and facilitate sample generation from the base distribution. Our randomized test function circumvents the need for mini-max optimization in the traditional weak formulation. Distinct from conventional generative models, our method neither necessitates the computationally intensive calculation of the Jacobian determinant nor the invertibility of the transformation map. A crucial component of our framework is the adaptively chosen family of test functions in the form of Gaussian kernel functions with centres selected from the generated data samples. Experimental results on several benchmark examples demonstrate the effectiveness of our method, which offers both low computational costs and excellent capability in exploring multiple metastable states.

[LG-17] Comparative Study of Neighbor-based Methods for Local Outlier Detection

链接: https://arxiv.org/abs/2405.19247
作者: Zhuang Qi,Junlin Zhang,Xiaming Chen,Xin Qi
关键词: outlier detection problem, outlier detection, existing outlier detection, outlier detection method, neighbor-based outlier detection
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The neighbor-based method has become a powerful tool to handle the outlier detection problem, which aims to infer the abnormal degree of the sample based on the compactness of the sample and its neighbors. However, the existing methods commonly focus on designing different processes to locate outliers in the dataset, while the contributions of different types neighbors to outlier detection has not been well discussed. To this end, this paper studies the neighbor in the existing outlier detection algorithms and a taxonomy is introduced, which uses the three-level components of information, neighbor and methodology to define hybrid methods. This taxonomy can serve as a paradigm where a novel neighbor-based outlier detection method can be proposed by combining different components in this taxonomy. A large number of comparative experiments were conducted on synthetic and real-world datasets in terms of performance comparison and case study, and the results show that reverse K-nearest neighbor based methods achieve promising performance and dynamic selection method is suitable for working in high-dimensional space. Notably, it is verified that rationally selecting components from this taxonomy may create an algorithms superior to existing methods.

[LG-18] ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning

链接: https://arxiv.org/abs/2405.19237
作者: Ruchika Chavhan,Da Li,Timothy Hospedales
关键词: impressive image-generation capabilities, perpetuating societal biases, demonstrated impressive image-generation, generating unsafe content, violating copyright
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While large-scale text-to-image diffusion models have demonstrated impressive image-generation capabilities, there are significant concerns about their potential misuse for generating unsafe content, violating copyright, and perpetuating societal biases. Recently, the text-to-image generation community has begun addressing these concerns by editing or unlearning undesired concepts from pre-trained models. However, these methods often involve data-intensive and inefficient fine-tuning or utilize various forms of token remapping, rendering them susceptible to adversarial jailbreaks. In this paper, we present a simple and effective training-free approach, ConceptPrune, wherein we first identify critical regions within pre-trained models responsible for generating undesirable concepts, thereby facilitating straightforward concept unlearning via weight pruning. Experiments across a range of concepts including artistic styles, nudity, object erasure, and gender debiasing demonstrate that target concepts can be efficiently erased by pruning a tiny fraction, approximately 0.12% of total weights, enabling multi-concept erasure and robustness against various white-box and black-box adversarial attacks.

[LG-19] Exploring the impact of traffic signal control and connected and automated vehicles on intersections safety: A deep reinforcement learning approach

链接: https://arxiv.org/abs/2405.19236
作者: Amir Hossein Karbasi,Hao Yang,Saiedeh Razavi
关键词: traffic signal control, traffic signal, signal control, adaptive traffic signal, pose significant risks
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: TRB 103nd Annual Meeting

点击查看摘要

Abstract:In transportation networks, intersections pose significant risks of collisions due to conflicting movements of vehicles approaching from different directions. To address this issue, various tools can exert influence on traffic safety both directly and indirectly. This study focuses on investigating the impact of adaptive signal control and connected and automated vehicles (CAVs) on intersection safety using a deep reinforcement learning approach. The objective is to assess the individual and combined effects of CAVs and adaptive traffic signal control on traffic safety, considering rear-end and crossing conflicts. The study employs a Deep Q Network (DQN) to regulate traffic signals and driving behaviors of both CAVs and Human Drive Vehicles (HDVs), and uses Time To Collision (TTC) metric to evaluate safety. The findings demonstrate a significant reduction in rear-end and crossing conflicts through the combined implementation of CAVs and DQNs-based traffic signal control. Additionally, the long-term positive effects of CAVs on safety are similar to the short-term effects of combined CAVs and DQNs-based traffic signal control. Overall, the study emphasizes the potential benefits of integrating CAVs and adaptive traffic signal control approaches in order to enhance traffic safety. The findings of this study could provide valuable insights for city officials and transportation authorities in developing effective strategies to improve safety at signalized intersections.

[LG-20] Forward-Backward Knowledge Distillation for Continual Clustering

链接: https://arxiv.org/abs/2405.19234
作者: Mohammadreza Sadeghi,Zihan Wang,Narges Armanfard
关键词: enabling neural networks, explicit label information, Unsupervised Continual Learning, Unsupervised Continual Clustering, knowledge distillation
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unsupervised Continual Learning (UCL) is a burgeoning field in machine learning, focusing on enabling neural networks to sequentially learn tasks without explicit label information. Catastrophic Forgetting (CF), where models forget previously learned tasks upon learning new ones, poses a significant challenge in continual learning, especially in UCL, where labeled information of data is not accessible. CF mitigation strategies, such as knowledge distillation and replay buffers, often face memory inefficiency and privacy issues. Although current research in UCL has endeavored to refine data representations and address CF in streaming data contexts, there is a noticeable lack of algorithms specifically designed for unsupervised clustering. To fill this gap, in this paper, we introduce the concept of Unsupervised Continual Clustering (UCC). We propose Forward-Backward Knowledge Distillation for unsupervised Continual Clustering (FBCC) to counteract CF within the context of UCC. FBCC employs a single continual learner (the ``teacher’') with a cluster projector, along with multiple student models, to address the CF issue. The proposed method consists of two phases: Forward Knowledge Distillation, where the teacher learns new clusters while retaining knowledge from previous tasks with guidance from specialized student models, and Backward Knowledge Distillation, where a student model mimics the teacher’s behavior to retain task-specific knowledge, aiding the teacher in subsequent tasks. FBCC marks a pioneering approach to UCC, demonstrating enhanced performance and memory efficiency in clustering across various tasks, outperforming the application of clustering algorithms to the latent space of state-of-the-art UCL algorithms.

[LG-21] Synthetic Potential Outcomes for Mixtures of Treatment Effects

链接: https://arxiv.org/abs/2405.19225
作者: Bijan Mazaheri,Chandler Squires,Caroline Uhler
关键词: Modern data analysis, analysis frequently relies, data analysis frequently, diverse populations, Modern data
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Modern data analysis frequently relies on the use of large datasets, often constructed as amalgamations of diverse populations or data-sources. Heterogeneity across these smaller datasets constitutes two major challenges for causal inference: (1) the source of each sample can introduce latent confounding between treatment and effect, and (2) diverse populations may respond differently to the same treatment, giving rise to heterogeneous treatment effects (HTEs). The issues of latent confounding and HTEs have been studied separately but not in conjunction. In particular, previous works only report the conditional average treatment effect (CATE) among similar individuals (with respect to the measured covariates). CATEs cannot resolve mixtures of potential treatment effects driven by latent heterogeneity, which we call mixtures of treatment effects (MTEs). Inspired by method of moment approaches to mixture models, we propose “synthetic potential outcomes” (SPOs). Our new approach deconfounds heterogeneity while also guaranteeing the identifiability of MTEs. This technique bypasses full recovery of a mixture, which significantly simplifies its requirements for identifiability. We demonstrate the efficacy of SPOs on synthetic data.

[LG-22] LoByITFL: Low Communication Secure and Private Federated Learning

链接: https://arxiv.org/abs/2405.19217
作者: Yue Xia,Christoph Hofmeister,Maximilian Egger,Rawad Bitar
关键词: Federated Learning, faces several challenges, Byzantine clients, clients data, security against Byzantine
类目: Information Theory (cs.IT); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) faces several challenges, such as the privacy of the clients data and security against Byzantine clients. Existing works treating privacy and security jointly make sacrifices on the privacy guarantee. In this work, we introduce LoByITFL, the first communication-efficient Information-Theoretic (IT) private and secure FL scheme that makes no sacrifices on the privacy guarantees while ensuring security against Byzantine adversaries. The key ingredients are a small and representative dataset available to the federator, a careful transformation of the FLTrust algorithm and the use of a trusted third party only in a one-time preprocessing phase before the start of the learning algorithm. We provide theoretical guarantees on privacy and Byzantine-resilience, and provide convergence guarantee and experimental results validating our theoretical findings.

[LG-23] HawkVision: Low-Latency Modeless Edge AI Serving

链接: https://arxiv.org/abs/2405.19213
作者: ChonLam Lao,Jiaqi Gao,Ganesh Ananthanarayanan,Aditya Akella,Minlan Yu
关键词: users and caters, user and application, increasingly growing, growing in popularity, hides the complexity
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The trend of modeless ML inference is increasingly growing in popularity as it hides the complexity of model inference from users and caters to diverse user and application accuracy requirements. Previous work mostly focuses on modeless inference in data centers. To provide low-latency inference, in this paper, we promote modeless inference at the edge. The edge environment introduces additional challenges related to low power consumption, limited device memory, and volatile network environments. To address these challenges, we propose HawkVision, which provides low-latency modeless serving of vision DNNs. HawkVision leverages a two-layer edge-DC architecture that employs confidence scaling to reduce the number of model options while meeting diverse accuracy requirements. It also supports lossy inference under volatile network environments. Our experimental results show that HawkVision outperforms current serving systems by up to 1.6X in P99 latency for providing modeless service. Our FPGA prototype demonstrates similar performance at certain accuracy levels with up to a 3.34X reduction in power consumption. Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2405.19213 [eess.SY] (or arXiv:2405.19213v1 [eess.SY] for this version)

[LG-24] Partial Information Decomposition for Data Interpretability and Feature Selection

链接: https://arxiv.org/abs/2405.19212
作者: Charles Westphal,Stephen Hailes,Mirco Musolesi
关键词: Partial Information Decomposition, introduce Partial Information, introduce Partial, simultaneous data interpretability, Partial Information
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:In this paper, we introduce Partial Information Decomposition of Features (PIDF), a new paradigm for simultaneous data interpretability and feature selection. Contrary to traditional methods that assign a single importance value, our approach is based on three metrics per feature: the mutual information shared with the target variable, the feature’s contribution to synergistic information, and the amount of this information that is redundant. In particular, we develop a novel procedure based on these three metrics, which reveals not only how features are correlated with the target but also the additional and overlapping information provided by considering them in combination with other features. We extensively evaluate PIDF using both synthetic and real-world data, demonstrating its potential applications and effectiveness, by considering case studies from genetics and neuroscience.

[LG-25] Gone but Not Forgotten: Improved Benchmarks for Machine Unlearning

链接: https://arxiv.org/abs/2405.19211
作者: Keltin Grimes,Collin Abidi,Cole Frank,Shannon Gallagher
关键词: Machine learning models, model training data, vulnerable to adversarial, leak information, Machine learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models are vulnerable to adversarial attacks, including attacks that leak information about the model’s training data. There has recently been an increase in interest about how to best address privacy concerns, especially in the presence of data-removal requests. Machine unlearning algorithms aim to efficiently update trained models to comply with data deletion requests while maintaining performance and without having to resort to retraining the model from scratch, a costly endeavor. Several algorithms in the machine unlearning literature demonstrate some level of privacy gains, but they are often evaluated only on rudimentary membership inference attacks, which do not represent realistic threats. In this paper we describe and propose alternative evaluation methods for three key shortcomings in the current evaluation of unlearning algorithms. We show the utility of our alternative evaluations via a series of experiments of state-of-the-art unlearning algorithms on different computer vision datasets, presenting a more detailed picture of the state of the field.

[LG-26] Gradient Guided Hypotheses: A unified solution to enable machine learning models on scarce and noisy data regimes

链接: https://arxiv.org/abs/2405.19210
作者: Paulo Neves,Joerg K. Wegner,Philippe Schwaller
关键词: Ensuring high-quality data, business intelligence systems, Ensuring high-quality, GGH, paramount for maximizing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Ensuring high-quality data is paramount for maximizing the performance of machine learning models and business intelligence systems. However, challenges in data quality, including noise in data capture, missing records, limited data production, and confounding variables, significantly constrain the potential performance of these systems. In this study, we propose an architecture-agnostic algorithm, Gradient Guided Hypotheses (GGH), designed to address these challenges. GGH analyses gradients from hypotheses as a proxy of distinct and possibly contradictory patterns in the data. This framework entails an additional step in machine learning training, where gradients can be included or excluded from backpropagation. In this manner, missing and noisy data are addressed through a unified solution that perceives both challenges as facets of the same overarching issue: the propagation of erroneous information. Experimental validation of GGH is conducted using real-world open-source datasets, where records with missing rates of up to 98.5% are simulated. Comparative analysis with state-of-the-art imputation methods demonstrates a substantial improvement in model performance achieved by GGH. Specifically in very high scarcity regimes, GGH was found to be the only viable solution. Additionally, GGH’s noise detection capabilities are showcased by introducing simulated noise into the datasets and observing enhanced model performance after filtering out the noisy data. This study presents GGH as a promising solution for improving data quality and model performance in various applications.

[LG-27] Vulnerable Road User Detection and Safety Enhancement: A Comprehensive Survey

链接: https://arxiv.org/abs/2405.19202
作者: Renato M. Silva,Gregório F. Azevedo,Matheus V. V. Berto,Jean R. Rocha,Eduardo C. Fidelis,Matheus V. Nogueira,Pedro H. Lisboa,Tiago A. Almeida
关键词: vulnerable road users, global road accidents, involving vulnerable road, incidents involving vulnerable, Traffic incidents involving
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 48 pages, 8 figures, citing 333 (up-to-date) papers, preprint submitted to ACM Computing Surveys

点击查看摘要

Abstract:Traffic incidents involving vulnerable road users (VRUs) constitute a significant proportion of global road accidents. Advances in traffic communication ecosystems, coupled with sophisticated signal processing and machine learning techniques, have facilitated the utilization of data from diverse sensors. Despite these advancements and the availability of extensive datasets, substantial progress is required to mitigate traffic casualties. This paper provides a comprehensive survey of state-of-the-art technologies and methodologies to enhance the safety of VRUs. The study delves into the communication networks between vehicles and VRUs, emphasizing the integration of advanced sensors and the availability of relevant datasets. It explores preprocessing techniques and data fusion methods to enhance sensor data quality. Furthermore, our study assesses critical simulation environments essential for developing and testing VRU safety systems. Our research also highlights recent advances in VRU detection and classification algorithms, addressing challenges such as variable environmental conditions. Additionally, we cover cutting-edge research in predicting VRU intentions and behaviors, which is crucial for proactive collision avoidance strategies. Through this survey, we aim to provide a comprehensive understanding of the current landscape of VRU safety technologies, identifying areas of progress and areas needing further research and development.

[LG-28] Diffusion-based Dynamics Models for Long-Horizon Rollout in Offline Reinforcement Learning

链接: https://arxiv.org/abs/2405.19189
作者: Hanye Zhao,Xiaoshen Han,Zhengbang Zhu,Minghuan Liu,Yong Yu,Weinan Zhang
关键词: generating realistic synthetic, realistic synthetic vision, synthetic vision data, decision-making and control, great success
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the great success of diffusion models (DMs) in generating realistic synthetic vision data, many researchers have investigated their potential in decision-making and control. Most of these works utilized DMs to sample directly from the trajectory space, where DMs can be viewed as a combination of dynamics models and policies. In this work, we explore how to decouple DMs’ ability as dynamics models in fully offline settings, allowing the learning policy to roll out trajectories. As DMs learn the data distribution from the dataset, their intrinsic policy is actually the behavior policy induced from the dataset, which results in a mismatch between the behavior policy and the learning policy. We propose Dynamics Diffusion, short as DyDiff, which can inject information from the learning policy to DMs iteratively. DyDiff ensures long-horizon rollout accuracy while maintaining policy consistency and can be easily deployed on model-free algorithms. We provide theoretical analysis to show the advantage of DMs on long-horizon rollout over models and demonstrate the effectiveness of DyDiff in the context of offline reinforcement learning, where the rollout dataset is provided but no online environment for interaction. Our code is at this https URL.

[LG-29] MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification

链接: https://arxiv.org/abs/2405.19186
作者: Laura Fieback(1,2),Jakob Spiegelberg(1),Hanno Gottschalk(2) ((1) Volkswagen AG, (2) TU Berlin)
关键词: Vision Language Models, shown remarkable capabilities, Large Vision Language, visual question answering, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 18 pages, 8 figures

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have shown remarkable capabilities in multimodal tasks like visual question answering or image captioning. However, inconsistencies between the visual information and the generated text, a phenomenon referred to as hallucinations, remain an unsolved problem with regard to the trustworthiness of LVLMs. To address this problem, recent works proposed to incorporate computationally costly Large (Vision) Language Models in order to detect hallucinations on a sentence- or subsentence-level. In this work, we introduce MetaToken, a lightweight binary classifier to detect hallucinations on the token-level at negligible cost. Based on a statistical analysis, we reveal key factors of hallucinations in LVLMs which have been overseen in previous works. MetaToken can be applied to any open-source LVLM without any knowledge about ground truth data providing a reliable detection of hallucinations. We evaluate our method on four state-of-the-art LVLMs demonstrating the effectiveness of our approach.

[LG-30] Online Linear Regression in Dynamic Environments via Discounting

链接: https://arxiv.org/abs/2405.19175
作者: Andrew Jacobsen,Ashok Cutkosky
关键词: online linear regression, achieves dynamic regret, prior knowledge, achieve optimal static, develop algorithms
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2024, 38 pages

点击查看摘要

Abstract:We develop algorithms for online linear regression which achieve optimal static and dynamic regret guarantees \empheven in the complete absence of prior knowledge. We present a novel analysis showing that a discounted variant of the Vovk-Azoury-Warmuth forecaster achieves dynamic regret of the form R_T(\vecu)\le O\left(d\log(T)\vee \sqrtdP_T^\gamma(\vecu)T\right) , where P_T^\gamma(\vecu) is a measure of variability of the comparator sequence, and show that the discount factor achieving this result can be learned on-the-fly. We show that this result is optimal by providing a matching lower bound. We also extend our results to \emphstrongly-adaptive guarantees which hold over every sub-interval [a,b]\subseteq[1,T] simultaneously.

[LG-31] ransformers as Neural Operators for Solutions of Differential Equations with Finite Regularity

链接: https://arxiv.org/abs/2405.19166
作者: Benjamin Shih,Ahmad Peyvan,Zhongqiang Zhang,George Em Karniadakis
关键词: operator learning models, Neural operator learning, partial differential equations, operator learning, science and engineering
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural operator learning models have emerged as very effective surrogates in data-driven methods for partial differential equations (PDEs) across different applications from computational science and engineering. Such operator learning models not only predict particular instances of a physical or biological system in real-time but also forecast classes of solutions corresponding to a distribution of initial and boundary conditions or forcing terms. % DeepONet is the first neural operator model and has been tested extensively for a broad class of solutions, including Riemann problems. Transformers have not been used in that capacity, and specifically, they have not been tested for solutions of PDEs with low regularity. % In this work, we first establish the theoretical groundwork that transformers possess the universal approximation property as operator learning models. We then apply transformers to forecast solutions of diverse dynamical systems with solutions of finite regularity for a plurality of initial conditions and forcing terms. In particular, we consider three examples: the Izhikevich neuron model, the tempered fractional-order Leaky Integrate-and-Fire (LIF) model, and the one-dimensional Euler equation Riemann problem. For the latter problem, we also compare with variants of DeepONet, and we find that transformers outperform DeepONet in accuracy but they are computationally more expensive. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2405.19166 [cs.LG] (or arXiv:2405.19166v1 [cs.LG] for this version)

[LG-32] Does learning the right latent variables necessarily improve in-context learning?

链接: https://arxiv.org/abs/2405.19162
作者: Sarthak Mittal,Eric Elmoznino,Leo Gagnon,Sangnie Bhardwaj,Dhanya Sridhar,Guillaume Lajoie
关键词: Large autoregressive models, Large autoregressive, in-context learning, suggesting avenues, autoregressive models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large autoregressive models like Transformers can solve tasks through in-context learning (ICL) without learning new weights, suggesting avenues for efficiently solving new tasks. For many tasks, e.g., linear regression, the data factorizes: examples are independent given a task latent that generates the data, e.g., linear coefficients. While an optimal predictor leverages this factorization by inferring task latents, it is unclear if Transformers implicitly do so or if they instead exploit heuristics and statistical shortcuts enabled by attention layers. Both scenarios have inspired active ongoing work. In this paper, we systematically investigate the effect of explicitly inferring task latents. We minimally modify the Transformer architecture with a bottleneck designed to prevent shortcuts in favor of more structured solutions, and then compare performance against standard Transformers across various ICL tasks. Contrary to intuition and some recent works, we find little discernible difference between the two; biasing towards task-relevant latent variables does not lead to better out-of-distribution performance, in general. Curiously, we find that while the bottleneck effectively learns to extract latent task variables from context, downstream processing struggles to utilize them for robust prediction. Our study highlights the intrinsic limitations of Transformers in achieving structured ICL solutions that generalize, and shows that while inferring the right latents aids interpretability, it is not sufficient to alleviate this problem.

[LG-33] Beyond Discrepancy: A Closer Look at the Theory of Distribution Shift

链接: https://arxiv.org/abs/2405.19156
作者: Robi Bhattacharjee,Nick Rittler,Kamalika Chaudhuri
关键词: machine learning models, distribution shift, distribution, target, deploy effortlessly
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many machine learning models appear to deploy effortlessly under distribution shift, and perform well on a target distribution that is considerably different from the training distribution. Yet, learning theory of distribution shift bounds performance on the target distribution as a function of the discrepancy between the source and target, rarely guaranteeing high target accuracy. Motivated by this gap, this work takes a closer look at the theory of distribution shift for a classifier from a source to a target distribution. Instead of relying on the discrepancy, we adopt an Invariant-Risk-Minimization (IRM)-like assumption connecting the distributions, and characterize conditions under which data from a source distribution is sufficient for accurate classification of the target. When these conditions are not met, we show when only unlabeled data from the target is sufficient, and when labeled target data is needed. In all cases, we provide rigorous theoretical guarantees in the large sample regime.

[LG-34] A Study of Plasticity Loss in On-Policy Deep Reinforcement Learning

链接: https://arxiv.org/abs/2405.19153
作者: Arthur Juliani,Jordan T. Ash
关键词: convex continual learning, presents challenges distinct, continual learning regimes, neural networks presents, Continual learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Continual learning with deep neural networks presents challenges distinct from both the fixed-dataset and convex continual learning regimes. One such challenge is plasticity loss, wherein a neural network trained in an online fashion displays a degraded ability to fit new tasks. This problem has been extensively studied in both supervised learning and off-policy reinforcement learning (RL), where a number of remedies have been proposed. Still, plasticity loss has received less attention in the on-policy deep RL setting. Here we perform an extensive set of experiments examining plasticity loss and a variety of mitigation methods in on-policy deep RL. We demonstrate that plasticity loss is pervasive under domain shift in this regime, and that a number of methods developed to resolve it in other settings fail, sometimes even resulting in performance that is worse than performing no intervention at all. In contrast, we find that a class of ``regenerative’’ methods are able to consistently mitigate plasticity loss in a variety of contexts, including in gridworld tasks and more challenging environments like Montezuma’s Revenge and ProcGen.

[LG-35] Spatio-Spectral Graph Neural Networks

链接: https://arxiv.org/abs/2405.19121
作者: Simon Geisler,Arthur Kosmala,Daniel Herbst,Stephan Günnemann
关键词: Spatial Message Passing, Graph Neural Networks, Message Passing Graph, Passing Graph Neural, Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 47 pages, 27 figures, 12 tables

点击查看摘要

Abstract:Spatial Message Passing Graph Neural Networks (MPGNNs) are widely used for learning on graph-structured data. However, key limitations of l-step MPGNNs are that their “receptive field” is typically limited to the l-hop neighborhood of a node and that information exchange between distant nodes is limited by over-squashing. Motivated by these limitations, we propose Spatio-Spectral Graph Neural Networks (S ^2 GNNs) – a new modeling paradigm for Graph Neural Networks (GNNs) that synergistically combines spatially and spectrally parametrized graph filters. Parameterizing filters partially in the frequency domain enables global yet efficient information propagation. We show that S ^2 GNNs vanquish over-squashing and yield strictly tighter approximation-theoretic error bounds than MPGNNs. Further, rethinking graph convolutions at a fundamental level unlocks new design spaces. For example, S ^2 GNNs allow for free positional encodings that make them strictly more expressive than the 1-Weisfeiler-Lehman (WL) test. Moreover, to obtain general-purpose S ^2 GNNs, we propose spectrally parametrized filters for directed graphs. S ^2 GNNs outperform spatial MPGNNs, graph transformers, and graph rewirings, e.g., on the peptide long-range benchmark tasks, and are competitive with state-of-the-art sequence modeling. On a 40 GB GPU, S ^2 GNNs scale to millions of nodes.

[LG-36] Can Graph Learning Improve Task Planning?

链接: https://arxiv.org/abs/2405.19119
作者: Xixi Wu,Yifei Shen,Caihua Shan,Kaitao Song,Siwei Wang,Bohang Zhang,Jiarui Feng,Hong Cheng,Wei Chen,Yun Xiong,Dongsheng Li
关键词: important research topic, research topic alongside, large language models, important research, research topic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Task planning is emerging as an important research topic alongside the development of large language models (LLMs). It aims to break down complex user requests into solvable sub-tasks, thereby fulfilling the original requests. In this context, the sub-tasks can be naturally viewed as a graph, where the nodes represent the sub-tasks, and the edges denote the dependencies among them. Consequently, task planning is a decision-making problem that involves selecting a connected path or subgraph within the corresponding graph and invoking it. In this paper, we explore graph learning-based methods for task planning, a direction that is orthogonal to the prevalent focus on prompt design. Our interest in graph learning stems from a theoretical discovery: the biases of attention and auto-regressive loss impede LLMs’ ability to effectively navigate decision-making on graphs, which is adeptly addressed by graph neural networks (GNNs). This theoretical insight led us to integrate GNNs with LLMs to enhance overall performance. Extensive experiments demonstrate that GNN-based methods surpass existing solutions even without training, and minimal training can further enhance their performance. Additionally, our approach complements prompt engineering and fine-tuning techniques, with performance further enhanced by improved prompts or a fine-tuned model.

[LG-37] Offline Regularised Reinforcement Learning for Large Language Models Alignment

链接: https://arxiv.org/abs/2405.19107
作者: Pierre Harvey Richemond,Yunhao Tang,Daniel Guo,Daniele Calandriello,Mohammad Gheshlaghi Azar,Rafael Rafailov,Bernardo Avila Pires,Eugene Tarassov,Lucas Spangher,Will Ellsworth,Aliaksei Severyn,Jonathan Mallinson,Lior Shani,Gil Shamir,Rishabh Joshi,Tianqi Liu,Remi Munos,Bilal Piot
关键词: alignment of large, reinforcement learning, direct preference optimisation, large language models, Direct Reward Optimisation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The dominant framework for alignment of large language models (LLM), whether through reinforcement learning from human feedback or direct preference optimisation, is to learn from preference data. This involves building datasets where each element is a quadruplet composed of a prompt, two independent responses (completions of the prompt) and a human preference between the two independent responses, yielding a preferred and a dis-preferred response. Such data is typically scarce and expensive to collect. On the other hand, \emphsingle-trajectory datasets where each element is a triplet composed of a prompt, a response and a human feedback is naturally more abundant. The canonical element of such datasets is for instance an LLM’s response to a user’s prompt followed by a user’s feedback such as a thumbs-up/down. Consequently, in this work, we propose DRO, or \emphDirect Reward Optimisation, as a framework and associated algorithms that do not require pairwise preferences. DRO uses a simple mean-squared objective that can be implemented in various ways. We validate our findings empirically, using T5 encoder-decoder language models, and show DRO’s performance over selected baselines such as Kahneman-Tversky Optimization (KTO). Thus, we confirm that DRO is a simple and empirically compelling method for single-trajectory policy optimisation.

[LG-38] Voice Jailbreak Attacks Against GPT-4o

链接: https://arxiv.org/abs/2405.19103
作者: Xinyue Shen,Yixin Wu,Michael Backes,Yang Zhang
关键词: real-world applications, concept of artificial, artificial assistants, assistants has evolved, evolved from science
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, the concept of artificial assistants has evolved from science fiction into real-world applications. GPT-4o, the newest multimodal large language model (MLLM) across audio, vision, and text, has further blurred the line between fiction and reality by enabling more natural human-computer interactions. However, the advent of GPT-4o’s voice mode may also introduce a new attack surface. In this paper, we present the first systematic measurement of jailbreak attacks against the voice mode of GPT-4o. We show that GPT-4o demonstrates good resistance to forbidden questions and text jailbreak prompts when directly transferring them to voice mode. This resistance is primarily due to GPT-4o’s internal safeguards and the difficulty of adapting text jailbreak prompts to voice mode. Inspired by GPT-4o’s human-like behaviors, we propose VoiceJailbreak, a novel voice jailbreak attack that humanizes GPT-4o and attempts to persuade it through fictional storytelling (setting, character, and plot). VoiceJailbreak is capable of generating simple, audible, yet effective jailbreak prompts, which significantly increases the average attack success rate (ASR) from 0.033 to 0.778 in six forbidden scenarios. We also conduct extensive experiments to explore the impacts of interaction steps, key elements of fictional writing, and different languages on VoiceJailbreak’s effectiveness and further enhance the attack performance with advanced fictional writing techniques. We hope our study can assist the research community in building more secure and well-regulated MLLMs.

[LG-39] Poseidon: Efficient Foundation Models for PDEs

链接: https://arxiv.org/abs/2405.19101
作者: Maximilian Herde,Bogdan Raonić,Tobias Rohner,Roger Käppeli,Roberto Molinaro,Emmanuel de Bézenac,Siddhartha Mishra
关键词: learning the solution, Poseidon, solution operators, PDEs, introduce Poseidon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Poseidon, a foundation model for learning the solution operators of PDEs. It is based on a multiscale operator transformer, with time-conditioned layer norms that enable continuous-in-time evaluations. A novel training strategy leveraging the semi-group property of time-dependent PDEs to allow for significant scaling-up of the training data is also proposed. Poseidon is pretrained on a diverse, large scale dataset for the governing equations of fluid dynamics. It is then evaluated on a suite of 15 challenging downstream tasks that include a wide variety of PDE types and operators. We show that Poseidon exhibits excellent performance across the board by outperforming baselines significantly, both in terms of sample efficiency and accuracy. Poseidon also generalizes very well to new physics that is not seen during pretraining. Moreover, Poseidon scales with respect to model and data size, both for pretraining and for downstream tasks. Taken together, our results showcase the surprising ability of Poseidon to learn effective representations from a very small set of PDEs during pretraining in order to generalize well to unseen and unrelated PDEs downstream, demonstrating its potential as an effective, general purpose PDE foundation model. Finally, the Poseidon model as well as underlying pretraining and downstream datasets are open sourced, with code being available at this https URL and pretrained models and datasets at this https URL.

[LG-40] Efficient Black-box Adversarial Attacks via Bayesian Optimization Guided by a Function Prior

链接: https://arxiv.org/abs/2405.19098
作者: Shuyu Cheng,Yibo Miao,Yinpeng Dong,Xiao Yang,Xiao-Shan Gao,Jun Zhu
关键词: challenging black-box adversarial, Prior-guided Bayesian Optimization, studies the challenging, aims to generate, output feedback
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: ICML 2024

点击查看摘要

Abstract:This paper studies the challenging black-box adversarial attack that aims to generate adversarial examples against a black-box model by only using output feedback of the model to input queries. Some previous methods improve the query efficiency by incorporating the gradient of a surrogate white-box model into query-based attacks due to the adversarial transferability. However, the localized gradient is not informative enough, making these methods still query-intensive. In this paper, we propose a Prior-guided Bayesian Optimization (P-BO) algorithm that leverages the surrogate model as a global function prior in black-box adversarial attacks. As the surrogate model contains rich prior information of the black-box one, P-BO models the attack objective with a Gaussian process whose mean function is initialized as the surrogate model’s loss. Our theoretical analysis on the regret bound indicates that the performance of P-BO may be affected by a bad prior. Therefore, we further propose an adaptive integration strategy to automatically adjust a coefficient on the function prior by minimizing the regret bound. Extensive experiments on image classifiers and large vision-language models demonstrate the superiority of the proposed algorithm in reducing queries and improving attack success rates compared with the state-of-the-art black-box attacks. Code is available at this https URL.

[LG-41] OMPO: A Unified Framework for RL under Policy and Dynamics Shifts

链接: https://arxiv.org/abs/2405.19080
作者: Yu Luo,Tianying Ji,Fuchun Sun,Jianwei Zhang,Huazhe Xu,Xianyuan Zhan
关键词: Training reinforcement learning, interaction data collected, Training reinforcement, reinforcement learning policies, environment interaction data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Training reinforcement learning policies using environment interaction data collected from varying policies or dynamics presents a fundamental challenge. Existing works often overlook the distribution discrepancies induced by policy or dynamics shifts, or rely on specialized algorithms with task priors, thus often resulting in suboptimal policy performances and high learning variances. In this paper, we identify a unified strategy for online RL policy learning under diverse settings of policy and dynamics shifts: transition occupancy matching. In light of this, we introduce a surrogate policy learning objective by considering the transition occupancy discrepancies and then cast it into a tractable min-max optimization problem through dual reformulation. Our method, dubbed Occupancy-Matching Policy Optimization (OMPO), features a specialized actor-critic structure equipped with a distribution discriminator and a small-size local buffer. We conduct extensive experiments based on the OpenAI Gym, Meta-World, and Panda Robots environments, encompassing policy shifts under stationary and nonstationary dynamics, as well as domain adaption. The results demonstrate that OMPO outperforms the specialized baselines from different categories in all settings. We also find that OMPO exhibits particularly strong performance when combined with domain randomization, highlighting its potential in RL-based robotics applications

[LG-42] Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design

链接: https://arxiv.org/abs/2405.19076
作者: Markus J. Buehler
关键词: multimodal vision large, multi-agent AI frameworks, present Cephalo, vision large language, series of multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present Cephalo, a series of multimodal vision large language models (V-LLMs) designed for materials science applications, integrating visual and linguistic data for enhanced understanding and interaction within human-AI and multi-agent AI frameworks. A key innovation of Cephalo is its advanced dataset generation method, which employs a sophisticated algorithm to accurately detect and separate images and their corresponding textual descriptions from PDF documents, such as scientific papers. The method includes a careful refinement of image-text pairs through integrated vision and language processing, ensuring high-quality, contextually relevant, and well reasoned training data. Cephalo is trained on integrated image and text data extracted from thousands of scientific papers and science-focused Wikipedia pages demonstrates can interpret complex visual scenes, generate precise language descriptions, and answer queries about images effectively. The combination of a vision encoder with an autoregressive transformer supports complex natural language understanding in an integrated model, which can be coupled with other generative methods to create an image-to-text-to-image or image-to-text-to-3D pipeline. To explore the development of larger models from smaller ones, we merge sets of layers that originate from different pre-trained source models. This hybrid approach allows us to leverage the domain-specific expertise and general conversational capabilities to harness the strengths of multiple models. We examine the models in diverse use cases that incorporate biological materials, fracture and engineering analysis, protein biophysics, and bio-inspired design based on insect behavior. Generative applications include bio-inspired designs, including pollen-inspired architected materials, as well as the synthesis of bio-inspired material microstructures from a photograph of a solar eclipse.

[LG-43] Relevance-aware Algorithmic Recourse

链接: https://arxiv.org/abs/2405.19072
作者: Dongwhi Kim,Nuno Moniz
关键词: machine learning continues, gain prominence, transparency and explainability, increasingly critical, machine learning
类目: Machine Learning (cs.LG)
*备注: 5 pages (4 content, 1 references)

点击查看摘要

Abstract:As machine learning continues to gain prominence, transparency and explainability are increasingly critical. Without an understanding of these models, they can replicate and worsen human bias, adversely affecting marginalized communities. Algorithmic recourse emerges as a tool for clarifying decisions made by predictive models, providing actionable insights to alter outcomes. They answer, ‘What do I have to change?’ to achieve the desired result. Despite their importance, current algorithmic recourse methods treat all domain values equally, which is unrealistic in real-world settings. In this paper, we propose a novel framework, Relevance-Aware Algorithmic Recourse (RAAR), that leverages the concept of relevance in applying algorithmic recourse to regression tasks. We conducted multiple experiments on 15 datasets to outline how relevance influences recourses. Results show that relevance contributes algorithmic recourses comparable to well-known baselines, with greater efficiency and lower relative costs.

[LG-44] xTern: Energy-Efficient Ternary Neural Network Inference on RISC-V-Based Edge Systems

链接: https://arxiv.org/abs/2405.19065
作者: Georg Rutishauser,Joan Mihali,Moritz Scherer,Luca Benini
关键词: Ternary neural networks, superior accuracy-energy trade-off, accuracy-energy trade-off compared, Ternary neural, binary neural networks
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted for publication at IEEE ASAP 2024

点击查看摘要

Abstract:Ternary neural networks (TNNs) offer a superior accuracy-energy trade-off compared to binary neural networks. However, until now, they have required specialized accelerators to realize their efficiency potential, which has hindered widespread adoption. To address this, we present xTern, a lightweight extension of the RISC-V instruction set architecture (ISA) targeted at accelerating TNN inference on general-purpose cores. To complement the ISA extension, we developed a set of optimized kernels leveraging xTern, achieving 67% higher throughput than their 2-bit equivalents. Power consumption is only marginally increased by 5.2%, resulting in an energy efficiency improvement by 57.1%. We demonstrate that the proposed xTern extension, integrated into an octa-core compute cluster, incurs a minimal silicon area overhead of 0.9% with no impact on timing. In end-to-end benchmarks, we demonstrate that xTern enables the deployment of TNNs achieving up to 1.6 percentage points higher CIFAR-10 classification accuracy than 2-bit networks at equal inference latency. Our results show that xTern enables RISC-V-based ultra-low-power edge AI platforms to benefit from the efficiency potential of TNNs.

[LG-45] SIG: Efficient Self-Interpretable Graph Neural Network for Continuous-time Dynamic Graphs

链接: https://arxiv.org/abs/2405.19062
作者: Lanting Fang,Yulian Yang,Kai Wang,Shanshan Feng,Kaiyu Feng,Jie Gui,Shuliang Wang,Yew-Soon Ong
关键词: graph neural networks, continuous-time dynamic graphs, dynamic graph neural, neural networks, networks have shown
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages

点击查看摘要

Abstract:While dynamic graph neural networks have shown promise in various applications, explaining their predictions on continuous-time dynamic graphs (CTDGs) is difficult. This paper investigates a new research task: self-interpretable GNNs for CTDGs. We aim to predict future links within the dynamic graph while simultaneously providing causal explanations for these predictions. There are two key challenges: (1) capturing the underlying structural and temporal information that remains consistent across both independent and identically distributed (IID) and out-of-distribution (OOD) data, and (2) efficiently generating high-quality link prediction results and explanations. To tackle these challenges, we propose a novel causal inference model, namely the Independent and Confounded Causal Model (ICCM). ICCM is then integrated into a deep learning architecture that considers both effectiveness and efficiency. Extensive experiments demonstrate that our proposed model significantly outperforms existing methods across link prediction accuracy, explanation quality, and robustness to shortcut features. Our code and datasets are anonymously released at this https URL.

[LG-46] Robust Entropy Search for Safe Efficient Bayesian Optimization

链接: https://arxiv.org/abs/2405.19059
作者: Dorina Weichert,Alexander Kister,Patrick Link,Sebastian Houben,Gunar Ernis
关键词: imposes special requirements, high sampling efficiency, engineering applications imposes, applications imposes special, Bayesian Optimization
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The practical use of Bayesian Optimization (BO) in engineering applications imposes special requirements: high sampling efficiency on the one hand and finding a robust solution on the other hand. We address the case of adversarial robustness, where all parameters are controllable during the optimization process, but a subset of them is uncontrollable or even adversely perturbed at the time of application. To this end, we develop an efficient information-based acquisition function that we call Robust Entropy Search (RES). We empirically demonstrate its benefits in experiments on synthetic and real-life data. The results showthat RES reliably finds robust optima, outperforming state-of-the-art algorithms.

[LG-47] Multiscale Spatio-Temporal Enhanced Short-term Load Forecasting of Electric Vehicle Charging Stations

链接: https://arxiv.org/abs/2405.19053
作者: Zongbao Zhang,Jiao Hao,Wenmeng Zhao,Yan Liu,Yaohui Huang,Xinhang Luo
关键词: electric vehicle charging, vehicle charging stations, electric vehicles, load forecasting, expansion of electric
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 5 pages, 1 figure, AEEES 2024

点击查看摘要

Abstract:The rapid expansion of electric vehicles (EVs) has rendered the load forecasting of electric vehicle charging stations (EVCS) increasingly critical. The primary challenge in achieving precise load forecasting for EVCS lies in accounting for the nonlinear of charging behaviors, the spatial interactions among different stations, and the intricate temporal variations in usage patterns. To address these challenges, we propose a Multiscale Spatio-Temporal Enhanced Model (MSTEM) for effective load forecasting at EVCS. MSTEM incorporates a multiscale graph neural network to discern hierarchical nonlinear temporal dependencies across various time scales. Besides, it also integrates a recurrent learning component and a residual fusion mechanism, enhancing its capability to accurately capture spatial and temporal variations in charging patterns. The effectiveness of the proposed MSTEM has been validated through comparative analysis with six baseline models using three evaluation metrics. The case studies utilize real-world datasets for both fast and slow charging loads at EVCS in Perth, UK. The experimental results demonstrate the superiority of MSTEM in short-term continuous load forecasting for EVCS.

[LG-48] Statistical Context Detection for Deep Lifelong Reinforcement Learning

链接: https://arxiv.org/abs/2405.19047
作者: Jeffery Dick,Saptarshi Nath,Christos Peridis,Eseoghene Benjamin,Soheil Kolouri,Andrea Soltoggio
关键词: involves labeling segments, detection involves labeling, Task labels, involves labeling, labeling segments
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages excluding references and bibliography. Submitted to CoLLAs 2024

点击查看摘要

Abstract:Context detection involves labeling segments of an online stream of data as belonging to different tasks. Task labels are used in lifelong learning algorithms to perform consolidation or other procedures that prevent catastrophic forgetting. Inferring task labels from online experiences remains a challenging problem. Most approaches assume finite and low-dimension observation spaces or a preliminary training phase during which task labels are learned. Moreover, changes in the transition or reward functions can be detected only in combination with a policy, and therefore are more difficult to detect than changes in the input distribution. This paper presents an approach to learning both policies and labels in an online deep reinforcement learning setting. The key idea is to use distance metrics, obtained via optimal transport methods, i.e., Wasserstein distance, on suitable latent action-reward spaces to measure distances between sets of data points from past and current streams. Such distances can then be used for statistical tests based on an adapted Kolmogorov-Smirnov calculation to assign labels to sequences of experiences. A rollback procedure is introduced to learn multiple policies by ensuring that only the appropriate data is used to train the corresponding policy. The combination of task detection and policy deployment allows for the optimization of lifelong reinforcement learning agents without an oracle that provides task labels. The approach is tested using two benchmarks and the results show promising performance when compared with related context detection algorithms. The results suggest that optimal transport statistical methods provide an explainable and justifiable procedure for online context detection and reward optimization in lifelong reinforcement learning.

[LG-49] CiliaGraph: Enabling Expression-enhanced Hyper-Dimensional Computation in Ultra-Lightweight and One-Shot Graph Classification on Edge

链接: https://arxiv.org/abs/2405.19033
作者: Yuxi Han,Jihe Wang,Danghui Wang
关键词: Graph Neural Networks, Neural Networks, involving multiple rounds, resource-constrained edge scenarios, edge scenarios due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are computationally demanding and inefficient when applied to graph classification tasks in resource-constrained edge scenarios due to their inherent process, involving multiple rounds of forward and backward propagation. As a lightweight alternative, Hyper-Dimensional Computing (HDC), which leverages high-dimensional vectors for data encoding and processing, offers a more efficient solution by addressing computational bottleneck. However, current HDC methods primarily focus on static graphs and neglect to effectively capture node attributes and structural information, which leads to poor accuracy. In this work, we propose CiliaGraph, an enhanced expressive yet ultra-lightweight HDC model for graph classification. This model introduces a novel node encoding strategy that preserves relative distance isomorphism for accurate node connection representation. In addition, node distances are utilized as edge weights for information aggregation, and the encoded node attributes and structural information are concatenated to obtain a comprehensive graph representation. Furthermore, we explore the relationship between orthogonality and dimensionality to reduce the dimensions, thereby further enhancing computational efficiency. Compared to the SOTA GNNs, extensive experiments show that CiliaGraph reduces memory usage and accelerates training speed by an average of 292 times(up to 2341 times) and 103 times(up to 313 times) respectively while maintaining comparable accuracy.

[LG-50] Large Language Models for Code Summarization

链接: https://arxiv.org/abs/2405.19032
作者: Balázs Szalontai,Gergő Szalay,Tamás Márton,Anna Sike,Balázs Pintér,Tibor Gregorics
关键词: software engineering, including tasks, increasing activity, deep learning, learning for software
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注: technical report with 11 pages, 1 figure, 10 tables

点击查看摘要

Abstract:Recently, there has been increasing activity in using deep learning for software engineering, including tasks like code generation and summarization. In particular, the most recent coding Large Language Models seem to perform well on these problems. In this technical report, we aim to review how these models perform in code explanation/summarization, while also investigating their code generation capabilities (based on natural language descriptions).

[LG-51] DiveR-CT: Diversity-enhanced Red Teaming with Relaxing Constraints

链接: https://arxiv.org/abs/2405.19026
作者: Andrew Zhao,Quentin Xu,Matthieu Lin,Shenzhi Wang,Yong-jin Liu,Zilong Zheng,Gao Huang
关键词: raising significant concerns, Recent advances, large language models, made them indispensable, raising significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximizing attack success rate. Additionally, methods that decrease the cosine similarity from historical embeddings with semantic diversity rewards lead to novelty stagnation as history grows. To address these issues, we introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT’s marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. Project details and code can be found at this https URL.

[LG-52] Inverse Concave-Utility Reinforcement Learning is Inverse Game Theory

链接: https://arxiv.org/abs/2405.19024
作者: Mustafa Mert Çelikok,Frans A. Oliehoek,Jan-Willem van de Meent
关键词: Utility Reinforcement Learning, Concave Utility Reinforcement, inverse reinforcement learning, reinforcement learning, reinforcement learning problems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We consider inverse reinforcement learning problems with concave utilities. Concave Utility Reinforcement Learning (CURL) is a generalisation of the standard RL objective, which employs a concave function of the state occupancy measure, rather than a linear function. CURL has garnered recent attention for its ability to represent instances of many important applications including the standard RL such as imitation learning, pure exploration, constrained MDPs, offline RL, human-regularized RL, and others. Inverse reinforcement learning is a powerful paradigm that focuses on recovering an unknown reward function that can rationalize the observed behaviour of an agent. There has been recent theoretical advances in inverse RL where the problem is formulated as identifying the set of feasible reward functions. However, inverse RL for CURL problems has not been considered previously. In this paper we show that most of the standard IRL results do not apply to CURL in general, since CURL invalidates the classical Bellman equations. This calls for a new theoretical framework for the inverse CURL problem. Using a recent equivalence result between CURL and Mean-field Games, we propose a new definition for the feasible rewards for I-CURL by proving that this problem is equivalent to an inverse game theory problem in a subclass of mean-field games. We present initial query and sample complexity results for the I-CURL problem under assumptions such as Lipschitz-continuity. Finally, we outline future directions and applications in human–AI collaboration enabled by our results.

[LG-53] owards Standardizing AI Bias Exploration

链接: https://arxiv.org/abs/2405.19022
作者: Emmanouil Krasanakis,Symeon Papadopoulos
关键词: Creating fair, fair AI systems, complex problem, problem that involves, involves the assessment
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: Workshop on AI bias: Measurements, Mitigation, Explanation Strategies (AIMMES 2024)

点击查看摘要

Abstract:Creating fair AI systems is a complex problem that involves the assessment of context-dependent bias concerns. Existing research and programming libraries express specific concerns as measures of bias that they aim to constrain or mitigate. In practice, one should explore a wide variety of (sometimes incompatible) measures before deciding which ones warrant corrective action, but their narrow scope means that most new situations can only be examined after devising new measures. In this work, we present a mathematical framework that distils literature measures of bias into building blocks, hereby facilitating new combinations to cover a wide range of fairness concerns, such as classification or recommendation differences across multiple multi-value sensitive attributes (e.g., many genders and races, and their intersections). We show how this framework generalizes existing concepts and present frequently used blocks. We provide an open-source implementation of our framework as a Python library, called FairBench, that facilitates systematic and extensible exploration of potential bias concerns.

[LG-54] Efficient Exploration in Average-Reward Constrained Reinforcement Learning: Achieving Near-Optimal Regret With Posterior Sampling

链接: https://arxiv.org/abs/2405.19017
作者: Danil Provodin,Maurits Kaptein,Mykola Pechenizkiy
关键词: Markov Decision Processes, Constrained Markov Decision, Decision Processes, Markov Decision, infinite-horizon undiscounted setting
类目: Machine Learning (cs.LG)
*备注: To appear at ICML’24

点击查看摘要

Abstract:We present a new algorithm based on posterior sampling for learning in Constrained Markov Decision Processes (CMDP) in the infinite-horizon undiscounted setting. The algorithm achieves near-optimal regret bounds while being advantageous empirically compared to the existing algorithms. Our main theoretical result is a Bayesian regret bound for each cost component of \tildeO (DS\sqrtAT) for any communicating CMDP with S states, A actions, and diameter D . This regret bound matches the lower bound in order of time horizon T and is the best-known regret bound for communicating CMDPs achieved by a computationally tractable algorithm. Empirical results show that our posterior sampling algorithm outperforms the existing algorithms for constrained reinforcement learning.

[LG-55] Distributed Management of Fluctuating Energy Resources in Dynamic Networked Systems

链接: https://arxiv.org/abs/2405.19015
作者: Xiaotong Cheng,Ioannis Tsetis,Setareh Maghsudi
关键词: Modern power systems, Modern power, power systems integrate, systems integrate renewable, ever-increasing demands
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Modern power systems integrate renewable distributed energy resources (DERs) as an environment-friendly enhancement to meet the ever-increasing demands. However, the inherent unreliability of renewable energy renders developing DER management algorithms imperative. We study the energy-sharing problem in a system consisting of several DERs. Each agent harvests and distributes renewable energy in its neighborhood to optimize the network’s performance while minimizing energy waste. We model this problem as a bandit convex optimization problem with constraints that correspond to each node’s limitations for energy production. We propose distributed decision-making policies to solve the formulated problem, where we utilize the notion of dynamic regret as the performance metric. We also include an adjustment strategy in our developed algorithm to reduce the constraint violations. Besides, we design a policy that deals with the non-stationary environment. Theoretical analysis shows the effectiveness of our proposed algorithm. Numerical experiments using a real-world dataset show superior performance of our proposal compared to state-of-the-art methods.

[LG-56] rust the Model Where It Trusts Itself – Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption

链接: https://arxiv.org/abs/2405.19014
作者: Bernd Frauenknecht,Artur Eisele,Devdutt Subhasish,Friedrich Solowjow,Sebastian Trimpe
关键词: combines model-free agents, Dyna-style model-based reinforcement, model-based reinforcement learning, Dyna-style model-based, predictive transition models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dyna-style model-based reinforcement learning (MBRL) combines model-free agents with predictive transition models through model-based rollouts. This combination raises a critical question: ‘When to trust your model?’; i.e., which rollout length results in the model providing useful data? Janner et al. (2019) address this question by gradually increasing rollout lengths throughout the training. While theoretically tempting, uniform model accuracy is a fallacy that collapses at the latest when extrapolating. Instead, we propose asking the question ‘Where to trust your model?’. Using inherent model uncertainty to consider local accuracy, we obtain the Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption (MACURA) algorithm. We propose an easy-to-tune rollout mechanism and demonstrate substantial improvements in data efficiency and performance compared to state-of-the-art deep MBRL methods on the MuJoCo benchmark.

[LG-57] On Dissipativity of Cross-Entropy Loss in Training ResNets

链接: https://arxiv.org/abs/2405.19013
作者: Jens Püttschneider,Timm Faulwasser
关键词: neural ODEs, optimal control, formulated and analyzed, perspective of optimal, ResNets and neural
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The training of ResNets and neural ODEs can be formulated and analyzed from the perspective of optimal control. This paper proposes a dissipative formulation of the training of ResNets and neural ODEs for classification problems by including a variant of the cross-entropy as a regularization in the stage cost. Based on the dissipative formulation of the training, we prove that the trained ResNet exhibit the turnpike phenomenon. We then illustrate that the training exhibits the turnpike phenomenon by training on the two spirals and MNIST datasets. This can be used to find very shallow networks suitable for a given classification task.

[LG-58] FedMAP: Unlocking Potential in Personalized Federated Learning through Bi-Level MAP Optimization

链接: https://arxiv.org/abs/2405.19000
作者: Fan Zhang,Carlos Esteve-Yagüe,Sören Dittmer,Carola-Bibiane Schönlieb,Michael Roberts
关键词: Federated Learning, enables collaborative training, machine learning models, machine learning, preserving data privacy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative training of machine learning models on decentralized data while preserving data privacy. However, data across clients often differs significantly due to class imbalance, feature distribution skew, sample size imbalance, and other phenomena. Leveraging information from these not identically distributed (non-IID) datasets poses substantial challenges. FL methods based on a single global model cannot effectively capture the variations in client data and underperform in non-IID settings. Consequently, Personalized FL (PFL) approaches that adapt to each client’s data distribution but leverage other clients’ data are essential but currently underexplored. We propose a novel Bayesian PFL framework using bi-level optimization to tackle the data heterogeneity challenges. Our proposed framework utilizes the global model as a prior distribution within a Maximum A Posteriori (MAP) estimation of personalized client models. This approach facilitates PFL by integrating shared knowledge from the prior, thereby enhancing local model performance, generalization ability, and communication efficiency. We extensively evaluated our bi-level optimization approach on real-world and synthetic datasets, demonstrating significant improvements in model accuracy compared to existing methods while reducing communication overhead. This study contributes to PFL by establishing a solid theoretical foundation for the proposed method and offering a robust, ready-to-use framework that effectively addresses the challenges posed by non-IID data in FL.

[LG-59] Robust Optimization in Protein Fitness Landscapes Using Reinforcement Learning in Latent Space

链接: https://arxiv.org/abs/2405.18986
作者: Minji Lee,Luiz Felipe Vecchietti,Hyunkyu Jung,Hyun Joo Ro,Meeyoung Cha,Ho Min Kim
关键词: complex molecules responsible, functions in nature, complex molecules, molecules responsible, Abstract
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
*备注: ICML 2024

点击查看摘要

Abstract:Proteins are complex molecules responsible for different functions in nature. Enhancing the functionality of proteins and cellular fitness can significantly impact various industries. However, protein optimization using computational methods remains challenging, especially when starting from low-fitness sequences. We propose LatProtRL, an optimization method to efficiently traverse a latent space learned by an encoder-decoder leveraging a large protein language model. To escape local optima, our optimization is modeled as a Markov decision process using reinforcement learning acting directly in latent space. We evaluate our approach on two important fitness optimization tasks, demonstrating its ability to achieve comparable or superior fitness over baseline methods. Our findings and in vitro evaluation show that the generated sequences can reach high-fitness regions, suggesting a substantial potential of LatProtRL in lab-in-the-loop scenarios.

[LG-60] Optimizing Vehicular Networks with Variational Quantum Circuits-based Reinforcement Learning

链接: https://arxiv.org/abs/2405.18984
作者: Zijiang Yan,Ramsundar Tanikella,Hina Tabassum
关键词: dependable network connectivity, ensuring both road, utmost importance, Variational Quantum Circuit, road safety
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: Accepted By INFOCOM 2024 Poster - 2024 IEEE International Conference on Computer Communications

点击查看摘要

Abstract:In vehicular networks (VNets), ensuring both road safety and dependable network connectivity is of utmost importance. Achieving this necessitates the creation of resilient and efficient decision-making policies that prioritize multiple objectives. In this paper, we develop a Variational Quantum Circuit (VQC)-based multi-objective reinforcement learning (MORL) framework to characterize efficient network selection and autonomous driving policies in a vehicular network (VNet). Numerical results showcase notable enhancements in both convergence rates and rewards when compared to conventional deep-Q networks (DQNs), validating the efficacy of the VQC-MORL solution.

[LG-61] Federated Learning under Partially Class-Disjoint Data via Manifold Reshaping

链接: https://arxiv.org/abs/2405.18983
作者: Ziqing Fan,Jiangchao Yao,Ruipeng Zhang,Lingjuan Lyu,Ya Zhang,Yanfeng Wang
关键词: Statistical heterogeneity severely, heterogeneity severely limits, MOON and FedDyn, Statistical heterogeneity, motivating several explorations
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Statistical heterogeneity severely limits the performance of federated learning (FL), motivating several explorations e.g., FedProx, MOON and FedDyn, to alleviate this problem. Despite effectiveness, their considered scenario generally requires samples from almost all classes during the local training of each client, although some covariate shifts may exist among clients. In fact, the natural case of partially class-disjoint data (PCDD), where each client contributes a few classes (instead of all classes) of samples, is practical yet underexplored. Specifically, the unique collapse and invasion characteristics of PCDD can induce the biased optimization direction in local training, which prevents the efficiency of federated learning. To address this dilemma, we propose a manifold reshaping approach called FedMR to calibrate the feature space of local training. Our FedMR adds two interplaying losses to the vanilla federated learning: one is intra-class loss to decorrelate feature dimensions for anti-collapse; and the other one is inter-class loss to guarantee the proper margin among categories in the feature expansion. We conduct extensive experiments on a range of datasets to demonstrate that our FedMR achieves much higher accuracy and better communication efficiency. Source code is available at: this https URL.

[LG-62] MANO: Exploiting Matrix Norm for Unsupervised Accuracy Estimation Under Distribution Shifts

链接: https://arxiv.org/abs/2405.18979
作者: Renchunzi Xie,Ambroise Odonnat,Vasilii Feofanov,Weijian Deng,Jianfeng Zhang,Bo An
关键词: ground truth labels, pre-trained neural network, Leveraging the models’, models’ outputs, samples without requiring
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: The three first authors contributed equally

点击查看摘要

Abstract:Leveraging the models’ outputs, specifically the logits, is a common approach to estimating the test accuracy of a pre-trained neural network on out-of-distribution (OOD) samples without requiring access to the corresponding ground truth labels. Despite their ease of implementation and computational efficiency, current logit-based methods are vulnerable to overconfidence issues, leading to prediction bias, especially under the natural shift. In this work, we first study the relationship between logits and generalization performance from the view of low-density separation assumption. Our findings motivate our proposed method MaNo which (1) applies a data-dependent normalization on the logits to reduce prediction bias, and (2) takes the L_p norm of the matrix of normalized logits as the estimation score. Our theoretical analysis highlights the connection between the provided score and the model’s uncertainty. We conduct an extensive empirical study on common unsupervised accuracy estimation benchmarks and demonstrate that MaNo achieves state-of-the-art performance across various architectures in the presence of synthetic, natural, or subpopulation shifts.

[LG-63] Hierarchical Classification Auxiliary Network for Time Series Forecasting

链接: https://arxiv.org/abs/2405.18975
作者: Yanru Sun,Zongxia Xie,Dongyue Chen,Emadeldeen Eldele,Qinghua Hu
关键词: capture sequence relationships, Deep learning, significantly advanced time, time series, advanced time series
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning has significantly advanced time series forecasting through its powerful capacity to capture sequence relationships. However, training these models with the Mean Square Error (MSE) loss often results in over-smooth predictions, making it challenging to handle the complexity and learn high-entropy features from time series data with high variability and unpredictability. In this work, we introduce a novel approach by tokenizing time series values to train forecasting models via cross-entropy loss, while considering the continuous nature of time series data. Specifically, we propose Hierarchical Classification Auxiliary Network, HCAN, a general model-agnostic component that can be integrated with any forecasting model. HCAN is based on a Hierarchy-Aware Attention module that integrates multi-granularity high-entropy features at different hierarchy levels. At each level, we assign a class label for timesteps to train an Uncertainty-Aware Classifier. This classifier mitigates the over-confidence in softmax loss via evidence theory. We also implement a Hierarchical Consistency Loss to maintain prediction consistency across hierarchy levels. Extensive experiments integrating HCAN with state-of-the-art forecasting models demonstrate substantial improvements over baselines on several real-world datasets. Code is available at:this https URL.

[LG-64] Federated Learning with Bilateral Curation for Partially Class-Disjoint Data

链接: https://arxiv.org/abs/2405.18972
作者: Ziqing Fan,Ruipeng Zhang,Jiangchao Yao,Bo Han,Ya Zhang,Yanfeng Wang
关键词: Partially class-disjoint data, Partially class-disjoint, under-explored data formation, locally existing classes, locally missing classes
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Partially class-disjoint data (PCDD), a common yet under-explored data formation where each client contributes a part of classes (instead of all classes) of samples, severely challenges the performance of federated algorithms. Without full classes, the local objective will contradict the global objective, yielding the angle collapse problem for locally missing classes and the space waste problem for locally existing classes. As far as we know, none of the existing methods can intrinsically mitigate PCDD challenges to achieve holistic improvement in the bilateral views (both global view and local view) of federated learning. To address this dilemma, we are inspired by the strong generalization of simplex Equiangular Tight Frame~(ETF) on the imbalanced data, and propose a novel approach called FedGELA where the classifier is globally fixed as a simplex ETF while locally adapted to the personal distributions. Globally, FedGELA provides fair and equal discrimination for all classes and avoids inaccurate updates of the classifier, while locally it utilizes the space of locally missing classes for locally existing classes. We conduct extensive experiments on a range of datasets to demonstrate that our FedGELA achieves promising performance~(averaged improvement of 3.9% to FedAvg and 1.5% to best baselines) and provide both local and global convergence guarantees. Source code is available at:this https URL.

[LG-65] UniIF: Unified Molecule Inverse Folding

链接: https://arxiv.org/abs/2405.18968
作者: Zhangyang Gao,Jue Wang,Cheng Tan,Lirong Wu,Yufei Huang,Siyuan Li,Zhirui Ye,Stan Z. Li
关键词: revolutionize drug discovery, Molecule inverse folding, chemistry and biology, long-standing challenge, challenge in chemistry
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Molecule inverse folding has been a long-standing challenge in chemistry and biology, with the potential to revolutionize drug discovery and material science. Despite specified models have been proposed for different small- or macro-molecules, few have attempted to unify the learning process, resulting in redundant efforts. Complementary to recent advancements in molecular structure prediction, such as RoseTTAFold All-Atom and AlphaFold3, we propose the unified model UniIF for the inverse folding of all molecules. We do such unification in two levels: 1) Data-Level: We propose a unified block graph data form for all molecules, including the local frame building and geometric feature initialization. 2) Model-Level: We introduce a geometric block attention network, comprising a geometric interaction, interactive attention and virtual long-term dependency modules, to capture the 3D interactions of all molecules. Through comprehensive evaluations across various tasks such as protein design, RNA design, and material design, we demonstrate that our proposed method surpasses state-of-the-art methods on all tasks. UniIF offers a versatile and effective solution for general molecule inverse folding.

[LG-66] MAGIC: Modular Auto-encoder for Generalisable Model Inversion with Bias Corrections

链接: https://arxiv.org/abs/2405.18953
作者: Yihang She,Clement Atzberger,Andrew Blake,Adriano Gualandi,Srinivasan Keshav
关键词: understand the natural, natural world, world and uncover, model, Scientists
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2403.02922

点击查看摘要

Abstract:Scientists often model physical processes to understand the natural world and uncover the causation behind observations. Due to unavoidable simplification, discrepancies often arise between model predictions and actual observations, in the form of systematic biases, whose impact varies with model completeness. Classical model inversion methods such as Bayesian inference or regressive neural networks tend either to overlook biases or make assumptions about their nature during data preprocessing, potentially leading to implausible results. Inspired by recent work in inverse graphics, we replace the decoder stage of a standard autoencoder with a physical model followed by a bias-correction layer. This generalisable approach simultaneously inverts the model and corrects its biases in an end-to-end manner without making strong assumptions about the nature of the biases. We demonstrate the effectiveness of our approach using two physical models from disparate domains: a complex radiative transfer model from remote sensing; and a volcanic deformation model from geodesy. Our method matches or surpasses results from classical approaches without requiring biases to be explicitly filtered out, suggesting an effective pathway for understanding the causation of various physical processes.

[LG-67] Are You Sure? Rank Them Again: Repeated Ranking For Better Preference Datasets

链接: https://arxiv.org/abs/2405.18952
作者: Peter Devine
关键词: Reinforcement Learning, aligns model outputs, Training Large Language, Large Language Models, Training Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training Large Language Models (LLMs) with Reinforcement Learning from AI Feedback (RLAIF) aligns model outputs more closely with human preferences. This involves an evaluator model ranking multiple candidate responses to user prompts. However, the rankings from popular evaluator models such as GPT-4 can be inconsistent. We propose the Repeat Ranking method - where we evaluate the same responses multiple times and train only on those responses which are consistently ranked. Using 2,714 prompts in 62 languages, we generated responses from 7 top multilingual LLMs and had GPT-4 rank them five times each. Evaluating on MT-Bench chat benchmarks in six languages, our method outperformed the standard practice of training on all available prompts. Our work highlights the quality versus quantity trade-off in RLAIF dataset generation and offers a stackable strategy for enhancing dataset and thus model quality.

[LG-68] Learning to Recover from Plan Execution Errors during Robot Manipulation: A Neuro-symbolic Approach

链接: https://arxiv.org/abs/2405.18948
作者: Namasivayam Kalithasan,Arnav Tuli,Vishal Bindal,Himanshu Gaurav Singh,Parag Singla,Rohan Paul
关键词: Automatically detecting, autonomous robots, important but challenging, challenging problem, problem for autonomous
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Automatically detecting and recovering from failures is an important but challenging problem for autonomous robots. Most of the recent work on learning to plan from demonstrations lacks the ability to detect and recover from errors in the absence of an explicit state representation and/or a (sub-) goal check function. We propose an approach (blending learning with symbolic search) for automated error discovery and recovery, without needing annotated data of failures. Central to our approach is a neuro-symbolic state representation, in the form of dense scene graph, structured based on the objects present within the environment. This enables efficient learning of the transition function and a discriminator that not only identifies failures but also localizes them facilitating fast re-planning via computation of heuristic distance function. We also present an anytime version of our algorithm, where instead of recovering to the last correct state, we search for a sub-goal in the original plan minimizing the total distance to the goal given a re-planning budget. Experiments on a physics simulator with a variety of simulated failures show the effectiveness of our approach compared to existing baselines, both in terms of efficiency as well as accuracy of our recovery mechanism.

[LG-69] WTTFNet: A Weather-Time-Trajectory Fusion Network for Pedestrian Trajectory Prediction in Urban Complex

链接: https://arxiv.org/abs/2405.18945
作者: Ho Chun Wu,Esther Hoi Shan Lau,Paul Yuen,Kevin Hung,John Kwok Tai Chui,Andrew Kwok Fai Lui
关键词: urban complex, complex is challenging, Pedestrian trajectory modelling, Pacific Trade Center, affect pedestrian behavior
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:Pedestrian trajectory modelling in an urban complex is challenging because pedestrians can have many possible destinations, such as shops, escalators, and attractions. Moreover, weather and time-of-day may affect pedestrian behavior. In this paper, a new weather-time-trajectory fusion network (WTTFNet) is proposed to improve the performance of baseline deep neural network architecture. By incorporating weather and time-of-day information as an embedding structure, a novel WTTFNet based on gate multimodal unit is used to fuse the multimodal information and deep representation of trajectories. A joint loss function based on focal loss is used to co-optimize both the deep trajectory features and final classifier, which helps to improve the accuracy in predicting the intended destination of pedestrians and hence the trajectories under possible scenarios of class imbalances. Experimental results using the Osaka Asia and Pacific Trade Center (ATC) dataset shows improved performance of the proposed approach over state-of-the-art algorithms by 23.67% increase in classification accuracy, 9.16% and 7.07% reduction of average and final displacement error. The proposed approach may serve as an attractive approach for improving existing baseline trajectory prediction models when they are applied to scenarios with influences of weather-time conditions. It can be employed in numerous applications such as pedestrian facility engineering, public space development and technology-driven retail.

[LG-70] Verifiably Robust Conformal Prediction

链接: https://arxiv.org/abs/2405.18942
作者: Linus Jeary,Tom Kuipers,Mehran Hosseini,Nicola Paoletti
关键词: popular uncertainty quantification, statistically valid prediction, valid prediction sets, uncertainty quantification method, statistically valid
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal Prediction (CP) is a popular uncertainty quantification method that provides distribution-free, statistically valid prediction sets, assuming that training and test data are exchangeable. In such a case, CP’s prediction sets are guaranteed to cover the (unknown) true test output with a user-specified probability. Nevertheless, this guarantee is violated when the data is subjected to adversarial attacks, which often result in a significant loss of coverage. Recently, several approaches have been put forward to recover CP guarantees in this setting. These approaches leverage variations of randomised smoothing to produce conservative sets which account for the effect of the adversarial perturbations. They are, however, limited in that they only support \ell^2 -bounded perturbations and classification tasks. This paper introduces \emphVRCP (Verifiably Robust Conformal Prediction), a new framework that leverages recent neural network verification methods to recover coverage guarantees under adversarial attacks. Our VRCP method is the first to support perturbations bounded by arbitrary norms including \ell^1 , \ell^2 , and \ell^\infty , as well as regression tasks. We evaluate and compare our approach on image classification tasks (CIFAR10, CIFAR100, and TinyImageNet) and regression tasks for deep reinforcement learning environments. In every case, VRCP achieves above nominal coverage and yields significantly more efficient and informative prediction regions than the SotA.

[LG-71] Content-Agnostic Moderation for Stance-Neutral Recommendation

链接: https://arxiv.org/abs/2405.18941
作者: Nan Li,Bo Kang,Tijl De Bie
关键词: exacerbating opinion polarization, Personalized recommendation systems, Personalized recommendation, exacerbating opinion, moderation
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Personalized recommendation systems often drive users towards more extreme content, exacerbating opinion polarization. While (content-aware) moderation has been proposed to mitigate these effects, such approaches risk curtailing the freedom of speech and of information. To address this concern, we propose and explore the feasibility of \emphcontent-agnostic moderation as an alternative approach for reducing polarization. Content-agnostic moderation does not rely on the actual content being moderated, arguably making it less prone to forms of censorship. We establish theoretically that content-agnostic moderation cannot be guaranteed to work in a fully generic setting. However, we show that it can often be effectively achieved in practice with plausible assumptions. We introduce two novel content-agnostic moderation methods that modify the recommendations from the content recommender to disperse user-item co-clusters without relying on content features. To evaluate the potential of content-agnostic moderation in controlled experiments, we built a simulation environment to analyze the closed-loop behavior of a system with a given set of users, recommendation system, and moderation approach. Through comprehensive experiments in this environment, we show that our proposed moderation methods significantly enhance stance neutrality and maintain high recommendation quality across various data scenarios. Our results indicate that achieving stance neutrality without direct content information is not only feasible but can also help in developing more balanced and informative recommendation systems without substantially degrading user engagement. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2405.18941 [cs.IR] (or arXiv:2405.18941v1 [cs.IR] for this version)

[LG-72] LSPI: Heterogeneous Graph Neural Network Classification Aggregation Algorithm Based on Size Neighbor Path Identification

链接: https://arxiv.org/abs/2405.18933
作者: Yufei Zhaoa,Shiduo Wanga,Hua Duana
关键词: Existing heterogeneous graph, large neighbor paths, heterogeneous graph neural, graph neural network, small neighbor paths
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing heterogeneous graph neural network algorithms (HGNNs) mostly rely on meta-paths to capture the rich semantic information contained in heterogeneous graphs (also known as heterogeneous information networks (HINs)), but most of these HGNNs focus on different ways of feature aggre gation and ignore the properties of the meta-paths themselves. This paper studies meta-paths in three commonly used data sets and finds that there are huge differences in the number of neighbors connected by different meta paths. At the same time, the noise information contained in large neigh bor paths will have an adverse impact on model performance. Therefore, this paper proposes a Heterogeneous Graph Neural Network Classification and Aggregation Algorithm Based on Large and Small Neighbor Path Iden tification(LSPI). LSPI firstly divides the meta-paths into large and small neighbor paths through the path discriminator , and in order to reduce the noise interference problem in large neighbor paths, LSPI selects neighbor nodes with higher similarity from both topology and feature perspectives, and passes small neighbor paths and filtered large neighbor paths through different graph convolution components. Aggregation is performed to obtain feature information under different subgraphs, and then LSPI uses subgraph level attention to fuse the feature information under different subgraphs to generate the final node embedding. Finally this paper verifies the superiority of the method through extensive experiments and also gives suggestions on the number of nodes to be retained in large neighbor paths through exper iments. The complete reproducible code adn data has been published at: this https URL.

[LG-73] Federated Continual Learning Goes Online: Leveraging Uncertainty for Modality-Agnostic Class-Incremental Learning

链接: https://arxiv.org/abs/2405.18925
作者: Giuseppe Serra,Florian Buettner
关键词: Federated Continual Learning, Federated Continual, increasingly investigated recently, Continual Learning, investigated recently
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Given the ability to model more realistic and dynamic problems, Federated Continual Learning (FCL) has been increasingly investigated recently. A well-known problem encountered in this setting is the so-called catastrophic forgetting, for which the learning model is inclined to focus on more recent tasks while forgetting the previously learned knowledge. The majority of the current approaches in FCL propose generative-based solutions to solve said problem. However, this setting requires multiple training epochs over the data, implying an offline setting where datasets are stored locally and remain unchanged over time. Furthermore, the proposed solutions are tailored for vision tasks solely. To overcome these limitations, we propose a new modality-agnostic approach to deal with the online scenario where new data arrive in streams of mini-batches that can only be processed once. To solve catastrophic forgetting, we propose an uncertainty-aware memory-based approach. In particular, we suggest using an estimator based on the Bregman Information (BI) to compute the model’s variance at the sample level. Through measures of predictive uncertainty, we retrieve samples with specific characteristics, and - by retraining the model on such samples - we demonstrate the potential of this approach to reduce the forgetting effect in realistic settings.

[LG-74] GLANCE: Global Actions in a Nutshell for Counterfactual Explainability

链接: https://arxiv.org/abs/2405.18921
作者: Ioannis Emiris,Dimitris Fotakis,Giorgos Giannopoulos,Dimitrios Gunopulos,Loukas Kavouras,Kleopatra Markou,Eleni Psaroudaki,Dimitrios Rontogiannis,Dimitris Sacharidis,Nikolaos Theologitis,Dimitrios Tomaras,Konstantinos Tsopelas
关键词: machine learning models, audit complex machine, complex machine learning, tool to understand, learning models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Counterfactual explanations have emerged as an important tool to understand, debug, and audit complex machine learning models. To offer global counterfactual explainability, state-of-the-art methods construct summaries of local explanations, offering a trade-off among conciseness, counterfactual effectiveness, and counterfactual cost or burden imposed on instances. In this work, we provide a concise formulation of the problem of identifying global counterfactuals and establish principled criteria for comparing solutions, drawing inspiration from Pareto dominance. We introduce innovative algorithms designed to address the challenge of finding global counterfactuals for either the entire input space or specific partitions, employing clustering and decision trees as key components. Additionally, we conduct a comprehensive experimental evaluation, considering various instances of the problem and comparing our proposed algorithms with state-of-the-art methods. The results highlight the consistent capability of our algorithms to generate meaningful and interpretable global counterfactual explanations.

[LG-75] Causal Action Influence Aware Counterfactual Data Augmentation

链接: https://arxiv.org/abs/2405.18917
作者: Núria Armengol Urpí,Marco Bagatella,Marin Vlastelica,Georg Martius
关键词: robots complex behaviors, teaching robots complex, complex behaviors, valuable and practical, practical resources
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted in 41st International Conference on Machine Learning (ICML 2024)

点击查看摘要

Abstract:Offline data are both valuable and practical resources for teaching robots complex behaviors. Ideally, learning agents should not be constrained by the scarcity of available demonstrations, but rather generalize beyond the training distribution. However, the complexity of real-world scenarios typically requires huge amounts of data to prevent neural network policies from picking up on spurious correlations and learning non-causal relationships. We propose CAIAC, a data augmentation method that can create feasible synthetic transitions from a fixed dataset without having access to online environment interactions. By utilizing principled methods for quantifying causal influence, we are able to perform counterfactual reasoning by swapping \itaction -unaffected parts of the state-space between independent trajectories in the dataset. We empirically show that this leads to a substantial increase in robustness of offline learning algorithms against distributional shift.

[LG-76] Leveraging Time-Series Foundation Models in Smart Agriculture for Soil Moisture Forecasting

链接: https://arxiv.org/abs/2405.18913
作者: Boje Deforce,Bart Baesens,Estefanía Serral Asensio
关键词: natural language processing, psi, mathrm, recent surge, natural language
类目: Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

Abstract:The recent surge in foundation models for natural language processing and computer vision has fueled innovation across various domains. Inspired by this progress, we explore the potential of foundation models for time-series forecasting in smart agriculture, a field often plagued by limited data availability. Specifically, this work presents a novel application of \textttTimeGPT , a state-of-the-art (SOTA) time-series foundation model, to predict soil water potential ( \psi_\mathrmsoil ), a key indicator of field water status that is typically used for irrigation advice. Traditionally, this task relies on a wide array of input variables. We explore \psi_\mathrmsoil 's ability to forecast \psi_\mathrmsoil in: ( i ) a zero-shot setting, ( ii ) a fine-tuned setting relying solely on historic \psi_\mathrmsoil measurements, and ( iii ) a fine-tuned setting where we also add exogenous variables to the model. We compare \textttTimeGPT 's performance to established SOTA baseline models for forecasting \psi_\mathrmsoil . Our results demonstrate that \textttTimeGPT achieves competitive forecasting accuracy using only historical \psi_\mathrmsoil data, highlighting its remarkable potential for agricultural applications. This research paves the way for foundation time-series models for sustainable development in agriculture by enabling forecasting tasks that were traditionally reliant on extensive data collection and domain expertise.

[LG-77] Language Generation with Strictly Proper Scoring Rules

链接: https://arxiv.org/abs/2405.18906
作者: Chenze Shao,Fandong Meng,Yijin Liu,Jie Zhou
关键词: maximum likelihood estimation, logarithmic score, maximum likelihood, likelihood estimation, proper scoring rules
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

Abstract:Language generation based on maximum likelihood estimation (MLE) has become the fundamental approach for text generation. Maximum likelihood estimation is typically performed by minimizing the log-likelihood loss, also known as the logarithmic score in statistical decision theory. The logarithmic score is strictly proper in the sense that it encourages honest forecasts, where the expected score is maximized only when the model reports true probabilities. Although many strictly proper scoring rules exist, the logarithmic score is the only local scoring rule among them that depends exclusively on the probability of the observed sample, making it capable of handling the exponentially large sample space of natural text. In this work, we propose a straightforward strategy for adapting scoring rules to language generation, allowing for language modeling with any non-local scoring rules. Leveraging this strategy, we train language generation models using two classic strictly proper scoring rules, the Brier score and the Spherical score, as alternatives to the logarithmic score. Experimental results indicate that simply substituting the loss function, without adjusting other hyperparameters, can yield substantial improvements in model’s generation capabilities. Moreover, these improvements can scale up to large language models (LLMs) such as LLaMA-7B and LLaMA-13B. Source code: \urlthis https URL.

[LG-78] A Causal Framework for Evaluating Deferring Systems

链接: https://arxiv.org/abs/2405.18902
作者: Filippo Palomba,Andrea Pugnana,José Manuel Alvarez,Salvatore Ruggieri
关键词: supervised Machine Learning, extend supervised Machine, Machine Learning, supervised Machine, systems extend supervised
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Deferring systems extend supervised Machine Learning (ML) models with the possibility to defer predictions to human experts. However, evaluating the impact of a deferring strategy on system accuracy is still an overlooked area. This paper fills this gap by evaluating deferring systems through a causal lens. We link the potential outcomes framework for causal inference with deferring systems. This allows us to identify the causal impact of the deferring strategy on predictive accuracy. We distinguish two scenarios. In the first one, we can access both the human and the ML model predictions for the deferred instances. In such a case, we can identify the individual causal effects for deferred instances and aggregates of them. In the second scenario, only human predictions are available for the deferred instances. In this case, we can resort to regression discontinuity design to estimate a local causal effect. We empirically evaluate our approach on synthetic and real datasets for seven deferring systems from the literature.

[LG-79] Unit-Aware Genetic Programming for the Development of Empirical Equations

链接: https://arxiv.org/abs/2405.18896
作者: Julia Reuter,Viktor Martinek,Roland Herzog,Sanaz Mostaghim
关键词: domain experts require, domain experts, physical laws, experts require, accurate and adhere
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: Submitted to Conference Proceedings of PPSN2024

点击查看摘要

Abstract:When developing empirical equations, domain experts require these to be accurate and adhere to physical laws. Often, constants with unknown units need to be discovered alongside the equations. Traditional unit-aware genetic programming (GP) approaches cannot be used when unknown constants with undetermined units are included. This paper presents a method for dimensional analysis that propagates unknown units as ‘‘jokers’’ and returns the magnitude of unit violations. We propose three methods, namely evolutive culling, a repair mechanism, and a multi-objective approach, to integrate the dimensional analysis in the GP algorithm. Experiments on datasets with ground truth demonstrate comparable performance of evolutive culling and the multi-objective approach to a baseline without dimensional analysis. Extensive analysis of the results on datasets without ground truth reveals that the unit-aware algorithms make only low sacrifices in accuracy, while producing unit-adherent solutions. Overall, we presented a promising novel approach for developing unit-adherent empirical equations.

[LG-80] Few-Shot Testing: Estimating Uncertainty of Memristive Deep Neural Networks Using One Bayesian Test Vector

链接: https://arxiv.org/abs/2405.18894
作者: Soyed Tuhin Ahmed,Mehdi Tahoori
关键词: increased tremendously recently, deep learning algorithms, neural networks, tremendously recently, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:The performance of deep learning algorithms such as neural networks (NNs) has increased tremendously recently, and they can achieve state-of-the-art performance in many domains. However, due to memory and computation resource constraints, implementing NNs on edge devices is a challenging task. Therefore, hardware accelerators such as computation-in-memory (CIM) with memristive devices have been developed to accelerate the most common operations, i.e., matrix-vector multiplication. However, due to inherent device properties, external environmental factors such as temperature, and an immature fabrication process, memristors suffer from various non-idealities, including defects and variations occurring during manufacturing and runtime. Consequently, there is a lack of complete confidence in the predictions made by the model. To improve confidence in NN predictions made by hardware accelerators in the presence of device non-idealities, in this paper, we propose a Bayesian test vector generation framework that can estimate the model uncertainty of NNs implemented on memristor-based CIM hardware. Compared to the conventional point estimate test vector generation method, our method is more generalizable across different model dimensions and requires storing only one test Bayesian vector in the hardware. Our method is evaluated on different model dimensions, tasks, fault rates, and variation noise to show that it can consistently achieve 100% coverage with only 0.024 MB of memory overhead.

[LG-81] Locally Estimated Global Perturbations are Better than Local Perturbations for Federated Sharpness-aware Minimization

链接: https://arxiv.org/abs/2405.18890
作者: Ziqing Fan,Shengchao Hu,Jiangchao Yao,Gang Niu,Ya Zhang,Masashi Sugiyama,Yanfeng Wang
关键词: resulted global model, global loss landscape, sharper minima, local loss landscapes, multi-step update
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In federated learning (FL), the multi-step update and data heterogeneity among clients often lead to a loss landscape with sharper minima, degenerating the performance of the resulted global model. Prevalent federated approaches incorporate sharpness-aware minimization (SAM) into local training to mitigate this problem. However, the local loss landscapes may not accurately reflect the flatness of global loss landscape in heterogeneous environments; as a result, minimizing local sharpness and calculating perturbations on client data might not align the efficacy of SAM in FL with centralized training. To overcome this challenge, we propose FedLESAM, a novel algorithm that locally estimates the direction of global perturbation on client side as the difference between global models received in the previous active and current rounds. Besides the improved quality, FedLESAM also speed up federated SAM-based approaches since it only performs once backpropagation in each iteration. Theoretically, we prove a slightly tighter bound than its original FedSAM by ensuring consistent perturbation. Empirically, we conduct comprehensive experiments on four federated benchmark datasets under three partition strategies to demonstrate the superior performance and efficiency of FedLESAM.

[LG-82] Proactive Load-Shaping Strategies with Privacy-Cost Trade-offs in Residential Households based on Deep Reinforcement Learning

链接: https://arxiv.org/abs/2405.18888
作者: Ruichang Zhang,Youcheng Sun,Mustafa A. Mustafa
关键词: Smart meters play, potentially revealing detailed, Smart meters, raise significant privacy, revealing detailed user
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

Abstract:Smart meters play a crucial role in enhancing energy management and efficiency, but they raise significant privacy concerns by potentially revealing detailed user behaviors through energy consumption patterns. Recent scholarly efforts have focused on developing battery-aided load-shaping techniques to protect user privacy while balancing costs. This paper proposes a novel deep reinforcement learning-based load-shaping algorithm (PLS-DQN) designed to protect user privacy by proactively creating artificial load signatures that mislead potential attackers. We evaluate our proposed algorithm against a non-intrusive load monitoring (NILM) adversary. The results demonstrate that our approach not only effectively conceals real energy usage patterns but also outperforms state-of-the-art methods in enhancing user privacy while maintaining cost efficiency.

[LG-83] Compressing Large Language Models using Low Rank and Low Precision Decomposition

链接: https://arxiv.org/abs/2405.18886
作者: Rajarshi Saha,Naomi Sagan,Varun Srivastava,Andrea J. Goldsmith,Mert Pilanci
关键词: Large Language Models, mathbf, Large Language, memory-constrained edge devices, sizes of Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 30 pages, 9 figures, 7 tables

点击查看摘要

Abstract:The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces \rm CALDERA – a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix \mathbfW by approximating it via a low-rank, low-precision decomposition as \mathbfW \approx \mathbfQ + \mathbfL\mathbfR . Here, \mathbfL and \mathbfR are low rank factors, and the entries of \mathbfQ , \mathbfL and \mathbfR are quantized. The model is compressed by substituting each layer with its \mathbfQ + \mathbfL\mathbfR decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, \mathbfL and \mathbfR are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. \rm CALDERA obtains this decomposition by formulating it as an optimization problem \min_\mathbfQ,\mathbfL,\mathbfR\lVert(\mathbfQ + \mathbfL\mathbfR - \mathbfW)\mathbfX^\top\rVert_\rm F^2 , where \mathbfX is the calibration data, and \mathbfQ, \mathbfL, \mathbfR are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of \rm CALDERA are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa- 2 7 B/ 70 B and LlaMa- 3 8 B models obtained using \rm CALDERA outperforms existing post-training LLM compression techniques in the regime of less than 2.5 bits per parameter. The implementation is available at: \hrefthis https URLthis https URL.

[LG-84] uning-Free Alignment of Diffusion Models with Direct Noise Optimization

链接: https://arxiv.org/abs/2405.18881
作者: Zhiwei Tang,Jiangweizhi Peng,Jiasheng Tang,Mingyi Hong,Fan Wang,Tsung-Hui Chang
关键词: represents specific objectives, improving human preference, diffusion models, continuous reward function, Direct Noise Optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we focus on the alignment problem of diffusion models with a continuous reward function, which represents specific objectives for downstream tasks, such as improving human preference. The central goal of the alignment problem is to adjust the distribution learned by diffusion models such that the generated samples maximize the target reward function. We propose a novel alignment approach, named Direct Noise Optimization (DNO), that optimizes the injected noise during the sampling process of diffusion models. By design, DNO is tuning-free and prompt-agnostic, as the alignment occurs in an online fashion during generation. We rigorously study the theoretical properties of DNO and also propose variants to deal with non-differentiable reward functions. Furthermore, we identify that naive implementation of DNO occasionally suffers from the out-of-distribution reward hacking problem, where optimized samples have high rewards but are no longer in the support of the pretrained distribution. To remedy this issue, we leverage classical high-dimensional statistics theory and propose to augment the DNO loss with certain probability regularization. We conduct extensive experiments on several popular reward functions trained on human feedback data and demonstrate that the proposed DNO approach achieves state-of-the-art reward scores as well as high image quality, all within a reasonable time budget for generation.

[LG-85] Spatiotemporal Forecasting Meets Efficiency: Causal Graph Process Neural Networks

链接: https://arxiv.org/abs/2405.18879
作者: Aref Einizade,Fragkiskos D. Malliaros,Jhony H. Giraldo
关键词: Recurrent Neural Networks, Process Neural Network, Graph Neural Networks, Neural Networks, leveraging relational inductive
类目: Machine Learning (cs.LG)
*备注: 21 pages, 4 figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have advanced spatiotemporal forecasting by leveraging relational inductive biases among sensors (or any other measuring scheme) represented as nodes in a graph. However, current methods often rely on Recurrent Neural Networks (RNNs), leading to increased runtimes and memory use. Moreover, these methods typically operate within 1-hop neighborhoods, exacerbating the reduction of the receptive field. Causal Graph Processes (CGPs) offer an alternative, using graph filters instead of MLP layers to reduce parameters and minimize memory consumption. This paper introduces the Causal Graph Process Neural Network (CGProNet), a non-linear model combining CGPs and GNNs for spatiotemporal forecasting. CGProNet employs higher-order graph filters, optimizing the model with fewer parameters, reducing memory usage, and improving runtime efficiency. We present a comprehensive theoretical and experimental stability analysis, highlighting key aspects of CGProNet. Experiments on synthetic and real data demonstrate CGProNet’s superior efficiency, minimizing memory and time requirements while maintaining competitive forecasting performance.

[LG-86] Privacy Preserving Data Imputation via Multi-party Computation for Medical Applications

链接: https://arxiv.org/abs/2405.18878
作者: Julia Jentsch,Ali Burak Ünal,Şeyma Selcan Mağara,Mete Akgün
关键词: Handling missing data, Handling missing, methods, machine learning, crucial in machine
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Submitted to IEEE International Conference on E-health Networking, Application Services

点击查看摘要

Abstract:Handling missing data is crucial in machine learning, but many datasets contain gaps due to errors or non-response. Unlike traditional methods such as listwise deletion, which are simple but inadequate, the literature offers more sophisticated and effective methods, thereby improving sample size and accuracy. However, these methods require accessing the whole dataset, which contradicts the privacy regulations when the data is distributed among multiple sources. Especially in the medical and healthcare domain, such access reveals sensitive information about patients. This study addresses privacy-preserving imputation methods for sensitive data using secure multi-party computation, enabling secure computations without revealing any party’s sensitive information. In this study, we realized the mean, median, regression, and kNN imputation methods in a privacy-preserving way. We specifically target the medical and healthcare domains considering the significance of protection of the patient data, showcasing our methods on a diabetes dataset. Experiments on the diabetes dataset validated the correctness of our privacy-preserving imputation methods, yielding the largest error around 3 \times 10^-3 , closely matching plaintext methods. We also analyzed the scalability of our methods to varying numbers of samples, showing their applicability to real-world healthcare problems. Our analysis demonstrated that all our methods scale linearly with the number of samples. Except for kNN, the runtime of all our methods indicates that they can be utilized for large datasets.

[LG-87] Continuous Product Graph Neural Networks

链接: https://arxiv.org/abs/2405.18877
作者: Aref Einizade,Fragkiskos D. Malliaros,Jhony H. Giraldo
关键词: Processing multidomain data, holds significant potential, Processing multidomain, multidomain data defined, graphs holds significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, 3 figures

点击查看摘要

Abstract:Processing multidomain data defined on multiple graphs holds significant potential in various practical applications in computer science. However, current methods are mostly limited to discrete graph filtering operations. Tensorial partial differential equations on graphs (TPDEGs) provide a principled framework for modeling structured data across multiple interacting graphs, addressing the limitations of the existing discrete methodologies. In this paper, we introduce Continuous Product Graph Neural Networks (CITRUS) that emerge as a natural solution to the TPDEG. CITRUS leverages the separability of continuous heat kernels from Cartesian graph products to efficiently implement graph spectral decomposition. We conduct thorough theoretical analyses of the stability and over-smoothing properties of CITRUS in response to domain-specific graph perturbations and graph spectra effects on the performance. We evaluate CITRUS on well-known traffic and weather spatiotemporal forecasting datasets, demonstrating superior performance over existing approaches.

[LG-88] DFAMiner: Mining minimal separating DFAs from labelled samples

链接: https://arxiv.org/abs/2405.18871
作者: Daniele Dell’Erba,Yong Li,Sven Schewe
关键词: separating deterministic finite, deterministic finite automata, deterministic finite, minimal separating deterministic, separating
类目: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
*备注: 24 pages including appendices and references; version for LearnAut workshop

点击查看摘要

Abstract:We propose DFAMiner, a passive learning tool for learning minimal separating deterministic finite automata (DFA) from a set of labelled samples. Separating automata are an interesting class of automata that occurs generally in regular model checking and has raised interest in foundational questions of parity game solving. We first propose a simple and linear-time algorithm that incrementally constructs a three-valued DFA (3DFA) from a set of labelled samples given in the usual lexicographical order. This 3DFA has accepting and rejecting states as well as don’t-care states, so that it can exactly recognise the labelled examples. We then apply our tool to mining a minimal separating DFA for the labelled samples by minimising the constructed automata via a reduction to solving SAT problems. Empirical evaluation shows that our tool outperforms current state-of-the-art tools significantly on standard benchmarks for learning minimal separating DFAs from samples. Progress in the efficient construction of separating DFAs can also lead to finding the lower bound of parity game solving, where we show that DFAMiner can create optimal separating automata for simple languages with up to 7 colours. Future improvements might offer inroads to better data structures.

[LG-89] owards Data-Driven Electricity Management: Multi-Region Harmonized Data and Knowledge Graph

链接: https://arxiv.org/abs/2405.18869
作者: Vid Hanžel,Blaž Bertalanič,Carolina Fortuna
关键词: global electricity consumption, Due to growing, emissions are increasing, electricity consumption, technological advances
类目: Machine Learning (cs.LG)
*备注: Submitted to: Scientific Data

点击查看摘要

Abstract:Due to growing population and technological advances, global electricity consumption, and consequently also CO2 emissions are increasing. The residential sector makes up 25% of global electricity consumption and has great potential to increase efficiency and reduce CO2 footprint without sacrificing comfort. However, a lack of uniform consumption data at the household level spanning multiple regions hinders large-scale studies and robust multi-region model development. This paper introduces a multi-region dataset compiled from publicly available sources and presented in a uniform format. This data enables machine learning tasks such as disaggregation, demand forecasting, appliance ON/OFF classification, etc. Furthermore, we develop an RDF knowledge graph that characterizes the electricity consumption of the households and contextualizes it with household related properties enabling semantic queries and interoperability with other open knowledge bases like Wikidata and DBpedia. This structured data can be utilized to inform various stakeholders towards data-driven policy and business development.

[LG-90] Domain-Inspired Sharpness-Aware Minimization Under Domain Shifts

链接: https://arxiv.org/abs/2405.18861
作者: Ruipeng Zhang,Ziqing Fan,Jiangchao Yao,Ya Zhang,Yanfeng Wang
关键词: Domain-Inspired Sharpness-Aware Minimization, Sharpness-Aware Minimization, paper presents, presents a Domain-Inspired, Domain-Inspired Sharpness-Aware
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2024

点击查看摘要

Abstract:This paper presents a Domain-Inspired Sharpness-Aware Minimization (DISAM) algorithm for optimization under domain shifts. It is motivated by the inconsistent convergence degree of SAM across different domains, which induces optimization bias towards certain domains and thus impairs the overall convergence. To address this issue, we consider the domain-level convergence consistency in the sharpness estimation to prevent the overwhelming (deficient) perturbations for less (well) optimized domains. Specifically, DISAM introduces the constraint of minimizing variance in the domain loss, which allows the elastic gradient calibration in perturbation generation: when one domain is optimized above the averaging level \textitw.r.t. loss, the gradient perturbation towards that domain will be weakened automatically, and vice versa. Under this mechanism, we theoretically show that DISAM can achieve faster overall convergence and improved generalization in principle when inconsistent convergence emerges. Extensive experiments on various domain generalization benchmarks show the superiority of DISAM over a range of state-of-the-art methods. Furthermore, we show the superior efficiency of DISAM in parameter-efficient fine-tuning combined with the pretraining models. The source code is released at this https URL.

[LG-91] Anomaly Detection by Context Contrasting

链接: https://arxiv.org/abs/2405.18848
作者: Alain Ryser,Thomas M. Sutter,Alexander Marx,Julia E. Vogt
关键词: Anomaly Detection focuses, Anomaly Detection, focuses on identifying, identifying samples, samples that deviate
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Anomaly Detection focuses on identifying samples that deviate from the norm. When working with high-dimensional data such as images, a crucial requirement for detecting anomalous patterns is learning lower-dimensional representations that capture normal concepts seen during training. Recent advances in self-supervised learning have shown great promise in this regard. However, many of the most successful self-supervised anomaly detection methods assume prior knowledge about the structure of anomalies and leverage synthetic anomalies during training. Yet, in many real-world applications, we do not know what to expect from unseen data, and we can solely leverage knowledge about normal data. In this work, we propose Con2, which addresses this problem by setting normal training data into distinct contexts while preserving its normal properties, letting us observe the data from different perspectives. Unseen normal data consequently adheres to learned context representations while anomalies fail to do so, letting us detect them without any knowledge about anomalies during training. Our experiments demonstrate that our approach achieves state-of-the-art performance on various benchmarks while exhibiting superior performance in a more realistic healthcare setting, where knowledge about potential anomalies is often scarce.

[LG-92] Simulation Modelling and Classification of Wiki Contributors: Spotting The Good The Bad and The Ugly

链接: https://arxiv.org/abs/2405.18845
作者: Silvia García Méndez,Fátima Leal,Benedita Malheiro,Juan Carlos Burguillo Rial,Bruno Veloso,Adriana E. Chis,Horacio González Vélez
关键词: data acquisition process, highly relevant data, relevant data ranging, voluntary contributors feed, contributors feed platforms
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data crowdsourcing is a data acquisition process where groups of voluntary contributors feed platforms with highly relevant data ranging from news, comments, and media to knowledge and classifications. It typically processes user-generated data streams to provide and refine popular services such as wikis, collaborative maps, e-commerce sites, and social networks. Nevertheless, this modus operandi raises severe concerns regarding ill-intentioned data manipulation in adversarial environments. This paper presents a simulation, modelling, and classification approach to automatically identify human and non-human (bots) as well as benign and malign contributors by using data fabrication to balance classes within experimental data sets, data stream modelling to build and update contributor profiles and, finally, autonomic data stream classification. By employing WikiVoyage - a free worldwide wiki travel guide open to contribution from the general public - as a testbed, our approach proves to significantly boost the confidence and quality of the classifier by using a class-balanced data stream, comprising both real and synthetic data. Our empirical results show that the proposed method distinguishes between benign and malign bots as well as human contributors with a classification accuracy of up to 92 %.

[LG-93] Data-driven Machinery Fault Detection: A Comprehensive Review

链接: https://arxiv.org/abs/2405.18843
作者: Dhiraj Neupane,Mohamed Reda Bouadjenek,Richard Dazeley,Sunil Aryal
关键词: diagnose machine faults, efficient operation, era of advanced, guarantee their safe, safe and efficient
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this era of advanced manufacturing, it’s now more crucial than ever to diagnose machine faults as early as possible to guarantee their safe and efficient operation. With the massive surge in industrial big data and advancement in sensing and computational technologies, data-driven Machinery Fault Diagnosis (MFD) solutions based on machine/deep learning approaches have been used ubiquitously in manufacturing. Timely and accurately identifying faulty machine signals is vital in industrial applications for which many relevant solutions have been proposed and are reviewed in many articles. Despite the availability of numerous solutions and reviews on MFD, existing works often lack several aspects. Most of the available literature has limited applicability in a wide range of manufacturing settings due to their concentration on a particular type of equipment or method of analysis. Additionally, discussions regarding the challenges associated with implementing data-driven approaches, such as dealing with noisy data, selecting appropriate features, and adapting models to accommodate new or unforeseen faults, are often superficial or completely overlooked. Thus, this survey provides a comprehensive review of the articles using different types of machine learning approaches for the detection and diagnosis of various types of machinery faults, highlights their strengths and limitations, provides a review of the methods used for condition-based analyses, comprehensively discusses the available machinery fault datasets, introduces future researchers to the possible challenges they have to encounter while using these approaches for MFD and recommends the probable solutions to mitigate those problems. The future research prospects are also pointed out for a better understanding of the field. We believe this article will help researchers and contribute to the further development of the field.

[LG-94] MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models

链接: https://arxiv.org/abs/2405.18832
作者: Taehyun Kim,Kwanseok Choi,Youngmock Cho,Jaehoon Cho,Hyuk-Jae Lee,Jaewoong Sim
关键词: large language models, GPU memory capacity, requiring costly parameter, MoE LLM inference, costly parameter movement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: Accepted to DAC 2024

点击查看摘要

Abstract:Mixture-of-Experts (MoE) large language models (LLM) have memory requirements that often exceed the GPU memory capacity, requiring costly parameter movement from secondary memories to the GPU for expert computation. In this work, we present Mixture of Near-Data Experts (MoNDE), a near-data computing solution that efficiently enables MoE LLM inference. MoNDE reduces the volume of MoE parameter movement by transferring only the \textithot experts to the GPU, while computing the remaining \textitcold experts inside the host memory device. By replacing the transfers of massive expert parameters with the ones of small activations, MoNDE enables far more communication-efficient MoE inference, thereby resulting in substantial speedups over the existing parameter offloading frameworks for both encoder and decoder operations.

[LG-95] Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks

链接: https://arxiv.org/abs/2405.18831
作者: Simranjit Singh,Georgios Pavlakos,Dimitrios Stamoulis
关键词: Visual Question Answering, Visual Question, Question Answering, paradigms influence existing, foundation models grows
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at 1st Workshop on Multimodalities for 3D Scenes CVPR 2024

点击查看摘要

Abstract:As interest in “reformulating” the 3D Visual Question Answering (VQA) problem in the context of foundation models grows, it is imperative to assess how these new paradigms influence existing closed-vocabulary datasets. In this case study, we evaluate the zero-shot performance of foundational models (GPT-4 Vision and GPT-4) on well-established 3D VQA benchmarks, namely 3D-VQA and ScanQA. We provide an investigation to contextualize the performance of GPT-based agents relative to traditional modeling approaches. We find that GPT-based agents without any fine-tuning perform on par with the closed vocabulary approaches. Our findings corroborate recent results that “blind” models establish a surprisingly strong baseline in closed-vocabulary settings. We demonstrate that agents benefit significantly from scene-specific vocabulary via in-context textual grounding. By presenting a preliminary comparison with previous baselines, we hope to inform the community’s ongoing efforts to refine multi-modal 3D benchmarks.

[LG-96] Flow Priors for Linear Inverse Problems via Iterative Corrupted Trajectory Matching

链接: https://arxiv.org/abs/2405.18816
作者: Yasi Zhang,Peiyu Yu,Yaxuan Zhu,Yingshan Chang,Feng Gao,Ying Nian Wu,Oscar Leong
关键词: Generative models based, attracted significant attention, Generative models, high-resolution image synthesis, attracted significant
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative models based on flow matching have attracted significant attention for their simplicity and superior performance in high-resolution image synthesis. By leveraging the instantaneous change-of-variables formula, one can directly compute image likelihoods from a learned flow, making them enticing candidates as priors for downstream tasks such as inverse problems. In particular, a natural approach would be to incorporate such image probabilities in a maximum-a-posteriori (MAP) estimation problem. A major obstacle, however, lies in the slow computation of the log-likelihood, as it requires backpropagating through an ODE solver, which can be prohibitively slow for high-dimensional problems. In this work, we propose an iterative algorithm to approximate the MAP estimator efficiently to solve a variety of linear inverse problems. Our algorithm is mathematically justified by the observation that the MAP objective can be approximated by a sum of N ``local MAP’’ objectives, where N is the number of function evaluations. By leveraging Tweedie’s formula, we show that we can perform gradient steps to sequentially optimize these objectives. We validate our approach for various linear inverse problems, such as super-resolution, deblurring, inpainting, and compressed sensing, and demonstrate that we can outperform other methods based on flow matching.

[LG-97] Semiring Activation in Neural Networks

链接: https://arxiv.org/abs/2405.18805
作者: Bart M.N. Smets,Peter D. Donker,Jim W. Portegies,Remco Duits
关键词: neural networks, nonlinear operators based, networks, convolutional neural networks, introduce a class
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a class of trainable nonlinear operators based on semirings that are suitable for use in neural networks. These operators generalize the traditional alternation of linear operators with activation functions in neural networks. Semirings are algebraic structures that describe a generalised notation of linearity, greatly expanding the range of trainable operators that can be included in neural networks. In fact, max- or min-pooling operations are convolutions in the tropical semiring with a fixed kernel. We perform experiments where we replace the activation functions for trainable semiring-based operators to show that these are viable operations to include in fully connected as well as convolutional neural networks (ConvNeXt). We discuss some of the challenges of replacing traditional activation functions with trainable semiring activations and the trade-offs of doing so. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2405.18805 [cs.LG] (or arXiv:2405.18805v1 [cs.LG] for this version)

[LG-98] Adaptive Discretization-based Non-Episodic Reinforcement Learning in Metric Spaces

链接: https://arxiv.org/abs/2405.18793
作者: Avik Kar,Rahul Singh
关键词: non-episodic Reinforcement Learning, Reinforcement Learning, study non-episodic Reinforcement, epsilon, Lipschitz functions
类目: Machine Learning (cs.LG)
*备注: 38 pages, 2 figures

点击查看摘要

Abstract:We study non-episodic Reinforcement Learning for Lipschitz MDPs in which state-action space is a metric space, and the transition kernel and rewards are Lipschitz functions. We develop computationally efficient UCB-based algorithm, \textitZoRL-\epsilon that adaptively discretizes the state-action space and show that their regret as compared with \epsilon -optimal policy is bounded as \mathcalO(\epsilon^-(2 d_\mathcalS + d^\epsilon_z + 1)\log(T)) , where d^\epsilon_z is the \epsilon -zooming dimension. In contrast, if one uses the vanilla \textitUCRL-2 on a fixed discretization of the MDP, the regret w.r.t. a \epsilon -optimal policy scales as \mathcalO(\epsilon^-(2 d_\mathcalS + d + 1)\log(T)) so that the adaptivity gains are huge when d^\epsilon_z \ll d . Note that the absolute regret of any ‘uniformly good’ algorithm for a large family of continuous MDPs asymptotically scales as at least \Omega(\log(T)) . Though adaptive discretization has been shown to yield \mathcal\tildeO(H^2.5K^\fracd_z + 1d_z + 2) regret in episodic RL, an attempt to extend this to the non-episodic case by employing constant duration episodes whose duration increases with T , is futile since d_z \to d as T \to \infty . The current work shows how to obtain adaptivity gains for non-episodic RL. The theoretical results are supported by simulations on two systems where the performance of \textitZoRL-\epsilon is compared with that of ’ \textitUCRL-C ,’ the fixed discretization-based extension of \textitUCRL-2 for systems with continuous state-action spaces.

[LG-99] Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies

链接: https://arxiv.org/abs/2405.18792
作者: Haanvid Lee,Tri Wahyu Guntara,Jongmin Lee,Yung-Kyun Noh,Kee-Eung Kim
关键词: deterministic target policies, continuous action spaces, deterministic target, OPE, target policies
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 23 pages, 2 figures, Accepted at ICLR 2024 (spotlight)

点击查看摘要

Abstract:We consider off-policy evaluation (OPE) of deterministic target policies for reinforcement learning (RL) in environments with continuous action spaces. While it is common to use importance sampling for OPE, it suffers from high variance when the behavior policy deviates significantly from the target policy. In order to address this issue, some recent works on OPE proposed in-sample learning with importance resampling. Yet, these approaches are not applicable to deterministic target policies for continuous action spaces. To address this limitation, we propose to relax the deterministic target policy using a kernel and learn the kernel metrics that minimize the overall mean squared error of the estimated temporal difference update vector of an action value function, where the action value function is used for policy evaluation. We derive the bias and variance of the estimation error due to this relaxation and provide analytic solutions for the optimal kernel metric. In empirical studies using various test domains, we show that the OPE with in-sample learning using the kernel with optimized metric achieves significantly improved accuracy than other baselines.

[LG-100] MOKD: Cross-domain Finetuning for Few-shot Classification via Maximizing Optimized Kernel Dependence

链接: https://arxiv.org/abs/2405.18786
作者: Hongduan Tian,Feng Liu,Tongliang Liu,Bo Du,Yiu-ming Cheung,Bo Han
关键词: cross-domain few-shot classification, nearest centroid classifier, few-shot classification, cross-domain few-shot, construct a metric
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In cross-domain few-shot classification, \emphnearest centroid classifier (NCC) aims to learn representations to construct a metric space where few-shot classification can be performed by measuring the similarities between samples and the prototype of each class. An intuition behind NCC is that each sample is pulled closer to the class centroid it belongs to while pushed away from those of other classes. However, in this paper, we find that there exist high similarities between NCC-learned representations of two samples from different classes. In order to address this problem, we propose a bi-level optimization framework, \emphmaximizing optimized kernel dependence (MOKD) to learn a set of class-specific representations that match the cluster structures indicated by labeled data of the given task. Specifically, MOKD first optimizes the kernel adopted in \emphHilbert-Schmidt independence criterion (HSIC) to obtain the optimized kernel HSIC (opt-HSIC) that can capture the dependence more precisely. Then, an optimization problem regarding the opt-HSIC is addressed to simultaneously maximize the dependence between representations and labels and minimize the dependence among all samples. Extensive experiments on Meta-Dataset demonstrate that MOKD can not only achieve better generalization performance on unseen domains in most cases but also learn better data representation clusters. The project repository of MOKD is available at: \hrefthis https URLthis https URL.

[LG-101] On the Role of Attention Masks and LayerNorm in Transformers

链接: https://arxiv.org/abs/2405.18781
作者: Xinyi Wu,Amir Ajorlou,Yifei Wang,Stefanie Jegelka,Ali Jadbabaie
关键词: essential building blocks, modern foundation models, rank collapse, rank, key mechanism
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Self-attention is the key mechanism of transformers, which are the essential building blocks of modern foundation models. Recent studies have shown that pure self-attention suffers from an increasing degree of rank collapse as depth increases, limiting model expressivity and further utilization of model depth. The existing literature on rank collapse, however, has mostly overlooked other critical components in transformers that may alleviate the rank collapse issue. In this paper, we provide a general analysis of rank collapse under self-attention, taking into account the effects of attention masks and layer normalization (LayerNorm). In particular, we find that although pure masked attention still suffers from exponential collapse to a rank one subspace, local masked attention can provably slow down the collapse rate. In the case of self-attention with LayerNorm, we first show that for certain classes of value matrices, collapse to a rank one subspace still happens exponentially. However, through construction of nontrivial counterexamples, we then establish that with proper choice of value matrices, a general class of sequences may not converge to a rank one subspace, and the self-attention dynamics with LayerNorm can simultaneously possess a rich set of equilibria with any possible rank between one and full. Our result refutes the previous hypothesis that LayerNorm plays no role in the rank collapse of self-attention and suggests that self-attention with LayerNorm constitutes a much more expressive, versatile nonlinear dynamical system than what was originally thought.

[LG-102] Quantitative Certification of Bias in Large Language Models

链接: https://arxiv.org/abs/2405.18780
作者: Isha Chaudhary,Qian Hu,Manoj Kumar,Morteza Ziyadi,Rahul Gupta,Gagandeep Singh
关键词: Large Language Models, Language Models, Large Language, exhibit social biases, support stereotypes
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) can produce responses that exhibit social biases and support stereotypes. However, conventional benchmarking is insufficient to thoroughly evaluate LLM bias, as it can not scale to large sets of prompts and provides no guarantees. Therefore, we propose a novel certification framework QuaCer-B (Quantitative Certification of Bias) that provides formal guarantees on obtaining unbiased responses from target LLMs under large sets of prompts. A certificate consists of high-confidence bounds on the probability of obtaining biased responses from the LLM for any set of prompts containing sensitive attributes, sampled from a distribution. We illustrate the bias certification in LLMs for prompts with various prefixes drawn from given distributions. We consider distributions of random token sequences, mixtures of manual jailbreaks, and jailbreaks in the LLM’s embedding space to certify its bias. We certify popular LLMs with QuaCer-B and present novel insights into their biases.

[LG-103] LMO-DP: Optimizing the Randomization Mechanism for Differentially Private Fine-Tuning (Large) Language Models

链接: https://arxiv.org/abs/2405.18776
作者: Qin Yang,Meisam Mohammad,Han Wang,Ali Payani,Ashish Kundu,Kai Shu,Yan Yan,Yuan Hong
关键词: Differentially Private Stochastic, Stochastic Gradient Descent, Private Stochastic Gradient, Differentially Private, Private Stochastic
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 18 pages, 15 figures

点击查看摘要

Abstract:Differentially Private Stochastic Gradient Descent (DP-SGD) and its variants have been proposed to ensure rigorous privacy for fine-tuning large-scale pre-trained language models. However, they rely heavily on the Gaussian mechanism, which may overly perturb the gradients and degrade the accuracy, especially in stronger privacy regimes (e.g., the privacy budget \epsilon 3 ). To address such limitations, we propose a novel Language Model-based Optimal Differential Privacy (LMO-DP) mechanism, which takes the first step to enable the tight composition of accurately fine-tuning (large) language models with a sub-optimal DP mechanism, even in strong privacy regimes (e.g., 0.1\leq \epsilon3 ). Furthermore, we propose a novel offline optimal noise search method to efficiently derive the sub-optimal DP that significantly reduces the noise magnitude. For instance, fine-tuning RoBERTa-large (with 300M parameters) on the SST-2 dataset can achieve an accuracy of 92.20% (given \epsilon=0.3 , \delta=10^-10 ) by drastically outperforming the Gaussian mechanism (e.g., \sim 50% for small \epsilon and \delta ). We also draw similar findings on the text generation tasks on GPT-2. Finally, to our best knowledge, LMO-DP is also the first solution to accurately fine-tune Llama-2 with strong differential privacy guarantees. The code will be released soon and available upon request.

[LG-104] Large Brain Model for Learning Generic Representations with Tremendous EEG Data in BCI

链接: https://arxiv.org/abs/2405.18765
作者: Wei-Bang Jiang,Li-Ming Zhao,Bao-Liang Lu
关键词: EEG, Large Language Models, Large EEG Models, EEG channel patches, based deep learning
类目: Machine Learning (cs.LG)
*备注: The Twelfth International Conference on Learning Representations

点击查看摘要

Abstract:The current electroencephalogram (EEG) based deep learning models are typically designed for specific datasets and applications in brain-computer interaction (BCI), limiting the scale of the models and thus diminishing their perceptual capabilities and generalizability. Recently, Large Language Models (LLMs) have achieved unprecedented success in text processing, prompting us to explore the capabilities of Large EEG Models (LEMs). We hope that LEMs can break through the limitations of different task types of EEG datasets, and obtain universal perceptual capabilities of EEG signals through unsupervised pre-training. Then the models can be fine-tuned for different downstream tasks. However, compared to text data, the volume of EEG datasets is generally small and the format varies widely. For example, there can be mismatched numbers of electrodes, unequal length data samples, varied task designs, and low signal-to-noise ratio. To overcome these challenges, we propose a unified foundation model for EEG called Large Brain Model (LaBraM). LaBraM enables cross-dataset learning by segmenting the EEG signals into EEG channel patches. Vector-quantized neural spectrum prediction is used to train a semantically rich neural tokenizer that encodes continuous raw EEG channel patches into compact neural codes. We then pre-train neural Transformers by predicting the original neural codes for the masked EEG channel patches. The LaBraMs were pre-trained on about 2,500 hours of various types of EEG signals from around 20 datasets and validated on multiple different types of downstream tasks. Experiments on abnormal detection, event type classification, emotion recognition, and gait prediction show that our LaBraM outperforms all compared SOTA methods in their respective fields. Our code is available at this https URL.

[LG-105] FDQN: A Flexible Deep Q-Network Framework for Game Automation

链接: https://arxiv.org/abs/2405.18761
作者: Prabhath Reddy Gujavarthy
关键词: domains require real-time, require real-time online, real-time online interaction, Flexible Deep Q-Network, Chrome Dino game
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In reinforcement learning, it is often difficult to automate high-dimensional, rapid decision-making in dynamic environments, especially when domains require real-time online interaction and adaptive strategies such as web-based games. This work proposes a state-of-the-art Flexible Deep Q-Network (FDQN) framework that can address this challenge with a selfadaptive approach that is processing high-dimensional sensory data in realtime using a CNN and dynamically adapting the model architecture to varying action spaces of different gaming environments and outperforming previous baseline models in various Atari games and the Chrome Dino game as baselines. Using the epsilon-greedy policy, it effectively balances the new learning and exploitation for improved performance, and it has been designed with a modular structure that it can be easily adapted to other HTML-based games without touching the core part of the framework. It is demonstrated that the FDQN framework can successfully solve a well-defined task in a laboratory condition, but more importantly it also discusses potential applications to more challenging real-world cases and serve as the starting point for future further exploration into automated game play and beyond.

[LG-106] Learning to Continually Learn with the Bayesian Principle

链接: https://arxiv.org/abs/2405.18758
作者: Soochan Lee,Hyeonseong Jeon,Jaehyeon Son,Gunhee Kim
关键词: stochastic gradient descent, sequential Bayesian update, Bayesian update rules, present era, era of deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICML 2024

点击查看摘要

Abstract:In the present era of deep learning, continual learning research is mainly focused on mitigating forgetting when training a neural network with stochastic gradient descent on a non-stationary stream of data. On the other hand, in the more classical literature of statistical machine learning, many models have sequential Bayesian update rules that yield the same learning outcome as the batch training, i.e., they are completely immune to catastrophic forgetting. However, they are often overly simple to model complex real-world data. In this work, we adopt the meta-learning paradigm to combine the strong representational power of neural networks and simple statistical models’ robustness to forgetting. In our novel meta-continual learning framework, continual learning takes place only in statistical models via ideal sequential Bayesian update rules, while neural networks are meta-learned to bridge the raw data and the statistical models. Since the neural networks remain fixed during continual learning, they are protected from catastrophic forgetting. This approach not only achieves significantly improved performance but also exhibits excellent scalability. Since our approach is domain-agnostic and model-agnostic, it can be applied to a wide range of problems and easily integrated with existing model architectures.

[LG-107] Provable Contrastive Continual Learning

链接: https://arxiv.org/abs/2405.18756
作者: Yichen Wen,Zhiquan Tan,Kaipeng Zheng,Chuanlong Xie,Weiran Huang
关键词: dynamic data distributions, Continual learning, requires learning incremental, contrastive continual learning, Continual learning requires
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Machine Learning (stat.ML)
*备注: Accepted by ICML 2024

点击查看摘要

Abstract:Continual learning requires learning incremental tasks with dynamic data distributions. So far, it has been observed that employing a combination of contrastive loss and distillation loss for training in continual learning yields strong performance. To the best of our knowledge, however, this contrastive continual learning framework lacks convincing theoretical explanations. In this work, we fill this gap by establishing theoretical performance guarantees, which reveal how the performance of the model is bounded by training losses of previous tasks in the contrastive continual learning framework. Our theoretical explanations further support the idea that pre-training can benefit continual learning. Inspired by our theoretical analysis of these guarantees, we propose a novel contrastive continual learning algorithm called CILA, which uses adaptive distillation coefficients for different tasks. These distillation coefficients are easily computed by the ratio between average distillation losses and average contrastive losses from previous tasks. Our method shows great improvement on standard benchmarks and achieves new state-of-the-art performance.

[LG-108] GIST: Greedy Independent Set Thresholding for Diverse Data Summarization

链接: https://arxiv.org/abs/2405.18754
作者: Matthew Fahrbach,Srikumar Ramalingam,Morteza Zadimoghaddam,Sara Ahmadian,Gui Citovsky,Giulia DeSalvo
关键词: task called min-distance, called min-distance diverse, diverse data summarization, min-distance diverse data, selection task called
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 15 pages, 1 figure

点击查看摘要

Abstract:We propose a novel subset selection task called min-distance diverse data summarization ( \textsfMDDS ), which has a wide variety of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal is to maximize an objective that combines the total utility of the points and a diversity term that captures the minimum distance between any pair of selected points, subject to the constraint |S| \le k . For example, the points may correspond to training examples in a data sampling problem, e.g., learned embeddings of images extracted from a deep neural network. This work presents the \textttGIST algorithm, which achieves a \frac23 -approximation guarantee for \textsfMDDS by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove a complementary (\frac23+\varepsilon) -hardness of approximation, for any \varepsilon 0 . Finally, we provide an empirical study that demonstrates \textttGIST outperforms existing methods for \textsfMDDS on synthetic data, and also for a real-world image classification experiment the studies single-shot subset selection for ImageNet.

[LG-109] Confronting the Reproducibility Crisis: A Case Study in Validating Certified Robustness

链接: https://arxiv.org/abs/2405.18753
作者: Richard H. Moulton,Gary A. McCully,John D. Hastings
关键词: deep neural networks, adversarial robustness, neural networks, enabling validation, deep neural
类目: Machine Learning (cs.LG)
*备注: 9 pages, 0 figures, submitted to ACSAC (Annual Computer Security Applications Conference) 2024

点击查看摘要

Abstract:Reproducibility is a cornerstone of scientific research, enabling validation, extension, and progress. However, the rapidly evolving nature of software and dependencies poses significant challenges to reproducing research results, particularly in fields like adversarial robustness for deep neural networks, where complex codebases and specialized toolkits are utilized. This paper presents a case study of attempting to validate the results on certified adversarial robustness in “SoK: Certified Robustness for Deep Neural Networks” using the VeriGauge toolkit. Despite following the documented methodology, numerous software and hardware compatibility issues were encountered, including outdated or unavailable dependencies, version conflicts, and driver incompatibilities. While a subset of the original results could be run, key findings related to the empirical robust accuracy of various verification methods proved elusive due to these technical obstacles, as well as slight discrepancies in the test results. This practical experience sheds light on the reproducibility crisis afflicting adversarial robustness research, where a lack of reproducibility threatens scientific integrity and hinders progress. The paper discusses the broader implications of this crisis, proposing potential solutions such as containerization, software preservation, and comprehensive documentation practices. Furthermore, it highlights the need for collaboration and standardization efforts within the research community to develop robust frameworks for reproducible research. By addressing the reproducibility crisis head-on, this work aims to contribute to the ongoing discourse on scientific reproducibility and advocate for best practices that ensure the reliability and validity of research findings within not only adversarial robustness, but security and technology research as a whole.

[LG-110] A SARS-CoV-2 Interaction Dataset and VHH Sequence Corpus for Antibody Language Models

链接: https://arxiv.org/abs/2405.18749
作者: Hirofumi Tsuruta,Hiroyuki Yamazaki,Ryota Maeda,Ryotaro Tamura,Akihiro Imura
关键词: treating human diseases, eliminate harmful foreign, harmful foreign substances, Antibodies are crucial, pivotal therapeutic agents
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Antibodies are crucial proteins produced by the immune system to eliminate harmful foreign substances and have become pivotal therapeutic agents for treating human diseases. To accelerate the discovery of antibody therapeutics, there is growing interest in constructing language models using antibody sequences. However, the applicability of pre-trained language models for antibody discovery has not been thoroughly evaluated due to the scarcity of labeled datasets. To overcome these limitations, we introduce AVIDa-SARS-CoV-2, a dataset featuring the antigen-variable domain of heavy chain of heavy chain antibody (VHH) interactions obtained from two alpacas immunized with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike proteins. AVIDa-SARS-CoV-2 includes binary labels indicating the binding or non-binding of diverse VHH sequences to 12 SARS-CoV-2 mutants, such as the Delta and Omicron variants. Furthermore, we release VHHCorpus-2M, a pre-training dataset for antibody language models, containing over two million VHH sequences. We report benchmark results for predicting SARS-CoV-2-VHH binding using VHHBERT pre-trained on VHHCorpus-2M and existing general protein and antibody-specific pre-trained language models. These results confirm that AVIDa-SARS-CoV-2 provides valuable benchmarks for evaluating the representation capabilities of antibody language models for binding prediction, thereby facilitating the development of AI-driven antibody discovery. The datasets are available at this https URL.

[LG-111] Preferred-Action-Optimized Diffusion Policies for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2405.18729
作者: Tianle Zhang,Jiayi Guan,Lin Zhao,Yihang Li,Dongjiang Li,Zecui Zeng,Lei Sun,Yue Chen,Xuelong Wei,Lusong Li,Xiaodong He
关键词: Offline reinforcement learning, learn optimal policies, previously collected datasets, reinforcement learning, aims to learn
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) aims to learn optimal policies from previously collected datasets. Recently, due to their powerful representational capabilities, diffusion models have shown significant potential as policy models for offline RL issues. However, previous offline RL algorithms based on diffusion policies generally adopt weighted regression to improve the policy. This approach optimizes the policy only using the collected actions and is sensitive to Q-values, which limits the potential for further performance enhancement. To this end, we propose a novel preferred-action-optimized diffusion policy for offline RL. In particular, an expressive conditional diffusion model is utilized to represent the diverse distribution of a behavior policy. Meanwhile, based on the diffusion model, preferred actions within the same behavior distribution are automatically generated through the critic function. Moreover, an anti-noise preference optimization is designed to achieve policy improvement by using the preferred actions, which can adapt to noise-preferred actions for stable training. Extensive experiments demonstrate that the proposed method provides competitive or superior performance compared to previous state-of-the-art offline RL methods, particularly in sparse reward tasks such as Kitchen and AntMaze. Additionally, we empirically prove the effectiveness of anti-noise preference optimization.

[LG-112] Can We Enhance the Quality of Mobile Crowdsensing Data Without Ground Truth?

链接: https://arxiv.org/abs/2405.18725
作者: Jiajie Li,Bo Gu,Shimin Gong,Zhou Su,Mohsen Guizani
关键词: data, Mobile crowdsensing, sensing data, prominent trend, sensing data submitted
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Mobile crowdsensing (MCS) has emerged as a prominent trend across various domains. However, ensuring the quality of the sensing data submitted by mobile users (MUs) remains a complex and challenging problem. To address this challenge, an advanced method is required to detect low-quality sensing data and identify malicious MUs that may disrupt the normal operations of an MCS system. Therefore, this article proposes a prediction- and reputation-based truth discovery (PRBTD) framework, which can separate low-quality data from high-quality data in sensing tasks. First, we apply a correlation-focused spatial-temporal transformer network to predict the ground truth of the input sensing data. Then, we extract the sensing errors of the data as features based on the prediction results to calculate the implications among the data. Finally, we design a reputation-based truth discovery (TD) module for identifying low-quality data with their implications. Given sensing data submitted by MUs, PRBTD can eliminate the data with heavy noise and identify malicious MUs with high accuracy. Extensive experimental results demonstrate that PRBTD outperforms the existing methods in terms of identification accuracy and data quality enhancement.

[LG-113] Conformal Depression Prediction

链接: https://arxiv.org/abs/2405.18723
作者: Yonghong Li,Shan Qu,Xiuzhuang Zhou
关键词: deep learning show, learning show promise, existing depression recognition, depression recognition, black box
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While existing depression recognition methods based on deep learning show promise, their practical application is hindered by the lack of trustworthiness, as these deep models are often deployed as \textitblack box models, leaving us uncertain about the confidence of the model predictions. For high-risk clinical applications like depression recognition, uncertainty quantification is essential in decision-making. In this paper, we introduce conformal depression prediction (CDP), a depression recognition method with uncertainty quantification based on conformal prediction (CP), giving valid confidence intervals with theoretical coverage guarantees for the model predictions. CDP is a plug-and-play module that requires neither model retraining nor an assumption about the depression data distribution. As CDP provides only an average performance guarantee across all inputs rather than per-input performance guarantee, we propose CDP-ACC, an improved conformal prediction with approximate conditional coverage. CDP-ACC firstly estimates the prediction distribution through neighborhood relaxation, and then introduces a conformal score function by constructing nested sequences, so as to provide tighter prediction interval for each specific input. We empirically demonstrate the application of uncertainty quantification in depression recognition, and the effectiveness and superiority of CDP and CDP-ACC on the AVEC 2013 and AVEC 2014 datasets

[LG-114] o FP8 and Back Again: Quantifying the Effects of Reducing Precision on LLM Training Stability

链接: https://arxiv.org/abs/2405.18710
作者: Joonhyung Lee,Jeongin Bae,Byeongwook Kim,Se Jung Kwon,Dongsoo Lee
关键词: massive computational costs, spurred great interest, LLM training, pretraining have spurred, accelerate the process
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The massive computational costs associated with large language model (LLM) pretraining have spurred great interest in reduced-precision floating-point representations to accelerate the process. As a result, the BrainFloat16 (BF16) precision has become the de facto standard for LLM training, with hardware support included in recent accelerators. This trend has gone even further in the latest processors, where FP8 has recently been introduced. However, prior experience with FP16, which was found to be less stable than BF16, raises concerns as to whether FP8, with even fewer bits than FP16, can be a cost-effective option for LLM training. We argue that reduced-precision training schemes must have similar training stability and hyperparameter sensitivities to their higher-precision counterparts in order to be cost-effective. However, we find that currently available methods for FP8 training are not robust enough to allow their use as economical replacements. This prompts us to investigate the stability of reduced-precision LLM training in terms of robustness across random seeds and learning rates. To this end, we propose new evaluation techniques and a new metric for quantifying loss landscape sharpness in autoregressive language models. By simulating incremental bit reductions in floating-point representations, we analyze the relationship between representational power and training stability with the intent of aiding future research into the field.

[LG-115] Adaptive and Parallel Split Federated Learning in Vehicular Edge Computing

链接: https://arxiv.org/abs/2405.18707
作者: Xianke Qiang,Zheng Chang,Yun Hu,Lei Liu,Timo Hamalainen
关键词: accommodating artificial intelligence, intelligent transportation systems, enabling future intelligent, future intelligent transportation, Vehicular edge intelligence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Vehicular edge intelligence (VEI) is a promising paradigm for enabling future intelligent transportation systems by accommodating artificial intelligence (AI) at the vehicular edge computing (VEC) system. Federated learning (FL) stands as one of the fundamental technologies facilitating collaborative model training locally and aggregation, while safeguarding the privacy of vehicle data in VEI. However, traditional FL faces challenges in adapting to vehicle heterogeneity, training large models on resource-constrained vehicles, and remaining susceptible to model weight privacy leakage. Meanwhile, split learning (SL) is proposed as a promising collaborative learning framework which can mitigate the risk of model wights leakage, and release the training workload on vehicles. SL sequentially trains a model between a vehicle and an edge cloud (EC) by dividing the entire model into a vehicle-side model and an EC-side model at a given cut layer. In this work, we combine the advantages of SL and FL to develop an Adaptive Split Federated Learning scheme for Vehicular Edge Computing (ASFV). The ASFV scheme adaptively splits the model and parallelizes the training process, taking into account mobile vehicle selection and resource allocation. Our extensive simulations, conducted on non-independent and identically distributed data, demonstrate that the proposed ASFV solution significantly reduces training latency compared to existing benchmarks, while adapting to network dynamics and vehicles’ mobility.

[LG-116] Spectral-Risk Safe Reinforcement Learning with Convergence Guarantees

链接: https://arxiv.org/abs/2405.18698
作者: Dohyeong Kim,Taehyun Cho,Seungyub Han,Hojun Chung,Kyungjae Lee,Songhwai Oh
关键词: risk-constrained reinforcement learning, reinforcement learning, explicitly handling, field of risk-constrained, risk-constrained reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 26 pages

点击查看摘要

Abstract:The field of risk-constrained reinforcement learning (RCRL) has been developed to effectively reduce the likelihood of worst-case scenarios by explicitly handling risk-measure-based constraints. However, the nonlinearity of risk measures makes it challenging to achieve convergence and optimality. To overcome the difficulties posed by the nonlinearity, we propose a spectral risk measure-constrained RL algorithm, spectral-risk-constrained policy optimization (SRCPO), a bilevel optimization approach that utilizes the duality of spectral risk measures. In the bilevel optimization structure, the outer problem involves optimizing dual variables derived from the risk measures, while the inner problem involves finding an optimal policy given these dual variables. The proposed method, to the best of our knowledge, is the first to guarantee convergence to an optimum in the tabular setting. Furthermore, the proposed method has been evaluated on continuous control tasks and showed the best performance among other RCRL algorithms satisfying the constraints.

[LG-117] DeepHGNN: Study of Graph Neural Network based Forecasting Methods for Hierarchically Related Multivariate Time Series

链接: https://arxiv.org/abs/2405.18693
作者: Abishek Sriramulu,Nicolas Fourrier,Christoph Bergmeir
关键词: Graph Neural Networks, Graph Neural, Neural Networks, gained significant traction, intra-series temporal correlations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNN) have gained significant traction in the forecasting domain, especially for their capacity to simultaneously account for intra-series temporal correlations and inter-series relationships. This paper introduces a novel Hierarchical GNN (DeepHGNN) framework, explicitly designed for forecasting in complex hierarchical structures. The uniqueness of DeepHGNN lies in its innovative graph-based hierarchical interpolation and an end-to-end reconciliation mechanism. This approach ensures forecast accuracy and coherence across various hierarchical levels while sharing signals across them, addressing a key challenge in hierarchical forecasting. A critical insight in hierarchical time series is the variance in forecastability across levels, with upper levels typically presenting more predictable components. DeepHGNN capitalizes on this insight by pooling and leveraging knowledge from all hierarchy levels, thereby enhancing the overall forecast accuracy. Our comprehensive evaluation set against several state-of-the-art models confirm the superior performance of DeepHGNN. This research not only demonstrates DeepHGNN’s effectiveness in achieving significantly improved forecast accuracy but also contributes to the understanding of graph-based methods in hierarchical time series forecasting.

[LG-118] Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation

链接: https://arxiv.org/abs/2405.18688
作者: Fengshuo Bai,Rui Zhao,Hongming Zhang,Sijia Cui,Ying Wen,Yaodong Yang,Bo Xu,Lei Han
关键词: Preference-based reinforcement learning, shown impressive capabilities, Preference-based reinforcement, shown impressive, impressive capabilities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering. However, a notable limitation of PbRL is its dependency on substantial human feedback. This dependency stems from the learning loop, which entails accurate reward learning compounded with value/policy learning, necessitating a considerable number of samples. To boost the learning loop, we propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques. Label smoothing reduces overfitting of the reward model by smoothing human preference labels. Additionally, we bootstrap a conservative estimate \widehatQ using well-supported state-action pairs from the current replay memory to mitigate overestimation bias and utilize it for policy learning regularization. Our experimental results across a variety of complex tasks, both in online and offline settings, demonstrate that our approach improves feedback efficiency, outperforming state-of-the-art methods by a large margin. Ablation studies further reveal that SEER achieves a more accurate Q-function compared to prior work.

[LG-119] Advancing Household Robotics: Deep Interactive Reinforcement Learning for Efficient Training and Enhanced Performance

链接: https://arxiv.org/abs/2405.18687
作者: Arpita Soni,Sujatha Alla,Suresh Dodda,Hemanth Volikatla
关键词: robots relieve people, domestic robots made, everyday responsibilities, made to perform, relieve people
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The market for domestic robots made to perform household chores is growing as these robots relieve people of everyday responsibilities. Domestic robots are generally welcomed for their role in easing human labor, in contrast to industrial robots, which are frequently criticized for displacing human workers. But before these robots can carry out domestic chores, they need to become proficient in several minor activities, such as recognizing their surroundings, making decisions, and picking up on human behaviors. Reinforcement learning, or RL, has emerged as a key robotics technology that enables robots to interact with their environment and learn how to optimize their actions to maximize rewards. However, the goal of Deep Reinforcement Learning is to address more complicated, continuous action-state spaces in real-world settings by combining RL with Neural Networks. The efficacy of DeepRL can be further augmented through interactive feedback, in which a trainer offers real-time guidance to expedite the robot’s learning process. Nevertheless, the current methods have drawbacks, namely the transient application of guidance that results in repeated learning under identical conditions. Therefore, we present a novel method to preserve and reuse information and advice via Deep Interactive Reinforcement Learning, which utilizes a persistent rule-based system. This method not only expedites the training process but also lessens the number of repetitions that instructors will have to carry out. This study has the potential to advance the development of household robots and improve their effectiveness and efficiency as learners.

[LG-120] Can GPT Redefine Medical Understanding? Evaluating GPT on Biomedical Machine Reading Comprehension

链接: https://arxiv.org/abs/2405.18682
作者: Shubham Vatsal,Ayush Singh
关键词: shown remarkable performance, Large language models, Large language, shown remarkable, closed-book biomedical MRC
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance on many tasks in different domains. However, their performance in closed-book biomedical machine reading comprehension (MRC) has not been evaluated in depth. In this work, we evaluate GPT on four closed-book biomedical MRC benchmarks. We experiment with different conventional prompting techniques as well as introduce our own novel prompting method. To solve some of the retrieval problems inherent to LLMs, we propose a prompting strategy named Implicit Retrieval Augmented Generation (RAG) that alleviates the need for using vector databases to retrieve important chunks in traditional RAG setups. Moreover, we report qualitative assessments on the natural language generation outputs from our approach. The results show that our new prompting technique is able to get the best performance in two out of four datasets and ranks second in rest of them. Experiments show that modern-day LLMs like GPT even in a zero-shot setting can outperform supervised models, leading to new state-of-the-art (SoTA) results on two of the benchmarks.

[LG-121] Navigable Graphs for High-Dimensional Nearest Neighbor Search: Constructions and Limits

链接: https://arxiv.org/abs/2405.18680
作者: Haya Diwan,Jinrui Gou,Cameron Musco,Christopher Musco,Torsten Suel
关键词: significant recent interest, neighbor search methods, graph-based nearest neighbor, nearest neighbor search, search methods
类目: Data Structures and Algorithms (cs.DS); Computational Geometry (cs.CG); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There has been significant recent interest in graph-based nearest neighbor search methods, many of which are centered on the construction of navigable graphs over high-dimensional point sets. A graph is navigable if we can successfully move from any starting node to any target node using a greedy routing strategy where we always move to the neighbor that is closest to the destination according to a given distance function. The complete graph is navigable for any point set, but the important question for applications is if sparser graphs can be constructed. While this question is fairly well understood in low-dimensions, we establish some of the first upper and lower bounds for high-dimensional point sets. First, we give a simple and efficient way to construct a navigable graph with average degree O(\sqrtn \log n ) for any set of n points, in any dimension, for any distance function. We compliment this result with a nearly matching lower bound: even under the Euclidean metric in O(\log n) dimensions, a random point set has no navigable graph with average degree O(n^\alpha) for any \alpha 1/2 . Our lower bound relies on sharp anti-concentration bounds for binomial random variables, which we use to show that the near-neighborhoods of a set of random points do not overlap significantly, forcing any navigable graph to have many edges.

[LG-122] Deep Bayesian Filter for Bayes-faithful Data Assimilation

链接: https://arxiv.org/abs/2405.18674
作者: Yuta Tarumi,Keisuke Fukuda,Shin-ichi Maeda
关键词: Deep Bayesian Filtering, nonlinear state space, state space models, DBF, Gaussian
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Main text 9 pages

点击查看摘要

Abstract:State estimation for nonlinear state space models is a challenging task. Existing assimilation methodologies predominantly assume Gaussian posteriors on physical space, where true posteriors become inevitably non-Gaussian. We propose Deep Bayesian Filtering (DBF) for data assimilation on nonlinear state space models (SSMs). DBF constructs new latent variables h_t on a new latent (``fancy’') space and assimilates observations o_t . By (i) constraining the state transition on fancy space to be linear and (ii) learning a Gaussian inverse observation operator q(h_t|o_t) , posteriors always remain Gaussian for DBF. Quite distinctively, the structured design of posteriors provides an analytic formula for the recursive computation of posteriors without accumulating Monte-Carlo sampling errors over time steps. DBF seeks the Gaussian inverse observation operators q(h_t|o_t) and other latent SSM parameters (e.g., dynamics matrix) by maximizing the evidence lower bound. Experiments show that DBF outperforms model-based approaches and latent assimilation methods in various tasks and conditions.

[LG-123] Watermarking Counterfactual Explanations

链接: https://arxiv.org/abs/2405.18671
作者: Hangzhi Guo,Amulya Yadav
关键词: Explainable Artificial Intelligence, Artificial Intelligence, Explainable Artificial, modern-day machine learning, underlie modern-day machine
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The field of Explainable Artificial Intelligence (XAI) focuses on techniques for providing explanations to end-users about the decision-making processes that underlie modern-day machine learning (ML) models. Within the vast universe of XAI techniques, counterfactual (CF) explanations are often preferred by end-users as they help explain the predictions of ML models by providing an easy-to-understand actionable recourse (or contrastive) case to individual end-users who are adversely impacted by predicted outcomes. However, recent studies have shown significant security concerns with using CF explanations in real-world applications; in particular, malicious adversaries can exploit CF explanations to perform query-efficient model extraction attacks on proprietary ML models. In this paper, we propose a model-agnostic watermarking framework (for adding watermarks to CF explanations) that can be leveraged to detect unauthorized model extraction attacks (which rely on the watermarked CF explanations). Our novel framework solves a bi-level optimization problem to embed an indistinguishable watermark into the generated CF explanation such that any future model extraction attacks that rely on these watermarked CF explanations can be detected using a null hypothesis significance testing (NHST) scheme, while ensuring that these embedded watermarks do not compromise the quality of the generated CF explanations. We evaluate this framework’s performance across a diverse set of real-world datasets, CF explanation methods, and model extraction techniques, and show that our watermarking detection system can be used to accurately identify extracted ML models that are trained using the watermarked CF explanations. Our work paves the way for the secure adoption of CF explanations in real-world applications.

[LG-124] Adapting Differentially Private Synthetic Data to Relational Databases

链接: https://arxiv.org/abs/2405.18670
作者: Kaveh Alimohammadi,Hao Wang,Ojas Gulati,Akash Srivastava,Navid Azizan
关键词: Existing differentially private, generation mechanisms typically, mechanisms typically assume, differentially private, typically assume
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Existing differentially private (DP) synthetic data generation mechanisms typically assume a single-source table. In practice, data is often distributed across multiple tables with relationships across tables. In this paper, we introduce the first-of-its-kind algorithm that can be combined with any existing DP mechanisms to generate synthetic relational databases. Our algorithm iteratively refines the relationship between individual synthetic tables to minimize their approximation errors in terms of low-order marginal distributions while maintaining referential integrity. Finally, we provide both DP and theoretical utility guarantees for our algorithm.

[LG-125] Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities

链接: https://arxiv.org/abs/2405.18669
作者: Vicky Zayats,Peter Chen,Melissa Merrari,Dirk Padfield
关键词: Integrating multiple generative, poses significant challenges, parts poses significant, Integrating multiple, multiple generative foundation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
*备注: Under review at NeurIPS

点击查看摘要

Abstract:Integrating multiple generative foundation models, especially those trained on different modalities, into something greater than the sum of its parts poses significant challenges. Two key hurdles are the availability of aligned data (concepts that contain similar meaning but is expressed differently in different modalities), and effectively leveraging unimodal representations in cross-domain generative tasks, without compromising their original unimodal capabilities. We propose Zipper, a multi-tower decoder architecture that addresses these concerns by using cross-attention to flexibly compose multimodal generative models from independently pre-trained unimodal decoders. In our experiments fusing speech and text modalities, we show the proposed architecture performs very competitively in scenarios with limited aligned text-speech data. We also showcase the flexibility of our model to selectively maintain unimodal (e.g., text-to-text generation) generation performance by freezing the corresponding modal tower (e.g. text). In cross-modal tasks such as automatic speech recognition (ASR) where the output modality is text, we show that freezing the text backbone results in negligible performance degradation. In cross-modal tasks such as text-to-speech generation (TTS) where the output modality is speech, we show that using a pre-trained speech backbone results in superior performance to the baseline. Comments: Under review at NeurIPS Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS) Cite as: arXiv:2405.18669 [cs.LG] (or arXiv:2405.18669v1 [cs.LG] for this version)

[LG-126] Fast Explainability via Feasible Concept Sets Generator

链接: https://arxiv.org/abs/2405.18664
作者: Deng Pan,Nuno Moniz,Nitesh Chawla
关键词: long-standing dilemma prevents, long-standing dilemma, dilemma prevents, prevents the broader, broader application
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A long-standing dilemma prevents the broader application of explanation methods: general applicability and inference speed. On the one hand, existing model-agnostic explanation methods usually make minimal pre-assumptions about the prediction models to be explained. Still, they require additional queries to the model through propagation or back-propagation to approximate the models’ behaviors, resulting in slow inference and hindering their use in time-sensitive tasks. On the other hand, various model-dependent explanations have been proposed that achieve low-cost, fast inference but at the expense of limiting their applicability to specific model structures. In this study, we bridge the gap between the universality of model-agnostic approaches and the efficiency of model-specific approaches by proposing a novel framework without assumptions on the prediction model’s structures, achieving high efficiency during inference and allowing for real-time explanations. To achieve this, we first define explanations through a set of human-comprehensible concepts and propose a framework to elucidate model predictions via minimal feasible concept sets. Second, we show that a minimal feasible set generator can be learned as a companion explainer to the prediction model, generating explanations for predictions. Finally, we validate this framework by implementing a novel model-agnostic method that provides robust explanations while facilitating real-time inference. Our claims are substantiated by comprehensive experiments, highlighting the effectiveness and efficiency of our approach.

[LG-127] Understanding Intrinsic Socioeconomic Biases in Large Language Models

链接: https://arxiv.org/abs/2405.18662
作者: Mina Arzaghi,Florian Carichon,Golnoosh Farnadi
关键词: Large Language Models, Large Language, critical decision-making processes, decision-making processes, Language Models
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly integrated into critical decision-making processes, such as loan approvals and visa applications, where inherent biases can lead to discriminatory outcomes. In this paper, we examine the nuanced relationship between demographic attributes and socioeconomic biases in LLMs, a crucial yet understudied area of fairness in LLMs. We introduce a novel dataset of one million English sentences to systematically quantify socioeconomic biases across various demographic groups. Our findings reveal pervasive socioeconomic biases in both established models such as GPT-2 and state-of-the-art models like Llama 2 and Falcon. We demonstrate that these biases are significantly amplified when considering intersectionality, with LLMs exhibiting a remarkable capacity to extract multiple demographic attributes from names and then correlate them with specific socioeconomic biases. This research highlights the urgent necessity for proactive and robust bias mitigation techniques to safeguard against discriminatory outcomes when deploying these powerful models in critical real-world applications.

[LG-128] CAVACHON: a hierarchical variational autoencoder to integrate multi-modal single-cell data

链接: https://arxiv.org/abs/2405.18655
作者: Ping-Han Hsieh,Ru-Xiu Hsiao,Katalin Ferenc,Anthony Mathelier,Rebekka Burkholz,Chien-Yu Chen,Geir Kjetil Sandve,Tatiana Belova,Marieke Lydia Kuijjer
关键词: sequencing technologies enable, single-cell sequencing technologies, Paired single-cell sequencing, enable the simultaneous, simultaneous measurement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Paired single-cell sequencing technologies enable the simultaneous measurement of complementary modalities of molecular data at single-cell resolution. Along with the advances in these technologies, many methods based on variational autoencoders have been developed to integrate these data. However, these methods do not explicitly incorporate prior biological relationships between the data modalities, which could significantly enhance modeling and interpretation. We propose a novel probabilistic learning framework that explicitly incorporates conditional independence relationships between multi-modal data as a directed acyclic graph using a generalized hierarchical variational autoencoder. We demonstrate the versatility of our framework across various applications pertinent to single-cell multi-omics data integration. These include the isolation of common and distinct information from different modalities, modality-specific differential analysis, and integrated cell clustering. We anticipate that the proposed framework can facilitate the construction of highly flexible graphical models that can capture the complexities of biological hypotheses and unravel the connections between different biological data types, such as different modalities of paired single-cell multi-omics data. The implementation of the proposed framework can be found in the repository this https URL.

[LG-129] Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning

链接: https://arxiv.org/abs/2405.18641
作者: Tiansheng Huang,Sihao Hu,Fatih Ilhan,Selim Furkan Tekin,Ling Liu
关键词: Large Language Models, Recent studies show, Language Models, Recent studies, Large Language
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we show that the jail-broken effect can be mitigated by separating states in the finetuning stage to optimize the alignment and user datasets. Unfortunately, our subsequent study shows that this simple Bi-State Optimization (BSO) solution experiences convergence instability when steps invested in its alignment state is too small, leading to downgraded alignment performance. By statistical analysis, we show that the \textitexcess drift towards consensus could be a probable reason for the instability. To remedy this issue, we propose \textbfLazy(\textbfi) \textbfsafety \textbfalignment (\textbfLisa), which introduces a proximal term to constraint the drift of each state. Theoretically, the benefit of the proximal term is supported by the convergence analysis, wherein we show that a sufficient large proximal factor is necessary to guarantee Lisa’s convergence. Empirically, our results on four downstream finetuning tasks show that Lisa with a proximal term can significantly increase alignment performance while maintaining the LLM’s accuracy on the user tasks. Code is available at \urlthis https URL.

[LG-130] When and How Does In-Distribution Label Help Out-of-Distribution Detection?

链接: https://arxiv.org/abs/2405.18635
作者: Xuefeng Du,Yiyou Sun,Yixuan Li
关键词: Detecting data points, ensuring reliable machine, reliable machine learning, OOD detection, data points deviating
类目: Machine Learning (cs.LG)
*备注: ICML 2024

点击查看摘要

Abstract:Detecting data points deviating from the training distribution is pivotal for ensuring reliable machine learning. Extensive research has been dedicated to the challenge, spanning classical anomaly detection techniques to contemporary out-of-distribution (OOD) detection approaches. While OOD detection commonly relies on supervised learning from a labeled in-distribution (ID) dataset, anomaly detection may treat the entire ID data as a single class and disregard ID labels. This fundamental distinction raises a significant question that has yet to be rigorously explored: when and how does ID label help OOD detection? This paper bridges this gap by offering a formal understanding to theoretically delineate the impact of ID labels on OOD detection. We employ a graph-theoretic approach, rigorously analyzing the separability of ID data from OOD data in a closed-form manner. Key to our approach is the characterization of data representations through spectral decomposition on the graph. Leveraging these representations, we establish a provable error bound that compares the OOD detection performance with and without ID labels, unveiling conditions for achieving enhanced OOD detection. Lastly, we present empirical results on both simulated and real datasets, validating theoretical guarantees and reinforcing our insights. Code is publicly available at this https URL.

[LG-131] A Theoretical Understanding of Self-Correction through In-context Alignment

链接: https://arxiv.org/abs/2405.18634
作者: Yifei Wang,Yuyang Wu,Zeming Wei,Stefanie Jegelka,Yisen Wang
关键词: limited human experiences, recent studies show, mimicking limited human, studies show initial, show initial evidence
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Going beyond mimicking limited human experiences, recent studies show initial evidence that, like humans, large language models (LLMs) are capable of improving their abilities purely by self-correction, i.e., correcting previous responses through self-examination, in certain circumstances. Nevertheless, little is known about how such capabilities arise. In this work, based on a simplified setup akin to an alignment task, we theoretically analyze self-correction from an in-context learning perspective, showing that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way. Notably, going beyond previous theories on over-simplified linear transformers, our theoretical construction underpins the roles of several key designs of realistic transformers for self-correction: softmax attention, multi-head attention, and the MLP block. We validate these findings extensively on synthetic datasets. Inspired by these findings, we also illustrate novel applications of self-correction, such as defending against LLM jailbreaks, where a simple self-correction step does make a large difference. We believe that these findings will inspire further research on understanding, exploiting, and enhancing self-correction for building better foundation models.

[LG-132] Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

链接: https://arxiv.org/abs/2405.18628
作者: Hao (Mark)Chen,Wayne Luk,Ka Fai Cedric Yiu,Rui Li,Konstantin Mishchenko,Stylianos I. Venieris,Hongxiang Fan
关键词: Large Language Models, Language Models, Large Language, results in significant, hardware performance
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: The code for this implementation is available at this https URL

点击查看摘要

Abstract:The auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. While recent research has investigated various speculative decoding techniques for multi-token generation, these efforts have primarily focused on improving processing speed such as throughput. Crucially, they often neglect other metrics essential for real-life deployments, such as memory consumption and training cost. To overcome these limitations, we propose a novel parallel prompt decoding that requires only 0.0002 % trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Inspired by the human natural language generation process, PPD approximates outputs generated at future timesteps in parallel by using multiple prompt tokens. This approach partially recovers the missing conditional dependency information necessary for multi-token generation, resulting in up to a 28% higher acceptance rate for long-range predictions. Furthermore, we present a hardware-aware dynamic sparse tree technique that adaptively optimizes this decoding scheme to fully leverage the computational capacities on different GPUs. Through extensive experiments across LLMs ranging from MobileLlama to Vicuna-13B on a wide range of benchmarks, our approach demonstrates up to 2.49 \times speedup and maintains a minimal runtime memory overhead of just 0.0004 %. More importantly, our parallel prompt decoding can serve as an orthogonal optimization for synergistic integration with existing speculative decoding, showing up to 1.22\times further speed improvement. Our code is available at this https URL.

[LG-133] PureGen: Universal Data Purification for Train-Time Poison Defense via Generative Model Dynamics

链接: https://arxiv.org/abs/2405.18627
作者: Sunay Bhat,Jeffrey Jiang,Omead Pooladzandi,Alexander Branch,Gregory Pottie
关键词: Train-time data poisoning, threaten machine learning, leading to misclassification, machine learning models, Denoising Diffusion Probabilistic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Train-time data poisoning attacks threaten machine learning models by introducing adversarial examples during training, leading to misclassification. Current defense methods often reduce generalization performance, are attack-specific, and impose significant training overhead. To address this, we introduce a set of universal data purification methods using a stochastic transform, \Psi(x) , realized via iterative Langevin dynamics of Energy-Based Models (EBMs), Denoising Diffusion Probabilistic Models (DDPMs), or both. These approaches purify poisoned data with minimal impact on classifier generalization. Our specially trained EBMs and DDPMs provide state-of-the-art defense against various attacks (including Narcissus, Bullseye Polytope, Gradient Matching) on CIFAR-10, Tiny-ImageNet, and CINIC-10, without needing attack or classifier-specific information. We discuss performance trade-offs and show that our methods remain highly effective even with poisoned or distributionally shifted generative model training data.

[LG-134] Causal Contextual Bandits with Adaptive Context

链接: https://arxiv.org/abs/2405.18626
作者: Rahul Madhavan,Aurghya Maiti,Gaurav Sinha,Siddharth Barman
关键词: initial intervention chosen, chosen based, intervention chosen, initial intervention, causal contextual bandits
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Reinforcement Learning Conference (RLC) 2024, 10 pages (31 pages including appendix), 8 plots. arXiv admin note: text overlap with arXiv:2111.00886

点击查看摘要

Abstract:We study a variant of causal contextual bandits where the context is chosen based on an initial intervention chosen by the learner. At the beginning of each round, the learner selects an initial action, depending on which a stochastic context is revealed by the environment. Following this, the learner then selects a final action and receives a reward. Given T rounds of interactions with the environment, the objective of the learner is to learn a policy (of selecting the initial and the final action) with maximum expected reward. In this paper we study the specific situation where every action corresponds to intervening on a node in some known causal graph. We extend prior work from the deterministic context setting to obtain simple regret minimization guarantees. This is achieved through an instance-dependent causal parameter, \lambda , which characterizes our upper bound. Furthermore, we prove that our simple regret is essentially tight for a large class of instances. A key feature of our work is that we use convex optimization to address the bandit exploration problem. We also conduct experiments to validate our theoretical results, and release our code at our project GitHub repository: this https URL.

[LG-135] Multi-Armed Bandits with Network Interference

链接: https://arxiv.org/abs/2405.18621
作者: Abhineet Agarwal,Anish Agarwal,Lorenzo Masoero,Justin Whitehouse
关键词: adaptive clinical trials, Online experimentation, trials in medicine, common challenge, challenge in modern
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Online experimentation with interference is a common challenge in modern applications such as e-commerce and adaptive clinical trials in medicine. For example, in online marketplaces, the revenue of a good depends on discounts applied to competing goods. Statistical inference with interference is widely studied in the offline setting, but far less is known about how to adaptively assign treatments to minimize regret. We address this gap by studying a multi-armed bandit (MAB) problem where a learner (e-commerce platform) sequentially assigns one of possible \mathcalA actions (discounts) to N units (goods) over T rounds to minimize regret (maximize revenue). Unlike traditional MAB problems, the reward of each unit depends on the treatments assigned to other units, i.e., there is interference across the underlying network of units. With \mathcalA actions and N units, minimizing regret is combinatorially difficult since the action space grows as \mathcalA^N . To overcome this issue, we study a sparse network interference model, where the reward of a unit is only affected by the treatments assigned to s neighboring units. We use tools from discrete Fourier analysis to develop a sparse linear representation of the unit-specific reward r_n: [\mathcalA]^N \rightarrow \mathbbR , and propose simple, linear regression-based algorithms to minimize regret. Importantly, our algorithms achieve provably low regret both when the learner observes the interference neighborhood for all units and when it is unknown. This significantly generalizes other works on this topic which impose strict conditions on the strength of interference on a known network, and also compare regret to a markedly weaker optimal action. Empirically, we corroborate our theoretical findings via numerical simulations.

[LG-136] Augmented Physics: A Machine Learning-Powered Tool for Creating Interactive Physics Simulations from Static Diagrams

链接: https://arxiv.org/abs/2405.18614
作者: Aditya Gunturu,Yi Wen,Jarin Thundathil,Nandi Zhang,Rubaiat Habib Kazi,Ryo Suzuki
关键词: machine learning-powered tool, learning-powered tool designed, introduce Augmented Physics, machine learning-powered, learning-powered tool
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Augmented Physics, a machine learning-powered tool designed for creating interactive physics simulations from static textbook diagrams. Leveraging computer vision techniques, such as Segment Anything and OpenCV, our web-based system enables users to semi-automatically extract diagrams from physics textbooks and then generate interactive simulations based on the extracted content. These interactive diagrams are seamlessly integrated into scanned textbook pages, facilitating interactive and personalized learning experiences across various physics concepts, including gravity, optics, circuits, and kinematics. Drawing on an elicitation study with seven physics instructors, we explore four key augmentation techniques: 1) augmented experiments, 2) animated diagrams, 3) bi-directional manipulatives, and 4) parameter visualization. We evaluate our system through technical evaluation, a usability study (N=12), and expert interviews (N=12). The study findings suggest that our system can facilitate more engaging and personalized learning experiences in physics education.

[LG-137] GLOCON Database: Design Decisions and User Manual (v1.0)

链接: https://arxiv.org/abs/2405.18613
作者: Ali Hürriyetoğlu,Osman Mutlu,Fırat Duruşan,Erdem Yörük
关键词: contentious events automatically, events automatically extracted, multiple languages, database of contentious, automatically extracted
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:GLOCON is a database of contentious events automatically extracted from national news sources from various countries in multiple languages. National news sources are utilized, and complete news archives are processed to create an event list for each source. Automation is achieved using a gold standard corpus sampled randomly from complete news archives (Yörük et al. 2022) and all annotated by at least two domain experts based on the event definition provided in Duruşan et al. (2022).

[LG-138] DTR-Bench: An in silico Environment and Benchmark Platform for Reinforcement Learning Based Dynamic Treatment Regime

链接: https://arxiv.org/abs/2405.18610
作者: Zhiyao Luo,Mingcheng Zhu,Fenglin Liu,Jiali Li,Yangchen Pan,Jiandong Zhou,Tingting Zhu
关键词: garnered increasing recognition, drug dosage prescriptions, Reinforcement learning, dynamic treatment regimes, optimise dynamic treatment
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages for main content

点击查看摘要

Abstract:Reinforcement learning (RL) has garnered increasing recognition for its potential to optimise dynamic treatment regimes (DTRs) in personalised medicine, particularly for drug dosage prescriptions and medication recommendations. However, a significant challenge persists: the absence of a unified framework for simulating diverse healthcare scenarios and a comprehensive analysis to benchmark the effectiveness of RL algorithms within these contexts. To address this gap, we introduce \textitDTR-Bench, a benchmarking platform comprising four distinct simulation environments tailored to common DTR applications, including cancer chemotherapy, radiotherapy, glucose management in diabetes, and sepsis treatment. We evaluate various state-of-the-art RL algorithms across these settings, particularly highlighting their performance amidst real-world challenges such as pharmacokinetic/pharmacodynamic (PK/PD) variability, noise, and missing data. Our experiments reveal varying degrees of performance degradation among RL algorithms in the presence of noise and patient variability, with some algorithms failing to converge. Additionally, we observe that using temporal observation representations does not consistently lead to improved performance in DTR settings. Our findings underscore the necessity of developing robust, adaptive RL algorithms capable of effectively managing these complexities to enhance patient-specific healthcare. We have open-sourced our benchmark and code at this https URL.

[LG-139] Artificial Intelligence in Industry 4.0: A Review of Integration Challenges for Industrial Systems

链接: https://arxiv.org/abs/2405.18580
作者: Alexander Windmann,Philipp Wittenberg,Marvin Schieseck,Oliver Niggemann
关键词: generate vast data, Artificial Intelligence, vast data sets, applications including predictive, including predictive maintenance
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages, 4 figures

点击查看摘要

Abstract:In Industry 4.0, Cyber-Physical Systems (CPS) generate vast data sets that can be leveraged by Artificial Intelligence (AI) for applications including predictive maintenance and production planning. However, despite the demonstrated potential of AI, its widespread adoption in sectors like manufacturing remains limited. Our comprehensive review of recent literature, including standards and reports, pinpoints key challenges: system integration, data-related issues, managing workforce-related concerns and ensuring trustworthy AI. A quantitative analysis highlights particular challenges and topics that are important for practitioners but still need to be sufficiently investigated by academics. The paper briefly discusses existing solutions to these challenges and proposes avenues for future research. We hope that this survey serves as a resource for practitioners evaluating the cost-benefit implications of AI in CPS and for researchers aiming to address these urgent challenges.

[LG-140] Low-rank finetuning for LLMs: A fairness perspective

链接: https://arxiv.org/abs/2405.18572
作者: Saswat Das,Marco Romanelli,Cuong Tran,Zarreen Reza,Bhavya Kailkhura,Ferdinando Fioretto
关键词: Large Language Models, fine-tuning Large Language, Large Language, Low-rank approximation techniques, Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models (LLMs) due to their reduced computational and memory requirements. This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution. Our findings reveal that there are cases in which low-rank fine-tuning falls short in learning such shifts. This, in turn, produces non-negligible side effects, especially when fine-tuning is adopted for toxicity mitigation in pre-trained models, or in scenarios where it is important to provide fair models. Through comprehensive empirical evidence on several models, datasets, and tasks, we show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors. We also show that this extends to sequential decision-making tasks, emphasizing the need for careful evaluation to promote responsible LLMs development.

[LG-141] Its Not a Modality Gap: Characterizing and Addressing the Contrastive Gap

链接: https://arxiv.org/abs/2405.18570
作者: Abrar Fahim,Alex Murphy,Alona Fyshe
关键词: embedding input images, contrastive models, Multi-modal contrastive models, contrastive, gap
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal contrastive models such as CLIP achieve state-of-the-art performance in zero-shot classification by embedding input images and texts on a joint representational space. Recently, a modality gap has been reported in two-encoder contrastive models like CLIP, meaning that the image and text embeddings reside in disjoint areas of the latent space. Previous studies suggest that this gap exists due to 1) the cone effect, 2) mismatched pairs in the dataset, and 3) insufficient training. We show that, even when accounting for all these factors, and even when using the same modality, the contrastive loss actually creates a gap during training. As a result, We propose that the modality gap is inherent to the two-encoder contrastive loss and rename it the contrastive gap. We present evidence that attributes this contrastive gap to low uniformity in CLIP space, resulting in embeddings that occupy only a small portion of the latent space. To close the gap, we adapt the uniformity and alignment properties of unimodal contrastive loss to the multi-modal setting and show that simply adding these terms to the CLIP loss distributes the embeddings more uniformly in the representational space, closing the gap. In our experiments, we show that the modified representational space achieves better performance than default CLIP loss in downstream tasks such as zero-shot image classification and multi-modal arithmetic.

[LG-142] Warm-starting Push-Relabel

链接: https://arxiv.org/abs/2405.18568
作者: Sami Davies,Sergei Vassilvitskii,Yuyan Wang
关键词: celebrated network flow, celebrated network, network flow algorithms, flow, network flow
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Push-Relabel is one of the most celebrated network flow algorithms. Maintaining a pre-flow that saturates a cut, it enjoys better theoretical and empirical running time than other flow algorithms, such as Ford-Fulkerson. In practice, Push-Relabel is even faster than what theoretical guarantees can promise, in part because of the use of good heuristics for seeding and updating the iterative algorithm. However, it remains unclear how to run Push-Relabel on an arbitrary initialization that is not necessarily a pre-flow or cut-saturating. We provide the first theoretical guarantees for warm-starting Push-Relabel with a predicted flow, where our learning-augmented version benefits from fast running time when the predicted flow is close to an optimal flow, while maintaining robust worst-case guarantees. Interestingly, our algorithm uses the gap relabeling heuristic, which has long been employed in practice, even though prior to our work there was no rigorous theoretical justification for why it can lead to run-time improvements. We then provide experiments that show our warm-started Push-Relabel also works well in practice.

[LG-143] Counterfactual Explanations for Multivariate Time-Series without Training Datasets

链接: https://arxiv.org/abs/2405.18563
作者: Xiangyu Sun,Raquel Aoki,Kevin H. Wilson
关键词: Machine learning, experienced significant growth, high-impact real-world domains, past decade, experienced significant
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Machine learning (ML) methods have experienced significant growth in the past decade, yet their practical application in high-impact real-world domains has been hindered by their opacity. When ML methods are responsible for making critical decisions, stakeholders often require insights into how to alter these decisions. Counterfactual explanations (CFEs) have emerged as a solution, offering interpretations of opaque ML models and providing a pathway to transition from one decision to another. However, most existing CFE methods require access to the model’s training dataset, few methods can handle multivariate time-series, and none can handle multivariate time-series without training datasets. These limitations can be formidable in many scenarios. In this paper, we present CFWoT, a novel reinforcement-learning-based CFE method that generates CFEs when training datasets are unavailable. CFWoT is model-agnostic and suitable for both static and multivariate time-series datasets with continuous and discrete features. Users have the flexibility to specify non-actionable, immutable, and preferred features, as well as causal constraints which CFWoT guarantees will be respected. We demonstrate the performance of CFWoT against four baselines on several datasets and find that, despite not having access to a training dataset, CFWoT finds CFEs that make significantly fewer and significantly smaller changes to the input time-series. These properties make CFEs more actionable, as the magnitude of change required to alter an outcome is vastly reduced.

[LG-144] Potential Field Based Deep Metric Learning

链接: https://arxiv.org/abs/2405.18560
作者: Shubhang Bhatnagar,Narendra Ahuja
关键词: Deep metric learning, meaningful representation space, semantically meaningful representation, Deep metric, involves training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Deep metric learning (DML) involves training a network to learn a semantically meaningful representation space. Many current approaches mine n-tuples of examples and model interactions within each tuplets. We present a novel, compositional DML model, inspired by electrostatic fields in physics that, instead of in tuples, represents the influence of each example (embedding) by a continuous potential field, and superposes the fields to obtain their combined global potential field. We use attractive/repulsive potential fields to represent interactions among embeddings from images of the same/different classes. Contrary to typical learning methods, where mutual influence of samples is proportional to their distance, we enforce reduction in such influence with distance, leading to a decaying field. We show that such decay helps improve performance on real world datasets with large intra-class variations and label noise. Like other proxy-based methods, we also use proxies to succinctly represent sub-populations of examples. We evaluate our method on three standard DML benchmarks- Cars-196, CUB-200-2011, and SOP datasets where it outperforms state-of-the-art baselines.

[LG-145] Reinforcement Learning in Dynamic Treatment Regimes Needs Critical Reexamination

链接: https://arxiv.org/abs/2405.18556
作者: Zhiyao Luo,Yangchen Pan,Peter Watkinson,Tingting Zhu
关键词: changing healthcare landscape, rapidly changing healthcare, offline reinforcement learning, healthcare landscape, presents a mix
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at ICML 2024. 9 pages for main content, 34 pages in total

点击查看摘要

Abstract:In the rapidly changing healthcare landscape, the implementation of offline reinforcement learning (RL) in dynamic treatment regimes (DTRs) presents a mix of unprecedented opportunities and challenges. This position paper offers a critical examination of the current status of offline RL in the context of DTRs. We argue for a reassessment of applying RL in DTRs, citing concerns such as inconsistent and potentially inconclusive evaluation metrics, the absence of naive and supervised learning baselines, and the diverse choice of RL formulation in existing research. Through a case study with more than 17,000 evaluation experiments using a publicly available Sepsis dataset, we demonstrate that the performance of RL algorithms can significantly vary with changes in evaluation metrics and Markov Decision Process (MDP) formulations. Surprisingly, it is observed that in some instances, RL algorithms can be surpassed by random baselines subjected to policy evaluation methods and reward design. This calls for more careful policy evaluation and algorithm development in future DTR works. Additionally, we discussed potential enhancements toward more reliable development of RL-based dynamic treatment regimes and invited further discussion within the community. Code is available at this https URL.

[LG-146] Scalable Surrogate Verification of Image-based Neural Network Control Systems using Composition and Unrolling

链接: https://arxiv.org/abs/2405.18554
作者: Feiyang Cai,Chuchu Fan,Stanley Bak
关键词: Verifying safety, difficult problem, mathematically model, system, Verifying
类目: Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Verifying safety of neural network control systems that use images as input is a difficult problem because, from a given system state, there is no known way to mathematically model what images are possible in the real-world. We build on recent work that considers a surrogate verification approach, training a conditional generative adversarial network (cGAN) as an image generator in place of the real world. This enables set-based formal analysis of the closed-loop system, providing analysis beyond simulation and testing. While existing work is effective on small examples, excessive overapproximation both within a single control period and across multiple control periods limits its scalability. We propose approaches to overcome these two sources of error. First, we overcome one-step error by composing the system’s dynamics along with the cGAN and neural network controller, without losing the dependencies between input states and the control outputs as in the monotonic analysis of the system dynamics. Second, we reduce multi-step error by repeating the single-step composition, essentially unrolling multiple steps of the control loop into a large neural network. We then leverage existing network verification tools to compute accurate reachable sets for multiple steps, avoiding the accumulation of abstraction error at each step. We demonstrate the effectiveness of our approach in terms of both accuracy and scalability using two case studies: an autonomous aircraft taxiing system and an advanced emergency braking system. On the aircraft taxiing system, the converged reachable set is 175% larger using the prior baseline method compared with our proposed approach. On the emergency braking system, with 24x the number of image output variables from the cGAN, the baseline method fails to prove any states are safe, whereas our improvements enable set-based safety analysis.

[LG-147] SGD method for entropy error function with smoothing l0 regularization for neural networks

链接: https://arxiv.org/abs/2405.18552
作者: Trong-Tuan Nguyen,Van-Dat Thang,Nguyen Van Thin,Phuong T. Nguyen
关键词: entropy error function, error function, neural networks, entropy error, error function generally
类目: Machine Learning (cs.LG)
*备注: The paper has been peer-reviewed and accepted for publication with Springer Applied Intelligence

点击查看摘要

Abstract:The entropy error function has been widely used in neural networks. Nevertheless, the network training based on this error function generally leads to a slow convergence rate, and can easily be trapped in a local minimum or even with the incorrect saturation problem in practice. In fact, there are many results based on entropy error function in neural network and its applications. However, the theory of such an algorithm and its convergence have not been fully studied so far. To tackle the issue, we propose a novel entropy function with smoothing l0 regularization for feed-forward neural networks. Using real-world datasets, we performed an empirical evaluation to demonstrate that the newly conceived algorithm allows us to substantially improve the prediction performance of the considered neural networks. More importantly, the experimental results also show that our proposed function brings in more precise classifications, compared to well-founded baselines. Our work is novel as it enables neural networks to learn effectively, producing more accurate predictions compared to state-of-the-art algorithms. In this respect, we expect that the algorithm will contribute to existing studies in the field, advancing research in Machine Learning and Deep Learning.

[LG-148] Learning from Uncertain Data: From Possible Worlds to Possible Models

链接: https://arxiv.org/abs/2405.18549
作者: Jiongli Zhu,Su Feng,Boris Glavic,Babak Salimi
关键词: learning linear models, leading to predictive, predictive multiplicity, introduce an efficient, learning linear
类目: Machine Learning (cs.LG); Databases (cs.DB); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:We introduce an efficient method for learning linear models from uncertain data, where uncertainty is represented as a set of possible variations in the data, leading to predictive multiplicity. Our approach leverages abstract interpretation and zonotopes, a type of convex polytope, to compactly represent these dataset variations, enabling the symbolic execution of gradient descent on all possible worlds simultaneously. We develop techniques to ensure that this process converges to a fixed point and derive closed-form solutions for this fixed point. Our method provides sound over-approximations of all possible optimal models and viable prediction ranges. We demonstrate the effectiveness of our approach through theoretical and empirical analysis, highlighting its potential to reason about model and prediction uncertainty due to data quality issues in training data.

[LG-149] he Computational Complexity of Formal Reasoning for Encoder-Only Transformers

链接: https://arxiv.org/abs/2405.18548
作者: Marco Sälzer,Eric Alsmann,Martin Lange
关键词: formal reasoning, encoder-only transformers, meaning sound, interpreting behaviour, investigate challenges
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate challenges and possibilities of formal reasoning for encoder-only transformers (EOT), meaning sound and complete methods for verifying or interpreting behaviour. In detail, we condense related formal reasoning tasks in the form of a naturally occurring satisfiability problem (SAT). We find that SAT is undecidable if we consider EOT, commonly considered in the expressiveness community. Furthermore, we identify practical scenarios where SAT is decidable and establish corresponding complexity bounds. Besides trivial cases, we find that quantized EOT, namely those restricted by some fixed-width arithmetic, lead to the decidability of SAT due to their limited attention capabilities. However, the problem remains difficult, as we establish those scenarios where SAT is NEXPTIME-hard and those where we can show that it is solvable in NEXPTIME for quantized EOT. To complement our theoretical results, we put our findings and their implications in the overall perspective of formal reasoning.

[LG-150] Automatic detection of cognitive impairment in elderly people using an entertainment chatbot with Natural Language Processing capabilities

链接: https://arxiv.org/abs/2405.18542
作者: Francisco de Arriba-Pérez,Silvia García-Méndez,Francisco J. González-Castaño,Enrique Costa-Montenegro
关键词: Previous researchers, cognitive impairment, researchers have proposed, therapeutic monitoring, proposed intelligent systems
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Previous researchers have proposed intelligent systems for therapeutic monitoring of cognitive impairments. However, most existing practical approaches for this purpose are based on manual tests. This raises issues such as excessive caretaking effort and the white-coat effect. To avoid these issues, we present an intelligent conversational system for entertaining elderly people with news of their interest that monitors cognitive impairment transparently. Automatic chatbot dialogue stages allow assessing content description skills and detecting cognitive impairment with Machine Learning algorithms. We create these dialogue flows automatically from updated news items using Natural Language Generation techniques. The system also infers the gold standard of the answers to the questions, so it can assess cognitive capabilities automatically by comparing these answers with the user responses. It employs a similarity metric with values in [0, 1], in increasing level of similarity. To evaluate the performance and usability of our approach, we have conducted field tests with a test group of 30 elderly people in the earliest stages of dementia, under the supervision of gerontologists. In the experiments, we have analysed the effect of stress and concentration in these users. Those without cognitive impairment performed up to five times better. In particular, the similarity metric varied between 0.03, for stressed and unfocused participants, and 0.36, for relaxed and focused users. Finally, we developed a Machine Learning algorithm based on textual analysis features for automatic cognitive impairment detection, which attained accuracy, F-measure and recall levels above 80%. We have thus validated the automatic approach to detect cognitive impairment in elderly people based on entertainment content.

[LG-151] Learning diverse attacks on large language models for robust red-teaming and safety tuning

链接: https://arxiv.org/abs/2405.18540
作者: Seanie Lee,Minsu Kim,Lynn Cherif,David Dobre,Juho Lee,Sung Ju Hwang,Kenji Kawaguchi,Gauthier Gidel,Yoshua Bengio,Nikolay Malkin,Moksh Jain
关键词: elicit harmful responses, large language models, critical step, step in ensuring, ensuring the safe
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.

[LG-152] Augmented Conversation with Embedded Speech-Driven On-the-Fly Referencing in AR

链接: https://arxiv.org/abs/2405.18537
作者: Shivesh Jadon,Mehrad Faridan,Edward Mah,Rajan Vaish,Wesley Willett,Ryo Suzuki
关键词: support co-located in-person, co-located in-person conversations, aims to support, support co-located, co-located in-person
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces the concept of augmented conversation, which aims to support co-located in-person conversations via embedded speech-driven on-the-fly referencing in augmented reality (AR). Today computing technologies like smartphones allow quick access to a variety of references during the conversation. However, these tools often create distractions, reducing eye contact and forcing users to focus their attention on phone screens and manually enter keywords to access relevant information. In contrast, AR-based on-the-fly referencing provides relevant visual references in real-time, based on keywords extracted automatically from the spoken conversation. By embedding these visual references in AR around the conversation partner, augmented conversation reduces distraction and friction, allowing users to maintain eye contact and supporting more natural social interactions. To demonstrate this concept, we developed \system, a Hololens-based interface that leverages real-time speech recognition, natural language processing and gaze-based interactions for on-the-fly embedded visual referencing. In this paper, we explore the design space of visual referencing for conversations, and describe our our implementation – building on seven design guidelines identified through a user-centered design process. An initial user study confirms that our system decreases distraction and friction in conversations compared to smartphone searches, while providing highly useful and relevant information.

[LG-153] Data-Driven Simulator for Mechanical Circulatory Support with Domain Adversarial Neural Process

链接: https://arxiv.org/abs/2405.18536
作者: Sophia Sun,Wenyuan Chen,Zihao Zhou,Sonia Fereidooni,Elise Jortberg,Rose Yu
关键词: Mechanical Circulatory Support, Circulatory Support, probabilistic deep sequence, deep sequence model, Mechanical Circulatory
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mechanical Circulatory Support (MCS) devices, implemented as a probabilistic deep sequence model. Existing mechanical simulators for MCS rely on oversimplifying assumptions and are insensitive to patient-specific behavior, limiting their applicability to real-world treatment scenarios. To address these shortcomings, our model Domain Adversarial Neural Process (DANP) employs a neural process architecture, allowing it to capture the probabilistic relationship between MCS pump levels and aortic pressure measurements with uncertainty. We use domain adversarial training to combine simulation data with real-world observations, resulting in a more realistic and diverse representation of potential outcomes. Empirical results with an improvement of 19% in non-stationary trend prediction establish DANP as an effective tool for clinicians to understand and make informed decisions regarding MCS patient treatment.

[LG-154] Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL

链接: https://arxiv.org/abs/2405.18520
作者: Yu Luo,Tianying Ji,Fuchun Sun,Jianwei Zhang,Huazhe Xu,Xianyuan Zhan
关键词: achieved notable success, leveraging previously collected, previously collected data, Off-policy reinforcement learning, complex real-world tasks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Off-policy reinforcement learning (RL) has achieved notable success in tackling many complex real-world tasks, by leveraging previously collected data for policy learning. However, most existing off-policy RL algorithms fail to maximally exploit the information in the replay buffer, limiting sample efficiency and policy performance. In this work, we discover that concurrently training an offline RL policy based on the shared online replay buffer can sometimes outperform the original online learning policy, though the occurrence of such performance gains remains uncertain. This motivates a new possibility of harnessing the emergent outperforming offline optimal policy to improve online policy learning. Based on this insight, we present Offline-Boosted Actor-Critic (OBAC), a model-free online RL framework that elegantly identifies the outperforming offline policy through value comparison, and uses it as an adaptive constraint to guarantee stronger policy learning performance. Our experiments demonstrate that OBAC outperforms other popular model-free RL baselines and rivals advanced model-based RL methods in terms of sample efficiency and asymptotic performance across 53 tasks spanning 6 task suites.

[LG-155] LSTM-COX Model: A Concise and Efficient Deep Learning Approach for Handling Recurrent Events

链接: https://arxiv.org/abs/2405.18518
作者: Zhang Runquan,Shi Xiaoping
关键词: analyzing recurrent events, complex time-dependent data, analyzing recurrent, Long Short-Term Memory, Akaike Information Criterion
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In the current field of clinical medicine, traditional methods for analyzing recurrent events have limitations when dealing with complex time-dependent data. This study combines Long Short-Term Memory networks (LSTM) with the Cox model to enhance the model’s performance in analyzing recurrent events with dynamic temporal information. Compared to classical models, the LSTM-Cox model significantly improves the accuracy of extracting clinical risk features and exhibits lower Akaike Information Criterion (AIC) values, while maintaining good performance on simulated datasets. In an empirical analysis of bladder cancer recurrence data, the model successfully reduced the mean squared error during the training phase and achieved a Concordance index of up to 0.90 on the test set. Furthermore, the model effectively distinguished between high and low-risk patient groups, and the identified recurrence risk features such as the number of tumor recurrences and maximum size were consistent with other research and clinical trial results. This study not only provides a straightforward and efficient method for analyzing recurrent data and extracting features but also offers a convenient pathway for integrating deep learning techniques into clinical risk prediction systems.

[LG-156] Atlas3D: Physically Constrained Self-Supporting Text-to-3D for Simulation and Fabrication

链接: https://arxiv.org/abs/2405.18515
作者: Yunuo Chen,Tianyi Xie,Zeshun Zong,Xuan Li,Feng Gao,Yin Yang,Ying Nian Wu,Chenfanfu Jiang
关键词: producing visually realistic, visually realistic shapes, methods primarily focus, shapes and appearances, primarily focus
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing diffusion-based text-to-3D generation methods primarily focus on producing visually realistic shapes and appearances, often neglecting the physical constraints necessary for downstream tasks. Generated models frequently fail to maintain balance when placed in physics-based simulations or 3D printed. This balance is crucial for satisfying user design intentions in interactive gaming, embodied AI, and robotics, where stable models are needed for reliable interaction. Additionally, stable models ensure that 3D-printed objects, such as figurines for home decoration, can stand on their own without requiring additional supports. To fill this gap, we introduce Atlas3D, an automatic and easy-to-implement method that enhances existing Score Distillation Sampling (SDS)-based text-to-3D tools. Atlas3D ensures the generation of self-supporting 3D models that adhere to physical laws of stability under gravity, contact, and friction. Our approach combines a novel differentiable simulation-based loss function with physically inspired regularization, serving as either a refinement or a post-processing module for existing frameworks. We verify Atlas3D’s efficacy through extensive generation tasks and validate the resulting 3D models in both simulated and real-world environments.

[LG-157] Understanding Transformer Reasoning Capabilities via Graph Algorithms

链接: https://arxiv.org/abs/2405.18512
作者: Clayton Sanford,Bahare Fatemi,Ethan Hall,Anton Tsitsulin,Mehran Kazemi,Jonathan Halcrow,Bryan Perozzi,Vahab Mirrokni
关键词: Abstract, regimes, algorithmic, scaling regimes, perfectly solve
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 43 pages, 8 figures

点击查看摘要

Abstract:Which transformer scaling regimes are able to perfectly solve different classes of algorithmic problems? While tremendous empirical advances have been attained by transformer-based neural networks, a theoretical understanding of their algorithmic reasoning capabilities in realistic parameter regimes is lacking. We investigate this question in terms of the network’s depth, width, and number of extra tokens for algorithm execution. Our novel representational hierarchy separates 9 algorithmic reasoning problems into classes solvable by transformers in different realistic parameter scaling regimes. We prove that logarithmic depth is necessary and sufficient for tasks like graph connectivity, while single-layer transformers with small embedding dimensions can solve contextual retrieval tasks. We also support our theoretical analysis with ample empirical evidence using the GraphQA benchmark. These results show that transformers excel at many graph reasoning tasks, even outperforming specialized graph neural networks.

[LG-158] Injecting Hierarchical Biological Priors into Graph Neural Networks for Flow Cytometry Prediction

链接: https://arxiv.org/abs/2405.18507
作者: Fatemeh Nassajian Mojarrad,Lorenzo Bini,Thomas Matthes,Stéphane Marchand-Maillet
关键词: presents profound challenges, bone marrow derived, cell-level prediction presents, prediction presents profound, flow cytometry
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注: 14 pages, ICML Conference Workshop 2024. arXiv admin note: text overlap with arXiv:2402.18610

点击查看摘要

Abstract:In the complex landscape of hematologic samples such as peripheral blood or bone marrow derived from flow cytometry (FC) data, cell-level prediction presents profound challenges. This work explores injecting hierarchical prior knowledge into graph neural networks (GNNs) for single-cell multi-class classification of tabular cellular data. By representing the data as graphs and encoding hierarchical relationships between classes, we propose our hierarchical plug-in method to be applied to several GNN models, namely, FCHC-GNN, and effectively designed to capture neighborhood information crucial for single-cell FC domain. Extensive experiments on our cohort of 19 distinct patients, demonstrate that incorporating hierarchical biological constraints boosts performance significantly across multiple metrics compared to baseline GNNs without such priors. The proposed approach highlights the importance of structured inductive biases for gaining improved generalization in complex biological prediction tasks.

[LG-159] SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

链接: https://arxiv.org/abs/2405.18503
作者: Koichi Saito,Dongjun Kim,Takashi Shibuya,Chieh-Hsin Lai,Zhi Zhong,Yuhta Takida,Yuki Mitsufuji
关键词: video games, indispensable element, element for multimedia, multimedia works, Sound
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Sound content is an indispensable element for multimedia works such as video games, music, and films. Recent high-quality diffusion-based sound generation models can serve as valuable tools for the creators. However, despite producing high-quality sounds, these models often suffer from slow inference speeds. This drawback burdens creators, who typically refine their sounds through trial and error to align them with their artistic intentions. To address this issue, we introduce Sound Consistency Trajectory Models (SoundCTM). Our model enables flexible transitioning between high-quality 1-step sound generation and superior sound quality through multi-step generation. This allows creators to initially control sounds with 1-step samples before refining them through multi-step generation. While CTM fundamentally achieves flexible 1-step and multi-step generation, its impressive performance heavily depends on an additional pretrained feature extractor and an adversarial loss, which are expensive to train and not always available in other domains. Thus, we reframe CTM’s training framework and introduce a novel feature distance by utilizing the teacher’s network for a distillation loss. Additionally, while distilling classifier-free guided trajectories, we train conditional and unconditional student models simultaneously and interpolate between these models during inference. We also propose training-free controllable frameworks for SoundCTM, leveraging its flexible sampling capability. SoundCTM achieves both promising 1-step and multi-step real-time sound generation without using any extra off-the-shelf networks. Furthermore, we demonstrate SoundCTM’s capability of controllable sound generation in a training-free manner.

[LG-160] he Unified Balance Theory of Second-Moment Exponential Scaling Optimizers in Visual Tasks

链接: https://arxiv.org/abs/2405.18498
作者: Gongyue Zhang,Honghai Liu
关键词: Second-Moment Exponential Scaling, variable Second-Moment Exponential, Exponential Scaling, unifying first-order optimizers, identified a potential
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We have identified a potential method for unifying first-order optimizers through the use of variable Second-Moment Exponential Scaling(SMES). We begin with back propagation, addressing classic phenomena such as gradient vanishing and explosion, as well as issues related to dataset sparsity, and introduce the theory of balance in optimization. Through this theory, we suggest that SGD and adaptive optimizers can be unified under a broader inference, employing variable moving exponential scaling to achieve a balanced approach within a generalized formula for first-order optimizers. We conducted tests on some classic datasets and networks to confirm the impact of different balance coefficients on the overall training process.

[LG-161] Why Algorithms Remain Unjust: Power Structures Surrounding Algorithmic Activity

链接: https://arxiv.org/abs/2405.18461
作者: Andrew Balch
关键词: Algorithmic Activity, algorithmic, play an increasingly-significant, increasingly-significant role, Activity
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 12 pages, 2 figures, submitted to 2024 AAAI/ACM conference on AI, Ethics, and Society (AIES)

点击查看摘要

Abstract:Algorithms play an increasingly-significant role in our social lives. Unfortunately, they often perpetuate social injustices while doing so. The popular means of addressing these algorithmic injustices has been through algorithmic reformism: fine-tuning the algorithm itself to be more fair, accountable, and transparent. While commendable, the emerging discipline of critical algorithm studies shows that reformist approaches have failed to curtail algorithmic injustice because they ignore the power structure surrounding algorithms. Heeding calls from critical algorithm studies to analyze this power structure, I employ a framework developed by Erik Olin Wright to examine the configuration of power surrounding Algorithmic Activity: the ways in which algorithms are researched, developed, trained, and deployed within society. I argue that the reason Algorithmic Activity is unequal, undemocratic, and unsustainable is that the power structure shaping it is one of economic empowerment rather than social empowerment. For Algorithmic Activity to be socially just, we need to transform this power configuration to empower the people at the other end of an algorithm. To this end, I explore Wright’s symbiotic, interstitial, and raptural transformations in the context of Algorithmic Activity, as well as how they may be applied in a hypothetical research project that uses algorithms to address a social issue. I conclude with my vision for socially just Algorithmic Activity, asking that future work strives to integrate the proposed transformations and develop new mechanisms for social empowerment.

[LG-162] Probing the Information Theoretical Roots of Spatial Dependence Measures

链接: https://arxiv.org/abs/2405.18459
作者: Zhangyu Wang,Krzysztof Janowicz,Gengchen Mai,Ivan Majic
关键词: measures of entropy, information theoretical measures, spatial, spatial data, Intuitively
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: COSIT-2024 Conference Proceedings

点击查看摘要

Abstract:Intuitively, there is a relation between measures of spatial dependence and information theoretical measures of entropy. For instance, we can provide an intuition of why spatial data is special by stating that, on average, spatial data samples contain less than expected information. Similarly, spatial data, e.g., remotely sensed imagery, that is easy to compress is also likely to show significant spatial autocorrelation. Formulating our (highly specific) core concepts of spatial information theory in the widely used language of information theory opens new perspectives on their differences and similarities and also fosters cross-disciplinary collaboration, e.g., with the broader AI/ML communities. Interestingly, however, this intuitive relation is challenging to formalize and generalize, leading prior work to rely mostly on experimental results, e.g., for describing landscape patterns. In this work, we will explore the information theoretical roots of spatial autocorrelation, more specifically Moran’s I, through the lens of self-information (also known as surprisal) and provide both formal proofs and experiments.

[LG-163] Asymmetrical estimator for training grey-box deep photonic neural networks

链接: https://arxiv.org/abs/2405.18458
作者: Yizhi Wang,Minjia Chen,Chunhui Yao,Jie Ma,Ting Yan,Richard Penty,Qixiang Cheng
关键词: in-propagation analogue processing, network acceleration due, neural network acceleration, in-propagation analogue, analogue processing
类目: Machine Learning (cs.LG); Optics (physics.optics)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:Physical neural networks (PNNs) are emerging paradigms for neural network acceleration due to their high-bandwidth, in-propagation analogue processing. Despite the advantages of PNN for inference, training remains a challenge. The imperfect information of the physical transformation means the failure of conventional gradient-based updates from backpropagation (BP). Here, we present the asymmetrical training (AT) method, which treats the PNN structure as a grey box. AT performs training while only knowing the last layer output and neuron topological connectivity of a deep neural network structure, not requiring information about the physical control-transformation mapping. We experimentally demonstrated the AT method on deep grey-box PNNs implemented by uncalibrated photonic integrated circuits (PICs), improving the classification accuracy of Iris flower and modified MNIST hand-written digits from random guessing to near theoretical maximum. We also showcased the consistently enhanced performance of AT over BP for different datasets, including MNIST, fashion-MNIST, and Kuzushiji-MNIST. The AT method demonstrated successful training with minimal hardware overhead and reduced computational overhead, serving as a robust light-weight training alternative to fully explore the advantages of physical computation.

[LG-164] Improving Linear System Solvers for Hyperparameter Optimisation in Iterative Gaussian Processes

链接: https://arxiv.org/abs/2405.18457
作者: Jihao Andreas Lin,Shreyas Padhy,Bruno Mlodozeniec,Javier Antorán,José Miguel Hernández-Lobato
关键词: Gaussian process community, Scaling hyperparameter optimisation, large datasets remains, Scaling hyperparameter, Gaussian process
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:2405.18328

点击查看摘要

Abstract:Scaling hyperparameter optimisation to very large datasets remains an open problem in the Gaussian process community. This paper focuses on iterative methods, which use linear system solvers, like conjugate gradients, alternating projections or stochastic gradient descent, to construct an estimate of the marginal likelihood gradient. We discuss three key improvements which are applicable across solvers: (i) a pathwise gradient estimator, which reduces the required number of solver iterations and amortises the computational cost of making predictions, (ii) warm starting linear system solvers with the solution from the previous step, which leads to faster solver convergence at the cost of negligible bias, (iii) early stopping linear system solvers after a limited computational budget, which synergises with warm starting, allowing solver progress to accumulate over multiple marginal likelihood steps. These techniques provide speed-ups of up to 72\times when solving to tolerance, and decrease the average residual norm by up to 7\times when stopping early.

[LG-165] InversionView: A General-Purpose Method for Reading Information from Neural Activations

链接: https://arxiv.org/abs/2405.17653
作者: Xinting Huang,Madhur Panwar,Navin Goyal,Michael Hahn
关键词: neural networks, fully decipher, information encoded, workings of neural, encoded in neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:The inner workings of neural networks can be better understood if we can fully decipher the information encoded in neural activations. In this paper, we argue that this information is embodied by the subset of inputs that give rise to similar activations. Computing such subsets is nontrivial as the input space is exponentially large. We propose InversionView, which allows us to practically inspect this subset by sampling from a trained decoder model conditioned on activations. This helps uncover the information content of activation vectors, and facilitates understanding of the algorithms implemented by transformer models. We present three case studies where we investigate models ranging from small transformers to GPT-2. In these studies, we demonstrate the characteristics of our method, show the distinctive advantages it offers, and provide causally verified circuits.

[LG-166] A Dataset for Research on Water Sustainability

链接: https://arxiv.org/abs/2405.17469
作者: Pranjol Sen Gupta,Md Rajib Hossen,Pengfei Li,Shaolei Ren,Mohammad A. Islam
关键词: requires collective efforts, Freshwater scarcity, industry sectors, global problem, problem that requires
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Performance (cs.PF)
*备注: Accepted by ACM e-Energy 2024

点击查看摘要

Abstract:Freshwater scarcity is a global problem that requires collective efforts across all industry sectors. Nevertheless, a lack of access to operational water footprint data bars many applications from exploring optimization opportunities hidden within the temporal and spatial variations. To break this barrier into research in water sustainability, we build a dataset for operation direct water usage in the cooling systems and indirect water embedded in electricity generation. Our dataset consists of the hourly water efficiency of major U.S. cities and states from 2019 to 2023. We also offer cooling system models that capture the impact of weather on water efficiency. We present a preliminary analysis of our dataset and discuss three potential applications that can benefit from it. Our dataset is publicly available at Open Science Framework (OSF)

[LG-167] OSLO: One-Shot Label-Only Membership Inference Attacks

链接: https://arxiv.org/abs/2405.16978
作者: Yuefeng Peng,Jaechul Roh,Subhransu Maji,Amir Houmansadr
关键词: predicted hard label, target model training, membership inference attacks, introduce One-Shot Label-Only, model training set
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We introduce One-Shot Label-Only (OSLO) membership inference attacks (MIAs), which accurately infer a given sample’s membership in a target model’s training set with high precision using just \empha single query, where the target model only returns the predicted hard label. This is in contrast to state-of-the-art label-only attacks which require \sim6000 queries, yet get attack precisions lower than OSLO’s. OSLO leverages transfer-based black-box adversarial attacks. The core idea is that a member sample exhibits more resistance to adversarial perturbations than a non-member. We compare OSLO against state-of-the-art label-only attacks and demonstrate that, despite requiring only one query, our method significantly outperforms previous attacks in terms of precision and true positive rate (TPR) under the same false positive rates (FPR). For example, compared to previous label-only MIAs, OSLO achieves a TPR that is 7 \times to 28 \times stronger under a 0.1% FPR on CIFAR10 for a ResNet model. We evaluated multiple defense mechanisms against OSLO.

[LG-168] Functional Programming Paradigm of Python for Scientific Computation Pipeline Integration

链接: https://arxiv.org/abs/2405.16956
作者: Chen Zhang,Lecheng Jia,Wei Zhang,Ning Wen
关键词: modern data processing, tendency towards interdisciplinarity, technical approaches, advent of modern, processing has led
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The advent of modern data processing has led to an increasing tendency towards interdisciplinarity, which frequently involves the importation of different technical approaches. Consequently, there is an urgent need for a unified data control system to facilitate the integration of varying libraries. This integration is of profound significance in accelerating prototype verification, optimising algorithm performance and minimising maintenance costs. This paper presents a novel functional programming (FP) paradigm based on the Python architecture and associated suites in programming practice, designed for the integration of pipelines of different data mapping operations. In particular, the solution is intended for the integration of scientific computation flows, which affords a robust yet flexible solution for the aforementioned challenges.

[LG-169] rusting Fair Data: Leveraging Quality in Fairness-Driven Data Removal Techniques

链接: https://arxiv.org/abs/2405.12926
作者: Manh Khoi Duong,Stefan Conrad
关键词: bias mitigation techniques, specific data points, remove specific data, training set, deal with bias
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we deal with bias mitigation techniques that remove specific data points from the training set to aim for a fair representation of the population in that set. Machine learning models are trained on these pre-processed datasets, and their predictions are expected to be fair. However, such approaches may exclude relevant data, making the attained subsets less trustworthy for further usage. To enhance the trustworthiness of prior methods, we propose additional requirements and objectives that the subsets must fulfill in addition to fairness: (1) group coverage, and (2) minimal data loss. While removing entire groups may improve the measured fairness, this practice is very problematic as failing to represent every group cannot be considered fair. In our second concern, we advocate for the retention of data while minimizing discrimination. By introducing a multi-objective optimization problem that considers fairness and data loss, we propose a methodology to find Pareto-optimal solutions that balance these objectives. By identifying such solutions, users can make informed decisions about the trade-off between fairness and data quality and select the most suitable subset for their application.

[LG-170] A Recipe for Charge Density Prediction

链接: https://arxiv.org/abs/2405.19276
作者: Xiang Fu,Andrew Rosen,Kyle Bystrom,Rui Wang,Albert Musaelian,Boris Kozinsky,Tess Smidt,Tommi Jaakkola
关键词: density functional theory, functional theory, core attribute, chemical properties, charge density
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注: 15 pages

点击查看摘要

Abstract:In density functional theory, charge density is the core attribute of atomic systems from which all chemical properties can be derived. Machine learning methods are promising in significantly accelerating charge density prediction, yet existing approaches either lack accuracy or scalability. We propose a recipe that can achieve both. In particular, we identify three key ingredients: (1) representing the charge density with atomic and virtual orbitals (spherical fields centered at atom/virtual coordinates); (2) using expressive and learnable orbital basis sets (basis function for the spherical fields); and (3) using high-capacity equivariant neural network architecture. Our method achieves state-of-the-art accuracy while being more than an order of magnitude faster than existing methods. Furthermore, our method enables flexible efficiency-accuracy trade-offs by adjusting the model/basis sizes.

[LG-171] Valid Conformal Prediction for Dynamic GNNs

链接: https://arxiv.org/abs/2405.19230
作者: Ed Davis,Ian Gallagher,Daniel John Lawson,Patrick Rubin-Delanchy
关键词: Graph neural networks, powerful black-box models, shown impressive empirical, neural networks, powerful black-box
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 17 pages, 8 figures

点击查看摘要

Abstract:Graph neural networks (GNNs) are powerful black-box models which have shown impressive empirical performance. However, without any form of uncertainty quantification, it can be difficult to trust such models in high-risk scenarios. Conformal prediction aims to address this problem, however, an assumption of exchangeability is required for its validity which has limited its applicability to static graphs and transductive regimes. We propose to use unfolding, which allows any existing static GNN to output a dynamic graph embedding with exchangeability properties. Using this, we extend the validity of conformal prediction to dynamic GNNs in both transductive and semi-inductive regimes. We provide a theoretical guarantee of valid conformal prediction in these cases and demonstrate the empirical validity, as well as the performance gains, of unfolded GNNs against standard GNN architectures on both simulated and real datasets.

[LG-172] Domain adaptation in small-scale and heterogeneous biological datasets

链接: https://arxiv.org/abs/2405.19221
作者: Seyedmehdi Orouji,Martin C. Liu,Tal Korem,Megan A. K. Peters
关键词: Machine learning techniques, build predictive models, Machine learning, Domain adaptation, discover patterns
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: main manuscript + supplement

点击查看摘要

Abstract:Machine learning techniques are steadily becoming more important in modern biology, and are used to build predictive models, discover patterns, and investigate biological problems. However, models trained on one dataset are often not generalizable to other datasets from different cohorts or laboratories, due to differences in the statistical properties of these datasets. These could stem from technical differences, such as the measurement technique used, or from relevant biological differences between the populations studied. Domain adaptation, a type of transfer learning, can alleviate this problem by aligning the statistical distributions of features and samples among different datasets so that similar models can be applied across them. However, a majority of state-of-the-art domain adaptation methods are designed to work with large-scale data, mostly text and images, while biological datasets often suffer from small sample sizes, and possess complexities such as heterogeneity of the feature space. This Review aims to synthetically discuss domain adaptation methods in the context of small-scale and highly heterogeneous biological data. We describe the benefits and challenges of domain adaptation in biological research and critically discuss some of its objectives, strengths, and weaknesses through key representative methodologies. We argue for the incorporation of domain adaptation techniques to the computational biologist’s toolkit, with further development of customized approaches.

[LG-173] Matrix Manifold Neural Networks

链接: https://arxiv.org/abs/2405.19206
作者: Xuan Son Nguyen,Shuo Yang,Aymeric Histace
关键词: garnered increasing interest, Deep neural networks, Riemannian manifolds, Deep neural, applied areas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) on Riemannian manifolds have garnered increasing interest in various applied areas. For instance, DNNs on spherical and hyperbolic manifolds have been designed to solve a wide range of computer vision and nature language processing tasks. One of the key factors that contribute to the success of these networks is that spherical and hyperbolic manifolds have the rich algebraic structures of gyrogroups and gyrovector spaces. This enables principled and effective generalizations of the most successful DNNs to these manifolds. Recently, some works have shown that many concepts in the theory of gyrogroups and gyrovector spaces can also be generalized to matrix manifolds such as Symmetric Positive Definite (SPD) and Grassmann manifolds. As a result, some building blocks for SPD and Grassmann neural networks, e.g., isometric models and multinomial logistic regression (MLR) can be derived in a way that is fully analogous to their spherical and hyperbolic counterparts. Building upon these works, we design fully-connected (FC) and convolutional layers for SPD neural networks. We also develop MLR on Symmetric Positive Semi-definite (SPSD) manifolds, and propose a method for performing backpropagation with the Grassmann logarithmic map in the projector perspective. We demonstrate the effectiveness of the proposed approach in the human action recognition and node classification tasks.

[LG-174] Model-independent cosmological inference post DESI DR1 BAO measurements

链接: https://arxiv.org/abs/2405.19178
作者: Purba Mukherjee,Anjan Ananda Sen
关键词: implement Gaussian process, Gaussian process regression, DESI BAO, implement Gaussian, Gaussian process
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
*备注: 5 pages, 5 sets of figures. Comments are welcome

点击查看摘要

Abstract:In this work, we implement Gaussian process regression to reconstruct the expansion history of the universe in a model-agnostic manner, using the Pantheon-Plus SN-Ia compilation in combination with two different BAO measurements (SDSS-IV and DESI DR1). In both the reconstructions, the \Lambda CDM model is always included in the 95% confidence intervals. We find evidence that the DESI LRG data at z_\texteff = 0.51 is not an outlier within our model-independent framework. We study the \mathcalOm -diagnostics and the evolution of the total equation of state (EoS) of our universe, which hint towards the possibility of a quintessence-like dark energy scenario with a very slowly varying EoS, and a phantom-crossing in higher z . The entire exercise is later complemented by considering two more SN-Ia compilations - DES-5YR and Union3 - in combination with DESI BAO. Reconstruction with the DESI BAO + DES-5YR SN data sets predicts that the \Lambda CDM model lies outside the 3 \sigma confidence levels, whereas with DESI BAO + Union3 data, the \Lambda CDM model is always included within 1 \sigma . We also report constraints on H_0 r_d from our model-agnostic analysis, independent of the pre-recombination physics. Our results point towards an \approx 2 \sigma discrepancy between the DESI + Pantheon-Plus and DESI + DES-5YR data sets, which calls for further investigation.

[LG-175] I Bet You Did Not Mean That: Testing Semantic Importance via Betting

链接: https://arxiv.org/abs/2405.19146
作者: Jacopo Teneggi,Jeremias Sulam
关键词: black-box predictive model, works have extended, extended notions, notions of feature, inherently interpretable
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent works have extended notions of feature importance to \emphsemantic concepts that are inherently interpretable to the users interacting with a black-box predictive model. Yet, precise statistical guarantees, such as false positive rate control, are needed to communicate findings transparently and to avoid unintended consequences in real-world scenarios. In this paper, we formalize the global (i.e., over a population) and local (i.e., for a sample) statistical importance of semantic concepts for the predictions of opaque models, by means of conditional independence, which allows for rigorous testing. We use recent ideas of sequential kernelized testing (SKIT) to induce a rank of importance across concepts, and showcase the effectiveness and flexibility of our framework on synthetic datasets as well as on image classification tasks using vision-language models such as CLIP.

[LG-176] State Space Models are Comparable to Transformers in Estimating Functions with Dynamic Smoothness

链接: https://arxiv.org/abs/2405.19036
作者: Naoki Nishikawa,Taiji Suzuki
关键词: Deep neural networks, state space models, neural networks based, Deep neural, space models
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 33 pages, 2 figures

点击查看摘要

Abstract:Deep neural networks based on state space models (SSMs) are attracting much attention in sequence modeling since their computational cost is significantly smaller than that of Transformers. While the capabilities of SSMs have been primarily investigated through experimental comparisons, theoretical understanding of SSMs is still limited. In particular, there is a lack of statistical and quantitative evaluation of whether SSM can replace Transformers. In this paper, we theoretically explore in which tasks SSMs can be alternatives of Transformers from the perspective of estimating sequence-to-sequence functions. We consider the setting where the target function has direction-dependent smoothness and prove that SSMs can estimate such functions with the same convergence rate as Transformers. Additionally, we prove that SSMs can estimate the target function, even if the smoothness changes depending on the input sequence, as well as Transformers. Our results show the possibility that SSMs can replace Transformers when estimating the functions in certain classes that appear in practice.

[LG-177] Physics-Aware Neural Implicit Solvers for multiscale parametric PDEs with applications in heterogeneous media

链接: https://arxiv.org/abs/2405.19019
作者: Matthaios Chatzopoulos,Phaedon-Stelios Koutsourelakis
关键词: Partial Differential Equations, parametrized Partial Differential, Neural Implicit Solvers, propose Physics-Aware Neural, Differential Equations
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose Physics-Aware Neural Implicit Solvers (PANIS), a novel, data-driven framework for learning surrogates for parametrized Partial Differential Equations (PDEs). It consists of a probabilistic, learning objective in which weighted residuals are used to probe the PDE and provide a source of \em virtual data i.e. the actual PDE never needs to be solved. This is combined with a physics-aware implicit solver that consists of a much coarser, discretized version of the original PDE, which provides the requisite information bottleneck for high-dimensional problems and enables generalization in out-of-distribution settings (e.g. different boundary conditions). We demonstrate its capability in the context of random heterogeneous materials where the input parameters represent the material microstructure. We extend the framework to multiscale problems and show that a surrogate can be learned for the effective (homogenized) solution without ever solving the reference problem. We further demonstrate how the proposed framework can accommodate and generalize several existing learning objectives and architectures while yielding probabilistic surrogates that can quantify predictive uncertainty.

[LG-178] Kernel Semi-Implicit Variational Inference

链接: https://arxiv.org/abs/2405.18997
作者: Ziheng Cheng,Longlin Yu,Tianyu Xie,Shiyue Zhang,Cheng Zhang
关键词: extends traditional variational, traditional variational families, semi-implicit distributions defined, extends traditional, semi-implicit distributions
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: ICML 2024 camera ready

点击查看摘要

Abstract:Semi-implicit variational inference (SIVI) extends traditional variational families with semi-implicit distributions defined in a hierarchical manner. Due to the intractable densities of semi-implicit distributions, classical SIVI often resorts to surrogates of evidence lower bound (ELBO) that would introduce biases for training. A recent advancement in SIVI, named SIVI-SM, utilizes an alternative score matching objective made tractable via a minimax formulation, albeit requiring an additional lower-level optimization. In this paper, we propose kernel SIVI (KSIVI), a variant of SIVI-SM that eliminates the need for lower-level optimization through kernel tricks. Specifically, we show that when optimizing over a reproducing kernel Hilbert space (RKHS), the lower-level problem has an explicit solution. This way, the upper-level objective becomes the kernel Stein discrepancy (KSD), which is readily computable for stochastic gradient descent due to the hierarchical structure of semi-implicit variational distributions. An upper bound for the variance of the Monte Carlo gradient estimators of the KSD objective is derived, which allows us to establish novel convergence guarantees of KSIVI. We demonstrate the effectiveness and efficiency of KSIVI on both synthetic distributions and a variety of real data Bayesian inference tasks.

[LG-179] Predicting Many Properties of Crystals by a Single Deep Learning Model

链接: https://arxiv.org/abs/2405.18944
作者: Haosheng Xu,Dongheng Qian,Jing Wang
关键词: encounters significant challenges, machine learning methods, crystalline materials encounters, materials encounters significant, output versatility
类目: Materials Science (cond-mat.mtrl-sci); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures. The codes are available upon reasonable request

点击查看摘要

Abstract:The use of machine learning methods for predicting the properties of crystalline materials encounters significant challenges, primarily related to input encoding, output versatility, and interpretability. Here, we introduce CrystalBERT, an adaptable transformer-based framework with novel structure that integrates space group, elemental, and unit cell information. The method’s adaptability lies not only in its ability to seamlessly combine diverse features but also in its capability to accurately predict a wide range of physically important properties, including topological properties, superconducting transition temperatures, dielectric constants, and more. CrystalBERT also provides insightful physical interpretations regarding the features that most significantly influence the target properties. Our findings indicate that space group and elemental information are more important for predicting topological and superconducting properties, in contrast to some properties that primarily depend on the unit cell information. This underscores the intricate nature of topological and superconducting properties. By incorporating all these features, we achieve a high accuracy of 91% in topological classification, surpassing prior studies and identifying previously misclassified topological materials, further demonstrating the effectiveness of our model.

[LG-180] HLOB – Information Persistence and Structure in Limit Order Books

链接: https://arxiv.org/abs/2405.18938
作者: Antonio Briola,Silvia Bartolucci,Tomaso Aste
关键词: Limit Order Book, Order Book mid-price, Convolutional Neural Networks, Maximally Filtered Graph, Triangulated Maximally Filtered
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG)
*备注: 34 pages, 7 figures, 7 tables, 3 equations

点击查看摘要

Abstract:We introduce a novel large-scale deep learning model for Limit Order Book mid-price changes forecasting, and we name it `HLOB’. This architecture (i) exploits the information encoded by an Information Filtering Network, namely the Triangulated Maximally Filtered Graph, to unveil deeper and non-trivial dependency structures among volume levels; and (ii) guarantees deterministic design choices to handle the complexity of the underlying system by drawing inspiration from the groundbreaking class of Homological Convolutional Neural Networks. We test our model against 9 state-of-the-art deep learning alternatives on 3 real-world Limit Order Book datasets, each including 15 stocks traded on the NASDAQ exchange, and we systematically characterize the scenarios where HLOB outperforms state-of-the-art architectures. Our approach sheds new light on the spatial distribution of information in Limit Order Books and on its degradation over increasing prediction horizons, narrowing the gap between microstructural modeling and deep learning-based forecasting in high-frequency financial markets.

[LG-181] A Mallows-like Criterion for Anomaly Detection with Random Forest Implementation

链接: https://arxiv.org/abs/2405.18932
作者: Gaoxiang Zhao,Lu Wang,Xiaoqiang Wang
关键词: significantly undermined, inherent uncertainty, uncertainty of relying, anomaly signal detection, Random Forest algorithm
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The effectiveness of anomaly signal detection can be significantly undermined by the inherent uncertainty of relying on one specified model. Under the framework of model average methods, this paper proposes a novel criterion to select the weights on aggregation of multiple models, wherein the focal loss function accounts for the classification of extremely imbalanced data. This strategy is further integrated into Random Forest algorithm by replacing the conventional voting method. We have evaluated the proposed method on benchmark datasets across various domains, including network intrusion. The findings indicate that our proposed method not only surpasses the model averaging with typical loss functions but also outstrips common anomaly detection algorithms in terms of accuracy and robustness.

[LG-182] EntProp: High Entropy Propagation for Improving Accuracy and Robustness

链接: https://arxiv.org/abs/2405.18931
作者: Shohei Enomoto
关键词: Deep neural networks, Deep neural, samples, struggle to generalize, impressive performance
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to UAI2024

点击查看摘要

Abstract:Deep neural networks (DNNs) struggle to generalize to out-of-distribution domains that are different from those in training despite their impressive performance. In practical applications, it is important for DNNs to have both high standard accuracy and robustness against out-of-distribution domains. One technique that achieves both of these improvements is disentangled learning with mixture distribution via auxiliary batch normalization layers (ABNs). This technique treats clean and transformed samples as different domains, allowing a DNN to learn better features from mixed domains. However, if we distinguish the domains of the samples based on entropy, we find that some transformed samples are drawn from the same domain as clean samples, and these samples are not completely different domains. To generate samples drawn from a completely different domain than clean samples, we hypothesize that transforming clean high-entropy samples to further increase the entropy generates out-of-distribution samples that are much further away from the in-distribution domain. On the basis of the hypothesis, we propose high entropy propagation~(EntProp), which feeds high-entropy samples to the network that uses ABNs. We introduce two techniques, data augmentation and free adversarial training, that increase entropy and bring the sample further away from the in-distribution domain. These techniques do not require additional training costs. Our experimental results show that EntProp achieves higher standard accuracy and robustness with a lower training cost than the baseline methods. In particular, EntProp is highly effective at training on small datasets.

[LG-183] Deep Positive-Unlabeled Anomaly Detection for Contaminated Unlabeled Data

链接: https://arxiv.org/abs/2405.18929
作者: Hiroshi Takahashi,Tomoharu Iwata,Atsutoshi Kumagai,Yuuki Yamanaka
关键词: anomaly, anomaly data, unlabeled data, data, anomaly detector
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under revirew. Code is available at this https URL

点击查看摘要

Abstract:Semi-supervised anomaly detection, which aims to improve the performance of the anomaly detector by using a small amount of anomaly data in addition to unlabeled data, has attracted attention. Existing semi-supervised approaches assume that unlabeled data are mostly normal. They train the anomaly detector to minimize the anomaly scores for the unlabeled data, and to maximize those for the anomaly data. However, in practice, the unlabeled data are often contaminated with anomalies. This weakens the effect of maximizing the anomaly scores for anomalies, and prevents us from improving the detection performance. To solve this problem, we propose the positive-unlabeled autoencoder, which is based on positive-unlabeled learning and the anomaly detector such as the autoencoder. With our approach, we can approximate the anomaly scores for normal data using the unlabeled and anomaly data. Therefore, without the labeled normal data, we can train the anomaly detector to minimize the anomaly scores for normal data, and to maximize those for the anomaly data. In addition, our approach is applicable to various anomaly detectors such as the DeepSVDD. Experiments on various datasets show that our approach achieves better detection performance than existing approaches.

[LG-184] Computing low-thrust transfers in the asteroid belt a comparison between astrodynamical manipulations and a machine learning approach

链接: https://arxiv.org/abs/2405.18918
作者: Giacomo Acciarini,Laurent Beauregard,Dario Izzo
关键词: optimizing scientific output, Low-thrust trajectories play, play a crucial, crucial role, role in optimizing
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Space Physics (physics.space-ph)
*备注: Paper presented and published in the proceedings of the 29th ISSFD Conference, Darmstadt, Germany, 2024

点击查看摘要

Abstract:Low-thrust trajectories play a crucial role in optimizing scientific output and cost efficiency in asteroid belt missions. Unlike high-thrust transfers, low-thrust trajectories require solving complex optimal control problems. This complexity grows exponentially with the number of asteroids visited due to orbital mechanics intricacies. In the literature, methods for approximating low-thrust transfers without full optimization have been proposed, including analytical and machine learning techniques. In this work, we propose new analytical approximations and compare their accuracy and performance to machine learning methods. While analytical approximations leverage orbit theory to estimate trajectory costs, machine learning employs a more black-box approach, utilizing neural networks to predict optimal transfers based on various attributes. We build a dataset of about 3 million transfers, found by solving the time and fuel optimal control problems, for different time of flights, which we also release open-source. Comparison between the two methods on this database reveals the superiority of machine learning, especially for longer transfers. Despite challenges such as multi revolution transfers, both approaches maintain accuracy within a few percent in the final mass errors, on a database of trajectories involving numerous asteroids. This work contributes to the efficient exploration of mission opportunities in the asteroid belt, providing insights into the strengths and limitations of different approximation strategies.

[LG-185] Do Finetti: On Causal Effects for Exchangeable Data

链接: https://arxiv.org/abs/2405.18836
作者: Siyuan Guo,Chi Zhang,Karthika Mohan,Ferenc Huszár,Bernhard Schölkopf
关键词: data, causal, effect estimation, study causal effect, exchangeable data
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study causal effect estimation in a setting where the data are not i.i.d. (independent and identically distributed). We focus on exchangeable data satisfying an assumption of independent causal mechanisms. Traditional causal effect estimation frameworks, e.g., relying on structural causal models and do-calculus, are typically limited to i.i.d. data and do not extend to more general exchangeable generative processes, which naturally arise in multi-environment data. To address this gap, we develop a generalized framework for exchangeable data and introduce a truncated factorization formula that facilitates both the identification and estimation of causal effects in our setting. To illustrate potential applications, we introduce a causal Pólya urn model and demonstrate how intervention propagates effects in exchangeable data settings. Finally, we develop an algorithm that performs simultaneous causal discovery and effect estimation given multi-environment data.

[LG-186] Federated Q-Learning with Reference-Advantage Decomposition: Almost Optimal Regret and Logarithmic Communication Cost

链接: https://arxiv.org/abs/2405.18795
作者: Zhong Zheng,Haochen Zhang,Lingzhou Xue
关键词: Markov decision processes, tabular episodic Markov, episodic Markov decision, episodic Markov, Markov decision
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we consider model-free federated reinforcement learning for tabular episodic Markov decision processes. Under the coordination of a central server, multiple agents collaboratively explore the environment and learn an optimal policy without sharing their raw data. Despite recent advances in federated Q-learning algorithms achieving near-linear regret speedup with low communication cost, existing algorithms only attain suboptimal regrets compared to the information bound. We propose a novel model-free federated Q-learning algorithm, termed FedQ-Advantage. Our algorithm leverages reference-advantage decomposition for variance reduction and operates under two distinct mechanisms: synchronization between the agents and the server, and policy update, both triggered by events. We prove that our algorithm not only requires a lower logarithmic communication cost but also achieves an almost optimal regret, reaching the information bound up to a logarithmic factor and near-linear regret speedup compared to its single-agent counterpart when the time horizon is sufficiently large.

[LG-187] SPABA: A Single-Loop and Probabilistic Stochastic Bilevel Algorithm Achieving Optimal Sample Complexity

链接: https://arxiv.org/abs/2405.18777
作者: Tianshu Chu,Dachuan Xu,Wei Yao,Jin Zhang
关键词: addressing large-scale nested, nested optimization problems, large-scale nested optimization, optimal complexity bounds, solving bilevel optimization
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Accepted by ICML 2024

点击查看摘要

Abstract:While stochastic bilevel optimization methods have been extensively studied for addressing large-scale nested optimization problems in machine learning, it remains an open question whether the optimal complexity bounds for solving bilevel optimization are the same as those in single-level optimization. Our main result resolves this question: SPABA, an adaptation of the PAGE method for nonconvex optimization in (Li et al., 2021) to the bilevel setting, can achieve optimal sample complexity in both the finite-sum and expectation settings. We show the optimality of SPABA by proving that there is no gap in complexity analysis between stochastic bilevel and single-level optimization when implementing PAGE. Notably, as indicated by the results of (Dagréou et al., 2022), there might exist a gap in complexity analysis when implementing other stochastic gradient estimators, like SGD and SAGA. In addition to SPABA, we propose several other single-loop stochastic bilevel algorithms, that either match or improve the state-of-the-art sample complexity results, leveraging our convergence rate and complexity analysis. Numerical experiments demonstrate the superior practical performance of the proposed methods.

[LG-188] RNAFlow: RNA Structure Sequence Design via Inverse Folding-Based Flow Matching

链接: https://arxiv.org/abs/2405.18768
作者: Divya Nori,Wengong Jin
关键词: diverse biological applications, RNA, structure-based RNA design, RNA design, growing significance
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:The growing significance of RNA engineering in diverse biological applications has spurred interest in developing AI methods for structure-based RNA design. While diffusion models have excelled in protein design, adapting them for RNA presents new challenges due to RNA’s conformational flexibility and the computational cost of fine-tuning large structure prediction models. To this end, we propose RNAFlow, a flow matching model for protein-conditioned RNA sequence-structure design. Its denoising network integrates an RNA inverse folding model and a pre-trained RosettaFold2NA network for generation of RNA sequences and structures. The integration of inverse folding in the structure denoising process allows us to simplify training by fixing the structure prediction network. We further enhance the inverse folding model by conditioning it on inferred conformational ensembles to model dynamic RNA conformations. Evaluation on protein-conditioned RNA structure and sequence generation tasks demonstrates RNAFlow’s advantage over existing RNA design methods.

[LG-189] STIQ: Safeguarding Training and Inferencing of Quantum Neural Networks from Untrusted Cloud

链接: https://arxiv.org/abs/2405.18746
作者: Satwik Kundu,Swaroop Ghosh
关键词: current quantum cloud, Quantum Neural Networks, cheaper cloud-based quantum, cloud-based quantum services, potentially untrusted providers
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The high expenses imposed by current quantum cloud providers, coupled with the escalating need for quantum resources, may incentivize the emergence of cheaper cloud-based quantum services from potentially untrusted providers. Deploying or hosting quantum models, such as Quantum Neural Networks (QNNs), on these untrusted platforms introduces a myriad of security concerns, with the most critical one being model theft. This vulnerability stems from the cloud provider’s full access to these circuits during training and/or inference. In this work, we introduce STIQ, a novel ensemble-based strategy designed to safeguard QNNs against such cloud-based adversaries. Our method innovatively trains two distinct QNNs concurrently, hosting them on same or different platforms, in a manner that each network yields obfuscated outputs rendering the individual QNNs ineffective for adversaries operating within cloud environments. However, when these outputs are combined locally (using an aggregate function), they reveal the correct result. Through extensive experiments across various QNNs and datasets, our technique has proven to effectively masks the accuracy and losses of the individually hosted models by upto 76%, albeit at the expense of \leq 2\times increase in the total computational overhead. This trade-off, however, is a small price to pay for the enhanced security and integrity of QNNs in a cloud-based environment prone to untrusted adversaries. We also demonstrated STIQ’s practical application by evaluating it on real 127-qubit IBM_Sherbrooke hardware, showing that STIQ achieves up to 60% obfuscation, with combined performance comparable to an unobfuscated model.

[LG-190] Gemini Physical World: Large Language Models Can Estimate the Intensity of Earthquake Shaking from Multi-Modal Social Media Posts

链接: https://arxiv.org/abs/2405.18732
作者: S. Mostafa Mousavi,Marc Stogaitis,Tajinder Gadh,Richard M Allen,Alexei Barski,Robert Bosch,Patrick Robertson,Nivetha Thiruverahan,Youngmin Cho
关键词: ground shaking intensity, CCTV footage, Modified Mercalli Intensity, paper presents, estimating the ground
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:This paper presents a novel approach for estimating the ground shaking intensity using social media data and CCTV footage. Employing the Gemini Pro (Reid et al. 2024) model, a multi-modal language model, we demonstrate the ability to extract relevant information from unstructured data utilizing generative AI and natural language processing. The model output, in the form of Modified Mercalli Intensity (MMI) values, align well with independent observational data. Furthermore, our results suggest that beyond its advanced visual and auditory understanding abilities, Gemini appears to utilize additional sources of knowledge, including a simplified understanding of the general relationship between earthquake magnitude, distance, and MMI intensity, which it presumably acquired during its training, in its reasoning and decision-making processes. These findings raise intriguing questions about the extent of Gemini’s general understanding of the physical world and its phenomena. The ability of Gemini to generate results consistent with established scientific knowledge highlights the potential of LLMs like Gemini in augmenting our understanding of complex physical phenomena such as earthquakes. More specifically, the results of this study highlight the potential of LLMs like Gemini to revolutionize citizen seismology by enabling rapid, effective, and flexible analysis of crowdsourced data from eyewitness accounts for assessing earthquake impact and providing crisis situational awareness. This approach holds great promise for improving early warning systems, disaster response, and overall resilience in earthquake-prone regions. This study provides a significant step toward harnessing the power of social media and AI for earthquake disaster mitigation.

[LG-191] Adapting Differential Molecular Representation with Hierarchical Prompts for Multi-label Property Prediction

链接: https://arxiv.org/abs/2405.18724
作者: Linjia Kang,Songhua Zhou,Shuyan Fang,Shichao Liu,Wen Zhang
关键词: molecular representation learning, molecular representation, representation learning, Representation Learning Framework, Prompted Molecular Representation
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate prediction of molecular properties is critical in the field of drug discovery. However, existing methods do not fully consider the fact that molecules in the real world usually possess multiple property labels, and complex high-order relationships may exist among these labels. Therefore, molecular representation learning models should generate differential molecular representations that consider multi-granularity correlation information among tasks. To this end, our research introduces a Hierarchical Prompted Molecular Representation Learning Framework (HiPM), which enhances the differential expression of tasks in molecular representations through task-aware prompts, and utilizes shared information among labels to mitigate negative transfer between different tasks. HiPM primarily consists of two core components: the Molecular Representation Encoder (MRE) and the Task-Aware Prompter (TAP). The MRE employs a hierarchical message-passing network architecture to capture molecular features at both the atomic and motif levels, while the TAP uses agglomerative hierarchical clustering to build a prompt tree that reflects the affinity and distinctiveness of tasks, enabling the model to effectively handle the complexity of multi-label property predictions. Extensive experiments demonstrate that HiPM achieves state-of-the-art performance across various multi-label datasets, offering a new perspective on multi-label molecular representation learning.

[LG-192] Rejection via Learning Density Ratios

链接: https://arxiv.org/abs/2405.18686
作者: Alexander Soen,Hisham Husain,Philip Schulz,Vu Nguyen
关键词: abstain from making, making predictions, learning paradigm, Classification, Classification with rejection
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions. The predominant approach is to alter the supervised learning pipeline by augmenting typical loss functions, letting model rejection incur a lower loss than an incorrect prediction. Instead, we propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model’s performance. This can be formalized via the optimization of a loss’s risk with a \phi -divergence regularization term. Through this idealized distribution, a rejection decision can be made by utilizing the density ratio between this distribution and the data distribution. We focus on the setting where our \phi -divergences are specified by the family of \alpha -divergence. Our framework is tested empirically over clean and noisy datasets.

[LG-193] Improving Speech Decoding from ECoG with Self-Supervised Pretraining

链接: https://arxiv.org/abs/2405.18639
作者: Brian A. Yuan,Joseph G. Makin
关键词: intracranial brain-machine interfaces, Recent work, deep neural networks, high accuracy, essentially by treating
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Recent work on intracranial brain-machine interfaces has demonstrated that spoken speech can be decoded with high accuracy, essentially by treating the problem as an instance of supervised learning and training deep neural networks to map from neural activity to text. However, such networks pay for their expressiveness with very large numbers of labeled data, a requirement that is particularly burdensome for invasive neural recordings acquired from human patients. On the other hand, these patients typically produce speech outside of the experimental blocks used for training decoders. Making use of such data, and data from other patients, to improve decoding would ease the burden of data collection – especially onerous for dys- and anarthric patients. Here we demonstrate that this is possible, by reengineering wav2vec – a simple, self-supervised, fully convolutional model that learns latent representations of audio using a noise-contrastive loss – for electrocorticographic (ECoG) data. We train this model on unlabelled ECoG recordings, and subsequently use it to transform ECoG from labeled speech sessions into wav2vec’s representation space, before finally training a supervised encoder-decoder to map these representations to text. We experiment with various numbers of labeled blocks; for almost all choices, the new representations yield superior decoding performance to the original ECoG data, and in no cases do they yield worse. Performance can also be improved in some cases by pretraining wav2vec on another patient’s data. In the best cases, wav2vec’s representations decrease word error rates over the original data by upwards of 50%.

[LG-194] Biclustering a dataset using photonic quantum computing

链接: https://arxiv.org/abs/2405.18622
作者: Ajinkya Borle,Ameya Bhave
关键词: Gaussian boson sampling, machine learning, learning and data, data mining, mining that seeks
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Optics (physics.optics)
*备注: 32 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Biclustering is a problem in machine learning and data mining that seeks to group together rows and columns of a dataset according to certain criteria. In this work, we highlight the natural relation that quantum computing models like boson and Gaussian boson sampling (GBS) have to this problem. We first explore the use of boson sampling to identify biclusters based on matrix permanents. We then propose a heuristic that finds clusters in a dataset using Gaussian boson sampling by (i) converting the dataset into a bipartite graph and then (ii) running GBS to find the densest sub-graph(s) within the larger bipartite graph. Our simulations for the above proposed heuristics show promising results for future exploration in this area.

[LG-195] From Conformal Predictions to Confidence Regions

链接: https://arxiv.org/abs/2405.18601
作者: Charles Guille-Escuret,Eugene Ndiaye
关键词: significantly advanced, advanced the quantification, quantification of uncertainties, uncertainties in predictive, Conformal prediction
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Conformal prediction methodologies have significantly advanced the quantification of uncertainties in predictive models. Yet, the construction of confidence regions for model parameters presents a notable challenge, often necessitating stringent assumptions regarding data distribution or merely providing asymptotic guarantees. We introduce a novel approach termed CCR, which employs a combination of conformal prediction intervals for the model outputs to establish confidence regions for model parameters. We present coverage guarantees under minimal assumptions on noise and that is valid in finite sample regime. Our approach is applicable to both split conformal predictions and black-box methodologies including full or cross-conformal approaches. In the specific case of linear models, the derived confidence region manifests as the feasible set of a Mixed-Integer Linear Program (MILP), facilitating the deduction of confidence intervals for individual parameters and enabling robust optimization. We empirically compare CCR to recent advancements in challenging settings such as with heteroskedastic and non-Gaussian noise.

[LG-196] A Margin-based Multiclass Generalization Bound via Geometric Complexity

链接: https://arxiv.org/abs/2405.18590
作者: Michael Munn,Benoit Dherin,Javier Gonzalvo
关键词: deep neural networks, neural networks, considerable effort, capabilities of deep, unlock a theoretical
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted as an ICML 2023 workshop paper (Topology, Algebra and Geometry in Machine Learning)

点击查看摘要

Abstract:There has been considerable effort to better understand the generalization capabilities of deep neural networks both as a means to unlock a theoretical understanding of their success as well as providing directions for further improvements. In this paper, we investigate margin-based multiclass generalization bounds for neural networks which rely on a recent complexity measure, the geometric complexity, developed for neural networks. We derive a new upper bound on the generalization error which scales with the margin-normalized geometric complexity of the network and which holds for a broad family of data distributions and model classes. Our generalization bound is empirically investigated for a ResNet-18 model trained with SGD on the CIFAR-10 and CIFAR-100 datasets with both original and random labels.

[LG-197] Single-loop Stochastic Algorithms for Difference of Max-Structured Weakly Convex Functions

链接: https://arxiv.org/abs/2405.18577
作者: Quanqi Hu,Qi Qi,Zhaosong Lu,Tianbao Yang
关键词: weakly convex functions, strongly concave functions, non-smooth non-convex problems, max, weakly convex
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we study a class of non-smooth non-convex problems in the form of \min_x[\max_y\in Y\phi(x, y) - \max_z\in Z\psi(x, z)] , where both \Phi(x) = \max_y\in Y\phi(x, y) and \Psi(x)=\max_z\in Z\psi(x, z) are weakly convex functions, and \phi(x, y), \psi(x, z) are strongly concave functions in terms of y and z , respectively. It covers two families of problems that have been studied but are missing single-loop stochastic algorithms, i.e., difference of weakly convex functions and weakly convex strongly-concave min-max problems. We propose a stochastic Moreau envelope approximate gradient method dubbed SMAG, the first single-loop algorithm for solving these problems, and provide a state-of-the-art non-asymptotic convergence rate. The key idea of the design is to compute an approximate gradient of the Moreau envelopes of \Phi, \Psi using only one step of stochastic gradient update of the primal and dual variables. Empirically, we conduct experiments on positive-unlabeled (PU) learning and partial area under ROC curve (pAUC) optimization with an adversarial fairness regularizer to validate the effectiveness of our proposed algorithms.

[LG-198] Large Margin Discriminative Loss for Classification

链接: https://arxiv.org/abs/2405.18499
作者: Hai-Vy Nguyen,Fabrice Gamboa,Sixin Zhang,Reda Chhaibi,Serge Gratton,Thierry Giaccone
关键词: Deep Learning, context of Deep, discriminative loss function, loss, Learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a novel discriminative loss function with large margin in the context of Deep Learning. This loss boosts the discriminative power of neural nets, represented by intra-class compactness and inter-class separability. On the one hand, the class compactness is ensured by close distance of samples of the same class to each other. On the other hand, the inter-class separability is boosted by a margin loss that ensures the minimum distance of each class to its closest boundary. All the terms in our loss have an explicit meaning, giving a direct view of the feature space obtained. We analyze mathematically the relation between compactness and margin term, giving a guideline about the impact of the hyper-parameters on the learned features. Moreover, we also analyze properties of the gradient of the loss with respect to the parameters of the neural net. Based on this, we design a strategy called partial momentum updating that enjoys simultaneously stability and consistency in training. Furthermore, we also investigate generalization errors to have better theoretical insights. Our loss function systematically boosts the test accuracy of models compared to the standard softmax loss in our experiments.

[LG-199] Predicting Ground State Properties: Constant Sample Complexity and Deep Learning Algorithms

链接: https://arxiv.org/abs/2405.18489
作者: Marc Wanner,Laura Lewis,Chiranjib Bhattacharyya,Devdatt Dubhashi,Alexandru Gheorghiu
关键词: quantum many-body physics, finding ground states, ground state properties, learning ground states, ground state
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 11 pages, 7 figures + 36-page appendix

点击查看摘要

Abstract:A fundamental problem in quantum many-body physics is that of finding ground states of local Hamiltonians. A number of recent works gave provably efficient machine learning (ML) algorithms for learning ground states. Specifically, [Huang et al. Science 2022], introduced an approach for learning properties of the ground state of an n -qubit gapped local Hamiltonian H from only n^\mathcalO(1) data points sampled from Hamiltonians in the same phase of matter. This was subsequently improved by [Lewis et al. Nature Communications 2024], to \mathcalO(\log n) samples when the geometry of the n -qubit system is known. In this work, we introduce two approaches that achieve a constant sample complexity, independent of system size n , for learning ground state properties. Our first algorithm consists of a simple modification of the ML model used by Lewis et al. and applies to a property of interest known beforehand. Our second algorithm, which applies even if a description of the property is not known, is a deep neural network model. While empirical results showing the performance of neural networks have been demonstrated, to our knowledge, this is the first rigorous sample complexity bound on a neural network model for predicting ground state properties. We also perform numerical experiments that confirm the improved scaling of our approach compared to earlier results.

[LG-200] Symbolic Regression for Beyond the Standard Model Physics

链接: https://arxiv.org/abs/2405.18471
作者: Shehu AbdusSalam,Steve Abel,Miguel Crispim Romao
关键词: Standard Model physics, Supersymmetric Standard Model, Minimal Supersymmetric Standard, propose symbolic regression, Standard Model
类目: High Energy Physics - Phenomenology (hep-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th); Computational Physics (physics.comp-ph)
*备注: 6 pages, 7 figures, for associated code and symbolic expressions see this https URL

点击查看摘要

Abstract:We propose symbolic regression as a powerful tool for studying Beyond the Standard Model physics. As a benchmark model, we consider the so-called Constrained Minimal Supersymmetric Standard Model, which has a four-dimensional parameter space defined at the GUT scale. We provide a set of analytical expressions that reproduce three low-energy observables of interest in terms of the parameters of the theory: the Higgs mass, the contribution to the anomalous magnetic moment of the muon, and the cold dark matter relic density. To demonstrate the power of the approach, we employ the symbolic expressions in a global fits analysis to derive the posterior probability densities of the parameters, which are obtained extremely rapidly in comparison with conventional methods.

[LG-201] Adaptive Multiscale Retinal Diagnosis: A Hybrid Trio-Model Approach for Comprehensive Fundus Multi-Disease Detection Leveraging Transfer Learning and Siamese Networks

链接: https://arxiv.org/abs/2405.18449
作者: Yavuz Selim Inan
关键词: billion people worldwide, visual disorders, media haze, people worldwide, worldwide are suffering
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:WHO has declared that more than 2.2 billion people worldwide are suffering from visual disorders, such as media haze, glaucoma, and drusen. At least 1 billion of these cases could have been either prevented or successfully treated, yet they remain unaddressed due to poverty, a lack of specialists, inaccurate ocular fundus diagnoses by ophthalmologists, or the presence of a rare disease. To address this, the research has developed the Hybrid Trio-Network Model Algorithm for accurately diagnosing 12 distinct common and rare eye diseases. This algorithm utilized the RFMiD dataset of 3,200 fundus images and the Binary Relevance Method to detect diseases separately, ensuring expandability and avoiding incorrect correlations. Each detector, incorporating finely tuned hyperparameters to optimize performance, consisted of three feature components: A classical transfer learning CNN model, a two-stage CNN model, and a Siamese Network. The diagnosis was made using features extracted through this Trio-Model with Ensembled Machine Learning algorithms. The proposed model achieved an average accuracy of 97% and an AUC score of 0.96. Compared to past benchmark studies, an increase of over 10% in the F1-score was observed for most diseases. Furthermore, using the Siamese Network, the model successfully made predictions in diseases like optic disc pallor, which past studies failed to predict due to low confidence. This diagnostic tool presents a stable, adaptive, cost-effective, efficient, accessible, and fast solution for globalizing early detection of both common and rare diseases.

[LG-202] Discovering deposition process regimes: leveraging unsupervised learning for process insights surrogate modeling and sensitivity analysis

链接: https://arxiv.org/abs/2405.18444
作者: Geremy Loachamín Suntaxi,Paris Papavasileiou,Eleni D. Koronaki,Dimitrios G. Giovanis,Georgios Gakis,Ioannis G. Aviziotis,Martin Kathrein,Gabriele Pozzetti,Christoph Czettl,Stéphane P.A. Bordas,Andreas G. Boudouvis
关键词: Chemical Vapor Deposition, Chemical Vapor, Vapor Deposition, comprehensive approach utilizing, deposition process regimes
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Systems and Control (eess.SY); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This work introduces a comprehensive approach utilizing data-driven methods to elucidate the deposition process regimes in Chemical Vapor Deposition (CVD) reactors and the interplay of physical mechanism that dominate in each one of them. Through this work, we address three key objectives. Firstly, our methodology relies on process outcomes, derived by a detailed CFD model, to identify clusters of “outcomes” corresponding to distinct process regimes, wherein the relative influence of input variables undergoes notable shifts. This phenomenon is experimentally validated through Arrhenius plot analysis, affirming the efficacy of our approach. Secondly, we demonstrate the development of an efficient surrogate model, based on Polynomial Chaos Expansion (PCE), that maintains accuracy, facilitating streamlined computational analyses. Finally, as a result of PCE, sensitivity analysis is made possible by means of Sobol’ indices, that quantify the impact of process inputs across identified regimes. The insights gained from our analysis contribute to the formulation of hypotheses regarding phenomena occurring beyond the transition regime. Notably, the significance of temperature even in the diffusion-limited regime, as evidenced by the Arrhenius plot, suggests activation of gas phase reactions at elevated temperatures. Importantly, our proposed methods yield insights that align with experimental observations and theoretical principles, aiding decision-making in process design and optimization. By circumventing the need for costly and time-consuming experiments, our approach offers a pragmatic pathway towards enhanced process efficiency. Moreover, this study underscores the potential of data-driven computational methods for innovating reactor design paradigms.

信息检索

[IR-0] A Multi-Source Retrieval Question Answering Framework Based on RAG

链接: https://arxiv.org/abs/2405.19207
作者: Ridong Wu,Shuhong Chen,Xiangbiao Su,Yuankai Zhu,Yifei Liao,Jianming Wu
关键词: large-scale language models, Retrieval-Augmented Generation, language models, widely adopted, retrieval information
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 4 pages,3 figures

点击查看摘要

Abstract:With the rapid development of large-scale language models, Retrieval-Augmented Generation (RAG) has been widely adopted. However, existing RAG paradigms are inevitably influenced by erroneous retrieval information, thereby reducing the reliability and correctness of generated results. Therefore, to improve the relevance of retrieval information, this study proposes a method that replaces traditional retrievers with GPT-3.5, leveraging its vast corpus knowledge to generate retrieval information. We also propose a web retrieval based method to implement fine-grained knowledge retrieval, Utilizing the powerful reasoning capability of GPT-3.5 to realize semantic partitioning of this http URL order to mitigate the illusion of GPT retrieval and reduce noise in Web retrieval,we proposes a multi-source retrieval framework, named MSRAG, which combines GPT retrieval with web retrieval. Experiments on multiple knowledge-intensive QA datasets demonstrate that the proposed framework in this study performs better than existing RAG framework in enhancing the overall efficiency and accuracy of QA systems.

[IR-1] Learning from Litigation: Graphs and LLMs for Retrieval and Reasoning in eDiscovery

链接: https://arxiv.org/abs/2405.19164
作者: Sounak Lahiri,Sumit Pai,Tim Weninger,Sanmitra Bhattacharya
关键词: involves identifying relevant, vast collection based, legal production requests, identifying relevant documents, Electronic Discovery
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 8 pages, 2 tables, 6 figures

点击查看摘要

Abstract:Electronic Discovery (eDiscovery) involves identifying relevant documents from a vast collection based on legal production requests. The integration of artificial intelligence (AI) and natural language processing (NLP) has transformed this process, helping document review and enhance efficiency and cost-effectiveness. Although traditional approaches like BM25 or fine-tuned pre-trained models are common in eDiscovery, they face performance, computational, and interpretability challenges. In contrast, Large Language Model (LLM)-based methods prioritize interpretability but sacrifice performance and throughput. This paper introduces DISCOvery Graph (DISCOG), a hybrid approach that combines the strengths of two worlds: a heterogeneous graph-based method for accurate document relevance prediction and subsequent LLM-driven approach for reasoning. Graph representational learning generates embeddings and predicts links, ranking the corpus for a given request, and the LLMs provide reasoning for document relevance. Our approach handles datasets with balanced and imbalanced distributions, outperforming baselines in F1-score, precision, and recall by an average of 12%, 3%, and 16%, respectively. In an enterprise context, our approach drastically reduces document review costs by 99.9% compared to manual processes and by 95% compared to LLM-based classification methods

[IR-2] CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval

链接: https://arxiv.org/abs/2405.19149
作者: Xintong Jiang,Yaxiong Wang,Mengjian Li,Yujiao Wu,Bingwen Hu,Xueming Qian
关键词: image-text pair query, Composed Image Retrieval, involves searching, image-text pair, target images based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: arXiv admin note: text overlap with arXiv:2309.02169

点击查看摘要

Abstract:Composed Image Retrieval (CIR) involves searching for target images based on an image-text pair query. While current methods treat this as a query-target matching problem, we argue that CIR triplets contain additional associations beyond this primary relation. In our paper, we identify two new relations within triplets, treating each triplet as a graph node. Firstly, we introduce the concept of text-bridged image alignment, where the query text serves as a bridge between the query image and the target image. We propose a hinge-based cross-attention mechanism to incorporate this relation into network learning. Secondly, we explore complementary text reasoning, considering CIR as a form of cross-modal retrieval where two images compose to reason about complementary text. To integrate these perspectives effectively, we design a twin attention-based compositor. By combining these complementary associations with the explicit query pair-target image relation, we establish a comprehensive set of constraints for CIR. Our framework, CaLa (Complementary Association Learning for Augmenting Composed Image Retrieval), leverages these insights. We evaluate CaLa on CIRR and FashionIQ benchmarks with multiple backbones, demonstrating its superiority in composed image retrieval.

[IR-3] Multi-stage Retrieve and Re-rank Model for Automatic Medical Coding Recommendation

链接: https://arxiv.org/abs/2405.19093
作者: Xindi Wang,Robert E. Mercer,Frank Rudzicz
关键词: classification system encompassing, definitive medical classification, medical classification system, International Classification, range of diseases
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Accepted to NAACL 2024 – camera-ready version

点击查看摘要

Abstract:The International Classification of Diseases (ICD) serves as a definitive medical classification system encompassing a wide range of diseases and conditions. The primary objective of ICD indexing is to allocate a subset of ICD codes to a medical record, which facilitates standardized documentation and management of various health conditions. Most existing approaches have suffered from selecting the proper label subsets from an extremely large ICD collection with a heavy long-tailed label distribution. In this paper, we leverage a multi-stage ``retrieve and re-rank’’ framework as a novel solution to ICD indexing, via a hybrid discrete retrieval method, and re-rank retrieved candidates with contrastive learning that allows the model to make more accurate predictions from a simplified label space. The retrieval model is a hybrid of auxiliary knowledge of the electronic health records (EHR) and a discrete retrieval method (BM25), which efficiently collects high-quality candidates. In the last stage, we propose a label co-occurrence guided contrastive re-ranking model, which re-ranks the candidate labels by pulling together the clinical notes with positive ICD codes. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures on the MIMIC-III benchmark.

[IR-4] An engine not a camera: Measuring performative power of online search

链接: https://arxiv.org/abs/2405.19073
作者: Celestine Mendler-Dünner,Gabriele Carovano,Moritz Hardt
关键词: major ongoing policy, performative power, regulatory efforts, power, center of major
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The power of digital platforms is at the center of major ongoing policy and regulatory efforts. To advance existing debates, we designed and executed an experiment to measure the power of online search providers, building on the recent definition of performative power. Instantiated in our setting, performative power quantifies the ability of a search engine to steer web traffic by rearranging results. To operationalize this definition we developed a browser extension that performs unassuming randomized experiments in the background. These randomized experiments emulate updates to the search algorithm and identify the causal effect of different content arrangements on clicks. We formally relate these causal effects to performative power. Analyzing tens of thousands of clicks, we discuss what our robust quantitative findings say about the power of online search engines. More broadly, we envision our work to serve as a blueprint for how performative power and online experiments can be integrated with future investigations into the economic power of digital platforms.

[IR-5] Continual Collaborative Distillation for Recommender System

链接: https://arxiv.org/abs/2405.19046
作者: Gyuseok Lee,SeongKu Kang,Wonbin Kweon,Hwanjo Yu
关键词: deploying large-scale recommender, large-scale recommender systems, promising technique, technique for addressing, deploying large-scale
类目: Information Retrieval (cs.IR)
*备注: KDD 2024. 9 pages + appendix (1 page). 5 figures

点击查看摘要

Abstract:Knowledge distillation (KD) has emerged as a promising technique for addressing the computational challenges associated with deploying large-scale recommender systems. KD transfers the knowledge of a massive teacher system to a compact student model, to reduce the huge computational burdens for inference while retaining high accuracy. The existing KD studies primarily focus on one-time distillation in static environments, leaving a substantial gap in their applicability to real-world scenarios dealing with continuously incoming users, items, and their interactions. In this work, we delve into a systematic approach to operating the teacher-student KD in a non-stationary data stream. Our goal is to enable efficient deployment through a compact student, which preserves the high performance of the massive teacher, while effectively adapting to continuously incoming data. We propose Continual Collaborative Distillation (CCD) framework, where both the teacher and the student continually and collaboratively evolve along the data stream. CCD facilitates the student in effectively adapting to new data, while also enabling the teacher to fully leverage accumulated knowledge. We validate the effectiveness of CCD through extensive quantitative, ablative, and exploratory experiments on two real-world datasets. We expect this research direction to contribute to narrowing the gap between existing KD studies and practical applications, thereby enhancing the applicability of KD in real-world systems.

[IR-6] SynerGraph: An Integrated Graph Convolution Network for Multimodal Recommendation

链接: https://arxiv.org/abs/2405.19031
作者: Mert Burabak,Tevfik Aytekin
关键词: focusing on integrating, article presents, integrating and purifying, multimodal recommendation systems, purifying multimodal data
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This article presents a novel approach to multimodal recommendation systems, focusing on integrating and purifying multimodal data. Our methodology starts by developing a filter to remove noise from various types of data, making the recommendations more reliable. We studied the impact of top-K sparsification on different datasets, finding optimal values that strike a balance between underfitting and overfitting concerns. The study emphasizes the significant role of textual information compared to visual data in providing a deep understanding of items. We conducted sensitivity analyses to understand how different modalities and the use of purifier circle loss affect the efficiency of the model. The findings indicate that systems that incorporate multiple modalities perform better than those relying on just one modality. Our approach highlights the importance of modality purifiers in filtering out irrelevant data, ensuring that user preferences remain relevant. Models without modality purifiers showed reduced performance, emphasizing the need for effective integration of pre-extracted features. The proposed model, which includes an novel self supervised auxiliary task, shows promise in accurately capturing user preferences. The main goal of the fusion technique is to enhance the modeling of user preferences by combining knowledge with item information, utilizing sophisticated language models. Extensive experiments show that our model produces better results than the existing state-of-the-art multimodal recommendation systems.

[IR-7] Evaluating the External and Parametric Knowledge Fusion of Large Language Models

链接: https://arxiv.org/abs/2405.19010
作者: Hao Zhang,Yuyang Zhang,Xiaoguang Li,Wenxuan Shi,Haonan Xu,Huanshuo Liu,Yasheng Wang,Lifeng Shang,Qun Liu,Yong Liu,Ruiming Tang
关键词: Integrating external knowledge, large language models, static parametric memory, Integrating external, parametric knowledge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 15 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Integrating external knowledge into large language models (LLMs) presents a promising solution to overcome the limitations imposed by their antiquated and static parametric memory. Prior studies, however, have tended to over-reliance on external knowledge, underestimating the valuable contributions of an LLMs’ intrinsic parametric knowledge. The efficacy of LLMs in blending external and parametric knowledge remains largely unexplored, especially in cases where external knowledge is incomplete and necessitates supplementation by their parametric knowledge. We propose to deconstruct knowledge fusion into four distinct scenarios, offering the first thorough investigation of LLM behavior across each. We develop a systematic pipeline for data construction and knowledge infusion to simulate these fusion scenarios, facilitating a series of controlled experiments. Our investigation reveals that enhancing parametric knowledge within LLMs can significantly bolster their capability for knowledge integration. Nonetheless, we identify persistent challenges in memorizing and eliciting parametric knowledge, and determining parametric knowledge boundaries. Our findings aim to steer future explorations on harmonizing external and parametric knowledge within LLMs.

[IR-8] Mitigate Position Bias with Coupled Ranking Bias on CTR Prediction

链接: https://arxiv.org/abs/2405.18971
作者: Yao Zhao,Zhining Liu,Tianchi Cai,Haipeng Zhang,Chenyi Zhuang,Jinjie Gu
关键词: recommender system literature, position bias estimation, Position bias, placing position, ranking bias
类目: Information Retrieval (cs.IR)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Position bias, i.e., users’ preference of an item is affected by its placing position, is well studied in the recommender system literature. However, most existing methods ignore the widely coupled ranking bias, which is also related to the placing position of the item. Using both synthetic and industrial datasets, we first show how this widely coexisted ranking bias deteriorates the performance of the existing position bias estimation methods. To mitigate the position bias with the presence of the ranking bias, we propose a novel position bias estimation method, namely gradient interpolation, which fuses two estimation methods using a fusing weight. We further propose an adaptive method to automatically determine the optimal fusing weight. Extensive experiments on both synthetic and industrial datasets demonstrate the superior performance of the proposed methods.

[IR-9] Content-Agnostic Moderation for Stance-Neutral Recommendation

链接: https://arxiv.org/abs/2405.18941
作者: Nan Li,Bo Kang,Tijl De Bie
关键词: exacerbating opinion polarization, Personalized recommendation systems, Personalized recommendation, exacerbating opinion, moderation
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Personalized recommendation systems often drive users towards more extreme content, exacerbating opinion polarization. While (content-aware) moderation has been proposed to mitigate these effects, such approaches risk curtailing the freedom of speech and of information. To address this concern, we propose and explore the feasibility of \emphcontent-agnostic moderation as an alternative approach for reducing polarization. Content-agnostic moderation does not rely on the actual content being moderated, arguably making it less prone to forms of censorship. We establish theoretically that content-agnostic moderation cannot be guaranteed to work in a fully generic setting. However, we show that it can often be effectively achieved in practice with plausible assumptions. We introduce two novel content-agnostic moderation methods that modify the recommendations from the content recommender to disperse user-item co-clusters without relying on content features. To evaluate the potential of content-agnostic moderation in controlled experiments, we built a simulation environment to analyze the closed-loop behavior of a system with a given set of users, recommendation system, and moderation approach. Through comprehensive experiments in this environment, we show that our proposed moderation methods significantly enhance stance neutrality and maintain high recommendation quality across various data scenarios. Our results indicate that achieving stance neutrality without direct content information is not only feasible but can also help in developing more balanced and informative recommendation systems without substantially degrading user engagement. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2405.18941 [cs.IR] (or arXiv:2405.18941v1 [cs.IR] for this version)

[IR-10] Leveraging Many-To-Many Relationships for Defending Against Visual-Language Adversarial Attacks

链接: https://arxiv.org/abs/2405.18770
作者: Futa Waseda,Antonio Tejero-de-Pablos
关键词: Recent studies, revealed that vision-language, ITR, adversarial attacks, adversarial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Under review

点击查看摘要

Abstract:Recent studies have revealed that vision-language (VL) models are vulnerable to adversarial attacks for image-text retrieval (ITR). However, existing defense strategies for VL models primarily focus on zero-shot image classification, which do not consider the simultaneous manipulation of image and text, as well as the inherent many-to-many (N:N) nature of ITR, where a single image can be described in numerous ways, and vice versa. To this end, this paper studies defense strategies against adversarial attacks on VL models for ITR for the first time. Particularly, we focus on how to leverage the N:N relationship in ITR to enhance adversarial robustness. We found that, although adversarial training easily overfits to specific one-to-one (1:1) image-text pairs in the train data, diverse augmentation techniques to create one-to-many (1:N) / many-to-one (N:1) image-text pairs can significantly improve adversarial robustness in VL models. Additionally, we show that the alignment of the augmented image-text pairs is crucial for the effectiveness of the defense strategy, and that inappropriate augmentations can even degrade the model’s performance. Based on these findings, we propose a novel defense strategy that leverages the N:N relationship in ITR, which effectively generates diverse yet highly-aligned N:N pairs using basic augmentations and generative model-based augmentations. This work provides a novel perspective on defending against adversarial attacks in VL tasks and opens up new research directions for future work.

[IR-11] CtrlA: Adaptive Retrieval-Augmented Generation via Probe-Guided Control

链接: https://arxiv.org/abs/2405.18727
作者: Huanshuo Liu,Hao Zhang,Zhijiang Guo,Kuicai Dong,Xiangyang Li,Yi Quan Lee,Cong Zhang,Yong Liu
关键词: large language models, Adaptive RAG, adaptive RAG methods, existing adaptive RAG, Retrieval-augmented generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 28 pages, 7 figures, 9 tables

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a promising solution for mitigating hallucinations of large language models (LLMs) with retrieved external knowledge. Adaptive RAG enhances this approach by dynamically assessing the retrieval necessity, aiming to balance external and internal knowledge usage. However, existing adaptive RAG methods primarily realize retrieval on demand by relying on superficially verbalize-based or probability-based feedback of LLMs, or directly fine-tuning LLMs via carefully crafted datasets, resulting in unreliable retrieval necessity decisions, heavy extra costs, and sub-optimal response generation. We present the first attempts to delve into the internal states of LLMs to mitigate such issues by introducing an effective probe-guided adaptive RAG framework, termed CtrlA. Specifically, CtrlA employs an honesty probe to regulate the LLM’s behavior by manipulating its representations for increased honesty, and a confidence probe to monitor the internal states of LLM and assess confidence levels, determining the retrieval necessity during generation. Experiments show that CtrlA is superior to existing adaptive RAG methods on a diverse set of tasks, the honesty control can effectively make LLMs more honest and confidence monitoring is proven to be a promising indicator of retrieval trigger. Our codes are available at this https URL.

[IR-12] Cognitive Evolutionary Learning to Select Feature Interactions for Recommender Systems

链接: https://arxiv.org/abs/2405.18708
作者: Runlong Yu,Qixiang Shao,Qi Liu,Huan Liu,Enhong Chen
关键词: commercial recommender systems, recommender systems, fundamental problem, problem in commercial, commercial recommender
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Feature interaction selection is a fundamental problem in commercial recommender systems. Most approaches equally enumerate all features and interactions by the same pre-defined operation under expert guidance. Their recommendation is unsatisfactory sometimes due to the following issues: (1)~They cannot ensure the learning abilities of models because their architectures are poorly adaptable to tasks and data; (2)~Useless features and interactions can bring unnecessary noise and complicate the training process. In this paper, we aim to adaptively evolve the model to select appropriate operations, features, and interactions under task guidance. Inspired by the evolution and functioning of natural organisms, we propose a novel \textslCognitive EvoLutionary Learning (CELL) framework, where cognitive ability refers to a property of organisms that allows them to react and survive in diverse environments. It consists of three stages, i.e., DNA search, genome search, and model functioning. Specifically, if we regard the relationship between models and tasks as the relationship between organisms and natural environments, interactions of feature pairs can be analogous to double-stranded DNA, of which relevant features and interactions can be analogous to genomes. Along this line, we diagnose the fitness of the model on operations, features, and interactions to simulate the survival rates of organisms for natural selection. We show that CELL can adaptively evolve into different models for different tasks and data, which enables practitioners to access off-the-shelf models. Extensive experiments on four real-world datasets demonstrate that CELL significantly outperforms state-of-the-art baselines. Also, we conduct synthetic experiments to ascertain that CELL can consistently discover the pre-defined interaction patterns for feature pairs.

[IR-13] BioBERT-based Deep Learning and Merged ChemProt-DrugProt for Enhanced Biomedical Relation Extraction

链接: https://arxiv.org/abs/2405.18605
作者: Bridget T. McInnes,Jiawei Tang,Darshini Mahendran,Mai H. Nguyen
关键词: enhancing relation extraction, focusing specifically, chemical-gene interactions, paper presents, presents a methodology
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Molecular Networks (q-bio.MN)
*备注:

点击查看摘要

Abstract:This paper presents a methodology for enhancing relation extraction from biomedical texts, focusing specifically on chemical-gene interactions. Leveraging the BioBERT model and a multi-layer fully connected network architecture, our approach integrates the ChemProt and DrugProt datasets using a novel merging strategy. Through extensive experimentation, we demonstrate significant performance improvements, particularly in CPR groups shared between the datasets. The findings underscore the importance of dataset merging in augmenting sample counts and improving model accuracy. Moreover, the study highlights the potential of automated information extraction in biomedical research and clinical practice.

[IR-14] Its Not a Modality Gap: Characterizing and Addressing the Contrastive Gap

链接: https://arxiv.org/abs/2405.18570
作者: Abrar Fahim,Alex Murphy,Alona Fyshe
关键词: embedding input images, contrastive models, Multi-modal contrastive models, contrastive, gap
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal contrastive models such as CLIP achieve state-of-the-art performance in zero-shot classification by embedding input images and texts on a joint representational space. Recently, a modality gap has been reported in two-encoder contrastive models like CLIP, meaning that the image and text embeddings reside in disjoint areas of the latent space. Previous studies suggest that this gap exists due to 1) the cone effect, 2) mismatched pairs in the dataset, and 3) insufficient training. We show that, even when accounting for all these factors, and even when using the same modality, the contrastive loss actually creates a gap during training. As a result, We propose that the modality gap is inherent to the two-encoder contrastive loss and rename it the contrastive gap. We present evidence that attributes this contrastive gap to low uniformity in CLIP space, resulting in embeddings that occupy only a small portion of the latent space. To close the gap, we adapt the uniformity and alignment properties of unimodal contrastive loss to the multi-modal setting and show that simply adding these terms to the CLIP loss distributes the embeddings more uniformly in the representational space, closing the gap. In our experiments, we show that the modified representational space achieves better performance than default CLIP loss in downstream tasks such as zero-shot image classification and multi-modal arithmetic.

[IR-15] Potential Field Based Deep Metric Learning

链接: https://arxiv.org/abs/2405.18560
作者: Shubhang Bhatnagar,Narendra Ahuja
关键词: Deep metric learning, meaningful representation space, semantically meaningful representation, Deep metric, involves training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Deep metric learning (DML) involves training a network to learn a semantically meaningful representation space. Many current approaches mine n-tuples of examples and model interactions within each tuplets. We present a novel, compositional DML model, inspired by electrostatic fields in physics that, instead of in tuples, represents the influence of each example (embedding) by a continuous potential field, and superposes the fields to obtain their combined global potential field. We use attractive/repulsive potential fields to represent interactions among embeddings from images of the same/different classes. Contrary to typical learning methods, where mutual influence of samples is proportional to their distance, we enforce reduction in such influence with distance, leading to a decaying field. We show that such decay helps improve performance on real world datasets with large intra-class variations and label noise. Like other proxy-based methods, we also use proxies to succinctly represent sub-populations of examples. We evaluate our method on three standard DML benchmarks- Cars-196, CUB-200-2011, and SOP datasets where it outperforms state-of-the-art baselines.

人工智能

[AI-0] LLMs Meet Multimodal Generation and Editing: A Survey

链接: https://arxiv.org/abs/2405.19334
作者: Yingqing He,Zhaoyang Liu,Jingye Chen,Zeyue Tian,Hongyu Liu,Xiaowei Chi,Runtao Liu,Ruibin Yuan,Yazhou Xing,Wenhai Wang,Jifeng Dai,Yong Zhang,Wei Xue,Qifeng Liu,Yike Guo,Qifeng Chen
关键词: large language models, large language, combining LLMs, growing interest, interest in combining
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 51 Pages with 16 Figures, 12 Tables, and 534 References. GitHub Repository at: this https URL

点击查看摘要

Abstract:With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on understanding. This survey elaborates on multimodal generation across different domains, including image, video, 3D, and audio, where we highlight the notable advancements with milestone works in these fields. Specifically, we exhaustively investigate the key technical components behind methods and multimodal datasets utilized in these studies. Moreover, we dig into tool-augmented multimodal agents that can use existing generative models for human-computer interaction. Lastly, we also comprehensively discuss the advancement in AI safety and investigate emerging applications as well as future prospects. Our work provides a systematic and insightful overview of multimodal generation, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at this https URL

[AI-1] Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

链接: https://arxiv.org/abs/2405.19332
作者: Shenao Zhang,Donghan Yu,Hiteshi Sharma,Ziyi Yang,Shuohang Wang,Hany Hassan,Zhaoran Wang
关键词: aligning Large Language, Reinforcement Learning, achieved significant success, Large Language Models, aligning Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the reparameterized reward function, the resulting algorithm, named Self-Exploring Language Models (SELM), eliminates the need for a separate RM and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), the SELM objective reduces indiscriminate favor of unseen extrapolations and enhances exploration efficiency. Our experimental results demonstrate that when finetuned on Zephyr-7B-