本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-06-14)

今日共更新501篇论文,其中:

  • 自然语言处理88篇(Computation and Language (cs.CL))
  • 计算机视觉142篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能120篇(Artificial Intelligence (cs.AI))
  • 机器学习183篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
[NLP-0] MuirBench:强大多图像理解的综合基准

链接: https://arxiv.org/abs/2406.09411
作者: Fei Wang,Xingyu Fu,James Y. Huang,Zekun Li,Qin Liu,Xiaogeng Liu,Mingyu Derek Ma,Nan Xu,Wenxuan Zhou,Kai Zhang,Tianyi Lorena Yan,Wenjie Jacky Mo,Hsiang-Hui Liu,Pan Lu,Chunyuan Li,Chaowei Xiao,Kai-Wei Chang,Dan Roth,Sheng Zhang,Hoifung Poon,Muhao Chen
关键词: comprehensive benchmark, benchmark that focuses, focuses on robust, robust multi-image understanding, multi-image understanding capabilities
中文关键词: 全面的基准,专注的基准,专注于稳健、稳健的多图像理解、多图像理解能力
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements.
摘要:我们介绍了MuirBtch,这是一个全面的基准测试,专注于多模式LLMS的稳健的多图像理解能力。MuirBtch由12个不同的多图像任务(例如,场景理解、排序)组成,涉及10类多图像关系(例如,多视角、时间关系)。由11,264张图像和2,600个多项选择题组成,MuirBtch是以成对方式创建的,其中每个标准实例都与一个语义差异最小的无法回答的变体配对,以获得可靠的评估。对最近20个多模式LLM的评估结果表明,即使是性能最好的模型,如GPT-40和Gemini Pro,也发现求解MuirBtch具有挑战性,分别达到68.0%和49.3%的准确率。基于单幅图像训练的开源多通道LLMS很难推广到多幅图像问题,准确率徘徊在33.3%以下。这些结果突显了MuirBtch在鼓励社区开发能够超越单一图像的多式联运LLM方面的重要性,并为未来的改进提供了潜在的途径。

[NLP-1] Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
[NLP-1] 视觉画板:草图作为多模式语言模型的视觉思维链

链接: https://arxiv.org/abs/2406.09403
作者: Yushi Hu,Weijia Shi,Xingyu Fu,Dan Roth,Mari Ostendorf,Luke Zettlemoyer,Noah A Smith,Ranjay Krishna
关键词: limited-capacity working memory, solving geometry problems, working memory, sketches to amplify, amplify our ideas
中文关键词: 容量有限的工作记忆,解决几何问题,工作记忆,草图来放大,放大我们的想法
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 26 pages

点击查看摘要

Abstract:Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In this work, we introduce Sketchpad, a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts planning and reasoning according to the visual artifacts it has drawn. Different from prior work, which uses text-to-image models to enable LMs to draw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is closer to human sketching and better facilitates reasoning. Sketchpad can also use specialist vision models during the sketching process (e.g., draw bounding boxes with object detection models, draw masks with segmentation models), to further enhance visual perception and reasoning. We experiment with a wide range of math tasks (including geometry, functions, graphs, and chess) and complex visual reasoning tasks. Sketchpad substantially improves performance on all tasks over strong base models with no sketching, yielding an average gain of 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a new state of the art on all tasks, including VBench (80.3%), BLINK spatial reasoning (83.9%), and visual correspondence (80.8%). All codes and data are in this https URL.
摘要:人类画图是为了促进推理:我们在解决几何问题时画辅助线;在地图上进行推理时,我们标记和圈出;我们使用草图来放大我们的想法,释放我们有限的工作记忆。然而,在当前的多通道语言模型(LMS)中,这样的动作是缺失的。目前的思想链和工具使用范式只使用文本作为中间推理步骤。在这项工作中,我们介绍了SketchPad,这是一个为多通道LMS提供可视SketchPad的框架,以及在SketchPad上绘制的工具。LM根据它绘制的视觉文物进行规划和推理。与以往使用文本到图像模型来使LMS能够绘制不同的是,SketchPad使LMS能够用线、框、标记等绘制,更接近于人类的草图,更好地促进推理。SketchPad还可以在草图绘制过程中使用专业视觉模型(例如,使用对象检测模型绘制边界框,使用分割模型绘制蒙版),以进一步增强视觉感知和推理。我们尝试了各种数学任务(包括几何、函数、图形和国际象棋)和复杂的视觉推理任务。SketchPad在没有绘制草图的强大基础模型上显著提高了所有任务的性能,在数学任务上的平均收益为12.7%,在视觉任务上的平均收益为8.6%。带有SketchPad的GPT-40在所有任务上设置了新的艺术状态,包括V
BASE(80.3%)、眨眼空间推理(83.9%)和视觉对应(80.8%)。所有代码和数据都在此HTTPS URL中。

[NLP-2] Improving Autoregressive Training with Dynamic Oracles
[NLP-2] 用动态先知改进自回归训练

链接: https://arxiv.org/abs/2406.09393
作者: Jianing Yang,Harshine Visvanathan,Yilin Wang,Xinyi Hu,Matthew Gormley
关键词: sequential decision problems, ranging from sequence, framed as sequential, sequential decision, sequence tagging
中文关键词: 顺序决策问题,范围从顺序、框架为顺序、顺序决策、顺序标记
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Many tasks within NLP can be framed as sequential decision problems, ranging from sequence tagging to text generation. However, for many tasks, the standard training methods, including maximum likelihood (teacher forcing) and scheduled sampling, suffer from exposure bias and a mismatch between metrics employed during training and inference. DAgger provides a solution to mitigate these problems, yet it requires a metric-specific dynamic oracle algorithm, which does not exist for many common metrics like span-based F1, ROUGE, and BLEU. In this paper, we develop these novel dynamic oracles and show they maintain DAgger’s no-regret guarantee for decomposable metrics like span-based F1. We evaluate the algorithm’s performance on named entity recognition (NER), text summarization, and machine translation (MT). While DAgger with dynamic oracle yields less favorable results in our MT experiments, it outperforms the baseline techniques in NER and text summarization.
摘要:NLP中的许多任务都可以被定义为顺序决策问题,范围从序列标记到文本生成。然而,对于许多任务来说,标准训练方法,包括最大可能性(教师强迫)和计划抽样,都会受到暴露偏差以及训练和推理期间使用的指标之间的不匹配的影响。DAgger提供了一种缓解这些问题的解决方案,但它需要一种特定于指标的动态Oracle算法,而对于许多常见指标(例如基于跨度的F1、ROUGE和BLEU)来说,该算法并不存在。在本文中,我们开发了这些新颖的动态预言,并表明它们保持了DAger对基于跨度的F1等可分解指标的无遗憾保证。我们评估了该算法在命名实体识别(NER)、文本摘要和机器翻译(MT)方面的性能。虽然具有动态Oracle的DAger在我们的MT实验中产生的结果不太理想,但它优于NER和文本摘要中的基线技术。

[NLP-3] DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding
[NLP-3] DiscreteSLU:一个具有自监督离散语音单元的大型语言模型,用于口语理解

链接: https://arxiv.org/abs/2406.09345
作者: Suwon Shon,Kwangyoun Kim,Yi-Te Hsu,Prashant Sridhar,Shinji Watanabe,Karen Livescu
关键词: pre-trained text-based large, text-based large language, large language models, enabled instruction-following capabilities, speech
中文关键词: 预训练的基于文本的大型语言、基于文本的大型语言、大型语言模型、启用的描述跟随能力、语音
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The integration of pre-trained text-based large language models (LLM) with speech input has enabled instruction-following capabilities for diverse speech tasks. This integration requires the use of a speech encoder, a speech adapter, and an LLM, trained on diverse tasks. We propose the use of discrete speech units (DSU), rather than continuous-valued speech encoder outputs, that are converted to the LLM token embedding space using the speech adapter. We generate DSU using a self-supervised speech encoder followed by k-means clustering. The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering. We also explore various types of DSU extracted from different layers of the self-supervised speech encoder, as well as Mel frequency Cepstral Coefficients (MFCC). Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
摘要:预训练的基于文本的大型语言模型(LLM)与语音输入的集成使各种语音任务的描述跟随能力成为可能。这种集成需要使用经过不同任务训练的语音编码器、语音适配器和LLM。我们建议使用离散语音单元(DS U),而不是连续值语音编码器输出,使用语音适配器将其转换为LLM令牌嵌入空间。我们使用自监督语音编码器生成DS U,然后进行k均值集群。所提出的模型在来自可见/不可见域的语音输入方面表现出稳健的性能,以及口语问答中的描述跟随能力。我们还探索从自我监督语音编码器的不同层提取的各种类型的DS U,以及梅尔频率倒谱系数(MFCC)。我们的研究结果表明,ASB任务和数据集对于口语问答任务的描述调整并不至关重要。

[NLP-4] ProxyLM: Predicting Language Model Performance on Multilingual Tasks via Proxy Models
[NLP-4] ProxyLM:通过代理模型预测多语言任务上的语言模型性能

链接: https://arxiv.org/abs/2406.09334
作者: David Anugraha,Genta Indra Winata,Chenyue Li,Patrick Amadeus Irawan,En-Shiun Annie Lee
关键词: mitigating computational costs, data for fine-tuning, capacity and data, proxy models, Performance
中文关键词: 降低计算成本、微调数据、容量和数据、代理模型、性能
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Performance prediction is a method to estimate the performance of multilingual language models (LMs), mitigating computational costs associated with model capacity and data for fine-tuning. Our paper introduces ProxyLM, a scalable framework for predicting LM performance using proxy models in multilingual tasks. These proxy models act as surrogates, approximating the performance of fine-tuned LMs on specific downstream natural language processing (NLP) tasks. By leveraging proxy models, ProxyLM significantly reduces computational overhead on task evaluations, achieving up to a 37.08x speedup compared to traditional methods, even with our smallest proxy models. Additionally, our methodology showcases adaptability to previously unseen languages in pre-trained LMs, outperforming the state-of-the-art performance by 1.89x as measured by root-mean-square-error (RMSE). This framework streamlines model selection, enabling efficient deployment and iterative LM enhancements without extensive computational resources.
摘要:性能预测是一种估计多语言模型(LMS)性能的方法,减少了与模型容量和微调数据相关的计算成本。本文介绍了ProxyLM,这是一个可扩展的框架,用于在多语言任务中使用代理模型来预测LM性能。这些代理模型充当代理,接近微调的LMS在特定下游自然语言处理(NLP)任务中的性能。通过利用代理模型,ProxyLM显著减少了任务评估的计算开销,与传统方法相比,即使使用最小的代理模型,也能实现高达37.08倍的加速。此外,我们的方法在预先训练的LMS中展示了对以前未见过的语言的适应性,以均方根误差(RMSE)衡量,性能比最先进的性能高1.89倍。此框架简化了模型选择,无需大量计算资源即可实现高效部署和迭代的LM增强。

[NLP-5] Learning from Natural Language Explanations for Generalizable Entity Matching
[NLP-5] 从自然语言解释中学习可推广实体匹配

链接: https://arxiv.org/abs/2406.09330
作者: Somin Wadhwa,Adit Krishnan,Runhui Wang,Byron C. Wallace,Chris Kong
关键词: Entity matching, sources that refer, Entity, linking records, matching
中文关键词: 实体匹配、引用的源、实体、链接记录、匹配
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Entity matching is the task of linking records from different sources that refer to the same real-world entity. Past work has primarily treated entity linking as a standard supervised learning problem. However, supervised entity matching models often do not generalize well to new data, and collecting exhaustive labeled training data is often cost prohibitive. Further, recent efforts have adopted LLMs for this task in few/zero-shot settings, exploiting their general knowledge. But LLMs are prohibitively expensive for performing inference at scale for real-world entity matching tasks. As an efficient alternative, we re-cast entity matching as a conditional generation task as opposed to binary classification. This enables us to “distill” LLM reasoning into smaller entity matching models via natural language explanations. This approach achieves strong performance, especially on out-of-domain generalization tests (10.85% F-1) where standalone generative methods struggle. We perform ablations that highlight the importance of explanations, both for performance and model robustness. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2406.09330 [cs.CL] (or arXiv:2406.09330v1 [cs.CL] for this version)
摘要:实体匹配是将来自不同来源的引用同一实体的记录链接起来的任务。过去的工作主要将实体链接视为标准的有监督学习问题。然而,有监督的实体匹配模型往往不能很好地推广到新数据,而且收集详尽的标记训练数据往往成本高昂。此外,最近的努力利用它们的常识,在少射/零射的情况下为这项任务采用了LLM。但是,对于为真实世界的实体匹配任务执行大规模推理而言,LLM的成本高得令人望而却步。作为一种有效的替代方案,我们将实体匹配重新转换为条件生成任务,而不是二进制分类。这使我们能够通过自然语言解释将LLM推理“提炼”成更小的实体匹配模型。这种方法获得了很好的性能,特别是在域外泛化测试(10.85%F-1)上,其中独立的生成方法难以实现。我们进行的消融突出了解释的重要性,无论是对于性能还是模型的稳健性。科目:计算和语言(cs.CL)引用为:arxiv:2406.09330cs.CL

[NLP-6] REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space
[NLP-6] REVS:通过词汇空间中的排名编辑消除语言模型中的敏感信息

链接: https://arxiv.org/abs/2406.09325
作者: Tomer Ashuach,Martin Tutek,Yonatan Belinkov
关键词: causing privacy concerns, Large language models, risk inadvertently memorizing, Large language, personally identifiable information
中文关键词: 引起隐私问题、大型语言模型、无意中记忆的风险、大型语言、个人可识别信息
类目: Computation and Language (cs.CL)
备注: 18 pages, 3 figures

点击查看摘要

Abstract:Large language models (LLMs) risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in training data, causing privacy concerns. Current approaches to address this issue involve costly dataset scrubbing, or model filtering through unlearning and model editing, which can be bypassed through extraction attacks. We propose REVS, a novel model editing method for unlearning sensitive information from LLMs. REVS identifies and modifies a small subset of neurons relevant for each piece of sensitive information. By projecting these neurons to the vocabulary space (unembedding), we pinpoint the components driving its generation. We then compute a model edit based on the pseudo-inverse of the unembedding matrix, and apply it to de-promote generation of the targeted sensitive data. To adequately evaluate our method on truly sensitive information, we curate two datasets: an email dataset inherently memorized by GPT-J, and a synthetic social security number dataset that we tune the model to memorize. Compared to other state-of-the-art model editing methods, REVS demonstrates superior performance in both eliminating sensitive information and robustness to extraction attacks, while retaining integrity of the underlying model. The code and a demo notebook are available at this https URL.
摘要:大型语言模型(LLM)可能会无意中记住和泄露训练数据中的敏感或个人身份信息(PII),导致隐私问题。目前解决这一问题的方法包括代价高昂的数据集清理,或通过遗忘和模型编辑进行模型过滤,这可以通过提取攻击绕过。我们提出了一种新的模型编辑方法–REVS,用于遗忘LLMS中的敏感信息。Revs识别并修改与每条敏感信息相关的一小部分神经元。通过将这些神经元投射到词汇空间(非嵌入),我们准确地找到了驱动其生成的组件。然后基于去嵌入矩阵的伪逆计算模型编辑,并将其应用于目标敏感数据的反加速生成。为了在真正敏感的信息上充分评估我们的方法,我们整理了两个数据集:GPT-J固有地记忆的电子邮件数据集,以及我们调整模型以记忆的合成社会安全号码数据集。与其他最先进的模型编辑方法相比,RERS在消除敏感信息和对提取攻击的稳健性方面表现出优越的性能,同时保持了底层模型的完整性。代码和演示笔记本可在此HTTPS URL上找到。

[NLP-7] Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
[NLP-7] 诡计袋:对LLM越狱攻击的基准

链接: https://arxiv.org/abs/2406.09324
作者: Zhao Xu,Fan Liu,Hao Liu
关键词: Large Language Models, Language Models, Large Language, produce harmful outputs, demonstrated significant capabilities
中文关键词: 大型语言模型,语言模型,大型语言,产生有害输出,表现出显着的能力
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors of jailbreak attacks, with most studies concentrating on LLM vulnerabilities and lacking exploration of defense-enhanced LLMs. To address these issues, we evaluate the impact of various attack settings on LLM performance and provide a baseline benchmark for jailbreak attacks, encouraging the adoption of a standardized evaluation framework. Specifically, we evaluate the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives. We further conduct seven representative jailbreak attacks on six defense methods across two widely used datasets, encompassing approximately 320 experiments with about 50,000 GPU hours on A800-80G. Our experimental results highlight the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs. Our code is available at this https URL.
摘要:尽管大型语言模型在以零命中方式执行复杂任务方面表现出了强大的能力,但它们很容易受到越狱攻击,并可能被操纵以产生有害的输出。最近,越来越多的研究将越狱攻击分为令牌级攻击和提示级攻击。然而,以前的工作主要忽略了越狱攻击的各种关键因素,大多数研究集中在LLM漏洞上,而缺乏对增强防御的LLM的探索。为了解决这些问题,我们评估了各种攻击设置对LLM性能的影响,并提供了越狱攻击的基准,鼓励采用标准化的评估框架。具体地,我们从目标级和攻击级两个角度评估了对LLMS实施越狱攻击的八个关键因素。我们进一步在两个广泛使用的数据集上对六种防御方法进行了七次有代表性的越狱攻击,包括在A800-80G上进行了大约320个实验,大约50,000个GPU小时。我们的实验结果强调了标准化基准测试的必要性,以评估这些针对防御增强型LLM的攻击。我们的代码可以在这个HTTPS URL上找到。

[NLP-8] JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
[NLP-8] 越狱Eval:用于评估针对大型语言模型的越狱尝试的集成工具包

链接: https://arxiv.org/abs/2406.09321
作者: Delong Ran,Jinyuan Liu,Yichen Gong,Jingyi Zheng,Xinlei He,Tianshuo Cong,Anyu Wang
关键词: Large Language Models, induce Large Language, Language Models, Large Language, presenting severe misuse
中文关键词: 大型语言模型,诱导大型语言,语言模型,大型语言,出现严重滥用
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Our code is available at this https URL

点击查看摘要

Abstract:Jailbreak attacks aim to induce Large Language Models (LLMs) to generate harmful responses for forbidden instructions, presenting severe misuse threats to LLMs. Up to now, research into jailbreak attacks and defenses is emerging, however, there is (surprisingly) no consensus on how to evaluate whether a jailbreak attempt is successful. In other words, the methods to assess the harmfulness of an LLM’s response are varied, such as manual annotation or prompting GPT-4 in specific ways. Each approach has its own set of strengths and weaknesses, impacting their alignment with human values, as well as the time and financial cost. This diversity in evaluation presents challenges for researchers in choosing suitable evaluation methods and conducting fair comparisons across different jailbreak attacks and defenses. In this paper, we conduct a comprehensive analysis of jailbreak evaluation methodologies, drawing from nearly ninety jailbreak research released between May 2023 and April 2024. Our study introduces a systematic taxonomy of jailbreak evaluators, offering in-depth insights into their strengths and weaknesses, along with the current status of their adaptation. Moreover, to facilitate subsequent research, we propose JailbreakEval, a user-friendly toolkit focusing on the evaluation of jailbreak attempts. It includes various well-known evaluators out-of-the-box, so that users can obtain evaluation results with only a single command. JailbreakEval also allows users to customize their own evaluation workflow in a unified framework with the ease of development and comparison. In summary, we regard JailbreakEval to be a catalyst that simplifies the evaluation process in jailbreak research and fosters an inclusive standard for jailbreak evaluation within the community.
摘要:越狱攻击旨在诱导大语言模型(LLMs)对禁止指令产生有害响应,对LLM构成严重的误用威胁。到目前为止,关于越狱攻击和防御的研究正在兴起,然而,(令人惊讶的)对于如何评估越狱尝试是否成功,还没有达成共识。换句话说,评估LLM响应的危害性的方法是多种多样的,例如手动注释或以特定方式提示GPT-4。每种方法都有自己的长处和短处,影响它们与人类价值观的一致性,以及时间和财务成本。这种评估的多样性给研究人员带来了挑战,他们要选择合适的评估方法,并对不同的越狱攻击和防御进行公平的比较。本文从2023年5月至2024年4月发布的近90项越狱研究中,对越狱评估方法进行了全面的分析。我们的研究介绍了越狱评估员的系统分类,深入了解了他们的优势和劣势,以及他们适应的现状。此外,为了便于后续研究,我们提出了JailBreak Eval,这是一个用户友好的工具包,专注于评估越狱企图。它包括各种开箱即用的知名评估器,使用户只需一条命令即可获得评估结果。JailBreak Eval还允许用户在统一的框架中定制自己的评估工作流程,易于开发和比较。总而言之,我们认为越狱评估是一种催化剂,可以简化越狱研究的评估过程,并在社区内培养一个包容性的越狱评估标准。

[NLP-9] Khmer Semantic Search Engine: Digital Information Access and Document Retrieval
[NLP-9] 高棉语义搜索引擎:数字信息访问和文档检索

链接: https://arxiv.org/abs/2406.09320
作者: Nimol Thuon
关键词: Khmer, search, semantic, process is crucial, Semantic search
中文关键词: 高棉语,搜索,语义,过程至关重要,语义搜索
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The search engine process is crucial for document content retrieval. For Khmer documents, a tool is needed to extract essential keywords. Despite the daily generation of significant Khmer content, Cambodians struggle to find necessary documents due to the lack of an effective semantic searching tool. Even Google does not deliver high accuracy for Khmer content. Semantic search engines improve search results by employing advanced algorithms to understand various content types. With the rise in Khmer digital content such as reports, articles, and social media feedback enhanced search capabilities are essential. This research proposes the first Khmer Semantic Search Engine (KSE), designed to improve traditional Khmer search methods. Utilizing semantic matching techniques and formally annotated semantic content, our tool extracts meaningful keywords from user queries performs precise matching, and provides the best matching offline documents and online URL documents. We propose two semantic search frameworks based on keyword extraction and semantic search matching. Additionally, we developed tools for data preparation, including document addition and manual keyword extraction. To evaluate performance, we created a ground truth dataset and discussed issues related to searching and semantic search. Our findings show how understanding search term semantics can lead to more accurate results.
摘要:搜索引擎的处理是文档内容检索的关键。对于高棉文文件,需要一个工具来提取关键关键字。尽管每天都会产生大量的高棉文内容,但由于缺乏有效的语义搜索工具,柬埔寨人很难找到必要的文件。即使是谷歌也不能提供高棉文内容的高准确度。语义搜索引擎通过使用高级算法来理解各种内容类型来改进搜索结果。随着报告、文章和社交媒体反馈等高棉语数字内容的增加,增强搜索能力至关重要。本研究提出了第一个高棉语语义搜索引擎(KSE),旨在改进传统的高棉语搜索方法。利用语义匹配技术和形式化标注的语义内容,我们的工具从用户查询中提取有意义的关键字进行精确匹配,并提供最佳匹配的离线文档和在线URL文档。提出了两种基于关键词提取和语义搜索匹配的语义搜索框架。此外,我们还开发了数据准备工具,包括文档添加和手动关键字提取。为了评估性能,我们创建了一个基本事实数据集,并讨论了与搜索和语义搜索相关的问题。我们的发现表明,理解搜索词的语义可以带来更准确的结果。

[NLP-10] ransformers meet Neural Algorithmic Reasoners
[NLP-10] 翻译者遇见神经数学推理者

链接: https://arxiv.org/abs/2406.09308
作者: Wilfried Bounsi,Borja Ibarz,Andrew Dudzik,Jessica B. Hamrick,Larisa Markeeva,Alex Vitvitskyi,Razvan Pascanu,Petar Veličković
关键词: revolutionized machine learning, revolutionized machine, machine learning, Transformer language understanding, language understanding
中文关键词: 革命性的机器学习,革命性的机器,机器学习,Transformer语言理解,语言理解
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: To appear at CVPR 2024 Multimodal Algorithmic Reasoning (MAR) Workshop. 10 pages, 5 figures

点击查看摘要

Abstract:Transformers have revolutionized machine learning with their simple yet effective architecture. Pre-training Transformers on massive text datasets from the Internet has led to unmatched generalization for natural language understanding (NLU) tasks. However, such language models remain fragile when tasked with algorithmic forms of reasoning, where computations must be precise and robust. To address this limitation, we propose a novel approach that combines the Transformer’s language understanding with the robustness of graph neural network (GNN)-based neural algorithmic reasoners (NARs). Such NARs proved effective as generic solvers for algorithmic tasks, when specified in graph form. To make their embeddings accessible to a Transformer, we propose a hybrid architecture with a two-phase training procedure, allowing the tokens in the language model to cross-attend to the node embeddings from the NAR. We evaluate our resulting TransNAR model on CLRS-Text, the text-based version of the CLRS-30 benchmark, and demonstrate significant gains over Transformer-only models for algorithmic reasoning, both in and out of distribution.
摘要:变形金刚以其简单而有效的体系结构彻底改变了机器学习。对来自互联网的海量文本数据集进行预培训的转换器导致了自然语言理解(NLU)任务的无与伦比的泛化。然而,当用算法形式进行推理时,这样的语言模型仍然很脆弱,计算必须精确和健壮。为了解决这一局限性,我们提出了一种新的方法,将Transformer的语言理解与基于图神经网络(GNN)的神经算法推理器(NAR)的健壮性相结合。当以图形形式指定时,这种NAR被证明是算法任务的通用解算器。为了使转换器可以访问它们的嵌入,我们提出了一种具有两阶段训练过程的混合体系结构,允许语言模型中的令牌交叉关注来自NAR的节点嵌入。我们在CLRS-TEXT(CLRS-30基准的基于文本的版本)上评估了我们得到的TransNAR模型,并证明了在算法推理方面,无论是分布内还是分布外,都显著优于仅使用Transformer的模型。

[NLP-11] AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models
[NLP-11] AlignMMBBench:评估大型视觉语言模型中的中文多模式对齐

链接: https://arxiv.org/abs/2406.09295
作者: Yuhang Wu,Wenmeng Yu,Yean Cheng,Yan Wang,Xiaohan Zhang,Jiazheng Xu,Ming Ding,Yuxiao Dong
关键词: large Vision-Language Models, Vision-Language Models, helpful assistants, large Vision-Language, essential for determining
中文关键词: 大型视觉语言模型,视觉语言模型,有用的助手,大型视觉语言,对于确定至关重要
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Evaluating the alignment capabilities of large Vision-Language Models (VLMs) is essential for determining their effectiveness as helpful assistants. However, existing benchmarks primarily focus on basic abilities using nonverbal methods, such as yes-no and multiple-choice questions. In this paper, we address this gap by introducing AlignMMBench, a comprehensive alignment benchmark specifically designed for emerging Chinese VLMs. This benchmark is meticulously curated from real-world scenarios and Chinese Internet sources, encompassing thirteen specific tasks across three categories, and includes both single-turn and multi-turn dialogue scenarios. Incorporating a prompt rewrite strategy, AlignMMBench encompasses 1,054 images and 4,978 question-answer pairs. To facilitate the evaluation pipeline, we propose CritiqueVLM, a rule-calibrated evaluator that exceeds GPT-4’s evaluation ability. Finally, we report the performance of representative VLMs on AlignMMBench, offering insights into the capabilities and limitations of different VLM architectures. All evaluation codes and data are available on this https URL.
摘要:评估大型视觉语言模型(VLM)的对齐能力对于确定其作为辅助工具的有效性至关重要。然而,现有的基准主要集中在使用非语言方法的基本能力上,如是-否和多项选择题。在本文中,我们通过引入AlignMMBtch来解决这一差距,这是一种专门为中国新兴的VLM设计的全面比对基准。这一基准是根据现实世界的情景和中国互联网资源精心策划的,涵盖了三个类别的13项具体任务,并包括单轮对话和多轮对话情景。结合即时重写策略,AlignMMBch包含1054张图像和4978个问答对。为了方便评估管道,我们提出了一种超过GPT-4的S评估能力的规则校准评估器CritiqueVLM。最后,我们报告了具有代表性的VLM在AlignMMBtch上的性能,提供了对不同VLM体系结构的功能和限制的见解。所有评估代码和数据均可在此HTTPS URL上找到。

[NLP-12] Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
[NLP-12] 了解越狱成功:大型语言模型中潜在空间动力学的研究

链接: https://arxiv.org/abs/2406.09289
作者: Sarah Ball,Frauke Kreuter,Nina Rimsky
关键词: Conversational Large Language, Conversational Large, answer harmful questions, Large Language Models, Large Language
中文关键词: 对话式大型语言,对话式大型语言,回答有害问题,大型语言模型,大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Conversational Large Language Models are trained to refuse to answer harmful questions. However, emergent jailbreaking techniques can still elicit unsafe outputs, presenting an ongoing challenge for model alignment. To better understand how different jailbreak types circumvent safeguards, this paper analyses model activations on different jailbreak inputs. We find that it is possible to extract a jailbreak vector from a single class of jailbreaks that works to mitigate jailbreak effectiveness from other classes. This may indicate that different kinds of effective jailbreaks operate via similar internal mechanisms. We investigate a potential common mechanism of harmfulness feature suppression, and provide evidence for its existence by looking at the harmfulness vector component. These findings offer actionable insights for developing more robust jailbreak countermeasures and lay the groundwork for a deeper, mechanistic understanding of jailbreak dynamics in language models.
摘要:对话式大型语言模型经过训练以拒绝回答有害问题。然而,紧急越狱技术仍然可能引发不安全的输出,这给模型对齐带来了持续的挑战。为了更好地了解不同越狱类型如何规避保障措施,本文分析了不同越狱输入上的模型激活。我们发现,可以从单个越狱类别中提取越狱载体,该载体有助于降低其他类别的越狱有效性。这可能表明不同类型的有效越狱通过类似的内部机制运作。我们研究了有害特征抑制的潜在常见机制,并通过观察有害载体成分来为其存在提供证据。这些发现为开发更强大的越狱对策提供了可行的见解,并为更深入、机械地理解语言模型中的越狱动态奠定了基础。

[NLP-13] On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
[NLP-13] 论异类数据源对语音转文本基础模型的影响

链接: https://arxiv.org/abs/2406.09282
作者: Jinchuan Tian,Yifan Peng,William Chen,Kwanghee Choi,Karen Livescu,Shinji Watanabe
关键词: Open Whisper-style Speech, Whisper-style Speech Model, achieve full transparency, Whisper-style Speech, Open Whisper-style
中文关键词: 开放耳语风格语音,耳语风格语音模型,实现完全透明,耳语风格语音,开放耳语风格
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the impacts of this data heterogeneity. Our study begins with a detailed analysis of each dataset, from which we derive two key strategies: data filtering with proxy task to enhance data quality, and the incorporation of punctuation and true-casing using an open large language model (LLM). With all other configurations staying the same, OWSM v3.2 improves performance over the OWSM v3.1 baseline while using 15% less training data.
摘要:引入开放耳语风格语音模型(OWSM)系列是为了在构建高级语音转文本(S2 T)基础模型时实现完全透明。为此,OWSM模型在25个公共语音数据集上进行训练,这些数据集在多种方面是异类的。在本研究中,我们通过引入OWSM v3.2来推进OWSM系列,该版本通过调查和解决这种数据异类的影响来改进先前的模型。我们的研究首先对每个数据集进行详细分析,从中我们得出了两个关键策略:通过代理任务进行数据过滤以提高数据质量,以及使用开放大型语言模型(LLM)合并标点符号和真实大小写。在所有其他配置保持不变的情况下,OWSM v3.2比OWSM v3.1基线提高了性能,同时减少了15%的训练数据。

[NLP-14] Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
[NLP-14] 拆开DPO和PPO:从偏好反馈中理清学习的最佳实践

链接: https://arxiv.org/abs/2406.09279
作者: Hamish Ivison,Yizhong Wang,Jiacheng Liu,Zeqiu Wu,Valentina Pyatkin,Nathan Lambert,Noah A. Smith,Yejin Choi,Hannaneh Hajishirzi
关键词: Learning, learning algorithm, https URL, essential step, step for improving
中文关键词: 学习,学习算法,https URL,必要步骤,改进步骤
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts, systematically investigate the impact of these components on downstream model performance, and suggest a recipe for strong learning for preference feedback. Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements, followed by the choice of learning algorithm, the use of improved reward models, and finally the use of additional unlabeled prompts for policy training. Notably, PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. High-quality preference data leads to improvements of up to 8% in instruction following and truthfulness. Despite significant gains of up to 5% in mathematical evaluation when scaling up reward models, we surprisingly observe marginal improvements in other categories. We publicly release the code used for training (this https URL) and evaluating (this https URL) our models, along with the models and datasets themselves (this https URL). Comments: Preprint Subjects: Computation and Language (cs.CL) Cite as: arXiv:2406.09279 [cs.CL] (or arXiv:2406.09279v1 [cs.CL] for this version)
摘要:从偏好反馈中学习已成为提高现代语言模型生成质量和性能的重要步骤。尽管基于偏好的学习被广泛使用,但它的应用方式差异很大,使用的数据、学习算法和评估都不同,这使得理清每个方面的影响变得困难。在这项工作中,我们确定了基于偏好的学习的四个核心方面:偏好数据、学习算法、奖励模型和策略训练提示,系统地研究了这些组成部分对下游模型性能的影响,并提出了一个强学习偏好反馈的秘诀。我们的研究结果表明,所有方面对绩效都很重要,其中较好的偏好数据会带来最大的改进,其次是学习算法的选择,使用改进的奖励模型,最后是使用额外的未标记提示进行策略培训。值得注意的是,PPO在数学上比DPO高出2.5%,在一般领域高出1.2%。高质量的偏好数据使指令跟随性和真实性提高高达8%。尽管在扩大奖励模型时,数学评估获得了高达5%的显著收益,但令人惊讶的是,我们在其他类别中观察到了轻微的改进。我们公开发布用于训练(此HTTPS URL)和评估(此HTTPS URL)我们的模型的代码,以及模型和数据集本身(此HTTPS URL)。备注:预印本主题:计算和语言(cs.CL)引用为:arxiv:2406.09279cs.CL

[NLP-15] Sharing Matters: Analysing Neurons Across Languages and Tasks in LLMs
[NLP-15] 共享事项:分析LLM中跨语言和任务的神经元

链接: https://arxiv.org/abs/2406.09265
作者: Weixuan Wang,Barry Haddow,Wei Peng,Alexandra Birch
关键词: large language models, Multilingual large language, greatly increased, increased the ceiling, ceiling of performance
中文关键词: 大语言模型,多语言大语言,大大增加,提高了天花板,性能天花板
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multilingual large language models (LLMs) have greatly increased the ceiling of performance on non-English tasks. However the mechanisms behind multilingualism in these LLMs are poorly understood. Of particular interest is the degree to which internal representations are shared between languages. Recent work on neuron analysis of LLMs has focused on the monolingual case, and the limited work on the multilingual case has not considered the interaction between tasks and linguistic representations. In our work, we investigate how neuron activation is shared across languages by categorizing neurons into four distinct groups according to their responses across different languages for a particular input: all-shared, partial-shared, specific, and non-activated. This categorization is combined with a study of neuron attribution, i.e. the importance of a neuron w.r.t an output. Our analysis reveals the following insights: (i) the linguistic sharing patterns are strongly affected by the type of task, but neuron behaviour changes across different inputs even for the same task; (ii) all-shared neurons play a key role in generating correct responses; (iii) boosting multilingual alignment by increasing all-shared neurons can enhance accuracy on multilingual tasks. The code is available at this https URL.
摘要:多语言大语言模型极大地提高了非英语任务的成绩上限。然而,这些LLM中的多种语言背后的机制却鲜为人知。特别令人感兴趣的是内部表示在语言之间共享的程度。最近关于LLMS神经元分析的工作主要集中在单语情况下,而对多语言情况的有限工作没有考虑任务和语言表征之间的相互作用。在我们的工作中,我们研究了神经元激活是如何在语言之间共享的,方法是根据神经元在不同语言中对特定输入的反应分为四个不同的组:全部共享、部分共享、特定和非激活。这种分类与对神经元属性的研究相结合,即神经元对输出的重要性。我们的分析揭示了以下几点:(I)语言分享模式受到任务类型的强烈影响,但神经元的行为在同一任务的不同输入之间也会发生变化;(Ii)全共享神经元在产生正确反应方面起着关键作用;(Iii)通过增加全共享神经元来促进多语言对齐可以提高多语言任务的准确性。代码可在此HTTPS URL上找到。

[NLP-16] owards Bidirectional Human-AI Alignment: A Systematic Review for Clarifications Framework and Future Directions
[NLP-16] owards双向人机匹配:澄清框架和未来方向的系统性审查

链接: https://arxiv.org/abs/2406.09264
作者: Hua Shen,Tiffany Knearem,Reshmi Ghosh,Kenan Alkiek,Kundan Krishna,Yachuan Liu,Ziqiao Ma,Savvas Petridis,Yi-Hao Peng,Li Qiwei,Sushrita Rakshit,Chenglei Si,Yutong Xie,Jeffrey P. Bigham,Frank Bentley,Joyce Chai,Zachary Lipton,Qiaozhu Mei,Rada Mihalcea,Michael Terry,Diyi Yang,Meredith Ringel Morris,Paul Resnick,David Jurgens
关键词: concept broadly recognized, ethical principles, Recent advancements, highlighted the importance, importance of guiding
中文关键词: 概念被广泛认可,道德原则,最近的进展,强调了指导的重要性
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 56 pages

点击查看摘要

Abstract:Recent advancements in general-purpose AI have highlighted the importance of guiding AI systems towards the intended goals, ethical principles, and values of individuals and groups, a concept broadly recognized as alignment. However, the lack of clarified definitions and scopes of human-AI alignment poses a significant obstacle, hampering collaborative efforts across research domains to achieve this alignment. In particular, ML- and philosophy-oriented alignment research often views AI alignment as a static, unidirectional process (i.e., aiming to ensure that AI systems’ objectives match humans) rather than an ongoing, mutual alignment problem [429]. This perspective largely neglects the long-term interaction and dynamic changes of alignment. To understand these gaps, we introduce a systematic review of over 400 papers published between 2019 and January 2024, spanning multiple domains such as Human-Computer Interaction (HCI), Natural Language Processing (NLP), Machine Learning (ML), and others. We characterize, define and scope human-AI alignment. From this, we present a conceptual framework of “Bidirectional Human-AI Alignment” to organize the literature from a human-centered perspective. This framework encompasses both 1) conventional studies of aligning AI to humans that ensures AI produces the intended outcomes determined by humans, and 2) a proposed concept of aligning humans to AI, which aims to help individuals and society adjust to AI advancements both cognitively and behaviorally. Additionally, we articulate the key findings derived from literature analysis, including discussions about human values, interaction techniques, and evaluations. To pave the way for future studies, we envision three key challenges for future directions and propose examples of potential future solutions.
摘要:通用人工智能的最新进展突显了引导人工智能系统朝着个人和群体的预期目标、伦理原则和价值观方向发展的重要性,这一概念被广泛认为是一致的。然而,缺乏明确的人-人工智能比对的定义和范围构成了一个重大障碍,阻碍了跨研究领域的合作努力实现这一比对。特别是,面向ML和哲学的比对研究经常将人工智能比对视为一个静态的、单向的过程(即,旨在确保人工智能系统的目标与人类匹配),而不是一个持续的、相互比对的问题[429]。这种观点在很大程度上忽略了对齐的长期互动和动态变化。为了了解这些差距,我们对2019年至2024年1月发表的400多篇论文进行了系统回顾,涉及人机交互(HCI)、自然语言处理(NLP)、机器学习(ML)等多个领域。我们描述、定义和扩大人类-人工智能对齐的范围。在此基础上,我们提出了“双向人-人工智能对齐”的概念框架,从以人为中心的角度来组织文献。这一框架包括1)将人工智能与人类相结合的传统研究,以确保人工智能产生由人类决定的预期结果,以及2)将人类与人工智能相结合的拟议概念,旨在帮助个人和社会在认知和行为上适应人工智能的进步。此外,我们阐明了从文献分析中得出的关键发现,包括关于人类价值、互动技术和评估的讨论。为了为未来的研究铺平道路,我们为未来的方向设想了三个关键挑战,并提出了潜在的未来解决方案的例子。

[NLP-17] Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models
[NLP-17] 使用预训练语言模型进行文本分类的样本高效主动学习的自我训练

链接: https://arxiv.org/abs/2406.09206
作者: Christopher Schröder,Gerhard Heyer
关键词: iterative labeling process, small labeled subset, Active learning, labeled data, iterative labeling
中文关键词: 迭代标记过程,小标记子集,主动学习,标记数据,迭代标记
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Active learning is an iterative labeling process that is used to obtain a small labeled subset, despite the absence of labeled data, thereby enabling to train a model for supervised tasks such as text classification. While active learning has made considerable progress in recent years due to improvements provided by pre-trained language models, there is untapped potential in the often neglected unlabeled portion of the data, although it is available in considerably larger quantities than the usually small set of labeled data. Here we investigate how self-training, a semi-supervised approach where a model is used to obtain pseudo-labels from the unlabeled data, can be used to improve the efficiency of active learning for text classification. Starting with an extensive reproduction of four previous self-training approaches, some of which are evaluated for the first time in the context of active learning or natural language processing, we devise HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks, on which it outperforms the reproduced self-training approaches and reaches classification results comparable to previous experiments for three out of four datasets, using only 25% of the data.
摘要:主动学习是一种迭代的标注过程,用于在没有标注数据的情况下获得较小的标注子集,从而能够为文本分类等监督任务训练模型。虽然近年来由于预先训练的语言模型的改进,主动学习取得了相当大的进步,但在经常被忽视的未标记数据部分仍有未开发的潜力,尽管它的数量比通常较小的已标记数据集大得多。在这里,我们研究如何使用自训练来提高文本分类的主动学习的效率。自我训练是一种半监督方法,使用模型从未标记的数据中获得伪标签。通过对已有的四种自训练方法的大量复制,其中一些方法是在主动学习或自然语言处理的背景下第一次进行评估,我们设计了一种新的有效的自我训练策略HAST,它在四个文本分类基准上进行了评估,在四个数据集中的三个上取得了与之前的实验相当的分类结果,仅使用了25%的数据。

[NLP-18] ReadCtrl: Personalizing text generation with readability-controlled instruction learning
[NLP-18] ReadControl:通过可读写性控制的教学学习来个性化文本生成

链接: https://arxiv.org/abs/2406.09205
作者: Hieu Tran,Zonghai Yao,Lingxi Li,Hong Yu
关键词: Readability-Controlled Instruction Learning, conditioning on users, Content generation conditioning, Instruction Learning, LLMs
中文关键词: 可读性控制的教学学习、用户条件反射、内容生成条件反射、教学学习、法学硕士
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:Content generation conditioning on users’s readability is an important application for personalization. In an era of large language models (LLMs), readability-controlled text generation based on LLMs has become increasingly important. This paper introduces a novel methodology called “Readability-Controlled Instruction Learning (ReadCtrl),” which aims to instruction-tune LLMs to tailor users’ readability levels. Unlike the traditional methods, which primarily focused on categorical readability adjustments typically classified as high, medium, and low or expert and layperson levels with limited success, ReadCtrl introduces a dynamic framework that enables LLMs to generate content at various (near continuous level) complexity levels, thereby enhancing their versatility across different applications. Our results show that the ReadCtrl-Mistral-7B models significantly outperformed strong baseline models such as GPT-4 and Claude-3, with a win rate of 52.1%:35.7% against GPT-4 in human evaluations. Furthermore, Read-Ctrl has shown significant improvements in automatic evaluations, as evidenced by better readability metrics (e.g., FOG, FKGL) and generation quality metrics (e.g., BLEU, SARI, SummaC-Factuality, UniEval-Consistency and Coherence). These results underscore Read-Ctrl’s effectiveness and tenacity in producing high-quality, contextually appropriate outputs that closely align with targeted readability levels, marking a significant advancement in personalized content generation using LLMs.
摘要:内容生成以用户的可读性为条件,是个性化的一个重要应用。在大型语言模型(LLMS)的时代,基于LLMS的可读性受控的文本生成变得越来越重要。本文介绍了一种名为“可读性控制的教学学习(ReadCtrl,ReadCtrl)”的新方法,该方法旨在对LLMS进行指令调整,以适应用户的可读性水平。与传统方法不同,ReadCtrl主要侧重于分类为高、中、低或专家和非专业人员级别的分类可读性调整,但效果有限。ReadCtrl引入了一个动态框架,使LLM能够生成各种(接近连续的)复杂级别的内容,从而增强它们在不同应用程序中的通用性。我们的结果表明,ReadCtrl-Mistral-7B模型显著优于GPT-4和Claude-3等强基线模型,在人类评估中的胜率为52.1%:35.7%。此外,Read-Ctrl在自动评估方面显示出显著的改进,更好的可读性指标(例如,雾、FKGL)和生成质量指标(例如,BLEU、SARI、SummaC-真实性、一致性和连贯性)证明了这一点。这些结果突显了Read-Ctrl在生成与目标可读性水平密切一致的高质量、适合上下文的输出方面的有效性和坚韧,标志着使用LLMS生成个性化内容方面的重大进步。

[NLP-19] Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts Phonological Complexity Doesnt
[NLP-19] 语言复杂性和语音识别准确性:正字复杂性损害了语音复杂性并没有

链接: https://arxiv.org/abs/2406.09202
作者: Chihiro Taguchi,David Chiang
关键词: Automatic Speech Recognition, Speech Recognition, Automatic Speech, linguistic factors affect, performance of Automatic
中文关键词: 自动语音识别,语音识别,自动语音,语言因素影响,自动的性能
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, 5 tables, submitted to ACL 2024

点击查看摘要

Abstract:We investigate what linguistic factors affect the performance of Automatic Speech Recognition (ASR) models. We hypothesize that orthographic and phonological complexities both degrade accuracy. To examine this, we fine-tune the multilingual self-supervised pretrained model Wav2Vec2-XLSR-53 on 25 languages with 15 writing systems, and we compare their ASR accuracy, number of graphemes, unigram grapheme entropy, logographicity (how much word/morpheme-level information is encoded in the writing system), and number of phonemes. The results demonstrate that orthographic complexities significantly correlate with low ASR accuracy, while phonological complexity shows no significant correlation.
摘要:我们调查哪些语言因素影响自动语音识别(ASB)模型的性能。我们假设正音和音素的复杂性都会降低准确性。为了检查这一点,我们对具有15种书写系统的25种语言进行了微调,并比较了它们的ASB准确性、字素数量、一元字素信息量、徽标性(书写系统中编码了多少字/词素级别信息)和音素数量。结果表明,正音复杂性与低ASB准确性显着相关,而音素复杂性则没有显着相关性。

[NLP-20] Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations
[NLP-20] 自我监督语音表示中说话人和语音信息的同向性和各向同性

链接: https://arxiv.org/abs/2406.09200
作者: Mukhtar Mohamed,Oli Danyi Liu,Hao Tang,Sharon Goldwater
关键词: downstream speech technologies, benefit downstream speech, hugely benefit downstream, Self-supervised speech representations, Cumulative Residual Variance
中文关键词: 下游语音技术,使下游语音受益,使下游受益,自我监督语音表示,累积剩余方差
类目: Computation and Language (cs.CL)
备注: Accepted to Interspeech

点击查看摘要

Abstract:Self-supervised speech representations can hugely benefit downstream speech technologies, yet the properties that make them useful are still poorly understood. Two candidate properties related to the geometry of the representation space have been hypothesized to correlate well with downstream tasks: (1) the degree of orthogonality between the subspaces spanned by the speaker centroids and phone centroids, and (2) the isotropy of the space, i.e., the degree to which all dimensions are effectively utilized. To study them, we introduce a new measure, Cumulative Residual Variance (CRV), which can be used to assess both properties. Using linear classifiers for speaker and phone ID to probe the representations of six different self-supervised models and two untrained baselines, we ask whether either orthogonality or isotropy correlate with linear probing accuracy. We find that both measures correlate with phonetic probing accuracy, though our results on isotropy are more nuanced.
摘要:自监督语音表示可以极大地造福下游语音技术,但人们对它们有用的属性仍然知之甚少。与表示空间的几何形状相关的两个候选属性被假设与下游任务良好相关:(1)由说话者中心和音素中心跨越的子空间之间的垂直度,以及(2)空间的各向同性,即所有维度得到有效利用的程度。为了研究它们,我们引入了一种新的指标–累积剩余方差(CRV),它可用于评估这两种属性。使用扬声器和电话ID的线性分类器来探测六个不同自我监督模型和两个未经训练的基线的表示,我们询问垂直性或各向同性是否与线性探测准确性相关。我们发现这两种测量方法都与语音探测准确性相关,尽管我们关于各向同性的结果更加细致入微。

[NLP-21] ReMI: A Dataset for Reasoning with Multiple Images
[NLP-21] ReMI:用于多个图像推理的数据集

链接: https://arxiv.org/abs/2406.09175
作者: Mehran Kazemi,Nishanth Dikkala,Ankit Anand,Petar Devic,Ishita Dasgupta,Fangyu Liu,Bahare Fatemi,Pranjal Awasthi,Dee Guo,Sreenivas Gollapudi,Ahmed Qureshi
关键词: large language models, continuous advancement, advancement of large, large language, essential to create
中文关键词: 大型语言模型,持续进步,大型语言的进步,创建必不可少
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the continuous advancement of large language models (LLMs), it is essential to create new benchmarks to effectively evaluate their expanding capabilities and identify areas for improvement. This work focuses on multi-image reasoning, an emerging capability in state-of-the-art LLMs. We introduce ReMI, a dataset designed to assess LLMs’ ability to Reason with Multiple Images. This dataset encompasses a diverse range of tasks, spanning various reasoning domains such as math, physics, logic, code, table/chart understanding, and spatial and temporal reasoning. It also covers a broad spectrum of characteristics found in multi-image reasoning scenarios. We have benchmarked several cutting-edge LLMs using ReMI and found a substantial gap between their performance and human-level proficiency. This highlights the challenges in multi-image reasoning and the need for further research. Our analysis also reveals the strengths and weaknesses of different models, shedding light on the types of reasoning that are currently attainable and areas where future models require improvement. To foster further research in this area, we are releasing ReMI publicly: this https URL.
摘要:随着大型语言模型的不断发展,有必要建立新的基准来有效地评估它们的扩展能力,并确定需要改进的领域。这项工作的重点是多图像推理,这是最先进的LLMS中的一种新兴功能。我们介绍了REMI,这是一个旨在评估LLMS对多个图像进行推理的能力的数据集。该数据集包含各种任务,跨越不同的推理领域,例如数学、物理、逻辑、代码、表格/图表理解以及空间和时间推理。它还涵盖了在多图像推理场景中发现的广泛特征。我们使用REMI对几个尖端的LLM进行了基准测试,发现它们的性能和人类水平的熟练程度之间存在很大差距。这突显了多图像推理中的挑战和进一步研究的必要性。我们的分析还揭示了不同模型的优势和劣势,揭示了目前可以实现的推理类型以及未来模型需要改进的领域。为了促进这一领域的进一步研究,我们将公开发布REMI:这个HTTPS URL。

[NLP-22] st of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
[NLP-22] st of Time:评估时态推理LLM的基准

链接: https://arxiv.org/abs/2406.09170
作者: Bahare Fatemi,Mehran Kazemi,Anton Tsitsulin,Karishma Malkan,Jinyeong Yim,John Palowitch,Sungyong Seo,Jonathan Halcrow,Bryan Perozzi
关键词: Large language models, Large language, remarkable reasoning capabilities, complex temporal logic, showcased remarkable reasoning
中文关键词: 大型语言模型,大型语言,非凡的推理能力,复杂的时态逻辑,展现了非凡的推理能力
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors, particularly in temporal reasoning tasks involving complex temporal logic. Existing research has explored LLM performance on temporal reasoning using diverse datasets and benchmarks. However, these studies often rely on real-world data that LLMs may have encountered during pre-training or employ anonymization techniques that can inadvertently introduce factual inconsistencies. In this work, we address these limitations by introducing novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios. The diversity of question types across these datasets enables systematic investigation into the impact of the problem structure, size, question type, fact order, and other factors on LLM performance. Our findings provide valuable insights into the strengths and weaknesses of current LLMs in temporal reasoning tasks. To foster further research in this area, we are open-sourcing the datasets and evaluation framework used in our experiments: this https URL.
摘要:大型语言模型(LLM)已经显示出显著的推理能力,但它们仍然容易出错,特别是在涉及复杂时态逻辑的时态推理任务中。现有的研究已经使用不同的数据集和基准来探索LLM在时间推理上的性能。然而,这些研究往往依赖于LLMS在预培训期间可能遇到的真实世界数据,或者使用可能无意中引入事实不一致的匿名化技术。在这项工作中,我们通过引入新的合成数据集来解决这些限制,这些数据集专门为评估各种场景下的LLM时间推理能力而设计。这些数据集中问题类型的多样性使得能够系统地研究问题结构、大小、问题类型、事实顺序和其他因素对LLM性能的影响。我们的发现为当前LLMS在时间推理任务中的优势和劣势提供了有价值的见解。为了促进这一领域的进一步研究,我们正在开源我们实验中使用的数据集和评估框架:这个HTTPS URL。

[NLP-23] DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation
[NLP-23] DefAn:LLM幻觉评估的有效答案数据集

链接: https://arxiv.org/abs/2406.09155
作者: A B M Ashikur Rahman,Saeed Anwar,Muhammad Usman,Ajmal Mian
关键词: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, daily life applications
中文关键词: 大型语言模型,大型语言,语言模型,展示了非凡的能力,日常生活应用
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities, revolutionizing the integration of AI in daily life applications. However, they are prone to hallucinations, generating claims that contradict established facts, deviating from prompts, and producing inconsistent responses when the same prompt is presented multiple times. Addressing these issues is challenging due to the lack of comprehensive and easily assessable benchmark datasets. Most existing datasets are small and rely on multiple-choice questions, which are inadequate for evaluating the generative prowess of LLMs. To measure hallucination in LLMs, this paper introduces a comprehensive benchmark dataset comprising over 75,000 prompts across eight domains. These prompts are designed to elicit definitive, concise, and informative answers. The dataset is divided into two segments: one publicly available for testing and assessing LLM performance and a hidden segment for benchmarking various LLMs. In our experiments, we tested six LLMs-GPT-3.5, LLama 2, LLama 3, Gemini, Mixtral, and Zephyr-revealing that overall factual hallucination ranges from 59% to 82% on the public dataset and 57% to 76% in the hidden benchmark. Prompt misalignment hallucination ranges from 6% to 95% in the public dataset and 17% to 94% in the hidden counterpart. Average consistency ranges from 21% to 61% and 22% to 63%, respectively. Domain-wise analysis shows that LLM performance significantly deteriorates when asked for specific numeric information while performing moderately with person, location, and date queries. Our dataset demonstrates its efficacy and serves as a comprehensive benchmark for LLM performance evaluation. Our dataset and LLMs responses are available at \hrefthis https URLthis https URL.
摘要:大型语言模型(LLM)已经显示出卓越的能力,彻底改变了人工智能在日常生活应用中的集成。然而,他们容易产生幻觉,产生与既定事实相矛盾的主张,偏离提示,并在多次呈现同一提示时产生不一致的反应。由于缺乏全面和易于评估的基准数据集,解决这些问题具有挑战性。大多数现有的数据集都很小,并且依赖于多项选择题,这不足以评估LLMS的生成能力。为了测量LLMS中的幻觉,本文引入了一个全面的基准数据集,该数据集包括八个领域的超过75000个提示。这些提示旨在引出明确、简洁和信息量大的答案。数据集分为两个部分:一个公开用于测试和评估LLM性能,另一个隐藏部分用于对各种LLM进行基准测试。在我们的实验中,我们测试了六个LLMS-GPT-3.5,大羊驼2,大羊驼3,双子座,Mixtral和Zephy-显示在公共数据集上总体的事实幻觉从59%到82%,在隐藏的基准中从57%到76%。在公开数据集中,即时错位幻觉的范围从6%到95%,在隐藏的对应数据集中,从17%到94%。平均一致性分别为21%~61%和22%~63%。逐域分析表明,当被要求提供特定的数字信息时,LLM的性能显著下降,而对人员、位置和日期查询的性能一般。我们的数据集证明了它的有效性,并作为LLM性能评估的综合基准。我们的数据集和LLMS响应位于\hrefthis https URL此https URL。

[NLP-24] Diffusion Gaussian Mixture Audio Denoise
[NLP-24] 扩散高斯混合音频降噪

链接: https://arxiv.org/abs/2406.09154
作者: Pu Wang,Junhui Li,Jialu Li,Liangdong Guo,Youshan Zhang
关键词: Recent diffusion models, achieved promising performances, Gaussian mixture model, Recent diffusion, Gaussian mixture
中文关键词: 最近的扩散模型,取得了有希望的性能,高斯混合模型,最近的扩散,高斯混合
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: INTERSPEECH 2024

点击查看摘要

Abstract:Recent diffusion models have achieved promising performances in audio-denoising tasks. The unique property of the reverse process could recover clean signals. However, the distribution of real-world noises does not comply with a single Gaussian distribution and is even unknown. The sampling of Gaussian noise conditions limits its application scenarios. To overcome these challenges, we propose a DiffGMM model, a denoising model based on the diffusion and Gaussian mixture models. We employ the reverse process to estimate parameters for the Gaussian mixture model. Given a noisy audio signal, we first apply a 1D-U-Net to extract features and train linear layers to estimate parameters for the Gaussian mixture model, and we approximate the real noise distributions. The noisy signal is continuously subtracted from the estimated noise to output clean audio signals. Extensive experimental results demonstrate that the proposed DiffGMM model achieves state-of-the-art performance.
摘要:最近的扩散模型在音频去噪任务中取得了令人鼓舞的性能。反向过程的独特性质可以恢复干净的信号。然而,现实世界噪音的分布并不符合单一高斯分布,甚至是未知的。高斯噪音条件的采样限制了其应用场景。为了克服这些挑战,我们提出了一种迪夫GMM模型,这是一种基于扩散和高斯混合模型的去噪模型。我们使用反向过程来估计高斯混合模型的参数。给定有噪的音频信号,我们首先应用1D-U-Net来提取特征并训练线性层来估计高斯混合模型的参数,然后我们逼近真实的噪音分布。从估计的噪音中连续减去有噪信号以输出干净的音频信号。大量实验结果表明,提出的迪夫GMM模型实现了最先进的性能。

[NLP-25] LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks
[NLP-25] 激光:通过调整语音的自我监督表示来学习以改善内容相关任务

链接: https://arxiv.org/abs/2406.09153
作者: Amit Meghanani,Thomas Hain
关键词: full-stack speech processing, Aligning Self-supervised Representations, SSL, speech processing, speech
中文关键词: 全栈语音处理、对齐自监督表示、SSL、语音处理、语音
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2024

点击查看摘要

Abstract:Self-supervised learning (SSL)-based speech models are extensively used for full-stack speech processing. However, it has been observed that improving SSL-based speech representations using unlabeled speech for content-related tasks is challenging and computationally expensive. Recent attempts have been made to address this issue with cost-effective self-supervised fine-tuning (SSFT) approaches. Continuing in this direction, a cost-effective SSFT method named “LASER: Learning by Aligning Self-supervised Representations” is presented. LASER is based on the soft-DTW alignment loss with temporal regularisation term. Experiments are conducted with HuBERT and WavLM models and evaluated on the SUPERB benchmark for two content-related tasks: automatic speech recognition (ASR) and phoneme recognition (PR). A relative improvement of 3.7% and 8.2% for HuBERT, and 4.1% and 11.7% for WavLM are observed, for the ASR and PR tasks respectively, with only 3 hours of fine-tuning on a single GPU.
摘要:基于自监督学习的语音模型被广泛应用于全栈语音处理。然而,已经观察到,对于与内容相关的任务,使用未标记的语音来改进基于SSL的语音表示是具有挑战性的,并且计算代价高昂。最近试图通过具有成本效益的自我监督微调(SSFT)方法来解决这一问题。在此基础上,提出了一种高性价比的SSFT方法“LASeR:通过对齐自监督表示进行学习”。激光是基于带有时间正则化项的软DTW对准损耗。使用Hubert和WavLM模型进行了实验,并在两个与内容相关的任务:自动语音识别(ASR)和音素识别(PR)上进行了评估。对于ASR和PR任务,仅在单个GPU上进行3小时的微调,Hubert和WavLM的相对改善分别为3.7%和8.2%,WavLM为4.1%和11.7%。

[NLP-26] Investigating the translation capabilities of Large Language Models trained on parallel data only
[NLP-26] 调查仅在并行数据上训练的大型语言模型的翻译能力

链接: https://arxiv.org/abs/2406.09140
作者: Javier García Gilabert,Carlos Escolano,Aleix Sant Savall,Francesca De Luca Fornaciari,Audrey Mash,Xixian Liao,Maite Melero
关键词: Natural Language Processing, Large Language Models, including Machine Translation, demonstrated exceptional proficiency, including Machine
中文关键词: 自然语言处理、大型语言模型(包括机器翻译)表现出出色的熟练程度,包括机器翻译
类目: Computation and Language (cs.CL)
备注: We release our code at: this https URL

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks, including Machine Translation. However, previous methods predominantly relied on iterative processes such as instruction fine-tuning or continual pre-training, leaving unexplored the challenges of training LLMs solely on parallel data. In this work, we introduce PLUME (Parallel Language Model), a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples. These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones. Utilizing this set of models, we conduct a thorough investigation into the translation capabilities of LLMs, probing their performance, the impact of the different elements of the prompt, and their cross-lingual representation space.
摘要:近年来,大型语言模型(LLM)在包括机器翻译在内的广泛自然语言处理(NLP)任务中表现出出色的熟练程度。然而,以前的方法主要依赖于迭代过程,例如指令微调或连续预训练,从而没有探索仅根据并行数据训练LLM的挑战。在这项工作中,我们引入了PLUME(并行语言模型),这是三个2B LLM的集合,具有不同的词汇量大小(32 k、128 k和256 k),专门在以加泰罗尼亚为中心的并行示例上训练。这些模型在16个监督翻译方向和56个零激发方向上执行与之前的编码器-解码器架构的相似性。利用这组模型,我们对LLM的翻译能力进行了彻底的调查,探讨它们的性能、提示不同元素的影响以及它们的跨语言表示空间。

[NLP-27] Leveraging Explicit Reasoning for Inference Integration in Commonsense-Augmented Dialogue Models
[NLP-27] 利用显式推理在常识增强对话模型中进行推理集成

链接: https://arxiv.org/abs/2406.09138
作者: Sarah E. Finch,Jinho D. Choi
关键词: grasp social commonsense, human users, grasp social, understand and respond, respond effectively
中文关键词: 掌握社会常识,人类用户,掌握社会,理解并响应,有效响应
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Open-domain dialogue systems need to grasp social commonsense to understand and respond effectively to human users. Commonsense-augmented dialogue models have been proposed that aim to infer commonsense knowledge from dialogue contexts in order to improve response quality. However, existing approaches to commonsense-augmented dialogue rely on implicit reasoning to integrate commonsense inferences during response generation. In this study, we explore the impact of explicit reasoning against implicit reasoning over commonsense for dialogue response generation. Our findings demonstrate that separating commonsense reasoning into explicit steps for generating, selecting, and integrating commonsense into responses leads to better dialogue interactions, improving naturalness, engagement, specificity, and overall quality. Subsequent analyses of these findings unveil insights into the effectiveness of various types of commonsense in generating responses and the particular response traits enhanced through explicit reasoning for commonsense integration. Our work advances research in open-domain dialogue by achieving a new state-of-the-art in commonsense-augmented response generation.
摘要:开放领域对话系统需要掌握社会常识,才能有效地理解和响应人类用户。已经提出了常识增强的对话模型,其目的是从对话上下文中推断常识知识,以提高响应质量。然而,现有的常识增强对话方法依赖于隐含推理来在响应生成过程中整合常识推理。在本研究中,我们探讨了外显推理与内隐推理相比常识推理对对话反应生成的影响。我们的发现表明,将常识推理分成明确的步骤来生成、选择常识并将其整合到回答中会导致更好的对话互动,提高自然性、参与性、专一性和整体质量。随后对这些发现的分析揭示了各种类型常识在产生反应方面的有效性,以及通过对常识整合的明确推理而增强的特殊反应特征。我们的工作通过在常识增强的响应生成方面实现了一种新的艺术状态来推进开放领域对话的研究。

[NLP-28] Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs
[NLP-28] 偏好链优化:改进LLM中的思维链推理

链接: https://arxiv.org/abs/2406.09136
作者: Xuan Zhang,Chao Du,Tianyu Pang,Qian Liu,Wei Gao,Min Lin
关键词: large language models, enabled large language, generate explicit logical, explicit logical reasoning, language models
中文关键词: 大型语言模型,启用大型语言,生成显式逻辑、显式逻辑推理、语言模型
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The recent development of chain-of-thought (CoT) decoding has enabled large language models (LLMs) to generate explicit logical reasoning paths for complex problem-solving. However, research indicates that these paths are not always deliberate and optimal. The tree-of-thought (ToT) method employs tree-searching to extensively explore the reasoning space and find better reasoning paths that CoT decoding might overlook. This deliberation, however, comes at the cost of significantly increased inference complexity. In this work, we demonstrate that fine-tuning LLMs leveraging the search tree constructed by ToT allows CoT to achieve similar or better performance, thereby avoiding the substantial inference burden. This is achieved through Chain of Preference Optimization (CPO), where LLMs are fine-tuned to align each step of the CoT reasoning paths with those of ToT using the inherent preference information in the tree-search process. Extensive experimental results show that CPO significantly improves LLM performance in solving a variety of complex problems, including question answering, fact verification, and arithmetic reasoning, demonstrating its effectiveness. Our code is available at this https URL.
摘要:思想链(CoT)译码的最新发展使得大型语言模型(LLM)能够为复杂问题的解决生成显式的逻辑推理路径。然而,研究表明,这些途径并不总是深思熟虑和最佳的。思想树(TOT)方法使用树搜索来广泛地探索推理空间,找到COT译码可能忽略的更好的推理路径。然而,这种深思熟虑是以显著增加推理复杂性为代价的。在这项工作中,我们证明了利用TOT构造的搜索树对LLMS进行微调可以使COT获得类似或更好的性能,从而避免了大量的推理负担。这是通过偏好链优化(CPO)实现的,其中LLM被微调以使用树搜索过程中固有的偏好信息将COT推理路径的每一步与TOT的推理路径对齐。大量的实验结果表明,CPO显著提高了LLM在问题回答、事实验证、算术推理等复杂问题中的性能,证明了其有效性。我们的代码可以在这个HTTPS URL上找到。

[NLP-29] RH-SQL: Refined Schema and Hardness Prompt for Text-to-SQL
[NLP-29] RH-SQL:文本转SQL的细化模式和硬度提示

链接: https://arxiv.org/abs/2406.09133
作者: Jiawen Yi,Guo Chen,Zixiang Shen
关键词: converts natural language, natural language queries, technology that converts, converts natural, query language SQL
中文关键词: 转换自然语言、自然语言查询、转换、转换自然、查询语言SQL的技术
类目: Computation and Language (cs.CL)
备注: 4 pages, 2 figures, 2024 6th International Conference on Electronic Engineering and Informatics (EEI 2024)

点击查看摘要

Abstract:Text-to-SQL is a technology that converts natural language queries into the structured query language SQL. A novel research approach that has recently gained attention focuses on methods based on the complexity of SQL queries, achieving notable performance improvements. However, existing methods entail significant storage and training costs, which hampers their practical application. To address this issue, this paper introduces a method for Text-to-SQL based on Refined Schema and Hardness Prompt. By filtering out low-relevance schema information with a refined schema and identifying query hardness through a Language Model (LM) to form prompts, this method reduces storage and training costs while maintaining performance. It’s worth mentioning that this method is applicable to any sequence-to-sequence (seq2seq) LM. Our experiments on the Spider dataset, specifically with large-scale LMs, achieved an exceptional Execution accuracy (EX) of 82.6%, demonstrating the effectiveness and greater suitability of our method for real-world applications.
摘要:Text-to-SQL是一种将自然语言查询转换为结构化查询语言SQL的技术。最近受到关注的一种新的研究方法专注于基于SQL查询复杂性的方法,取得了显著的性能改进。然而,现有的方法需要大量的存储和培训成本,这阻碍了它们的实际应用。针对这一问题,提出了一种基于精化模式和困难提示的Text-to-SQL方法。该方法通过精化模式过滤掉相关性较低的模式信息,并通过语言模型(Language Model,LM)识别查询难度以形成提示,从而在保持性能的同时降低了存储和训练成本。值得一提的是,该方法适用于任何序列到序列(Seq2seq)LM。我们在SPIDER数据集上的实验,特别是在大规模LMS上的实验,获得了82.6%的异常执行准确率,证明了我们的方法的有效性和更大的现实应用程序的适用性。

[NLP-30] CoastTerm: a Corpus for Multidisciplinary Term Extraction in Coastal Scientific Literature
[NLP-30] 术语:沿海科学文献多学科术语提取的数据库

链接: https://arxiv.org/abs/2406.09128
作者: Julien Delaunay,Hanh Thi Hong Tran,Carlos-Emiliano González-Gallardo,Georgeta Bordea,Mathilde Ducos,Nicolas Sidere,Antoine Doucet,Senja Pollak,Olivier De Viron
关键词: environmental protection policies, formulate effective environmental, effective environmental protection, coastal areas, Automatic Term Extraction
中文关键词: 环境保护政策,制定有效的环境,有效的环境保护,沿海地区,自动术语提取
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing impact of climate change on coastal areas, particularly active but fragile regions, necessitates collaboration among diverse stakeholders and disciplines to formulate effective environmental protection policies. We introduce a novel specialized corpus comprising 2,491 sentences from 410 scientific abstracts concerning coastal areas, for the Automatic Term Extraction (ATE) and Classification (ATC) tasks. Inspired by the ARDI framework, focused on the identification of Actors, Resources, Dynamics and Interactions, we automatically extract domain terms and their distinct roles in the functioning of coastal systems by leveraging monolingual and multilingual transformer models. The evaluation demonstrates consistent results, achieving an F1 score of approximately 80% for automated term extraction and F1 of 70% for extracting terms and their labels. These findings are promising and signify an initial step towards the development of a specialized Knowledge Base dedicated to coastal areas.
摘要:气候变化对沿海地区,特别是活跃但脆弱的地区的影响越来越大,需要不同利益攸关方和学科之间进行合作,以制定有效的环境保护政策。我们介绍了一个新的专门语料库,包括来自410篇沿海地区的科学摘要的2491个句子,用于自动术语提取(ATE)和分类(ATC)任务。受ARDI框架的启发,专注于参与者、资源、动态和交互的识别,我们通过利用单语言和多语言转换器模型自动提取领域术语及其在沿海系统功能中的不同角色。评估结果一致,自动术语提取的F1得分约为80,而术语及其标签的F1得分约为70。这些调查结果是有希望的,标志着朝着建立专门针对沿海地区的专门知识库迈出了第一步。

[NLP-31] INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs Performance in Insurance
[NLP-31] INS-MMBench:评估LVLM保险业表现的综合基准

链接: https://arxiv.org/abs/2406.09105
作者: Chenwei Lin,Hanjia Lyu,Xian Xu,Jiebo Luo
关键词: Large Vision-Language Models, Large Vision-Language, shown promising potential, insurance, insurance domain
中文关键词: 大型视觉语言模型,大型视觉语言,显示出有前途的潜力,保险,保险领域
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated outstanding performance in various general multimodal applications such as image recognition and visual reasoning, and have also shown promising potential in specialized domains. However, the application potential of LVLMs in the insurance domain-characterized by rich application scenarios and abundant multimodal data-has not been effectively explored. There is no systematic review of multimodal tasks in the insurance domain, nor a benchmark specifically designed to evaluate the capabilities of LVLMs in insurance. This gap hinders the development of LVLMs within the insurance domain. In this paper, we systematically review and distill multimodal tasks for four representative types of insurance: auto insurance, property insurance, health insurance, and agricultural insurance. We propose INS-MMBench, the first comprehensive LVLMs benchmark tailored for the insurance domain. INS-MMBench comprises a total of 2.2K thoroughly designed multiple-choice questions, covering 12 meta-tasks and 22 fundamental tasks. Furthermore, we evaluate multiple representative LVLMs, including closed-source models such as GPT-4o and open-source models like BLIP-2. This evaluation not only validates the effectiveness of our benchmark but also provides an in-depth performance analysis of current LVLMs on various multimodal tasks in the insurance domain. We hope that INS-MMBench will facilitate the further application of LVLMs in the insurance domain and inspire interdisciplinary development. Our dataset and evaluation code are available at this https URL.
摘要:大视觉语言模型在图像识别、视觉推理等多种通用多通道应用中表现出优异的性能,在专业领域也显示出巨大的潜力。然而,以丰富的应用场景和丰富的多模式数据为特征的LVLMS在保险领域的应用潜力还没有得到有效的挖掘。没有对保险领域的多式联运任务进行系统审查,也没有专门为评估低成本管理系统在保险方面的能力而设计的基准。这一差距阻碍了保险领域低成本管理的发展。本文系统地回顾和提炼了汽车保险、财产保险、健康保险和农业保险四种典型险种的多式联运任务。我们提出了INS-MMBtch,这是第一个为保险领域量身定做的全面的LVLMS基准。INS-MMBch包括2.2K个精心设计的多项选择题,涵盖12个元任务和22个基本任务。此外,我们评估了多个具有代表性的LVLM,包括闭源模型如GPT-40和开源模型如BLIP-2。这一评估不仅验证了我们的基准测试的有效性,而且还提供了对当前保险领域各种多式联运任务的LVLMS的深入性能分析。我们希望INS-MMBch将促进LVLMS在保险领域的进一步应用,并推动跨学科发展。我们的数据集和评估代码可在此HTTPS URL获得。

[NLP-32] Chain-of-Though (CoT) prompting strategies for medical error detection and correction
[NLP-32] 通过链(CoT)促进医疗错误检测和纠正策略

链接: https://arxiv.org/abs/2406.09103
作者: Zhaolong Wu,Abul Hasan,Jinge Wu,Yunsoo Kim,Jason P.Y. Cheung,Teng Zhang,Honghan Wu
关键词: correcting medical errors, paper describes, automatically detecting, detecting and correcting, correcting medical
中文关键词: 纠正医疗错误,论文描述,自动检测,检测和纠正,纠正医疗
类目: Computation and Language (cs.CL)
备注: accepted as NAACL workshop

点击查看摘要

Abstract:This paper describes our submission to the MEDIQA-CORR 2024 shared task for automatically detecting and correcting medical errors in clinical notes. We report results for three methods of few-shot In-Context Learning (ICL) augmented with Chain-of-Thought (CoT) and reason prompts using a large language model (LLM). In the first method, we manually analyse a subset of train and validation dataset to infer three CoT prompts by examining error types in the clinical notes. In the second method, we utilise the training dataset to prompt the LLM to deduce reasons about their correctness or incorrectness. The constructed CoTs and reasons are then augmented with ICL examples to solve the tasks of error detection, span identification, and error correction. Finally, we combine the two methods using a rule-based ensemble method. Across the three sub-tasks, our ensemble method achieves a ranking of 3rd for both sub-task 1 and 2, while securing 7th place in sub-task 3 among all submissions.
摘要:本文描述了我们向MEDIQA-CORR 2024共享任务提交的自动检测和纠正临床笔记中的医疗错误。我们报告了使用大型语言模型(LLM)通过思想链(CoT)和推理提示增强的三种少镜头上下文学习(ICL)方法的结果。在第一种方法中,我们手动分析训练和验证数据集的子集,通过检查临床笔记中的错误类型来推断三个CoT提示。在第二种方法中,我们利用训练数据集来提示LLM推断其正确性或不正确性的原因。然后,用ICL示例来扩展所构建的CoT和原因,以解决错误检测、跨度识别和错误纠正的任务。最后,我们使用基于规则的集成方法将这两种方法结合起来。在这三个子任务中,我们的集成方法在子任务1和2中获得了第3名,而在所有提交的子任务3中获得了第7名。

[NLP-33] SciKnowEval: Evaluating Multi-level Scientific Knowledge of Large Language Models
[NLP-33] SciKnowEval:评估大型语言模型的多层次科学知识

链接: https://arxiv.org/abs/2406.09098
作者: Kehua Feng,Keyan Ding,Weijie Wang,Xiang Zhuang,Zeyuan Wang,Ming Qin,Yu Zhao,Jianhua Yao,Qiang Zhang,Huajun Chen
关键词: Large Language Models, Language Models, Large Language, necessitates advanced benchmarks, advanced benchmarks capable
中文关键词: 大型语言模型,语言模型,大型语言,需要高级基准,高级基准能够
类目: Computation and Language (cs.CL)
备注: 48 pages, 2 figures

点击查看摘要

Abstract:The burgeoning utilization of Large Language Models (LLMs) in scientific research necessitates advanced benchmarks capable of evaluating their understanding and application of scientific knowledge comprehensively. To address this need, we introduce the SciKnowEval benchmark, a novel framework that systematically evaluates LLMs across five progressive levels of scientific knowledge: studying extensively, inquiring earnestly, thinking profoundly, discerning clearly, and practicing assiduously. These levels aim to assess the breadth and depth of scientific knowledge in LLMs, including knowledge coverage, inquiry and exploration capabilities, reflection and reasoning abilities, ethic and safety considerations, as well as practice proficiency. Specifically, we take biology and chemistry as the two instances of SciKnowEval and construct a dataset encompassing 50K multi-level scientific problems and solutions. By leveraging this dataset, we benchmark 20 leading open-source and proprietary LLMs using zero-shot and few-shot prompting strategies. The results reveal that despite achieving state-of-the-art performance, the proprietary LLMs still have considerable room for improvement, particularly in addressing scientific computations and applications. We anticipate that SciKnowEval will establish a comprehensive standard for benchmarking LLMs in science research and discovery, and promote the development of LLMs that integrate scientific knowledge with strong safety awareness. The dataset and code are publicly available at this https URL .
摘要:随着大型语言模型在科学研究中的迅速应用,需要先进的基准来全面评估它们对科学知识的理解和应用。为了满足这一需求,我们引入了科学知识评估基准,这是一个新的框架,系统地评估了五个进步的科学知识水平的低成本管理:广泛学习、认真探究、深刻思考、洞察清晰和勤奋实践。这些级别旨在评估小岛屿发展中国家科学知识的广度和深度,包括知识覆盖面、探究和探索能力、反思和推理能力、道德和安全考虑以及实践熟练程度。具体地说,我们以生物和化学作为科学知识评估的两个实例,构建了一个包含50K多层次科学问题和解决方案的数据集。通过利用该数据集,我们使用零触发和少触发提示策略对20个领先的开源和专有LLM进行了基准测试。结果表明,尽管获得了最先进的性能,但专有的LLM仍有相当大的改进空间,特别是在处理科学计算和应用方面。我们预计,本知识评估将为科学研究和发现中的低成本管理建立一个全面的标准,并促进将科学知识与强烈的安全意识相结合的低成本管理的发展。数据集和代码在此HTTPS URL上公开提供。

[NLP-34] Modeling Comparative Logical Relation with Contrastive Learning for Text Generation
[NLP-34] 用对比学习建模比较逻辑关系以生成文本

链接: https://arxiv.org/abs/2406.09095
作者: Yuhao Dan,Junfeng Tian,Jie Zhou,Ming Yan,Ji Zhang,Qin Chen,Liang He
关键词: comparative logical relations, classic natural language, comparative logical, logical relations, natural language generation
中文关键词: 比较逻辑关系,经典自然语言,比较逻辑,逻辑关系,自然语言生成
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Data-to-Text Generation (D2T), a classic natural language generation problem, aims at producing fluent descriptions for structured input data, such as a table. Existing D2T works mainly focus on describing the superficial associative relations among entities, while ignoring the deep comparative logical relations, such as A is better than B in a certain aspect with a corresponding opinion, which is quite common in our daily life. In this paper, we introduce a new D2T task named comparative logical relation generation (CLRG). Additionally, we propose a Comparative Logic (CoLo) based text generation method, which generates texts following specific comparative logical relations with contrastive learning. Specifically, we first construct various positive and negative samples by fine-grained perturbations in entities, aspects and opinions. Then, we perform contrastive learning in the encoder layer to have a better understanding of the comparative logical relations, and integrate it in the decoder layer to guide the model to correctly generate the relations. Noting the data scarcity problem, we construct a Chinese Comparative Logical Relation Dataset (CLRD), which is a high-quality human-annotated dataset and challenging for text generation with descriptions of multiple entities and annotations on their comparative logical relations. Extensive experiments show that our method achieves impressive performance in both automatic and human evaluations.
摘要:数据到文本生成(D2T)是一个经典的自然语言生成问题,其目标是为表格等结构化输入数据生成流畅的描述。现有的D2T研究主要集中于描述实体之间的表层关联关系,而忽略了深层的比较逻辑关系,如A在某一方面好于B,并有相应的观点,这在我们的日常生活中相当常见。本文介绍了一种新的D2T任务:比较逻辑关系生成(CLRG)。此外,我们还提出了一种基于比较逻辑(COLO)的文本生成方法,该方法通过对比学习生成符合特定比较逻辑关系的文本。具体地说,我们首先通过实体、方面和观点的细粒度扰动来构造各种正负样本。然后,我们在编码层进行对比学习,以更好地理解比较逻辑关系,并将其整合到解码器层,以指导模型正确地生成关系。注意到数据稀缺的问题,我们构建了一个中文比较逻辑关系数据集(CLRD),这是一个高质量的人工标注的数据集,对文本生成具有挑战性,它描述了多个实体并标注了它们之间的比较逻辑关系。大量的实验表明,我们的方法在自动和人工评估方面都取得了令人印象深刻的性能。

[NLP-35] 3M: Multi-modal Multi-task Multi-teacher Learning for Game Event Detection
[NLP-35] 3 M:用于游戏事件检测的多模式多任务多教师学习

链接: https://arxiv.org/abs/2406.09076
作者: Thye Shan Ng,Feiqi Cao,Soyeon Caren Han
关键词: Esports has rapidly, game event detection, rapidly emerged, global phenomenon, ever-expanding audience
中文关键词: 电子竞技发展迅速,游戏事件检测,迅速出现,全球现象,受众不断扩大
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Esports has rapidly emerged as a global phenomenon with an ever-expanding audience via platforms, like YouTube. Due to the inherent complexity nature of the game, it is challenging for newcomers to comprehend what the event entails. The chaotic nature of online chat, the fast-paced speech of the game commentator, and the game-specific user interface further compound the difficulty for users in comprehending the gameplay. To overcome these challenges, it is crucial to integrate the Multi-Modal (MM) information from the platform and understand the event. The paper introduces a new MM multi-teacher-based game event detection framework, with the ultimate goal of constructing a comprehensive framework that enhances the comprehension of the ongoing game situation. While conventional MM models typically prioritise aligning MM data through concurrent training towards a unified objective, our framework leverages multiple teachers trained independently on different tasks to accomplish the Game Event Detection. The experiment clearly shows the effectiveness of the proposed MM multi-teacher framework.
摘要:电子竞技已经迅速成为一种全球现象,通过YouTube等平台的受众不断扩大。由于这个游戏固有的复杂性,对于新手来说,理解这项活动需要什么是具有挑战性的。在线聊天的混乱性质,游戏解说员快节奏的演讲,以及特定于游戏的用户界面,进一步增加了用户理解游戏的难度。要克服这些挑战,集成来自平台的多模式(MM)信息并了解活动至关重要。本文介绍了一种新的基于MM多教师的游戏事件检测框架,最终目的是构建一个全面的框架,以提高对正在进行的游戏情况的理解。虽然传统的MM模型通常优先通过并发训练来调整MM数据,以实现统一的目标,但我们的框架利用多名教师独立接受不同任务的培训来完成游戏事件检测。实验结果清楚地表明了所提出的MM多教师教学框架的有效性。

[NLP-36] Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning?
[NLP-36] 活在当下:大型语言模型能否掌握共时推理?

链接: https://arxiv.org/abs/2406.09072
作者: Zhaochen Su,Juntao Li,Jun Zhang,Tong Zhu,Xiaoye Qu,Pan Zhou,Yan Bowen,Yu Cheng,Min zhang
关键词: large language models, comprehend the world, fundamental for large, large language, Temporal
中文关键词: 大型语言模型,理解世界,大型语言的基础,时态
类目: Computation and Language (cs.CL)
备注: This paper has been accepted to the ACL 2024 main conference

点击查看摘要

Abstract:Temporal reasoning is fundamental for large language models (LLMs) to comprehend the world. Current temporal reasoning datasets are limited to questions about single or isolated events, falling short in mirroring the realistic temporal characteristics involving concurrent nature and intricate temporal interconnections. In this paper, we introduce CoTempQA, a comprehensive co-temporal Question Answering (QA) benchmark containing four co-temporal scenarios (Equal, Overlap, During, Mix) with 4,748 samples for evaluating the co-temporal comprehension and reasoning abilities of LLMs. Our extensive experiments reveal a significant gap between the performance of current LLMs and human-level reasoning on CoTempQA tasks. Even when enhanced with Chain of Thought (CoT) methodologies, models consistently struggle with our task. In our preliminary exploration, we discovered that mathematical reasoning plays a significant role in handling co-temporal events and proposed a strategy to boost LLMs’ co-temporal reasoning from a mathematical perspective. We hope that our CoTempQA datasets will encourage further advancements in improving the co-temporal reasoning capabilities of LLMs. Our code is available at this https URL.
摘要:时间推理是大语言模型理解世界的基础。当前的时态推理数据集仅限于关于单个或孤立事件的问题,不能反映涉及并发性质和复杂的时态互连的现实时态特征。本文介绍了CoTempQA,一个包含4个共时场景(相等、重叠、期间、混合)的综合同时问答基准,共4748个样本,用于评估LLMS的共时理解和推理能力。我们的大量实验表明,在CoTempQA任务上,现有的LLMS的性能与人类水平的推理之间存在着显着的差距。即使使用了思想链(COT)方法进行增强,模型也始终难以完成我们的任务。在我们的初步探索中,我们发现了数学推理在处理同时事件中的重要作用,并从数学的角度提出了一种提高LLMS共时推理能力的策略。我们希望我们的CoTempQA数据集将鼓励在提高LLMS的共时推理能力方面取得进一步的进步。我们的代码可以在这个HTTPS URL上找到。

[NLP-37] How structured are the representations in transformer-based vision encoders? An analysis of multi-object representations in vision-language models
[NLP-37] 基于变压器的视觉编码器中的表示结构如何?视觉语言模型中多对象表示的分析

链接: https://arxiv.org/abs/2406.09067
作者: Tarun Khajuria,Braian Olmiro Dias,Jaan Aru
关键词: considered essential, essential for generalising, symbol-like structured representations, Forming, representations
中文关键词: 被认为是必不可少的,对于概括化、类似符号的结构化表示、形成、表示
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Forming and using symbol-like structured representations for reasoning has been considered essential for generalising over novel inputs. The primary tool that allows generalisation outside training data distribution is the ability to abstract away irrelevant information into a compact form relevant to the task. An extreme form of such abstract representations is symbols. Humans make use of symbols to bind information while abstracting away irrelevant parts to utilise the information consistently and meaningfully. This work estimates the state of such structured representations in vision encoders. Specifically, we evaluate image encoders in large vision-language pre-trained models to address the question of which desirable properties their representations lack by applying the criteria of symbolic structured reasoning described for LLMs to the image models. We test the representation space of image encoders like VIT, BLIP, CLIP, and FLAVA to characterise the distribution of the object representations in these models. In particular, we create decoding tasks using multi-object scenes from the COCO dataset, relating the token space to its input content for various objects in the scene. We use these tasks to characterise the network’s token and layer-wise information modelling. Our analysis highlights that the CLS token, used for the downstream task, only focuses on a few objects necessary for the trained downstream task. Still, other individual objects are well-modelled separately by the tokens in the network originating from those objects. We further observed a widespread distribution of scene information. This demonstrates that information is far more entangled in tokens than optimal for representing objects similar to symbols. Given these symbolic properties, we show the network dynamics that cause failure modes of these models on basic downstream tasks in a multi-object scene.
摘要:形成和使用类似符号的结构表征来进行推理被认为是对新输入进行泛化的关键。允许在训练数据分布之外进行概括的主要工具是将不相关的信息抽象成与任务相关的紧凑形式的能力。这种抽象表现的一种极端形式是符号。人类利用符号来绑定信息,同时抽象出不相关的部分,以一致和有意义地利用信息。这项工作估计了视觉编码器中这种结构化表示的状态。具体地说,我们通过将为LLMS描述的符号结构推理的标准应用于图像模型来评估大型视觉语言预训练模型中的图像编码者,以解决其表示缺乏哪些理想特性的问题。我们测试了VIT、BLIP、CLIP和FLAVA等图像编码器的表示空间,以表征对象表示在这些模型中的分布。特别是,我们使用来自COCO数据集的多对象场景创建解码任务,将场景中各种对象的令牌空间与其输入内容相关联。我们使用这些任务来描述网络的令牌和分层信息建模。我们的分析强调,用于下游任务的CLS令牌只关注训练的下游任务所需的少数对象。尽管如此,其他单独的对象仍由网络中源自这些对象的令牌单独建模。我们进一步观察到现场信息的广泛分布。这表明,信息更多地纠缠在令牌中,而不是表示类似于符号的对象的最佳选择。在给定这些符号属性的情况下,我们展示了导致这些模型在多对象场景中的基本下游任务上的故障模式的网络动力学。

[NLP-38] CUDRT: Benchmarking the Detection of Human vs. Large Language Models Generated Texts
[NLP-38] CUDART:人类与大型语言模型生成文本的检测进行基准测试

链接: https://arxiv.org/abs/2406.09056
作者: Zhen Tao,Zhiyu Li,Dinghao Xi,Wei Xu
关键词: significantly enhanced text, large language models, proliferation of large, significantly enhanced, AI-generated text detectors
中文关键词: 显着增强的文本、大型语言模型、大型显着增强的人工智能生成文本检测器的激增
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 32 pages

点击查看摘要

Abstract:The proliferation of large language models (LLMs) has significantly enhanced text generation capabilities across various industries. However, these models’ ability to generate human-like text poses substantial challenges in discerning between human and AI authorship. Despite the effectiveness of existing AI-generated text detectors, their development is hindered by the lack of comprehensive, publicly available benchmarks. Current benchmarks are limited to specific scenarios, such as question answering and text polishing, and predominantly focus on English texts, failing to capture the diverse applications and linguistic nuances of LLMs. To address these limitations, this paper constructs a comprehensive bilingual benchmark in both Chinese and English to evaluate mainstream AI-generated text detectors. We categorize LLM text generation into five distinct operations: Create, Update, Delete, Rewrite, and Translate (CUDRT), encompassing all current LLMs activities. We also establish a robust benchmark evaluation framework to support scalable and reproducible experiments. For each CUDRT category, we have developed extensive datasets to thoroughly assess detector performance. By employing the latest mainstream LLMs specific to each language, our datasets provide a thorough evaluation environment. Extensive experimental results offer critical insights for optimizing AI-generated text detectors and suggest future research directions to improve detection accuracy and generalizability across various scenarios.
摘要:大型语言模型(LLM)的激增极大地增强了各个行业的文本生成能力。然而,这些模型生成类似人类的文本的能力给区分人类和人工智能作者带来了巨大的挑战。尽管现有的人工智能生成的文本检测器很有效,但由于缺乏全面的、公开可用的基准,它们的发展受到阻碍。目前的基准仅限于特定的情景,如问题回答和文本润色,并且主要侧重于英语文本,未能捕捉到LLMS的多样化应用和语言细微差别。为了解决这些局限性,本文构建了一个全面的中英文双语基准来评估主流的人工智能生成的文本检测器。我们将LLM文本生成分为五个不同的操作:创建、更新、删除、重写和翻译(CUDRT),包括所有当前的LLMS活动。我们还建立了一个健壮的基准评估框架,以支持可扩展和可重现的实验。对于每个CUDRT类别,我们都开发了广泛的数据集,以彻底评估探测器的性能。通过使用针对每种语言的最新主流LLM,我们的数据集提供了一个全面的评估环境。广泛的实验结果为优化人工智能生成的文本检测器提供了关键的见解,并提出了未来的研究方向,以提高检测的准确性和泛化能力。

[NLP-39] MiLoRA: Harnessing Minor Singular Components for Parameter-Efficient LLM Finetuning
[NLP-39] MiLoRA:利用较小奇异分量实现参数高效的LLM微调

链接: https://arxiv.org/abs/2406.09044
作者: Hanqing Wang,Zeguan Xiao,Yixia Li,Shuo Wang,Guanhua Chen,Yun Chen
关键词: large language models, Efficient finetuning, aims to adapt, memory cost, large language
中文关键词: 大型语言模型、高效微调、旨在适应、内存成本、大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Efficient finetuning of large language models (LLMs) aims to adapt the LLMs with reduced computation and memory cost. Previous LoRA-based approaches initialize the low-rank matrices with gaussian distribution and zero values, while keeping the original weight matrices frozen. However, the trainable model parameters optimized in an unguided subspace might have interference with the well-learned subspace of the pretrained weight matrix. In this paper, we propose MiLoRA, a simple yet effective LLM finetuning approach that only updates the minor singular components of the weight matrix while keeping the principle singular components frozen. It is observed that the minor matrix corresponds to the noisy or long-tail information, while the principle matrix contains important knowledge. The MiLoRA initializes the low-rank matrices within a subspace that is orthogonal to the principle matrix, thus the pretrained knowledge is expected to be well preserved. During finetuning, MiLoRA makes the most use of the less-optimized subspace for learning the finetuning dataset. Extensive experiments on commonsense reasoning, math reasoning and instruction following benchmarks present the superior performance of our method.
摘要:高效的大型语言模型精调优旨在降低计算和存储开销,以适应大型语言模型。以往的基于LORA的方法在保持原始权重矩阵不变的情况下,用高斯分布和零值对低秩阵进行初始化。然而,在无引导子空间中优化的可训练模型参数可能会干扰预先训练权重矩阵的训练子空间。在本文中,我们提出了一种简单而有效的LLM精调方法MILORA,它只更新权矩阵的次要奇异分量,而保持主奇异分量不变。可以观察到,次要矩阵对应于噪声或长尾信息,而主矩阵包含重要知识。MILORA算法在与主矩阵正交的子空间内初始化低秩阵,因此可以很好地保存预先训练好的知识。在优化过程中,Milora最大限度地利用优化程度较低的子空间来学习优化数据集。在常识推理、数学推理和指令跟踪基准上的大量实验表明,该方法具有优越的性能。

[NLP-40] Language Models are Crossword Solvers
[NLP-40] 语言模型是填字游戏解决器

链接: https://arxiv.org/abs/2406.09043
作者: Soumadeep Saha,Sutanoya Chakraborty,Saptarshi Saha,Utpal Garain
关键词: natural language understanding, Large Language Models, world knowledge, length constraints, form of word
中文关键词: 自然语言理解、大型语言模型、世界知识、长度限制、词形式
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Crosswords are a form of word puzzle that require a solver to demonstrate a high degree of proficiency in natural language understanding, wordplay, reasoning, and world knowledge, along with adherence to character and length constraints. In this paper we tackle the challenge of solving crosswords with Large Language Models (LLMs). We demonstrate that the current generation of state-of-the art (SoTA) language models show significant competence at deciphering cryptic crossword clues, and outperform previously reported SoTA results by a factor of 2-3 in relevant benchmarks. We also develop a search algorithm that builds off this performance to tackle the problem of solving full crossword grids with LLMs for the very first time, achieving an accuracy of 93% on New York Times crossword puzzles. Contrary to previous work in this area which concluded that LLMs lag human expert performance significantly, our research suggests this gap is a lot narrower.
摘要:填字游戏是一种文字谜题,需要解谜者表现出对自然语言理解、文字游戏、推理和世界知识的高度熟练程度,并遵守字符和长度限制。在本文中,我们解决了使用大型语言模型(LLM)解决填字游戏的挑战。我们证明,当前一代最先进的(SoTA)语言模型在破译神秘的填字游戏线索方面表现出了出色的能力,并且在相关基准测试中比之前报告的SoTA结果高出2-3倍。我们还开发了一种搜索算法,该算法利用这种性能来解决首次使用LLM解决完整填字游戏网格的问题,在《纽约时报》填字游戏中实现了93%的准确率。与该领域之前的工作得出的结论是LLM显着落后于人类专家的表现相反,我们的研究表明这种差距要小得多。

[NLP-41] ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models
[NLP-41] ME-Switch:用于大型语言模型的内存高效专家交换框架

链接: https://arxiv.org/abs/2406.09041
作者: Jing Liu,Ruihao Gong,Mingyang Zhang,Yefei He,Jianfei Cai,Bohan Zhuang
关键词: developing LLMs involves, LLMs involves pre-training, create specialized experts, general foundation model, massive data
中文关键词: 开发LLM涉及,LLM涉及预培训、创建专业专家、通用基础模型、海量数据
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Tech report

点击查看摘要

Abstract:The typical process for developing LLMs involves pre-training a general foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Serving these experts poses challenges, as loading all experts onto devices is impractical, and frequent switching between experts in response to user requests incurs substantial I/O costs, increasing latency and expenses. Previous approaches decompose expert weights into pre-trained model weights and residual delta weights, then quantize the delta weights to reduce model size. However, these methods often lead to significant quantization errors at extremely low bitwidths and assume the appropriate model for a user request is known in advance, which is not practical. To address these issues, we introduce ME-Switch, a memory-efficient expert switching framework for LLM serving. ME-Switch uses mixed-precision quantization, selectively quantizing non-salient input channels of delta weights to extremely low bits while keeping salient ones intact, significantly reducing storage demands while maintaining performance. Additionally, we develop a routing method that efficiently directs user queries to the most suitable expert by transforming the model selection problem into a domain classification problem. Extensive experiments show ME-Switch’s promising memory efficiency and routing performance. For example, when serving three models from the Mistral-7B family, ME-Switch reduces model size by 1.74x while maintaining nearly lossless performance on instruction, mathematical reasoning, and code generation tasks. Furthermore, ME-Switch can efficiently serve 16 models from the Mistral-7B family on a single NVIDIA A100 GPU.
摘要:开发LLMS的典型过程包括对海量数据进行一般基础模型的预培训,然后对特定任务的数据进行微调以创建专门的专家。为这些专家服务带来了挑战,因为将所有专家加载到设备上是不切实际的,而且响应用户请求在专家之间频繁切换会导致大量的I/O成本,增加延迟和费用。以前的方法将专家权重分解为预先训练的模型权重和残差增量权重,然后对增量权重进行量化以减小模型的规模。然而,这些方法经常在极低的位宽度处导致显著的量化误差,并且假设用户请求的适当模型是预先已知的,这是不切实际的。为了解决这些问题,我们引入了ME-Switch,一个用于LLM服务的内存效率高的专家交换框架。ME-Switch使用混合精度量化,选择性地将Delta权重的非显著输入通道量化到极低的位,同时保持显著位不变,在保持性能的同时显著减少存储需求。此外,我们还开发了一种路由方法,通过将模型选择问题转化为领域分类问题,有效地将用户查询定向到最合适的专家。大量实验表明ME-Switch具有良好的存储效率和路由性能。例如,在为Mistral-7B系列的三个型号提供服务时,ME-Switch将模型大小减少了1.74倍,同时在指令、数学推理和代码生成任务方面保持了几乎无损的性能。此外,ME-Switch可以在单个NVIDIA A100图形处理器上高效地为Mistral-7B系列的16个型号提供服务。

[NLP-42] Bayesian Statistical Modeling with Predictors from LLMs
[NLP-42] 使用来自LLM的预测因子的Bayesian统计建模

链接: https://arxiv.org/abs/2406.09012
作者: Michael Franke,Polina Tsvilodub,Fausto Carcassi
关键词: art large language, large language models, LLM-based predictions serve, shown impressive performance, larger applications
中文关键词: 艺术大型语言、大型语言模型、基于LLM的预测服务、表现出令人印象深刻的性能、更大的应用
类目: Computation and Language (cs.CL)
备注: 20 pages, 10 figures, parallel submission to a journal

点击查看摘要

Abstract:State of the art large language models (LLMs) have shown impressive performance on a variety of benchmark tasks and are increasingly used as components in larger applications, where LLM-based predictions serve as proxies for human judgements or decision. This raises questions about the human-likeness of LLM-derived information, alignment with human intuition, and whether LLMs could possibly be considered (parts of) explanatory models of (aspects of) human cognition or language use. To shed more light on these issues, we here investigate the human-likeness of LLMs’ predictions for multiple-choice decision tasks from the perspective of Bayesian statistical modeling. Using human data from a forced-choice experiment on pragmatic language use, we find that LLMs do not capture the variance in the human data at the item-level. We suggest different ways of deriving full distributional predictions from LLMs for aggregate, condition-level data, and find that some, but not all ways of obtaining condition-level predictions yield adequate fits to human data. These results suggests that assessment of LLM performance depends strongly on seemingly subtle choices in methodology, and that LLMs are at best predictors of human behavior at the aggregate, condition-level, for which they are, however, not designed to, or usually used to, make predictions in the first place.
摘要:大型语言模型(LLM)在各种基准任务上表现出了令人印象深刻的性能,并越来越多地被用作更大应用中的组件,其中基于LLM的预测充当人类判断或决策的代理。这就提出了一些问题:LLM来源的信息是否与人类相似,是否与人类的直觉一致,以及LLM是否可能被视为(部分)人类认知或语言使用(方面)的解释性模型。为了更好地阐明这些问题,我们从贝叶斯统计模型的角度研究了LLMS对多项选择决策任务的预测是否与人类相似。使用来自语用语言使用的强制选择实验的人类数据,我们发现LLMS没有捕捉到人类数据在项目水平上的差异。我们建议了不同的方法来从LLMS中推导出对聚集的条件级别数据的完整分布预测,并发现一些但不是所有获得条件级别预测的方法产生了与人类数据足够的拟合。这些结果表明,对LLM性能的评估强烈依赖于方法上看似微妙的选择,LLM充其量是对人类行为的总体、条件水平的预测,然而,它们并不是被设计为或通常一开始就用来预测的。

[NLP-43] LLM Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models
[NLP-43] LLM阅读茶叶:使用大型语言模型自动评估主题模型

链接: https://arxiv.org/abs/2406.09008
作者: Xiaohao Yang,He Zhao,Dinh Phung,Wray Buntine,Lan Du
关键词: unsupervised text analysis, text analysis, tool for unsupervised, unsupervised text, Topic
中文关键词: 无监督文本分析,文本分析,无监督工具,无监督文本,主题
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Topic modeling has been a widely used tool for unsupervised text analysis. However, comprehensive evaluations of a topic model remain challenging. Existing evaluation methods are either less comparable across different models (e.g., perplexity) or focus on only one specific aspect of a model (e.g., topic quality or document representation quality) at a time, which is insufficient to reflect the overall model performance. In this paper, we propose WALM (Words Agreement with Language Model), a new evaluation method for topic modeling that comprehensively considers the semantic quality of document representations and topics in a joint manner, leveraging the power of large language models (LLMs). With extensive experiments involving different types of topic models, WALM is shown to align with human judgment and can serve as a complementary evaluation method to the existing ones, bringing a new perspective to topic modeling. Our software package will be available at this https URL, which can be integrated with many widely used topic models.
摘要:主题建模已经成为无监督文本分析的一种广泛使用的工具。然而,对主题模型的全面评估仍然具有挑战性。现有的评估方法要么在不同模型上的可比性较差(例如,困惑),要么一次只关注模型的一个特定方面(例如,主题质量或文档表示质量),这不足以反映整体模型的性能。在本文中,我们提出了一种新的主题建模评价方法WALM(Words Agreement With Language Model),它综合考虑了文档表示和主题的语义质量,利用了大型语言模型(LLMS)的能力。通过涉及不同类型主题模型的大量实验,WALM被证明与人类的判断一致,并可以作为现有评估方法的补充,为主题建模带来了新的视角。我们的软件包将在此HTTPS URL上提供,该URL可以与许多广泛使用的主题模型集成。

[NLP-44] Multi-Agent Software Development through Cross-Team Collaboration
[NLP-44] 通过跨团队协作进行多代理软件开发

链接: https://arxiv.org/abs/2406.08979
作者: Zhuoyun Du,Chen Qian,Wei Liu,Zihao Xie,Yifei Wang,Yufan Dang,Weize Chen,Cheng Yang
关键词: Large Language Models, Large Language, catalyzed profound transformations, Language Models, breakthroughs in Large
中文关键词: 大型语言模型,大型语言,催化了深刻的转变,语言模型,大型突破
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: Work in progress

点击查看摘要

Abstract:The latest breakthroughs in Large Language Models (LLMs), eg., ChatDev, have catalyzed profound transformations, particularly through multi-agent collaboration for software development. LLM agents can collaborate in teams like humans, and follow the waterfall model to sequentially work on requirements analysis, development, review, testing, and other phases to perform autonomous software generation. However, for an agent team, each phase in a single development process yields only one possible outcome. This results in the completion of only one development chain, thereby losing the opportunity to explore multiple potential decision paths within the solution space. Consequently, this may lead to obtaining suboptimal results. To address this challenge, we introduce Cross-Team Collaboration (CTC), a scalable multi-team framework that enables orchestrated teams to jointly propose various decisions and communicate with their insights in a cross-team collaboration environment for superior content generation. Experimental results in software development reveal a notable increase in quality compared to state-of-the-art baselines, underscoring the efficacy of our framework. The significant improvements in story generation demonstrate the promising generalization ability of our framework across various domains. We anticipate that our work will guide LLM agents towards a cross-team paradigm and contribute to their significant growth in but not limited to software development. The code and data will be available at this https URL.
摘要:大型语言模型(LLM)的最新突破,如ChatDev,已经催化了深刻的变革,特别是通过软件开发的多代理协作。LLM代理可以像人类一样在团队中协作,并遵循瀑布模型顺序地进行需求分析、开发、审查、测试和其他阶段的工作,以执行自主的软件生成。然而,对于代理团队来说,单个开发过程中的每个阶段只产生一个可能的结果。这导致只完成一个开发链,从而失去了在解决方案空间内探索多条潜在决策路径的机会。因此,这可能导致获得不太理想的结果。为了应对这一挑战,我们引入了跨团队协作(CTC),这是一个可扩展的多团队框架,使协调的团队能够在跨团队协作环境中共同提出各种决策并与他们的见解进行沟通,以获得卓越的内容生成。软件开发的实验结果显示,与最先进的基线相比,质量有了显著的提高,这突显了我们框架的有效性。在故事生成方面的显著改进证明了我们的框架在不同领域具有良好的泛化能力。我们预计,我们的工作将引导LLM代理走向跨团队范例,并为他们在但不限于软件开发方面的显着增长做出贡献。代码和数据将在此HTTPS URL上提供。

[NLP-45] Word Order in English-Japanese Simultaneous Interpretation: Analyses and Evaluation using Chunk-wise Monotonic Translation
[NLP-45] 英日同传中的语序:基于块式单调翻译的分析与评价

链接: https://arxiv.org/abs/2406.08940
作者: Kosuke Doi,Yuka Ko,Mana Makinae,Katsuhito Sudoh,Satoshi Nakamura
关键词: Chunk-wise Monotonic Translation, Monotonic Translation Evaluation, Translation Evaluation Dataset, monotonic translations, Chunk-wise Monotonic
中文关键词: 按块单调翻译,单调翻译评估,翻译评估数据集,单调翻译,按块单调
类目: Computation and Language (cs.CL)
备注: Accepted to IWSLT2024

点击查看摘要

Abstract:This paper analyzes the features of monotonic translations, which follow the word order of the source language, in simultaneous interpreting (SI). The word order differences are one of the biggest challenges in SI, especially for language pairs with significant structural differences like English and Japanese. We analyzed the characteristics of monotonic translations using the NAIST English-to-Japanese Chunk-wise Monotonic Translation Evaluation Dataset and found some grammatical structures that make monotonic translation difficult in English-Japanese SI. We further investigated the features of monotonic translations through evaluating the output from the existing speech translation (ST) and simultaneous speech translation (simulST) models on NAIST English-to-Japanese Chunk-wise Monotonic Translation Evaluation Dataset as well as on existing test sets. The results suggest that the existing SI-based test set underestimates the model performance. We also found that the monotonic-translation-based dataset would better evaluate simulST models, while using an offline-based test set for evaluating simulST models underestimates the model performance.
摘要:本文分析了同声传译中遵循源语语序的单调翻译的特点。语序差异是同声传译中最大的挑战之一,尤其是对于英语和日语等结构差异显著的语言对来说。我们利用NAIST的英日语块单调翻译评估数据集分析了单调翻译的特点,发现了一些语法结构使单调翻译在英日同传中变得困难。我们通过在NAIST英日语块单调翻译评价数据集和现有测试集上对现有的语音翻译(ST)和同时语音翻译(SIMST)模型的输出进行评估,进一步研究了单调翻译的特征。结果表明,现有的基于SI的测试集低估了模型的性能。我们还发现,基于单调翻译的数据集能够更好地评估同构模型,而使用基于离线的测试集来评估同构模型则低估了模型的性能。

[NLP-46] Exploring Multilingual Unseen Speaker Emotion Recognition: Leveraging Co-Attention Cues in Multitask Learning
[NLP-46] 探索多语言看不见的说话者情感识别:在多任务学习中利用共同注意线索

链接: https://arxiv.org/abs/2406.08931
作者: Arnav Goel,Medha Hira,Anubha Gupta
关键词: Speech Emotion Recognition, Emotion Recognition, Speech Emotion, Advent of modern, modern deep learning
中文关键词: 语音情感识别,情感识别,语音情感,现代深度学习的到来
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, Accepted to INTERSPEECH 2024

点击查看摘要

Abstract:Advent of modern deep learning techniques has given rise to advancements in the field of Speech Emotion Recognition (SER). However, most systems prevalent in the field fail to generalize to speakers not seen during training. This study focuses on handling challenges of multilingual SER, specifically on unseen speakers. We introduce CAMuLeNet, a novel architecture leveraging co-attention based fusion and multitask learning to address this problem. Additionally, we benchmark pretrained encoders of Whisper, HuBERT, Wav2Vec2.0, and WavLM using 10-fold leave-speaker-out cross-validation on five existing multilingual benchmark datasets: IEMOCAP, RAVDESS, CREMA-D, EmoDB and CaFE and, release a novel dataset for SER on the Hindi language (BhavVani). CAMuLeNet shows an average improvement of approximately 8% over all benchmarks on unseen speakers determined by our cross-validation strategy.
摘要:现代深度学习技术的出现带来了语音情感识别(BER)领域的进步。然而,该领域流行的大多数系统都未能推广到培训期间未见过的说话者。本研究的重点是应对多语言BER的挑战,特别是针对未见过的说话者。我们引入了CAMuLeNet,这是一种新颖的架构,利用基于共同注意力的融合和多任务学习来解决这个问题。此外,我们还在五个现有的多语言基准数据集上使用10重留言者交叉验证,对Whisper、HuBERT、Wav2Vec2.0和WavLM的预训练编码器进行基准测试:IEMOCAP、RAVDESS、CREMA-D、CLARDB和CaFE,并发布了印地语(BhavVani)的新BER数据集。CAMuLeNet显示,与我们的交叉验证策略确定的未见过发言者的所有基准相比,平均改进约8%。

[NLP-47] Navigating the Shadows: Unveiling Effective Disturbances for Modern AI Content Detectors
[NLP-47] 探索阴影:揭示现代人工智能内容检测器的有效干扰

链接: https://arxiv.org/abs/2406.08922
作者: Ying Zhou,Ben He,Le Sun
关键词: attracted global attention, large language models, launch of ChatGPT, large language, global attention
中文关键词: 吸引全球关注,大型语言模型,ChatGPT推出,大型语言,全球关注
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ACL 2024, Main Conference

点击查看摘要

Abstract:With the launch of ChatGPT, large language models (LLMs) have attracted global attention. In the realm of article writing, LLMs have witnessed extensive utilization, giving rise to concerns related to intellectual property protection, personal privacy, and academic integrity. In response, AI-text detection has emerged to distinguish between human and machine-generated content. However, recent research indicates that these detection systems often lack robustness and struggle to effectively differentiate perturbed texts. Currently, there is a lack of systematic evaluations regarding detection performance in real-world applications, and a comprehensive examination of perturbation techniques and detector robustness is also absent. To bridge this gap, our work simulates real-world scenarios in both informal and professional writing, exploring the out-of-the-box performance of current detectors. Additionally, we have constructed 12 black-box text perturbation methods to assess the robustness of current detection models across various perturbation granularities. Furthermore, through adversarial learning experiments, we investigate the impact of perturbation data augmentation on the robustness of AI-text detectors. We have released our code and data at this https URL.
摘要:随着ChatGPT的推出,大型语言模型引起了全球的关注。在文章写作领域,LLM得到了广泛的应用,引起了人们对知识产权保护、个人隐私和学术诚信的担忧。作为回应,人工智能文本检测应运而生,以区分人类和机器生成的内容。然而,最近的研究表明,这些检测系统往往缺乏稳健性,难以有效区分受干扰的文本。目前,对于实际应用中的检测性能缺乏系统的评估,也缺乏对扰动技术和检测器稳健性的全面检查。为了弥补这一差距,我们的工作模拟了非正式和专业写作的真实世界场景,探索了当前检测器的开箱即用性能。此外,我们构造了12个黑盒文本扰动方法来评估当前检测模型在不同扰动粒度下的稳健性。此外,通过对抗性学习实验,研究了扰动数据增强对AI-Text检测器鲁棒性的影响。我们已经在这个HTTPS URL上发布了我们的代码和数据。

[NLP-48] An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios
[NLP-48] 低资源场景下TTC系统语言适应的初步研究

链接: https://arxiv.org/abs/2406.08911
作者: Cheng Gong,Erica Cooper,Xin Wang,Chunyu Qiang,Mengzhe Geng,Dan Wells,Longbiao Wang,Jianwu Dang,Marc Tessier,Aidan Pine,Korin Richmond,Junichi Yamagishi
关键词: Self-supervised learning, massively multilingual models, multilingual models offer, representations from massively, models offer
中文关键词: 自我监督学习、大规模多语言模型、多语言模型提供、来自大规模的表示、模型提供
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on 12 languages using limited data with various fine-tuning configurations. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language’s adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS.
摘要:来自大规模多语言模型的自监督学习(SSL)为低资源语言语音任务提供了一种很有前途的解决方案。尽管取得了进步,但TTS系统中的语言适应仍然是一个悬而未决的问题。本文研究了ZMM-TTS的语言适配能力,这是我们先前工作中提出的一个新的基于SSL的多语言TTS系统。我们使用有限的数据和各种微调配置在12种语言上进行了实验。我们证明,训练前和目标语言之间的语音相似性以及语言类别都会影响目标语言的顺应表现。此外,我们发现微调数据集的大小和说话人的数量会影响自适应能力。令人惊讶的是,我们还观察到,与纯音频数据相比,使用配对数据进行微调并不总是最佳的。除了语音清晰度,我们的分析还包括说话人相似性、语言识别和预测的MOS。

[NLP-49] Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models
[NLP-49] Delta-CoMe:针对大型语言模型的免训练Delta压缩、混合精度

链接: https://arxiv.org/abs/2406.08903
作者: Bowen Ping,Shuo Wang,Hanqing Wang,Xu Han,Yuzhuang Xu,Yukun Yan,Yun Chen,Baobao Chang,Zhiyuan Liu,Maosong Sun
关键词: adapting large language, large language models, diverse applications, crucial process, process for adapting
中文关键词: 适应大语言、大语言模型、多样化的应用、关键过程、适应过程
类目: Computation and Language (cs.CL)
备注: 12 pages

点击查看摘要

Abstract:Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, such as multi-tenant serving, deploying multiple LLMs becomes necessary to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs (e.g., WizardMath for math problems). Motivated by the long-tail distribution of singular values in the delta weights, we propose a delta quantization approach using mixed-precision. This method employs higher-bit representation for singular vectors corresponding to larger singular values. We evaluate our approach on various fine-tuned LLMs, including math LLMs, code LLMs, chat LLMs, and even VLMs. Experimental results demonstrate that our approach performs comparably to full fine-tuned LLMs, surpassing both low-rank and low-bit baselines by a considerable margin. Additionally, we show that our method is compatible with various backbone LLMs, such as Llama-2, Llama-3, and Mistral, highlighting its generalizability.
摘要:微调是大型语言模型适应不同应用的关键过程。在某些场景中,例如多租户服务,需要部署多个LLM来满足复杂的需求。最近的研究建议将微调的LLM分解为基本模型和相应的增量权重,然后使用低阶或低位方法进行压缩以降低成本。在这项工作中,我们观察到现有的低阶和低位压缩方法会显著损害特定于任务的微调LLM(例如,数学问题的WizardMath)的模型性能。针对Delta权重中奇异值的长尾分布特点,提出了一种基于混合精度的Delta量化方法。该方法对对应于较大奇异值的奇异向量采用高位表示。我们在各种微调的LLM上评估了我们的方法,包括数学LLM、代码LLM、聊天LLM,甚至VLM。实验结果表明,我们的方法的性能与完全微调的LLMS相当,大大超过了低阶和低位基线。此外,我们还证明了我们的方法与各种主干LLM,如LLAMA-2、LLAMA-3和Mistral都是兼容的,这突出了它的通用性。

[NLP-50] No perspective no perception!! Perspective-aware Healthcare Answer Summarization
[NLP-50] 没有视角没有感知!!视角感知医疗保健答案总结

链接: https://arxiv.org/abs/2406.08881
作者: Gauri Naik,Sharad Chandakacherla,Shweta Yadav,Md. Shad Akhtar
关键词: Healthcare Community Question, individuals seeking information, Community Question Answering, answering others’ questions, Healthcare Community
中文关键词: 医疗保健社区问题,个人寻求信息,社区问题志愿服务,回答他人问题,医疗保健社区
类目: Computation and Language (cs.CL)
备注: ACL 2024 Findings

点击查看摘要

Abstract:Healthcare Community Question Answering (CQA) forums offer an accessible platform for individuals seeking information on various healthcare-related topics. People find such platforms suitable for self-disclosure, seeking medical opinions, finding simplified explanations for their medical conditions, and answering others’ questions. However, answers on these forums are typically diverse and prone to off-topic discussions. It can be challenging for readers to sift through numerous answers and extract meaningful insights, making answer summarization a crucial task for CQA forums. While several efforts have been made to summarize the community answers, most of them are limited to the open domain and overlook the different perspectives offered by these answers. To address this problem, this paper proposes a novel task of perspective-specific answer summarization. We identify various perspectives, within healthcare-related responses and frame a perspective-driven abstractive summary covering all responses. To achieve this, we annotate 3167 CQA threads with 6193 perspective-aware summaries in our PUMA dataset. Further, we propose PLASMA, a prompt-driven controllable summarization model. To encapsulate the perspective-specific conditions, we design an energy-controlled loss function for the optimization. We also leverage the prefix tuner to learn the intricacies of the health-care perspective summarization. Our evaluation against five baselines suggests the superior performance of PLASMA by a margin of 1.5-21% improvement. We supplement our experiments with ablation and qualitative analysis.
摘要:医疗保健社区问答(CQA)论坛为寻求各种医疗保健相关话题信息的个人提供了一个可访问的平台。人们发现这样的平台适合自我披露、征求医疗意见、为自己的疾病找到简单的解释,以及回答他人的问题。然而,这些论坛上的答案通常是多种多样的,容易引发离题讨论。对于读者来说,筛选大量的答案并提取有意义的见解可能是一项挑战,这使得答案摘要成为CQA论坛的一项关键任务。虽然已经做出了一些努力来总结社区的答案,但大多数都局限于开放领域,而忽略了这些答案提供的不同视角。为了解决这一问题,本文提出了一种新的特定视角的答案摘要任务。我们在与医疗保健相关的回应中确定不同的视角,并构建一个涵盖所有回应的视角驱动的摘要。为了实现这一点,我们在PUMA数据集中用6193个透视感知摘要注释了3167个CQA线程。在此基础上,提出了一种即时驱动的可控摘要模型–等离子体。为了封装特定于视角的条件,我们设计了能量控制的损失函数来进行优化。我们还利用前缀调谐器来了解医疗保健观点摘要的复杂性。我们对五个基准的评估表明,等离子体的卓越性能提高了1.5-21%。我们用烧蚀和定性分析来补充我们的实验。

[NLP-51] Plan Generate and Complicate: Improving Low-resource Dialogue State Tracking via Easy-to-Difficult Zero-shot Data Augmentation
[NLP-51] 计划生成和复杂化:通过易于实现的零采样数据增强来改善低资源对话状态跟踪

链接: https://arxiv.org/abs/2406.08860
作者: Ming Gu,Yan Yang
关键词: low-resource dialogue state, Zero-shot Data Augmentation, dialogue state tracking, Data augmentation, Data augmentation methods
中文关键词: 低资源对话状态、零镜头数据增强、对话状态跟踪、数据增强、数据增强方法
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2024 Findings

点击查看摘要

Abstract:Data augmentation methods have been a promising direction to improve the performance of small models for low-resource dialogue state tracking. However, traditional methods rely on pre-defined user goals and neglect the importance of data complexity in this task. In this paper, we propose EDZ-DA, an Easy-to-Difficult Zero-shot Data Augmentation framework for low-resource dialogue state tracking that utilizes large language models to automatically catch the relationships of different domains and then generate the dialogue data. We also complicate the dialogues based on the domain relation to enhance the model’s capability for co-reference slot tracking. Furthermore, we permute slot values to mitigate the influence of output orders and the problem of incomplete value generation. Experimental results illustrate the superiority of our proposed method compared to previous strong data augmentation baselines on MultiWOZ.
摘要:数据增强方法一直是提高低资源对话状态跟踪小型模型性能的一个有前途的方向。然而,传统方法依赖于预定义的用户目标,并忽视了数据复杂性在这项任务中的重要性。在本文中,我们提出了EDZ-DA,这是一个易于实现的零镜头数据增强框架,用于低资源对话状态跟踪,它利用大型语言模型自动捕捉不同领域的关系,然后生成对话数据。我们还根据领域关系使对话复杂化,以增强模型的共指插槽跟踪能力。此外,我们对插槽值进行了置换,以减轻产出订单的影响和不完整价值生成的问题。实验结果表明,与MultiWOZ上之前的强数据增强基线相比,我们提出的方法的优越性。

[NLP-52] An Approach to Build Zero-Shot Slot-Filling System for Industry-Grade Conversational Assistants
[NLP-52] 一种构建行业级会话助理零镜头填位系统的方法

链接: https://arxiv.org/abs/2406.08848
作者: G P Shrivatsa Bhargav,Sumit Neelam,Udit Sharma,Shajith Ikbal,Dheeraj Sreedhar,Hima Karanam,Sachindra Joshi,Pankaj Dhoolia,Dinesh Garg,Kyle Croutwater,Haode Qi,Eric Wayne,J William Murdock
关键词: Dialogue State Tracking, build Large Language, perform Dialogue State, Large Language Model, Large Language
中文关键词: 对话状态跟踪、构建大型语言、执行对话状态、大型语言模型、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present an approach to build Large Language Model (LLM) based slot-filling system to perform Dialogue State Tracking in conversational assistants serving across a wide variety of industry-grade applications. Key requirements of this system include: 1) usage of smaller-sized models to meet low latency requirements and to enable convenient and cost-effective cloud and customer premise deployments, and 2) zero-shot capabilities to serve across a wide variety of domains, slot types and conversational scenarios. We adopt a fine-tuning approach where a pre-trained LLM is fine-tuned into a slot-filling model using task specific data. The fine-tuning data is prepared carefully to cover a wide variety of slot-filling task scenarios that the model is expected to face across various domains. We give details of the data preparation and model building process. We also give a detailed analysis of the results of our experimental evaluations. Results show that our prescribed approach for slot-filling model building has resulted in 6.9% relative improvement of F1 metric over the best baseline on a realistic benchmark, while at the same time reducing the latency by 57%. More over, the data we prepared has helped improve F1 on an average by 4.2% relative across various slot-types.
摘要:我们提出了一种基于大型语言模型(LLM)的空位填充系统,用于在服务于各种工业级应用的会话助手中执行对话状态跟踪。该系统的主要要求包括:1)使用较小的型号来满足低延迟要求,并支持方便且经济高效的云和客户驻地部署,以及2)零触发功能,以跨多种域、插槽类型和对话场景提供服务。我们采用了一种微调的方法,使用特定于任务的数据将预先训练的LLM微调到空位填充模型中。微调数据是精心准备的,以涵盖该模型预计将在不同领域面临的各种填补空位的任务场景。我们给出了详细的数据准备和模型建立过程。并对实验结果进行了详细的分析。结果表明,我们提出的空位填充模型构建方法在实际基准上比最佳基准的F1度量相对提高了6.9%,同时延迟减少了57%。更重要的是,我们准备的数据帮助F1在不同的时隙类型中平均提高了4.2%。

[NLP-53] ContraSolver: Self-Alignment of Language Models by Resolving Internal Preference Contradictions
[NLP-53] Contrast Solver:通过解决内部偏好矛盾来实现语言模型的自我调整

链接: https://arxiv.org/abs/2406.08842
作者: Xu Zhang,Xunjian Yin,Xiaojun Wan
关键词: large language models, developing large language, language models, substantial advancements, made in developing
中文关键词: 大型语言模型,开发大型语言,语言模型,重大进步,开发中取得
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While substantial advancements have been made in developing large language models (LLMs), achieving control over their behavior can be difficult. Direct preference optimization (DPO) assumes the existence of a latent reward function to evaluate the responses of LLMs. This assumption indicates a strict preference ordering of different responses to the same input. However, there always exist contradictions of preference in LLMs according to our experimental observations. In this paper, we construct a graph structure of the preference relationship among different responses with self-annotation to find contradictions in the preference order. We propose ContraSolver, an algorithm that traverses all edges on the preference graph to identify those that might cause contradictions. ContraSolver initializes the graph with a maximum spanning tree and identifies contradictory edges, prioritizing the resolution of low-confidence preferences while preserving high-confidence ones. Experimental results on four different generation tasks show that the performance of different LLMs can be largely improved through our completely unsupervised self-alignment. Furthermore, by analyzing the preference graphs of LLMs with and without self-alignment by ContraSolver, we quantify the reduction in contradictions, suggesting that resolving preference contradictions is crucial for achieving better alignment performance.
摘要:虽然在开发大型语言模型(LLM)方面已经取得了实质性的进展,但实现对它们的行为的控制可能是困难的。直接偏好优化(DPO)假设潜在报酬函数的存在来评价LLMS的反应。这一假设表明了对相同输入的不同响应的严格偏好排序。然而,根据我们的实验观察,在LLMS中总是存在偏好的矛盾。在本文中,我们构造了一种带有自注解的不同响应之间的偏好关系的图结构,以发现偏好顺序中的矛盾。我们提出了ContraSolver算法,该算法遍历偏好图上的所有边来识别可能导致矛盾的边。ContraSolver使用最大生成树初始化图,并识别相互矛盾的边,在保留高置信度首选项的同时优先处理低置信度首选项的解决方案。在四个不同的生成任务上的实验结果表明,通过完全无监督的自对齐,不同的LLMS的性能可以得到很大的提高。此外,通过使用ContraSolver分析有无自对齐的LLM的偏好图,我们量化了矛盾的减少,表明解决偏好矛盾是获得更好的对齐性能的关键。

[NLP-54] Research on Optimization of Natural Language Processing Model Based on Multimodal Deep Learning
[NLP-54] 基于多模式深度学习的自然语言处理模型优化研究

链接: https://arxiv.org/abs/2406.08838
作者: Dan Sun,Yaxin Liang,Yining Yang,Yuhan Ma,Qishi Zhan,Erdi Gao
关键词: multimodal data, based on attention, attention mechanism, mechanism and multimodal, image representation based
中文关键词: 多模式数据,基于注意力、注意力机制、机制和多模式、基于图像表示
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This project intends to study the image representation based on attention mechanism and multimodal data. By adding multiple pattern layers to the attribute model, the semantic and hidden layers of image content are integrated. The word vector is quantified by the Word2Vec method and then evaluated by a word embedding convolutional neural network. The published experimental results of the two groups were tested. The experimental results show that this method can convert discrete features into continuous characters, thus reducing the complexity of feature preprocessing. Word2Vec and natural language processing technology are integrated to achieve the goal of direct evaluation of missing image features. The robustness of the image feature evaluation model is improved by using the excellent feature analysis characteristics of a convolutional neural network. This project intends to improve the existing image feature identification methods and eliminate the subjective influence in the evaluation process. The findings from the simulation indicate that the novel approach has developed is viable, effectively augmenting the features within the produced representations.
摘要:本课题旨在研究基于注意机制和多通道数据的图像表征。通过在属性模型中添加多个模式层,整合了图像内容的语义层和隐藏层。用word2vec方法对词矢量进行量化,然后用嵌入词的卷积神经网络进行求值。对两组已发表的实验结果进行了检验。实验结果表明,该方法可以将离散特征转化为连续特征,从而降低了特征预处理的复杂度。将word2vec与自然语言处理技术相结合,实现了对缺失图像特征的直接评价。利用卷积神经网络良好的特征分析特性,提高了图像特征评价模型的鲁棒性。本课题旨在对现有的图像特征识别方法进行改进,消除评价过程中的主观影响。仿真结果表明,所提出的新方法是可行的,有效地增强了生成的表示中的特征。

[NLP-55] LLM-Driven Robots Risk Enacting Discrimination Violence and Unlawful Actions
[NLP-55] 法学硕士驱动的机器人面临实施歧视、暴力和非法行为的风险

链接: https://arxiv.org/abs/2406.08824
作者: Rumaisa Azeem,Andrew Hundt,Masoumeh Mansouri,Martim Brandão
关键词: proposed Large Language, common sense reasoning, Artificial Intelligence, Large Language Models, natural language interactions
中文关键词: 提出的大型语言、常识推理、人工智能、大型语言模型、自然语言交互
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 40 pages (52 with references), 21 Figures, 6 Tables

点击查看摘要

Abstract:Members of the Human-Robot Interaction (HRI) and Artificial Intelligence (AI) communities have proposed Large Language Models (LLMs) as a promising resource for robotics tasks such as natural language interactions, doing household and workplace tasks, approximating common sense reasoning', and modeling humans. However, recent research has raised concerns about the potential for LLMs to produce discriminatory outcomes and unsafe behaviors in real-world robot experiments and applications. To address these concerns, we conduct an HRI-based evaluation of discrimination and safety criteria on several highly-rated LLMs. Our evaluation reveals that LLMs currently lack robustness when encountering people across a diverse range of protected identity characteristics (e.g., race, gender, disability status, nationality, religion, and their intersections), producing biased outputs consistent with directly discriminatory outcomes -- e.g. gypsy’ and mute' people are labeled untrustworthy, but not european’ or `able-bodied’ people. Furthermore, we test models in settings with unconstrained natural language (open vocabulary) inputs, and find they fail to act safely, generating responses that accept dangerous, violent, or unlawful instructions – such as incident-causing misstatements, taking people’s mobility aids, and sexual predation. Our results underscore the urgent need for systematic, routine, and comprehensive risk assessments and assurances to improve outcomes and ensure LLMs only operate on robots when it is safe, effective, and just to do so. Data and code will be made available.
摘要:人类-机器人交互(HRI)和人工智能(AI)领域的成员提出了大型语言模型(LLM),作为一种很有前途的资源用于机器人任务,如自然语言交互、做家庭和工作任务、近似常识推理和对人类建模。然而,最近的研究提出了对LLMS在现实世界机器人实验和应用中产生歧视性结果和不安全行为的可能性的担忧。为了解决这些担忧,我们对几个评级较高的LLM进行了基于HRI的歧视和安全标准评估。我们的评估显示,LLMS目前在遇到各种受保护身份特征(例如种族、性别、残疾状况、国籍、宗教及其交集)的人时缺乏稳健性,产生与直接歧视性结果一致的有偏见的输出–例如“吉普赛人”和“哑巴”人被贴上不值得信任的标签,而不是“欧洲人”或“健全”人。此外,我们在自然语言(开放词汇)输入不受限制的环境中测试模型,发现它们无法安全地操作,生成接受危险、暴力或非法指令的响应–例如导致事件的错误陈述、服用人们的行动辅助工具和性掠夺。我们的结果强调了迫切需要系统、常规和全面的风险评估和保证,以改善结果,并确保LLM只有在安全、有效和公正地这样做的情况下才在机器人上操作。数据和代码将可用。

[NLP-56] Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination
[NLP-56] ChatGPT中的语言偏见:语言模型强化方言区分

链接: https://arxiv.org/abs/2406.08818
作者: Eve Fleisig,Genevieve Smith,Madeline Bossi,Ishita Rustagi,Xavier Yin,Dan Klein
关键词: Standard American English, Standard British English, American English, British English, ChatGPT covering ten
中文关键词: 标准美式英语、标准英式英语、美式英语、英式英语、ChatGPT涵盖十
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We present a large-scale study of linguistic bias exhibited by ChatGPT covering ten dialects of English (Standard American English, Standard British English, and eight widely spoken non-“standard” varieties from around the world). We prompted GPT-3.5 Turbo and GPT-4 with text by native speakers of each variety and analyzed the responses via detailed linguistic feature annotation and native speaker evaluation. We find that the models default to “standard” varieties of English; based on evaluation by native speakers, we also find that model responses to non-“standard” varieties consistently exhibit a range of issues: lack of comprehension (10% worse compared to “standard” varieties), stereotyping (16% worse), demeaning content (22% worse), and condescending responses (12% worse). We also find that if these models are asked to imitate the writing style of prompts in non-“standard” varieties, they produce text that exhibits lower comprehension of the input and is especially prone to stereotyping. GPT-4 improves on GPT-3.5 in terms of comprehension, warmth, and friendliness, but it also results in a marked increase in stereotyping (+17%). The results suggest that GPT-3.5 Turbo and GPT-4 exhibit linguistic discrimination in ways that can exacerbate harms for speakers of non-“standard” varieties.
摘要:我们介绍了一项关于ChatGPT表现出的语言偏见的大规模研究,涵盖了十种英语方言(标准美国英语、标准英国英语和来自世界各地的八种被广泛使用的非标准英语变体)。我们对GPT-3.5、Turbo和GPT-4进行了提示,并通过详细的语言特征标注和母语说话者评估分析了他们的回答。我们发现这些模型默认使用标准的英语变体;根据母语人士的评估,我们还发现对非标准变体的模型反应始终表现出一系列问题:缺乏理解(比标准变体差10%)、刻板印象(比标准变体差16%)、贬低内容(比差22%)和居高临下的反应(比差12%)。我们还发现,如果这些模型被要求模仿非标准形式的提示语的写作风格,他们产生的文本对输入的理解程度较低,特别容易被刻板印象。GPT-4在理解力、亲和力和友好性方面优于GPT-3.5,但它也导致刻板印象显著增加(+17%)。结果表明,GPT-3.5、Turbo和GPT-4表现出的语言歧视会加剧对非标准变种说话人的伤害。

[NLP-57] Automated Essay Scoring Using Grammatical Variety and Errors with Multi-Task Learning and Item Response Theory
[NLP-57] 利用语法多样性和错误以及多任务学习和项目反应理论进行自动论文评分

链接: https://arxiv.org/abs/2406.08817
作者: Kosuke Doi,Katsuhito Sudoh,Satoshi Nakamura
关键词: automatic essay scoring, grammatical features, grammatical, study examines, examines the effect
中文关键词: 自动论文评分、语法特征、语法、学习检查、检查效果
类目: Computation and Language (cs.CL)
备注: Accepted to BEA2024

点击查看摘要

Abstract:This study examines the effect of grammatical features in automatic essay scoring (AES). We use two kinds of grammatical features as input to an AES model: (1) grammatical items that writers used correctly in essays, and (2) the number of grammatical errors. Experimental results show that grammatical features improve the performance of AES models that predict the holistic scores of essays. Multi-task learning with the holistic and grammar scores, alongside using grammatical features, resulted in a larger improvement in model performance. We also show that a model using grammar abilities estimated using Item Response Theory (IRT) as the labels for the auxiliary task achieved comparable performance to when we used grammar scores assigned by human raters. In addition, we weight the grammatical features using IRT to consider the difficulty of grammatical items and writers’ grammar abilities. We found that weighting grammatical features with the difficulty led to further improvement in performance.
摘要:本研究考察了语法特征在作文自动评分中的作用。我们使用两种语法特征作为AES模型的输入:(1)作者在文章中正确使用的语法项目,(2)语法错误的数量。实验结果表明,语法特征提高了AES模型预测作文整体分数的性能。结合整体和语法分数的多任务学习,加上使用语法特征,导致了模型绩效的较大改善。我们还表明,使用项目反应理论(IRT)估计的语法能力作为辅助任务的标签的模型取得了与使用人类评分员分配的语法分数时相当的成绩。此外,我们使用IRT对语法特征进行加权,以考虑语法项目的难度和作者的语法能力。我们发现,将语法特征与难度加权会导致成绩的进一步提高。

[NLP-58] Mixture-of-Skills: Learning to Optimize Data Usage for Fine-Tuning Large Language Models
[NLP-58] 技能混合:学习优化数据使用以微调大型语言模型

链接: https://arxiv.org/abs/2406.08811
作者: Minghao Wu,Thuy-Trang Vu,Lizhen Qu,Gholamreza Haffari
关键词: Large language models, Large language, typically fine-tuned, origins to develop, extensive datasets sourced
中文关键词: 大型语言模型,大型语言,通常经过微调,起源开发,来源广泛的数据集
类目: Computation and Language (cs.CL)
备注: Work in progress; 15 pages, 7 tables, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) are typically fine-tuned on diverse and extensive datasets sourced from various origins to develop a comprehensive range of skills, such as writing, reasoning, chatting, coding, and more. Each skill has unique characteristics, and these datasets are often heterogeneous and imbalanced, making the fine-tuning process highly challenging. Balancing the development of each skill while ensuring the model maintains its overall performance requires sophisticated techniques and careful dataset curation. In this work, we propose a general, model-agnostic, reinforcement learning framework, Mixture-of-Skills (MoS), that learns to optimize data usage automatically during the fine-tuning process. This framework ensures the optimal comprehensive skill development of LLMs by dynamically adjusting the focus on different datasets based on their current learning state. To validate the effectiveness of MoS, we conduct extensive experiments using three diverse LLM backbones on two widely used benchmarks and demonstrate that MoS substantially enhances model performance. Building on the success of MoS, we propose MoSpec, an adaptation for task-specific fine-tuning, which harnesses the utilities of various datasets for a specific purpose. Our work underlines the significance of dataset rebalancing and present MoS as a powerful, general solution for optimizing data usage in the fine-tuning of LLMs for various purposes.
摘要:大型语言模型(LLM)通常在来自不同来源的不同和广泛的数据集上进行微调,以发展广泛的技能,如写作、推理、聊天、编码等。每项技能都有独特的特点,而这些数据集往往是异质和不平衡的,这使得微调过程具有极大的挑战性。平衡每项技能的发展,同时确保模型保持其整体性能,需要复杂的技术和仔细的数据集管理。在这项工作中,我们提出了一个通用的,模型不可知的强化学习框架,混合技能(MoS),它学习在微调过程中自动优化数据使用。该框架根据LLMS当前的学习状态动态调整对不同数据集的关注,从而保证LLMS的最佳综合技能发展。为了验证MOS的有效性,我们使用三个不同的LLM主干在两个广泛使用的基准上进行了广泛的实验,并证明了MOS显著地提高了模型的性能。在MOS成功的基础上,我们提出了MoSpec,一种适用于特定任务微调的改进方案,它将各种数据集的效用用于特定目的。我们的工作强调了数据集再平衡的重要性,并提出了MOS作为一种强大的通用解决方案,用于优化用于各种目的的LLMS微调中的数据使用。

[NLP-59] Deep Exploration of Cross-Lingual Zero-Shot Generalization in Instruction Tuning
[NLP-59] 教学调优中跨语言零镜头概括的深入探索

链接: https://arxiv.org/abs/2406.08796
作者: Janghoon Han,Changho Lee,Joongbo Shin,Stanley Jungkyu Choi,Honglak Lee,Kynghoon Bae
关键词: significantly boosting zero-shot, Instruction tuning, boosting zero-shot performance, powerful technique, significantly boosting
中文关键词: 显着提高零发射,指令调优,提高零发射性能,强大的技术,显着提高
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2024 (Camera-ready), by Janghoon Han and Changho Lee, with equal contribution

点击查看摘要

Abstract:Instruction tuning has emerged as a powerful technique, significantly boosting zero-shot performance on unseen tasks. While recent work has explored cross-lingual generalization by applying instruction tuning to multilingual models, previous studies have primarily focused on English, with a limited exploration of non-English tasks. For an in-depth exploration of cross-lingual generalization in instruction tuning, we perform instruction tuning individually for two distinct language meta-datasets. Subsequently, we assess the performance on unseen tasks in a language different from the one used for training. To facilitate this investigation, we introduce a novel non-English meta-dataset named “KORANI” (Korean Natural Instruction), comprising 51 Korean benchmarks. Moreover, we design cross-lingual templates to mitigate discrepancies in language and instruction-format of the template between training and inference within the cross-lingual setting. Our experiments reveal consistent improvements through cross-lingual generalization in both English and Korean, outperforming baseline by average scores of 20.7% and 13.6%, respectively. Remarkably, these enhancements are comparable to those achieved by monolingual instruction tuning and even surpass them in some tasks. The result underscores the significance of relevant data acquisition across languages over linguistic congruence with unseen tasks during instruction tuning.
摘要:指令调优已经成为一种强大的技术,显著提高了看不见的任务的零命中性能。虽然最近的研究通过将教学调整应用于多语言模式来探索跨语言泛化,但以前的研究主要集中在英语上,对非英语任务的探索有限。为了深入探讨跨语言泛化在教学调整中的作用,我们分别针对两个不同的语言元数据集进行了教学调整。随后,我们用一种不同于培训所用语言的语言评估看不见的任务的表现。为了便于研究,我们引入了一个新的非英语元数据集KORANI(韩语自然指令),包括51个韩语基准。此外,我们设计了跨语言模板,以缓解跨语言环境下训练和推理在语言和模板指令格式上的差异。我们的实验表明,在英语和韩语中,跨语言泛化的效果是一致的,分别比基线高出20.7分和13.6分。值得注意的是,这些增强可以与单语教学调整相媲美,甚至在某些任务中超过它们。这一结果强调了在教学调整过程中跨语言获得相关数据的重要性,而不是语言与看不见的任务的一致性。

[NLP-60] MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs
[NLP-60] MMFakeBench:LVLM的混合源多模式错误信息检测基准

链接: https://arxiv.org/abs/2406.08772
作者: Xuannan Liu,Zekun Li,Peipei Li,Shuhan Xia,Xing Cui,Linzhi Huang,Huaibo Huang,Weihong Deng,Zhaofeng He
关键词: assume a single, insufficient for real-world, real-world scenarios, scenarios where multiple, forgery sources coexist
中文关键词: 假设一个不足以满足现实世界的现实世界场景,多个伪造源共存的场景
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current multimodal misinformation detection (MMD) methods often assume a single source and type of forgery for each sample, which is insufficient for real-world scenarios where multiple forgery sources coexist. The lack of a benchmark for mixed-source misinformation has hindered progress in this field. To address this, we introduce MMFakeBench, the first comprehensive benchmark for mixed-source MMD. MMFakeBench includes 3 critical sources: textual veracity distortion, visual veracity distortion, and cross-modal consistency distortion, along with 12 sub-categories of misinformation forgery types. We further conduct an extensive evaluation of 6 prevalent detection methods and 15 large vision-language models (LVLMs) on MMFakeBench under a zero-shot setting. The results indicate that current methods struggle under this challenging and realistic mixed-source MMD setting. Additionally, we propose an innovative unified framework, which integrates rationales, actions, and tool-use capabilities of LVLM agents, significantly enhancing accuracy and generalization. We believe this study will catalyze future research into more realistic mixed-source multimodal misinformation and provide a fair evaluation of misinformation detection methods.
摘要:当前的多模式错误信息检测(MMD)方法通常假设每个样本只有一个伪造源和类型,这对于现实世界中多个伪造源共存的场景是不够的。缺乏混合来源错误信息的基准阻碍了这一领域的进展。为了解决这一问题,我们引入了MMFakeBch,这是第一个针对混合源MMD的全面基准。MMFakeBtch包括3个关键源:文本准确性失真、视觉准确性失真和跨模式一致性失真,以及12个子类别的错误信息伪造类型。在零镜头的情况下,我们进一步对6种流行的检测方法和15种大视觉语言模型进行了广泛的评估。结果表明,当前的方法在这种具有挑战性和现实性的混合源MMD环境下挣扎。此外,我们还提出了一个创新的统一框架,该框架集成了LVLM代理的基本原理、操作和工具使用能力,显著提高了准确性和通用性。我们相信,这项研究将促进未来对更现实的混合源多模式错误信息的研究,并为错误信息检测方法提供公平的评估。

[NLP-61] SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding
[NLP-61] SRFUND:形式理解的多粒度分层结构重建基准

链接: https://arxiv.org/abs/2406.08757
作者: Jiefeng Ma,Yan Wang,Chenyu Liu,Jun Du,Yu Hu,Zhenrong Zhang,Pengfei Hu,Qing Wang,Jianshu Zhang
关键词: organizing textual content, Accurately identifying, FUNSD and XFUND, form understanding, identifying and organizing
中文关键词: 组织文本内容,准确识别、FUNSD和XFUND,形成理解、识别和组织
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NeurIPS 2024 Track on Datasets and Benchmarks under review

点击查看摘要

Abstract:Accurately identifying and organizing textual content is crucial for the automation of document processing in the field of form understanding. Existing datasets, such as FUNSD and XFUND, support entity classification and relationship prediction tasks but are typically limited to local and entity-level annotations. This limitation overlooks the hierarchically structured representation of documents, constraining comprehensive understanding of complex forms. To address this issue, we present the SRFUND, a hierarchically structured multi-task form understanding benchmark. SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets, encompassing five tasks: (1) word to text-line merging, (2) text-line to entity merging, (3) entity category classification, (4) item table localization, and (5) entity-based full-document hierarchical structure recovery. We meticulously supplemented the original dataset with missing annotations at various levels of granularity and added detailed annotations for multi-item table regions within the forms. Additionally, we introduce global hierarchical structure dependencies for entity relation prediction tasks, surpassing traditional local key-value associations. The SRFUND dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese, making it a powerful tool for cross-lingual form understanding. Extensive experimental results demonstrate that the SRFUND dataset presents new challenges and significant opportunities in handling diverse layouts and global hierarchical structures of forms, thus providing deep insights into the field of form understanding. The original dataset and implementations of baseline methods are available at this https URL
摘要:在表格理解领域,文本内容的准确识别和组织是实现文档处理自动化的关键。现有的数据集,如FUNSD和XFUND,支持实体分类和关系预测任务,但通常仅限于本地和实体级别的注释。这种限制忽略了文档的层次结构表示,限制了对复杂表单的全面理解。为了解决这个问题,我们提出了SRFUND,一个层次结构的多任务表单理解基准。SRFUND在原始FUNSD和XFUND数据集的基础上提供了改进的注释,包括五个任务:(1)字到文本行的合并,(2)文本行到实体的合并,(3)实体类别分类,(4)项目表本地化,以及(5)基于实体的全文档层次结构恢复。我们在不同的粒度级别用缺失的注释仔细地补充了原始数据集,并为表单中的多项表区域添加了详细的注释。此外,我们在实体关系预测任务中引入了全局层次结构依赖关系,超越了传统的局部键-值关联。SRFUND数据集包括英语、汉语、日语、德语、法语、西班牙语、意大利语和葡萄牙语等八种语言,使其成为跨语言形式理解的强大工具。大量的实验结果表明,SRFUND数据集在处理表单的不同布局和全局层次结构方面提出了新的挑战和重要的机会,从而为表单理解领域提供了深刻的见解。基线方法的原始数据集和实现可在此HTTPS URL中获得

[NLP-62] StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Encoded Structure
[NLP-62] StructualSleight:利用不常见的文本编码结构对大型语言模型进行自动越狱攻击

链接: https://arxiv.org/abs/2406.08754
作者: Bangxin Li,Hengrui Xing,Chao Huang,Jin Qian,Huangqing Xiao,Linfeng Feng,Cong Tian
关键词: Large Language Models, natural language processing, Large Language, Language Models, generate harmful content
中文关键词: 大型语言模型、自然语言处理、大型语言、语言模型、生成有害内容
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used in natural language processing but face the risk of jailbreak attacks that maliciously induce them to generate harmful content. Existing jailbreak attacks, including character-level and context-level attacks, mainly focus on the prompt of the plain text without specifically exploring the significant influence of its structure. In this paper, we focus on studying how prompt structure contributes to the jailbreak attack. We introduce a novel structure-level attack method based on tail structures that are rarely used during LLM training, which we refer to as Uncommon Text-Encoded Structure (UTES). We extensively study 12 UTESs templates and 6 obfuscation methods to build an effective automated jailbreak tool named StructuralSleight that contains three escalating attack strategies: Structural Attack, Structural and Character/Context Obfuscation Attack, and Fully Obfuscated Structural Attack. Extensive experiments on existing LLMs show that StructuralSleight significantly outperforms baseline methods. In particular, the attack success rate reaches 94.62% on GPT-4o, which has not been addressed by state-of-the-art techniques.
摘要:大语言模型在自然语言处理中得到了广泛的应用,但也面临着越狱攻击的风险,这些攻击会导致它们产生有害的内容。现有的越狱攻击,包括字符级攻击和语境级攻击,主要集中在明文的提示上,没有具体探讨其结构的重大影响。本文主要研究提示结构在越狱攻击中的作用。提出了一种基于LLM训练中很少使用的尾部结构的结构级攻击方法,称为非公共文本编码结构(UTES)。我们深入研究了12个UTE模板和6种混淆方法,构建了一个有效的自动化越狱工具StructuralSleight,它包含三种逐步升级的攻击策略:结构攻击、结构和字符/上下文混淆攻击和完全混淆结构攻击。在现有LLMS上的大量实验表明,StructuralSleight的性能明显优于基线方法。特别是,在GPT-40上的攻击成功率达到了94.62%,这是最新技术还没有解决的问题。

[NLP-63] StreamBench: Towards Benchmarking Continuous Improvement of Language Agents
[NLP-63] StreamBench:迈向语言代理的持续改进基准

链接: https://arxiv.org/abs/2406.08747
作者: Cheng-Kuang Wu,Zhi Rui Tam,Chieh-Yen Lin,Yun-Nung Chen,Hung-yi Lee
关键词: large language model, continuous enhancement post-deployment, language model, enhancement post-deployment, Recent works
中文关键词: 大型语言模型,部署后持续增强,语言模型,部署后增强,最近的作品
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent works have shown that large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment. However, existing benchmarks primarily evaluate their innate capabilities and do not assess their ability to improve over time. To address this gap, we introduce StreamBench, a pioneering benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence. StreamBench simulates an online learning environment where LLMs receive a continuous flow of feedback stream and iteratively enhance their performance. In addition, we propose several simple yet effective baselines for improving LLMs on StreamBench, and provide a comprehensive analysis to identify critical components that contribute to successful streaming strategies. Our work serves as a stepping stone towards developing effective online learning strategies for LLMs, paving the way for more adaptive AI systems in streaming scenarios.
摘要:最近的研究表明,大型语言模型(LLM)代理能够从经验中提高自己,这是部署后持续增强的重要能力。然而,现有的基准主要是评估它们的内在能力,而不是评估它们随着时间的推移而改进的能力。为了弥补这一差距,我们引入了StreamBch,这是一个开创性的基准,旨在评估LLM代理在输入-反馈序列上的持续改进。StreamBtch模拟了一个在线学习环境,在该环境中,LLM接收到连续的反馈流,并迭代地提高其性能。此外,我们提出了几个简单而有效的基准来改进StreamBch上的LLMS,并提供了一个全面的分析,以确定有助于成功的流策略的关键组件。我们的工作是为LLMS开发有效的在线学习策略的垫脚石,为流媒体场景中更具适应性的AI系统铺平了道路。

[NLP-64] Standard Language Ideology in AI-Generated Language
[NLP-64] 人工智能生成语言中的标准语言意识形态

链接: https://arxiv.org/abs/2406.08726
作者: Genevieve Smith,Eve Fleisig,Madeline Bossi,Ishita Rustagi,Xavier Yin
关键词: standard language ideology, large language models, language ideology, explore standard language, language
中文关键词: 标准语言意识形态,大语言模型,语言意识形态,探索标准语言,语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this position paper, we explore standard language ideology in language generated by large language models (LLMs). First, we outline how standard language ideology is reflected and reinforced in LLMs. We then present a taxonomy of open problems regarding standard language ideology in AI-generated language with implications for minoritized language communities. We introduce the concept of standard AI-generated language ideology, the process by which AI-generated language regards Standard American English (SAE) as a linguistic default and reinforces a linguistic bias that SAE is the most “appropriate” language. Finally, we discuss tensions that remain, including reflecting on what desirable system behavior looks like, as well as advantages and drawbacks of generative AI tools imitating–or often not–different English language varieties. Throughout, we discuss standard language ideology as a manifestation of existing global power structures in and through AI-generated language before ending with questions to move towards alternative, more emancipatory digital futures.
摘要:在这篇立场论文中,我们探讨了由大型语言模型(LLM)生成的语言中的标准语言思想。首先,我们概述了标准语言意识形态如何在LLMS中得到反映和强化。然后,我们提出了一个关于人工智能生成语言中的标准语言意识形态的未决问题的分类,并对小型化的语言社区产生了影响。我们引入了标准人工智能生成语言思想的概念,在这个过程中,人工智能生成语言将标准美国英语(SAE)视为一种语言缺省,并强化了标准美国英语是最合适的语言的语言偏见。最后,我们讨论仍然存在的紧张关系,包括反思理想的系统行为是什么样子,以及生成性人工智能工具模仿–或者经常不模仿–不同英语语言变体的优点和缺点。贯穿始终,我们讨论标准语言意识形态,将其作为现有全球权力结构在人工智能生成的语言中和通过人工智能生成的语言的表现,最后以走向替代的、更解放的数字未来的问题结束。

[NLP-65] ECBD: Evidence-Centered Benchmark Design for NLP
[NLP-65] ECBD:以证据为中心的NLP基准设计

链接: https://arxiv.org/abs/2406.08723
作者: Yu Lu Liu,Su Lin Blodgett,Jackie Chi Kit Cheung,Q. Vera Liao,Alexandra Olteanu,Ziang Xiao
关键词: progress in NLP, benchmark, critical to assessing, assessing progress, Benchmark Design
中文关键词: NLP进展,基准,对评估至关重要,评估进展,基准设计
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Benchmarking is seen as critical to assessing progress in NLP. However, creating a benchmark involves many design decisions (e.g., which datasets to include, which metrics to use) that often rely on tacit, untested assumptions about what the benchmark is intended to measure or is actually measuring. There is currently no principled way of analyzing these decisions and how they impact the validity of the benchmark’s measurements. To address this gap, we draw on evidence-centered design in educational assessments and propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules. ECBD specifies the role each module plays in helping practitioners collect evidence about capabilities of interest. Specifically, each module requires benchmark designers to describe, justify, and support benchmark design choices – e.g., clearly specifying the capabilities the benchmark aims to measure or how evidence about those capabilities is collected from model responses. To demonstrate the use of ECBD, we conduct case studies with three benchmarks: BoolQ, SuperGLUE, and HELM. Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks’ measurements.
摘要:基准被认为是评估NLP进展的关键。然而,创建基准涉及许多设计决策(例如,要包括哪些数据集、使用哪些度量标准),这些决策通常依赖于关于基准要测量什么或实际正在测量什么的默认的、未经测试的假设。目前还没有一种原则性的方法来分析这些决定以及它们如何影响基准测量的有效性。为了弥补这一差距,我们借鉴了教育评价中以证据为中心的设计,并提出了以证据为中心的基准设计(ECBD),这是一个将基准设计过程形式化为五个模块的框架。ECBD规定了每个模块在帮助从业者收集有关感兴趣能力的证据方面所扮演的角色。具体地说,每个模块都需要基准设计人员描述、证明和支持基准设计选择–例如,清楚地指定基准旨在测量的能力,或者如何从模型响应中收集关于这些能力的证据。为了演示ECBD的使用,我们使用三个基准进行了案例研究:BoolQ、Superglue和HELM。我们的分析揭示了基准设计和文档方面的共同趋势,这些趋势可能会威胁基准测量的有效性。

[NLP-66] Enhancing Psychotherapy Counseling: A Data Augmentation Pipeline Leveraging Large Language Models for Counseling Conversations
[NLP-66] 加强心理治疗咨询:利用大型语言模型进行咨询对话的数据增强管道

链接: https://arxiv.org/abs/2406.08718
作者: Jun-Woo Kim,Ji-Eun Han,Jun-Seok Koh,Hyeon-Tae Seo,Du-Seong Chang
关键词: Large Language Models, leverages Large Language, Language Models, Large Language, transform single-turn psychotherapy
中文关键词: 大型语言模型,利用大型语言,语言模型,大型语言,转变单轮心理治疗
类目: Computation and Language (cs.CL)
备注: IJCAI 2024 AI4Research workshop

点击查看摘要

Abstract:We introduce a pipeline that leverages Large Language Models (LLMs) to transform single-turn psychotherapy counseling sessions into multi-turn interactions. While AI-supported online counseling services for individuals with mental disorders exist, they are often constrained by the limited availability of multi-turn training datasets and frequently fail to fully utilize therapists’ expertise. Our proposed pipeline effectively addresses these limitations. The pipeline comprises two main steps: 1) Information Extraction and 2) Multi-turn Counseling Generation. Each step is meticulously designed to extract and generate comprehensive multi-turn counseling conversations from the available datasets. Experimental results from both zero-shot and few-shot generation scenarios demonstrate that our approach significantly enhances the ability of LLMs to produce higher quality multi-turn dialogues in the context of mental health counseling. Our pipeline and dataset are publicly available this https URL.
摘要:我们介绍了一种利用大型语言模型(LLM)将单轮心理治疗咨询会话转变为多轮互动的管道。虽然存在为精神障碍患者提供人工智能支持的在线咨询服务,但它们往往受到多轮训练数据集有限的限制,而且往往无法充分利用治疗师的专业知识。我们提议的管道有效地解决了这些限制。该流程包括两个主要步骤:1)信息提取和2)多轮咨询生成。每一步都是精心设计的,以从可用的数据集提取和生成全面的多轮咨询对话。零镜头和少镜头两种情景下的实验结果表明,我们的方法显著提高了LLMS在心理健康咨询背景下产生高质量多话轮对话的能力。我们的管道和数据集通过该HTTPS URL公开可用。

[NLP-67] mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
[NLP-67] mOTAR:大规模多语言和多模式文档级数据库

链接: https://arxiv.org/abs/2406.08707
作者: Matthieu Futeral,Armel Zebaze,Pedro Ortiz Suarez,Julien Abadji,Rémi Lacroix,Cordelia Schmid,Rachel Bawden,Benoît Sagot
关键词: large amount, Multimodal Large Language, Large Language Models, Multimodal Large, Large Language
中文关键词: 大量,多模式大型语言,大型语言模型,多模式大型,大型语言
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review

点击查看摘要

Abstract:Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. [2022] showed that additionally training them on interleaved sequences of text and images can lead to the emergence of in-context learning capabilities. However, the dataset they used, M3W, is not public and is only in English. There have been attempts to reproduce their results but the released datasets are English-only. In contrast, current multilingual and multimodal datasets are either composed of caption-like only or medium-scale or fully private data. This limits mLLM research for the 7,000 other languages spoken in the world. We therefore introduce mOSCAR, to the best of our knowledge the first large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 315M documents, 214B tokens and 1.2B images. We carefully conduct a set of filtering and evaluation steps to make sure mOSCAR is sufficiently safe, diverse and of good quality. We additionally train two types of multilingual model to prove the benefits of mOSCAR: (1) a model trained on a subset of mOSCAR and captioning data and (2) a model train on captioning data only. The model additionally trained on mOSCAR shows a strong boost in few-shot learning performance across various multilingual image-text tasks and benchmarks, confirming previous findings for English-only mLLMs.
摘要:多通道大语言模型是在大量的文本-图像数据上进行训练的。虽然大多数mLLM只针对字幕类数据进行培训,但Alayrac等人。[2022]结果表明,另外对他们进行文本和图像交错序列的培训可以导致情景中学习能力的出现。然而,他们使用的数据集M3W不是公开的,只有英语版本。有人试图复制他们的结果,但公布的数据集只有英文版。相比之下,当前的多语言和多模式数据集要么只由类似字幕的数据组成,要么由中等规模或完全私有的数据组成。这限制了对世界上7000种其他语言的mLLM研究。因此,我们介绍了Moscar,据我们所知,这是从网络上爬行的第一个大规模多语言和多模式文档语料库。它涵盖163种语言、3.15亿份文档、2.14亿个代币和1.2B张图片。我们仔细地进行了一系列的筛选和评估步骤,以确保Moscar足够安全、多样化和高质量。此外,我们还训练了两种类型的多语言模型来证明Moscar的好处:(1)基于Moscar的子集和字幕数据训练的模型;(2)仅针对字幕数据训练的模型。在Moscar上额外训练的模型显示,在各种多语言图文任务和基准测试中,极少的学习成绩得到了极大的提高,证实了之前对仅限英语的多语言学习模型的研究结果。

[NLP-68] VLind-Bench: Measuring Language Priors in Large Vision-Language Models
[NLP-68] VLind-Bench:测量大型视觉语言模型中的语言先验

链接: https://arxiv.org/abs/2406.08702
作者: Kang-il Lee,Minbeom Kim,Seunghyun Yoon,Minsung Kim,Dongryeol Lee,Hyukhun Koh,Kyomin Jung
关键词: Large Vision-Language Models, demonstrated outstanding performance, Large Vision-Language, language priors, language
中文关键词: 大型视觉语言模型,表现出出色的性能,大型视觉语言,语言先验,语言
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated outstanding performance across various multimodal tasks. However, they suffer from a problem known as language prior, where responses are generated based solely on textual patterns while disregarding image information. Addressing the issue of language prior is crucial, as it can lead to undesirable biases or hallucinations when dealing with images that are out of training distribution. Despite its importance, current methods for accurately measuring language priors in LVLMs are poorly studied. Although existing benchmarks based on counterfactual or out-of-distribution images can partially be used to measure language priors, they fail to disentangle language priors from other confounding factors. To this end, we propose a new benchmark called VLind-Bench, which is the first benchmark specifically designed to measure the language priors, or blindness, of LVLMs. It not only includes tests on counterfactual images to assess language priors but also involves a series of tests to evaluate more basic capabilities such as commonsense knowledge, visual perception, and commonsense biases. For each instance in our benchmark, we ensure that all these basic tests are passed before evaluating the language priors, thereby minimizing the influence of other factors on the assessment. The evaluation and analysis of recent LVLMs in our benchmark reveal that almost all models exhibit a significant reliance on language priors, presenting a strong challenge in the field.
摘要:大型视觉语言模型在不同的多通道任务中表现出了优异的性能。然而,他们受到一个被称为语言优先的问题的困扰,在这个问题上,只根据文本模式生成响应,而忽略图像信息。解决语言优先的问题是至关重要的,因为在处理训练分布之外的图像时,这可能会导致不希望看到的偏见或幻觉。尽管语言习得很重要,但目前用于准确测量语言习得的方法研究很少。虽然现有的基于反事实或分布外图像的基准可以部分地用来衡量语言先验,但它们无法将语言先验与其他混淆因素分开。为此,我们提出了一个名为VLind-BENCH的新基准,这是第一个专门为测量LVLM的语言先验或盲度而设计的基准。它不仅包括对反事实图像的测试来评估语言先验,还包括一系列测试来评估更基本的能力,如常识知识、视觉感知和常识偏见。对于我们基准测试中的每个实例,我们确保在评估语言优先级之前通过所有这些基本测试,从而将其他因素对评估的影响降至最低。在我们的基准中对最近的LVLM的评估和分析表明,几乎所有的模型都表现出对语言先验的严重依赖,这在该领域提出了强大的挑战。

[NLP-69] Analyzing Large Language Models for Classroom Discussion Assessment
[NLP-69] 分析课堂讨论评估的大型语言模型

链接: https://arxiv.org/abs/2406.08680
作者: Nhat Tran,Benjamin Pierce,Diane Litman,Richard Correnti,Lindsay Clare Matsumura
关键词: Automatically assessing classroom, large language models, assessing classroom discussion, classroom discussion quality, Automatically assessing
中文关键词: 自动评估课堂、大型语言模型、评估课堂讨论、课堂讨论质量、自动评估
类目: Computation and Language (cs.CL)
备注: EDM 2024 Short Paper

点击查看摘要

Abstract:Automatically assessing classroom discussion quality is becoming increasingly feasible with the help of new NLP advancements such as large language models (LLMs). In this work, we examine how the assessment performance of 2 LLMs interacts with 3 factors that may affect performance: task formulation, context length, and few-shot examples. We also explore the computational efficiency and predictive consistency of the 2 LLMs. Our results suggest that the 3 aforementioned factors do affect the performance of the tested LLMs and there is a relation between consistency and performance. We recommend a LLM-based assessment approach that has a good balance in terms of predictive performance, computational efficiency, and consistency.
摘要:在大型语言模型(LLM)等新NLP进步的帮助下,自动评估课堂讨论质量变得越来越可行。在这项工作中,我们研究了2个LLM的评估绩效如何与可能影响绩效的3个因素相互作用:任务制定、上下文长度和少数镜头示例。我们还探索了2个LLM的计算效率和预测一致性。我们的结果表明,上述3个因素确实会影响测试LLM的性能,并且一致性和性能之间存在相关性。我们推荐基于LLM的评估方法,该方法在预测性能、计算效率和一致性方面取得了良好的平衡。

[NLP-70] HelpSteer2: Open-source dataset for training top-performing reward models
[NLP-70] HelpSteer 2:用于训练最佳绩效奖励模型的开源数据集

链接: https://arxiv.org/abs/2406.08673
作者: Zhilin Wang,Yi Dong,Olivier Delalleau,Jiaqi Zeng,Gerald Shen,Daniel Egert,Jimmy J. Zhang,Makesh Narsimhan Sreedhar,Oleksii Kuchaiev
关键词: guide large language, generating high-quality responses, High-quality preference datasets, high-quality responses aligned, generating high-quality
中文关键词: 引导大型语言,生成高质量响应,高质量偏好数据集,对齐高质量响应,生成高质量
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High-quality preference datasets are essential for training reward models that can effectively guide large language models (LLMs) in generating high-quality responses aligned with human preferences. As LLMs become stronger and better aligned, permissively licensed preference datasets, such as Open Assistant, HH-RLHF, and HelpSteer need to be updated to remain effective for reward modeling. Methods that distil preference data from proprietary LLMs such as GPT-4 have restrictions on commercial usage imposed by model providers. To improve upon both generated responses and attribute labeling quality, we release HelpSteer2, a permissively licensed preference dataset (CC-BY-4.0). Using a powerful internal base model trained on HelpSteer2, we are able to achieve the SOTA score (92.0%) on Reward-Bench’s primary dataset, outperforming currently listed open and proprietary models, as of June 12th, 2024. Notably, HelpSteer2 consists of only ten thousand response pairs, an order of magnitude fewer than existing preference datasets (e.g., HH-RLHF), which makes it highly efficient for training reward models. Our extensive experiments demonstrate that reward models trained with HelpSteer2 are effective in aligning LLMs. In particular, we propose SteerLM 2.0, a model alignment approach that can effectively make use of the rich multi-attribute score predicted by our reward models. HelpSteer2 is available at this https URL and code is available at this https URL
摘要:高质量的偏好数据集对于训练奖励模型至关重要,它可以有效地指导大型语言模型生成与人类偏好一致的高质量回答。随着LLM变得更强大和更好地匹配,需要更新许可许可的偏好数据集,如Open Assistant、HH-RLHF和HelpSteer,以保持对奖励建模的有效性。从GPT-4等专有LLM中提取偏好数据的方法对模型提供者施加的商业用途有限制。为了提高生成的响应和属性标签的质量,我们发布了HelpSteer2,这是一个经过许可的首选项数据集(CC-by-4.0)。使用在HelpSteer2上训练的强大的内部基础模型,我们能够在Reward-Beck的主要数据集上获得SOTA分数(92.0%),表现优于目前上市的开放和专有模型,截至2024年6月12日。值得注意的是,HelpSteer2只包含10,000个响应对,比现有的偏好数据集(例如,HH-RLHF)少一个数量级,这使得它对于训练奖励模型非常高效。我们的大量实验表明,用HelpSteer2训练的奖励模型在对齐LLM方面是有效的。特别是,我们提出了SteerLM2.0,这是一种模型对齐方法,可以有效地利用我们的奖励模型预测的丰富的多属性分数。HelpSteer2在此HTTPS URL上提供,代码在此HTTPS URL上提供

[NLP-71] Fine-Tuned Small LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification
[NLP-71] 微调的小型LLM(仍然)在文本分类方面显着优于零镜头生成AI模型

链接: https://arxiv.org/abs/2406.08660
作者: Martin Juan José Bucher,Marco Martini
关键词: offers a simple, prompt-based alternative, smaller BERT-style LLMs, Generative AI offers, fine-tuning smaller BERT-style
中文关键词: Generative AI提供了一种简单的、基于预算的替代方案、更小的BERT式LLM,微调更小的BERT式
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI offers a simple, prompt-based alternative to fine-tuning smaller BERT-style LLMs for text classification tasks. This promises to eliminate the need for manually labeled training data and task-specific model training. However, it remains an open question whether tools like ChatGPT can deliver on this promise. In this paper, we show that smaller, fine-tuned LLMs (still) consistently and significantly outperform larger, zero-shot prompted models in text classification. We compare three major generative AI models (ChatGPT with GPT-3.5/GPT-4 and Claude Opus) with several fine-tuned LLMs across a diverse set of classification tasks (sentiment, approval/disapproval, emotions, party positions) and text categories (news, tweets, speeches). We find that fine-tuning with application-specific training data achieves superior performance in all cases. To make this approach more accessible to a broader audience, we provide an easy-to-use toolkit alongside this paper. Our toolkit, accompanied by non-technical step-by-step guidance, enables users to select and fine-tune BERT-like LLMs for any classification task with minimal technical and computational effort.
摘要:生成式人工智能为文本分类任务提供了一种简单、基于提示的替代方案,可以微调较小的BERT样式的LLM。这有望消除对手动标记训练数据和特定于任务的模型训练的需要。然而,像ChatGPT这样的工具能否兑现这一承诺仍然是一个悬而未决的问题。在这篇文章中,我们表明,在文本分类中,较小的、微调的LLMS(仍然)一致且显著地优于较大的、零镜头提示模型。我们比较了三个主要的生成性人工智能模型(ChatGPT与GPT-3.5/GPT-4和Claude Opus)与几个微调的LLM,跨越不同的分类任务集(情感、赞成/反对、情感、政党立场)和文本类别(新闻、推文、演讲)。我们发现,使用特定于应用程序的训练数据进行微调在所有情况下都能获得卓越的性能。为了使这种方法更容易为更广泛的受众所接受,我们在本文旁边提供了一个易于使用的工具包。我们的工具包,伴随着非技术性的逐步指导,使用户能够选择和微调任何分类任务的类似BERT的LLM,只需最少的技术和计算工作。

[NLP-72] Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs
[NLP-72] Mistral-C2 F:从粗到细的角色,用于增强WLHF和有效合并的LLM的分析和推理

链接: https://arxiv.org/abs/2406.08657
作者: Chen Zheng,Ke Sun,Xun Zhou
关键词: Coarse Actor, Policy-based Coarse Actor, Large Language Models, advances in Large, Large Language
中文关键词: 粗演员、基于策略的粗演员、大型语言模型、大型语言的进步
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the advances in Large Language Models (LLMs), exemplified by models like GPT-4 and Claude, smaller-scale LLMs such as Llama and Mistral often struggle with generating in-depth and coherent dialogues. This paper presents a novel two-step Coarse-to-Fine Actor model to address the inherent limitations in conversational and analytical capabilities of small-sized LLMs. Our approach begins with the Policy-based Coarse Actor, employing a technique we term “Continuous Maximization”. The Coarse Actor establishes an enhanced, knowledge-rich pool adept at aligning with human preference styles in analysis and reasoning. Through the RLHF process, it employs Continuous Maximization, a strategy that dynamically and adaptively extends the output length limit, enabling the generation of more detailed and analytical content. Subsequently, the Fine Actor refines this analytical content, addressing the generation of excessively redundant information from the Coarse Actor. We introduce a “Knowledge Residue Merger” approach, refining the content from the Coarse Actor and merging it with an existing Instruction model to improve quality, correctness, and reduce redundancies. We applied our methodology to the popular Mistral model, creating Mistral-C2F, which has demonstrated exceptional performance across 11 general language tasks and the MT-Bench Dialogue task, outperforming similar-scale models and even larger models with 13B and 30B parameters. Our model has significantly improved conversational and analytical reasoning abilities.
摘要:尽管GPT-4和克劳德等大型语言模型(LLM)取得了很大进展,但Llama和Mistral等规模较小的LLM往往难以生成深入连贯的对话。本文提出了一种新的两步从粗略到精细的参与者模型,以解决小型LLMS在会话和分析能力方面的固有限制。我们的方法从基于策略的粗略参与者开始,使用一种我们称之为“连续最大化”的技术。粗鲁的演员建立了一个增强的、知识丰富的人才库,擅长在分析和推理中与人类的偏好风格保持一致。通过RLHF过程,它采用了连续最大化,这是一种动态和自适应地扩展输出长度限制的策略,能够生成更详细和更具分析性的内容。随后,精细参与者提炼了该分析内容,解决了粗略参与者产生过度冗余信息的问题。我们引入了“知识剩余合并”的方法,从粗略的参与者中提炼出内容,并将其与现有的教学模型合并,以提高质量、正确性和减少冗余。我们将我们的方法论应用于流行的米斯特拉尔模型,创建了米斯特拉尔-C2F,它在11个通用语言任务和MT-BANK对话任务中表现出了出色的性能,表现优于类似规模的模型,甚至是具有13B和30B参数的更大的模型。我们的模型显著提高了会话和分析推理能力。

[NLP-73] C-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation
[NLP-73] C-Bench:文本到视频和图像到视频生成中的时间合成基准

链接: https://arxiv.org/abs/2406.08656
作者: Weixi Feng,Jiachen Li,Michael Saxon,Tsu-jui Fu,Wenhu Chen,William Yang Wang
关键词: unique challenges, Video generation, video generation models, image generation, videos
中文关键词: 独特的挑战、视频生成、视频生成模型、图像生成、视频
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this study, we move beyond evaluating simple actions and argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses. To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics. The prompts articulate the initial and final states of scenes, effectively reducing ambiguities for frame development and simplifying the assessment of transition completion. In addition, by collecting aligned real-world videos corresponding to the prompts, we expand TC-Bench’s applicability from text-conditional models to image-conditional ones that can perform generative frame interpolation. We also develop new metrics to measure the completeness of component transitions in generated videos, which demonstrate significantly higher correlations with human judgments than existing metrics. Our comprehensive experimental results reveal that most video generators achieve less than 20% of the compositional changes, highlighting enormous space for future improvement. Our analysis indicates that current video generation models struggle to interpret descriptions of compositional changes and synthesize various components across different time steps.
摘要:除了图像生成之外,视频生成还有许多独特的挑战。时间维度在帧之间引入了广泛的可能变化,在这些变化上,一致性和连续性可能被破坏。在这项研究中,我们超越了对简单行为的评估,认为生成的视频应该包括新概念的出现及其关系的转变,就像现实世界中的视频一样,随着时间的推移。为了评估视频生成模型的时间合成性,我们提出了TC-BENCH,这是一个由精心制作的文本提示、对应的基本事实视频和稳健的评估指标组成的基准。这些提示清楚地表达了场景的初始和最终状态,有效地减少了帧开发的模糊性,并简化了对过渡完成的评估。此外,通过收集与提示对应的对齐的真实世界视频,我们将TC-BENCH的适用范围从文本条件模型扩展到图像条件模型,该模型可以执行生成性帧内插。我们还开发了新的度量标准来衡量生成的视频中组件转换的完备性,与现有的度量标准相比,它与人类判断的相关性要高得多。我们的综合实验结果表明,大多数视频生成器实现的成分变化不到20%,这突出了未来巨大的改进空间。我们的分析表明,当前的视频生成模型难以解释对成分变化的描述,并在不同的时间步长上合成各种成分。

[NLP-74] ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints Languages and Datasets
[NLP-74] ML-SURB 2.0:跨建模约束语言和数据集的多语言语音模型基准测试

链接: https://arxiv.org/abs/2406.08641
作者: Jiatong Shi,Shih-Heng Wang,William Chen,Martijn Bartelds,Vanya Bannihatti Kumar,Jinchuan Tian,Xuankai Chang,Dan Jurafsky,Karen Livescu,Hung-yi Lee,Shinji Watanabe
关键词: evaluates self-supervised learning, automatic speech recognition, ML-SUPERB evaluates self-supervised, self-supervised learning, evaluates self-supervised
中文关键词: 评估自我监督学习、自动语音识别、ML-SURB评估自我监督、自我监督学习、评估自我监督
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2024

点击查看摘要

Abstract:ML-SUPERB evaluates self-supervised learning (SSL) models on the tasks of language identification and automatic speech recognition (ASR). This benchmark treats the models as feature extractors and uses a single shallow downstream model, which can be fine-tuned for a downstream task. However, real-world use cases may require different configurations. This paper presents ML-SUPERB~2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models across downstream models, fine-tuning setups, and efficient model adaptation approaches. We find performance improvements over the setup of ML-SUPERB. However, performance depends on the downstream model design. Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches to improve multilingual ASR performance.
摘要:ML-SURB评估语言识别和自动语音识别(ASB)任务的自我监督学习(SSL)模型。该基准测试将模型视为特征提取器,并使用单个浅层下游模型,该模型可以针对下游任务进行微调。然而,现实世界的用例可能需要不同的配置。本文介绍了ML-SURB ~2.0,这是一个新基准,用于评估跨下游模型、微调设置和高效模型自适应方法的预训练SSL和监督语音模型。我们发现ML-SUPER设置的性能改进。然而,性能取决于下游模型设计。此外,我们发现语言和数据集之间存在很大的性能差异,这表明需要采取更有针对性的方法来提高多语言ASB性能。

[NLP-75] Unraveling Code-Mixing Patterns in Migration Discourse: Automated Detection and Analysis of Online Conversations on Reddit
[NLP-75] 揭开移民话语中的代码混合模式:Reddit上在线对话的自动检测和分析

链接: https://arxiv.org/abs/2406.08633
作者: Fedor Vitiugin,Sunok Lee,Henna Paakki,Anastasiia Chizhikova,Nitin Sawhney
关键词: trustworthy public services, integrating migrants seamlessly, global migration patterns, migration patterns underscores, surge in global
中文关键词: 值得信赖的公共服务、无缝融合移民、全球移民模式、移民模式强调、全球移民激增
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 10 pages, 3 figures, Workshop Proceedings of the 18th International AAAI Conference on Web and Social Media

点击查看摘要

Abstract:The surge in global migration patterns underscores the imperative of integrating migrants seamlessly into host communities, necessitating inclusive and trustworthy public services. Despite the Nordic countries’ robust public sector infrastructure, recent immigrants often encounter barriers to accessing these services, exacerbating social disparities and eroding trust. Addressing digital inequalities and linguistic diversity is paramount in this endeavor. This paper explores the utilization of code-mixing, a communication strategy prevalent among multilingual speakers, in migration-related discourse on social media platforms such as Reddit. We present Ensemble Learning for Multilingual Identification of Code-mixed Texts (ELMICT), a novel approach designed to automatically detect code-mixed messages in migration-related discussions. Leveraging ensemble learning techniques for combining multiple tokenizers’ outputs and pre-trained language models, ELMICT demonstrates high performance (with F1 more than 0.95) in identifying code-mixing across various languages and contexts, particularly in cross-lingual zero-shot conditions (with avg. F1 more than 0.70). Moreover, the utilization of ELMICT helps to analyze the prevalence of code-mixing in migration-related threads compared to other thematic categories on Reddit, shedding light on the topics of concern to migrant communities. Our findings reveal insights into the communicative strategies employed by migrants on social media platforms, offering implications for the development of inclusive digital public services and conversational systems. By addressing the research questions posed in this study, we contribute to the understanding of linguistic diversity in migration discourse and pave the way for more effective tools for building trust in multicultural societies.
摘要:全球移民模式的激增突显了移民无缝融入收容社区的必要性,需要包容性和值得信赖的公共服务。尽管北欧国家拥有强大的公共部门基础设施,但新移民在获得这些服务方面经常遇到障碍,加剧了社会差距,侵蚀了信任。在这项努力中,解决数字不平等和语言多样性是至关重要的。本文探讨了在Reddit等社交媒体平台上,多语种说话者普遍使用的一种交际策略–代码混合在与移民相关的话语中的运用。我们提出了用于代码混合文本的多语言识别的集成学习(ELMICT),这是一种新的方法,旨在自动检测与移民相关的讨论中的代码混合消息。利用集成学习技术来组合多个标记器的输出和预先训练的语言模型,ELMICT在识别跨语言和上下文的代码混合方面表现出高性能(F1超过0.95),特别是在跨语言的零命中条件下(使用avg。F1超过0.70)。此外,与Reddit上的其他主题类别相比,ELMICT的使用有助于分析与移徙有关的主题中代码混合的流行率,有助于了解移徙社区关注的主题。我们的发现揭示了移民在社交媒体平台上使用的沟通策略,为包容性数字公共服务和对话系统的发展提供了启示。通过回答这项研究中提出的问题,我们有助于理解移民话语中的语言多样性,并为在多元文化社会中建立信任的更有效的工具铺平道路。

[NLP-76] me-MMD: A New Multi-Domain Multimodal Dataset for Time Series Analysis
[NLP-76] me-MMD:用于时间序列分析的新型多域多峰数据集

链接: https://arxiv.org/abs/2406.08627
作者: Haoxin Liu,Shangqing Xu,Zhiyuan Zhao,Lingkai Kong,Harshavardhan Kamarthi,Aditya B. Sasanur,Megha Sharma,Jiaming Cui,Qingsong Wen,Chao Zhang,B. Aditya Prakash
关键词: Time series, real-world time series, wide range, series, time series analysis
中文关键词: 时间序列,现实世界时间序列,广泛范围,系列,时间序列分析
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Time series data are ubiquitous across a wide range of real-world domains. While real-world time series analysis (TSA) requires human experts to integrate numerical series data with multimodal domain-specific knowledge, most existing TSA models rely solely on numerical data, overlooking the significance of information beyond numerical series. This oversight is due to the untapped potential of textual series data and the absence of a comprehensive, high-quality multimodal dataset. To overcome this obstacle, we introduce Time-MMD, the first multi-domain, multimodal time series dataset covering 9 primary data domains. Time-MMD ensures fine-grained modality alignment, eliminates data contamination, and provides high usability. Additionally, we develop MM-TSFlib, the first multimodal time-series forecasting (TSF) library, seamlessly pipelining multimodal TSF evaluations based on Time-MMD for in-depth analyses. Extensive experiments conducted on Time-MMD through MM-TSFlib demonstrate significant performance enhancements by extending unimodal TSF to multimodality, evidenced by over 15% mean squared error reduction in general, and up to 40% in domains with rich textual data. More importantly, our datasets and library revolutionize broader applications, impacts, research topics to advance TSA. The dataset and library are available at this https URL and this https URL.
摘要:时间序列数据广泛存在于现实世界的各个领域。现实世界的时间序列分析(TSA)需要人类专家将数值序列数据与多峰特定领域的知识相结合,而现有的大多数TSA模型仅依赖于数值数据,忽略了数值序列以外的信息的重要性。这一疏忽是由于文本系列数据的潜力尚未开发,以及缺乏全面、高质量的多模式数据集。为了克服这一障碍,我们引入了Time-MMD,这是第一个覆盖9个主要数据域的多域、多模时间序列数据集。Time-MMD确保了细粒度的通道对齐,消除了数据污染,并提供了高可用性。此外,我们开发了第一个多模式时间序列预测(TSF)库MM-TSFlib,无缝地流水线处理基于时间-MMD的多模式TSF评估以进行深入分析。通过MM-TSFlib在Time-MMD上进行的大量实验表明,通过将单峰TSF扩展到多模,性能有了显著的提高,平均平方误差总体上降低了15%以上,在具有丰富文本数据的领域中最高可降低40%。更重要的是,我们的数据集和库彻底改变了更广泛的应用程序、影响和研究主题,以推进TSA。数据集和库可在此HTTPS URL和此HTTPS URL获得。

[NLP-77] Self-Supervised Speech Representations are More Phonetic than Semantic
[NLP-77] 自我监督的言语表达更多的是语音而不是语义

链接: https://arxiv.org/abs/2406.08619
作者: Kwanghee Choi,Ankita Pasad,Tomohiko Nakamura,Satoru Fukayama,Karen Livescu,Shinji Watanabe
关键词: Self-supervised speech models, Self-supervised speech, effective backbone, Self-supervised, speech applications
中文关键词: 自我监督语音模型、自我监督语音、有效主干、自我监督、语音应用
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2024. Source code at this https URL

点击查看摘要

Abstract:Self-supervised speech models (S3Ms) have become an effective backbone for speech applications. Various analyses suggest that S3Ms encode linguistic properties. In this work, we seek a more fine-grained analysis of the word-level linguistic properties encoded in S3Ms. Specifically, we curate a novel dataset of near homophone (phonetically similar) and synonym (semantically similar) word pairs and measure the similarities between S3M word representation pairs. Our study reveals that S3M representations consistently and significantly exhibit more phonetic than semantic similarity. Further, we question whether widely used intent classification datasets such as Fluent Speech Commands and Snips Smartlights are adequate for measuring semantic abilities. Our simple baseline, using only the word identity, surpasses S3M-based models. This corroborates our findings and suggests that high scores on these datasets do not necessarily guarantee the presence of semantic content.
摘要:自监督语音模型(S3 MS)已成为语音应用的有效支柱。各种分析表明S3 M编码语言属性。在这项工作中,我们寻求对S3 MS中编码的词级语言属性进行更细粒度的分析。具体来说,我们策划了一个由近辅音(语音相似)和同义词(语义相似)词对组成的新颖数据集,并测量S3 M词表示对之间的相似性。我们的研究表明,S3 M表示一致且显着地表现出更多的语音相似性而不是语义相似性。此外,我们质疑广泛使用的意图分类数据集(例如Fluent Speech Commands和Snips Smartlights)是否足以测量语义能力。我们的简单基线(仅使用“身份”一词)超越了基于S3 M的模型。这证实了我们的发现,并表明这些数据集的高分并不一定保证语义内容的存在。

[NLP-78] Reversing the Forget-Retain Objectives: An Efficient LLM Unlearning Framework from Logit Difference
[NLP-78] 逆转遗忘-保留目标:Logit Difference的高效LLM学习框架

链接: https://arxiv.org/abs/2406.08607
作者: Jiabao Ji,Yujian Liu,Yang Zhang,Gaowen Liu,Ramana Rao Kompella,Sijia Liu,Shiyu Chang
关键词: Large Language Models, Large Language, increasingly important research, important research area, LLM unlearning
中文关键词: 大型语言模型,大型语言,日益重要的研究,重要的研究领域,法学硕士取消学习
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 11 figures

点击查看摘要

Abstract:As Large Language Models (LLMs) demonstrate extensive capability in learning from documents, LLM unlearning becomes an increasingly important research area to address concerns of LLMs in terms of privacy, copyright, etc. A conventional LLM unlearning task typically involves two goals: (1) The target LLM should forget the knowledge in the specified forget documents, and (2) it should retain the other knowledge that the LLM possesses, for which we assume access to a small number of retain documents. To achieve both goals, a mainstream class of LLM unlearning methods introduces an optimization framework with a combination of two objectives - maximizing the prediction loss on the forget documents while minimizing that on the retain documents, which suffers from two challenges, degenerated output and catastrophic forgetting. In this paper, we propose a novel unlearning framework called Unlearning from Logit Difference (ULD), which introduces an assistant LLM that aims to achieve the opposite of the unlearning goals: remembering the forget documents and forgetting the retain knowledge. ULD then derives the unlearned LLM by computing the logit difference between the target and the assistant LLMs. We show that such reversed objectives would naturally resolve both aforementioned challenges while significantly improving the training efficiency. Extensive experiments demonstrate that our method efficiently achieves the intended forgetting while preserving the LLM’s overall capabilities, reducing training time by more than threefold. Notably, our method loses 0% of model utility on the ToFU benchmark, whereas baseline methods may sacrifice 17% of utility on average to achieve comparable forget quality. Our code will be publicly available at this https URL.
摘要:随着大型语言模型(LLM)显示出广泛的从文档中学习的能力,LLM遗忘成为一个日益重要的研究领域,以解决LLM在隐私、版权等方面的担忧。传统的LLM遗忘任务通常涉及两个目标:(1)目标LLM应该忘记指定遗忘文档中的知识,(2)它应该保留LLM拥有的其他知识,为此我们假设可以访问少量保留的文档。为了达到这两个目标,一类主流的LLM遗忘方法引入了一个结合了两个目标的优化框架:最大化遗忘文档的预测损失,最小化遭受退化输出和灾难性遗忘两个挑战的保留文档的预测损失。在本文中,我们提出了一种新的遗忘框架,称为从Logit差分忘记学习(ULD),它引入了一个辅助LLM,旨在实现与遗忘目标相反的目标:记住忘记的文档和忘记保留的知识。然后,ULD通过计算目标和辅助LLM之间的Logit差来得到未学习的LLM。我们表明,这种颠倒的目标自然会解决上述两个挑战,同时显著提高培训效率。大量的实验表明,我们的方法有效地实现了预期的遗忘,同时保持了LLM的整体能力,将训练时间减少了三倍以上。值得注意的是,我们的方法在豆腐基准上损失了0%的模型效用,而基线方法可能平均牺牲17%的效用来实现类似的遗忘质量。我们的代码将在此HTTPS URL上公开提供。

[NLP-79] End-to-End Argument Mining as Augmented Natural Language Generation
[NLP-79] 端到端参数挖掘作为增强自然语言生成

链接: https://arxiv.org/abs/2406.08606
作者: Nilmadhab Das,Vishal Choudhary,V. Vijaya Saradhi,Ashish Anand
关键词: Argument Mining, Argumentative Components, Argumentative Relations, computational argumentation, crucial aspect
中文关键词: 论点挖掘、论点成分、论点关系、计算论证、关键方面
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Argument Mining (AM) is a crucial aspect of computational argumentation, which deals with the identification and extraction of Argumentative Components (ACs) and their corresponding Argumentative Relations (ARs). Most prior works have solved these problems by dividing them into multiple subtasks. And the available end-to-end setups are mostly based on the dependency parsing approach. This work proposes a unified end-to-end framework based on a generative paradigm, in which the argumentative structures are framed into label-augmented text, called Augmented Natural Language (ANL). Additionally, we explore the role of different types of markers in solving AM tasks. Through different marker-based fine-tuning strategies, we present an extensive study by integrating marker knowledge into our generative model. The proposed framework achieves competitive results to the state-of-the-art (SoTA) model and outperforms several baselines.
摘要:论据挖掘(AM)是计算论证的一个重要方面,涉及论证成分(AC)及其相应的论证关系(AR)的识别和提取。大多数以前的作品都通过将这些问题分为多个子任务来解决这些问题。可用的端到端设置大多基于依赖解析方法。这项工作提出了一个基于生成范式的统一端到端框架,其中论点结构被框架为标签增强文本,称为增强自然语言(ANL)。此外,我们还探讨了不同类型的标记在解决AM任务中的作用。通过不同的基于标记的微调策略,我们通过将标记知识集成到我们的生成模型中来进行广泛的研究。所提出的框架实现了与最先进(SoTA)模型相竞争的结果,并且优于多个基线。

[NLP-80] Language Model Council: Benchmarking Foundation Models on Highly Subjective Tasks by Consensus
[NLP-80] 语言模型委员会:通过共识对高度主观性任务的基础模型进行基准测试

链接: https://arxiv.org/abs/2406.08598
作者: Justin Zhao,Flor Miriam Plaza-del-Arco,Amanda Cercas Curry
关键词: Large Language Models, advancement of Large, Large Language, Chatbot Arena rank, Language Model Council
中文关键词: 大型语言模型、大型、大型语言的进步、Chatbot Arena排名、语言模型委员会
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) necessitates robust and challenging benchmarks. Leaderboards like Chatbot Arena rank LLMs based on how well their responses align with human preferences. However, many tasks such as those related to emotional intelligence, creative writing, or persuasiveness, are highly subjective and often lack majoritarian human agreement. Judges may have irreconcilable disagreements about what constitutes a better response. To address the challenge of ranking LLMs on highly subjective tasks, we propose a novel benchmarking framework, the Language Model Council (LMC). The LMC operates through a democratic process to: 1) formulate a test set through equal participation, 2) administer the test among council members, and 3) evaluate responses as a collective jury. We deploy a council of 20 newest LLMs on an open-ended emotional intelligence task: responding to interpersonal dilemmas. Our results show that the LMC produces rankings that are more separable, robust, and less biased than those from any individual LLM judge, and is more consistent with a human-established leaderboard compared to other benchmarks.
摘要:大型语言模型(LLM)的快速发展需要健壮和具有挑战性的基准。像聊天机器人Arena这样的排行榜根据LLM的反应与人类偏好的一致性程度对LLM进行排名。然而,许多任务,如与情商、创造性写作或说服力有关的任务,都是高度主观的,往往缺乏多数人的认同。对于什么是更好的回应,法官可能会有不可调和的分歧。为了应对在高度主观的任务上对LLM进行排名的挑战,我们提出了一个新的基准测试框架,即语言模型理事会(LMC)。LMC通过民主程序运作:1)通过平等参与制定一套测试;2)在理事会成员中管理测试;3)作为集体评审团评估答复。我们部署了一个由20名最新LLM组成的委员会来执行一项开放式的情商任务:应对人际困境。我们的结果表明,LMC产生的排名比任何LLM评委的排名更具分离性、健壮性和较少的偏见,与其他基准相比,它更符合人类建立的排行榜。

[NLP-81] CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery
[NLP-81] CS-Bench:大型语言模型迈向计算机科学掌握的综合基准

链接: https://arxiv.org/abs/2406.08587
作者: Xiaoshuai Song,Muxi Diao,Guanting Dong,Zhengyang Wang,Yujia Fu,Runqi Qiao,Zhexu Wang,Dayuan Fu,Huangxuan Wu,Bin Liang,Weihao Zeng,Yejie Wang,Zhuoma GongQue,Jianing Yu,Qiuna Tan,Weiran Xu
关键词: Computer Science, human intelligence, artificial intelligence, profoundly advancing, modern society
中文关键词: 计算机科学、人类智能、人工智能、深刻推进、现代社会
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Computer Science (CS) stands as a testament to the intricacies of human intelligence, profoundly advancing the development of artificial intelligence and modern society. However, the current community of large language models (LLMs) overly focuses on benchmarks for analyzing specific foundational skills (e.g. mathematics and code generation), neglecting an all-round evaluation of the computer science field. To bridge this gap, we introduce CS-Bench, the first bilingual (Chinese-English) benchmark dedicated to evaluating the performance of LLMs in computer science. CS-Bench comprises approximately 5K meticulously curated test samples, covering 26 subfields across 4 key areas of computer science, encompassing various task forms and divisions of knowledge and reasoning. Utilizing CS-Bench, we conduct a comprehensive evaluation of over 30 mainstream LLMs, revealing the relationship between CS performance and model scales. We also quantitatively analyze the reasons for failures in existing LLMs and highlight directions for improvements, including knowledge supplementation and CS-specific reasoning. Further cross-capability experiments show a high correlation between LLMs’ capabilities in computer science and their abilities in mathematics and coding. Moreover, expert LLMs specialized in mathematics and coding also demonstrate strong performances in several CS subfields. Looking ahead, we envision CS-Bench serving as a cornerstone for LLM applications in the CS field and paving new avenues in assessing LLMs’ diverse reasoning capabilities. The CS-Bench data and evaluation code are available at this https URL.
摘要:计算机科学证明了人类智能的复杂性,深刻地推动了人工智能和现代社会的发展。然而,当前的大型语言模型(LLM)社区过于关注用于分析特定基础技能(例如数学和代码生成)的基准,而忽视了对计算机科学领域的全面评估。为了弥补这一差距,我们引入了CS-BENCH,这是第一个致力于评估LLMS在计算机科学中的性能的双语(汉英)基准。CS-BENCH由大约5K个精心挑选的测试样本组成,涵盖计算机科学4个关键领域的26个子领域,包括各种任务形式以及知识和推理的划分。利用CS-BENCH对30多个主流LLM进行了综合评价,揭示了CS性能与模型规模之间的关系。我们还定量地分析了现有LLMS失败的原因,并指出了改进的方向,包括知识补充和CS特有的推理。进一步的交叉能力实验表明,LLMS在计算机科学方面的能力与他们在数学和编码方面的能力之间存在高度的相关性。此外,专门研究数学和编码的专家LLMS在几个CS子领域也表现出了很好的性能。展望未来,我们设想CS-BENCH将成为CS领域中LLM应用的基石,并为评估LLM的多样化推理能力铺平新的道路。CS-BENCH数据和评估代码可在此HTTPS URL获得。

[NLP-82] Exploring Fact Memorization and Style Imitation in LLMs Using QLoRA: An Experimental Study and Quality Assessment Methods
[NLP-82] 使用QLoRA探索LLM中的事实简化和风格模仿:实验研究和质量评估方法

链接: https://arxiv.org/abs/2406.08582
作者: Eugene Vyborov,Oleksiy Osypenko,Serge Sotnyk
关键词: adapting LLMs, PEFT methods, methods, Abstract, common methods
中文关键词: 适应LLM、PEFT方法、方法、摘要、常用方法
类目: Computation and Language (cs.CL)
备注: 16 pages, 5 tables

点击查看摘要

Abstract:There are various methods for adapting LLMs to different domains. The most common methods are prompting, finetuning, and RAG. In this work, we explore the possibility of adapting a model using one of the PEFT methods - QLoRA. The experiment aims to simulate human responses based on their interviews. The simulation quality is assessed by comparing the quality of the style and the quality of the generated facts.
摘要:有多种方法可以将LLM适应不同的领域。最常见的方法是提示、微调和RAG。在这项工作中,我们探索了使用PEFT方法之一- QLoRA调整模型的可能性。该实验旨在根据人类的采访模拟人类的反应。通过比较风格的质量和生成事实的质量来评估模拟质量。

[NLP-83] Automated Question Generation for Science Tests in Arabic Language Using NLP Techniques
[NLP-83] 使用NLP技术自动生成阿拉伯语科学测试问题

链接: https://arxiv.org/abs/2406.08520
作者: Mohammad Tami,Huthaifa I. Ashqar,Mohammed Elhenawy
关键词: artificial intelligence applied, growing field, field within artificial, artificial intelligence, intelligence applied
中文关键词: 人工智能应用,成长领域,人工领域,人工智能,智能应用
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Question generation for education assessments is a growing field within artificial intelligence applied to education. These question-generation tools have significant importance in the educational technology domain, such as intelligent tutoring systems and dialogue-based platforms. The automatic generation of assessment questions, which entail clear-cut answers, usually relies on syntactical and semantic indications within declarative sentences, which are then transformed into questions. Recent research has explored the generation of assessment educational questions in Arabic. The reported performance has been adversely affected by inherent errors, including sentence parsing inaccuracies, name entity recognition issues, and errors stemming from rule-based question transformation. Furthermore, the complexity of lengthy Arabic sentences has contributed to these challenges. This research presents an innovative Arabic question-generation system built upon a three-stage process: keywords and key phrases extraction, question generation, and subsequent ranking. The aim is to tackle the difficulties associated with automatically generating assessment questions in the Arabic language. The proposed approach and results show a precision of 83.50%, a recall of 78.68%, and an Fl score of 80.95%, indicating the framework high efficiency. Human evaluation further confirmed the model efficiency, receiving an average rating of 84%.
摘要:在人工智能应用于教育的领域中,教育评估问题生成是一个不断发展的领域。这些问题生成工具在教育技术领域具有重要意义,例如智能辅导系统和基于对话的平台。评估问题的自动生成需要明确的答案,通常依赖于陈述句中的句法和语义指示,然后将其转换为问题。最近的研究探索了阿拉伯语评估教育问题的生成。报告的业绩受到固有错误的不利影响,包括句子分析不准确、命名实体识别问题以及基于规则的问题转换产生的错误。此外,冗长的阿拉伯语句子的复杂性也造成了这些挑战。这项研究提出了一个创新的阿拉伯问题生成系统,该系统建立在一个三个阶段的过程之上:关键词和关键短语提取、问题生成和随后的排名。其目的是解决与自动生成阿拉伯语评估问题有关的困难。实验结果表明,该框架的准确率为83.50%,召回率为78.68%,F1评分为80.95%,表明该框架具有较高的效率。人工评价进一步证实了模型的有效性,平均得分为84%。

[NLP-84] Question-Answering (QA) Model for a Personalized Learning Assistant for Arabic Language
[NLP-84] 阿拉伯语个性化学习助理的志愿服务(QA)模型

链接: https://arxiv.org/abs/2406.08519
作者: Mohammad Sammoudi,Ahmad Habaybeh,Huthaifa I. Ashqar,Mohammed Elhenawy
关键词: BERT transformers customized, personalized learning assistant, describes the creation, BERT transformers, paper describes
中文关键词: BERT变形器定制、个性化学习助手、描述创建、BERT变形器、论文描述
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This paper describes the creation, optimization, and assessment of a question-answering (QA) model for a personalized learning assistant that uses BERT transformers customized for the Arabic language. The model was particularly finetuned on science textbooks in Palestinian curriculum. Our approach uses BERT’s brilliant capabilities to automatically produce correct answers to questions in the field of science education. The model’s ability to understand and extract pertinent information is improved by finetuning it using 11th and 12th grade biology book in Palestinian curriculum. This increases the model’s efficacy in producing enlightening responses. Exact match (EM) and F1 score metrics are used to assess the model’s performance; the results show an EM score of 20% and an F1 score of 51%. These findings show that the model can comprehend and react to questions in the context of Palestinian science book. The results demonstrate the potential of BERT-based QA models to support learning and understanding Arabic students questions.
摘要:本文描述了一个个性化学习助手的问答(QA)模型的创建、优化和评估,该模型使用了为阿拉伯语定制的Bert转换器。这一模式在巴勒斯坦课程的科学教科书上做了特别精细的调整。我们的方法利用伯特的卓越能力,自动为科学教育领域的问题提供正确的答案。通过使用巴勒斯坦课程中11年级和12年级的生物教科书对模型进行微调,提高了模型理解和提取相关信息的能力。这提高了模型在产生启发性反应方面的效率。使用精确匹配(EM)和F1得分度量来评估模型的性能;结果显示EM得分为20%,F1得分为51%。这些发现表明,该模型能够理解和回答巴勒斯坦科学书籍中的问题。结果表明,基于BERT的问答模型在支持阿拉伯学生问题的学习和理解方面具有一定的潜力。

[NLP-85] Exploring Spoken Language Identification Strategies for Automatic Transcription of Multilingual Broadcast and Institutional Speech
[NLP-85] 探索多语言广播和机构语音自动转录的口语识别策略

链接: https://arxiv.org/abs/2406.09290
作者: Martina Valente,Fabio Brugnara,Giovanni Morrone,Enrico Zovato,Leonardo Badino
关键词: real application scenarios, paper addresses spoken, addresses spoken language, spoken language identification, SLI literature
中文关键词: 真实应用场景、纸质地址、口语地址、口语识别、SLI文献
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:This paper addresses spoken language identification (SLI) and speech recognition of multilingual broadcast and institutional speech, real application scenarios that have been rarely addressed in the SLI literature. Observing that in these domains language changes are mostly associated with speaker changes, we propose a cascaded system consisting of speaker diarization and language identification and compare it with more traditional language identification and language diarization systems. Results show that the proposed system often achieves lower language classification and language diarization error rates (up to 10% relative language diarization error reduction and 60% relative language confusion reduction) and leads to lower WERs on multilingual test sets (more than 8% relative WER reduction), while at the same time does not negatively affect speech recognition on monolingual audio (with an absolute WER increase between 0.1% and 0.7% w.r.t. monolingual ASR).
摘要:本文讨论了多语言广播和机构语音的口语识别(SLI)和语音识别,这些实际应用场景在SLI文献中很少涉及。观察到在这些领域中,语言变化主要与说话人的变化相关,我们提出了一个由说话人日记化和语言识别组成的级联系统,并将其与更传统的语言识别和语言日记化系统进行比较。结果表明,所提出的系统通常能够实现较低的语言分类和语言日记化错误率(相对语言日记化错误减少高达10%,相对语言混乱减少60%)并导致多语言测试集的WERS较低(相对WER下降超过8%),同时不会对单语音频上的语音识别产生负面影响(绝对WER增加在0.1%和0.7%之间)。单语ASB)。

[NLP-86] End-to-end Streaming model for Low-Latency Speech Anonymization
[NLP-86] 低延迟语音模拟的端到端流媒体模型

链接: https://arxiv.org/abs/2406.09277
作者: Waris Quamer,Ricardo Gutierrez-Osuna
关键词: preserving linguistic content, Speaker anonymization aims, aims to conceal, conceal cues, preserving linguistic
中文关键词: 保留语言内容,说话者匿名化的目标,旨在隐藏,隐藏线索,保留语言
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speaker anonymization aims to conceal cues to speaker identity while preserving linguistic content. Current machine learning based approaches require substantial computational resources, hindering real-time streaming applications. To address these concerns, we propose a streaming model that achieves speaker anonymization with low latency. The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder that extracts HuBERT-like information, a pretrained speaker encoder that extract speaker identity, and a variance encoder that injects pitch and energy information. These three disentangled representations are fed to a decoder that resynthesizes the speech signal. We present evaluation results from two implementations of our system, a full model that achieves a latency of 230ms, and a lite version (0.1x in size) that further reduces latency to 66ms while maintaining state-of-the-art performance in naturalness, intelligibility, and privacy preservation.
摘要:说话人匿名化的目的是隐藏说话人身份的线索,同时保留语言内容。目前基于机器学习的方法需要大量的计算资源,阻碍了实时流应用。针对这些问题,我们提出了一种低延迟实现说话人匿名化的流媒体模型。该系统以端到端自动编码器的方式进行训练,使用提取休伯特类信息的轻量级内容编码器、提取说话人身份的预先训练的说话人编码器以及注入音调和能量信息的方差编码器。这三个解缠的表示被馈送到重新合成语音信号的解码器。我们给出了我们的系统的两个实现的评估结果,一个完整的模型实现了230ms的延迟,而一个精简版本(大小为0.1x)在保持最先进的自然度、可理解性和隐私保护性能的同时,将延迟进一步减少到66ms。

[NLP-87] DisfluencySpeech – Single-Speaker Conversational Speech Dataset with Paralanguage
[NLP-87] DisfluencySpeech --具有副语言的单说话者对话语音数据集

链接: https://arxiv.org/abs/2406.08820
作者: Kyra Wang,Dorien Herremans
关键词: direct lexical meaning, crucial propositional context, provide crucial propositional, contribute any direct, direct lexical
中文关键词: 直接的词汇意义,关键的命题上下文,提供关键的命题,贡献任何直接的词汇
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: 4 pages, 1 figure, submitted to IEEE TENCON 2024

点击查看摘要

Abstract:Laughing, sighing, stuttering, and other forms of paralanguage do not contribute any direct lexical meaning to speech, but they provide crucial propositional context that aids semantic and pragmatic processes such as irony. It is thus important for artificial social agents to both understand and be able to generate speech with semantically-important paralanguage. Most speech datasets do not include transcribed non-lexical speech sounds and disfluencies, while those that do are typically multi-speaker datasets where each speaker provides relatively little audio. This makes it challenging to train conversational Text-to-Speech (TTS) synthesis models that include such paralinguistic components. We thus present DisfluencySpeech, a studio-quality labeled English speech dataset with paralanguage. A single speaker recreates nearly 10 hours of expressive utterances from the Switchboard-1 Telephone Speech Corpus (Switchboard), simulating realistic informal conversations. To aid the development of a TTS model that is able to predictively synthesise paralanguage from text without such components, we provide three different transcripts at different levels of information removal (removal of non-speech events, removal of non-sentence elements, and removal of false starts), as well as benchmark TTS models trained on each of these levels. Comments: 4 pages, 1 figure, submitted to IEEE TENCON 2024 Subjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL) Cite as: arXiv:2406.08820 [eess.AS] (or arXiv:2406.08820v1 [eess.AS] for this version)
摘要:大笑、叹息、结巴和其他形式的副语言对演讲没有任何直接的词汇意义,但它们提供了关键的命题语境,有助于反讽等语义和语用过程。因此,对于人工社交代理来说,理解并能够生成具有语义重要的副语言的语音是很重要的。大多数语音数据集不包括转录的非词汇语音声音和不流利,而那些包括转录的非词汇语音和不流利的数据集通常是多说话者数据集,其中每个说话者提供的音频相对较少。这使得训练包括这种副语言成分的会话文本到语音(TTS)合成模型变得具有挑战性。因此,我们提出了DisfluencySpeech,这是一个演播室级的带有副语言的标签英语语音数据集。单个演讲者从Switchboard-1电话语音语料库(Switchboard)中重现近10个小时的富有表现力的话语,模拟现实的非正式对话。为了帮助开发一个能够从没有这样的成分的文本中预测性地合成副语言的TTS模型,我们提供了三个不同信息去除水平(非语音事件的去除、非句子成分的去除和错误开始的去除)的不同成绩单,以及在每个水平上训练的基准TTS模型。评论:4页,1图,提交给IEEE TENCON2024主题:音频和语音处理(eess.AS);计算和语言(cs.CL)引用AS:arxiv:2406.08820eess.AS

计算机视觉

[CV-0] VideoGPT: Integrating Image and Video Encoders for Enhanced Video Understanding

链接: https://arxiv.org/abs/2406.09418
作者: Muhammad Maaz,Hanoona Rasheed,Salman Khan,Fahad Khan
关键词: Large Multimodal Models, Large Language Models, Large Multimodal, contributed significant improvements, advanced Large Language
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Technical Report

点击查看摘要

Abstract:Building on the advances of language models, Large Multimodal Models (LMMs) have contributed significant improvements in video understanding. While the current video LMMs utilize advanced Large Language Models (LLMs), they rely on either image or video encoders to process visual inputs, each of which has its own limitations. Image encoders excel at capturing rich spatial details from frame sequences but lack explicit temporal context, which can be important in videos with intricate action sequences. On the other hand, video encoders provide temporal context but are often limited by computational constraints that lead to processing only sparse frames at lower resolutions, resulting in reduced contextual and spatial understanding. To this end, we introduce VideoGPT+, which combines the complementary benefits of the image encoder (for detailed spatial understanding) and the video encoder (for global temporal context modeling). The model processes videos by dividing them into smaller segments and applies an adaptive pooling strategy on features extracted by both image and video encoders. Our architecture showcases improved performance across multiple video benchmarks, including VCGBench, MVBench and Zero-shot question-answering. Further, we develop 112K video-instruction set using a novel semi-automatic annotation pipeline which further improves the model performance. Additionally, to comprehensively evaluate video LMMs, we present VCGBench-Diverse, covering 18 broad video categories such as lifestyle, sports, science, gaming, and surveillance videos. This benchmark with 4,354 question-answer pairs evaluates the generalization of existing LMMs on dense video captioning, spatial and temporal understanding, and complex reasoning, ensuring comprehensive assessment across diverse video types and dynamics. Code: this https URL.

[CV-1] Rethinking Score Distillation as a Bridge Between Image Distributions

链接: https://arxiv.org/abs/2406.09417
作者: David McAllister,Songwei Ge,Jia-Bin Huang,David W. Jacobs,Alexei A. Efros,Aleksander Holynski,Angjoo Kanazawa
关键词: large-scale diffusion priors, important tool, large-scale diffusion, diffusion priors, priors for tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:Score distillation sampling (SDS) has proven to be an important tool, enabling the use of large-scale diffusion priors for tasks operating in data-poor domains. Unfortunately, SDS has a number of characteristic artifacts that limit its usefulness in general-purpose applications. In this paper, we make progress toward understanding the behavior of SDS and its variants by viewing them as solving an optimal-cost transport path from a source distribution to a target distribution. Under this new interpretation, these methods seek to transport corrupted images (source) to the natural image distribution (target). We argue that current methods’ characteristic artifacts are caused by (1) linear approximation of the optimal path and (2) poor estimates of the source distribution. We show that calibrating the text conditioning of the source distribution can produce high-quality generation and translation results with little extra overhead. Our method can be easily applied across many domains, matching or beating the performance of specialized methods. We demonstrate its utility in text-to-2D, text-based NeRF optimization, translating paintings to real images, optical illusion generation, and 3D sketch-to-real. We compare our method to existing approaches for score distillation sampling and show that it can produce high-frequency details with realistic colors.

[CV-2] Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

链接: https://arxiv.org/abs/2406.09416
作者: Qihao Liu,Zhanpeng Zeng,Ju He,Qihang Yu,Xiaohui Shen,Liang-Chieh Chen
关键词: paper presents innovative, presents innovative enhancements, paper presents, presents innovative, innovative enhancements
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Introducing DiMR, a new diffusion backbone that surpasses all existing image generation models of various sizes on ImageNet 256 with only 505M parameters. Project page: this https URL

点击查看摘要

Abstract:This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization. Diffusion models have gained prominence for their effectiveness in high-fidelity image generation. While conventional approaches rely on convolutional U-Net architectures, recent Transformer-based designs have demonstrated superior performance and scalability. However, Transformer architectures, which tokenize input data (via “patchification”), face a trade-off between visual fidelity and computational complexity due to the quadratic nature of self-attention operations concerning token length. While larger patch sizes enable attention computation efficiency, they struggle to capture fine-grained visual details, leading to image distortions. To address this challenge, we propose augmenting the Diffusion model with the Multi-Resolution network (DiMR), a framework that refines features across multiple resolutions, progressively enhancing detail from low to high resolution. Additionally, we introduce Time-Dependent Layer Normalization (TD-LN), a parameter-efficient approach that incorporates time-dependent parameters into layer normalization to inject time information and achieve superior performance. Our method’s efficacy is demonstrated on the class-conditional ImageNet generation benchmark, where DiMR-XL variants outperform prior diffusion models, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512. Project page: this https URL

[CV-3] An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

链接: https://arxiv.org/abs/2406.09415
作者: Duy-Kien Nguyen,Mahmoud Assran,Unnat Jain,Martin R. Oswald,Cees G. M. Snoek,Xinlei Chen
关键词: computer vision, inductive bias, vision, modern computer vision, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Technical report, 23 pages

点击查看摘要

Abstract:This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias – locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision: supervised learning for object classification, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although directly operating on individual pixels is less computationally practical, we believe the community must be aware of this surprising piece of knowledge when devising the next generation of neural architectures for computer vision.

[CV-4] Depth Anything V2

链接: https://arxiv.org/abs/2406.09414
作者: Lihe Yang,Bingyi Kang,Zilong Huang,Zhen Zhao,Xiaogang Xu,Jiashi Feng,Hengshuang Zhao
关键词: work presents Depth, work presents, Depth, presents Depth, models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:This work presents Depth Anything V2. Without pursuing fancy techniques, we aim to reveal crucial findings to pave the way towards building a powerful monocular depth estimation model. Notably, compared with V1, this version produces much finer and more robust depth predictions through three key practices: 1) replacing all labeled real images with synthetic images, 2) scaling up the capacity of our teacher model, and 3) teaching student models via the bridge of large-scale pseudo-labeled real images. Compared with the latest models built on Stable Diffusion, our models are significantly more efficient (more than 10x faster) and more accurate. We offer models of different scales (ranging from 25M to 1.3B params) to support extensive scenarios. Benefiting from their strong generalization capability, we fine-tune them with metric depth labels to obtain our metric depth models. In addition to our models, considering the limited diversity and frequent noise in current test sets, we construct a versatile evaluation benchmark with precise annotations and diverse scenes to facilitate future research.

[CV-5] Interpreting the Weight Space of Customized Diffusion Models

链接: https://arxiv.org/abs/2406.09413
作者: Amil Dravid,Yossi Gandelsman,Kuan-Chieh Wang,Rameen Abdal,Gordon Wetzstein,Alexei A. Efros,Kfir Aberman
关键词: large collection, collection of customized, space, customized diffusion models, identity
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:We investigate the space of weights spanned by a large collection of customized diffusion models. We populate this space by creating a dataset of over 60,000 models, each of which is a base model fine-tuned to insert a different person’s visual identity. We model the underlying manifold of these weights as a subspace, which we term weights2weights. We demonstrate three immediate applications of this space – sampling, editing, and inversion. First, as each point in the space corresponds to an identity, sampling a set of weights from it results in a model encoding a novel identity. Next, we find linear directions in this space corresponding to semantic edits of the identity (e.g., adding a beard). These edits persist in appearance across generated samples. Finally, we show that inverting a single image into this space reconstructs a realistic identity, even if the input image is out of distribution (e.g., a painting). Our results indicate that the weight space of fine-tuned diffusion models behaves as an interpretable latent space of identities.

[CV-6] Explore the Limits of Omni-modal Pretraining at Scale

链接: https://arxiv.org/abs/2406.09412
作者: Yiyuan Zhang,Handong Li,Jing Liu,Xiangyu Yue
关键词: learning universal representations, universal representations, named Multimodal Context, build omni-modal intelligence, Multimodal Context
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Project Website: this https URL

点击查看摘要

Abstract:We propose to build omni-modal intelligence, which is capable of understanding any modality and learning universal representations. In specific, we propose a scalable pretraining paradigm, named Multimodal Context (MiCo), which can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process. With MiCo, the pretrained models show significant emergent abilities in multimodal learning, which are evaluated on the following tasks: i) single-modality perception benchmarks of 10 different modalities, ii) 25 cross-modality understanding tasks of retrieval, question-answering, captioning, and iii) 18 multimodal large language model benchmarks. Our models establish 37 new records for state-of-the-art performance. We hope that our research could contribute to the development of omni-modal intelligence. Code and Models are at this https URL

[CV-7] MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

链接: https://arxiv.org/abs/2406.09411
作者: Fei Wang,Xingyu Fu,James Y. Huang,Zekun Li,Qin Liu,Xiaogeng Liu,Mingyu Derek Ma,Nan Xu,Wenxuan Zhou,Kai Zhang,Tianyi Lorena Yan,Wenjie Jacky Mo,Hsiang-Hui Liu,Pan Lu,Chunyuan Li,Chaowei Xiao,Kai-Wei Chang,Dan Roth,Sheng Zhang,Hoifung Poon,Muhao Chen
关键词: comprehensive benchmark, benchmark that focuses, focuses on robust, robust multi-image understanding, multi-image understanding capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements.

[CV-8] Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach

链接: https://arxiv.org/abs/2406.09410
作者: Yansheng Li,Linlin Wang,Tingzhu Wang,Xue Yang,Junwei Luo,Qi Wang,Youming Deng,Wenbin Wang,Xian Sun,Haifeng Li,Bo Dang,Yongjun Zhang,Yi Yu,Junchi Yan
关键词: large-size VHR SAI, VHR SAI, large-size VHR, Scene graph generation, benefits promoting intelligent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This paper releases a SAI-oriented SGG toolkit with about 30 OBD methods and 10 SGG methods, and develops a benchmark based on RSG where our HOD-Net and RPCM significantly outperform the state-of-the-art methods in both OBD and SGG tasks. The RSG dataset and SAI-oriented toolkit will be made publicly available at this https URL

点击查看摘要

Abstract:Scene graph generation (SGG) in satellite imagery (SAI) benefits promoting intelligent understanding of geospatial scenarios from perception to cognition. In SAI, objects exhibit great variations in scales and aspect ratios, and there exist rich relationships between objects (even between spatially disjoint objects), which makes it necessary to holistically conduct SGG in large-size very-high-resolution (VHR) SAI. However, the lack of SGG datasets with large-size VHR SAI has constrained the advancement of SGG in SAI. Due to the complexity of large-size VHR SAI, mining triplets subject, relationship, object in large-size VHR SAI heavily relies on long-range contextual reasoning. Consequently, SGG models designed for small-size natural imagery are not directly applicable to large-size VHR SAI. To address the scarcity of datasets, this paper constructs a large-scale dataset for SGG in large-size VHR SAI with image sizes ranging from 512 x 768 to 27,860 x 31,096 pixels, named RSG, encompassing over 210,000 objects and more than 400,000 triplets. To realize SGG in large-size VHR SAI, we propose a context-aware cascade cognition (CAC) framework to understand SAI at three levels: object detection (OBD), pair pruning and relationship prediction. As a fundamental prerequisite for SGG in large-size SAI, a holistic multi-class object detection network (HOD-Net) that can flexibly integrate multi-scale contexts is proposed. With the consideration that there exist a huge amount of object pairs in large-size SAI but only a minority of object pairs contain meaningful relationships, we design a pair proposal generation (PPG) network via adversarial reconstruction to select high-value pairs. Furthermore, a relationship prediction network with context-aware messaging (RPCM) is proposed to predict the relationship types of these pairs.

[CV-9] CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras

链接: https://arxiv.org/abs/2406.09409
作者: Sachin Shah,Matthew Albert Chan,Haoming Cai,Jingxi Chen,Sakshum Kulshrestha,Chahat Deep Singh,Yiannis Aloimonos,Christopher Metzler
关键词: CMOS image sensors, conventional CMOS image, embed extra information, well-established computational imaging, computational imaging technique
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Point-spread-function (PSF) engineering is a well-established computational imaging technique that uses phase masks and other optical elements to embed extra information (e.g., depth) into the images captured by conventional CMOS image sensors. To date, however, PSF-engineering has not been applied to neuromorphic event cameras; a powerful new image sensing technology that responds to changes in the log-intensity of light. This paper establishes theoretical limits (Cramér Rao bounds) on 3D point localization and tracking with PSF-engineered event cameras. Using these bounds, we first demonstrate that existing Fisher phase masks are already near-optimal for localizing static flashing point sources (e.g., blinking fluorescent molecules). We then demonstrate that existing designs are sub-optimal for tracking moving point sources and proceed to use our theory to design optimal phase masks and binary amplitude masks for this task. To overcome the non-convexity of the design problem, we leverage novel implicit neural representation based parameterizations of the phase and amplitude masks. We demonstrate the efficacy of our designs through extensive simulations. We also validate our method with a simple prototype. Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV) Cite as: arXiv:2406.09409 [cs.CV] (or arXiv:2406.09409v1 [cs.CV] for this version)

[CV-10] Data Attribution for Text-to-Image Models by Unlearning Synthesized Images

链接: https://arxiv.org/abs/2406.09408
作者: Sheng-Yu Wang,Aaron Hertzmann,Alexei A. Efros,Jun-Yan Zhu,Richard Zhang
关键词: goal of data, data attribution, images, influential images, image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project page: this https URL Code: this https URL

点击查看摘要

Abstract:The goal of data attribution for text-to-image models is to identify the training images that most influence the generation of a new image. We can define “influence” by saying that, for a given output, if a model is retrained from scratch without that output’s most influential images, the model should then fail to generate that output image. Unfortunately, directly searching for these influential images is computationally infeasible, since it would require repeatedly retraining from scratch. We propose a new approach that efficiently identifies highly-influential images. Specifically, we simulate unlearning the synthesized image, proposing a method to increase the training loss on the output image, without catastrophic forgetting of other, unrelated concepts. Then, we find training images that are forgotten by proxy, identifying ones with significant loss deviations after the unlearning process, and label these as influential. We evaluate our method with a computationally intensive but “gold-standard” retraining from scratch and demonstrate our method’s advantages over previous methods.

[CV-11] owards Evaluating the Robustness of Visual State Space Models

链接: https://arxiv.org/abs/2406.09407
作者: Hashmat Shadab Malik,Fahad Shamshad,Muzammal Naseer,Karthik Nandakumar,Fahad Shahbaz Khan,Salman Khan
关键词: Vision State Space, State Space Models, Vision State, State Space, efficiently capturing long-range
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision State Space Models (VSSMs), a novel architecture that combines the strengths of recurrent neural networks and latent variable models, have demonstrated remarkable performance in visual perception tasks by efficiently capturing long-range dependencies and modeling complex visual dynamics. However, their robustness under natural and adversarial perturbations remains a critical concern. In this work, we present a comprehensive evaluation of VSSMs’ robustness under various perturbation scenarios, including occlusions, image structure, common corruptions, and adversarial attacks, and compare their performance to well-established architectures such as transformers and Convolutional Neural Networks. Furthermore, we investigate the resilience of VSSMs to object-background compositional changes on sophisticated benchmarks designed to test model performance in complex visual scenes. We also assess their robustness on object detection and segmentation tasks using corrupted datasets that mimic real-world scenarios. To gain a deeper understanding of VSSMs’ adversarial robustness, we conduct a frequency analysis of adversarial attacks, evaluating their performance against low-frequency and high-frequency perturbations. Our findings highlight the strengths and limitations of VSSMs in handling complex visual corruptions, offering valuable insights for future research and improvements in this promising field. Our code and models will be available at this https URL.

[CV-12] 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

链接: https://arxiv.org/abs/2406.09406
作者: Roman Bachmann,Oğuzhan Fatih Kar,David Mizrahi,Ali Garjani,Mingfei Gao,David Griffiths,Jiaming Hu,Afshin Dehghan,Amir Zamir
关键词: accept diverse inputs, multitask foundation models, show promising results, UnifiedIO show promising, Current multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page at this http URL

点击查看摘要

Abstract:Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we expand upon the capabilities of them by training a single model on tens of highly diverse modalities and by performing co-training on large-scale multimodal datasets and text corpora. This includes training on several semantic and geometric modalities, feature maps from recent state of the art models like DINOv2 and ImageBind, pseudo labels of specialist models like SAM and 4DHumans, and a range of new modalities that allow for novel ways to interact with the model and steer the generation, for example image metadata or color palettes. A crucial step in this process is performing discrete tokenization on various modalities, whether they are image-like, neural network feature maps, vectors, structured data like instance segmentation or human poses, or data that can be represented as text. Through this, we expand on the out-of-the-box capabilities of multimodal models and specifically show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance. This enables more fine-grained and controllable multimodal generation capabilities and allows us to study the distillation of models trained on diverse data and objectives into a unified model. We successfully scale the training to a three billion parameter model using tens of modalities and different datasets. The resulting models and training code are open sourced at this http URL.

[CV-13] ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

链接: https://arxiv.org/abs/2406.09404
作者: Jun-Kun Chen,Samuel Rota Bulò,Norman Müller,Lorenzo Porzi,Peter Kontschieder,Yu-Xiong Wang
关键词: enabling high-fidelity instruction-guided, diffusion models, paper proposes ConsistDreamer, framework that lifts, diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: CVPR 2024

点击查看摘要

Abstract:This paper proposes ConsistDreamer - a novel framework that lifts 2D diffusion models with 3D awareness and 3D consistency, thus enabling high-fidelity instruction-guided scene editing. To overcome the fundamental limitation of missing 3D consistency in 2D diffusion models, our key insight is to introduce three synergetic strategies that augment the input of the 2D diffusion model to become 3D-aware and to explicitly enforce 3D consistency during the training process. Specifically, we design surrounding views as context-rich input for the 2D diffusion model, and generate 3D-consistent, structured noise instead of image-independent noise. Moreover, we introduce self-supervised consistency-enforcing training within the per-scene editing procedure. Extensive evaluation shows that our ConsistDreamer achieves state-of-the-art performance for instruction-guided scene editing across various scenes and editing instructions, particularly in complicated large-scale indoor scenes from ScanNet++, with significantly improved sharpness and fine-grained textures. Notably, ConsistDreamer stands as the first work capable of successfully editing complex (e.g., plaid/checkered) patterns. Our project page is at this http URL.

[CV-14] Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

链接: https://arxiv.org/abs/2406.09403
作者: Yushi Hu,Weijia Shi,Xingyu Fu,Dan Roth,Mari Ostendorf,Luke Zettlemoyer,Noah A Smith,Ranjay Krishna
关键词: limited-capacity working memory, solving geometry problems, working memory, sketches to amplify, amplify our ideas
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: 26 pages

点击查看摘要

Abstract:Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In this work, we introduce Sketchpad, a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts planning and reasoning according to the visual artifacts it has drawn. Different from prior work, which uses text-to-image models to enable LMs to draw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is closer to human sketching and better facilitates reasoning. Sketchpad can also use specialist vision models during the sketching process (e.g., draw bounding boxes with object detection models, draw masks with segmentation models), to further enhance visual perception and reasoning. We experiment with a wide range of math tasks (including geometry, functions, graphs, and chess) and complex visual reasoning tasks. Sketchpad substantially improves performance on all tasks over strong base models with no sketching, yielding an average gain of 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a new state of the art on all tasks, including V*Bench (80.3%), BLINK spatial reasoning (83.9%), and visual correspondence (80.8%). All codes and data are in this https URL.

[CV-15] Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion

链接: https://arxiv.org/abs/2406.09402
作者: Linzhan Mou,Jun-Kun Chen,Yu-Xiong Wang
关键词: paper proposes Instruct, generate high-quality instruction-guided, high-quality instruction-guided dynamic, dynamic scene editing, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: CVPR 2024

点击查看摘要

Abstract:This paper proposes Instruct 4D-to-4D that achieves 4D awareness and spatial-temporal consistency for 2D diffusion models to generate high-quality instruction-guided dynamic scene editing results. Traditional applications of 2D diffusion models in dynamic scene editing often result in inconsistency, primarily due to their inherent frame-by-frame editing methodology. Addressing the complexities of extending instruction-guided editing to 4D, our key insight is to treat a 4D scene as a pseudo-3D scene, decoupled into two sub-problems: achieving temporal consistency in video editing and applying these edits to the pseudo-3D scene. Following this, we first enhance the Instruct-Pix2Pix (IP2P) model with an anchor-aware attention module for batch processing and consistent editing. Additionally, we integrate optical flow-guided appearance propagation in a sliding window fashion for more precise frame-to-frame editing and incorporate depth-based projection to manage the extensive data of pseudo-3D scenes, followed by iterative editing to achieve convergence. We extensively evaluate our approach in various scenes and editing instructions, and demonstrate that it achieves spatially and temporally consistent editing results, with significantly enhanced detail and sharpness over the prior art. Notably, Instruct 4D-to-4D is general and applicable to both monocular and challenging multi-camera scenes. Code and more results are available at this http URL.

[CV-16] MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

链接: https://arxiv.org/abs/2406.09401
作者: Ruiyuan Lyu,Tai Wang,Jingli Lin,Shuai Yang,Xiaohan Mao,Yilun Chen,Runsen Xu,Haifeng Huang,Chenming Zhu,Dahua Lin,Jiangmiao Pang
关键词: makes rapid progress, perception attracts, rapid progress, attracts more attention, attention due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Follow-up of EmbodiedScan. A multi-modal 3D dataset with the most-ever comprehensive language annotations for 3D-LLMs. Project page: this https URL

点击查看摘要

Abstract:With the emergence of LLMs and their integration with other data modalities, multi-modal 3D perception attracts more attention due to its connectivity to the physical world and makes rapid progress. However, limited by existing datasets, previous works mainly focus on understanding object properties or inter-object spatial relationships in a 3D scene. To tackle this problem, this paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. It is constructed based on a top-down logic, from region to object level, from a single target to inter-target relationships, covering holistic aspects of spatial and attribute understanding. The overall pipeline incorporates powerful VLMs via carefully designed prompts to initialize the annotations efficiently and further involve humans’ correction in the loop to ensure the annotations are natural, correct, and comprehensive. Built upon existing 3D scanning data, the resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks. We evaluate representative baselines on our benchmarks, analyze their capabilities in different aspects, and showcase the key problems to be addressed in the future. Furthermore, we use this high-quality dataset to train state-of-the-art 3D visual grounding and LLMs and obtain remarkable performance improvement both on existing benchmarks and in-the-wild evaluation. Codes, datasets, and benchmarks will be available at this https URL.

[CV-17] YoLLaVA: Your Personalized Language and Vision Assistant

链接: https://arxiv.org/abs/2406.09400
作者: Thao Nguyen,Haotian Liu,Yuheng Li,Mu Cai,Utkarsh Ojha,Yong Jae Lee
关键词: Large Multimodal Models, Large Multimodal, Multimodal Models, shown remarkable capabilities, visual question answering
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have shown remarkable capabilities across a variety of tasks (e.g., image captioning, visual question answering). While broad, their knowledge remains generic (e.g., recognizing a dog), and they are unable to handle personalized subjects (e.g., recognizing a user’s pet dog). Human reasoning, in contrast, typically operates within the context of specific subjects in our surroundings. For example, one might ask, “What should I buy for my dog’s birthday?”; as opposed to a generic inquiry about “What should I buy for a dog’s birthday?”. Similarly, when looking at a friend’s image, the interest lies in seeing their activities (e.g., “my friend is holding a cat”), rather than merely observing generic human actions (e.g., “a man is holding a cat”). In this paper, we introduce the novel task of personalizing LMMs, so that they can have conversations about a specific subject. We propose Yo’LLaVA, which learns to embed a personalized subject into a set of latent tokens given a handful of example images of the subject. Our qualitative and quantitative analyses reveal that Yo’LLaVA can learn the concept more efficiently using fewer tokens and more effectively encode the visual attributes compared to strong prompting baselines (e.g., LLaVA).

[CV-18] OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

链接: https://arxiv.org/abs/2406.09399
作者: Junke Wang,Yi Jiang,Zehuan Yuan,Binyue Peng,Zuxuan Wu,Yu-Gang Jiang
关键词: compact latent space, latent space, intricate visual data, translator to map, map the intricate
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models. Based on the finding that existing tokenizers are tailored to image or video inputs, this paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. OmniTokenizer is designed with a spatial-temporal decoupled architecture, which integrates window and causal attention for spatial and temporal modeling. To exploit the complementary nature of image and video data, we further propose a progressive training strategy, where OmniTokenizer is first trained on image data on a fixed resolution to develop the spatial encoding capacity and then jointly trained on image and video data on multiple resolutions to learn the temporal dynamics. OmniTokenizer, for the first time, handles both image and video inputs within a unified framework and proves the possibility of realizing their synergy. Extensive experiments demonstrate that OmniTokenizer achieves state-of-the-art (SOTA) reconstruction performance on various image and video datasets, e.g., 1.11 reconstruction FID on ImageNet and 42 reconstruction FVD on UCF-101, beating the previous SOTA methods by 13% and 26%, respectively. Additionally, we also show that when integrated with OmniTokenizer, both language model-based approaches and diffusion models can realize advanced visual synthesis performance, underscoring the superiority and versatility of our method. Code is available at this https URL.

[CV-19] Real-Time Deepfake Detection in the Real-World

链接: https://arxiv.org/abs/2406.09398
作者: Bar Cavia,Eliahu Horwitz,Tal Reiss,Yedid Hoshen
关键词: made synthesizing fake, develop accurate techniques, Recent improvements, fake images easy, synthesizing fake images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent improvements in generative AI made synthesizing fake images easy; as they can be used to cause harm, it is crucial to develop accurate techniques to identify them. This paper introduces “Locally Aware Deepfake Detection Algorithm” (LaDeDa), that accepts a single 9x9 image patch and outputs its deepfake score. The image deepfake score is the pooled score of its patches. With merely patch-level information, LaDeDa significantly improves over the state-of-the-art, achieving around 99% mAP on current benchmarks. Owing to the patch-level structure of LaDeDa, we hypothesize that the generation artifacts can be detected by a simple model. We therefore distill LaDeDa into Tiny-LaDeDa, a highly efficient model consisting of only 4 convolutional layers. Remarkably, Tiny-LaDeDa has 375x fewer FLOPs and is 10,000x more parameter-efficient than LaDeDa, allowing it to run efficiently on edge devices with a minor decrease in accuracy. These almost-perfect scores raise the question: is the task of deepfake detection close to being solved? Perhaps surprisingly, our investigation reveals that current training protocols prevent methods from generalizing to real-world deepfakes extracted from social media. To address this issue, we introduce WildRF, a new deepfake detection dataset curated from several popular social networks. Our method achieves the top performance of 93.7% mAP on WildRF, however the large gap from perfect accuracy shows that reliable real-world deepfake detection is still unsolved.

[CV-20] Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

链接: https://arxiv.org/abs/2406.09397
作者: Miaosen Zhang,Yixuan Wei,Zhen Xing,Yifei Ma,Zuxuan Wu,Ji Li,Zheng Zhang,Qi Dai,Chong Luo,Xin Geng,Baining Guo
关键词: Modern vision models, vision models, Modern vision, models, aesthetic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 28 pages, 26 figures, under review

点击查看摘要

Abstract:Modern vision models are trained on very large noisy datasets. While these models acquire strong capabilities, they may not follow the user’s intent to output the desired results in certain aspects, e.g., visual aesthetic, preferred style, and responsibility. In this paper, we target the realm of visual aesthetics and aim to align vision models with human aesthetic standards in a retrieval system. Advanced retrieval systems usually adopt a cascade of aesthetic models as re-rankers or filters, which are limited to low-level features like saturation and perform poorly when stylistic, cultural or knowledge contexts are involved. We find that utilizing the reasoning ability of large language models (LLMs) to rephrase the search query and extend the aesthetic expectations can make up for this shortcoming. Based on the above findings, we propose a preference-based reinforcement learning method that fine-tunes the vision models to distill the knowledge from both LLMs reasoning and the aesthetic models to better align the vision models with human aesthetics. Meanwhile, with rare benchmarks designed for evaluating retrieval systems, we leverage large multi-modality model (LMM) to evaluate the aesthetic performance with their strong abilities. As aesthetic assessment is one of the most subjective tasks, to validate the robustness of LMM, we further propose a novel dataset named HPIR to benchmark the alignment with human aesthetics. Experiments demonstrate that our method significantly enhances the aesthetic behaviors of the vision models, under several metrics. We believe the proposed algorithm can be a general practice for aligning vision models with human values.

[CV-21] oo Many Frames not all Useful:Efficient Strategies for Long-Form Video QA

链接: https://arxiv.org/abs/2406.09396
作者: Jongwoo Park,Kanchana Ranasinghe,Kumara Kahatapitiya,Wonjeong Ryoo,Donghyun Kim,Michael S. Ryoo
关键词: wide temporal intervals, multiple distinct events, span across wide, wide temporal, temporal intervals
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Long-form videos that span across wide temporal intervals are highly information redundant and contain multiple distinct events or entities that are often loosely-related. Therefore, when performing long-form video question answering (LVQA),all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature explore the use of large language models (LLMs) in LVQA benchmarks, achieving exceptional performance, while relying on vision language models (VLMs) to convert all visual content within videos into natural language. Such VLMs often independently caption a large number of frames uniformly sampled from long videos, which is not efficient and can mostly be redundant. Questioning these decision choices, we explore optimal strategies for key-frame selection and sequence-aware captioning, that can significantly reduce these redundancies. We propose two novel approaches that improve each of aspects, namely Hierarchical Keyframe Selector and Sequential Visual LLM. Our resulting framework termed LVNet achieves state-of-the-art performance across three benchmark LVQA datasets. Our code will be released publicly.

[CV-22] Modeling Ambient Scene Dynamics for Free-view Synthesis

链接: https://arxiv.org/abs/2406.09395
作者: Meng-Li Shih,Jia-Bin Huang,Changil Kim,Rajvi Shah,Johannes Kopf,Chen Gao
关键词: monocular capture bringing, viewing experience, dynamic free-view synthesis, bringing a immersive, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce a novel method for dynamic free-view synthesis of an ambient scenes from a monocular capture bringing a immersive quality to the viewing experience. Our method builds upon the recent advancements in 3D Gaussian Splatting (3DGS) that can faithfully reconstruct complex static scenes. Previous attempts to extend 3DGS to represent dynamics have been confined to bounded scenes or require multi-camera captures, and often fail to generalize to unseen motions, limiting their practical application. Our approach overcomes these constraints by leveraging the periodicity of ambient motions to learn the motion trajectory model, coupled with careful regularization. We also propose important practical strategies to improve the visual quality of the baseline 3DGS static reconstructions and to improve memory efficiency critical for GPU-memory intensive learning. We demonstrate high-quality photorealistic novel view synthesis of several ambient natural scenes with intricate textures and fine structural elements.

[CV-23] WonderWorld: Interactive 3D Scene Generation from a Single Image

链接: https://arxiv.org/abs/2406.09394
作者: Hong-Xing Yu,Haoyi Duan,Charles Herrmann,William T. Freeman,Jiajun Wu
关键词: Fast Gaussian Surfels, user-specified text, explore and shape, single input image, virtual environments based
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: Project website: this https URL

点击查看摘要

Abstract:We present WonderWorld, a novel framework for \emphinteractive 3D scene extrapolation that enables users to explore and shape virtual environments based on a single input image and user-specified text. While significant improvements have been made to the visual quality of scene generation, existing methods are run offline, taking tens of minutes to hours to generate a scene. By leveraging Fast Gaussian Surfels and a guided diffusion-based depth estimation method, WonderWorld generates geometrically consistent extrapolation while significantly reducing computational time. Our framework generates connected and diverse 3D scenes in less than 10 seconds on a single A6000 GPU, enabling real-time user interaction and exploration. We demonstrate the potential of WonderWorld for applications in virtual reality, gaming, and creative design, where users can quickly generate and navigate immersive, potentially infinite virtual worlds from a single image. Our approach represents a significant advancement in interactive 3D scene generation, opening up new possibilities for user-driven content creation and exploration in virtual environments. We will release full code and software for reproducibility. Project website: this https URL

[CV-24] LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living

链接: https://arxiv.org/abs/2406.09390
作者: Rajatsubhra Chakraborty,Arkaprava Sinha,Dominick Reilly,Manish Kumar Govind,Pu Wang,Francois Bremond,Srijan Das
关键词: Large Language Vision, Language Vision Models, Daily Living, Activities of Daily, processing internet videos
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Vision Models (LLVMs) have demonstrated effectiveness in processing internet videos, yet they struggle with the visually perplexing dynamics present in Activities of Daily Living (ADL) due to limited pertinent datasets and models tailored to relevant cues. To this end, we propose a framework for curating ADL multiview datasets to fine-tune LLVMs, resulting in the creation of ADL-X, comprising 100K RGB video-instruction pairs, language descriptions, 3D skeletons, and action-conditioned object trajectories. We introduce LLAVIDAL, an LLVM capable of incorporating 3D poses and relevant object trajectories to understand the intricate spatiotemporal relationships within ADLs. Furthermore, we present a novel benchmark, ADLMCQ, for quantifying LLVM effectiveness in ADL scenarios. When trained on ADL-X, LLAVIDAL consistently achieves state-of-the-art performance across all ADL evaluation metrics. Qualitative analysis reveals LLAVIDAL’s temporal reasoning capabilities in understanding ADL. The link to the dataset is provided at: this https URL

[CV-25] Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

链接: https://arxiv.org/abs/2406.09388
作者: Youngtaek Oh,Pyunghwan Ahn,Jinhyung Kim,Gwangmo Song,Soonyoung Lee,In So Kweon,Junmo Kim
关键词: fine-grained image-text alignment, Vision and language, showcased remarkable zero-shot, image-text alignment, showcased remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to CVPRW 2024 on ‘What is Next in Multimodal Foundation Models?’. Code: this https URL

点击查看摘要

Abstract:Vision and language models (VLMs) such as CLIP have showcased remarkable zero-shot recognition abilities yet face challenges in visio-linguistic compositionality, particularly in linguistic comprehension and fine-grained image-text alignment. This paper explores the intricate relationship between compositionality and recognition – two pivotal aspects of VLM capability. We conduct a comprehensive evaluation of existing VLMs, covering both pre-training approaches aimed at recognition and the fine-tuning methods designed to improve compositionality. Our evaluation employs 12 benchmarks for compositionality, along with 21 zero-shot classification and two retrieval benchmarks for recognition. In our analysis from 274 CLIP model checkpoints, we reveal patterns and trade-offs that emerge between compositional understanding and recognition accuracy. Ultimately, this necessitates strategic efforts towards developing models that improve both capabilities, as well as the meticulous formulation of benchmarks for compositionality. We open our evaluation framework at this https URL.

[CV-26] SimGen: Simulator-conditioned Driving Scene Generation

链接: https://arxiv.org/abs/2406.09386
作者: Yunsong Zhou,Michael Simon,Zhenghao Peng,Sicheng Mo,Hongzi Zhu,Minyi Guo,Bolei Zhou
关键词: Controllable synthetic data, autonomous driving research, Controllable synthetic, research and development, substantially lower
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Controllable synthetic data generation can substantially lower the annotation cost of training data in autonomous driving research and development. Prior works use diffusion models to generate driving images conditioned on the 3D object layout. However, those models are trained on small-scale datasets like nuScenes, which lack appearance and layout diversity. Moreover, the trained models can only generate images based on the real-world layout data from the validation set of the same dataset, where overfitting might happen. In this work, we introduce a simulator-conditioned scene generation framework called SimGen that can learn to generate diverse driving scenes by mixing data from the simulator and the real world. It uses a novel cascade diffusion pipeline to address challenging sim-to-real gaps and multi-condition conflicts. A driving video dataset DIVA is collected to enhance the generative diversity of SimGen, which contains over 147.5 hours of real-world driving videos from 73 locations worldwide and simulated driving data from the MetaDrive simulator. SimGen achieves superior generation quality and diversity while preserving controllability based on the text prompt and the layout pulled from a simulator. We further demonstrate the improvements brought by SimGen for synthetic data augmentation on the BEV detection and segmentation task and showcase its capability in safety-critical data generation. Code, data, and models will be made available.

[CV-27] owards Vision-Language Geo-Foundation Model: A Survey

链接: https://arxiv.org/abs/2406.09385
作者: Yue Zhou,Litong Feng,Yiping Ke,Xue Jiang,Junchi Yan,Xue Yang,Wayne Zhang
关键词: visual question answering, made remarkable progress, Vision-Language Foundation Models, visual grounding, Foundation Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 4 figures

点击查看摘要

Abstract:Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding. However, most methods rely on training with general image datasets, and the lack of geospatial data leads to poor performance on earth observation. Numerous geospatial image-text pair datasets and VLFMs fine-tuned on them have been proposed recently. These new approaches aim to leverage large-scale, multimodal geospatial data to build versatile intelligent models with diverse geo-perceptive capabilities, which we refer to as Vision-Language Geo-Foundation Models (VLGFMs). This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field. In particular, we introduce the background and motivation behind the rise of VLGFMs, highlighting their unique research significance. Then, we systematically summarize the core technologies employed in VLGFMs, including data construction, model architectures, and applications of various multimodal geospatial tasks. Finally, we conclude with insights, issues, and discussions regarding future research directions. To the best of our knowledge, this is the first comprehensive literature review of VLGFMs. We keep tracing related works at this https URL.

[CV-28] Reflecting on the State of Rehearsal-free Continual Learning with Pretrained Models

链接: https://arxiv.org/abs/2406.09384
作者: Lukas Thede,Karsten Roth,Olivier J. Hénaff,Matthias Bethge,Zeynep Akata
关键词: foundation models, pretrained models, ubiquity of foundation, success on rehearsal-free, models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 3rd Conference on Lifelong Learning Agents (CoLLAs) 2024

点击查看摘要

Abstract:With the advent and recent ubiquity of foundation models, continual learning (CL) has recently shifted from continual training from scratch to the continual adaptation of pretrained models, seeing particular success on rehearsal-free CL benchmarks (RFCL). To achieve this, most proposed methods adapt and restructure parameter-efficient finetuning techniques (PEFT) to suit the continual nature of the problem. Based most often on input-conditional query-mechanisms or regularizations on top of prompt- or adapter-based PEFT, these PEFT-style RFCL (P-RFCL) approaches report peak performances; often convincingly outperforming existing CL techniques. However, on the other end, critical studies have recently highlighted competitive results by training on just the first task or via simple non-parametric baselines. Consequently, questions arise about the relationship between methodological choices in P-RFCL and their reported high benchmark scores. In this work, we tackle these questions to better understand the true drivers behind strong P-RFCL performances, their placement w.r.t. recent first-task adaptation studies, and their relation to preceding CL standards such as EWC or SI. In particular, we show: (1) P-RFCL techniques relying on input-conditional query mechanisms work not because, but rather despite them by collapsing towards standard PEFT shortcut solutions. (2) Indeed, we show how most often, P-RFCL techniques can be matched by a simple and lightweight PEFT baseline. (3) Using this baseline, we identify the implicit bound on tunable parameters when deriving RFCL approaches from PEFT methods as a potential denominator behind P-RFCL efficacy. Finally, we (4) better disentangle continual versus first-task adaptation, and (5) motivate standard RFCL techniques s.a. EWC or SI in light of recent P-RFCL methods.

[CV-29] Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset

链接: https://arxiv.org/abs/2406.09383
作者: Yiming Li,Zhiheng Li,Nuo Chen,Moonjun Gong,Zonglin Lyu,Zehong Wang,Peili Jiang,Chen Feng
关键词: fueled recent advancements, Large-scale datasets, fueled recent, recent advancements, advancements in AI-based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by CVPR 2024

点击查看摘要

Abstract:Large-scale datasets have fueled recent advancements in AI-based autonomous vehicle research. However, these datasets are usually collected from a single vehicle’s one-time pass of a certain location, lacking multiagent interactions or repeated traversals of the same place. Such information could lead to transformative enhancements in autonomous vehicles’ perception, prediction, and planning capabilities. To bridge this gap, in collaboration with the self-driving company May Mobility, we present the MARS dataset which unifies scenarios that enable MultiAgent, multitraveRSal, and multimodal autonomous vehicle research. More specifically, MARS is collected with a fleet of autonomous vehicles driving within a certain geographical area. Each vehicle has its own route and different vehicles may appear at nearby locations. Each vehicle is equipped with a LiDAR and surround-view RGB cameras. We curate two subsets in MARS: one facilitates collaborative driving with multiple vehicles simultaneously present at the same location, and the other enables memory retrospection through asynchronous traversals of the same location by multiple vehicles. We conduct experiments in place recognition and neural reconstruction. More importantly, MARS introduces new research opportunities and challenges such as multitraversal 3D reconstruction, multiagent perception, and unsupervised object discovery. Our data and codes can be found at this https URL.

[CV-30] GGHead: Fast and Generalizable 3D Gaussian Heads

链接: https://arxiv.org/abs/2406.09377
作者: Tobias Kirschstein,Simon Giebenhain,Jiapeng Tang,Markos Georgopoulos,Matthias Nießner
关键词: human modeling, important step, large image resolutions, Learning, Generative Gaussian Heads
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project Page: this https URL ; YouTube Video: this https URL

点击查看摘要

Abstract:Learning 3D head priors from large 2D image collections is an important step towards high-quality 3D-aware human modeling. A core requirement is an efficient architecture that scales well to large-scale datasets and large image resolutions. Unfortunately, existing 3D GANs struggle to scale to generate samples at high resolutions due to their relatively slow train and render speeds, and typically have to rely on 2D superresolution networks at the expense of global 3D consistency. To address these challenges, we propose Generative Gaussian Heads (GGHead), which adopts the recent 3D Gaussian Splatting representation within a 3D GAN framework. To generate a 3D representation, we employ a powerful 2D CNN generator to predict Gaussian attributes in the UV space of a template head mesh. This way, GGHead exploits the regularity of the template’s UV layout, substantially facilitating the challenging task of predicting an unstructured set of 3D Gaussians. We further improve the geometric fidelity of the generated 3D representations with a novel total variation loss on rendered UV coordinates. Intuitively, this regularization encourages that neighboring rendered pixels should stem from neighboring Gaussians in the template’s UV space. Taken together, our pipeline can efficiently generate 3D heads trained only from single-view 2D image observations. Our proposed framework matches the quality of existing 3D head GANs on FFHQ while being both substantially faster and fully 3D consistent. As a result, we demonstrate real-time generation and rendering of high-quality 3D-consistent heads at 1024^2 resolution for the first time.

[CV-31] Scale-Invariant Monocular Depth Estimation via SSI Depth

链接: https://arxiv.org/abs/2406.09374
作者: S. Mahdi H. Miangoleh,Mahesh Reddy,Yağız Aksoy
关键词: scale-invariant monocular depth, monocular depth estimation, depth estimation, hindering generalizability, methods for scale-invariant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To appear in Proc. SIGGRAPH, 2024. Project webpage: this https URL

点击查看摘要

Abstract:Existing methods for scale-invariant monocular depth estimation (SI MDE) often struggle due to the complexity of the task, and limited and non-diverse datasets, hindering generalizability in real-world scenarios. This is while shift-and-scale-invariant (SSI) depth estimation, simplifying the task and enabling training with abundant stereo datasets achieves high performance. We present a novel approach that leverages SSI inputs to enhance SI depth estimation, streamlining the network’s role and facilitating in-the-wild generalization for SI depth estimation while only using a synthetic dataset for training. Emphasizing the generation of high-resolution details, we introduce a novel sparse ordinal loss that substantially improves detail generation in SSI MDE, addressing critical limitations in existing approaches. Through in-the-wild qualitative examples and zero-shot evaluation we substantiate the practical utility of our approach in computational photography applications, showcasing its ability to generate highly detailed SI depth maps and achieve generalization in diverse scenarios.

[CV-32] LRM-Zero: Training Large Reconstruction Models with Synthesized Data

链接: https://arxiv.org/abs/2406.09371
作者: Desai Xie,Sai Bi,Zhixin Shu,Kai Zhang,Zexiang Xu,Yi Zhou,Sören Pirk,Arie Kaufman,Xin Sun,Hao Tan
关键词: achieving high-quality sparse-view, Large Reconstruction Model, achieving high-quality, high-quality sparse-view, Large Reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 23 pages, 8 figures. Our code and interactive visualization are available at: this https URL

点击查看摘要

Abstract:We present LRM-Zero, a Large Reconstruction Model (LRM) trained entirely on synthesized 3D data, achieving high-quality sparse-view 3D reconstruction. The core of LRM-Zero is our procedural 3D dataset, Zeroverse, which is automatically synthesized from simple primitive shapes with random texturing and augmentations (e.g., height fields, boolean differences, and wireframes). Unlike previous 3D datasets (e.g., Objaverse) which are often captured or crafted by humans to approximate real 3D data, Zeroverse completely ignores realistic global semantics but is rich in complex geometric and texture details that are locally similar to or even more intricate than real objects. We demonstrate that our LRM-Zero, trained with our fully synthesized Zeroverse, can achieve high visual quality in the reconstruction of real-world objects, competitive with models trained on Objaverse. We also analyze several critical design choices of Zeroverse that contribute to LRM-Zero’s capability and training stability. Our work demonstrates that 3D reconstruction, one of the core tasks in 3D vision, can potentially be addressed without the semantics of real-world objects. The Zeroverse’s procedural synthesis code and interactive visualization are available at: this https URL.

[CV-33] CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

链接: https://arxiv.org/abs/2406.09368
作者: Yigit Ekin,Ahmet Burak Yildirim,Erdem Eren Caglar,Aykut Erdem,Erkut Erdem,Aysegul Dundar
关键词: Advanced image editing, preserving visual integrity, seamlessly removing unwanted, Advanced image, removing unwanted elements
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:Advanced image editing techniques, particularly inpainting, are essential for seamlessly removing unwanted elements while preserving visual integrity. Traditional GAN-based methods have achieved notable success, but recent advancements in diffusion models have produced superior results due to their training on large-scale datasets, enabling the generation of remarkably realistic inpainted images. Despite their strengths, diffusion models often struggle with object removal tasks without explicit guidance, leading to unintended hallucinations of the removed object. To address this issue, we introduce CLIPAway, a novel approach leveraging CLIP embeddings to focus on background regions while excluding foreground elements. CLIPAway enhances inpainting accuracy and quality by identifying embeddings that prioritize the background, thus achieving seamless object removal. Unlike other methods that rely on specialized training datasets or costly manual annotations, CLIPAway provides a flexible, plug-and-play solution compatible with various diffusion-based inpainting techniques.

[CV-34] Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs

链接: https://arxiv.org/abs/2406.09367
作者: Zijia Zhao,Haoyu Lu,Yuqi Huo,Yifan Du,Tongtian Yue,Longteng Guo,Bingning Wang,Weipeng Chen,Jing Liu
关键词: Video, crucial next step, models, VideoNIAH, Video understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video understanding is a crucial next step for multimodal large language models (MLLMs). To probe specific aspects of video understanding ability, existing video benchmarks typically require careful video selection based on the target capability, along with laborious annotation of query-response pairs to match the specific video content. This process is both challenging and resource-intensive. In this paper, we propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation. VideoNIAH decouples test video content from their query-responses by inserting unrelated image/text ‘needles’ into original videos. It generates annotations solely from these needles, ensuring diversity in video sources and a variety of query-responses. Additionally, by inserting multiple needles, VideoNIAH rigorously evaluates the temporal understanding capabilities of models. We utilized VideoNIAH to compile a video benchmark VNBench, including tasks such as retrieval, ordering, and counting. VNBench can efficiently evaluate the fine-grained understanding ability and spatio-temporal modeling ability of a video model, while also supporting the long-context evaluation. Additionally, we evaluated recent video-centric multimodal large language models (MLLMs), both open-source and proprietary, providing a comprehensive analysis. We found that although proprietary models have significant advantages over open-source models, all existing video models still perform poorly on long-distance dependency tasks. VideoNIAH is a simple yet highly scalable benchmark construction framework, and we believe it will inspire future video benchmark works. The code and data are available at this https URL.

[CV-35] owards an Improved Understanding and Utilization of Maximum Manifold Capacity Representations

链接: https://arxiv.org/abs/2406.09366
作者: Rylan Schaeffer,Victor Lecomte,Dhruv Bhandarkar Pai,Andres Carranza,Berivan Isik,Alyssa Unell,Mikail Khona,Thomas Yerxa,Yann LeCun,SueYeon Chung,Andrey Gromov,Ravid Shwartz-Ziv,Sanmi Koyejo
关键词: Maximum Manifold Capacity, Manifold Capacity Representations, Capacity Representations, multi-view self-supervised learning, recent multi-view self-supervised
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Maximum Manifold Capacity Representations (MMCR) is a recent multi-view self-supervised learning (MVSSL) method that matches or surpasses other leading MVSSL methods. MMCR is intriguing because it does not fit neatly into any of the commonplace MVSSL lineages, instead originating from a statistical mechanical perspective on the linear separability of data manifolds. In this paper, we seek to improve our understanding and our utilization of MMCR. To better understand MMCR, we leverage tools from high dimensional probability to demonstrate that MMCR incentivizes alignment and uniformity of learned embeddings. We then leverage tools from information theory to show that such embeddings maximize a well-known lower bound on mutual information between views, thereby connecting the geometric perspective of MMCR to the information-theoretic perspective commonly discussed in MVSSL. To better utilize MMCR, we mathematically predict and experimentally confirm non-monotonic changes in the pretraining loss akin to double descent but with respect to atypical hyperparameters. We also discover compute scaling laws that enable predicting the pretraining loss as a function of gradients steps, batch size, embedding dimension and number of views. We then show that MMCR, originally applied to image data, is performant on multimodal image-text data. By more deeply understanding the theoretical and empirical behavior of MMCR, our work reveals insights on improving MVSSL methods.

[CV-36] CMC-Bench: Towards a New Paradigm of Visual Signal Compression

链接: https://arxiv.org/abs/2406.09356
作者: Chunyi Li,Xiele Wu,Haoning Wu,Donghui Feng,Zicheng Zhang,Guo Lu,Xiongkuo Min,Xiaohong Liu,Guangtao Zhai,Weisi Lin
关键词: Cross Modality Compression, Large Multimodal Models, demanding topic, challenging and demanding, Large Multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Ultra-low bitrate image compression is a challenging and demanding topic. With the development of Large Multimodal Models (LMMs), a Cross Modality Compression (CMC) paradigm of Image-Text-Image has emerged. Compared with traditional codecs, this semantic-level compression can reduce image data size to 0.1% or even lower, which has strong potential applications. However, CMC has certain defects in consistency with the original image and perceptual quality. To address this problem, we introduce CMC-Bench, a benchmark of the cooperative performance of Image-to-Text (I2T) and Text-to-Image (T2I) models for image compression. This benchmark covers 18,000 and 40,000 images respectively to verify 6 mainstream I2T and 12 T2I models, including 160,000 subjective preference scores annotated by human experts. At ultra-low bitrates, this paper proves that the combination of some I2T and T2I models has surpassed the most advanced visual signal codecs; meanwhile, it highlights where LMMs can be further optimized toward the compression task. We encourage LMM developers to participate in this test to promote the evolution of visual signal codec protocols.

[CV-37] Enhancing Domain Adaptation through Prompt Gradient Alignment

链接: https://arxiv.org/abs/2406.09353
作者: Hoang Phan,Lam Tran,Quyen Tran,Trung Le
关键词: Prior Unsupervised Domain, Unsupervised Domain Adaptation, Prior Unsupervised, sufficiently discriminative features, learning sufficiently discriminative
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 26 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Prior Unsupervised Domain Adaptation (UDA) methods often aim to train a domain-invariant feature extractor, which may hinder the model from learning sufficiently discriminative features. To tackle this, a line of works based on prompt learning leverages the power of large-scale pre-trained vision-language models to learn both domain-invariant and specific features through a set of domain-agnostic and domain-specific learnable prompts. Those studies typically enforce invariant constraints on representation, output, or prompt space to learn such prompts. Differently, we cast UDA as a multiple-objective optimization problem in which each objective is represented by a domain loss. Under this new framework, we propose aligning per-objective gradients to foster consensus between them. Additionally, to prevent potential overfitting when fine-tuning this deep learning architecture, we penalize the norm of these gradients. To achieve these goals, we devise a practical gradient update procedure that can work under both single-source and multi-source UDA. Empirically, our method consistently surpasses other prompt-based baselines by a large margin on different UDA benchmarks

[CV-38] Memory-Efficient Sparse Pyramid Attention Networks for Whole Slide Image Analysis

链接: https://arxiv.org/abs/2406.09333
作者: Weiyi Wu,Chongyang Gao,Xinwen Xu,Siting Li,Jiang Gui
关键词: Slide Images, modern pathological diagnosis, pose significant computational, significant computational challenges, regions pose significant
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Whole Slide Images (WSIs) are crucial for modern pathological diagnosis, yet their gigapixel-scale resolutions and sparse informative regions pose significant computational challenges. Traditional dense attention mechanisms, widely used in computer vision and natural language processing, are impractical for WSI analysis due to the substantial data scale and the redundant processing of uninformative areas. To address these challenges, we propose Memory-Efficient Sparse Pyramid Attention Networks with Shifted Windows (SPAN), drawing inspiration from state-of-the-art sparse attention techniques in other domains. SPAN introduces a sparse pyramid attention architecture that hierarchically focuses on informative regions within the WSI, aiming to reduce memory overhead while preserving critical features. Additionally, the incorporation of shifted windows enables the model to capture long-range contextual dependencies essential for accurate classification. We evaluated SPAN on multiple public WSI datasets, observing its competitive performance. Unlike existing methods that often struggle to model spatial and contextual information due to memory constraints, our approach enables the accurate modeling of these crucial features. Our study also highlights the importance of key design elements in attention mechanisms, such as the shifted-window scheme and the hierarchical structure, which contribute substantially to the effectiveness of SPAN in WSI analysis. The potential of SPAN for memory-efficient and effective analysis of WSI data is thus demonstrated, and the code will be made publicly available following the publication of this work.

[CV-39] PianoMotion10M: Dataset and Benchmark for Hand Motion Generation in Piano Performance

链接: https://arxiv.org/abs/2406.09326
作者: Qijun Gan,Song Wang,Shengtao Wu,Jianke Zhu
关键词: artificial intelligence techniques, received increasing attentions, instrument instructing systems, effective music instrument, music instrument instructing
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Codes and Dataset: this https URL

点击查看摘要

Abstract:Recently, artificial intelligence techniques for education have been received increasing attentions, while it still remains an open problem to design the effective music instrument instructing systems. Although key presses can be directly derived from sheet music, the transitional movements among key presses require more extensive guidance in piano performance. In this work, we construct a piano-hand motion generation benchmark to guide hand movements and fingerings for piano playing. To this end, we collect an annotated dataset, PianoMotion10M, consisting of 116 hours of piano playing videos from a bird’s-eye view with 10 million annotated hand poses. We also introduce a powerful baseline model that generates hand motions from piano audios through a position predictor and a position-guided gesture generator. Furthermore, a series of evaluation metrics are designed to assess the performance of the baseline model, including motion similarity, smoothness, positional accuracy of left and right hands, and overall fidelity of movement distribution. Despite that piano key presses with respect to music scores or audios are already accessible, PianoMotion10M aims to provide guidance on piano fingering for instruction purposes. The dataset and source code can be accessed at this https URL.

[CV-40] Vertical LoRA: Dense Expectation-Maximization Interpretation of Transformers

链接: https://arxiv.org/abs/2406.09315
作者: Zhuolin Fu
关键词: Bayesian Nets, dense Expectation-Maximization algorithms, Expectation-Maximization algorithms performed, performed on Bayesian, interpreted as dense
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we show how Transformers can be interpreted as dense Expectation-Maximization algorithms performed on Bayesian Nets. Based on the above interpretation, we propose a new model design paradigm, namely Vertical LoRA (VLoRA), which reduces the parameter count dramatically while preserving performance. In VLoRA, a model consists of layers, each of which recursively learns an increment based on the previous layer. We then apply LoRA decomposition to the increments. VLoRA works on the base model, which is orthogonal to LoRA, meaning they can be used together. We do experiments on various tasks and models. The results show that 1) with VLoRA, the Transformer model parameter count can be reduced dramatically and 2) the performance of the original model is preserved. The source code is available at \urlthis https URL

[CV-41] offee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation

链接: https://arxiv.org/abs/2406.09305
作者: Yufan Zhou,Ruiyi Zhang,Kaizhi Zheng,Nanxuan Zhao,Jiuxiang Gu,Zichao Wang,Xin Eric Wang,Tong Sun
关键词: achieved superior performance, recent works, dataset, works have achieved, achieved superior
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In subject-driven text-to-image generation, recent works have achieved superior performance by training the model on synthetic datasets containing numerous image pairs. Trained on these datasets, generative models can produce text-aligned images for specific subject from arbitrary testing image in a zero-shot manner. They even outperform methods which require additional fine-tuning on testing images. However, the cost of creating such datasets is prohibitive for most researchers. To generate a single training pair, current methods fine-tune a pre-trained text-to-image model on the subject image to capture fine-grained details, then use the fine-tuned model to create images for the same subject based on creative text prompts. Consequently, constructing a large-scale dataset with millions of subjects can require hundreds of thousands of GPU hours. To tackle this problem, we propose Toffee, an efficient method to construct datasets for subject-driven editing and generation. Specifically, our dataset construction does not need any subject-level fine-tuning. After pre-training two generative models, we are able to generate infinite number of high-quality samples. We construct the first large-scale dataset for subject-driven image editing and generation, which contains 5 million image pairs, text prompts, and masks. Our dataset is 5 times the size of previous largest dataset, yet our cost is tens of thousands of GPU hours lower. To test the proposed dataset, we also propose a model which is capable of both subject-driven image editing and generation. By simply training the model on our proposed dataset, it obtains competitive results, illustrating the effectiveness of the proposed dataset construction framework.

[CV-42] Parameter-Efficient Active Learning for Foundational models

链接: https://arxiv.org/abs/2406.09296
作者: Athmanarayanan Lakshmi Narayanan,Ranganath Krishnan,Amrutha Machireddy,Mahesh Subedar
关键词: Foundational vision transformer, Foundational vision, vision transformer models, vision transformer, shown impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted for CVPR2024 Transformers for Vision Workshop

点击查看摘要

Abstract:Foundational vision transformer models have shown impressive few shot performance on many vision tasks. This research presents a novel investigation into the application of parameter efficient fine-tuning methods within an active learning (AL) framework, to advance the sampling selection process in extremely budget constrained classification tasks. The focus on image datasets, known for their out-of-distribution characteristics, adds a layer of complexity and relevance to our study. Through a detailed evaluation, we illustrate the improved AL performance on these challenging datasets, highlighting the strategic advantage of merging parameter efficient fine tuning methods with foundation models. This contributes to the broader discourse on optimizing AL strategies, presenting a promising avenue for future exploration in leveraging foundation models for efficient and effective data annotation in specialized domains.

[CV-43] AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

链接: https://arxiv.org/abs/2406.09295
作者: Yuhang Wu,Wenmeng Yu,Yean Cheng,Yan Wang,Xiaohan Zhang,Jiazheng Xu,Ming Ding,Yuxiao Dong
关键词: large Vision-Language Models, Vision-Language Models, helpful assistants, large Vision-Language, essential for determining
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Evaluating the alignment capabilities of large Vision-Language Models (VLMs) is essential for determining their effectiveness as helpful assistants. However, existing benchmarks primarily focus on basic abilities using nonverbal methods, such as yes-no and multiple-choice questions. In this paper, we address this gap by introducing AlignMMBench, a comprehensive alignment benchmark specifically designed for emerging Chinese VLMs. This benchmark is meticulously curated from real-world scenarios and Chinese Internet sources, encompassing thirteen specific tasks across three categories, and includes both single-turn and multi-turn dialogue scenarios. Incorporating a prompt rewrite strategy, AlignMMBench encompasses 1,054 images and 4,978 question-answer pairs. To facilitate the evaluation pipeline, we propose CritiqueVLM, a rule-calibrated evaluator that exceeds GPT-4’s evaluation ability. Finally, we report the performance of representative VLMs on AlignMMBench, offering insights into the capabilities and limitations of different VLM architectures. All evaluation codes and data are available on this https URL.

[CV-44] You Dont Need Data-Augmentation in Self-Supervised Learning

链接: https://arxiv.org/abs/2406.09294
作者: Théo Moutakanni,Maxime Oquab,Marc Szafraniec,Maria Vakalopoulou,Piotr Bojanowski
关键词: Joint-Embedding Predictive Architectures, led to outstanding, Predictive Architectures, outstanding performances, Joint-Embedding Architectures
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Self-Supervised learning (SSL) with Joint-Embedding Architectures (JEA) has led to outstanding performances. All instantiations of this paradigm were trained using strong and well-established hand-crafted data augmentations, leading to the general belief that they are required for the proper training and performance of such models. On the other hand, generative reconstruction-based models such as BEIT and MAE or Joint-Embedding Predictive Architectures such as I-JEPA have shown strong performance without using data augmentations except masking. In this work, we challenge the importance of invariance and data-augmentation in JEAs at scale. By running a case-study on a recent SSL foundation model - DINOv2 - we show that strong image representations can be obtained with JEAs and only cropping without resizing provided the training data is large enough, reaching state-of-the-art results and using the least amount of augmentation in the literature. Through this study, we also discuss the impact of compute constraints on the outcomes of experimental deep learning research, showing that they can lead to very different conclusions.

[CV-45] StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning

链接: https://arxiv.org/abs/2406.09293
作者: Giuseppe Vecchio
关键词: photorealistic physical-based rendering, generating photorealistic physical-based, integrate semi-supervised learning, physical-based rendering, generating photorealistic
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注:

点击查看摘要

Abstract:We introduce StableMaterials, a novel approach for generating photorealistic physical-based rendering (PBR) materials that integrate semi-supervised learning with Latent Diffusion Models (LDMs). Our method employs adversarial training to distill knowledge from existing large-scale image generation models, minimizing the reliance on annotated data and enhancing the diversity in generation. This distillation approach aligns the distribution of the generated materials with that of image textures from an SDXL model, enabling the generation of novel materials that are not present in the initial training dataset. Furthermore, we employ a diffusion-based refiner model to improve the visual quality of the samples and achieve high-resolution generation. Finally, we distill a latent consistency model for fast generation in just four steps and propose a new tileability technique that removes visual artifacts typically associated with fewer diffusion steps. We detail the architecture and training process of StableMaterials, the integration of semi-supervised training within existing LDM frameworks and show the advantages of our approach. Comparative evaluations with state-of-the-art methods show the effectiveness of StableMaterials, highlighting its potential applications in computer graphics and beyond. StableMaterials is publicly available at this https URL.

[CV-46] Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

链接: https://arxiv.org/abs/2406.09292
作者: Ziyi Wu,Yulia Rubanova,Rishabh Kabra,Drew A. Hudson,Igor Gilitschenski,Yusuf Aytar,Sjoerd van Steenkiste,Kelsey R. Allen,Thomas Kipf
关键词: Neural Assets, address the problem, Assets, Neural, pose
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Additional details and video results are available at this https URL

点击查看摘要

Abstract:We address the problem of multi-object 3D pose control in image diffusion models. Instead of conditioning on a sequence of text tokens, we propose to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene. Neural Assets are obtained by pooling visual representations of objects from a reference image, such as a frame in a video, and are trained to reconstruct the respective objects in a different image, e.g., a later frame in the video. Importantly, we encode object visuals from the reference image while conditioning on object poses from the target frame. This enables learning disentangled appearance and pose features. Combining visual and 3D pose representations in a sequence-of-tokens format allows us to keep the text-to-image architecture of existing models, with Neural Assets in place of text tokens. By fine-tuning a pre-trained text-to-image diffusion model with this information, our approach enables fine-grained 3D pose and placement control of individual objects in a scene. We further demonstrate that Neural Assets can be transferred and recomposed across different scenes. Our model achieves state-of-the-art multi-object editing results on both synthetic 3D scene datasets, as well as two real-world video datasets (Objectron, Waymo Open).

[CV-47] Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

链接: https://arxiv.org/abs/2406.09272
作者: Changan Chen,Puyuan Peng,Ami Baid,Zihui Xue,Wei-Ning Hsu,David Harwarth,Kristen Grauman
关键词: Generating realistic audio, Generating realistic, creating sound effects, virtual reality games, human interactions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Project page: this https URL

点击查看摘要

Abstract:Generating realistic audio for human interactions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals – resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets Ego4D and EPIC-KITCHENS. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our work is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.

[CV-48] Deep Transformer Network for Monocular Pose Estimation of Ship-Based UAV

链接: https://arxiv.org/abs/2406.09260
作者: Maneesha Wickramasuriya,Taeyoung Lee,Murray Snyder
关键词: Unmanned Aerial Vehicle, Aerial Vehicle, Unmanned Aerial, deep transformer network, Transformer Neural Network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO); Image and Video Processing (eess.IV)
*备注: 23 pages, 25 figures, 3 tables

点击查看摘要

Abstract:This paper introduces a deep transformer network for estimating the relative 6D pose of a Unmanned Aerial Vehicle (UAV) with respect to a ship using monocular images. A synthetic dataset of ship images is created and annotated with 2D keypoints of multiple ship parts. A Transformer Neural Network model is trained to detect these keypoints and estimate the 6D pose of each part. The estimates are integrated using Bayesian fusion. The model is tested on synthetic data and in-situ flight experiments, demonstrating robustness and accuracy in various lighting conditions. The position estimation error is approximately 0.8% and 1.0% of the distance to the ship for the synthetic data and the flight experiments, respectively. The method has potential applications for ship-based autonomous UAV landing and navigation.

[CV-49] Assessing Model Generalization in Vicinity

链接: https://arxiv.org/abs/2406.09257
作者: Yuchi Liu,Yifan Sun,Jingdong Wang,Liang Zheng
关键词: ground truth labels, paper evaluates, ability of classification, depending on ground, ground truth
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper evaluates the generalization ability of classification models on out-of-distribution test sets without depending on ground truth labels. Common approaches often calculate an unsupervised metric related to a specific model property, like confidence or invariance, which correlates with out-of-distribution accuracy. However, these metrics are typically computed for each test sample individually, leading to potential issues caused by spurious model responses, such as overly high or low confidence. To tackle this challenge, we propose incorporating responses from neighboring test samples into the correctness assessment of each individual sample. In essence, if a model consistently demonstrates high correctness scores for nearby samples, it increases the likelihood of correctly predicting the target sample, and vice versa. The resulting scores are then averaged across all test samples to provide a holistic indication of model accuracy. Developed under the vicinal risk formulation, this approach, named vicinal risk proxy (VRP), computes accuracy without relying on labels. We show that applying the VRP method to existing generalization indicators, such as average confidence and effective invariance, consistently improves over these baselines both methodologically and experimentally. This yields a stronger correlation with model accuracy, especially on challenging out-of-distribution test sets.

[CV-50] MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

链接: https://arxiv.org/abs/2406.09250
作者: Samar Fares,Klea Ziu,Toluwani Aremu,Nikita Durasov,Martin Takáč,Pascal Fua,Karthik Nandakumar,Ivan Laptev
关键词: increasingly vulnerable, Vision-Language Models, adversarial, adversarial threats, VLMs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are becoming increasingly vulnerable to adversarial attacks as various novel attack strategies are being proposed against these models. While existing defenses excel in unimodal contexts, they currently fall short in safeguarding VLMs against adversarial threats. To mitigate this vulnerability, we propose a novel, yet elegantly simple approach for detecting adversarial samples in VLMs. Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs. Subsequently, we calculate the similarities of the embeddings of both input and generated images in the feature space to identify adversarial samples. Empirical evaluations conducted on different datasets validate the efficacy of our approach, outperforming baseline methods adapted from image classification domains. Furthermore, we extend our methodology to classification tasks, showcasing its adaptability and model-agnostic nature. Theoretical analyses and empirical findings also show the resilience of our approach against adaptive attacks, positioning it as an excellent defense mechanism for real-world deployment against adversarial threats.

[CV-51] Comparison Visual Instruction Tuning

链接: https://arxiv.org/abs/2406.09240
作者: Wei Lin,Muhammad Jehanzeb Mirza,Sivan Doveh,Rogerio Feris,Raja Giryes,Sepp Hochreiter,Leonid Karlinsky
关键词: Commonalities and Differences, terms of Commonalities, advanced visual reasoning, Large Multimodal Models, reasoning and interpretation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL ; Huggingface dataset repo: this https URL

点击查看摘要

Abstract:Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis of advanced visual reasoning and interpretation. It is essential for the generation of detailed and contextually relevant descriptions, performing comparative analysis, novelty detection, and making informed decisions based on visual data. However, surprisingly, little attention has been given to these fundamental concepts in the best current mimic of human visual intelligence - Large Multimodal Models (LMMs). We develop and contribute a new two-phase approach CaD-VI for collecting synthetic visual instructions, together with an instruction-following dataset CaD-Inst containing 349K image pairs with CaD instructions collected using CaD-VI. Our approach significantly improves the CaD spotting capabilities in LMMs, advancing the SOTA on a diverse set of related tasks by up to 17.5%. It is also complementary to existing difference-only instruction datasets, allowing automatic targeted refinement of those resources increasing their effectiveness for CaD tuning by up to 10%. Additionally, we propose an evaluation benchmark with 7.5K open-ended QAs to assess the CaD understanding abilities of LMMs.

[CV-52] MGRQ: Post-Training Quantization For Vision Transformer With Mixed Granularity Reconstruction

链接: https://arxiv.org/abs/2406.09229
作者: Lianwei Yang,Zhikai Li,Junrui Xiao,Haisong Gong,Qingyi Gu
关键词: efficiently compresses vision, Mixed Granularity Reconstruction, Post-training quantization, Mixed Granularity, Global Supervision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by 2024 IEEE International Conference on Image Processing

点击查看摘要

Abstract:Post-training quantization (PTQ) efficiently compresses vision models, but unfortunately, it accompanies a certain degree of accuracy degradation. Reconstruction methods aim to enhance model performance by narrowing the gap between the quantized model and the full-precision model, often yielding promising results. However, efforts to significantly improve the performance of PTQ through reconstruction in the Vision Transformer (ViT) have shown limited efficacy. In this paper, we conduct a thorough analysis of the reasons for this limited effectiveness and propose MGRQ (Mixed Granularity Reconstruction Quantization) as a solution to address this issue. Unlike previous reconstruction schemes, MGRQ introduces a mixed granularity reconstruction approach. Specifically, MGRQ enhances the performance of PTQ by introducing Extra-Block Global Supervision and Intra-Block Local Supervision, building upon Optimized Block-wise Reconstruction. Extra-Block Global Supervision considers the relationship between block outputs and the model’s output, aiding block-wise reconstruction through global supervision. Meanwhile, Intra-Block Local Supervision reduces generalization errors by aligning the distribution of outputs at each layer within a block. Subsequently, MGRQ is further optimized for reconstruction through Mixed Granularity Loss Fusion. Extensive experiments conducted on various ViT models illustrate the effectiveness of MGRQ. Notably, MGRQ demonstrates robust performance in low-bit quantization, thereby enhancing the practicality of the quantized model.

[CV-53] WildlifeReID-10k: Wildlife re-identification dataset with 10k individual animals

链接: https://arxiv.org/abs/2406.09211
作者: Lukáš Adam,Vojtěch Čermák,Kostas Papafitsoros,Lukas Picek
关键词: wildlife re-identification, existing wildlife re-identification, wildlife re-identification dataset, individual animals, re-identification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce a new wildlife re-identification dataset WildlifeReID-10k with more than 214k images of 10k individual animals. It is a collection of 30 existing wildlife re-identification datasets with additional processing steps. WildlifeReID-10k contains animals as diverse as marine turtles, primates, birds, African herbivores, marine mammals and domestic animals. Due to the ubiquity of similar images in datasets, we argue that the standard (random) splits into training and testing sets are inadequate for wildlife re-identification and propose a new similarity-aware split based on the similarity of extracted features. To promote fair method comparison, we include similarity-aware splits both for closed-set and open-set settings, use MegaDescriptor - a foundational model for wildlife re-identification - for baseline performance and host a leaderboard with the best results. We publicly publish the dataset and the codes used to create it in the wildlife-datasets library, making WildlifeReID-10k both highly curated and easy to use.

[CV-54] Optimizing Visual Question Answering Models for Driving: Bridging the Gap Between Human and Machine Attention Patterns

链接: https://arxiv.org/abs/2406.09203
作者: Kaavya Rekanar,Martin Hayes,Ganesh Sistu,Ciaran Eising
关键词: Visual Question Answering, analyze visual inputs, fostering natural interaction, autonomous driving systems, visual inputs alongside
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Visual Question Answering (VQA) models play a critical role in enhancing the perception capabilities of autonomous driving systems by allowing vehicles to analyze visual inputs alongside textual queries, fostering natural interaction and trust between the vehicle and its occupants or other road users. This study investigates the attention patterns of humans compared to a VQA model when answering driving-related questions, revealing disparities in the objects observed. We propose an approach integrating filters to optimize the model’s attention mechanisms, prioritizing relevant objects and improving accuracy. Utilizing the LXMERT model for a case study, we compare attention patterns of the pre-trained and Filter Integrated models, alongside human answers using images from the NuImages dataset, gaining insights into feature prioritization. We evaluated the models using a Subjective scoring framework which shows that the integration of the feature encoder filter has enhanced the performance of the VQA model by refining its attention mechanisms.

[CV-55] Enhanced Object Detection: A Study on Vast Vocabulary Object Detection Track for V3Det Challenge 2024

链接: https://arxiv.org/abs/2406.09201
作者: Peixi Wu,Bosong Chai,Xuan Nie,Longquan Yan,Zeyu Wang,Qifan Zhou,Boning Wang
关键词: Vocabulary Visual Detection, Vast Vocabulary Visual, Visual Detection task, Visual Detection, Vocabulary Object Detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this technical report, we present our findings from the research conducted on the Vast Vocabulary Visual Detection (V3Det) dataset for Supervised Vast Vocabulary Visual Detection task. How to deal with complex categories and detection boxes has become a difficulty in this track. The original supervised detector is not suitable for this task. We have designed a series of improvements, including adjustments to the network structure, changes to the loss function, and design of training strategies. Our model has shown improvement over the baseline and achieved excellent rankings on the Leaderboard for both the Vast Vocabulary Object Detection (Supervised) track and the Open Vocabulary Object Detection (OVD) track of the V3Det Challenge 2024.

[CV-56] CLIP-Driven Cloth-Agnostic Feature Learning for Cloth-Changing Person Re-Identification

链接: https://arxiv.org/abs/2406.09198
作者: Shuang Li,Jiaxu Leng,Guozhang Li,Ji Gan,Haosheng chen,Xinbo Gao
关键词: short-term Person Re-Identification, Cloth-Changing Person Re-Identification, Contrastive Language-Image Pre-Training, Person Re-Identification, faces challenges due
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Contrastive Language-Image Pre-Training (CLIP) has shown impressive performance in short-term Person Re-Identification (ReID) due to its ability to extract high-level semantic features of pedestrians, yet its direct application to Cloth-Changing Person Re-Identification (CC-ReID) faces challenges due to CLIP’s image encoder overly focusing on clothes clues. To address this, we propose a novel framework called CLIP-Driven Cloth-Agnostic Feature Learning (CCAF) for CC-ReID. Accordingly, two modules were custom-designed: the Invariant Feature Prompting (IFP) and the Clothes Feature Minimization (CFM). These modules guide the model to extract cloth-agnostic features positively and attenuate clothes-related features negatively. Specifically, IFP is designed to extract fine-grained semantic features unrelated to clothes from the raw image, guided by the cloth-agnostic text prompts. This module first covers the clothes in the raw image at the pixel level to obtain the shielding image and then utilizes CLIP’s knowledge to generate cloth-agnostic text prompts. Subsequently, it aligns the raw image-text and the raw image-shielding image in the feature space, emphasizing discriminative clues related to identity but unrelated to clothes. Furthermore, CFM is designed to examine and weaken the image encoder’s ability to extract clothes features. It first generates text prompts corresponding to clothes pixels. Then, guided by these clothes text prompts, it iteratively examines and disentangles clothes features from pedestrian features, ultimately retaining inherent discriminative features. Extensive experiments have demonstrated the effectiveness of the proposed CCAF, achieving new state-of-the-art performance on several popular CC-ReID benchmarks without any additional inference time.

[CV-57] Adaptive Slot Attention: Object Discovery with Dynamic Slot Number

链接: https://arxiv.org/abs/2406.09196
作者: Ke Fan,Zechen Bai,Tianjun Xiao,Tong He,Max Horn,Yanwei Fu,Francesco Locatello,Zheng Zhang
关键词: low-level perceptual features, abstracting low-level perceptual, perceptual features, exceptional blend, blend of flexibility
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: CVPR 2024

点击查看摘要

Abstract:Object-centric learning (OCL) extracts the representation of objects with slots, offering an exceptional blend of flexibility and interpretability for abstracting low-level perceptual features. A widely adopted method within OCL is slot attention, which utilizes attention mechanisms to iteratively refine slot representations. However, a major drawback of most object-centric models, including slot attention, is their reliance on predefining the number of slots. This not only necessitates prior knowledge of the dataset but also overlooks the inherent variability in the number of objects present in each instance. To overcome this fundamental limitation, we present a novel complexity-aware object auto-encoder framework. Within this framework, we introduce an adaptive slot attention (AdaSlot) mechanism that dynamically determines the optimal number of slots based on the content of the data. This is achieved by proposing a discrete slot sampling module that is responsible for selecting an appropriate number of slots from a candidate list. Furthermore, we introduce a masked slot decoder that suppresses unselected slots during the decoding process. Our framework, tested extensively on object discovery tasks with various datasets, shows performance matching or exceeding top fixed-slot models. Moreover, our analysis substantiates that our method exhibits the capability to dynamically adapt the slot number according to each instance’s complexity, offering the potential for further exploration in slot attention research. Project will be available at this https URL

[CV-58] Reducing Task Discrepancy of Text Encoders for Zero-Shot Composed Image Retrieval

链接: https://arxiv.org/abs/2406.09188
作者: Jaeseok Byun,Seokhyeon Jeong,Wonjae Kim,Sanghyuk Chun,Taesup Moon
关键词: enabling controllable searches, Composed Image Retrieval, Image, text, CLIP image
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
*备注: 17 pages

点击查看摘要

Abstract:Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable searches. Due to the expensive dataset construction cost for CIR triplets, a zero-shot (ZS) CIR setting has been actively studied to eliminate the need for human-collected triplet datasets. The mainstream of ZS-CIR employs an efficient projection module that projects a CLIP image embedding to the CLIP text token embedding space, while fixing the CLIP encoders. Using the projected image embedding, these methods generate image-text composed features by using the pre-trained text encoder. However, their CLIP image and text encoders suffer from the task discrepancy between the pre-training task (text \leftrightarrow image) and the target CIR task (image + text \leftrightarrow image). Conceptually, we need expensive triplet samples to reduce the discrepancy, but we use cheap text triplets instead and update the text encoder. To that end, we introduce the Reducing Task Discrepancy of text encoders for Composed Image Retrieval (RTD), a plug-and-play training scheme for the text encoder that enhances its capability using a novel target-anchored text contrastive learning. We also propose two additional techniques to improve the proposed learning scheme: a hard negatives-based refined batch sampling strategy and a sophisticated concatenation scheme. Integrating RTD into the state-of-the-art projection-based ZS-CIR methods significantly improves performance across various datasets and backbones, demonstrating its efficiency and generalizability.

[CV-59] horacic Surgery Video Analysis for Surgical Phase Recognition

链接: https://arxiv.org/abs/2406.09185
作者: Syed Abdul Mateen,Niharika Malvia,Syed Abdul Khader,Danny Wang,Deepti Srinivasan,Chi-Fu Jeffrey Yang,Lana Schumacher,Sandeep Manjanna
关键词: surgical phase recognition, automated workflow analysis, phase recognition, surgical phase, aiming to provide
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 2 pages, 2 figures

点击查看摘要

Abstract:This paper presents an approach for surgical phase recognition using video data, aiming to provide a comprehensive understanding of surgical procedures for automated workflow analysis. The advent of robotic surgery, digitized operating rooms, and the generation of vast amounts of data have opened doors for the application of machine learning and computer vision in the analysis of surgical videos. Among these advancements, Surgical Phase Recognition(SPR) stands out as an emerging technology that has the potential to recognize and assess the ongoing surgical scenario, summarize the surgery, evaluate surgical skills, offer surgical decision support, and facilitate medical training. In this paper, we analyse and evaluate both frame-based and video clipping-based phase recognition on thoracic surgery dataset consisting of 11 classes of phases. Specifically, we utilize ImageNet ViT for image-based classification and VideoMAE as the baseline model for video-based classification. We show that Masked Video Distillation(MVD) exhibits superior performance, achieving a top-1 accuracy of 72.9%, compared to 52.31% achieved by ImageNet ViT. These findings underscore the efficacy of video-based classifiers over their image-based counterparts in surgical phase recognition tasks.

[CV-60] A Large-scale Universal Evaluation Benchmark For Face Forgery Detection

链接: https://arxiv.org/abs/2406.09181
作者: Yijun Bei,Hengrui Lou,Jinsong Geng,Erteng Liu,Lechao Cheng,Jie Song,Mingli Song,Zunlei Feng
关键词: deceive human visual, human visual perception, realistic fake facial, face forgery detection, forgery detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: This is a paper about constructing a large-scale universal evaluation benchmark for face forgery detection.The main text is 9 pages and the full text is 30 pages

点击查看摘要

Abstract:With the rapid development of AI-generated content (AIGC) technology, the production of realistic fake facial images and videos that deceive human visual perception has become possible. Consequently, various face forgery detection techniques have been proposed to identify such fake facial content. However, evaluating the effectiveness and generalizability of these detection techniques remains a significant challenge. To address this, we have constructed a large-scale evaluation benchmark called DeepFaceGen, aimed at quantitatively assessing the effectiveness of face forgery detection and facilitating the iterative development of forgery detection technology. DeepFaceGen consists of 776,990 real face image/video samples and 773,812 face forgery image/video samples, generated using 34 mainstream face generation techniques. During the construction process, we carefully consider important factors such as content diversity, fairness across ethnicities, and availability of comprehensive labels, in order to ensure the versatility and convenience of DeepFaceGen. Subsequently, DeepFaceGen is employed in this study to evaluate and analyze the performance of 13 mainstream face forgery detection techniques from various perspectives. Through extensive experimental analysis, we derive significant findings and propose potential directions for future research. The code and dataset for DeepFaceGen are available at https://anonymous.4open.science/r/DeepFaceGen-47D1.

[CV-61] ReMI: A Dataset for Reasoning with Multiple Images

链接: https://arxiv.org/abs/2406.09175
作者: Mehran Kazemi,Nishanth Dikkala,Ankit Anand,Petar Devic,Ishita Dasgupta,Fangyu Liu,Bahare Fatemi,Pranjal Awasthi,Dee Guo,Sreenivas Gollapudi,Ahmed Qureshi
关键词: large language models, continuous advancement, advancement of large, large language, essential to create
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:With the continuous advancement of large language models (LLMs), it is essential to create new benchmarks to effectively evaluate their expanding capabilities and identify areas for improvement. This work focuses on multi-image reasoning, an emerging capability in state-of-the-art LLMs. We introduce ReMI, a dataset designed to assess LLMs’ ability to Reason with Multiple Images. This dataset encompasses a diverse range of tasks, spanning various reasoning domains such as math, physics, logic, code, table/chart understanding, and spatial and temporal reasoning. It also covers a broad spectrum of characteristics found in multi-image reasoning scenarios. We have benchmarked several cutting-edge LLMs using ReMI and found a substantial gap between their performance and human-level proficiency. This highlights the challenges in multi-image reasoning and the need for further research. Our analysis also reveals the strengths and weaknesses of different models, shedding light on the types of reasoning that are currently attainable and areas where future models require improvement. To foster further research in this area, we are releasing ReMI publicly: this https URL.

[CV-62] Fine-Grained Domain Generalization with Feature Structuralization

链接: https://arxiv.org/abs/2406.09166
作者: Wenlong Yu,Dongyue Chen,Qilong Wang,Qinghua Hu
关键词: Fine-grained domain generalization, Structuralized Domain Generalization, large intra-class disparities, challenging task due, small inter-class variations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fine-grained domain generalization (FGDG) is a more challenging task due to its small inter-class variations and relatively large intra-class disparities. When domain distribution changes, the fragility of subtle features leads to a pronounced deterioration in model performance.Nevertheless, humans inherently demonstrate the capacity for generalizing to out-of-distribution data, leveraging structured multi-granularity knowledge that emerges from discerning both the commonality and specificity within categories.Likewise, we propose a Feature Structuralized Domain Generalization (FSDG) model, wherein features experience structuralization into common, specific, and confounding segments, harmoniously aligned with their relevant semantic concepts, to elevate performance in FGDG. Specifically, feature structuralization (FS) is achieved through a decorrelation function on disentangled segments, constraints on common feature consistency, specific feature distinctiveness, and a prediction calibration operation across granularities. By imposing these stipulations, FSDG is prompted to disentangle and align features based on multi-granularity knowledge, facilitating robust subtle distinctions among categories. Extensive experimentation on three benchmarks consistently validates the superiority of FSDG over state-of-the-art counterparts, with an average improvement of 6.1% in terms of FGDG performance. Beyond that, the explainability analysis and experiments on various mainstream model architectures confirm the validity of FS.

[CV-63] EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

链接: https://arxiv.org/abs/2406.09162
作者: Yucheng Han,Rui Wang,Chi Zhang,Juntao Hu,Pei Cheng,Bin Fu,Hanwang Zhang
关键词: Recent advancements, enabled the creation, creation of high-quality, diffusion model, image generation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: this https URL

点击查看摘要

Abstract:Recent advancements in image generation have enabled the creation of high-quality images from text conditions. However, when facing multi-modal conditions, such as text combined with reference appearances, existing methods struggle to balance multiple conditions effectively, typically showing a preference for one modality over others. To address this challenge, we introduce EMMA, a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA. EMMA seamlessly incorporates additional modalities alongside text to guide image generation through an innovative Multi-modal Feature Connector design, which effectively integrates textual and supplementary modal information using a special attention mechanism. By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an interesting finding that the pre-trained T2I diffusion model can secretly accept multi-modal prompts. This interesting property facilitates easy adaptation to different existing frameworks, making EMMA a flexible and effective tool for producing personalized and context-aware images and even videos. Additionally, we introduce a strategy to assemble learned EMMA modules to produce images conditioned on multiple modalities simultaneously, eliminating the need for additional training with mixed multi-modal prompts. Extensive experiments demonstrate the effectiveness of EMMA in maintaining high fidelity and detail in generated images, showcasing its potential as a robust solution for advanced multi-modal conditional image generation tasks.

[CV-64] Beyond the Frontier: Predicting Unseen Walls from Occupancy Grids by Learning from Floor Plans

链接: https://arxiv.org/abs/2406.09160
作者: Ludvig Ericson,Patric Jensfelt
关键词: occupancy grids integrated, partially observed environment, LIDAR sensor, tackle the challenge, partially observed
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: RA-L, 8 pages

点击查看摘要

Abstract:In this paper, we tackle the challenge of predicting the unseen walls of a partially observed environment as a set of 2D line segments, conditioned on occupancy grids integrated along the trajectory of a 360° LIDAR sensor. A dataset of such occupancy grids and their corresponding target wall segments is collected by navigating a virtual robot between a set of randomly sampled waypoints in a collection of office-scale floor plans from a university campus. The line segment prediction task is formulated as an autoregressive sequence prediction task, and an attention-based deep network is trained on the dataset. The sequence-based autoregressive formulation is evaluated through predicted information gain, as in frontier-based autonomous exploration, demonstrating significant improvements over both non-predictive estimation and convolution-based image prediction found in the literature. Ablations on key components are evaluated, as well as sensor range and the occupancy grid’s metric area. Finally, model generality is validated by predicting walls in a novel floor plan reconstructed on-the-fly in a real-world office environment.

[CV-65] owards Multilingual Audio-Visual Question Answering

链接: https://arxiv.org/abs/2406.09156
作者: Orchid Chetia Phukan,Priyabrata Mallick,Swarup Ranjan Behera,Aalekhya Satya Narayani,Arun Balaji Buduru,Rajesh Sharma
关键词: Audio-Visual Question Answering, extending Audio-Visual Question, Question Answering, AVQA, extending Audio-Visual
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages created from existing benchmark AVQA datasets. This prevents extra human annotation efforts of collecting questions and answers manually. To this end, we propose, MERA framework, by leveraging state-of-the-art (SOTA) video, audio, and textual foundation models for AVQA in multiple languages. We introduce a suite of models namely MERA-L, MERA-C, MERA-T with varied model architectures to benchmark the proposed datasets. We believe our work will open new research directions and act as a reference benchmark for future works in multilingual AVQA.

[CV-66] DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation

链接: https://arxiv.org/abs/2406.09155
作者: A B M Ashikur Rahman,Saeed Anwar,Muhammad Usman,Ajmal Mian
关键词: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, daily life applications
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities, revolutionizing the integration of AI in daily life applications. However, they are prone to hallucinations, generating claims that contradict established facts, deviating from prompts, and producing inconsistent responses when the same prompt is presented multiple times. Addressing these issues is challenging due to the lack of comprehensive and easily assessable benchmark datasets. Most existing datasets are small and rely on multiple-choice questions, which are inadequate for evaluating the generative prowess of LLMs. To measure hallucination in LLMs, this paper introduces a comprehensive benchmark dataset comprising over 75,000 prompts across eight domains. These prompts are designed to elicit definitive, concise, and informative answers. The dataset is divided into two segments: one publicly available for testing and assessing LLM performance and a hidden segment for benchmarking various LLMs. In our experiments, we tested six LLMs-GPT-3.5, LLama 2, LLama 3, Gemini, Mixtral, and Zephyr-revealing that overall factual hallucination ranges from 59% to 82% on the public dataset and 57% to 76% in the hidden benchmark. Prompt misalignment hallucination ranges from 6% to 95% in the public dataset and 17% to 94% in the hidden counterpart. Average consistency ranges from 21% to 61% and 22% to 63%, respectively. Domain-wise analysis shows that LLM performance significantly deteriorates when asked for specific numeric information while performing moderately with person, location, and date queries. Our dataset demonstrates its efficacy and serves as a comprehensive benchmark for LLM performance evaluation. Our dataset and LLMs responses are available at \hrefthis https URLthis https URL.

[CV-67] Generative AI-based Prompt Evolution Engineering Design Optimization With Vision-Language Model

链接: https://arxiv.org/abs/2406.09143
作者: Melvin Wong,Thiago Rios,Stefan Menzel,Yew Soon Ong
关键词: Engineering design optimization, performance evaluation method, Engineering design, vision-language model, design optimization requires
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted and to be published in IEEE Congress on Evolutionary Computation 2024

点击查看摘要

Abstract:Engineering design optimization requires an efficient combination of a 3D shape representation, an optimization algorithm, and a design performance evaluation method, which is often computationally expensive. We present a prompt evolution design optimization (PEDO) framework contextualized in a vehicle design scenario that leverages a vision-language model for penalizing impractical car designs synthesized by a generative model. The backbone of our framework is an evolutionary strategy coupled with an optimization objective function that comprises a physics-based solver and a vision-language model for practical or functional guidance in the generated car designs. In the prompt evolutionary search, the optimizer iteratively generates a population of text prompts, which embed user specifications on the aerodynamic performance and visual preferences of the 3D car designs. Then, in addition to the computational fluid dynamics simulations, the pre-trained vision-language model is used to penalize impractical designs and, thus, foster the evolutionary algorithm to seek more viable designs. Our investigations on a car design optimization problem show a wide spread of potential car designs generated at the early phase of the search, which indicates a good diversity of designs in the initial populations, and an increase of over 20% in the probability of generating practical designs compared to a baseline framework without using a vision-language model. Visual inspection of the designs against the performance results demonstrates prompt evolution as a very promising paradigm for finding novel designs with good optimization performance while providing ease of use in specifying design specifications and preferences via a natural language interface.

[CV-68] AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring

链接: https://arxiv.org/abs/2406.09135
作者: Xintian Mao,Qingli Li,Yan Wang
关键词: Adaptive Patch Exiting, limited decoding capability, decoding capability constrains, Patch Exiting Reversible, Exiting Reversible Decoder
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite the recent progress in enhancing the efficacy of image deblurring, the limited decoding capability constrains the upper limit of State-Of-The-Art (SOTA) methods. This paper proposes a pioneering work, Adaptive Patch Exiting Reversible Decoder (AdaRevD), to explore their insufficient decoding capability. By inheriting the weights of the well-trained encoder, we refactor a reversible decoder which scales up the single-decoder training to multi-decoder training while remaining GPU memory-friendly. Meanwhile, we show that our reversible structure gradually disentangles high-level degradation degree and low-level blur pattern (residual of the blur image and its sharp counterpart) from compact degradation representation. Besides, due to the spatially-variant motion blur kernels, different blur patches have various deblurring difficulties. We further introduce a classifier to learn the degradation degree of image patches, enabling them to exit at different sub-decoders for speedup. Experiments show that our AdaRevD pushes the limit of image deblurring, e.g., achieving 34.60 dB in PSNR on GoPro dataset.

[CV-69] Auto-Vocabulary Segmentation for LiDAR Points

链接: https://arxiv.org/abs/2406.09126
作者: Weijie Wei,Osman Ülger,Fatemeh Karimi Najadasl,Theo Gevers,Martin R. Oswald
关键词: Existing perception methods, autonomous driving fall, driving fall short, recognizing unknown entities, Existing perception
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by CVPR 2024 OpenSun3D Workshop

点击查看摘要

Abstract:Existing perception methods for autonomous driving fall short of recognizing unknown entities not covered in the training data. Open-vocabulary methods offer promising capabilities in detecting any object but are limited by user-specified queries representing target classes. We propose AutoVoc3D, a framework for automatic object class recognition and open-ended segmentation. Evaluation on nuScenes showcases AutoVoc3D’s ability to generate precise semantic classes and accurate point-wise segmentation. Moreover, we introduce Text-Point Semantic Similarity, a new metric to assess the semantic similarity between text and point cloud without eliminating novel classes.

[CV-70] MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era

链接: https://arxiv.org/abs/2406.09121
作者: Jiahao Nie,Gongjie Zhang,Wenbin An,Yap-Peng Tan,Alex C. Kot,Shijian Lu
关键词: Large Language Models, Language Models, Multi-modal Large Language, Large Language, Models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Despite the recent advancements in Multi-modal Large Language Models (MLLMs), understanding inter-object relations, i.e., interactions or associations between distinct objects, remains a major challenge for such models. This issue significantly hinders their advanced reasoning capabilities and is primarily due to the lack of large-scale, high-quality, and diverse multi-modal data essential for training and evaluating MLLMs. In this paper, we provide a taxonomy of inter-object relations and introduce Multi-Modal Relation Understanding (MMRel), a comprehensive dataset designed to bridge this gap by providing large-scale, high-quality and diverse data for studying inter-object relations with MLLMs. MMRel features three distinctive attributes: (i) It includes over 15K question-answer pairs, which are sourced from three distinct domains, ensuring large scale and high diversity; (ii) It contains a subset featuring highly unusual relations, on which MLLMs often fail due to hallucinations, thus are very challenging; (iii) It provides manually verified high-quality labels for inter-object relations. Thanks to these features, MMRel is ideal for evaluating MLLMs on relation understanding, as well as being used to fine-tune MLLMs to enhance relation understanding and even benefit overall performance in various vision-language tasks. Extensive experiments on various popular MLLMs validate the effectiveness of MMRel. Both MMRel dataset and the complete labeling scripts have been made publicly available.

[CV-71] PC-LoRA: Low-Rank Adaptation for Progressive Model Compression with Knowledge Distillation

链接: https://arxiv.org/abs/2406.09117
作者: Injoon Hwang,Haewon Park,Youngwan Lee,Jooyoung Yang,SunJae Maeng
关键词: frozen pre-trained weights, pre-trained weights, Progressive Compression LoRA, adds a small, small number
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: Accepted at T4V@CVPR

点击查看摘要

Abstract:Low-rank adaption (LoRA) is a prominent method that adds a small number of learnable parameters to the frozen pre-trained weights for parameter-efficient fine-tuning. Prompted by the question, ``Can we make its representation enough with LoRA weights solely at the final phase of finetuning without the pre-trained weights?‘’ In this work, we introduce Progressive Compression LoRA~(PC-LoRA), which utilizes low-rank adaptation (LoRA) to simultaneously perform model compression and fine-tuning. The PC-LoRA method gradually removes the pre-trained weights during the training process, eventually leaving only the low-rank adapters in the end. Thus, these low-rank adapters replace the whole pre-trained weights, achieving the goals of compression and fine-tuning at the same time. Empirical analysis across various models demonstrates that PC-LoRA achieves parameter and FLOPs compression rates of 94.36%/89.1% for vision models, e.g., ViT-B, and 93.42%/84.2% parameters and FLOPs compressions for language models, e.g., BERT.

[CV-72] Large-Scale Evaluation of Open-Set Image Classification Techniques

链接: https://arxiv.org/abs/2406.09112
作者: Halil Bisgin,Andres Palechor,Mike Suter,Manuel Günther
关键词: correctly assign labels, Maximum Logit Scores, correctly assign, Maximum SoftMax Scores, samples
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The goal for classification is to correctly assign labels to unseen samples. However, most methods misclassify samples with unseen labels and assign them to one of the known classes. Open-Set Classification (OSC) algorithms aim to maximize both closed and open-set recognition capabilities. Recent studies showed the utility of such algorithms on small-scale data sets, but limited experimentation makes it difficult to assess their performances in real-world problems. Here, we provide a comprehensive comparison of various OSC algorithms, including training-based (SoftMax, Garbage, EOS) and post-processing methods (Maximum SoftMax Scores, Maximum Logit Scores, OpenMax, EVM, PROSER), the latter are applied on features from the former. We perform our evaluation on three large-scale protocols that mimic real-world challenges, where we train on known and negative open-set samples, and test on known and unknown instances. Our results show that EOS helps to improve performance of almost all post-processing algorithms. Particularly, OpenMax and PROSER are able to exploit better-trained networks, demonstrating the utility of hybrid models. However, while most algorithms work well on negative test samples – samples of open-set classes seen during training – they tend to perform poorly when tested on samples of previously unseen unknown classes, especially in challenging conditions.

[CV-73] INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs Performance in Insurance

链接: https://arxiv.org/abs/2406.09105
作者: Chenwei Lin,Hanjia Lyu,Xian Xu,Jiebo Luo
关键词: Large Vision-Language Models, Large Vision-Language, shown promising potential, insurance, insurance domain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated outstanding performance in various general multimodal applications such as image recognition and visual reasoning, and have also shown promising potential in specialized domains. However, the application potential of LVLMs in the insurance domain-characterized by rich application scenarios and abundant multimodal data-has not been effectively explored. There is no systematic review of multimodal tasks in the insurance domain, nor a benchmark specifically designed to evaluate the capabilities of LVLMs in insurance. This gap hinders the development of LVLMs within the insurance domain. In this paper, we systematically review and distill multimodal tasks for four representative types of insurance: auto insurance, property insurance, health insurance, and agricultural insurance. We propose INS-MMBench, the first comprehensive LVLMs benchmark tailored for the insurance domain. INS-MMBench comprises a total of 2.2K thoroughly designed multiple-choice questions, covering 12 meta-tasks and 22 fundamental tasks. Furthermore, we evaluate multiple representative LVLMs, including closed-source models such as GPT-4o and open-source models like BLIP-2. This evaluation not only validates the effectiveness of our benchmark but also provides an in-depth performance analysis of current LVLMs on various multimodal tasks in the insurance domain. We hope that INS-MMBench will facilitate the further application of LVLMs in the insurance domain and inspire interdisciplinary development. Our dataset and evaluation code are available at this https URL.

[CV-74] Suitability of KANs for Computer Vision: A preliminary investigation

链接: https://arxiv.org/abs/2406.09087
作者: Basim Azam,Naveed Akhtar
关键词: implements learnable functions, traditional node-centric activations, introduce a paradigm, implements learnable, learnable functions
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) introduce a paradigm of neural modeling that implements learnable functions on the edges of the networks, diverging from the traditional node-centric activations in neural networks. This work assesses the applicability and efficacy of KANs in visual modeling, focusing on the image recognition task. We mainly analyze the performance and efficiency of different network architectures built using KAN concepts along with conventional building blocks of convolutional and linear layers, enabling a comparative analysis with the conventional models. Our findings are aimed at contributing to understanding the potential of KANs in computer vision, highlighting both their strengths and areas for further research. Our evaluation shows that whereas KAN-based architectures perform in-line with the original claims of KAN paper for performance and model-complexity in the case of simpler vision datasets like MNIST, the advantages seem to diminish even for slightly more complex datasets like CIFAR-10.

[CV-75] EquiPrompt: Debiasing Diffusion Models via Iterative Bootstrapping in Chain of Thoughts

链接: https://arxiv.org/abs/2406.09070
作者: Zahraa Al Sahili,Ioannis Patras,Matthew Purver
关键词: training datasets poses, datasets poses significant, significant ethical challenges, poses significant ethical, generative models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the domain of text-to-image generative models, the inadvertent propagation of biases inherent in training datasets poses significant ethical challenges, particularly in the generation of socially sensitive content. This paper introduces EquiPrompt, a novel method employing Chain of Thought (CoT) reasoning to reduce biases in text-to-image generative models. EquiPrompt uses iterative bootstrapping and bias-aware exemplar selection to balance creativity and ethical responsibility. It integrates iterative reasoning refinement with controlled evaluation techniques, addressing zero-shot CoT issues in sensitive contexts. Experiments on several generation tasks show EquiPrompt effectively lowers bias while maintaining generative quality, advancing ethical AI and socially responsible creative processes.Code will be publically available.

[CV-76] How structured are the representations in transformer-based vision encoders? An analysis of multi-object representations in vision-language models

链接: https://arxiv.org/abs/2406.09067
作者: Tarun Khajuria,Braian Olmiro Dias,Jaan Aru
关键词: considered essential, essential for generalising, symbol-like structured representations, Forming, representations
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Forming and using symbol-like structured representations for reasoning has been considered essential for generalising over novel inputs. The primary tool that allows generalisation outside training data distribution is the ability to abstract away irrelevant information into a compact form relevant to the task. An extreme form of such abstract representations is symbols. Humans make use of symbols to bind information while abstracting away irrelevant parts to utilise the information consistently and meaningfully. This work estimates the state of such structured representations in vision encoders. Specifically, we evaluate image encoders in large vision-language pre-trained models to address the question of which desirable properties their representations lack by applying the criteria of symbolic structured reasoning described for LLMs to the image models. We test the representation space of image encoders like VIT, BLIP, CLIP, and FLAVA to characterise the distribution of the object representations in these models. In particular, we create decoding tasks using multi-object scenes from the COCO dataset, relating the token space to its input content for various objects in the scene. We use these tasks to characterise the network’s token and layer-wise information modelling. Our analysis highlights that the CLS token, used for the downstream task, only focuses on a few objects necessary for the trained downstream task. Still, other individual objects are well-modelled separately by the tokens in the network originating from those objects. We further observed a widespread distribution of scene information. This demonstrates that information is far more entangled in tokens than optimal for representing objects similar to symbols. Given these symbolic properties, we show the network dynamics that cause failure modes of these models on basic downstream tasks in a multi-object scene.

[CV-77] FacEnhance: Facial Expression Enhancing with Recurrent DDPMs

链接: https://arxiv.org/abs/2406.09040
作者: Hamza Bouzid,Lahoucine Ballihi
关键词: non-verbal human communication, computer vision fields, facial expression generation, facial expression, Facial
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: submitted to Multimedia Tools and Applications

点击查看摘要

Abstract:Facial expressions, vital in non-verbal human communication, have found applications in various computer vision fields like virtual reality, gaming, and emotional AI assistants. Despite advancements, many facial expression generation models encounter challenges such as low resolution (e.g., 32x32 or 64x64 pixels), poor quality, and the absence of background details. In this paper, we introduce FacEnhance, a novel diffusion-based approach addressing constraints in existing low-resolution facial expression generation models. FacEnhance enhances low-resolution facial expression videos (64x64 pixels) to higher resolutions (192x192 pixels), incorporating background details and improving overall quality. Leveraging conditional denoising within a diffusion framework, guided by a background-free low-resolution video and a single neutral expression high-resolution image, FacEnhance generates a video incorporating the facial expression from the low-resolution video performed by the individual with background from the neutral image. By complementing lightweight low-resolution models, FacEnhance strikes a balance between computational efficiency and desirable image resolution and quality. Extensive experiments on the MUG facial expression database demonstrate the efficacy of FacEnhance in enhancing low-resolution model outputs to state-of-the-art quality while preserving content and identity consistency. FacEnhance represents significant progress towards resource-efficient, high-fidelity facial expression generation, Renewing outdated low-resolution methods to up-to-date standards.

[CV-78] Steganalysis on Digital Watermarking: Is Your Defense Truly Impervious?

链接: https://arxiv.org/abs/2406.09026
作者: Pei Yang,Hai Ci,Yiren Song,Mike Zheng Shou
关键词: Digital watermarking techniques, Digital watermarking, generative AI models, crucial for copyright, copyright protection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Digital watermarking techniques are crucial for copyright protection and source identification of images, especially in the era of generative AI models. However, many existing watermarking methods, particularly content-agnostic approaches that embed fixed patterns regardless of image content, are vulnerable to steganalysis attacks that can extract and remove the watermark with minimal perceptual distortion. In this work, we categorize watermarking algorithms into content-adaptive and content-agnostic ones, and demonstrate how averaging a collection of watermarked images could reveal the underlying watermark pattern. We then leverage this extracted pattern for effective watermark removal under both graybox and blackbox settings, even when the collection contains multiple watermark patterns. For some algorithms like Tree-Ring watermarks, the extracted pattern can also forge convincing watermarks on clean images. Our quantitative and qualitative evaluations across twelve watermarking methods highlight the threat posed by steganalysis to content-agnostic watermarks and the importance of designing watermarking techniques resilient to such analytical attacks. We propose security guidelines calling for using content-adaptive watermarking strategies and performing security evaluation against steganalysis. We also suggest multi-key assignments as potential mitigations against steganalysis vulnerabilities.

[CV-79] A PCA based Keypoint Tracking Approach to Automated Facial Expressions Encoding

链接: https://arxiv.org/abs/2406.09017
作者: Shivansh Chandra Tripathi,Rahul Garg
关键词: Action Coding System, Facial Action Coding, Coding System, requires significant effort, Action Coding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in [LNCS,volume 14301], and is available online at this https URL

点击查看摘要

Abstract:The Facial Action Coding System (FACS) for studying facial expressions is manual and requires significant effort and expertise. This paper explores the use of automated techniques to generate Action Units (AUs) for studying facial expressions. We propose an unsupervised approach based on Principal Component Analysis (PCA) and facial keypoint tracking to generate data-driven AUs called PCA AUs using the publicly available DISFA dataset. The PCA AUs comply with the direction of facial muscle movements and are capable of explaining over 92.83 percent of the variance in other public test datasets (BP4D-Spontaneous and CK+), indicating their capability to generalize facial expressions. The PCA AUs are also comparable to a keypoint-based equivalence of FACS AUs in terms of variance explained on the test datasets. In conclusion, our research demonstrates the potential of automated techniques to be an alternative to manual FACS labeling which could lead to efficient real-time analysis of facial expressions in psychology and related fields. To promote further research, we have made code repository publicly available.

[CV-80] Cross-Modal Learning for Anomaly Detection in Fused Magnesium Smelting Process: Methodology and Benchmark

链接: https://arxiv.org/abs/2406.09016
作者: Gaochang Wu,Yapeng Zhang,Lan Deng,Jingxin Zhang,Tianyou Chai
关键词: Fused Magnesium Furnace, anomaly detection plays, Magnesium Furnace, anomaly detection, fused magnesium smelting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 6 figures, 5 tables. Submitted to IEEE

点击查看摘要

Abstract:Fused Magnesium Furnace (FMF) is a crucial industrial equipment in the production of magnesia, and anomaly detection plays a pivotal role in ensuring its efficient, stable, and secure operation. Existing anomaly detection methods primarily focus on analyzing dominant anomalies using the process variables (such as arc current) or constructing neural networks based on abnormal visual features, while overlooking the intrinsic correlation of cross-modal information. This paper proposes a cross-modal Transformer (dubbed FmFormer), designed to facilitate anomaly detection in fused magnesium smelting processes by exploring the correlation between visual features (video) and process variables (current). Our approach introduces a novel tokenization paradigm to effectively bridge the substantial dimensionality gap between the 3D video modality and the 1D current modality in a multiscale manner, enabling a hierarchical reconstruction of pixel-level anomaly detection. Subsequently, the FmFormer leverages self-attention to learn internal features within each modality and bidirectional cross-attention to capture correlations across modalities. To validate the effectiveness of the proposed method, we also present a pioneering cross-modal benchmark of the fused magnesium smelting process, featuring synchronously acquired video and current data for over 2.2 million samples. Leveraging cross-modal learning, the proposed FmFormer achieves state-of-the-art performance in detecting anomalies, particularly under extreme interferences such as current fluctuations and visual occlusion caused by heavy water mist. The presented methodology and benchmark may be applicable to other industrial applications with some amendments. The benchmark will be released at this https URL.

[CV-81] AMSA-UNet: An Asymmetric Multiple Scales U-net Based on Self-attention for Deblurring

链接: https://arxiv.org/abs/2406.09015
作者: Yingying Wang
关键词: traditional ingle-scale U-Net, deblurring accracy, loss of spatial, ingle-scale U-Net, affects the deblurring
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15pages, 6figures

点击查看摘要

Abstract:The traditional ingle-scale U-Net often leads to the loss of spatial information during deblurring, which affects the deblurring accracy. Additionally, due to the convolutional method’s limitation in capturing long-range dependencies, the quality of the recovered image is degraded. To address the above problems, an asymmetric multiple scales U-net based on self-attention (AMSA-UNet) is proposed to improve the accuracy and computational complexity. By introducing a multiple-scales U shape architecture, the network can focus on blurry regions at the global level and better recover image details at the local level. In order to overcome the limitations of traditional convolutional methods in capturing the long-range dependencies of information, a self-attention mechanism is introduced into the decoder part of the backbone network, which significantly increases the model’s receptive field, enabling it to pay more attention to semantic information of the image, thereby producing more accurate and visually pleasing deblurred images. What’s more, a frequency domain-based computation method was introduced to reduces the computation amount. The experimental results demonstrate that the proposed method exhibits significant improvements in both accuracy and speed compared to eight excellent methods

[CV-82] Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality Generation

链接: https://arxiv.org/abs/2406.09003
作者: Lincan Cai,Shuang Li,Wenxuan Ma,Jingxuan Kang,Binhui Xie,Zixun Sun,Chengwei Zhu
关键词: proven immensely valuable, handling data-intensive modalities, Large-scale pretrained models, text and image, Large-scale pretrained
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale pretrained models have proven immensely valuable in handling data-intensive modalities like text and image. However, fine-tuning these models for certain specialized modalities, such as protein sequence and cosmic ray, poses challenges due to the significant modality discrepancy and scarcity of labeled data. In this paper, we propose an end-to-end method, PaRe, to enhance cross-modal fine-tuning, aiming to transfer a large-scale pretrained model to various target modalities. PaRe employs a gating mechanism to select key patches from both source and target data. Through a modality-agnostic Patch Replacement scheme, these patches are preserved and combined to construct data-rich intermediate modalities ranging from easy to hard. By gradually intermediate modality generation, we can not only effectively bridge the modality gap to enhance stability and transferability of cross-modal fine-tuning, but also address the challenge of limited data in the target modality by leveraging enriched intermediate modality data. Compared with hand-designed, general-purpose, task-specific, and state-of-the-art cross-modal fine-tuning approaches, PaRe demonstrates superior performance across three challenging benchmarks, encompassing more than ten modalities.

[CV-83] Adaptive Temporal Motion Guided Graph Convolution Network for Micro-expression Recognition

链接: https://arxiv.org/abs/2406.08997
作者: Fengyuan Zhang,Zhaopei Huang,Xinjie Zhang,Qin Jin
关键词: genuine emotional states, understanding individuals’ genuine, individuals’ genuine emotional, Adaptive Temporal Motion, emotional states
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICME 2024

点击查看摘要

Abstract:Micro-expressions serve as essential cues for understanding individuals’ genuine emotional states. Recognizing micro-expressions attracts increasing research attention due to its various applications in fields such as business negotiation and psychotherapy. However, the intricate and transient nature of micro-expressions poses a significant challenge to their accurate recognition. Most existing works either neglect temporal dependencies or suffer from redundancy issues in clip-level recognition. In this work, we propose a novel framework for micro-expression recognition, named the Adaptive Temporal Motion Guided Graph Convolution Network (ATM-GCN). Our framework excels at capturing temporal dependencies between frames across the entire clip, thereby enhancing micro-expression recognition at the clip level. Specifically, the integration of Adaptive Temporal Motion layers empowers our method to aggregate global and local motion features inherent in micro-expressions. Experimental results demonstrate that ATM-GCN not only surpasses existing state-of-the-art methods, particularly on the Composite dataset, but also achieves superior performance on the latest micro-expression dataset CAS(ME) ^3 .

[CV-84] AirPlanes: Accurate Plane Estimation via 3D-Consistent Embeddings

链接: https://arxiv.org/abs/2406.08960
作者: Jamie Watson,Filippo Aleotti,Mohamed Sayed,Zawar Qureshi,Oisin Mac Aodha,Gabriel Brostow,Michael Firman,Sara Vicente
关键词: augmented reality, Extracting planes, robotics and augmented, Extracting, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Extracting planes from a 3D scene is useful for downstream tasks in robotics and augmented reality. In this paper we tackle the problem of estimating the planar surfaces in a scene from posed images. Our first finding is that a surprisingly competitive baseline results from combining popular clustering algorithms with recent improvements in 3D geometry estimation. However, such purely geometric methods are understandably oblivious to plane semantics, which are crucial to discerning distinct planes. To overcome this limitation, we propose a method that predicts multi-view consistent plane embeddings that complement geometry when clustering points into planes. We show through extensive evaluation on the ScanNetV2 dataset that our new method outperforms existing approaches and our strong geometric baseline for the task of plane estimation.

[CV-85] Preserving Identity with Variational Score for General-purpose 3D Editing

链接: https://arxiv.org/abs/2406.08953
作者: Duong H. Le,Tuan Pham,Aniruddha Kembhavi,Stephan Mandt,Wei-Chiu Ma,Jiasen Lu
关键词: Variational Score Distillation, Delta Denoising Score, present Piva, Variational Score, Preserving Identity
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 22 pages, 14 figures

点击查看摘要

Abstract:We present Piva (Preserving Identity with Variational Score Distillation), a novel optimization-based method for editing images and 3D models based on diffusion models. Specifically, our approach is inspired by the recently proposed method for 2D image editing - Delta Denoising Score (DDS). We pinpoint the limitations in DDS for 2D and 3D editing, which causes detail loss and over-saturation. To address this, we propose an additional score distillation term that enforces identity preservation. This results in a more stable editing process, gradually optimizing NeRF models to match target prompts while retaining crucial input characteristics. We demonstrate the effectiveness of our approach in zero-shot image and neural field editing. Our method successfully alters visual attributes, adds both subtle and substantial structural elements, translates shapes, and achieves competitive results on standard 2D and 3D editing benchmarks. Additionally, our method imposes no constraints like masking or pre-training, making it compatible with a wide range of pre-trained diffusion models. This allows for versatile editing without needing neural field-to-mesh conversion, offering a more user-friendly experience.

[CV-86] Neural NeRF Compression

链接: https://arxiv.org/abs/2406.08943
作者: Tuan Pham,Stephan Mandt
关键词: Neural Radiance Fields, Radiance Fields, continuous volumetric representations, Neural Radiance, capturing detailed
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:Neural Radiance Fields (NeRFs) have emerged as powerful tools for capturing detailed 3D scenes through continuous volumetric representations. Recent NeRFs utilize feature grids to improve rendering quality and speed; however, these representations introduce significant storage overhead. This paper presents a novel method for efficiently compressing a grid-based NeRF model, addressing the storage overhead concern. Our approach is based on the non-linear transform coding paradigm, employing neural compression for compressing the model’s feature grids. Due to the lack of training data involving many i.i.d scenes, we design an encoder-free, end-to-end optimized approach for individual scenes, using lightweight decoders. To leverage the spatial inhomogeneity of the latent feature grids, we introduce an importance-weighted rate-distortion objective and a sparse entropy model employing a masking mechanism. Our experimental results validate that our proposed method surpasses existing works in terms of grid-based NeRF compression efficacy and reconstruction quality.

[CV-87] Step-by-Step Diffusion: An Elementary Tutorial

链接: https://arxiv.org/abs/2406.08929
作者: Preetum Nakkiran,Arwen Bradley,Hattie Zhou,Madhu Advani
关键词: machine learning, diffusion experience, present an accessible, models and flow, flow matching
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: 35 pages, 11 figures

点击查看摘要

Abstract:We present an accessible first course on diffusion models and flow matching for machine learning, aimed at a technical audience with no diffusion experience. We try to simplify the mathematical details as much as possible (sometimes heuristically), while retaining enough precision to derive correct algorithms.

[CV-88] Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer

链接: https://arxiv.org/abs/2406.08928
作者: Guodong Sun,Junjie Liu,Mingxuan Liu,Moyun Liu,Yang Zhang
关键词: Self-supervised monocular depth, infer depth information, aims to infer, depth estimation aims, labeled data
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: 28 pages, 12 figures

点击查看摘要

Abstract:Self-supervised monocular depth estimation aims to infer depth information without relying on labeled data. However, the lack of labeled information poses a significant challenge to the model’s representation, limiting its ability to capture the intricate details of the scene accurately. Prior information can potentially mitigate this issue, enhancing the model’s understanding of scene structure and texture. Nevertheless, solely relying on a single type of prior information often falls short when dealing with complex scenes, necessitating improvements in generalization performance. To address these challenges, we introduce a novel self-supervised monocular depth estimation model that leverages multiple priors to bolster representation capabilities across spatial, context, and semantic dimensions. Specifically, we employ a hybrid transformer and a lightweight pose network to obtain long-range spatial priors in the spatial dimension. Then, the context prior attention is designed to improve generalization, particularly in complex structures or untextured areas. In addition, semantic priors are introduced by leveraging semantic boundary loss, and semantic prior attention is supplemented, further refining the semantic features extracted by the decoder. Experiments on three diverse datasets demonstrate the effectiveness of the proposed model. It integrates multiple priors to comprehensively enhance the representation ability, improving the accuracy and reliability of depth estimation. Codes are available at: \urlthis https URL

[CV-89] Learning Images Across Scales Using Adversarial Training

链接: https://arxiv.org/abs/2406.08924
作者: Krzysztof Wolski,Adarsh Djeacoumar,Alireza Javanmardi,Hans-Peter Seidel,Christian Theobalt,Guillaume Cordonnier,Karol Myszkowski,George Drettakis,Xingang Pan,Thomas Leimkühler
关键词: real world exhibits, world exhibits rich, exhibits rich structure, real world, world exhibits
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: SIGGRAPH 2024; project page: this https URL

点击查看摘要

Abstract:The real world exhibits rich structure and detail across many scales of observation. It is difficult, however, to capture and represent a broad spectrum of scales using ordinary images. We devise a novel paradigm for learning a representation that captures an orders-of-magnitude variety of scales from an unstructured collection of ordinary images. We treat this collection as a distribution of scale-space slices to be learned using adversarial training, and additionally enforce coherency across slices. Our approach relies on a multiscale generator with carefully injected procedural frequency content, which allows to interactively explore the emerging continuous scale space. Training across vastly different scales poses challenges regarding stability, which we tackle using a supervision scheme that involves careful sampling of scales. We show that our generator can be used as a multiscale generative model, and for reconstructions of scale spaces from unstructured patches. Significantly outperforming the state of the art, we demonstrate zoom-in factors of up to 256x at high quality and scale consistency.

[CV-90] A Label-Free and Non-Monotonic Metric for Evaluating Denoising in Event Cameras

链接: https://arxiv.org/abs/2406.08909
作者: Chenyang Shi,Shasha Guo,Boyi Wei,Hanxiao Liu,Yibo Zhang,Ningfang Song,Jing Jin
关键词: high efficiency due, outputting a sparse, asynchronous stream, efficiency due, due to outputting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Event cameras are renowned for their high efficiency due to outputting a sparse, asynchronous stream of events. However, they are plagued by noisy events, especially in low light conditions. Denoising is an essential task for event cameras, but evaluating denoising performance is challenging. Label-dependent denoising metrics involve artificially adding noise to clean sequences, complicating evaluations. Moreover, the majority of these metrics are monotonic, which can inflate scores by removing substantial noise and valid events. To overcome these limitations, we propose the first label-free and non-monotonic evaluation metric, the area of the continuous contrast curve (AOCC), which utilizes the area enclosed by event frame contrast curves across different time intervals. This metric is inspired by how events capture the edge contours of scenes or objects with high temporal resolution. An effective denoising method removes noise without eliminating these edge-contour events, thus preserving the contrast of event frames. Consequently, contrast across various time ranges serves as a metric to assess denoising effectiveness. As the time interval lengthens, the curve will initially rise and then fall. The proposed metric is validated through both theoretical and experimental evidence.

[CV-91] Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding

链接: https://arxiv.org/abs/2406.08907
作者: Yue Xu,Kaizhi Yang,Jiebo Luo,Xuejin Chen
关键词: achieving embodied intelligence, emerging research area, research area dedicated, relation Alignment Network, physical world
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:3D visual grounding is an emerging research area dedicated to making connections between the 3D physical world and natural language, which is crucial for achieving embodied intelligence. In this paper, we propose DASANet, a Dual Attribute-Spatial relation Alignment Network that separately models and aligns object attributes and spatial relation features between language and 3D vision modalities. We decompose both the language and 3D point cloud input into two separate parts and design a dual-branch attention module to separately model the decomposed inputs while preserving global context in attribute-spatial feature fusion by cross attentions. Our DASANet achieves the highest grounding accuracy 65.1% on the Nr3D dataset, 1.3% higher than the best competitor. Besides, the visualization of the two branches proves that our method is efficient and highly interpretable.

[CV-92] Computer Vision Approaches for Automated Bee Counting Application

链接: https://arxiv.org/abs/2406.08898
作者: Simon Bilik,Ilona Janakova,Adam Ligocki,Dominik Ficek,Karel Horak
关键词: computer vision techniques, colony health state, bee colony health, health state monitoring, colony health
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many application from the bee colony health state monitoring could be efficiently solved using a computer vision techniques. One of such challenges is an efficient way for counting the number of incoming and outcoming bees, which could be used to further analyse many trends, such as the bee colony health state, blooming periods, or for investigating the effects of agricultural spraying. In this paper, we compare three methods for the automated bee counting over two own datasets. The best performing method is based on the ResNet-50 convolutional neural network classifier, which achieved accuracy of 87% over the BUT1 dataset and the accuracy of 93% over the BUT2 dataset.

[CV-93] OpenMaterial: A Comprehensive Dataset of Complex Materials for 3D Reconstruction

链接: https://arxiv.org/abs/2406.08894
作者: Zheng Dang,Jialu Huang,Fei Wang,Mathieu Salzmann
关键词: neural radiance fields, implicit neural representations, Recent advances, neural radiance, radiance fields
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent advances in deep learning such as neural radiance fields and implicit neural representations have significantly propelled the field of 3D reconstruction. However, accurately reconstructing objects with complex optical properties, such as metals and glass, remains a formidable challenge due to their unique specular and light-transmission characteristics. To facilitate the development of solutions to these challenges, we introduce the OpenMaterial dataset, comprising 1001 objects made of 295 distinct materials-including conductors, dielectrics, plastics, and their roughened variants- and captured under 723 diverse lighting conditions. To this end, we utilized physics-based rendering with laboratory-measured Indices of Refraction (IOR) and generated high-fidelity multiview images that closely replicate real-world objects. OpenMaterial provides comprehensive annotations, including 3D shape, material type, camera pose, depth, and object mask. It stands as the first large-scale dataset enabling quantitative evaluations of existing algorithms on objects with diverse and challenging materials, thereby paving the way for the development of 3D reconstruction algorithms capable of handling complex material properties.

[CV-94] he Penalized Inverse Probability Measure for Conformal Classification

链接: https://arxiv.org/abs/2406.08884
作者: Paul Melki(IMS),Lionel Bombrun(IMS),Boubacar Diallo,Jérôme Dias,Jean-Pierre da Costa(IMS)
关键词: machine learning systems, box neural networks, trustworthy machine learning, complex black box, black box neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The deployment of safe and trustworthy machine learning systems, and particularly complex black box neural networks, in real-world applications requires reliable and certified guarantees on their performance. The conformal prediction framework offers such formal guarantees by transforming any point into a set predictor with valid, finite-set, guarantees on the coverage of the true at a chosen level of confidence. Central to this methodology is the notion of the nonconformity score function that assigns to each example a measure of ‘‘strangeness’’ in comparison with the previously seen observations. While the coverage guarantees are maintained regardless of the nonconformity measure, the point predictor and the dataset, previous research has shown that the performance of a conformal model, as measured by its efficiency (the average size of the predicted sets) and its informativeness (the proportion of prediction sets that are singletons), is influenced by the choice of the nonconformity score function. The current work introduces the Penalized Inverse Probability (PIP) nonconformity score, and its regularized version RePIP, that allow the joint optimization of both efficiency and informativeness. Through toy examples and empirical results on the task of crop and weed image classification in agricultural robotics, the current work shows how PIP-based conformal classifiers exhibit precisely the desired behavior in comparison with other nonconformity measures and strike a good balance between informativeness and efficiency.

[CV-95] EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding

链接: https://arxiv.org/abs/2406.08877
作者: Yuan-Ming Li,Wei-Jin Huang,An-Lan Wang,Ling-An Zeng,Jing-Ke Meng,Wei-Shi Zheng
关键词: full-body action understanding, featuring fitness sequence, action understanding dataset, action, action understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 33 pages, 9 figures

点击查看摘要

Abstract:We present EgoExo-Fitness, a new full-body action understanding dataset, featuring fitness sequence videos recorded from synchronized egocentric and fixed exocentric (third-person) cameras. Compared with existing full-body action understanding datasets, EgoExo-Fitness not only contains videos from first-person perspectives, but also provides rich annotations. Specifically, two-level temporal boundaries are provided to localize single action videos along with sub-steps of each action. More importantly, EgoExo-Fitness introduces innovative annotations for interpretable action judgement–including technical keypoint verification, natural language comments on action execution, and action quality scores. Combining all of these, EgoExo-Fitness provides new resources to study egocentric and exocentric full-body action understanding across dimensions of “what”, “when”, and “how well”. To facilitate research on egocentric and exocentric full-body action understanding, we construct benchmarks on a suite of tasks (i.e., action classification, action localization, cross-view sequence verification, cross-view skill determination, and a newly proposed task of guidance-based execution verification), together with detailed analysis. Code and data will be available at this https URL.

[CV-96] Zoom and Shift are All You Need

链接: https://arxiv.org/abs/2406.08866
作者: Jiahao Qin
关键词: fusing multimodal data, Feature alignment serves, primary mechanism, mechanism for fusing, alignment serves
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: 8 pages, 3 figures

点击查看摘要

Abstract:Feature alignment serves as the primary mechanism for fusing multimodal data. We put forth a feature alignment approach that achieves full integration of multimodal information. This is accomplished via an alternating process of shifting and expanding feature representations across modalities to obtain a consistent unified representation in a joint feature space. The proposed technique can reliably capture high-level interplay between features originating from distinct modalities. Consequently, substantial gains in multimodal learning performance are attained. Additionally, we demonstrate the superiority of our approach over other prevalent multimodal fusion schemes on a range of tasks. Extensive experimental evaluation conducted on multimodal datasets comprising time series, image, and text demonstrates that our method achieves state-of-the-art results.

[CV-97] Self-supervised Graph Neural Network for Mechanical CAD Retrieval

链接: https://arxiv.org/abs/2406.08863
作者: Yuhan Quan,Huan ZHao,Jinfeng Yi,Yuqiang Chen
关键词: similar-shaped CAD parts, CAD, Computer-Aided Design, plays a crucial, crucial role
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:CAD (Computer-Aided Design) plays a crucial role in mechanical industry, where large numbers of similar-shaped CAD parts are often created. Efficiently reusing these parts is key to reducing design and production costs for enterprises. Retrieval systems are vital for achieving CAD reuse, but the complex shapes of CAD models are difficult to accurately describe using text or keywords, making traditional retrieval methods ineffective. While existing representation learning approaches have been developed for CAD, manually labeling similar samples in these methods is expensive. Additionally, CAD models’ unique parameterized data structure presents challenges for applying existing 3D shape representation learning techniques directly. In this work, we propose GC-CAD, a self-supervised contrastive graph neural network-based method for mechanical CAD retrieval that directly models parameterized CAD raw files. GC-CAD consists of two key modules: structure-aware representation learning and contrastive graph learning framework. The method leverages graph neural networks to extract both geometric and topological information from CAD models, generating feature representations. We then introduce a simple yet effective contrastive graph learning framework approach, enabling the model to train without manual labels and generate retrieval-ready representations. Experimental results on four datasets including human evaluation demonstrate that the proposed method achieves significant accuracy improvements and up to 100 times efficiency improvement over the baseline methods.

[CV-98] Fusion of regional and sparse attention in Vision Transformers

链接: https://arxiv.org/abs/2406.08859
作者: Nabil Ibtehaz,Ning Yan,Masood Mortazavi,Daisuke Kihara
关键词: leverage visually inspired, Modern vision transformers, transformers leverage visually, visually inspired local, Modern vision
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted as a Workshop Paper at T4V@CVPR2024. arXiv admin note: substantial text overlap with arXiv:2403.04200

点击查看摘要

Abstract:Modern vision transformers leverage visually inspired local interaction between pixels through attention computed within window or grid regions, in contrast to the global attention employed in the original ViT. Regional attention restricts pixel interactions within specific regions, while sparse attention disperses them across sparse grids. These differing approaches pose a challenge between maintaining hierarchical relationships vs. capturing a global context. In this study, drawing inspiration from atrous convolution, we propose Atrous Attention, a blend of regional and sparse attention that dynamically integrates both local and global information while preserving hierarchical structures. Based on this, we introduce a versatile, hybrid vision transformer backbone called ACC-ViT, tailored for standard vision tasks. Our compact model achieves approximately 84% accuracy on ImageNet-1K with fewer than 28.5 million parameters, outperforming the state-of-the-art MaxViT by 0.42% while requiring 8.4% fewer parameters.

[CV-99] OmniH2O: Universal and Dexterous Human-to-Humanoid Whole-Body Teleoperation and Learning

链接: https://arxiv.org/abs/2406.08858
作者: Tairan He,Zhengyi Luo,Xialin He,Wenli Xiao,Chong Zhang,Weinan Zhang,Kris Kitani,Changliu Liu,Guanya Shi
关键词: learning-based system, Omni, whole-body humanoid teleoperation, humanoid, humanoid whole-body
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present OmniH2O (Omni Human-to-Humanoid), a learning-based system for whole-body humanoid teleoperation and autonomy. Using kinematic pose as a universal control interface, OmniH2O enables various ways for a human to control a full-sized humanoid with dexterous hands, including using real-time teleoperation through VR headset, verbal instruction, and RGB camera. OmniH2O also enables full autonomy by learning from teleoperated demonstrations or integrating with frontier models such as GPT-4. OmniH2O demonstrates versatility and dexterity in various real-world whole-body tasks through teleoperation or autonomy, such as playing multiple sports, moving and manipulating objects, and interacting with humans. We develop an RL-based sim-to-real pipeline, which involves large-scale retargeting and augmentation of human motion datasets, learning a real-world deployable policy with sparse sensor input by imitating a privileged teacher policy, and reward designs to enhance robustness and stability. We release the first humanoid whole-body control dataset, OmniH2O-6, containing six everyday tasks, and demonstrate humanoid whole-body skill learning from teleoperated datasets.

[CV-100] COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing

链接: https://arxiv.org/abs/2406.08850
作者: Jiangshan Wang,Yue Ma,Jiayi Guo,Yicheng Xiao,Gao Huang,Xiu Li
关键词: Video editing, current methods adopt, diffusion model, emerging task, zero-shot manner
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video in a zero-shot manner. Despite extensive efforts, maintaining the temporal consistency of edited videos remains challenging due to the lack of temporal constraints in the regular T2I diffusion model. To address this issue, we propose COrrespondence-guided Video Editing (COVE), leveraging the inherent diffusion feature correspondence to achieve high-quality and consistent video editing. Specifically, we propose an efficient sliding-window-based strategy to calculate the similarity among tokens in the diffusion features of source videos, identifying the tokens with high correspondence across frames. During the inversion and denoising process, we sample the tokens in noisy latent based on the correspondence and then perform self-attention within them. To save GPU memory usage and accelerate the editing process, we further introduce the temporal-dimensional token merging strategy, which can effectively reduce redundancy. COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization. Extensive experiment results demonstrate that COVE achieves the start-of-the-art performance in various video editing scenarios, outperforming existing methods both quantitatively and qualitatively. The code will be release at this https URL

[CV-101] Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing ReliabilityReproducibility and Practicality

链接: https://arxiv.org/abs/2406.08845
作者: Tianle Zhang,Langtian Ma,Yuchen Yan,Yuchen Zhang,Kai Wang,Yue Yang,Ziyao Guo,Wenqi Shao,Yang You,Yu Qiao,Ping Luo,Kaipeng Zhang
关键词: technology advancements, applicability and popularity, significantly broadened, broadened its applicability, Recent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recent text-to-video (T2V) technology advancements, as demonstrated by models such as Gen2, Pika, and Sora, have significantly broadened its applicability and popularity. Despite these strides, evaluating these models poses substantial challenges. Primarily, due to the limitations inherent in automatic metrics, manual evaluation is often considered a superior method for assessing T2V generation. However, existing manual evaluation protocols face reproducibility, reliability, and practicality issues. To address these challenges, this paper introduces the Text-to-Video Human Evaluation (T2VHE) protocol, a comprehensive and standardized protocol for T2V models. The T2VHE protocol includes well-defined metrics, thorough annotator training, and an effective dynamic evaluation module. Experimental results demonstrate that this protocol not only ensures high-quality annotations but can also reduce evaluation costs by nearly 50%. We will open-source the entire setup of the T2VHE protocol, including the complete protocol workflow, the dynamic evaluation component details, and the annotation interface code. This will help communities establish more sophisticated human assessment protocols.

[CV-102] Conceptual Learning via Embedding Approximations for Reinforcing Interpretability and Transparency

链接: https://arxiv.org/abs/2406.08840
作者: Maor Dikter,Tsachi Blau,Chaim Baskin
关键词: emerged as critical, critical tools, tools in domains, underline, textbf
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Concept bottleneck models (CBMs) have emerged as critical tools in domains where interpretability is paramount. These models rely on predefined textual descriptions, referred to as concepts, to inform their decision-making process and offer more accurate reasoning. As a result, the selection of concepts used in the model is of utmost significance. This study proposes \underline\textbfConceptual \underline\textbfLearning via \underline\textbfEmbedding \underline\textbfApproximations for \underline\textbfReinforcing Interpretability and Transparency, abbreviated as CLEAR, a framework for constructing a CBM for image classification. Using score matching and Langevin sampling, we approximate the embedding of concepts within the latent space of a vision-language model (VLM) by learning the scores associated with the joint distribution of images and concepts. A concept selection process is then employed to optimize the similarity between the learned embeddings and the predefined ones. The derived bottleneck offers insights into the CBM’s decision-making process, enabling more comprehensive interpretations. Our approach was evaluated through extensive experiments and achieved state-of-the-art performance on various benchmarks. The code for our experiments is available at this https URL

[CV-103] NeRF Director: Revisiting View Selection in Neural Volume Rendering

链接: https://arxiv.org/abs/2406.08839
作者: Wenhui Xiao,Rodrigo Santa Cruz,David Ahmedt-Aristizabal,Olivier Salvado,Clinton Fookes,Leo Lebrat
关键词: Neural Rendering representations, computer vision, Neural Rendering, representations have significantly, significantly contributed
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: CVPR2024

点击查看摘要

Abstract:Neural Rendering representations have significantly contributed to the field of 3D computer vision. Given their potential, considerable efforts have been invested to improve their performance. Nonetheless, the essential question of selecting training views is yet to be thoroughly investigated. This key aspect plays a vital role in achieving high-quality results and aligns with the well-known tenet of deep learning: “garbage in, garbage out”. In this paper, we first illustrate the importance of view selection by demonstrating how a simple rotation of the test views within the most pervasive NeRF dataset can lead to consequential shifts in the performance rankings of state-of-the-art techniques. To address this challenge, we introduce a unified framework for view selection methods and devise a thorough benchmark to assess its impact. Significant improvements can be achieved without leveraging error or uncertainty estimation but focusing on uniform view coverage of the reconstructed object, resulting in a training-free approach. Using this technique, we show that high-quality renderings can be achieved faster by using fewer views. We conduct extensive experiments on both synthetic datasets and realistic data to demonstrate the effectiveness of our proposed method compared with random, conventional error-based, and uncertainty-guided view selection.

[CV-104] Improving Adversarial Robustness via Feature Pattern Consistency Constraint

链接: https://arxiv.org/abs/2406.08829
作者: Jiacong Hu,Jingwen Ye,Zunlei Feng,Jiazhen Yang,Shunyu Liu,Xiaotian Yu,Lingxiang Jia,Mingli Song
关键词: Convolutional Neural Networks, Convolutional Neural, posing significant security, significant security concerns, Neural Networks
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) are well-known for their vulnerability to adversarial attacks, posing significant security concerns. In response to these threats, various defense methods have emerged to bolster the model’s robustness. However, most existing methods either focus on learning from adversarial perturbations, leading to overfitting to the adversarial examples, or aim to eliminate such perturbations during inference, inevitably increasing computational burdens. Conversely, clean training, which strengthens the model’s robustness by relying solely on clean examples, can address the aforementioned issues. In this paper, we align with this methodological stream and enhance its generalizability to unknown adversarial examples. This enhancement is achieved by scrutinizing the behavior of latent features within the network. Recognizing that a correct prediction relies on the correctness of the latent feature’s pattern, we introduce a novel and effective Feature Pattern Consistency Constraint (FPCC) method to reinforce the latent feature’s capacity to maintain the correct feature pattern. Specifically, we propose Spatial-wise Feature Modification and Channel-wise Feature Selection to enhance latent features. Subsequently, we employ the Pattern Consistency Loss to constrain the similarity between the feature pattern of the latent features and the correct feature pattern. Our experiments demonstrate that the FPCC method empowers latent features to uphold correct feature patterns even in the face of adversarial examples, resulting in inherent adversarial robustness surpassing state-of-the-art models.

[CV-105] Computer vision-based model for detecting turning lane features on Floridas public roadways

链接: https://arxiv.org/abs/2406.08822
作者: Richard Boadu Antwi,Samuel Takyi,Kimollo Michael,Alican Karaer,Eren Erman Ozguven,Ren Moses,Maxim A. Dulebenets,Thobias Sando
关键词: critical to transportation, transportation agencies, current roadway geometry, data collection, data
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficient and current roadway geometry data collection is critical to transportation agencies in road planning, maintenance, design, and rehabilitation. Data collection methods are divided into land-based and aerial-based. Land-based methods for extensive highway networks are tedious, costly, pose safety risks. Therefore, there is the need for efficient, safe, and economical data acquisition methodologies. The rise of computer vision and object detection technologies have made automated extraction of roadway geometry features feasible. This study detects roadway features on Florida’s public roads from high-resolution aerial images using AI. The developed model achieved an average accuracy of 80.4 percent when compared with ground truth data. The extracted roadway geometry data can be integrated with crash and traffic data to provide valuable insights to policymakers and roadway users.

[CV-106] oSA: Token Selective Attention for Efficient Vision Transformers

链接: https://arxiv.org/abs/2406.08816
作者: Manish Kumar Singh,Rajeev Yasarla,Hong Cai,Mingu Lee,Fatih Porikli
关键词: selective attention approach, token selective attention, attention approach, skip a transformer, attention maps
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at CVPRW 2024

点击查看摘要

Abstract:In this paper, we propose a novel token selective attention approach, ToSA, which can identify tokens that need to be attended as well as those that can skip a transformer layer. More specifically, a token selector parses the current attention maps and predicts the attention maps for the next layer, which are then used to select the important tokens that should participate in the attention operation. The remaining tokens simply bypass the next layer and are concatenated with the attended ones to re-form a complete set of tokens. In this way, we reduce the quadratic computation and memory costs as fewer tokens participate in self-attention while maintaining the features for all the image patches throughout the network, which allows it to be used for dense prediction tasks. Our experiments show that by applying ToSA, we can significantly reduce computation costs while maintaining accuracy on the ImageNet classification benchmark. Furthermore, we evaluate on the dense prediction task of monocular depth estimation on NYU Depth V2, and show that we can achieve similar depth prediction accuracy using a considerably lighter backbone with ToSA.

[CV-107] Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting

链接: https://arxiv.org/abs/2406.08814
作者: Zhengqi Zhao,Xiaohu Huang,Hao Zhou,Kun Yao,Errui Ding,Jingdong Wang,Xinggang Wang,Wenyu Liu,Bin Feng
关键词: action, repetitive actions, branch, focus branch, target action
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 9 figures

点击查看摘要

Abstract:The key to action counting is accurately locating each video’s repetitive actions. Instead of estimating the probability of each frame belonging to an action directly, we propose a dual-branch network, i.e., SkimFocusNet, working in a two-step manner. The model draws inspiration from empirical observations indicating that humans typically engage in coarse skimming of entire sequences to grasp the general action pattern initially, followed by a finer, frame-by-frame focus to determine if it aligns with the target action. Specifically, SkimFocusNet incorporates a skim branch and a focus branch. The skim branch scans the global contextual information throughout the sequence to identify potential target action for guidance. Subsequently, the focus branch utilizes the guidance to diligently identify repetitive actions using a long-short adaptive guidance (LSAG) block. Additionally, we have observed that videos in existing datasets often feature only one type of repetitive action, which inadequately represents real-world scenarios. To more accurately describe real-life situations, we establish the Multi-RepCount dataset, which includes videos containing multiple repetitive motions. On Multi-RepCount, our SkimFoucsNet can perform specified action counting, that is, to enable counting a particular action type by referencing an exemplary video. This capability substantially exhibits the robustness of our method. Extensive experiments demonstrate that SkimFocusNet achieves state-of-the-art performances with significant improvements. We also conduct a thorough ablation study to evaluate the network components. The source code will be published upon acceptance.

[CV-108] Few-Shot Anomaly Detection via Category-Agnostic Registration Learning

链接: https://arxiv.org/abs/2406.08810
作者: Chaoqin Huang,Haoyan Guan,Aofan Jiang,Yanfeng Wang,Michael Spratling,Xinchao Wang,Ya Zhang
关键词: existing anomaly detection, normal images, anomaly detection, existing anomaly, categories
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Most existing anomaly detection methods require a dedicated model for each category. Such a paradigm, despite its promising results, is computationally expensive and inefficient, thereby failing to meet the requirements for real-world applications. Inspired by how humans detect anomalies, by comparing a query image to known normal ones, this paper proposes a novel few-shot anomaly detection (FSAD) framework. Using a training set of normal images from various categories, registration, aiming to align normal images of the same categories, is leveraged as the proxy task for self-supervised category-agnostic representation learning. At test time, an image and its corresponding support set, consisting of a few normal images from the same category, are supplied, and anomalies are identified by comparing the registered features of the test image to its corresponding support image features. Such a setup enables the model to generalize to novel test categories. It is, to our best knowledge, the first FSAD method that requires no model fine-tuning for novel categories: enabling a single model to be applied to all categories. Extensive experiments demonstrate the effectiveness of the proposed method. Particularly, it improves the current state-of-the-art for FSAD by 11.3% and 8.3% on the MVTec and MPDD benchmarks, respectively. The source code is available at this https URL.

[CV-109] Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

链接: https://arxiv.org/abs/2406.08801
作者: Mingwang Xu,Hui Li,Qingkun Su,Hanlin Shang,Liwei Zhang,Ce Liu,Jingdong Wang,Luc Van Gool,Yao Yao,Siyu Zhu
关键词: experienced significant advancements, portrait image animation, speech audio input, dynamic portraits, driven by speech
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 20 pages

点击查看摘要

Abstract:The field of portrait image animation, driven by speech audio input, has experienced significant advancements in the generation of realistic and dynamic portraits. This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations within the framework of diffusion-based methodologies. Moving away from traditional paradigms that rely on parametric models for intermediate facial representations, our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module to enhance the precision of alignment between audio inputs and visual outputs, encompassing lip, expression, and pose motion. Our proposed network architecture seamlessly integrates diffusion-based generative models, a UNet-based denoiser, temporal alignment techniques, and a reference network. The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities. Through a comprehensive evaluation that incorporates both qualitative and quantitative analyses, our approach demonstrates obvious enhancements in image and video quality, lip synchronization precision, and motion diversity. Further visualization and access to the source code can be found at: this https URL.

[CV-110] FouRA: Fourier Low Rank Adaptation

链接: https://arxiv.org/abs/2406.08798
作者: Shubhankar Borse,Shreya Kadambi,Nilesh Prasad Pandey,Kartikeya Bhardwaj,Viswanath Ganapathy,Sweta Priyadarshi,Risheek Garrepalli,Rafael Esteves,Munawar Hayat,Fatih Porikli
关键词: observed training samples, efficiently fine-tuning large, diffusion models lack, fine-tuning large models, models lack diversity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:While Low-Rank Adaptation (LoRA) has proven beneficial for efficiently fine-tuning large models, LoRA fine-tuned text-to-image diffusion models lack diversity in the generated images, as the model tends to copy data from the observed training samples. This effect becomes more pronounced at higher values of adapter strength and for adapters with higher ranks which are fine-tuned on smaller datasets. To address these challenges, we present FouRA, a novel low-rank method that learns projections in the Fourier domain along with learning a flexible input-dependent adapter rank selection strategy. Through extensive experiments and analysis, we show that FouRA successfully solves the problems related to data copying and distribution collapse while significantly improving the generated image quality. We demonstrate that FouRA enhances the generalization of fine-tuned models thanks to its adaptive rank selection. We further show that the learned projections in the frequency domain are decorrelated and prove effective when merging multiple adapters. While FouRA is motivated for vision tasks, we also demonstrate its merits for language tasks on the GLUE benchmark.

[CV-111] BEVSpread: Spread Voxel Pooling for Birds-Eye-View Representation in Vision-based Roadside 3D Object Detection

链接: https://arxiv.org/abs/2406.08785
作者: Wenjie Wang,Yehao Lu,Guangcong Zheng,Shuigen Zhan,Xiaoqing Ye,Zichang Tan,Jingdong Wang,Gaoang Wang,Xi Li
关键词: autonomous driving domain, expanding perception range, attracted rising attention, encompasses inherent advantages, reducing blind spots
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-based roadside 3D object detection has attracted rising attention in autonomous driving domain, since it encompasses inherent advantages in reducing blind spots and expanding perception range. While previous work mainly focuses on accurately estimating depth or height for 2D-to-3D mapping, ignoring the position approximation error in the voxel pooling process. Inspired by this insight, we propose a novel voxel pooling strategy to reduce such error, dubbed BEVSpread. Specifically, instead of bringing the image features contained in a frustum point to a single BEV grid, BEVSpread considers each frustum point as a source and spreads the image features to the surrounding BEV grids with adaptive weights. To achieve superior propagation performance, a specific weight function is designed to dynamically control the decay speed of the weights according to distance and depth. Aided by customized CUDA parallel acceleration, BEVSpread achieves comparable inference time as the original voxel pooling. Extensive experiments on two large-scale roadside benchmarks demonstrate that, as a plug-in, BEVSpread can significantly improve the performance of existing frustum-based BEV methods by a large margin of (1.12, 5.26, 3.01) AP in vehicle, pedestrian and cyclist.

[CV-112] ALINA: Advanced Line Identification and Notation Algorithm

链接: https://arxiv.org/abs/2406.08775
作者: Mohammed Abdul Hafeez Khan,Parth Ganeriwala,Siddhartha Bhattacharyya,Natasha Neogi,Raja Muthalagu
关键词: supervised machine learning, machine learning algorithms, machine learning, Advanced Line Identification, supervised machine
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper has been accepted to The 3rd CVPR Workshop on Vision Datasets Understanding, 2024

点击查看摘要

Abstract:Labels are the cornerstone of supervised machine learning algorithms. Most visual recognition methods are fully supervised, using bounding boxes or pixel-wise segmentations for object localization. Traditional labeling methods, such as crowd-sourcing, are prohibitive due to cost, data privacy, amount of time, and potential errors on large datasets. To address these issues, we propose a novel annotation framework, Advanced Line Identification and Notation Algorithm (ALINA), which can be used for labeling taxiway datasets that consist of different camera perspectives and variable weather attributes (sunny and cloudy). Additionally, the CIRCular threshoLd pixEl Discovery And Traversal (CIRCLEDAT) algorithm has been proposed, which is an integral step in determining the pixels corresponding to taxiway line markings. Once the pixels are identified, ALINA generates corresponding pixel coordinate annotations on the frame. Using this approach, 60,249 frames from the taxiway dataset, AssistTaxi have been labeled. To evaluate the performance, a context-based edge map (CBEM) set was generated manually based on edge features and connectivity. The detection rate after testing the annotated labels with the CBEM set was recorded as 98.45%, attesting its dependability and effectiveness.

[CV-113] DenoiseReID: Denoising Model for Representation Learning of Person Re-Identification

链接: https://arxiv.org/abs/2406.08773
作者: Zhengrui Xu,Guan’an Wang,Xiaowen Huang,Jitao Sang
关键词: Model for Representation, Representation Learning, Person Re-Identification, Denoising Model, joint feature extraction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel Denoising Model for Representation Learning and take Person Re-Identification (ReID) as a benchmark task, named DenoiseReID, to improve feature discriminative with joint feature extraction and denoising. In the deep learning epoch, backbones which consists of cascaded embedding layers (e.g. convolutions or transformers) to progressively extract useful features, becomes popular. We first view each embedding layer in a backbone as a denoising layer, processing the cascaded embedding layers as if we are recursively denoise features step-by-step. This unifies the frameworks of feature extraction and feature denoising, where the former progressively embeds features from low-level to high-level, and the latter recursively denoises features step-by-step. Then we design a novel Feature Extraction and Feature Denoising Fusion Algorithm (FEFDFA) and \textittheoretically demonstrate its equivalence before and after fusion. FEFDFA merges parameters of the denoising layers into existing embedding layers, thus making feature denoising computation-free. This is a label-free algorithm to incrementally improve feature also complementary to the label if available. Besides, it enjoys two advantages: 1) it’s a computation-free and label-free plugin for incrementally improving ReID features. 2) it is complementary to the label if the label is available. Experimental results on various tasks (large-scale image classification, fine-grained image classification, image retrieval) and backbones (transformers and convolutions) show the scalability and stability of our method. Experimental results on 4 ReID datasets and various of backbones show the stability and impressive improvements. We also extend the proposed method to large-scale (ImageNet) and fine-grained (e.g. CUB200) classification tasks, similar improvements are proven.

[CV-114] MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs

链接: https://arxiv.org/abs/2406.08772
作者: Xuannan Liu,Zekun Li,Peipei Li,Shuhan Xia,Xing Cui,Linzhi Huang,Huaibo Huang,Weihong Deng,Zhaofeng He
关键词: assume a single, insufficient for real-world, real-world scenarios, scenarios where multiple, forgery sources coexist
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Current multimodal misinformation detection (MMD) methods often assume a single source and type of forgery for each sample, which is insufficient for real-world scenarios where multiple forgery sources coexist. The lack of a benchmark for mixed-source misinformation has hindered progress in this field. To address this, we introduce MMFakeBench, the first comprehensive benchmark for mixed-source MMD. MMFakeBench includes 3 critical sources: textual veracity distortion, visual veracity distortion, and cross-modal consistency distortion, along with 12 sub-categories of misinformation forgery types. We further conduct an extensive evaluation of 6 prevalent detection methods and 15 large vision-language models (LVLMs) on MMFakeBench under a zero-shot setting. The results indicate that current methods struggle under this challenging and realistic mixed-source MMD setting. Additionally, we propose an innovative unified framework, which integrates rationales, actions, and tool-use capabilities of LVLM agents, significantly enhancing accuracy and generalization. We believe this study will catalyze future research into more realistic mixed-source multimodal misinformation and provide a fair evaluation of misinformation detection methods.

[CV-115] Gaussian-Forest: Hierarchical-Hybrid 3D Gaussian Splatting for Compressed Scene Modeling

链接: https://arxiv.org/abs/2406.08759
作者: Fengyi Zhang,Tianjun Zhang,Lin Zhang,Helen Huang,Yadan Luo
关键词: Gaussian Splatting, renders through rasterization, novel-view synthesis, synthesis has recently, recently witnessed
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:The field of novel-view synthesis has recently witnessed the emergence of 3D Gaussian Splatting, which represents scenes in a point-based manner and renders through rasterization. This methodology, in contrast to Radiance Fields that rely on ray tracing, demonstrates superior rendering quality and speed. However, the explicit and unstructured nature of 3D Gaussians poses a significant storage challenge, impeding its broader application. To address this challenge, we introduce the Gaussian-Forest modeling framework, which hierarchically represents a scene as a forest of hybrid 3D Gaussians. Each hybrid Gaussian retains its unique explicit attributes while sharing implicit ones with its sibling Gaussians, thus optimizing parameterization with significantly fewer variables. Moreover, adaptive growth and pruning strategies are designed, ensuring detailed representation in complex regions and a notable reduction in the number of required Gaussians. Extensive experiments demonstrate that Gaussian-Forest not only maintains comparable speed and quality but also achieves a compression rate surpassing 10 times, marking a significant advancement in efficient scene modeling. Codes are available at this https URL.

[CV-116] Comparative Analysis of Deep Convolutional Neural Networks for Detecting Medical Image Deepfakes

链接: https://arxiv.org/abs/2406.08758
作者: Abdel Rahman Alsabbagh,Omar Al-Kadi
关键词: Generative Adversarial Networks, Generative Adversarial, Convolutional Neural Network, exhibited noteworthy advancements, Adversarial Networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) have exhibited noteworthy advancements across various applications, including medical imaging. While numerous state-of-the-art Deep Convolutional Neural Network (DCNN) architectures are renowned for their proficient feature extraction, this paper investigates their efficacy in the context of medical image deepfake detection. The primary objective is to effectively distinguish real from tampered or manipulated medical images by employing a comprehensive evaluation of 13 state-of-the-art DCNNs. Performance is assessed across diverse evaluation metrics, encompassing considerations of time efficiency and computational resource requirements. Our findings reveal that ResNet50V2 excels in precision and specificity, whereas DenseNet169 is distinguished by its accuracy, recall, and F1-score. We investigate the specific scenarios in which one model would be more favorable than another. Additionally, MobileNetV3Large offers competitive performance, emerging as the swiftest among the considered DCNN models while maintaining a relatively small parameter count. We also assess the latent space separability quality across the examined DCNNs, showing superiority in both the DenseNet and EfficientNet model families and entailing a higher understanding of medical image deepfakes. The experimental analysis in this research contributes valuable insights to the field of deepfake image detection in the medical imaging domain.

[CV-117] Batch-Instructed Gradient for Prompt Evolution:Systematic Prompt Optimization for Enhanced Text-to-Image Synthesis

链接: https://arxiv.org/abs/2406.08713
作者: Xinrui Yang,Zhuohan Wang,Anthony Hu
关键词: shown remarkable progress, shown remarkable, remarkable progress, Stable Diffusion model, prompts
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image models have shown remarkable progress in generating high-quality images from user-provided prompts. Despite this, the quality of these images varies due to the models’ sensitivity to human language nuances. With advancements in large language models, there are new opportunities to enhance prompt design for image generation tasks. Existing research primarily focuses on optimizing prompts for direct interaction, while less attention is given to scenarios involving intermediary agents, like the Stable Diffusion model. This study proposes a Multi-Agent framework to optimize input prompts for text-to-image generation models. Central to this framework is a prompt generation mechanism that refines initial queries using dynamic instructions, which evolve through iterative performance feedback. High-quality prompts are then fed into a state-of-the-art text-to-image model. A professional prompts database serves as a benchmark to guide the instruction modifier towards generating high-caliber prompts. A scoring system evaluates the generated images, and an LLM generates new instructions based on calculated gradients. This iterative process is managed by the Upper Confidence Bound (UCB) algorithm and assessed using the Human Preference Score version 2 (HPS v2). Preliminary ablation studies highlight the effectiveness of various system components and suggest areas for future improvements.

[CV-118] mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

链接: https://arxiv.org/abs/2406.08707
作者: Matthieu Futeral,Armel Zebaze,Pedro Ortiz Suarez,Julien Abadji,Rémi Lacroix,Cordelia Schmid,Rachel Bawden,Benoît Sagot
关键词: large amount, Multimodal Large Language, Large Language Models, Multimodal Large, Large Language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Preprint. Under review

点击查看摘要

Abstract:Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. [2022] showed that additionally training them on interleaved sequences of text and images can lead to the emergence of in-context learning capabilities. However, the dataset they used, M3W, is not public and is only in English. There have been attempts to reproduce their results but the released datasets are English-only. In contrast, current multilingual and multimodal datasets are either composed of caption-like only or medium-scale or fully private data. This limits mLLM research for the 7,000 other languages spoken in the world. We therefore introduce mOSCAR, to the best of our knowledge the first large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 315M documents, 214B tokens and 1.2B images. We carefully conduct a set of filtering and evaluation steps to make sure mOSCAR is sufficiently safe, diverse and of good quality. We additionally train two types of multilingual model to prove the benefits of mOSCAR: (1) a model trained on a subset of mOSCAR and captioning data and (2) a model train on captioning data only. The model additionally trained on mOSCAR shows a strong boost in few-shot learning performance across various multilingual image-text tasks and benchmarks, confirming previous findings for English-only mLLMs.

[CV-119] VLind-Bench: Measuring Language Priors in Large Vision-Language Models

链接: https://arxiv.org/abs/2406.08702
作者: Kang-il Lee,Minbeom Kim,Seunghyun Yoon,Minsung Kim,Dongryeol Lee,Hyukhun Koh,Kyomin Jung
关键词: Large Vision-Language Models, demonstrated outstanding performance, Large Vision-Language, language priors, language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated outstanding performance across various multimodal tasks. However, they suffer from a problem known as language prior, where responses are generated based solely on textual patterns while disregarding image information. Addressing the issue of language prior is crucial, as it can lead to undesirable biases or hallucinations when dealing with images that are out of training distribution. Despite its importance, current methods for accurately measuring language priors in LVLMs are poorly studied. Although existing benchmarks based on counterfactual or out-of-distribution images can partially be used to measure language priors, they fail to disentangle language priors from other confounding factors. To this end, we propose a new benchmark called VLind-Bench, which is the first benchmark specifically designed to measure the language priors, or blindness, of LVLMs. It not only includes tests on counterfactual images to assess language priors but also involves a series of tests to evaluate more basic capabilities such as commonsense knowledge, visual perception, and commonsense biases. For each instance in our benchmark, we ensure that all these basic tests are passed before evaluating the language priors, thereby minimizing the influence of other factors on the assessment. The evaluation and analysis of recent LVLMs in our benchmark reveal that almost all models exhibit a significant reliance on language priors, presenting a strong challenge in the field.

[CV-120] UnO: Unsupervised Occupancy Fields for Perception and Forecasting

链接: https://arxiv.org/abs/2406.08691
作者: Ben Agro,Quinlan Sykora,Sergio Casas,Thomas Gilles,Raquel Urtasun
关键词: future state, Perceiving the world, Perceiving, world, critical task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Perceiving the world and forecasting its future state is a critical task for self-driving. Supervised approaches leverage annotated object labels to learn a model of the world – traditionally with object detections and trajectory predictions, or temporal bird’s-eye-view (BEV) occupancy fields. However, these annotations are expensive and typically limited to a set of predefined categories that do not cover everything we might encounter on the road. Instead, we learn to perceive and forecast a continuous 4D (spatio-temporal) occupancy field with self-supervision from LiDAR data. This unsupervised world model can be easily and effectively transferred to downstream tasks. We tackle point cloud forecasting by adding a lightweight learned renderer and achieve state-of-the-art performance in Argoverse 2, nuScenes, and KITTI. To further showcase its transferability, we fine-tune our model for BEV semantic occupancy forecasting and show that it outperforms the fully supervised state-of-the-art, especially when labeled data is scarce. Finally, when compared to prior state-of-the-art on spatio-temporal geometric occupancy prediction, our 4D world model achieves a much higher recall of objects from classes relevant to self-driving.

[CV-121] Vivid-ZOO: Multi-View Video Generation with Diffusion Model

链接: https://arxiv.org/abs/2406.08659
作者: Bing Li,Cheng Zheng,Wenxuan Zhu,Jinjie Mai,Biao Zhang,Peter Wonka,Bernard Ghanem
关键词: shown impressive performance, generation remains underexplored, multi-view, multi-view videos, remains underexplored
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Our project page is at this https URL

点击查看摘要

Abstract:While diffusion models have shown impressive performance in 2D image/video generation, diffusion-based Text-to-Multi-view-Video (T2MVid) generation remains underexplored. The new challenges posed by T2MVid generation lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. To this end, we propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text. Specifically, we factor the T2MVid problem into viewpoint-space and time components. Such factorization allows us to combine and reuse layers of advanced pre-trained multi-view image and 2D video diffusion models to ensure multi-view consistency as well as temporal coherence for the generated multi-view videos, largely reducing the training cost. We further introduce alignment modules to align the latent spaces of layers from the pre-trained multi-view and the 2D video diffusion models, addressing the reused layers’ incompatibility that arises from the domain gap between 2D and multi-view data. In support of this and future research, we further contribute a captioned multi-view video dataset. Experimental results demonstrate that our method generates high-quality multi-view videos, exhibiting vivid motions, temporal coherence, and multi-view consistency, given a variety of text prompts.

[CV-122] C-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

链接: https://arxiv.org/abs/2406.08656
作者: Weixi Feng,Jiachen Li,Michael Saxon,Tsu-jui Fu,Wenhu Chen,William Yang Wang
关键词: unique challenges, Video generation, video generation models, image generation, videos
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this study, we move beyond evaluating simple actions and argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses. To assess the Temporal Compositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics. The prompts articulate the initial and final states of scenes, effectively reducing ambiguities for frame development and simplifying the assessment of transition completion. In addition, by collecting aligned real-world videos corresponding to the prompts, we expand TC-Bench’s applicability from text-conditional models to image-conditional ones that can perform generative frame interpolation. We also develop new metrics to measure the completeness of component transitions in generated videos, which demonstrate significantly higher correlations with human judgments than existing metrics. Our comprehensive experimental results reveal that most video generators achieve less than 20% of the compositional changes, highlighting enormous space for future improvement. Our analysis indicates that current video generation models struggle to interpret descriptions of compositional changes and synthesize various components across different time steps.

[CV-123] How to Distinguish AI-Generated Images from Authentic Photographs

链接: https://arxiv.org/abs/2406.08651
作者: Negar Kamali,Karyn Nakamura,Angelos Chatzimparmpas,Jessica Hullman,Matthew Groh
关键词: Firefly makes, Stable Diffusion, high level, level of photorealism, untrained humans
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: 54 pages, 189 Figures

点击查看摘要

Abstract:The high level of photorealism in state-of-the-art diffusion models like Midjourney, Stable Diffusion, and Firefly makes it difficult for untrained humans to distinguish between real photographs and AI-generated images. To address this problem, we designed a guide to help readers develop a more critical eye toward identifying artifacts, inconsistencies, and implausibilities that often appear in AI-generated images. The guide is organized into five categories of artifacts and implausibilities: anatomical, stylistic, functional, violations of physics, and sociocultural. For this guide, we generated 138 images with diffusion models, curated 9 images from social media, and curated 42 real photographs. These images showcase the kinds of cues that prompt suspicion towards the possibility an image is AI-generated and why it is often difficult to draw conclusions about an image’s provenance without any context beyond the pixels in an image. Human-perceptible artifacts are not always present in AI-generated images, but this guide reveals artifacts and implausibilities that often emerge. By drawing attention to these kinds of artifacts and implausibilities, we aim to better equip people to distinguish AI-generated images from real photographs in the future.

[CV-124] FSBI: Deepfakes Detection with Frequency Enhanced Self-Blended Images

链接: https://arxiv.org/abs/2406.08625
作者: Ahmed Abul Hasanaath,Hamzah Luqman,Raed Katib,Saeed Anwar
关键词: perfect manipulations undetectable, deepfakes detection tools, research have led, perfect manipulations, manipulations undetectable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The paper is under consideration at Pattern Recognition Letters

点击查看摘要

Abstract:Advances in deepfake research have led to the creation of almost perfect manipulations undetectable by human eyes and some deepfakes detection tools. Recently, several techniques have been proposed to differentiate deepfakes from realistic images and videos. This paper introduces a Frequency Enhanced Self-Blended Images (FSBI) approach for deepfakes detection. This proposed approach utilizes Discrete Wavelet Transforms (DWT) to extract discriminative features from the self-blended images (SBI) to be used for training a convolutional network architecture model. The SBIs blend the image with itself by introducing several forgery artifacts in a copy of the image before blending it. This prevents the classifier from overfitting specific artifacts by learning more generic representations. These blended images are then fed into the frequency features extractor to detect artifacts that can not be detected easily in the time domain. The proposed approach has been evaluated on FF++ and Celeb-DF datasets and the obtained results outperformed the state-of-the-art techniques with the cross-dataset evaluation protocol.

[CV-125] LayeredDoc: Domain Adaptive Document Restoration with a Layer Separation Approach

链接: https://arxiv.org/abs/2406.08610
作者: Maria Pilligua,Nil Biescas,Javier Vazquez-Corral,Josep Lladós,Ernest Valveny,Sanket Biswas
关键词: processing systems demands, systems demands robust, intelligent document processing, demands robust solutions, extensive retraining
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to ICDAR 2024 (Athens, Greece) Workshop on Automatically Domain-Adapted and Personalized Document Analysis (ADAPDA)

点击查看摘要

Abstract:The rapid evolution of intelligent document processing systems demands robust solutions that adapt to diverse domains without extensive retraining. Traditional methods often falter with variable document types, leading to poor performance. To overcome these limitations, this paper introduces a text-graphic layer separation approach that enhances domain adaptability in document image restoration (DIR) systems. We propose LayeredDoc, which utilizes two layers of information: the first targets coarse-grained graphic components, while the second refines machine-printed textual content. This hierarchical DIR framework dynamically adjusts to the characteristics of the input document, facilitating effective domain adaptation. We evaluated our approach both qualitatively and quantitatively using a new real-world dataset, LayeredDocDB, developed for this study. Initially trained on a synthetically generated dataset, our model demonstrates strong generalization capabilities for the DIR task, offering a promising solution for handling variability in real-world data. Our code is accessible on GitHub.

[CV-126] FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion

链接: https://arxiv.org/abs/2406.08603
作者: George Cazenavette,Avneesh Sud,Thomas Leung,Ben Usman
关键词: GenAI systems, Stable Diffusion, potential for abuse, abuse of GenAI, task of detecting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Due to the high potential for abuse of GenAI systems, the task of detecting synthetic images has recently become of great interest to the research community. Unfortunately, existing image-space detectors quickly become obsolete as new high-fidelity text-to-image models are developed at blinding speed. In this work, we propose a new synthetic image detector that uses features obtained by inverting an open-source pre-trained Stable Diffusion model. We show that these inversion features enable our detector to generalize well to unseen generators of high visual fidelity (e.g., DALL-E 3) even when the detector is trained only on lower fidelity fake images generated via Stable Diffusion. This detector achieves new state-of-the-art across multiple training and evaluation setups. Moreover, we introduce a new challenging evaluation protocol that uses reverse image search to mitigate stylistic and thematic biases in the detector evaluation. We show that the resulting evaluation scores align well with detectors’ in-the-wild performance, and release these datasets as public benchmarks for future research.

[CV-127] LLM-assisted Concept Discovery: Automatically Identifying and Explaining Neuron Functions

链接: https://arxiv.org/abs/2406.08572
作者: Nhat Hoang-Xuan,Minh Vu,My T. Thai
关键词: textual concept-based explanations, Providing textual concept-based, DNN model works, textual concept-based, importance in understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Providing textual concept-based explanations for neurons in deep neural networks (DNNs) is of importance in understanding how a DNN model works. Prior works have associated concepts with neurons based on examples of concepts or a pre-defined set of concepts, thus limiting possible explanations to what the user expects, especially in discovering new concepts. Furthermore, defining the set of concepts requires manual work from the user, either by directly specifying them or collecting examples. To overcome these, we propose to leverage multimodal large language models for automatic and open-ended concept discovery. We show that, without a restricted set of pre-defined concepts, our method gives rise to novel interpretable concepts that are more faithful to the model’s behavior. To quantify this, we validate each concept by generating examples and counterexamples and evaluating the neuron’s response on this new set of images. Collectively, our method can discover concepts and simultaneously validate them, providing a credible automated tool to explain deep neural networks.

[CV-128] DiTFastAttn: Attention Compression for Diffusion Transformer Models

链接: https://arxiv.org/abs/2406.08552
作者: Zhihang Yuan,Pu Lu,Hanling Zhang,Xuefei Ning,Linfeng Zhang,Tianchen Zhao,Shengen Yan,Guohao Dai,Yu Wang
关键词: Diffusion Transformers, self-attention quadratic complexity, face computational challenges, computational challenges due, quadratic complexity
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to self-attention’s quadratic complexity. We propose DiTFastAttn, a novel post-training compression method to alleviate DiT’s computational bottleneck. We identify three key redundancies in the attention computation during DiT inference: 1. spatial redundancy, where many attention heads focus on local information; 2. temporal redundancy, with high similarity between neighboring steps’ attention outputs; 3. conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. To tackle these redundancies, we propose three techniques: 1. Window Attention with Residual Caching to reduce spatial redundancy; 2. Temporal Similarity Reduction to exploit the similarity between steps; 3. Conditional Redundancy Elimination to skip redundant computations during conditional generation. To demonstrate the effectiveness of DiTFastAttn, we apply it to DiT, PixArt-Sigma for image generation tasks, and OpenSora for video generation tasks. Evaluation results show that for image generation, our method reduces up to 88% of the FLOPs and achieves up to 1.6x speedup at high resolution generation.

[CV-129] RVT-2: Learning Precise Manipulation from Few Demonstrations

链接: https://arxiv.org/abs/2406.08545
作者: Ankit Goyal,Valts Blukis,Jie Xu,Yijie Guo,Yu-Wei Chao,Dieter Fox
关键词: solve multiple, language instructions, build a robotic, robotic system, tasks requiring high
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to RSS 2024

点击查看摘要

Abstract:In this work, we study how to build a robotic system that can solve multiple 3D manipulation tasks given language instructions. To be useful in industrial and household domains, such a system should be capable of learning new tasks with few demonstrations and solving them precisely. Prior works, like PerAct and RVT, have studied this problem, however, they often struggle with tasks requiring high precision. We study how to make them more effective, precise, and fast. Using a combination of architectural and system-level improvements, we propose RVT-2, a multitask 3D manipulation model that is 6X faster in training and 2X faster in inference than its predecessor RVT. RVT-2 achieves a new state-of-the-art on RLBench, improving the success rate from 65% to 82%. RVT-2 is also effective in the real world, where it can learn tasks requiring high precision, like picking up and inserting plugs, with just 10 demonstrations. Visual results, code, and trained model are provided at: this https URL.

[CV-130] Adaptive Teaching with a Shared Classifier for Knowledge Distillation

链接: https://arxiv.org/abs/2406.08528
作者: Jaeyeon Jang,Young-Ik Kim,Jisu Lim,Hyeonseong Lee
关键词: student network, teacher network, less-parameterized student network, Knowledge distillation, overparameterized teacher network
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge distillation (KD) is a technique used to transfer knowledge from an overparameterized teacher network to a less-parameterized student network, thereby minimizing the incurred performance loss. KD methods can be categorized into offline and online approaches. Offline KD leverages a powerful pretrained teacher network, while online KD allows the teacher network to be adjusted dynamically to enhance the learning effectiveness of the student network. Recently, it has been discovered that sharing the classifier of the teacher network can significantly boost the performance of the student network with only a minimal increase in the number of network parameters. Building on these insights, we propose adaptive teaching with a shared classifier (ATSC). In ATSC, the pretrained teacher network self-adjusts to better align with the learning needs of the student network based on its capabilities, and the student network benefits from the shared classifier, enhancing its performance. Additionally, we extend ATSC to environments with multiple teachers. We conduct extensive experiments, demonstrating the effectiveness of the proposed KD method. Our approach achieves state-of-the-art results on the CIFAR-100 and ImageNet datasets in both single-teacher and multiteacher scenarios, with only a modest increase in the number of required model parameters. The source code is publicly available at this https URL.

[CV-131] Sagiri: Low Dynamic Range Image Enhancement with Generative Diffusion Prior

链接: https://arxiv.org/abs/2406.09389
作者: Baiang Li,Sizhuo Ma,Yanhong Zeng,Xiaogang Xu,Youqing Fang,Zhao Zhang,Jian Wang,Kai Chen
关键词: Capturing High Dynamic, High Dynamic Range, Capturing High, low bit-depth compression, skewed color distributions
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: this https URL

点击查看摘要

Abstract:Capturing High Dynamic Range (HDR) scenery using 8-bit cameras often suffers from over-/underexposure, loss of fine details due to low bit-depth compression, skewed color distributions, and strong noise in dark areas. Traditional LDR image enhancement methods primarily focus on color mapping, which enhances the visual representation by expanding the image’s color range and adjusting the brightness. However, these approaches fail to effectively restore content in dynamic range extremes, which are regions with pixel values close to 0 or 255. To address the full scope of challenges in HDR imaging and surpass the limitations of current models, we propose a novel two-stage approach. The first stage maps the color and brightness to an appropriate range while keeping the existing details, and the second stage utilizes a diffusion prior to generate content in dynamic range extremes lost during capture. This generative refinement module can also be used as a plug-and-play module to enhance and complement existing LDR enhancement models. The proposed method markedly improves the quality and details of LDR images, demonstrating superior performance through rigorous experimental validation. The project page is at this https URL

[CV-132] Instance-level quantitative saliency in multiple sclerosis lesion segmentation

链接: https://arxiv.org/abs/2406.09335
作者: Federico Spagnolo,Nataliia Molchanova,Roger Schaer,Meritxell Bach Cuadra,Mario Ocampo Pineda,Lester Melie-Garcia,Cristina Granziera,Vincent Andrearczyk,Adrien Depeursinge
关键词: describe models’ decision, models’ decision mechanisms, recent years, artificial intelligence, classification tasks
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, explainable methods for artificial intelligence (XAI) have tried to reveal and describe models’ decision mechanisms in the case of classification tasks. However, XAI for semantic segmentation and in particular for single instances has been little studied to date. Understanding the process underlying automatic segmentation of single instances is crucial to reveal what information was used to detect and segment a given object of interest. In this study, we proposed two instance-level explanation maps for semantic segmentation based on SmoothGrad and Grad-CAM++ methods. Then, we investigated their relevance for the detection and segmentation of white matter lesions (WML), a magnetic resonance imaging (MRI) biomarker in multiple sclerosis (MS). 687 patients diagnosed with MS for a total of 4043 FLAIR and MPRAGE MRI scans were collected at the University Hospital of Basel, Switzerland. Data were randomly split into training, validation and test sets to train a 3D U-Net for MS lesion segmentation. We observed 3050 true positive (TP), 1818 false positive (FP), and 789 false negative (FN) cases. We generated instance-level explanation maps for semantic segmentation, by developing two XAI methods based on SmoothGrad and Grad-CAM++. We investigated: 1) the distribution of gradients in saliency maps with respect to both input MRI sequences; 2) the model’s response in the case of synthetic lesions; 3) the amount of perilesional tissue needed by the model to segment a lesion. Saliency maps (based on SmoothGrad) in FLAIR showed positive values inside a lesion and negative in its neighborhood. Peak values of saliency maps generated for these four groups of volumes presented distributions that differ significantly from one another, suggesting a quantitative nature of the proposed saliency. Contextual information of 7mm around the lesion border was required for their segmentation.

[CV-133] owards AI Lesion Tracking in PET/CT Imaging: A Siamese-based CNN Pipeline applied on PSMA PET/CT Scans

链接: https://arxiv.org/abs/2406.09327
作者: Stefan P. Hein,Manuel Schultheiss,Andrei Gafita,Raphael Zaum,Farid Yagubbayli,Isabel Rauscher,Matthias Eiber,Franz Pfeiffer,Wolfgang A. Weber
关键词: Assessing tumor response, Assessing tumor, Siamese CNN, main applications, PET
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
*备注: 25 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Assessing tumor response to systemic therapies is one of the main applications of PET/CT. Routinely, only a small subset of index lesions out of multiple lesions is analyzed. However, this operator dependent selection may bias the results due to possible significant inter-metastatic heterogeneity of response to therapy. Automated, AI based approaches for lesion tracking hold promise in enabling the analysis of many more lesions and thus providing a better assessment of tumor response. This work introduces a Siamese CNN approach for lesion tracking between PET/CT scans. Our approach is applied on the laborious task of tracking a high number of bone lesions in full-body baseline and follow-up [68Ga]Ga- or [18F]F-PSMA PET/CT scans after two cycles of [177Lu]Lu-PSMA therapy of metastatic castration resistant prostate cancer patients. Data preparation includes lesion segmentation and affine registration. Our algorithm extracts suitable lesion patches and forwards them into a Siamese CNN trained to classify the lesion patch pairs as corresponding or non-corresponding lesions. Experiments have been performed with different input patch types and a Siamese network in 2D and 3D. The CNN model successfully learned to classify lesion assignments, reaching a lesion tracking accuracy of 83 % in its best configuration with an AUC = 0.91. For remaining lesions the pipeline accomplished a re-identification rate of 89 %. We proved that a CNN may facilitate the tracking of multiple lesions in PSMA PET/CT scans. Future clinical studies are necessary if this improves the prediction of the outcome of therapies.

[CV-134] Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

链接: https://arxiv.org/abs/2406.09317
作者: Meng Wang,Tian Lin,Kai Yu,Aidi Lin,Yuanyuan Peng,Lianyu Wang,Cheng Chen,Ke Zou,Huiyu Liang,Man Chen,Xue Yao,Meiqin Zhang,Binwei Huang,Chaoxin Zheng,Wei Chen,Yilong Luo,Yifan Chen,Jingcheng Wang,Yih Chung Tham,Dianbo Liu,Wendy Wong,Sahil Thakur,Beau Fenner,Yanda Meng,Yukun Zhou,Zehua Jiang,Minghui Qiu,Changqing Zhang,Xinjian Chen,Sophia Y. Wang,Cecilia S. Lee,Lucia Sobrin,Pearse A. Keane,Ching-Yu Cheng,Haoyu Chen,Huazhu Fu
关键词: current retinal artificial, retinal artificial intelligence, artificial intelligence models, fundus diseases, limited category
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The current retinal artificial intelligence models were trained using data with a limited category of diseases and limited knowledge. In this paper, we present a retinal vision-language foundation model (RetiZero) with knowledge of over 400 fundus diseases. Specifically, we collected 341,896 fundus images paired with text descriptions from 29 publicly available datasets, 180 ophthalmic books, and online resources, encompassing over 400 fundus diseases across multiple countries and ethnicities. RetiZero achieved outstanding performance across various downstream tasks, including zero-shot retinal disease recognition, image-to-image retrieval, internal domain and cross-domain retinal disease classification, and few-shot fine-tuning. Specially, in the zero-shot scenario, RetiZero achieved a Top5 score of 0.8430 and 0.7561 on 15 and 52 fundus diseases respectively. In the image-retrieval task, RetiZero achieved a Top5 score of 0.9500 and 0.8860 on 15 and 52 retinal diseases respectively. Furthermore, clinical evaluations by ophthalmology experts from different countries demonstrate that RetiZero can achieve performance comparable to experienced ophthalmologists using zero-shot and image retrieval methods without requiring model retraining. These capabilities of retinal disease identification strengthen our RetiZero foundation model in clinical implementation.

[CV-135] SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-Resolution

链接: https://arxiv.org/abs/2406.09168
作者: Soufiane Belharbi,Mara KM Whitford,Phuong Hoang,Shakeeb Murtaza,Luke McCaffrey,Eric Granger
关键词: Scanning confocal microscopy, Confocal fluorescence microscopy, biological processes, confocal microscopy, accessible and widely
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 23 pages, 13 figures

点击查看摘要

Abstract:Confocal fluorescence microscopy is one of the most accessible and widely used imaging techniques for the study of biological processes. Scanning confocal microscopy allows the capture of high-quality images from 3D samples, yet suffers from well-known limitations such as photobleaching and phototoxicity of specimens caused by intense light exposure, which limits its use in some applications, especially for living cells. Cellular damage can be alleviated by changing imaging parameters to reduce light exposure, often at the expense of image quality. Machine/deep learning methods for single-image super-resolution (SISR) can be applied to restore image quality by upscaling lower-resolution (LR) images to produce high-resolution images (HR). These SISR methods have been successfully applied to photo-realistic images due partly to the abundance of publicly available data. In contrast, the lack of publicly available data partly limits their application and success in scanning confocal microscopy. In this paper, we introduce a large scanning confocal microscopy dataset named SR-CACO-2 that is comprised of low- and high-resolution image pairs marked for three different fluorescent markers. It allows the evaluation of performance of SISR methods on three different upscaling levels (X2, X4, X8). SR-CACO-2 contains the human epithelial cell line Caco-2 (ATCC HTB-37), and it is composed of 22 tiles that have been translated in the form of 9,937 image patches for experiments with SISR methods. Given the new SR-CACO-2 dataset, we also provide benchmarking results for 15 state-of-the-art methods that are representative of the main SISR families. Results show that these methods have limited success in producing high-resolution textures, indicating that SR-CACO-2 represents a challenging problem. Our dataset, code and pretrained weights are available: this https URL.

[CV-136] Blind Super-Resolution via Meta-learning and Markov Chain Monte Carlo Simulation

链接: https://arxiv.org/abs/2406.08896
作者: Jingyuan Xia,Zhixiong Yang,Shengxi Li,Shuanghui Zhang,Yaowen Fu,Deniz Gündüz,Xiang Li
关键词: witnessed great successes, Chain Monte Carlo, Markov Chain Monte, handcrafted kernel priors, single image super-resolution
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper has been accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

点击查看摘要

Abstract:Learning-based approaches have witnessed great successes in blind single image super-resolution (SISR) tasks, however, handcrafted kernel priors and learning based kernel priors are typically required. In this paper, we propose a Meta-learning and Markov Chain Monte Carlo (MCMC) based SISR approach to learn kernel priors from organized randomness. In concrete, a lightweight network is adopted as kernel generator, and is optimized via learning from the MCMC simulation on random Gaussian distributions. This procedure provides an approximation for the rational blur kernel, and introduces a network-level Langevin dynamics into SISR optimization processes, which contributes to preventing bad local optimal solutions for kernel estimation. Meanwhile, a meta-learning-based alternating optimization procedure is proposed to optimize the kernel generator and image restorer, respectively. In contrast to the conventional alternating minimization strategy, a meta-learning-based framework is applied to learn an adaptive optimization strategy, which is less-greedy and results in better convergence performance. These two procedures are iteratively processed in a plug-and-play fashion, for the first time, realizing a learning-based but plug-and-play blind SISR solution in unsupervised inference. Extensive simulations demonstrate the superior performance and generalization ability of the proposed approach when comparing with state-of-the-arts on synthesis and real-world datasets. The code is available at this https URL.

[CV-137] Research on Deep Learning Model of Feature Extraction Based on Convolutional Neural Network

链接: https://arxiv.org/abs/2406.08837
作者: Houze Liu,Iris Li,Yaxin Liang,Dan Sun,Yining Yang,Haowei Yang
关键词: convolutional neural networks, Neural networks, accurately identifying pneumonia, deep neural networks, shallow layers
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks with relatively shallow layers and simple structures may have limited ability in accurately identifying pneumonia. In addition, deep neural networks also have a large demand for computing resources, which may cause convolutional neural networks to be unable to be implemented on terminals. Therefore, this paper will carry out the optimal classification of convolutional neural networks. Firstly, according to the characteristics of pneumonia images, AlexNet and InceptionV3 were selected to obtain better image recognition results. Combining the features of medical images, the forward neural network with deeper and more complex structure is learned. Finally, knowledge extraction technology is used to extract the obtained data into the AlexNet model to achieve the purpose of improving computing efficiency and reducing computing costs. The results showed that the prediction accuracy, specificity, and sensitivity of the trained AlexNet model increased by 4.25 percentage points, 7.85 percentage points, and 2.32 percentage points, respectively. The graphics processing usage has decreased by 51% compared to the InceptionV3 mode.

[CV-138] Hybrid Spatial-spectral Neural Network for Hyperspectral Image Denoising

链接: https://arxiv.org/abs/2406.08782
作者: Hao Liang,Chengjie,Kun Li,Xin Tian
关键词: HSI applications, Hyperspectral image, procedure for HSI, essential procedure, HSI
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) denoising is an essential procedure for HSI applications. Unfortunately, the existing Transformer-based methods mainly focus on non-local modeling, neglecting the importance of locality in image denoising. Moreover, deep learning methods employ complex spectral learning mechanisms, thus introducing large computation costs. To address these problems, we propose a hybrid spatial-spectral denoising network (HSSD), in which we design a novel hybrid dual-path network inspired by CNN and Transformer characteristics, leading to capturing both local and non-local spatial details while suppressing noise efficiently. Furthermore, to reduce computational complexity, we adopt a simple but effective decoupling strategy that disentangles the learning of space and spectral channels, where multilayer perception with few parameters is utilized to learn the global correlations among spectra. The synthetic and real experiments demonstrate that our proposed method outperforms state-of-the-art methods on spatial and spectral reconstruction. The code and details are available on this https URL. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.08782 [eess.IV] (or arXiv:2406.08782v1 [eess.IV] for this version)

[CV-139] AGFA-Net: Attention-Guided and Feature-Aggregated Network for Coronary Artery Segmentation using Computed Tomography Angiography

链接: https://arxiv.org/abs/2406.08724
作者: Xinyun Liu,Chen Zhao
关键词: posing significant health, prevalent cardiovascular condition, health risks worldwide, significant health risks, Coronary artery disease
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Coronary artery disease (CAD) remains a prevalent cardiovascular condition, posing significant health risks worldwide. This pathology, characterized by plaque accumulation in coronary artery walls, leads to myocardial ischemia and various symptoms, including chest pain and shortness of breath. Accurate segmentation of coronary arteries from coronary computed tomography angiography (CCTA) images is crucial for diagnosis and treatment planning. Traditional segmentation methods face challenges in handling low-contrast images and complex anatomical structures. In this study, we propose an attention-guided, feature-aggregated 3D deep network (AGFA-Net) for coronary artery segmentation using CCTA images. AGFA-Net leverages attention mechanisms and feature refinement modules to capture salient features and enhance segmentation accuracy. Evaluation on a dataset comprising 1,000 CCTA scans demonstrates AGFA-Net’s superior performance, achieving an average Dice coefficient similarity of 86.74% and a Hausdorff distance of 0.23 mm during 5-fold cross-validation. Ablation studies further validate the effectiveness of the proposed modules, highlighting their contributions to improved segmentation accuracy. Overall, AGFA-Net offers a robust and reliable solution for coronary artery segmentation, addressing challenges posed by varying vessel sizes, complex anatomies, and low image contrast.

[CV-140] Unveiling Incomplete Modality Brain Tumor Segmentation: Leveraging Masked Predicted Auto-Encoder and Divergence Learning

链接: https://arxiv.org/abs/2406.08634
作者: Zhongao Sun,Jiameng Li,Yuhan Wang,Jiarong Cheng,Qing Zhou,Chun Li
关键词: Brain tumor segmentation, reduced segmentation accuracy, magnetic resonance imaging, tumor segmentation remains, multi-modal magnetic resonance
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Brain tumor segmentation remains a significant challenge, particularly in the context of multi-modal magnetic resonance imaging (MRI) where missing modality images are common in clinical settings, leading to reduced segmentation accuracy. To address this issue, we propose a novel strategy, which is called masked predicted pre-training, enabling robust feature learning from incomplete modality data. Additionally, in the fine-tuning phase, we utilize a knowledge distillation technique to align features between complete and missing modality data, simultaneously enhancing model robustness. Notably, we leverage the Holder pseudo-divergence instead of the KLD for distillation loss, offering improve mathematical interpretability and properties. Extensive experiments on the BRATS2018 and BRATS2020 datasets demonstrate significant performance enhancements compared to existing state-of-the-art methods.

[CV-141] GRU-Net for breast histopathology image segmentation

链接: https://arxiv.org/abs/2406.08604
作者: Ayush Roy,Payel Pramanik,Sohom Ghosal,Daria Valenkova,Dmitrii Kaplun,Ram Sarkar
关键词: global health concern, major global health, health concern, major global, global health
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Breast cancer is a major global health concern. Pathologists face challenges in analyzing complex features from pathological images, which is a time-consuming and labor-intensive task. Therefore, efficient computer-based diagnostic tools are needed for early detection and treatment planning. This paper presents a modified version of MultiResU-Net for histopathology image segmentation, which is selected as the backbone for its ability to analyze and segment complex features at multiple scales and ensure effective feature flow via skip connections. The modified version also utilizes the Gaussian distribution-based Attention Module (GdAM) to incorporate histopathology-relevant text information in a Gaussian distribution. The sampled features from the Gaussian text feature-guided distribution highlight specific spatial regions based on prior knowledge. Finally, using the Controlled Dense Residual Block (CDRB) on skip connections of MultiResU-Net, the information is transferred from the encoder layers to the decoder layers in a controlled manner using a scaling parameter derived from the extracted spatial features. We validate our approach on two diverse breast cancer histopathology image datasets: TNBC and MonuSeg, demonstrating superior segmentation performance compared to state-of-the-art methods. The code for our proposed model is available on this https URL.

机器学习

[LG-0] Rethinking Score Distillation as a Bridge Between Image Distributions

链接: https://arxiv.org/abs/2406.09417
作者: David McAllister,Songwei Ge,Jia-Bin Huang,David W. Jacobs,Alexei A. Efros,Aleksander Holynski,Angjoo Kanazawa
关键词: large-scale diffusion priors, important tool, large-scale diffusion, diffusion priors, priors for tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project webpage: this https URL

点击查看摘要

Abstract:Score distillation sampling (SDS) has proven to be an important tool, enabling the use of large-scale diffusion priors for tasks operating in data-poor domains. Unfortunately, SDS has a number of characteristic artifacts that limit its usefulness in general-purpose applications. In this paper, we make progress toward understanding the behavior of SDS and its variants by viewing them as solving an optimal-cost transport path from a source distribution to a target distribution. Under this new interpretation, these methods seek to transport corrupted images (source) to the natural image distribution (target). We argue that current methods’ characteristic artifacts are caused by (1) linear approximation of the optimal path and (2) poor estimates of the source distribution. We show that calibrating the text conditioning of the source distribution can produce high-quality generation and translation results with little extra overhead. Our method can be easily applied across many domains, matching or beating the performance of specialized methods. We demonstrate its utility in text-to-2D, text-based NeRF optimization, translating paintings to real images, optical illusion generation, and 3D sketch-to-real. We compare our method to existing approaches for score distillation sampling and show that it can produce high-frequency details with realistic colors.

[LG-1] An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

链接: https://arxiv.org/abs/2406.09415
作者: Duy-Kien Nguyen,Mahmoud Assran,Unnat Jain,Martin R. Oswald,Cees G. M. Snoek,Xinlei Chen
关键词: computer vision, inductive bias, vision, modern computer vision, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Technical report, 23 pages

点击查看摘要

Abstract:This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias – locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision: supervised learning for object classification, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although directly operating on individual pixels is less computationally practical, we believe the community must be aware of this surprising piece of knowledge when devising the next generation of neural architectures for computer vision.

[LG-2] Interpreting the Weight Space of Customized Diffusion Models

链接: https://arxiv.org/abs/2406.09413
作者: Amil Dravid,Yossi Gandelsman,Kuan-Chieh Wang,Rameen Abdal,Gordon Wetzstein,Alexei A. Efros,Kfir Aberman
关键词: large collection, collection of customized, space, customized diffusion models, identity
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project Page: this https URL

点击查看摘要

Abstract:We investigate the space of weights spanned by a large collection of customized diffusion models. We populate this space by creating a dataset of over 60,000 models, each of which is a base model fine-tuned to insert a different person’s visual identity. We model the underlying manifold of these weights as a subspace, which we term weights2weights. We demonstrate three immediate applications of this space – sampling, editing, and inversion. First, as each point in the space corresponds to an identity, sampling a set of weights from it results in a model encoding a novel identity. Next, we find linear directions in this space corresponding to semantic edits of the identity (e.g., adding a beard). These edits persist in appearance across generated samples. Finally, we show that inverting a single image into this space reconstructs a realistic identity, even if the input image is out of distribution (e.g., a painting). Our results indicate that the weight space of fine-tuned diffusion models behaves as an interpretable latent space of identities.

[LG-3] Explore the Limits of Omni-modal Pretraining at Scale

链接: https://arxiv.org/abs/2406.09412
作者: Yiyuan Zhang,Handong Li,Jing Liu,Xiangyu Yue
关键词: learning universal representations, universal representations, named Multimodal Context, build omni-modal intelligence, Multimodal Context
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Project Website: this https URL

点击查看摘要

Abstract:We propose to build omni-modal intelligence, which is capable of understanding any modality and learning universal representations. In specific, we propose a scalable pretraining paradigm, named Multimodal Context (MiCo), which can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process. With MiCo, the pretrained models show significant emergent abilities in multimodal learning, which are evaluated on the following tasks: i) single-modality perception benchmarks of 10 different modalities, ii) 25 cross-modality understanding tasks of retrieval, question-answering, captioning, and iii) 18 multimodal large language model benchmarks. Our models establish 37 new records for state-of-the-art performance. We hope that our research could contribute to the development of omni-modal intelligence. Code and Models are at this https URL

[LG-4] Data Attribution for Text-to-Image Models by Unlearning Synthesized Images

链接: https://arxiv.org/abs/2406.09408
作者: Sheng-Yu Wang,Aaron Hertzmann,Alexei A. Efros,Jun-Yan Zhu,Richard Zhang
关键词: goal of data, data attribution, images, influential images, image
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project page: this https URL Code: this https URL

点击查看摘要

Abstract:The goal of data attribution for text-to-image models is to identify the training images that most influence the generation of a new image. We can define “influence” by saying that, for a given output, if a model is retrained from scratch without that output’s most influential images, the model should then fail to generate that output image. Unfortunately, directly searching for these influential images is computationally infeasible, since it would require repeatedly retraining from scratch. We propose a new approach that efficiently identifies highly-influential images. Specifically, we simulate unlearning the synthesized image, proposing a method to increase the training loss on the output image, without catastrophic forgetting of other, unrelated concepts. Then, we find training images that are forgotten by proxy, identifying ones with significant loss deviations after the unlearning process, and label these as influential. We evaluate our method with a computationally intensive but “gold-standard” retraining from scratch and demonstrate our method’s advantages over previous methods.

[LG-5] 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

链接: https://arxiv.org/abs/2406.09406
作者: Roman Bachmann,Oğuzhan Fatih Kar,David Mizrahi,Ali Garjani,Mingfei Gao,David Griffiths,Jiaming Hu,Afshin Dehghan,Amir Zamir
关键词: accept diverse inputs, multitask foundation models, show promising results, UnifiedIO show promising, Current multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page at this http URL

点击查看摘要

Abstract:Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we expand upon the capabilities of them by training a single model on tens of highly diverse modalities and by performing co-training on large-scale multimodal datasets and text corpora. This includes training on several semantic and geometric modalities, feature maps from recent state of the art models like DINOv2 and ImageBind, pseudo labels of specialist models like SAM and 4DHumans, and a range of new modalities that allow for novel ways to interact with the model and steer the generation, for example image metadata or color palettes. A crucial step in this process is performing discrete tokenization on various modalities, whether they are image-like, neural network feature maps, vectors, structured data like instance segmentation or human poses, or data that can be represented as text. Through this, we expand on the out-of-the-box capabilities of multimodal models and specifically show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance. This enables more fine-grained and controllable multimodal generation capabilities and allows us to study the distillation of models trained on diverse data and objectives into a unified model. We successfully scale the training to a three billion parameter model using tens of modalities and different datasets. The resulting models and training code are open sourced at this http URL.

[LG-6] Why Warmup the Learning Rate? Underlying Mechanisms and Improvements

链接: https://arxiv.org/abs/2406.09405
作者: Dayal Singh Kalra,Maissam Barkeshli
关键词: eta, learning rate, deep learning, text, predetermined target
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注: 11+22 pages, 7+24 figures

点击查看摘要

Abstract:It is common in deep learning to warm up the learning rate \eta , often by a linear schedule between \eta_\textinit = 0 and a predetermined target \eta_\texttrgt . In this paper, we show through systematic experiments using SGD and Adam that the overwhelming benefit of warmup arises from allowing the network to tolerate larger \eta_\texttrgt by forcing the network to more well-conditioned areas of the loss landscape. The ability to handle larger \eta_\texttrgt makes hyperparameter tuning more robust while improving the final performance. We uncover different regimes of operation during the warmup period, depending on whether training starts off in a progressive sharpening or sharpness reduction phase, which in turn depends on the initialization and parameterization. Using these insights, we show how \eta_\textinit can be properly chosen by utilizing the loss catapult mechanism, which saves on the number of warmup steps, in some cases completely eliminating the need for warmup. We also suggest an initialization for the variance in Adam which provides benefits similar to warmup.

[LG-7] ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

链接: https://arxiv.org/abs/2406.09404
作者: Jun-Kun Chen,Samuel Rota Bulò,Norman Müller,Lorenzo Porzi,Peter Kontschieder,Yu-Xiong Wang
关键词: enabling high-fidelity instruction-guided, diffusion models, paper proposes ConsistDreamer, framework that lifts, diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: CVPR 2024

点击查看摘要

Abstract:This paper proposes ConsistDreamer - a novel framework that lifts 2D diffusion models with 3D awareness and 3D consistency, thus enabling high-fidelity instruction-guided scene editing. To overcome the fundamental limitation of missing 3D consistency in 2D diffusion models, our key insight is to introduce three synergetic strategies that augment the input of the 2D diffusion model to become 3D-aware and to explicitly enforce 3D consistency during the training process. Specifically, we design surrounding views as context-rich input for the 2D diffusion model, and generate 3D-consistent, structured noise instead of image-independent noise. Moreover, we introduce self-supervised consistency-enforcing training within the per-scene editing procedure. Extensive evaluation shows that our ConsistDreamer achieves state-of-the-art performance for instruction-guided scene editing across various scenes and editing instructions, particularly in complicated large-scale indoor scenes from ScanNet++, with significantly improved sharpness and fine-grained textures. Notably, ConsistDreamer stands as the first work capable of successfully editing complex (e.g., plaid/checkered) patterns. Our project page is at this http URL.

[LG-8] Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion

链接: https://arxiv.org/abs/2406.09402
作者: Linzhan Mou,Jun-Kun Chen,Yu-Xiong Wang
关键词: paper proposes Instruct, generate high-quality instruction-guided, high-quality instruction-guided dynamic, dynamic scene editing, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: CVPR 2024

点击查看摘要

Abstract:This paper proposes Instruct 4D-to-4D that achieves 4D awareness and spatial-temporal consistency for 2D diffusion models to generate high-quality instruction-guided dynamic scene editing results. Traditional applications of 2D diffusion models in dynamic scene editing often result in inconsistency, primarily due to their inherent frame-by-frame editing methodology. Addressing the complexities of extending instruction-guided editing to 4D, our key insight is to treat a 4D scene as a pseudo-3D scene, decoupled into two sub-problems: achieving temporal consistency in video editing and applying these edits to the pseudo-3D scene. Following this, we first enhance the Instruct-Pix2Pix (IP2P) model with an anchor-aware attention module for batch processing and consistent editing. Additionally, we integrate optical flow-guided appearance propagation in a sliding window fashion for more precise frame-to-frame editing and incorporate depth-based projection to manage the extensive data of pseudo-3D scenes, followed by iterative editing to achieve convergence. We extensively evaluate our approach in various scenes and editing instructions, and demonstrate that it achieves spatially and temporally consistent editing results, with significantly enhanced detail and sharpness over the prior art. Notably, Instruct 4D-to-4D is general and applicable to both monocular and challenging multi-camera scenes. Code and more results are available at this http URL.

[LG-9] YoLLaVA: Your Personalized Language and Vision Assistant

链接: https://arxiv.org/abs/2406.09400
作者: Thao Nguyen,Haotian Liu,Yuheng Li,Mu Cai,Utkarsh Ojha,Yong Jae Lee
关键词: Large Multimodal Models, Large Multimodal, Multimodal Models, shown remarkable capabilities, visual question answering
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have shown remarkable capabilities across a variety of tasks (e.g., image captioning, visual question answering). While broad, their knowledge remains generic (e.g., recognizing a dog), and they are unable to handle personalized subjects (e.g., recognizing a user’s pet dog). Human reasoning, in contrast, typically operates within the context of specific subjects in our surroundings. For example, one might ask, “What should I buy for my dog’s birthday?”; as opposed to a generic inquiry about “What should I buy for a dog’s birthday?”. Similarly, when looking at a friend’s image, the interest lies in seeing their activities (e.g., “my friend is holding a cat”), rather than merely observing generic human actions (e.g., “a man is holding a cat”). In this paper, we introduce the novel task of personalizing LMMs, so that they can have conversations about a specific subject. We propose Yo’LLaVA, which learns to embed a personalized subject into a set of latent tokens given a handful of example images of the subject. Our qualitative and quantitative analyses reveal that Yo’LLaVA can learn the concept more efficiently using fewer tokens and more effectively encode the visual attributes compared to strong prompting baselines (e.g., LLaVA).

[LG-10] Improving Autoregressive Training with Dynamic Oracles

链接: https://arxiv.org/abs/2406.09393
作者: Jianing Yang,Harshine Visvanathan,Yilin Wang,Xinyi Hu,Matthew Gormley
关键词: sequential decision problems, ranging from sequence, framed as sequential, sequential decision, sequence tagging
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many tasks within NLP can be framed as sequential decision problems, ranging from sequence tagging to text generation. However, for many tasks, the standard training methods, including maximum likelihood (teacher forcing) and scheduled sampling, suffer from exposure bias and a mismatch between metrics employed during training and inference. DAgger provides a solution to mitigate these problems, yet it requires a metric-specific dynamic oracle algorithm, which does not exist for many common metrics like span-based F1, ROUGE, and BLEU. In this paper, we develop these novel dynamic oracles and show they maintain DAgger’s no-regret guarantee for decomposable metrics like span-based F1. We evaluate the algorithm’s performance on named entity recognition (NER), text summarization, and machine translation (MT). While DAgger with dynamic oracle yields less favorable results in our MT experiments, it outperforms the baseline techniques in NER and text summarization.

[LG-11] A More Practical Approach to Machine Unlearning

链接: https://arxiv.org/abs/2406.09391
作者: David Zagardo
关键词: incorporate vast amounts, unlearning, raising significant privacy, Machine unlearning, raising significant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Machine learning models often incorporate vast amounts of data, raising significant privacy concerns. Machine unlearning, the ability to remove the influence of specific data points from a trained model, addresses these concerns. This paper explores practical methods for implementing machine unlearning, focusing on a first-epoch gradient-ascent approach. Key findings include: 1. Single vs. Multi-Epoch Unlearning: First-epoch gradient unlearning is more effective than multi-epoch gradients. 2. Layer-Based Unlearning: The embedding layer in GPT-2 is crucial for effective unlearning. Gradients from the output layers (11 and 12) have no impact. Efficient unlearning can be achieved using only the embedding layer, halving space complexity. 3. Influence Functions Scoring: Techniques like Hessian Vector Product and the dot product of activations and tensors are used for quantifying unlearning. 4. Gradient Ascent Considerations: Calibration is necessary to avoid overexposing the model to specific data points during unlearning, which could prematurely terminate the process. 5. Fuzzy Matching vs. Iterative Unlearning: Fuzzy matching techniques shift the model to a new optimum, while iterative unlearning provides a more complete modality. Our empirical evaluation confirms that first-epoch gradient ascent for machine unlearning is more effective than whole-model gradient ascent. These results highlight the potential of machine unlearning for enhancing data privacy and compliance with regulations such as GDPR and CCPA. The study underscores the importance of formal methods to comprehensively evaluate the unlearning process. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2406.09391 [cs.LG] (or arXiv:2406.09391v1 [cs.LG] for this version)

[LG-12] LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living

链接: https://arxiv.org/abs/2406.09390
作者: Rajatsubhra Chakraborty,Arkaprava Sinha,Dominick Reilly,Manish Kumar Govind,Pu Wang,Francois Bremond,Srijan Das
关键词: Large Language Vision, Language Vision Models, Daily Living, Activities of Daily, processing internet videos
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Vision Models (LLVMs) have demonstrated effectiveness in processing internet videos, yet they struggle with the visually perplexing dynamics present in Activities of Daily Living (ADL) due to limited pertinent datasets and models tailored to relevant cues. To this end, we propose a framework for curating ADL multiview datasets to fine-tune LLVMs, resulting in the creation of ADL-X, comprising 100K RGB video-instruction pairs, language descriptions, 3D skeletons, and action-conditioned object trajectories. We introduce LLAVIDAL, an LLVM capable of incorporating 3D poses and relevant object trajectories to understand the intricate spatiotemporal relationships within ADLs. Furthermore, we present a novel benchmark, ADLMCQ, for quantifying LLVM effectiveness in ADL scenarios. When trained on ADL-X, LLAVIDAL consistently achieves state-of-the-art performance across all ADL evaluation metrics. Qualitative analysis reveals LLAVIDAL’s temporal reasoning capabilities in understanding ADL. The link to the dataset is provided at: this https URL

[LG-13] Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

链接: https://arxiv.org/abs/2406.09388
作者: Youngtaek Oh,Pyunghwan Ahn,Jinhyung Kim,Gwangmo Song,Soonyoung Lee,In So Kweon,Junmo Kim
关键词: fine-grained image-text alignment, Vision and language, showcased remarkable zero-shot, image-text alignment, showcased remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to CVPRW 2024 on ‘What is Next in Multimodal Foundation Models?’. Code: this https URL

点击查看摘要

Abstract:Vision and language models (VLMs) such as CLIP have showcased remarkable zero-shot recognition abilities yet face challenges in visio-linguistic compositionality, particularly in linguistic comprehension and fine-grained image-text alignment. This paper explores the intricate relationship between compositionality and recognition – two pivotal aspects of VLM capability. We conduct a comprehensive evaluation of existing VLMs, covering both pre-training approaches aimed at recognition and the fine-tuning methods designed to improve compositionality. Our evaluation employs 12 benchmarks for compositionality, along with 21 zero-shot classification and two retrieval benchmarks for recognition. In our analysis from 274 CLIP model checkpoints, we reveal patterns and trade-offs that emerge between compositional understanding and recognition accuracy. Ultimately, this necessitates strategic efforts towards developing models that improve both capabilities, as well as the meticulous formulation of benchmarks for compositionality. We open our evaluation framework at this https URL.

[LG-14] Reflecting on the State of Rehearsal-free Continual Learning with Pretrained Models

链接: https://arxiv.org/abs/2406.09384
作者: Lukas Thede,Karsten Roth,Olivier J. Hénaff,Matthias Bethge,Zeynep Akata
关键词: foundation models, pretrained models, ubiquity of foundation, success on rehearsal-free, models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 3rd Conference on Lifelong Learning Agents (CoLLAs) 2024

点击查看摘要

Abstract:With the advent and recent ubiquity of foundation models, continual learning (CL) has recently shifted from continual training from scratch to the continual adaptation of pretrained models, seeing particular success on rehearsal-free CL benchmarks (RFCL). To achieve this, most proposed methods adapt and restructure parameter-efficient finetuning techniques (PEFT) to suit the continual nature of the problem. Based most often on input-conditional query-mechanisms or regularizations on top of prompt- or adapter-based PEFT, these PEFT-style RFCL (P-RFCL) approaches report peak performances; often convincingly outperforming existing CL techniques. However, on the other end, critical studies have recently highlighted competitive results by training on just the first task or via simple non-parametric baselines. Consequently, questions arise about the relationship between methodological choices in P-RFCL and their reported high benchmark scores. In this work, we tackle these questions to better understand the true drivers behind strong P-RFCL performances, their placement w.r.t. recent first-task adaptation studies, and their relation to preceding CL standards such as EWC or SI. In particular, we show: (1) P-RFCL techniques relying on input-conditional query mechanisms work not because, but rather despite them by collapsing towards standard PEFT shortcut solutions. (2) Indeed, we show how most often, P-RFCL techniques can be matched by a simple and lightweight PEFT baseline. (3) Using this baseline, we identify the implicit bound on tunable parameters when deriving RFCL approaches from PEFT methods as a potential denominator behind P-RFCL efficacy. Finally, we (4) better disentangle continual versus first-task adaptation, and (5) motivate standard RFCL techniques s.a. EWC or SI in light of recent P-RFCL methods.

[LG-15] Efficient Discrepancy Testing for Learning with Distribution Shift

链接: https://arxiv.org/abs/2406.09373
作者: Gautam Chandrasekaran,Adam R. Klivans,Vasilis Kontonis,Konstantinos Stavropoulos,Arsen Vasilyan
关键词: discrepancy distance, fundamental notion, field of domain, domain adaptation, localized discrepancy distance
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 45 pages, 3 figures

点击查看摘要

Abstract:A fundamental notion of distance between train and test distributions from the field of domain adaptation is discrepancy distance. While in general hard to compute, here we provide the first set of provably efficient algorithms for testing localized discrepancy distance, where discrepancy is computed with respect to a fixed output classifier. These results imply a broad set of new, efficient learning algorithms in the recently introduced model of Testable Learning with Distribution Shift (TDS learning) due to Klivans et al. (2023). Our approach generalizes and improves all prior work on TDS learning: (1) we obtain universal learners that succeed simultaneously for large classes of test distributions, (2) achieve near-optimal error rates, and (3) give exponential improvements for constant depth circuits. Our methods further extend to semi-parametric settings and imply the first positive results for low-dimensional convex sets. Additionally, we separate learning and testing phases and obtain algorithms that run in fully polynomial time at test time. Comments: 45 pages, 3 figures Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2406.09373 [cs.DS] (or arXiv:2406.09373v1 [cs.DS] for this version)

[LG-16] LRM-Zero: Training Large Reconstruction Models with Synthesized Data

链接: https://arxiv.org/abs/2406.09371
作者: Desai Xie,Sai Bi,Zhixin Shu,Kai Zhang,Zexiang Xu,Yi Zhou,Sören Pirk,Arie Kaufman,Xin Sun,Hao Tan
关键词: achieving high-quality sparse-view, Large Reconstruction Model, achieving high-quality, high-quality sparse-view, Large Reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 23 pages, 8 figures. Our code and interactive visualization are available at: this https URL

点击查看摘要

Abstract:We present LRM-Zero, a Large Reconstruction Model (LRM) trained entirely on synthesized 3D data, achieving high-quality sparse-view 3D reconstruction. The core of LRM-Zero is our procedural 3D dataset, Zeroverse, which is automatically synthesized from simple primitive shapes with random texturing and augmentations (e.g., height fields, boolean differences, and wireframes). Unlike previous 3D datasets (e.g., Objaverse) which are often captured or crafted by humans to approximate real 3D data, Zeroverse completely ignores realistic global semantics but is rich in complex geometric and texture details that are locally similar to or even more intricate than real objects. We demonstrate that our LRM-Zero, trained with our fully synthesized Zeroverse, can achieve high visual quality in the reconstruction of real-world objects, competitive with models trained on Objaverse. We also analyze several critical design choices of Zeroverse that contribute to LRM-Zero’s capability and training stability. Our work demonstrates that 3D reconstruction, one of the core tasks in 3D vision, can potentially be addressed without the semantics of real-world objects. The Zeroverse’s procedural synthesis code and interactive visualization are available at: this https URL.

[LG-17] Data-dependent and Oracle Bounds on Forgetting in Continual Learning

链接: https://arxiv.org/abs/2406.09370
作者: Lior Friedman,Ron Meir
关键词: maintaining good transfer, continual learning, maintaining good, future tasks, preserved and re-used
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In continual learning, knowledge must be preserved and re-used between tasks, maintaining good transfer to future tasks and minimizing forgetting of previously learned ones. While several practical algorithms have been devised for this setting, there have been few theoretical works aiming to quantify and bound the degree of Forgetting in general settings. We provide both data-dependent and oracle upper bounds that apply regardless of model and algorithm choice, as well as bounds for Gibbs posteriors. We derive an algorithm inspired by our bounds and demonstrate empirically that our approach yields improved forward and backward transfer.

[LG-18] owards an Improved Understanding and Utilization of Maximum Manifold Capacity Representations

链接: https://arxiv.org/abs/2406.09366
作者: Rylan Schaeffer,Victor Lecomte,Dhruv Bhandarkar Pai,Andres Carranza,Berivan Isik,Alyssa Unell,Mikail Khona,Thomas Yerxa,Yann LeCun,SueYeon Chung,Andrey Gromov,Ravid Shwartz-Ziv,Sanmi Koyejo
关键词: Maximum Manifold Capacity, Manifold Capacity Representations, Capacity Representations, multi-view self-supervised learning, recent multi-view self-supervised
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Maximum Manifold Capacity Representations (MMCR) is a recent multi-view self-supervised learning (MVSSL) method that matches or surpasses other leading MVSSL methods. MMCR is intriguing because it does not fit neatly into any of the commonplace MVSSL lineages, instead originating from a statistical mechanical perspective on the linear separability of data manifolds. In this paper, we seek to improve our understanding and our utilization of MMCR. To better understand MMCR, we leverage tools from high dimensional probability to demonstrate that MMCR incentivizes alignment and uniformity of learned embeddings. We then leverage tools from information theory to show that such embeddings maximize a well-known lower bound on mutual information between views, thereby connecting the geometric perspective of MMCR to the information-theoretic perspective commonly discussed in MVSSL. To better utilize MMCR, we mathematically predict and experimentally confirm non-monotonic changes in the pretraining loss akin to double descent but with respect to atypical hyperparameters. We also discover compute scaling laws that enable predicting the pretraining loss as a function of gradients steps, batch size, embedding dimension and number of views. We then show that MMCR, originally applied to image data, is performant on multimodal image-text data. By more deeply understanding the theoretical and empirical behavior of MMCR, our work reveals insights on improving MVSSL methods.

[LG-19] ElicitationGPT: Text Elicitation Mechanisms via Language Models

链接: https://arxiv.org/abs/2406.09363
作者: Yifan Wu,Jason Hartline
关键词: fundamental building block, machine learning models, rules evaluate probabilistic, evaluate probabilistic forecasts, Scoring rules evaluate
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scoring rules evaluate probabilistic forecasts of an unknown state against the realized state and are a fundamental building block in the incentivized elicitation of information and the training of machine learning models. This paper develops mechanisms for scoring elicited text against ground truth text using domain-knowledge-free queries to a large language model (specifically ChatGPT) and empirically evaluates their alignment with human preferences. The empirical evaluation is conducted on peer reviews from a peer-grading dataset and in comparison to manual instructor scores for the peer reviews.

[LG-20] Understanding Hallucinations in Diffusion Models through Mode Interpolation

链接: https://arxiv.org/abs/2406.09358
作者: Sumukh K Aithal,Pratyush Maini,Zachary C. Lipton,J. Zico Kolter
关键词: Colloquially speaking, diffusion models, processes are frequently, image generation models, diffusion
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Colloquially speaking, image generation models based upon diffusion processes are frequently said to exhibit “hallucinations,” samples that could never occur in the training data. But where do such hallucinations come from? In this paper, we study a particular failure mode in diffusion models, which we term mode interpolation. Specifically, we find that diffusion models smoothly “interpolate” between nearby data modes in the training set, to generate samples that are completely outside the support of the original training distribution; this phenomenon leads diffusion models to generate artifacts that never existed in real data (i.e., hallucinations). We systematically study the reasons for, and the manifestation of this phenomenon. Through experiments on 1D and 2D Gaussians, we show how a discontinuous loss landscape in the diffusion model’s decoder leads to a region where any smooth approximation will cause such hallucinations. Through experiments on artificial datasets with various shapes, we show how hallucination leads to the generation of combinations of shapes that never existed. Finally, we show that diffusion models in fact know when they go out of support and hallucinate. This is captured by the high variance in the trajectory of the generated sample towards the final few backward sampling process. Using a simple metric to capture this variance, we can remove over 95% of hallucinations at generation time while retaining 96% of in-support samples. We conclude our exploration by showing the implications of such hallucination (and its removal) on the collapse (and stabilization) of recursive training on synthetic data with experiments on MNIST and 2D Gaussians dataset. We release our code at this https URL.

[LG-21] Advancing Graph Generation through Beta Diffusion

链接: https://arxiv.org/abs/2406.09357
作者: Yilin He,Xinyang Liu,Bo Chen,Mingyuan Zhou
关键词: generating natural images, effectiveness in generating, generating natural, natural images, extended to generate
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Diffusion models have demonstrated effectiveness in generating natural images and have been extended to generate diverse data types, including graphs. This new generation of diffusion-based graph generative models has demonstrated significant performance improvements over methods that rely on variational autoencoders or generative adversarial networks. It’s important to recognize, however, that most of these models employ Gaussian or categorical diffusion processes, which can struggle with sparse and long-tailed data distributions. In our work, we introduce Graph Beta Diffusion (GBD), a diffusion-based generative model particularly adept at capturing diverse graph structures. GBD utilizes a beta diffusion process, tailored for the sparse and range-bounded characteristics of graph adjacency matrices. Furthermore, we have developed a modulation technique that enhances the realism of the generated graphs by stabilizing the generation of critical graph structures, while preserving flexibility elsewhere. The outstanding performance of GBD across three general graph benchmarks and two biochemical graph benchmarks highlights its capability to effectively capture the complexities of real-world graph data. The code will be made available at this https URL

[LG-22] Enhancing Domain Adaptation through Prompt Gradient Alignment

链接: https://arxiv.org/abs/2406.09353
作者: Hoang Phan,Lam Tran,Quyen Tran,Trung Le
关键词: Prior Unsupervised Domain, Unsupervised Domain Adaptation, Prior Unsupervised, sufficiently discriminative features, learning sufficiently discriminative
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: 26 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Prior Unsupervised Domain Adaptation (UDA) methods often aim to train a domain-invariant feature extractor, which may hinder the model from learning sufficiently discriminative features. To tackle this, a line of works based on prompt learning leverages the power of large-scale pre-trained vision-language models to learn both domain-invariant and specific features through a set of domain-agnostic and domain-specific learnable prompts. Those studies typically enforce invariant constraints on representation, output, or prompt space to learn such prompts. Differently, we cast UDA as a multiple-objective optimization problem in which each objective is represented by a domain loss. Under this new framework, we propose aligning per-objective gradients to foster consensus between them. Additionally, to prevent potential overfitting when fine-tuning this deep learning architecture, we penalize the norm of these gradients. To achieve these goals, we devise a practical gradient update procedure that can work under both single-source and multi-source UDA. Empirically, our method consistently surpasses other prompt-based baselines by a large margin on different UDA benchmarks

[LG-23] On the Expressibility of the Reconstructional Color Refinement

链接: https://arxiv.org/abs/2406.09351
作者: V. Arvind,Johannes Köbler,Oleg Verbitsky
关键词: famous Ulam reconstruction, basic facts related, Ulam reconstruction conjecture, famous Ulam, Reconstruction Graph Neural
类目: Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Combinatorics (math.CO)
*备注: 9 pages

点击查看摘要

Abstract:One of the most basic facts related to the famous Ulam reconstruction conjecture is that the connectedness of a graph can be determined by the deck of its vertex-deleted subgraphs, which are considered up to isomorphism. We strengthen this result by proving that connectedness can still be determined when the subgraphs in the deck are given up to equivalence under the color refinement isomorphism test. Consequently, this implies that connectedness is recognizable by Reconstruction Graph Neural Networks, a recently introduced GNN architecture inspired by the reconstruction conjecture (Cotta, Morris, Ribeiro 2021).

[LG-24] Separations in the Representational Capabilities of Transformers and Recurrent Architectures

链接: https://arxiv.org/abs/2406.09347
作者: Satwik Bhattamishra,Michael Hahn,Phil Blunsom,Varun Kanade
关键词: bounded Dyck languages, widely adopted, adopted in foundation, Transformers, linear size
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:Transformer architectures have been widely adopted in foundation models. Due to their high inference costs, there is renewed interest in exploring the potential of efficient recurrent architectures (RNNs). In this paper, we analyze the differences in the representational capabilities of Transformers and RNNs across several tasks of practical relevance, including index lookup, nearest neighbor, recognizing bounded Dyck languages, and string equality. For the tasks considered, our results show separations based on the size of the model required for different architectures. For example, we show that a one-layer Transformer of logarithmic width can perform index lookup, whereas an RNN requires a hidden state of linear size. Conversely, while constant-size RNNs can recognize bounded Dyck languages, we show that one-layer Transformers require a linear size for this task. Furthermore, we show that two-layer Transformers of logarithmic size can perform decision tasks such as string equality or disjointness, whereas both one-layer Transformers and recurrent models require linear size for these tasks. We also show that a log-size two-layer Transformer can implement the nearest neighbor algorithm in its forward pass; on the other hand recurrent models require linear size. Our constructions are based on the existence of N nearly orthogonal vectors in O(\log N) dimensional space and our lower bounds are based on reductions from communication complexity problems. We supplement our theoretical results with experiments that highlight the differences in the performance of these architectures on practical-size sequences.

[LG-25] Scoreformer: A Surrogate Model For Large-Scale Prediction of Docking Scores

链接: https://arxiv.org/abs/2406.09346
作者: Álvaro Ciudad,Adrián Morales-Pastor,Laura Malo,Isaac Filella-Mercè,Victor Guallar,Alexis Molina
关键词: high-throughput virtual screening, optimizing high-throughput virtual, accurately predict molecular, Principal Neighborhood Aggregation, Walk Positional Encodings
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:In this study, we present ScoreFormer, a novel graph transformer model designed to accurately predict molecular docking scores, thereby optimizing high-throughput virtual screening (HTVS) in drug discovery. The architecture integrates Principal Neighborhood Aggregation (PNA) and Learnable Random Walk Positional Encodings (LRWPE), enhancing the model’s ability to understand complex molecular structures and their relationship with their respective docking scores. This approach significantly surpasses traditional HTVS methods and recent Graph Neural Network (GNN) models in both recovery and efficiency due to a wider coverage of the chemical space and enhanced performance. Our results demonstrate that ScoreFormer achieves competitive performance in docking score prediction and offers a substantial 1.65-fold reduction in inference time compared to existing models. We evaluated ScoreFormer across multiple datasets under various conditions, confirming its robustness and reliability in identifying potential drug candidates rapidly.

[LG-26] Learning the Influence Graph of a High-Dimensional Markov Process with Memory

链接: https://arxiv.org/abs/2406.09338
作者: Smita Bagewadi,Avhishek Chatterjee
关键词: financial risk analysis, Motivated by multiple, high-dimensional multivariate discrete-time, influence graph, discrete-time Markov process
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Motivated by multiple applications in social networks, nervous systems, and financial risk analysis, we consider the problem of learning the underlying (directed) influence graph or causal graph of a high-dimensional multivariate discrete-time Markov process with memory. At any discrete time instant, each observed variable of the multivariate process is a binary string of random length, which is parameterized by an unobservable or hidden [0,1]-valued scalar. The hidden scalars corresponding to the variables evolve according to discrete-time linear stochastic dynamics dictated by the underlying influence graph whose nodes are the variables. We extend an existing algorithm for learning i.i.d. graphical models to this Markovian setting with memory and prove that it can learn the influence graph based on the binary observations using logarithmic (in number of variables or nodes) samples when the degree of the influence graph is bounded. The crucial analytical contribution of this work is the derivation of the sample complexity result by upper and lower bounding the rate of convergence of the observed Markov process with memory to its stationary distribution in terms of the parameters of the influence graph.

[LG-27] Is Value Learning Really the Main Bottleneck in Offline RL?

链接: https://arxiv.org/abs/2406.09329
作者: Seohong Park,Kevin Frans,Sergey Levine,Aviral Kumar
关键词: offline, learning requires access, imitation learning requires, substantially lower data, lower data quality
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While imitation learning requires access to high-quality data, offline reinforcement learning (RL) should, in principle, perform similarly or better with substantially lower data quality by using a value function. However, current results indicate that offline RL often performs worse than imitation learning, and it is often unclear what holds back the performance of offline RL. Motivated by this observation, we aim to understand the bottlenecks in current offline RL algorithms. While poor performance of offline RL is typically attributed to an imperfect value function, we ask: is the main bottleneck of offline RL indeed in learning the value function, or something else? To answer this question, we perform a systematic empirical study of (1) value learning, (2) policy extraction, and (3) policy generalization in offline RL problems, analyzing how these components affect performance. We make two surprising observations. First, we find that the choice of a policy extraction algorithm significantly affects the performance and scalability of offline RL, often more so than the value learning objective. For instance, we show that common value-weighted behavioral cloning objectives (e.g., AWR) do not fully leverage the learned value function, and switching to behavior-constrained policy gradient objectives (e.g., DDPG+BC) often leads to substantial improvements in performance and scalability. Second, we find that a big barrier to improving offline RL performance is often imperfect policy generalization on test-time states out of the support of the training data, rather than policy learning on in-distribution states. We then show that the use of suboptimal but high-coverage data or test-time policy training techniques can address this generalization issue in practice. Specifically, we propose two simple test-time policy improvement methods and show that these methods lead to better performance.

[LG-28] Active Inference Meeting Energy-Efficient Control of Parallel and Identical Machines

链接: https://arxiv.org/abs/2406.09322
作者: Yavar Taheri Yeganeh,Mohsen Jafari,Andrea Matta
关键词: developing energy-efficient control, active inference, energy-efficient control agents, deep active inference, manufacturing systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We investigate the application of active inference in developing energy-efficient control agents for manufacturing systems. Active inference, rooted in neuroscience, provides a unified probabilistic framework integrating perception, learning, and action, with inherent uncertainty quantification elements. Our study explores deep active inference, an emerging field that combines deep learning with the active inference decision-making framework. Leveraging a deep active inference agent, we focus on controlling parallel and identical machine workstations to enhance energy efficiency. We address challenges posed by the problem’s stochastic nature and delayed policy response by introducing tailored enhancements to existing agent architectures. Specifically, we introduce multi-step transition and hybrid horizon methods to mitigate the need for complex planning. Our experimental results demonstrate the effectiveness of these enhancements and highlight the potential of the active inference-based approach.

[LG-29] Vertical LoRA: Dense Expectation-Maximization Interpretation of Transformers

链接: https://arxiv.org/abs/2406.09315
作者: Zhuolin Fu
关键词: Bayesian Nets, dense Expectation-Maximization algorithms, Expectation-Maximization algorithms performed, performed on Bayesian, interpreted as dense
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we show how Transformers can be interpreted as dense Expectation-Maximization algorithms performed on Bayesian Nets. Based on the above interpretation, we propose a new model design paradigm, namely Vertical LoRA (VLoRA), which reduces the parameter count dramatically while preserving performance. In VLoRA, a model consists of layers, each of which recursively learns an increment based on the previous layer. We then apply LoRA decomposition to the increments. VLoRA works on the base model, which is orthogonal to LoRA, meaning they can be used together. We do experiments on various tasks and models. The results show that 1) with VLoRA, the Transformer model parameter count can be reduced dramatically and 2) the performance of the original model is preserved. The source code is available at \urlthis https URL

[LG-30] ransformers meet Neural Algorithmic Reasoners

链接: https://arxiv.org/abs/2406.09308
作者: Wilfried Bounsi,Borja Ibarz,Andrew Dudzik,Jessica B. Hamrick,Larisa Markeeva,Alex Vitvitskyi,Razvan Pascanu,Petar Veličković
关键词: revolutionized machine learning, revolutionized machine, machine learning, Transformer language understanding, language understanding
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: To appear at CVPR 2024 Multimodal Algorithmic Reasoning (MAR) Workshop. 10 pages, 5 figures

点击查看摘要

Abstract:Transformers have revolutionized machine learning with their simple yet effective architecture. Pre-training Transformers on massive text datasets from the Internet has led to unmatched generalization for natural language understanding (NLU) tasks. However, such language models remain fragile when tasked with algorithmic forms of reasoning, where computations must be precise and robust. To address this limitation, we propose a novel approach that combines the Transformer’s language understanding with the robustness of graph neural network (GNN)-based neural algorithmic reasoners (NARs). Such NARs proved effective as generic solvers for algorithmic tasks, when specified in graph form. To make their embeddings accessible to a Transformer, we propose a hybrid architecture with a two-phase training procedure, allowing the tokens in the language model to cross-attend to the node embeddings from the NAR. We evaluate our resulting TransNAR model on CLRS-Text, the text-based version of the CLRS-30 benchmark, and demonstrate significant gains over Transformer-only models for algorithmic reasoning, both in and out of distribution.

[LG-31] A tutorial on fairness in machine learning in healthcare

链接: https://arxiv.org/abs/2406.09307
作者: Jianhui Gao,Benson Chou,Zachary R. McCaw,Hilary Thurston,Paul Varghese,Chuan Hong,Jessica Gronsbell
关键词: clinical decision making, existing healthcare inequities, machine learning, algorithms are safe, safe and effective
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:OBJECTIVE: Ensuring that machine learning (ML) algorithms are safe and effective within all patient groups, and do not disadvantage particular patients, is essential to clinical decision making and preventing the reinforcement of existing healthcare inequities. The objective of this tutorial is to introduce the medical informatics community to the common notions of fairness within ML, focusing on clinical applications and implementation in practice. TARGET AUDIENCE: As gaps in fairness arise in a variety of healthcare applications, this tutorial is designed to provide an understanding of fairness, without assuming prior knowledge, to researchers and clinicians who make use of modern clinical data. SCOPE: We describe the fundamental concepts and methods used to define fairness in ML, including an overview of why models in healthcare may be unfair, a summary and comparison of the metrics used to quantify fairness, and a discussion of some ongoing research. We illustrate some of the fairness methods introduced through a case study of mortality prediction in a publicly available electronic health record dataset. Finally, we provide a user-friendly R package for comprehensive group fairness evaluation, enabling researchers and clinicians to assess fairness in their own ML work. Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML) Cite as: arXiv:2406.09307 [cs.LG] (or arXiv:2406.09307v1 [cs.LG] for this version)

[LG-32] MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

链接: https://arxiv.org/abs/2406.09297
作者: Zayd Muhammad Kawakibi Zuhri,Muhammad Farid Adilazuarda,Ayu Purwarianti,Alham Fikri Aji
关键词: sequence length grow, major memory bottlenecks, transformers benefit greatly, Auto-regressive inference, benefit greatly
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV’s potential for efficient deployment of transformer models at scale. We provide code at this https URL

[LG-33] You Dont Need Data-Augmentation in Self-Supervised Learning

链接: https://arxiv.org/abs/2406.09294
作者: Théo Moutakanni,Maxime Oquab,Marc Szafraniec,Maria Vakalopoulou,Piotr Bojanowski
关键词: Joint-Embedding Predictive Architectures, led to outstanding, Predictive Architectures, outstanding performances, Joint-Embedding Architectures
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Self-Supervised learning (SSL) with Joint-Embedding Architectures (JEA) has led to outstanding performances. All instantiations of this paradigm were trained using strong and well-established hand-crafted data augmentations, leading to the general belief that they are required for the proper training and performance of such models. On the other hand, generative reconstruction-based models such as BEIT and MAE or Joint-Embedding Predictive Architectures such as I-JEPA have shown strong performance without using data augmentations except masking. In this work, we challenge the importance of invariance and data-augmentation in JEAs at scale. By running a case-study on a recent SSL foundation model - DINOv2 - we show that strong image representations can be obtained with JEAs and only cropping without resizing provided the training data is large enough, reaching state-of-the-art results and using the least amount of augmentation in the literature. Through this study, we also discuss the impact of compute constraints on the outcomes of experimental deep learning research, showing that they can lead to very different conclusions.

[LG-34] Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

链接: https://arxiv.org/abs/2406.09292
作者: Ziyi Wu,Yulia Rubanova,Rishabh Kabra,Drew A. Hudson,Igor Gilitschenski,Yusuf Aytar,Sjoerd van Steenkiste,Kelsey R. Allen,Thomas Kipf
关键词: Neural Assets, address the problem, Assets, Neural, pose
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Additional details and video results are available at this https URL

点击查看摘要

Abstract:We address the problem of multi-object 3D pose control in image diffusion models. Instead of conditioning on a sequence of text tokens, we propose to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene. Neural Assets are obtained by pooling visual representations of objects from a reference image, such as a frame in a video, and are trained to reconstruct the respective objects in a different image, e.g., a later frame in the video. Importantly, we encode object visuals from the reference image while conditioning on object poses from the target frame. This enables learning disentangled appearance and pose features. Combining visual and 3D pose representations in a sequence-of-tokens format allows us to keep the text-to-image architecture of existing models, with Neural Assets in place of text tokens. By fine-tuning a pre-trained text-to-image diffusion model with this information, our approach enables fine-grained 3D pose and placement control of individual objects in a scene. We further demonstrate that Neural Assets can be transferred and recomposed across different scenes. Our model achieves state-of-the-art multi-object editing results on both synthetic 3D scene datasets, as well as two real-world video datasets (Objectron, Waymo Open).

[LG-35] A Flexible Equivariant Framework for Subgraph GNNs via Graph Products and Graph Coarsening

链接: https://arxiv.org/abs/2406.09291
作者: Guy Bar-Shalom,Yam Eitan,Fabrizio Frasca,Haggai Maron
关键词: Graph Neural Networks, Neural Networks, Subgraph Graph Neural, Graph Neural, enhance the expressivity
类目: Machine Learning (cs.LG)
*备注: Preprint, under review

点击查看摘要

Abstract:Subgraph Graph Neural Networks (Subgraph GNNs) enhance the expressivity of message-passing GNNs by representing graphs as sets of subgraphs. They have shown impressive performance on several tasks, but their complexity limits applications to larger graphs. Previous approaches suggested processing only subsets of subgraphs, selected either randomly or via learnable sampling. However, they make suboptimal subgraph selections or can only cope with very small subset sizes, inevitably incurring performance degradation. This paper introduces a new Subgraph GNNs framework to address these issues. We employ a graph coarsening function to cluster nodes into super-nodes with induced connectivity. The product between the coarsened and the original graph reveals an implicit structure whereby subgraphs are associated with specific sets of nodes. By running generalized message-passing on such graph product, our method effectively implements an efficient, yet powerful Subgraph GNN. Controlling the coarsening function enables meaningful selection of any number of subgraphs while, contrary to previous methods, being fully compatible with standard training techniques. Notably, we discover that the resulting node feature tensor exhibits new, unexplored permutation symmetries. We leverage this structure, characterize the associated linear equivariant layers and incorporate them into the layers of our Subgraph GNN architecture. Extensive experiments on multiple graph learning benchmarks demonstrate that our method is significantly more flexible than previous approaches, as it can seamlessly handle any number of subgraphs, while consistently outperforming baseline approaches.

[LG-36] Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models

链接: https://arxiv.org/abs/2406.09289
作者: Sarah Ball,Frauke Kreuter,Nina Rimsky
关键词: Conversational Large Language, Conversational Large, answer harmful questions, Large Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conversational Large Language Models are trained to refuse to answer harmful questions. However, emergent jailbreaking techniques can still elicit unsafe outputs, presenting an ongoing challenge for model alignment. To better understand how different jailbreak types circumvent safeguards, this paper analyses model activations on different jailbreak inputs. We find that it is possible to extract a jailbreak vector from a single class of jailbreaks that works to mitigate jailbreak effectiveness from other classes. This may indicate that different kinds of effective jailbreaks operate via similar internal mechanisms. We investigate a potential common mechanism of harmfulness feature suppression, and provide evidence for its existence by looking at the harmfulness vector component. These findings offer actionable insights for developing more robust jailbreak countermeasures and lay the groundwork for a deeper, mechanistic understanding of jailbreak dynamics in language models.

[LG-37] Zero-Shot Learning Over Large Output Spaces : Utilizing Indirect Knowledge Extraction from Large Language Models

链接: https://arxiv.org/abs/2406.09288
作者: Jinbin Zhang,Nasib Ullah,Rohit Babbar
关键词: Extreme Multi-label Learning, Multi-label Learning, Extreme Zero-shot XMC, Extreme Multi-label, predefined label set
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Extreme Multi-label Learning (XMC) is a task that allocates the most relevant labels for an instance from a predefined label set. Extreme Zero-shot XMC (EZ-XMC) is a special setting of XMC wherein no supervision is provided; only the instances (raw text of the document) and the predetermined label set are given. The scenario is designed to address cold-start problems in categorization and recommendation. Traditional state-of-the-art methods extract pseudo labels from the document title or segments. These labels from the document are used to train a zero-shot bi-encoder model. The main issue with these generated labels is their misalignment with the tagging task. In this work, we propose a framework to train a small bi-encoder model via the feedback from the large language model (LLM), the bi-encoder model encodes the document and labels into embeddings for retrieval. Our approach leverages the zero-shot ability of LLM to assess the correlation between labels and the document instead of using the low-quality labels extracted from the document itself. Our method also guarantees fast inference without the involvement of LLM. The performance of our approach outperforms the SOTA methods on various datasets while retaining a similar training time for large datasets.

[LG-38] Flexible Heteroscedastic Count Regression with Deep Double Poisson Networks

链接: https://arxiv.org/abs/2406.09262
作者: Spencer Young,Porter Jenkins,Lonchao Da,Jeff Dotson,Hua Wei
关键词: input-conditional uncertainty representations, real-world applications, Deep Double Poisson, representations are critical, critical for real-world
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks that can produce accurate, input-conditional uncertainty representations are critical for real-world applications. Recent progress on heteroscedastic continuous regression has shown great promise for calibrated uncertainty quantification on complex tasks, like image regression. However, when these methods are applied to discrete regression tasks, such as crowd counting, ratings prediction, or inventory estimation, they tend to produce predictive distributions with numerous pathologies. We propose to address these issues by training a neural network to output the parameters of a Double Poisson distribution, which we call the Deep Double Poisson Network (DDPN). In contrast to existing methods that are trained to minimize Gaussian negative log likelihood (NLL), DDPNs produce a proper probability mass function over discrete output. Additionally, DDPNs naturally model under-, over-, and equi-dispersion, unlike networks trained with the more rigid Poisson and Negative Binomial parameterizations. We show DDPNs 1) vastly outperform existing discrete models; 2) meet or exceed the accuracy and flexibility of networks trained with Gaussian NLL; 3) produce proper predictive distributions over discrete counts; and 4) exhibit superior out-of-distribution detection. DDPNs can easily be applied to a variety of count regression datasets including tabular, image, point cloud, and text data.

[LG-39] Assessing Model Generalization in Vicinity

链接: https://arxiv.org/abs/2406.09257
作者: Yuchi Liu,Yifan Sun,Jingdong Wang,Liang Zheng
关键词: ground truth labels, paper evaluates, ability of classification, depending on ground, ground truth
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper evaluates the generalization ability of classification models on out-of-distribution test sets without depending on ground truth labels. Common approaches often calculate an unsupervised metric related to a specific model property, like confidence or invariance, which correlates with out-of-distribution accuracy. However, these metrics are typically computed for each test sample individually, leading to potential issues caused by spurious model responses, such as overly high or low confidence. To tackle this challenge, we propose incorporating responses from neighboring test samples into the correctness assessment of each individual sample. In essence, if a model consistently demonstrates high correctness scores for nearby samples, it increases the likelihood of correctly predicting the target sample, and vice versa. The resulting scores are then averaged across all test samples to provide a holistic indication of model accuracy. Developed under the vicinal risk formulation, this approach, named vicinal risk proxy (VRP), computes accuracy without relying on labels. We show that applying the VRP method to existing generalization indicators, such as average confidence and effective invariance, consistently improves over these baselines both methodologically and experimentally. This yields a stronger correlation with model accuracy, especially on challenging out-of-distribution test sets.

[LG-40] MirrorCheck: Efficient Adversarial Defense for Vision-Language Models

链接: https://arxiv.org/abs/2406.09250
作者: Samar Fares,Klea Ziu,Toluwani Aremu,Nikita Durasov,Martin Takáč,Pascal Fua,Karthik Nandakumar,Ivan Laptev
关键词: increasingly vulnerable, Vision-Language Models, adversarial, adversarial threats, VLMs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are becoming increasingly vulnerable to adversarial attacks as various novel attack strategies are being proposed against these models. While existing defenses excel in unimodal contexts, they currently fall short in safeguarding VLMs against adversarial threats. To mitigate this vulnerability, we propose a novel, yet elegantly simple approach for detecting adversarial samples in VLMs. Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs. Subsequently, we calculate the similarities of the embeddings of both input and generated images in the feature space to identify adversarial samples. Empirical evaluations conducted on different datasets validate the efficacy of our approach, outperforming baseline methods adapted from image classification domains. Furthermore, we extend our methodology to classification tasks, showcasing its adaptability and model-agnostic nature. Theoretical analyses and empirical findings also show the resilience of our approach against adaptive attacks, positioning it as an excellent defense mechanism for real-world deployment against adversarial threats.

[LG-41] OpenVLA: An Open-Source Vision-Language-Action Model

链接: https://arxiv.org/abs/2406.09246
作者: Moo Jin Kim,Karl Pertsch,Siddharth Karamcheti,Ted Xiao,Ashwin Balakrishna,Suraj Nair,Rafael Rafailov,Ethan Foster,Grace Lam,Pannag Sanketi,Quan Vuong,Thomas Kollar,Benjamin Burchfiel,Russ Tedrake,Dorsa Sadigh,Sergey Levine,Percy Liang,Chelsea Finn
关键词: Large policies pretrained, Internet-scale vision-language data, Large policies, combination of Internet-scale, Internet-scale vision-language
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Website: this https URL

点击查看摘要

Abstract:Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption. Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP. As a product of the added data diversity and new model components, OpenVLA demonstrates strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters. We further show that we can effectively fine-tune OpenVLA for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4%. We also explore compute efficiency; as a separate contribution, we show that OpenVLA can be fine-tuned on consumer GPUs via modern low-rank adaptation methods and served efficiently via quantization without a hit to downstream success rate. Finally, we release model checkpoints, fine-tuning notebooks, and our PyTorch codebase with built-in support for training VLAs at scale on Open X-Embodiment datasets.

[LG-42] Investigating potential causes of Sepsis with Bayesian network structure learning

链接: https://arxiv.org/abs/2406.09207
作者: Bruno Petrungaro,Neville K. Kitson,Anthony C. Constantinou
关键词: global health issue, Sepsis, structure, Obstructive Pulmonary Disease, causal structure
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sepsis is a life-threatening and serious global health issue. This study combines knowledge with available hospital data to investigate the potential causes of Sepsis that can be affected by policy decisions. We investigate the underlying causal structure of this problem by combining clinical expertise with score-based, constraint-based, and hybrid structure learning algorithms. A novel approach to model averaging and knowledge-based constraints was implemented to arrive at a consensus structure for causal inference. The structure learning process highlighted the importance of exploring data-driven approaches alongside clinical expertise. This includes discovering unexpected, although reasonable, relationships from a clinical perspective. Hypothetical interventions on Chronic Obstructive Pulmonary Disease, Alcohol dependence, and Diabetes suggest that the presence of any of these risk factors in patients increases the likelihood of Sepsis. This finding, alongside measuring the effect of these risk factors on Sepsis, has potential policy implications. Recognising the importance of prediction in improving Sepsis related health outcomes, the model built is also assessed in its ability to predict Sepsis. The predictions generated by the consensus model were assessed for their accuracy, sensitivity, and specificity. These three indicators all had results around 70%, and the AUC was 80%, which means the causal structure of the model is reasonably accurate given that the models were trained on data available for commissioning purposes only.

[LG-43] Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models

链接: https://arxiv.org/abs/2406.09206
作者: Christopher Schröder,Gerhard Heyer
关键词: iterative labeling process, small labeled subset, Active learning, labeled data, iterative labeling
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Active learning is an iterative labeling process that is used to obtain a small labeled subset, despite the absence of labeled data, thereby enabling to train a model for supervised tasks such as text classification. While active learning has made considerable progress in recent years due to improvements provided by pre-trained language models, there is untapped potential in the often neglected unlabeled portion of the data, although it is available in considerably larger quantities than the usually small set of labeled data. Here we investigate how self-training, a semi-supervised approach where a model is used to obtain pseudo-labels from the unlabeled data, can be used to improve the efficiency of active learning for text classification. Starting with an extensive reproduction of four previous self-training approaches, some of which are evaluated for the first time in the context of active learning or natural language processing, we devise HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks, on which it outperforms the reproduced self-training approaches and reaches classification results comparable to previous experiments for three out of four datasets, using only 25% of the data.

[LG-44] Adaptive Slot Attention: Object Discovery with Dynamic Slot Number

链接: https://arxiv.org/abs/2406.09196
作者: Ke Fan,Zechen Bai,Tianjun Xiao,Tong He,Max Horn,Yanwei Fu,Francesco Locatello,Zheng Zhang
关键词: low-level perceptual features, abstracting low-level perceptual, perceptual features, exceptional blend, blend of flexibility
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: CVPR 2024

点击查看摘要

Abstract:Object-centric learning (OCL) extracts the representation of objects with slots, offering an exceptional blend of flexibility and interpretability for abstracting low-level perceptual features. A widely adopted method within OCL is slot attention, which utilizes attention mechanisms to iteratively refine slot representations. However, a major drawback of most object-centric models, including slot attention, is their reliance on predefining the number of slots. This not only necessitates prior knowledge of the dataset but also overlooks the inherent variability in the number of objects present in each instance. To overcome this fundamental limitation, we present a novel complexity-aware object auto-encoder framework. Within this framework, we introduce an adaptive slot attention (AdaSlot) mechanism that dynamically determines the optimal number of slots based on the content of the data. This is achieved by proposing a discrete slot sampling module that is responsible for selecting an appropriate number of slots from a candidate list. Furthermore, we introduce a masked slot decoder that suppresses unselected slots during the decoding process. Our framework, tested extensively on object discovery tasks with various datasets, shows performance matching or exceeding top fixed-slot models. Moreover, our analysis substantiates that our method exhibits the capability to dynamically adapt the slot number according to each instance’s complexity, offering the potential for further exploration in slot attention research. Project will be available at this https URL

[LG-45] GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

链接: https://arxiv.org/abs/2406.09187
作者: Zhen Xiang,Linzhi Zheng,Yanjie Li,Junyuan Hong,Qinbin Li,Han Xie,Jiawei Zhang,Zidi Xiong,Chulin Xie,Carl Yang,Dawn Song,Bo Li
关键词: large language models, LLM, language models, numerous applications, raising new concerns
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has catalyzed the deployment of LLM-powered agents across numerous applications, raising new concerns regarding their safety and trustworthiness. Existing methods for enhancing the safety of LLMs are not directly transferable to LLM-powered agents due to their diverse objectives and output modalities. In this paper, we propose GuardAgent, the first LLM agent as a guardrail to other LLM agents. Specifically, GuardAgent oversees a target LLM agent by checking whether its inputs/outputs satisfy a set of given guard requests defined by the users. GuardAgent comprises two steps: 1) creating a task plan by analyzing the provided guard requests, and 2) generating guardrail code based on the task plan and executing the code by calling APIs or using external engines. In both steps, an LLM is utilized as the core reasoning component, supplemented by in-context demonstrations retrieved from a memory module. Such knowledge-enabled reasoning allows GuardAgent to understand various textual guard requests and accurately “translate” them into executable code that provides reliable guardrails. Furthermore, GuardAgent is equipped with an extendable toolbox containing functions and APIs and requires no additional LLM training, which underscores its generalization capabilities and low operational overhead. Additionally, we propose two novel benchmarks: an EICU-AC benchmark for assessing privacy-related access control for healthcare agents and a Mind2Web-SC benchmark for safety evaluation for web agents. We show the effectiveness of GuardAgent on these two benchmarks with 98.7% and 90.0% accuracy in moderating invalid inputs and outputs for the two types of agents, respectively. We also show that GuardAgent is able to define novel functions in adaption to emergent LLM agents and guard requests, which underscores its strong generalization capabilities.

[LG-46] Detection-Rate-Emphasized Multi-objective Evolutionary Feature Selection for Network Intrusion Detection

链接: https://arxiv.org/abs/2406.09180
作者: Zi-Hang Cheng,Haopu Shang,Chao Qian
关键词: machine learning techniques, Network intrusion detection, build intrusion detection, cyber security, intrusion detection
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Network intrusion detection is one of the most important issues in the field of cyber security, and various machine learning techniques have been applied to build intrusion detection systems. However, since the number of features to describe the network connections is often large, where some features are redundant or noisy, feature selection is necessary in such scenarios, which can both improve the efficiency and accuracy. Recently, some researchers focus on using multi-objective evolutionary algorithms (MOEAs) to select features. But usually, they only consider the number of features and classification accuracy as the objectives, resulting in unsatisfactory performance on a critical metric, detection rate. This will lead to the missing of many real attacks and bring huge losses to the network system. In this paper, we propose DR-MOFS to model the feature selection problem in network intrusion detection as a three-objective optimization problem, where the number of features, accuracy and detection rate are optimized simultaneously, and use MOEAs to solve it. Experiments on two popular network intrusion detection datasets NSL-KDD and UNSW-NB15 show that in most cases the proposed method can outperform previous methods, i.e., lead to fewer features, higher accuracy and detection rate.

[LG-47] Unlearning with Control: Assessing Real-world Utility for Large Language Model Unlearning

链接: https://arxiv.org/abs/2406.09179
作者: Qizhou Wang,Bo Han,Puning Yang,Jianing Zhu,Tongliang Liu,Masashi Sugiyama
关键词: eradicating undesirable data, usual model functioning, preserving usual model, large language models, undesirable data behaviors
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The compelling goal of eradicating undesirable data behaviors, while preserving usual model functioning, underscores the significance of machine unlearning within the domain of large language models (LLMs). Recent research has begun to approach LLM unlearning via gradient ascent (GA) – increasing the prediction risk for those training strings targeted to be unlearned, thereby erasing their parameterized responses. Despite their simplicity and efficiency, we suggest that GA-based methods face the propensity towards excessive unlearning, resulting in various undesirable model behaviors, such as catastrophic forgetting, that diminish their practical utility. In this paper, we suggest a set of metrics that can capture multiple facets of real-world utility and propose several controlling methods that can regulate the extent of excessive unlearning. Accordingly, we suggest a general framework to better reflect the practical efficacy of various unlearning methods – we begin by controlling the unlearning procedures/unlearned models such that no excessive unlearning occurs and follow by the evaluation for unlearning efficacy. Our experimental analysis on established benchmarks revealed that GA-based methods are far from perfect in practice, as strong unlearning is at the high cost of hindering the model utility. We conclude that there is still a long way towards practical and effective LLM unlearning, and more efforts are required in this field.

[LG-48] Potion: Towards Poison Unlearning

链接: https://arxiv.org/abs/2406.09173
作者: Stefan Schoepf,Jack Foster,Alexandra Brintrup
关键词: pose significant risks, machine learning systems, Adversarial attacks, introducing poison triggers, poison
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adversarial attacks by malicious actors on machine learning systems, such as introducing poison triggers into training datasets, pose significant risks. The challenge in resolving such an attack arises in practice when only a subset of the poisoned data can be identified. This necessitates the development of methods to remove, i.e. unlearn, poison triggers from already trained models with only a subset of the poison data available. The requirements for this task significantly deviate from privacy-focused unlearning where all of the data to be forgotten by the model is known. Previous work has shown that the undiscovered poisoned samples lead to a failure of established unlearning methods, with only one method, Selective Synaptic Dampening (SSD), showing limited success. Even full retraining, after the removal of the identified poison, cannot address this challenge as the undiscovered poison samples lead to a reintroduction of the poison trigger in the model. Our work addresses two key challenges to advance the state of the art in poison unlearning. First, we introduce a novel outlier-resistant method, based on SSD, that significantly improves model protection and unlearning performance. Second, we introduce Poison Trigger Neutralisation (PTN) search, a fast, parallelisable, hyperparameter search that utilises the characteristic “unlearning versus model protection” trade-off to find suitable hyperparameters in settings where the forget set size is unknown and the retain set is contaminated. We benchmark our contributions using ResNet-9 on CIFAR10 and WideResNet-28x10 on CIFAR100. Experimental results show that our method heals 93.72% of poison compared to SSD with 83.41% and full retraining with 40.68%. We achieve this while also lowering the average model accuracy drop caused by unlearning from 5.68% (SSD) to 1.41% (ours).

[LG-49] owards Multilingual Audio-Visual Question Answering

链接: https://arxiv.org/abs/2406.09156
作者: Orchid Chetia Phukan,Priyabrata Mallick,Swarup Ranjan Behera,Aalekhya Satya Narayani,Arun Balaji Buduru,Rajesh Sharma
关键词: Audio-Visual Question Answering, extending Audio-Visual Question, Question Answering, AVQA, extending Audio-Visual
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:In this paper, we work towards extending Audio-Visual Question Answering (AVQA) to multilingual settings. Existing AVQA research has predominantly revolved around English and replicating it for addressing AVQA in other languages requires a substantial allocation of resources. As a scalable solution, we leverage machine translation and present two multilingual AVQA datasets for eight languages created from existing benchmark AVQA datasets. This prevents extra human annotation efforts of collecting questions and answers manually. To this end, we propose, MERA framework, by leveraging state-of-the-art (SOTA) video, audio, and textual foundation models for AVQA in multiple languages. We introduce a suite of models namely MERA-L, MERA-C, MERA-T with varied model architectures to benchmark the proposed datasets. We believe our work will open new research directions and act as a reference benchmark for future works in multilingual AVQA.

[LG-50] DefAn: Definitive Answer Dataset for LLMs Hallucination Evaluation

链接: https://arxiv.org/abs/2406.09155
作者: A B M Ashikur Rahman,Saeed Anwar,Muhammad Usman,Ajmal Mian
关键词: Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, daily life applications
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities, revolutionizing the integration of AI in daily life applications. However, they are prone to hallucinations, generating claims that contradict established facts, deviating from prompts, and producing inconsistent responses when the same prompt is presented multiple times. Addressing these issues is challenging due to the lack of comprehensive and easily assessable benchmark datasets. Most existing datasets are small and rely on multiple-choice questions, which are inadequate for evaluating the generative prowess of LLMs. To measure hallucination in LLMs, this paper introduces a comprehensive benchmark dataset comprising over 75,000 prompts across eight domains. These prompts are designed to elicit definitive, concise, and informative answers. The dataset is divided into two segments: one publicly available for testing and assessing LLM performance and a hidden segment for benchmarking various LLMs. In our experiments, we tested six LLMs-GPT-3.5, LLama 2, LLama 3, Gemini, Mixtral, and Zephyr-revealing that overall factual hallucination ranges from 59% to 82% on the public dataset and 57% to 76% in the hidden benchmark. Prompt misalignment hallucination ranges from 6% to 95% in the public dataset and 17% to 94% in the hidden counterpart. Average consistency ranges from 21% to 61% and 22% to 63%, respectively. Domain-wise analysis shows that LLM performance significantly deteriorates when asked for specific numeric information while performing moderately with person, location, and date queries. Our dataset demonstrates its efficacy and serves as a comprehensive benchmark for LLM performance evaluation. Our dataset and LLMs responses are available at \hrefthis https URLthis https URL.

[LG-51] EncCluster: Scalable Functional Encryption in Federated Learning through Weight Clustering and Probabilistic Filters

链接: https://arxiv.org/abs/2406.09152
作者: Vasileios Tsouvalas,Samaneh Mohammadi,Ali Balador,Tanir Ozcelebi,Francesco Flammini,Nirvana Meratnia
关键词: Federated Learning, communicating solely local, enables model training, solely local model, local model updates
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 21 pages, 4 figures

点击查看摘要

Abstract:Federated Learning (FL) enables model training across decentralized devices by communicating solely local model updates to an aggregation server. Although such limited data sharing makes FL more secure than centralized approached, FL remains vulnerable to inference attacks during model update transmissions. Existing secure aggregation approaches rely on differential privacy or cryptographic schemes like Functional Encryption (FE) to safeguard individual client data. However, such strategies can reduce performance or introduce unacceptable computational and communication overheads on clients running on edge devices with limited resources. In this work, we present EncCluster, a novel method that integrates model compression through weight clustering with recent decentralized FE and privacy-enhancing data encoding using probabilistic filters to deliver strong privacy guarantees in FL without affecting model performance or adding unnecessary burdens to clients. We performed a comprehensive evaluation, spanning various datasets and architectures, to demonstrate EncCluster’s scalability across encryption levels. Our findings reveal that EncCluster significantly reduces communication costs - below even conventional FedAvg - and accelerates encryption by more than four times over all baselines; at the same time, it maintains high model accuracy and enhanced privacy assurances.

[LG-52] Weakly-supervised anomaly detection for multimodal data distributions

链接: https://arxiv.org/abs/2406.09147
作者: Xu Tan,Junqi Chen,Sylwan Rahardja,Jiawei Yang,Susanto Rahardja
关键词: attracts increasing attention, Weakly-supervised anomaly detection, outperform existing unsupervised, existing weakly-supervised anomaly, existing unsupervised methods
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures. Accepted by 2024 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC)

点击查看摘要

Abstract:Weakly-supervised anomaly detection can outperform existing unsupervised methods with the assistance of a very small number of labeled anomalies, which attracts increasing attention from researchers. However, existing weakly-supervised anomaly detection methods are limited as these methods do not factor in the multimodel nature of the real-world data distribution. To mitigate this, we propose the Weakly-supervised Variational-mixture-model-based Anomaly Detector (WVAD). WVAD excels in multimodal datasets. It consists of two components: a deep variational mixture model, and an anomaly score estimator. The deep variational mixture model captures various features of the data from different clusters, then these features are delivered to the anomaly score estimator to assess the anomaly levels. Experimental results on three real-world datasets demonstrate WVAD’s superiority.

[LG-53] Generative AI-based Prompt Evolution Engineering Design Optimization With Vision-Language Model

链接: https://arxiv.org/abs/2406.09143
作者: Melvin Wong,Thiago Rios,Stefan Menzel,Yew Soon Ong
关键词: Engineering design optimization, performance evaluation method, Engineering design, vision-language model, design optimization requires
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted and to be published in IEEE Congress on Evolutionary Computation 2024

点击查看摘要

Abstract:Engineering design optimization requires an efficient combination of a 3D shape representation, an optimization algorithm, and a design performance evaluation method, which is often computationally expensive. We present a prompt evolution design optimization (PEDO) framework contextualized in a vehicle design scenario that leverages a vision-language model for penalizing impractical car designs synthesized by a generative model. The backbone of our framework is an evolutionary strategy coupled with an optimization objective function that comprises a physics-based solver and a vision-language model for practical or functional guidance in the generated car designs. In the prompt evolutionary search, the optimizer iteratively generates a population of text prompts, which embed user specifications on the aerodynamic performance and visual preferences of the 3D car designs. Then, in addition to the computational fluid dynamics simulations, the pre-trained vision-language model is used to penalize impractical designs and, thus, foster the evolutionary algorithm to seek more viable designs. Our investigations on a car design optimization problem show a wide spread of potential car designs generated at the early phase of the search, which indicates a good diversity of designs in the initial populations, and an increase of over 20% in the probability of generating practical designs compared to a baseline framework without using a vision-language model. Visual inspection of the designs against the performance results demonstrates prompt evolution as a very promising paradigm for finding novel designs with good optimization performance while providing ease of use in specifying design specifications and preferences via a natural language interface.

[LG-54] Optimal Control of Agent-Based Dynamics under Deep Galerkin Feedback Laws

链接: https://arxiv.org/abs/2406.09141
作者: Frederik Kelbel
关键词: adequately address high-dimensional, Deep Neural Networks, address high-dimensional control, Deep Galerkin Method, Neural Networks promises
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ever since the concepts of dynamic programming were introduced, one of the most difficult challenges has been to adequately address high-dimensional control problems. With growing dimensionality, the utilisation of Deep Neural Networks promises to circumvent the issue of an otherwise exponentially increasing complexity. The paper specifically investigates the sampling issues the Deep Galerkin Method is subjected to. It proposes a drift relaxation-based sampling approach to alleviate the symptoms of high-variance policy approximations. This is validated on mean-field control problems; namely, the variations of the opinion dynamics presented by the Sznajd and the Hegselmann-Krause model. The resulting policies induce a significant cost reduction over manually optimised control functions and show improvements on the Linear-Quadratic Regulator problem over the Deep FBSDE approach.

[LG-55] Dynamic Correlation Clustering in Sublinear Update Time

链接: https://arxiv.org/abs/2406.09137
作者: Vincent Cohen-Addad,Silvio Lattanzi,Andreas Maggiori,Nikos Parotsidis
关键词: study the classic, classic problem, problem of correlation, correlation clustering, clustering in dynamic
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: ICML’24 (spotlight)

点击查看摘要

Abstract:We study the classic problem of correlation clustering in dynamic node streams. In this setting, nodes are either added or randomly deleted over time, and each node pair is connected by a positive or negative edge. The objective is to continuously find a partition which minimizes the sum of positive edges crossing clusters and negative edges within clusters. We present an algorithm that maintains an O(1) -approximation with O (polylog n ) amortized update time. Prior to our work, Behnezhad, Charikar, Ma, and L. Tan achieved a 5 -approximation with O(1) expected update time in edge streams which translates in node streams to an O(D) -update time where D is the maximum possible degree. Finally we complement our theoretical analysis with experiments on real world data.

[LG-56] Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

链接: https://arxiv.org/abs/2406.09136
作者: Xuan Zhang,Chao Du,Tianyu Pang,Qian Liu,Wei Gao,Min Lin
关键词: large language models, enabled large language, generate explicit logical, explicit logical reasoning, language models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recent development of chain-of-thought (CoT) decoding has enabled large language models (LLMs) to generate explicit logical reasoning paths for complex problem-solving. However, research indicates that these paths are not always deliberate and optimal. The tree-of-thought (ToT) method employs tree-searching to extensively explore the reasoning space and find better reasoning paths that CoT decoding might overlook. This deliberation, however, comes at the cost of significantly increased inference complexity. In this work, we demonstrate that fine-tuning LLMs leveraging the search tree constructed by ToT allows CoT to achieve similar or better performance, thereby avoiding the substantial inference burden. This is achieved through Chain of Preference Optimization (CPO), where LLMs are fine-tuned to align each step of the CoT reasoning paths with those of ToT using the inherent preference information in the tree-search process. Extensive experimental results show that CPO significantly improves LLM performance in solving a variety of complex problems, including question answering, fact verification, and arithmetic reasoning, demonstrating its effectiveness. Our code is available at this https URL.

[LG-57] Jacobian-Enhanced Neural Networks

链接: https://arxiv.org/abs/2406.09132
作者: Steven H. Berguin
关键词: Jacobian-Enhanced Neural Networks, connected multi-layer perceptrons, densely connected multi-layer, Neural Networks, standard neural networks
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Jacobian-Enhanced Neural Networks (JENN) are densely connected multi-layer perceptrons, whose training process is modified to predict partial derivatives accurately. Their main benefit is better accuracy with fewer training points compared to standard neural networks. These attributes are particularly desirable in the field of computer-aided design, where there is often the need to replace computationally expensive, physics-based models with fast running approximations, known as surrogate models or meta-models. Since a surrogate emulates the original model accurately in near-real time, it yields a speed benefit that can be used to carry out orders of magnitude more function calls quickly. However, in the special case of gradient-enhanced methods, there is the additional value proposition that partial derivatives are accurate, which is a critical property for one important use-case: surrogate-based optimization. This work derives the complete theory and exemplifies its superiority over standard neural nets for surrogate-based optimization.

[LG-58] OLGA: One-cLass Graph Autoencoder

链接: https://arxiv.org/abs/2406.09131
作者: M. P. S. Gôlo,J. G. B. M. Junior,D. F. Silva,R. M. Marcacini
关键词: OCL, OLGA, set of techniques, techniques applied, applied when real-world
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One-class learning (OCL) comprises a set of techniques applied when real-world problems have a single class of interest. The usual procedure for OCL is learning a hypersphere that comprises instances of this class and, ideally, repels unseen instances from any other classes. Besides, several OCL algorithms for graphs have been proposed since graph representation learning has succeeded in various fields. These methods may use a two-step strategy, initially representing the graph and, in a second step, classifying its nodes. On the other hand, end-to-end methods learn the node representations while classifying the nodes in one learning process. We highlight three main gaps in the literature on OCL for graphs: (i) non-customized representations for OCL; (ii) the lack of constraints on hypersphere parameters learning; and (iii) the methods’ lack of interpretability and visualization. We propose One-cLass Graph Autoencoder (OLGA). OLGA is end-to-end and learns the representations for the graph nodes while encapsulating the interest instances by combining two loss functions. We propose a new hypersphere loss function to encapsulate the interest instances. OLGA combines this new hypersphere loss with the graph autoencoder reconstruction loss to improve model learning. OLGA achieved state-of-the-art results and outperformed six other methods with a statistically significant difference from five methods. Moreover, OLGA learns low-dimensional representations maintaining the classification performance with an interpretable model representation learning and results.

[LG-59] me-Series Forecasting for Out-of-Distribution Generalization Using Invariant Learning

链接: https://arxiv.org/abs/2406.09130
作者: Haoxin Liu,Harshavardhan Kamarthi,Lingkai Kong,Zhiyuan Zhao,Chao Zhang,B. Aditya Prakash
关键词: finds broad applications, TSF, invariant learning, finds broad, real-world scenarios
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:Time-series forecasting (TSF) finds broad applications in real-world scenarios. Due to the dynamic nature of time-series data, it is crucial to equip TSF models with out-of-distribution (OOD) generalization abilities, as historical training data and future test data can have different distributions. In this paper, we aim to alleviate the inherent OOD problem in TSF via invariant learning. We identify fundamental challenges of invariant learning for TSF. First, the target variables in TSF may not be sufficiently determined by the input due to unobserved core variables in TSF, breaking the conventional assumption of invariant learning. Second, time-series datasets lack adequate environment labels, while existing environmental inference methods are not suitable for TSF. To address these challenges, we propose FOIL, a model-agnostic framework that enables timeseries Forecasting for Out-of-distribution generalization via Invariant Learning. FOIL employs a novel surrogate loss to mitigate the impact of unobserved variables. Further, FOIL implements a joint optimization by alternately inferring environments effectively with a multi-head network while preserving the temporal adjacency structure, and learning invariant representations across inferred environments for OOD generalized TSF. We demonstrate that the proposed FOIL significantly improves the performance of various TSF models, achieving gains of up to 85%. Comments: 14 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) ACMclasses: H.0 Cite as: arXiv:2406.09130 [cs.LG] (or arXiv:2406.09130v1 [cs.LG] for this version)

[LG-60] Injective Flows for parametric hypersurfaces

链接: https://arxiv.org/abs/2406.09116
作者: Marcello Massimo Negri,Jonathan Aellen,Volker Roth
关键词: Normalizing Flows, Jacobian determinant, powerful and efficient, injective flows, Jacobian
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Normalizing Flows (NFs) are powerful and efficient models for density estimation. When modeling densities on manifolds, NFs can be generalized to injective flows but the Jacobian determinant becomes computationally prohibitive. Current approaches either consider bounds on the log-likelihood or rely on some approximations of the Jacobian determinant. In contrast, we propose injective flows for parametric hypersurfaces and show that for such manifolds we can compute the Jacobian determinant exactly and efficiently, with the same cost as NFs. Furthermore, we show that for the subclass of star-like manifolds we can extend the proposed framework to always allow for a Cartesian representation of the density. We showcase the relevance of modeling densities on hypersurfaces in two settings. Firstly, we introduce a novel Objective Bayesian approach to penalized likelihood models by interpreting level-sets of the penalty as star-like manifolds. Secondly, we consider Bayesian mixture models and introduce a general method for variational inference by defining the posterior of mixture weights on the probability simplex.

[LG-61] Large-Scale Evaluation of Open-Set Image Classification Techniques

链接: https://arxiv.org/abs/2406.09112
作者: Halil Bisgin,Andres Palechor,Mike Suter,Manuel Günther
关键词: correctly assign labels, Maximum Logit Scores, correctly assign, Maximum SoftMax Scores, samples
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The goal for classification is to correctly assign labels to unseen samples. However, most methods misclassify samples with unseen labels and assign them to one of the known classes. Open-Set Classification (OSC) algorithms aim to maximize both closed and open-set recognition capabilities. Recent studies showed the utility of such algorithms on small-scale data sets, but limited experimentation makes it difficult to assess their performances in real-world problems. Here, we provide a comprehensive comparison of various OSC algorithms, including training-based (SoftMax, Garbage, EOS) and post-processing methods (Maximum SoftMax Scores, Maximum Logit Scores, OpenMax, EVM, PROSER), the latter are applied on features from the former. We perform our evaluation on three large-scale protocols that mimic real-world challenges, where we train on known and negative open-set samples, and test on known and unknown instances. Our results show that EOS helps to improve performance of almost all post-processing algorithms. Particularly, OpenMax and PROSER are able to exploit better-trained networks, demonstrating the utility of hybrid models. However, while most algorithms work well on negative test samples – samples of open-set classes seen during training – they tend to perform poorly when tested on samples of previously unseen unknown classes, especially in challenging conditions.

[LG-62] INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs Performance in Insurance

链接: https://arxiv.org/abs/2406.09105
作者: Chenwei Lin,Hanjia Lyu,Xian Xu,Jiebo Luo
关键词: Large Vision-Language Models, Large Vision-Language, shown promising potential, insurance, insurance domain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated outstanding performance in various general multimodal applications such as image recognition and visual reasoning, and have also shown promising potential in specialized domains. However, the application potential of LVLMs in the insurance domain-characterized by rich application scenarios and abundant multimodal data-has not been effectively explored. There is no systematic review of multimodal tasks in the insurance domain, nor a benchmark specifically designed to evaluate the capabilities of LVLMs in insurance. This gap hinders the development of LVLMs within the insurance domain. In this paper, we systematically review and distill multimodal tasks for four representative types of insurance: auto insurance, property insurance, health insurance, and agricultural insurance. We propose INS-MMBench, the first comprehensive LVLMs benchmark tailored for the insurance domain. INS-MMBench comprises a total of 2.2K thoroughly designed multiple-choice questions, covering 12 meta-tasks and 22 fundamental tasks. Furthermore, we evaluate multiple representative LVLMs, including closed-source models such as GPT-4o and open-source models like BLIP-2. This evaluation not only validates the effectiveness of our benchmark but also provides an in-depth performance analysis of current LVLMs on various multimodal tasks in the insurance domain. We hope that INS-MMBench will facilitate the further application of LVLMs in the insurance domain and inspire interdisciplinary development. Our dataset and evaluation code are available at this https URL.

[LG-63] DiffPoGAN: Diffusion Policies with Generative Adversarial Networks for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2406.09089
作者: Xuemin Hu,Shen Li,Yingfen Xu,Bo Tang,Long Chen
关键词: extrapolation error issue, learn optimal policies, generative adversarial networks, learn optimal, extrapolation error
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) can learn optimal policies from pre-collected offline datasets without interacting with the environment, but the sampled actions of the agent cannot often cover the action distribution under a given state, resulting in the extrapolation error issue. Recent works address this issue by employing generative adversarial networks (GANs). However, these methods often suffer from insufficient constraints on policy exploration and inaccurate representation of behavior policies. Moreover, the generator in GANs fails in fooling the discriminator while maximizing the expected returns of a policy. Inspired by the diffusion, a generative model with powerful feature expressiveness, we propose a new offline RL method named Diffusion Policies with Generative Adversarial Networks (DiffPoGAN). In this approach, the diffusion serves as the policy generator to generate diverse distributions of actions, and a regularization method based on maximum likelihood estimation (MLE) is developed to generate data that approximate the distribution of behavior policies. Besides, we introduce an additional regularization term based on the discriminator output to effectively constrain policy exploration for policy improvement. Comprehensive experiments are conducted on the datasets for deep data-driven reinforcement learning (D4RL), and experimental results show that DiffPoGAN outperforms state-of-the-art methods in offline RL.

[LG-64] Latent Assistance Networks: Rediscovering Hyperbolic Tangents in RL

链接: https://arxiv.org/abs/2406.09079
作者: Jacob E. Kooi,Mark Hoogendoorn,Vincent François-Lavet
关键词: Activation functions, key components, effective rank, continuously differentiable activations, dead neurons
类目: Machine Learning (cs.LG)
*备注: 22 pages, 17 figures, 4 tables

点击查看摘要

Abstract:Activation functions are one of the key components of a neural network. The most commonly used activation functions can be classed into the category of continuously differentiable (e.g. tanh) and linear-unit functions (e.g. ReLU), both having their own strengths and drawbacks with respect to downstream performance and representation capacity through learning (e.g. measured by the number of dead neurons and the effective rank). In reinforcement learning, the performance of continuously differentiable activations often falls short as compared to linear-unit functions. From the perspective of the activations in the last hidden layer, this paper provides insights regarding this sub-optimality and explores how activation functions influence the occurrence of dead neurons and the magnitude of the effective rank. Additionally, a novel neural architecture is proposed that leverages the product of independent activation values. In the Atari domain, we show faster learning, a reduction in dead neurons and increased effective rank.

[LG-65] Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition

链接: https://arxiv.org/abs/2406.09073
作者: Eleni Triantafillou,Peter Kairouz,Fabian Pedregosa,Jamie Hayes,Meghdad Kurmanji,Kairan Zhao,Vincent Dumoulin,Julio Jacques Junior,Ioannis Mitliagkas,Jun Wan,Lisheng Sun Hosoya,Sergio Escalera,Gintare Karolina Dziugaite,Peter Triantafillou,Isabelle Guyon
关键词: robust evaluation methodologies, sought to stimulate, evaluation, initiate discussions, NeurIPS competition
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present the findings of the first NeurIPS competition on unlearning, which sought to stimulate the development of novel algorithms and initiate discussions on formal and robust evaluation methodologies. The competition was highly successful: nearly 1,200 teams from across the world participated, and a wealth of novel, imaginative solutions with different characteristics were contributed. In this paper, we analyze top solutions and delve into discussions on benchmarking unlearning, which itself is a research problem. The evaluation methodology we developed for the competition measures forgetting quality according to a formal notion of unlearning, while incorporating model utility for a holistic evaluation. We analyze the effectiveness of different instantiations of this evaluation framework vis-a-vis the associated compute cost, and discuss implications for standardizing evaluation. We find that the ranking of leading methods remains stable under several variations of this framework, pointing to avenues for reducing the cost of evaluation. Overall, our findings indicate progress in unlearning, with top-performing competition entries surpassing existing algorithms under our evaluation framework. We analyze trade-offs made by different algorithms and strengths or weaknesses in terms of generalizability to new datasets, paving the way for advancing both benchmarking and algorithm development in this important area.

[LG-66] FlamePINN-1D: Physics-informed neural networks to solve forward and inverse problems of 1D laminar flames

链接: https://arxiv.org/abs/2406.09071
作者: Jiahao Wu,Su Zhang,Yuxin Wu,Guihua Zhang,Xin Li,Hai Zhang
关键词: necessitate distinct methods, inverse problems, critically needed, studies and applications, applications that necessitate
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Given the existence of various forward and inverse problems in combustion studies and applications that necessitate distinct methods for resolution, a framework to solve them in a unified way is critically needed. A promising approach is the integration of machine learning methods with governing equations of combustion systems, which exhibits superior generality and few-shot learning ability compared to purely data-driven methods. In this work, the FlamePINN-1D framework is proposed to solve the forward and inverse problems of 1D laminar flames based on physics-informed neural networks. Three cases with increasing complexity have been tested: Case 1 are freely-propagating premixed (FPP) flames with simplified physical models, while Case 2 and Case 3 are FPP and counterflow premixed (CFP) flames with detailed models, respectively. For forward problems, FlamePINN-1D aims to solve the flame fields and infer the unknown eigenvalues (such as laminar flame speeds) under the constraints of governing equations and boundary conditions. For inverse problems, FlamePINN-1D aims to reconstruct the continuous fields and infer the unknown parameters (such as transport and chemical kinetics parameters) from noisy sparse observations of the flame. Our results strongly validate these capabilities of FlamePINN-1D across various flames and working conditions. Compared to traditional methods, FlamePINN-1D is differentiable and mesh-free, exhibits no discretization errors, and is easier to implement for inverse problems. The inverse problem results also indicate the possibility of optimizing chemical mechanisms from measurements of laboratory 1D flames. Furthermore, some proposed strategies, such as hard constraints and thin-layer normalization, are proven to be essential for the robust learning of FlamePINN-1D. The code for this paper is partially available at this https URL.

[LG-67] EquiPrompt: Debiasing Diffusion Models via Iterative Bootstrapping in Chain of Thoughts

链接: https://arxiv.org/abs/2406.09070
作者: Zahraa Al Sahili,Ioannis Patras,Matthew Purver
关键词: training datasets poses, datasets poses significant, significant ethical challenges, poses significant ethical, generative models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the domain of text-to-image generative models, the inadvertent propagation of biases inherent in training datasets poses significant ethical challenges, particularly in the generation of socially sensitive content. This paper introduces EquiPrompt, a novel method employing Chain of Thought (CoT) reasoning to reduce biases in text-to-image generative models. EquiPrompt uses iterative bootstrapping and bias-aware exemplar selection to balance creativity and ethical responsibility. It integrates iterative reasoning refinement with controlled evaluation techniques, addressing zero-shot CoT issues in sensitive contexts. Experiments on several generation tasks show EquiPrompt effectively lowers bias while maintaining generative quality, advancing ethical AI and socially responsible creative processes.Code will be publically available.

[LG-68] On the Robustness of Global Feature Effect Explanations

链接: https://arxiv.org/abs/2406.09069
作者: Hubert Baniecki,Giuseppe Casalicchio,Bernd Bischl,Przemyslaw Biecek
关键词: global post-hoc explanations, predictive models trained, global post-hoc, post-hoc explanations, explanations for predictive
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at ECML PKDD 2024

点击查看摘要

Abstract:We study the robustness of global post-hoc explanations for predictive models trained on tabular data. Effects of predictor features in black-box supervised learning are an essential diagnostic tool for model debugging and scientific discovery in applied sciences. However, how vulnerable they are to data and model perturbations remains an open research question. We introduce several theoretical bounds for evaluating the robustness of partial dependence plots and accumulated local effects. Our experimental results with synthetic and real-world datasets quantify the gap between the best and worst-case scenarios of (mis)interpreting machine learning predictions globally.

[LG-69] Dispelling the Mirage of Progress in Offline MARL through Standardised Baselines and Evaluation

链接: https://arxiv.org/abs/2406.09068
作者: Claude Formanek,Callum Rhys Tilbury,Louise Beyers,Jonathan Shock,Arnu Pretorius
关键词: multi-agent reinforcement learning, Offline multi-agent reinforcement, offline MARL, offline MARL work, published offline MARL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Offline multi-agent reinforcement learning (MARL) is an emerging field with great promise for real-world applications. Unfortunately, the current state of research in offline MARL is plagued by inconsistencies in baselines and evaluation protocols, which ultimately makes it difficult to accurately assess progress, trust newly proposed innovations, and allow researchers to easily build upon prior work. In this paper, we firstly identify significant shortcomings in existing methodologies for measuring the performance of novel algorithms through a representative study of published offline MARL work. Secondly, by directly comparing to this prior work, we demonstrate that simple, well-implemented baselines can achieve state-of-the-art (SOTA) results across a wide range of tasks. Specifically, we show that on 35 out of 47 datasets used in prior work (almost 75% of cases), we match or surpass the performance of the current purported SOTA. Strikingly, our baselines often substantially outperform these more sophisticated algorithms. Finally, we correct for the shortcomings highlighted from this prior work by introducing a straightforward standardised methodology for evaluation and by providing our baseline implementations with statistically robust results across several scenarios, useful for comparisons in future work. Our proposal includes simple and sensible steps that are easy to adopt, which in combination with solid baselines and comparative results, could substantially improve the overall rigour of empirical science in offline MARL moving forward.

[LG-70] State-Space Modeling in Long Sequence Processing: A Survey on Recurrence in the Transformer Era

链接: https://arxiv.org/abs/2406.09062
作者: Matteo Tiezzi,Michele Casoni,Alessandro Betti,Marco Gori,Stefano Melacci
关键词: Artificial Intelligence, case of long, Recurrent Neural Nets, long sequences, goal of Artificial
类目: Machine Learning (cs.LG)
*备注: Currently under review

点击查看摘要

Abstract:Effectively learning from sequential data is a longstanding goal of Artificial Intelligence, especially in the case of long sequences. From the dawn of Machine Learning, several researchers engaged in the search of algorithms and architectures capable of processing sequences of patterns, retaining information about the past inputs while still leveraging the upcoming data, without losing precious long-term dependencies and correlations. While such an ultimate goal is inspired by the human hallmark of continuous real-time processing of sensory information, several solutions simplified the learning paradigm by artificially limiting the processed context or dealing with sequences of limited length, given in advance. These solutions were further emphasized by the large ubiquity of Transformers, that have initially shaded the role of Recurrent Neural Nets. However, recurrent networks are facing a strong recent revival due to the growing popularity of (deep) State-Space models and novel instances of large-context Transformers, which are both based on recurrent computations to go beyond several limits of currently ubiquitous technologies. In fact, the fast development of Large Language Models enhanced the interest in efficient solutions to process data over time. This survey provides an in-depth summary of the latest approaches that are based on recurrent models for sequential data processing. A complete taxonomy over the latest trends in architectural and algorithmic solutions is reported and discussed, guiding researchers in this appealing research field. The emerging picture suggests that there is room for thinking of novel routes, constituted by learning algorithms which depart from the standard Backpropagation Through Time, towards a more realistic scenario where patterns are effectively processed online, leveraging local-forward computations, opening to further research on this topic.

[LG-71] Data-Free Generative Replay for Class-Incremental Learning on Imbalanced Data

链接: https://arxiv.org/abs/2406.09052
作者: Sohaib Younis,Bernhard Seeger
关键词: problem in machine, learning, DFGR, challenging problem, data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Continual learning is a challenging problem in machine learning, especially for image classification tasks with imbalanced datasets. It becomes even more challenging when it involves learning new classes incrementally. One method for incremental class learning, addressing dataset imbalance, is rehearsal using previously stored data. In rehearsal-based methods, access to previous data is required for either training the classifier or the generator, but it may not be feasible due to storage, legal, or data access constraints. Although there are many rehearsal-free alternatives for class incremental learning, such as parameter or loss regularization, knowledge distillation, and dynamic architectures, they do not consistently achieve good results, especially on imbalanced data. This paper proposes a new approach called Data-Free Generative Replay (DFGR) for class incremental learning, where the generator is trained without access to real data. In addition, DFGR also addresses dataset imbalance in continual learning of an image classifier. Instead of using training data, DFGR trains a generator using mean and variance statistics of batch-norm and feature maps derived from a pre-trained classification model. The results of our experiments demonstrate that DFGR performs significantly better than other data-free methods and reveal the performance impact of specific parameter settings. DFGR achieves up to 88.5% and 46.6% accuracy on MNIST and FashionMNIST datasets, respectively. Our code is available at this https URL

[LG-72] Efficiently Deciding Algebraic Equivalence of Bow-Free Acyclic Path Diagrams

链接: https://arxiv.org/abs/2406.09049
作者: Thijs van Ommen
关键词: enable causal discovery, conditional independences exist, causal discovery, causal discovery algorithms, latent confounders
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: To appear in the proceedings of the 40th Conference on Uncertainty in Artificial Intelligence (UAI 2024)

点击查看摘要

Abstract:For causal discovery in the presence of latent confounders, constraints beyond conditional independences exist that can enable causal discovery algorithms to distinguish more pairs of graphs. Such constraints are not well-understood yet. In the setting of linear structural equation models without bows, we study algebraic constraints and argue that these provide the most fine-grained resolution achievable. We propose efficient algorithms that decide whether two graphs impose the same algebraic constraints, or whether the constraints imposed by one graph are a subset of those imposed by another graph.

[LG-73] ExioML: Eco-economic dataset for Machine Learning in Global Sectoral Sustainability

链接: https://arxiv.org/abs/2406.09046
作者: Yanming Guo,Jin Ma
关键词: Environmental Extended Multi-Regional, Extended Multi-Regional Input-Output, Environmental Extended, Ecological Economics research, Ecological Economics
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:The Environmental Extended Multi-Regional Input-Output analysis is the predominant framework in Ecological Economics for assessing the environmental impact of economic activities. This paper introduces ExioML, the first Machine Learning benchmark dataset designed for sustainability analysis, aimed at lowering barriers and fostering collaboration between Machine Learning and Ecological Economics research. A crucial greenhouse gas emission regression task was conducted to evaluate sectoral sustainability and demonstrate the usability of the dataset. We compared the performance of traditional shallow models with deep learning models, utilizing a diverse Factor Accounting table and incorporating various categorical and numerical features. Our findings reveal that ExioML, with its high usability, enables deep and ensemble models to achieve low mean square errors, establishing a baseline for future Machine Learning research. Through ExioML, we aim to build a foundational dataset supporting various Machine Learning applications and promote climate actions and sustainable investment decisions.

[LG-74] ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models

链接: https://arxiv.org/abs/2406.09041
作者: Jing Liu,Ruihao Gong,Mingyang Zhang,Yefei He,Jianfei Cai,Bohan Zhuang
关键词: developing LLMs involves, LLMs involves pre-training, create specialized experts, general foundation model, massive data
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Tech report

点击查看摘要

Abstract:The typical process for developing LLMs involves pre-training a general foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Serving these experts poses challenges, as loading all experts onto devices is impractical, and frequent switching between experts in response to user requests incurs substantial I/O costs, increasing latency and expenses. Previous approaches decompose expert weights into pre-trained model weights and residual delta weights, then quantize the delta weights to reduce model size. However, these methods often lead to significant quantization errors at extremely low bitwidths and assume the appropriate model for a user request is known in advance, which is not practical. To address these issues, we introduce ME-Switch, a memory-efficient expert switching framework for LLM serving. ME-Switch uses mixed-precision quantization, selectively quantizing non-salient input channels of delta weights to extremely low bits while keeping salient ones intact, significantly reducing storage demands while maintaining performance. Additionally, we develop a routing method that efficiently directs user queries to the most suitable expert by transforming the model selection problem into a domain classification problem. Extensive experiments show ME-Switch’s promising memory efficiency and routing performance. For example, when serving three models from the Mistral-7B family, ME-Switch reduces model size by 1.74x while maintaining nearly lossless performance on instruction, mathematical reasoning, and code generation tasks. Furthermore, ME-Switch can efficiently serve 16 models from the Mistral-7B family on a single NVIDIA A100 GPU.

[LG-75] CGP : A Modern C Implementation of Cartesian Genetic Programming

链接: https://arxiv.org/abs/2406.09038
作者: Roman Kalkreuth,Thomas Baeck
关键词: Cartesian Genetic Programming, Cartesian Genetic, Genetic Programming, reference implementation, implementation
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: Accepted as a full paper in the BBSR track at the Genetic and Evolutionary Computation Conference (GECCO’24), July 14-18, 2024, Melbourne, Australia

点击查看摘要

Abstract:The reference implementation of Cartesian Genetic Programming (CGP) was written in the C programming language. C inherently follows a procedural programming paradigm, which entails challenges in providing a reusable and scalable implementation model for complex structures and methods. Moreover, due to the limiting factors of C, the reference implementation of CGP does not provide a generic framework and is therefore restricted to a set of predefined evaluation types. Besides the reference implementation, we also observe that other existing implementations are limited with respect to the features provided. In this work, we therefore propose the first version of a modern C++ implementation of CGP that pursues object-oriented design and generic programming paradigm to provide an efficient implementation model that can facilitate the discovery of new problem domains and the implementation of complex advanced methods that have been proposed for CGP over time. With the proposal of our new implementation, we aim to generally promote interpretability, accessibility and reproducibility in the field of CGP.

[LG-76] A Comprehensive Graph Pooling Benchmark: Effectiveness Robustness and Generalizability

链接: https://arxiv.org/abs/2406.09031
作者: Pengyun Wang,Junyu Luo,Yanxin Shen,Siyu Heng,Xiao Luo
关键词: Graph pooling, graph pooling approaches, obtain effective node, graph pooling methods, Graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph pooling has gained attention for its ability to obtain effective node and graph representations for various downstream tasks. Despite the recent surge in graph pooling approaches, there is a lack of standardized experimental settings and fair benchmarks to evaluate their performance. To address this issue, we have constructed a comprehensive benchmark that includes 15 graph pooling methods and 21 different graph datasets. This benchmark systematically assesses the performance of graph pooling methods in three dimensions, i.e., effectiveness, robustness, and generalizability. We first evaluate the performance of these graph pooling approaches across different tasks including graph classification, graph regression and node classification. Then, we investigate their performance under potential noise attacks and out-of-distribution shifts in real-world scenarios. We also involve detailed efficiency analysis and parameter analysis. Extensive experiments validate the strong capability and applicability of graph pooling approaches in various scenarios, which can provide valuable insights and guidance for deep geometric learning research. The source code of our benchmark is available at this https URL.

[LG-77] CUER: Corrected Uniform Experience Replay for Off-Policy Continuous Deep Reinforcement Learning Algorithms

链接: https://arxiv.org/abs/2406.09030
作者: Arda Sarp Yenicesu,Furkan B. Mutlu,Suleyman S. Kozat,Ozgur S. Oguz
关键词: experience replay, replay mechanism enables, mechanism enables agents, effectively leverage, Uniform Experience Replay
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The utilization of the experience replay mechanism enables agents to effectively leverage their experiences on several occasions. In previous studies, the sampling probability of the transitions was modified based on their relative significance. The process of reassigning sample probabilities for every transition in the replay buffer after each iteration is considered extremely inefficient. Hence, in order to enhance computing efficiency, experience replay prioritization algorithms reassess the importance of a transition as it is sampled. However, the relative importance of the transitions undergoes dynamic adjustments when the agent’s policy and value function are iteratively updated. Furthermore, experience replay is a mechanism that retains the transitions generated by the agent’s past policies, which could potentially diverge significantly from the agent’s most recent policy. An increased deviation from the agent’s most recent policy results in a greater frequency of off-policy updates, which has a negative impact on the agent’s performance. In this paper, we develop a novel algorithm, Corrected Uniform Experience Replay (CUER), which stochastically samples the stored experience while considering the fairness among all other experiences without ignoring the dynamic nature of the transition importance by making sampled state distribution more on-policy. CUER provides promising improvements for off-policy continuous control algorithms in terms of sample efficiency, final performance, and stability of the policy during the training.

[LG-78] From Biased to Unbiased Dynamics: An Infinitesimal Generator Approach

链接: https://arxiv.org/abs/2406.09028
作者: Timothée Devergne,Vladimir Kostic,Michele Parrinello,Massimiliano Pontil
关键词: time-reversal invariant stochastic, invariant stochastic processes, Langevin equation, molecular dynamics, time-reversal invariant
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:We investigate learning the eigenfunctions of evolution operators for time-reversal invariant stochastic processes, a prime example being the Langevin equation used in molecular dynamics. Many physical or chemical processes described by this equation involve transitions between metastable states separated by high potential barriers that can hardly be crossed during a simulation. To overcome this bottleneck, data are collected via biased simulations that explore the state space more rapidly. We propose a framework for learning from biased simulations rooted in the infinitesimal generator of the process and the associated resolvent operator. We contrast our approach to more common ones based on the transfer operator, showing that it can provably learn the spectral properties of the unbiased system from biased data. In experiments, we highlight the advantages of our method over transfer operator approaches and recent developments based on generator learning, demonstrating its effectiveness in estimating eigenfunctions and eigenvalues. Importantly, we show that even with datasets containing only a few relevant transitions due to sub-optimal biasing, our approach recovers relevant information about the transition mechanism.

[LG-79] Schurs Positive-Definite Network: Deep Learning in the SPD cone with structure

链接: https://arxiv.org/abs/2406.09023
作者: Can Pouliquen,Mathurin Massias,Titouan Vayer
关键词: Estimating matrices, symmetric positive-definite, applications ranging, ranging from computer, computer vision
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Estimating matrices in the symmetric positive-definite (SPD) cone is of interest for many applications ranging from computer vision to graph learning. While there exist various convex optimization-based estimators, they remain limited in expressivity due to their model-based approach. The success of deep learning has thus led many to use neural networks to learn to estimate SPD matrices in a data-driven fashion. For learning structured outputs, one promising strategy involves architectures designed by unrolling iterative algorithms, which potentially benefit from inductive bias properties. However, designing correct unrolled architectures for SPD learning is difficult: they either do not guarantee that their output has all the desired properties, rely on heavy computations, or are overly restrained to specific matrices which hinders their expressivity. In this paper, we propose a novel and generic learning module with guaranteed SPD outputs called SpodNet, that also enables learning a larger class of functions than existing approaches. Notably, it solves the challenging task of learning jointly SPD and sparse matrices. Our experiments demonstrate the versatility of SpodNet layers.

[LG-80] Deep learning empowered sensor fusion to improve infant movement classification

链接: https://arxiv.org/abs/2406.09014
作者: Tomas Kulvicius,Dajie Zhang,Luise Poustka,Sven Bölte,Lennart Jahn,Sarah Flügge,Marc Kraft,Markus Zweckstetter,Karin Nielsen-Saines,Florentin Wörgötter,Peter B Marschik
关键词: established clinical tools, enhance diagnostic procedures, recent boom, solutions to facilitate, diagnostic procedures
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:There is a recent boom in the development of AI solutions to facilitate and enhance diagnostic procedures for established clinical tools. To assess the integrity of the developing nervous system, the Prechtl general movement assessment (GMA) is recognized for its clinical value in the diagnosis of neurological impairments in early infancy. GMA has been increasingly augmented through machine learning approaches intending to scale-up its application, circumvent costs in the training of human assessors and further standardize classification of spontaneous motor patterns. Available deep learning tools, all of which are based on single sensor modalities, are however still considerably inferior to that of well-trained human assessors. These approaches are hardly comparable as all models are designed, trained and evaluated on proprietary/ silo-data sets. We propose a sensor fusion approach for assessing fidgety movements (FMs) comparing three different sensor modalities (pressure, inertial, and visual sensors). Various combinations and two sensor fusion approaches (late and early fusion) for infant movement classification were tested to evaluate whether a multi-sensor system outperforms single modality assessments. The performance of the three-sensor fusion (classification accuracy of 94.5%) was significantly higher than that of any single modality evaluated, suggesting the sensor fusion approach is a promising avenue for automated classification of infant motor patterns. The development of a robust sensor fusion system may significantly enhance AI-based early recognition of neurofunctions, ultimately facilitating early implementation of automated detection of neurodevelopmental conditions.

[LG-81] Fredformer: Frequency Debiased Transformer for Time Series Forecasting

链接: https://arxiv.org/abs/2406.09009
作者: Xihao Piao,Zheng Chen,Taichi Murayama,Yasuko Matsubara,Yasushi Sakurai
关键词: time series forecasting, Transformer model, shown leading performance, shown leading, time series
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by SIGKDD2024

点击查看摘要

Abstract:The Transformer model has shown leading performance in time series forecasting. Nevertheless, in some complex scenarios, it tends to learn low-frequency features in the data and overlook high-frequency features, showing a frequency bias. This bias prevents the model from accurately capturing important high-frequency data features. In this paper, we undertook empirical analyses to understand this bias and discovered that frequency bias results from the model disproportionately focusing on frequency features with higher energy. Based on our analysis, we formulate this bias and propose Fredformer, a Transformer-based framework designed to mitigate frequency bias by learning features equally across different frequency bands. This approach prevents the model from overlooking lower amplitude features important for accurate forecasting. Extensive experiments show the effectiveness of our proposed approach, which can outperform other baselines in different real-world time-series datasets. Furthermore, we introduce a lightweight variant of the Fredformer with an attention matrix approximation, which achieves comparable performance but with much fewer parameters and lower computation costs. The code is available at: this https URL

[LG-82] Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality Generation

链接: https://arxiv.org/abs/2406.09003
作者: Lincan Cai,Shuang Li,Wenxuan Ma,Jingxuan Kang,Binhui Xie,Zixun Sun,Chengwei Zhu
关键词: proven immensely valuable, handling data-intensive modalities, Large-scale pretrained models, text and image, Large-scale pretrained
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large-scale pretrained models have proven immensely valuable in handling data-intensive modalities like text and image. However, fine-tuning these models for certain specialized modalities, such as protein sequence and cosmic ray, poses challenges due to the significant modality discrepancy and scarcity of labeled data. In this paper, we propose an end-to-end method, PaRe, to enhance cross-modal fine-tuning, aiming to transfer a large-scale pretrained model to various target modalities. PaRe employs a gating mechanism to select key patches from both source and target data. Through a modality-agnostic Patch Replacement scheme, these patches are preserved and combined to construct data-rich intermediate modalities ranging from easy to hard. By gradually intermediate modality generation, we can not only effectively bridge the modality gap to enhance stability and transferability of cross-modal fine-tuning, but also address the challenge of limited data in the target modality by leveraging enriched intermediate modality data. Compared with hand-designed, general-purpose, task-specific, and state-of-the-art cross-modal fine-tuning approaches, PaRe demonstrates superior performance across three challenging benchmarks, encompassing more than ten modalities.

[LG-83] Classic GNNs are Strong Baselines: Reassessing GNNs for Node Classification

链接: https://arxiv.org/abs/2406.08993
作者: Yuankai Luo,Lei Shi,Xiao-Ming Wu
关键词: message-passing Graph Neural, theoretically superior expressiveness, Graph Neural Networks, significantly outperforming GNNs, traditional message-passing Graph
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Transformers (GTs) have recently emerged as popular alternatives to traditional message-passing Graph Neural Networks (GNNs), due to their theoretically superior expressiveness and impressive performance reported on standard node classification benchmarks, often significantly outperforming GNNs. In this paper, we conduct a thorough empirical analysis to reevaluate the performance of three classic GNN models (GCN, GAT, and GraphSAGE) against GTs. Our findings suggest that the previously reported superiority of GTs may have been overstated due to suboptimal hyperparameter configurations in GNNs. Remarkably, with slight hyperparameter tuning, these classic GNN models achieve state-of-the-art performance, matching or even exceeding that of recent GTs across 17 out of the 18 diverse datasets examined. Additionally, we conduct detailed ablation studies to investigate the influence of various GNN configurations, such as normalization, dropout, residual connections, network depth, and jumping knowledge mode, on node classification performance. Our study aims to promote a higher standard of empirical rigor in the field of graph machine learning, encouraging more accurate comparisons and evaluations of model capabilities.

[LG-84] BTS: Building Timeseries Dataset: Empowering Large-Scale Building Analytics

链接: https://arxiv.org/abs/2406.08990
作者: Arian Prabowo,Xiachong Lin,Imran Razzak,Hao Xue,Emily W. Yap,Matthew Amos,Flora D. Salim
关键词: influencing occupant comfort, influencing occupant, occupant comfort, play a crucial, crucial role
类目: Machine Learning (cs.LG)
*备注: 21 pages, 2 figures, 9 tables, under review

点击查看摘要

Abstract:Buildings play a crucial role in human well-being, influencing occupant comfort, health, and safety. Additionally, they contribute significantly to global energy consumption, accounting for one-third of total energy usage, and carbon emissions. Optimizing building performance presents a vital opportunity to combat climate change and promote human flourishing. However, research in building analytics has been hampered by the lack of accessible, available, and comprehensive real-world datasets on multiple building operations. In this paper, we introduce the Building TimeSeries (BTS) dataset. Our dataset covers three buildings over a three-year period, comprising more than ten thousand timeseries data points with hundreds of unique ontologies. Moreover, the metadata is standardized using the Brick schema. To demonstrate the utility of this dataset, we performed benchmarks on two tasks: timeseries ontology classification and zero-shot forecasting. These tasks represent an essential initial step in addressing challenges related to interoperability in building analytics. Access to the dataset and the code used for benchmarking are available here: this https URL .

[LG-85] XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning

链接: https://arxiv.org/abs/2406.08973
作者: Alexander Nikulin,Ilya Zisman,Alexey Zemtsov,Viacheslav Sinii,Vladislav Kurenkov,Sergey Kolesnikov
关键词: computer vision models, in-context reinforcement learning, recently emerging field, in-context learning paradigm, vision models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Following the success of the in-context learning paradigm in large-scale language and computer vision models, the recently emerging field of in-context reinforcement learning is experiencing a rapid growth. However, its development has been held back by the lack of challenging benchmarks, as all the experiments have been carried out in simple environments and on small-scale datasets. We present \textbfXLand-100B, a large-scale dataset for in-context reinforcement learning based on the XLand-MiniGrid environment, as a first step to alleviate this problem. It contains complete learning histories for nearly 30,000 different tasks, covering 100 B transitions and 2.5 B episodes. It took 50,000 GPU hours to collect the dataset, which is beyond the reach of most academic labs. Along with the dataset, we provide the utilities to reproduce or expand it even further. With this substantial effort, we aim to democratize research in the rapidly growing field of in-context reinforcement learning and provide a solid foundation for further scaling. The code is open-source and available under Apache 2.0 licence at this https URL.

[LG-86] Separation Power of Equivariant Neural Networks

链接: https://arxiv.org/abs/2406.08966
作者: Marco Pacini,Xiaowen Dong,Bruno Lepri,Gabriele Santin
关键词: machine learning model, learning model refers, separation power, distinguish distinct inputs, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 9 pages of main text, 2 figures

点击查看摘要

Abstract:The separation power of a machine learning model refers to its capacity to distinguish distinct inputs, and it is often employed as a proxy for its expressivity. In this paper, we propose a theoretical framework to investigate the separation power of equivariant neural networks with point-wise activations. Using the proposed framework, we can derive an explicit description of inputs indistinguishable by a family of neural networks with given architecture, demonstrating that it remains unaffected by the choice of non-polynomial activation function employed. We are able to understand the role played by activation functions in separability. Indeed, we show that all non-polynomial activations, such as ReLU and sigmoid, are equivalent in terms of expressivity, and that they reach maximum discrimination capacity. We demonstrate how assessing the separation power of an equivariant neural network can be simplified to evaluating the separation power of minimal representations. We conclude by illustrating how these minimal components form a hierarchy in separation power.

[LG-87] An Unsupervised Approach to Achieve Supervised-Level Explainability in Healthcare Records

链接: https://arxiv.org/abs/2406.08958
作者: Joakim Edin,Maria Maistro,Lars Maaløe,Lasse Borgholt,Jakob D. Havtorn,Tuukka Ruotsalo
关键词: Electronic healthcare records, Electronic healthcare, document conditions, vital for patient, patient safety
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electronic healthcare records are vital for patient safety as they document conditions, plans, and procedures in both free text and medical codes. Language models have significantly enhanced the processing of such records, streamlining workflows and reducing manual data entry, thereby saving healthcare providers significant resources. However, the black-box nature of these models often leaves healthcare professionals hesitant to trust them. State-of-the-art explainability methods increase model transparency but rely on human-annotated evidence spans, which are costly. In this study, we propose an approach to produce plausible and faithful explanations without needing such annotations. We demonstrate on the automated medical coding task that adversarial robustness training improves explanation plausibility and introduce AttInGrad, a new explanation method superior to previous ones. By combining both contributions in a fully unsupervised setup, we produce explanations of comparable quality, or better, to that of a supervised approach. We release our code and model weights.

[LG-88] Preserving Identity with Variational Score for General-purpose 3D Editing

链接: https://arxiv.org/abs/2406.08953
作者: Duong H. Le,Tuan Pham,Aniruddha Kembhavi,Stephan Mandt,Wei-Chiu Ma,Jiasen Lu
关键词: Variational Score Distillation, Delta Denoising Score, present Piva, Variational Score, Preserving Identity
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 22 pages, 14 figures

点击查看摘要

Abstract:We present Piva (Preserving Identity with Variational Score Distillation), a novel optimization-based method for editing images and 3D models based on diffusion models. Specifically, our approach is inspired by the recently proposed method for 2D image editing - Delta Denoising Score (DDS). We pinpoint the limitations in DDS for 2D and 3D editing, which causes detail loss and over-saturation. To address this, we propose an additional score distillation term that enforces identity preservation. This results in a more stable editing process, gradually optimizing NeRF models to match target prompts while retaining crucial input characteristics. We demonstrate the effectiveness of our approach in zero-shot image and neural field editing. Our method successfully alters visual attributes, adds both subtle and substantial structural elements, translates shapes, and achieves competitive results on standard 2D and 3D editing benchmarks. Additionally, our method imposes no constraints like masking or pre-training, making it compatible with a wide range of pre-trained diffusion models. This allows for versatile editing without needing neural field-to-mesh conversion, offering a more user-friendly experience.

[LG-89] Neural NeRF Compression

链接: https://arxiv.org/abs/2406.08943
作者: Tuan Pham,Stephan Mandt
关键词: Neural Radiance Fields, Radiance Fields, continuous volumetric representations, Neural Radiance, capturing detailed
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:Neural Radiance Fields (NeRFs) have emerged as powerful tools for capturing detailed 3D scenes through continuous volumetric representations. Recent NeRFs utilize feature grids to improve rendering quality and speed; however, these representations introduce significant storage overhead. This paper presents a novel method for efficiently compressing a grid-based NeRF model, addressing the storage overhead concern. Our approach is based on the non-linear transform coding paradigm, employing neural compression for compressing the model’s feature grids. Due to the lack of training data involving many i.i.d scenes, we design an encoder-free, end-to-end optimized approach for individual scenes, using lightweight decoders. To leverage the spatial inhomogeneity of the latent feature grids, we introduce an importance-weighted rate-distortion objective and a sparse entropy model employing a masking mechanism. Our experimental results validate that our proposed method surpasses existing works in terms of grid-based NeRF compression efficacy and reconstruction quality.

[LG-90] LaCoOT: Layer Collapse through Optimal Transport

链接: https://arxiv.org/abs/2406.08933
作者: Victor Quétu,Nour Hezbri,Enzo Tartaglione
关键词: tackling complex tasks, posing energy-consumption issues, computational resources remains, deep neural networks, over-parametrized deep neural
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although deep neural networks are well-known for their remarkable performance in tackling complex tasks, their hunger for computational resources remains a significant hurdle, posing energy-consumption issues and restricting their deployment on resource-constrained devices, which stalls their widespread adoption. In this paper, we present an optimal transport method to reduce the depth of over-parametrized deep neural networks, alleviating their computational burden. More specifically, we propose a new regularization strategy based on the Max-Sliced Wasserstein distance to minimize the distance between the intermediate feature distributions in the neural network. We show that minimizing this distance enables the complete removal of intermediate layers in the network, with almost no performance loss and without requiring any finetuning. We assess the effectiveness of our method on traditional image classification setups. We commit to releasing the source code upon acceptance of the article.

[LG-91] Efficient Multi-View Fusion and Flexible Adaptation to View Missing in Cardiovascular System Signals

链接: https://arxiv.org/abs/2406.08930
作者: Qihan Hu,Daomiao Wang,Hong Wu,Jian Liu,Cuiwei Yang
关键词: automatic multi-view fusion, facilitated automatic multi-view, amalgamates CVS signals, prevalent MVF model, multi-view fusion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages,12 figures

点击查看摘要

Abstract:The progression of deep learning and the widespread adoption of sensors have facilitated automatic multi-view fusion (MVF) about the cardiovascular system (CVS) signals. However, prevalent MVF model architecture often amalgamates CVS signals from the same temporal step but different views into a unified representation, disregarding the asynchronous nature of cardiovascular events and the inherent heterogeneity across views, leading to catastrophic view confusion. Efficient training strategies specifically tailored for MVF models to attain comprehensive representations need simultaneous consideration. Crucially, real-world data frequently arrives with incomplete views, an aspect rarely noticed by researchers. Thus, the View-Centric Transformer (VCT) and Multitask Masked Autoencoder (M2AE) are specifically designed to emphasize the centrality of each view and harness unlabeled data to achieve superior fused representations. Additionally, we systematically define the missing-view problem for the first time and introduce prompt techniques to aid pretrained MVF models in flexibly adapting to various missing-view scenarios. Rigorous experiments involving atrial fibrillation detection, blood pressure estimation, and sleep staging-typical health monitoring tasks-demonstrate the remarkable advantage of our method in MVF compared to prevailing methodologies. Notably, the prompt technique requires finetuning less than 3% of the entire model’s data, substantially fortifying the model’s resilience to view missing while circumventing the need for complete retraining. The results demonstrate the effectiveness of our approaches, highlighting their potential for practical applications in cardiovascular health monitoring. Codes and models are released at URL.

[LG-92] Step-by-Step Diffusion: An Elementary Tutorial

链接: https://arxiv.org/abs/2406.08929
作者: Preetum Nakkiran,Arwen Bradley,Hattie Zhou,Madhu Advani
关键词: machine learning, diffusion experience, present an accessible, models and flow, flow matching
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
*备注: 35 pages, 11 figures

点击查看摘要

Abstract:We present an accessible first course on diffusion models and flow matching for machine learning, aimed at a technical audience with no diffusion experience. We try to simplify the mathematical details as much as possible (sometimes heuristically), while retaining enough precision to derive correct algorithms.

[LG-93] Learning Images Across Scales Using Adversarial Training

链接: https://arxiv.org/abs/2406.08924
作者: Krzysztof Wolski,Adarsh Djeacoumar,Alireza Javanmardi,Hans-Peter Seidel,Christian Theobalt,Guillaume Cordonnier,Karol Myszkowski,George Drettakis,Xingang Pan,Thomas Leimkühler
关键词: real world exhibits, world exhibits rich, exhibits rich structure, real world, world exhibits
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: SIGGRAPH 2024; project page: this https URL

点击查看摘要

Abstract:The real world exhibits rich structure and detail across many scales of observation. It is difficult, however, to capture and represent a broad spectrum of scales using ordinary images. We devise a novel paradigm for learning a representation that captures an orders-of-magnitude variety of scales from an unstructured collection of ordinary images. We treat this collection as a distribution of scale-space slices to be learned using adversarial training, and additionally enforce coherency across slices. Our approach relies on a multiscale generator with carefully injected procedural frequency content, which allows to interactively explore the emerging continuous scale space. Training across vastly different scales poses challenges regarding stability, which we tackle using a supervision scheme that involves careful sampling of scales. We show that our generator can be used as a multiscale generative model, and for reconstructions of scale spaces from unstructured patches. Significantly outperforming the state of the art, we demonstrate zoom-in factors of up to 256x at high quality and scale consistency.

[LG-94] Navigating the Shadows: Unveiling Effective Disturbances for Modern AI Content Detectors

链接: https://arxiv.org/abs/2406.08922
作者: Ying Zhou,Ben He,Le Sun
关键词: attracted global attention, large language models, launch of ChatGPT, large language, global attention
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by ACL 2024, Main Conference

点击查看摘要

Abstract:With the launch of ChatGPT, large language models (LLMs) have attracted global attention. In the realm of article writing, LLMs have witnessed extensive utilization, giving rise to concerns related to intellectual property protection, personal privacy, and academic integrity. In response, AI-text detection has emerged to distinguish between human and machine-generated content. However, recent research indicates that these detection systems often lack robustness and struggle to effectively differentiate perturbed texts. Currently, there is a lack of systematic evaluations regarding detection performance in real-world applications, and a comprehensive examination of perturbation techniques and detector robustness is also absent. To bridge this gap, our work simulates real-world scenarios in both informal and professional writing, exploring the out-of-the-box performance of current detectors. Additionally, we have constructed 12 black-box text perturbation methods to assess the robustness of current detection models across various perturbation granularities. Furthermore, through adversarial learning experiments, we investigate the impact of perturbation data augmentation on the robustness of AI-text detectors. We have released our code and data at this https URL.

[LG-95] Beyond the Calibration Point: Mechanism Comparison in Differential Privacy

链接: https://arxiv.org/abs/2406.08918
作者: Georgios Kaissis,Stefan Kolek,Borja Balle,Jamie Hayes,Daniel Rueckert
关键词: delta, machine learning, differentially private, reported and compared, varepsilon
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: ICML 2024

点击查看摘要

Abstract:In differentially private (DP) machine learning, the privacy guarantees of DP mechanisms are often reported and compared on the basis of a single (\varepsilon, \delta) -pair. This practice overlooks that DP guarantees can vary substantially \empheven between mechanisms sharing a given (\varepsilon, \delta) , and potentially introduces privacy vulnerabilities which can remain undetected. This motivates the need for robust, rigorous methods for comparing DP guarantees in such cases. Here, we introduce the \Delta -divergence between mechanisms which quantifies the worst-case excess privacy vulnerability of choosing one mechanism over another in terms of (\varepsilon, \delta) , f -DP and in terms of a newly presented Bayesian interpretation. Moreover, as a generalisation of the Blackwell theorem, it is endowed with strong decision-theoretic foundations. Through application examples, we show that our techniques can facilitate informed decision-making and reveal gaps in the current understanding of privacy risks, as current practices in DP-SGD often result in choosing mechanisms with high excess privacy vulnerabilities.

[LG-96] Predicting Fault-Ride-Through Probability of Inverter-Dominated Power Grids using Machine Learning

链接: https://arxiv.org/abs/2406.08917
作者: Christian Nauck,Anna Büttner,Sebastian Liemann,Frank Hellmann,Michael Lindner
关键词: grids gains importance, power grids gains, power grids, gains importance, power
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:Due to the increasing share of renewables, the analysis of the dynamical behavior of power grids gains importance. Effective risk assessments necessitate the analysis of large number of fault scenarios. The computational costs inherent in dynamic simulations impose constraints on the number of configurations that can be analyzed. Machine Learning (ML) has proven to efficiently predict complex power grid properties. Hence, we analyze the potential of ML for predicting dynamic stability of future power grids with large shares of inverters. For this purpose, we generate a new dataset consisting of synthetic power grid models and perform dynamical simulations. As targets for the ML training, we calculate the fault-ride-through probability, which we define as the probability of staying within a ride-through curve after a fault at a bus has been cleared. Importantly, we demonstrate that ML models accurately predict the fault-ride-through probability of synthetic power grids. Finally, we also show that the ML models generalize to an IEEE-96 Test System, which emphasizes the potential of deploying ML methods to study probabilistic stability of power grids.

[LG-97] ranscription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

链接: https://arxiv.org/abs/2406.08914
作者: William Ravenscroft,George Close,Stefan Goetze,Thomas Hain,Mohammad Soleymanpour,Anurag Chowdhury,Mark C. Fuhs
关键词: automatic speech recognition, speech recognition, automatic speech, separate speech, solution to automatic
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 3 Figures, 3 Tables, Accepted for Interspeech 2024

点击查看摘要

Abstract:One solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals. Commonly, the separator produces artefacts which often degrade ASR performance. Addressing this issue typically requires reference transcriptions to jointly train the separation and ASR networks. This is often not viable for training on real-world in-domain audio where reference transcript information is not always available. This paper proposes a transcription-free method for joint training using only audio signals. The proposed method uses embedding differences of pre-trained ASR encoders as a loss with a proposed modification to permutation invariant training (PIT) called guided PIT (GPIT). The method achieves a 6.4% improvement in word error rate (WER) measures over a signal-level loss and also shows enhancement improvements in perceptual measures such as short-time objective intelligibility (STOI).

[LG-98] AdaPTwin: Low-Cost Adaptive Compression of Product Twins in Transformers

链接: https://arxiv.org/abs/2406.08904
作者: Emil Biju,Anirudh Sriram,Mert Pilanci
关键词: computational requirements make, speaker-independent speech recognition, exhibited remarkable performance, large transformer-based models, resource-constrained settings
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 12 pages, 3 figures, submitted to NeurIPS 2024

点击查看摘要

Abstract:While large transformer-based models have exhibited remarkable performance in speaker-independent speech recognition, their large size and computational requirements make them expensive or impractical to use in resource-constrained settings. In this work, we propose a low-rank adaptive compression technique called AdaPTwin that jointly compresses product-dependent pairs of weight matrices in the transformer attention layer. Our approach can prioritize the compressed model’s performance on a specific speaker while maintaining generalizability to new speakers and acoustic conditions. Notably, our technique requires only 8 hours of speech data for fine-tuning, which can be accomplished in under 20 minutes, making it highly cost-effective compared to other compression methods. We demonstrate the efficacy of our approach by compressing the Whisper and Distil-Whisper models by up to 45% while incurring less than a 2% increase in word error rate.

[LG-99] Computer Vision Approaches for Automated Bee Counting Application

链接: https://arxiv.org/abs/2406.08898
作者: Simon Bilik,Ilona Janakova,Adam Ligocki,Dominik Ficek,Karel Horak
关键词: computer vision techniques, colony health state, bee colony health, health state monitoring, colony health
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many application from the bee colony health state monitoring could be efficiently solved using a computer vision techniques. One of such challenges is an efficient way for counting the number of incoming and outcoming bees, which could be used to further analyse many trends, such as the bee colony health state, blooming periods, or for investigating the effects of agricultural spraying. In this paper, we compare three methods for the automated bee counting over two own datasets. The best performing method is based on the ResNet-50 convolutional neural network classifier, which achieved accuracy of 87% over the BUT1 dataset and the accuracy of 93% over the BUT2 dataset.

[LG-100] Motif-driven Subgraph Structure Learning for Graph Classification

链接: https://arxiv.org/abs/2406.08897
作者: Zhiyao Zhou,Sheng Zhou,Bochao Mao,Jiawei Chen,Qingyun Sun,Yan Feng,Chun Chen,Can Wang
关键词: graph classification, Structure Learning, subgraph structure learning, improve graph structure, structure
类目: Machine Learning (cs.LG)
*备注: 16 pages, 8 figures

点击查看摘要

Abstract:To mitigate the suboptimal nature of graph structure, Graph Structure Learning (GSL) has emerged as a promising approach to improve graph structure and boost performance in downstream tasks. Despite the proposal of numerous GSL methods, the progresses in this field mostly concentrated on node-level tasks, while graph-level tasks (e.g., graph classification) remain largely unexplored. Notably, applying node-level GSL to graph classification is non-trivial due to the lack of find-grained guidance for intricate structure learning. Inspired by the vital role of subgraph in graph classification, in this paper we explore the potential of subgraph structure learning for graph classification by tackling the challenges of key subgraph selection and structure optimization. We propose a novel Motif-driven Subgraph Structure Learning method for Graph Classification (MOSGSL). Specifically, MOSGSL incorporates a subgraph structure learning module which can adaptively select important subgraphs. A motif-driven structure guidance module is further introduced to capture key subgraph-level structural patterns (motifs) and facilitate personalized structure learning. Extensive experiments demonstrate a significant and consistent improvement over baselines, as well as its flexibility and generalizability for various backbones and learning procedures.

[LG-101] he Penalized Inverse Probability Measure for Conformal Classification

链接: https://arxiv.org/abs/2406.08884
作者: Paul Melki(IMS),Lionel Bombrun(IMS),Boubacar Diallo,Jérôme Dias,Jean-Pierre da Costa(IMS)
关键词: machine learning systems, box neural networks, trustworthy machine learning, complex black box, black box neural
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The deployment of safe and trustworthy machine learning systems, and particularly complex black box neural networks, in real-world applications requires reliable and certified guarantees on their performance. The conformal prediction framework offers such formal guarantees by transforming any point into a set predictor with valid, finite-set, guarantees on the coverage of the true at a chosen level of confidence. Central to this methodology is the notion of the nonconformity score function that assigns to each example a measure of ‘‘strangeness’’ in comparison with the previously seen observations. While the coverage guarantees are maintained regardless of the nonconformity measure, the point predictor and the dataset, previous research has shown that the performance of a conformal model, as measured by its efficiency (the average size of the predicted sets) and its informativeness (the proportion of prediction sets that are singletons), is influenced by the choice of the nonconformity score function. The current work introduces the Penalized Inverse Probability (PIP) nonconformity score, and its regularized version RePIP, that allow the joint optimization of both efficiency and informativeness. Through toy examples and empirical results on the task of crop and weed image classification in agricultural robotics, the current work shows how PIP-based conformal classifiers exhibit precisely the desired behavior in comparison with other nonconformity measures and strike a good balance between informativeness and efficiency.

[LG-102] CIMRL: Combining IMitiation and Reinforcement Learning for Safe Autonomous Driving

链接: https://arxiv.org/abs/2406.08878
作者: Jonathan Booher,Khashayar Rohanimanesh,Junhong Xu,Aleksandr Petiushko
关键词: learned components trained, Modern approaches, driving rely heavily, Reinforcement Learning, large amounts
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern approaches to autonomous driving rely heavily on learned components trained with large amounts of human driving data via imitation learning. However, these methods require large amounts of expensive data collection and even then face challenges with safely handling long-tail scenarios and compounding errors over time. At the same time, pure Reinforcement Learning (RL) methods can fail to learn performant policies in sparse, constrained, and challenging-to-define reward settings like driving. Both of these challenges make deploying purely cloned policies in safety critical applications like autonomous vehicles challenging. In this paper we propose Combining IMitation and Reinforcement Learning (CIMRL) approach – a framework that enables training driving policies in simulation through leveraging imitative motion priors and safety constraints. CIMRL does not require extensive reward specification and improves on the closed loop behavior of pure cloning methods. By combining RL and imitation, we demonstrate that our method achieves state-of-the-art results in closed loop simulation driving benchmarks.

[LG-103] Research on Early Warning Model of Cardiovascular Disease Based on Computer Deep Learning

链接: https://arxiv.org/abs/2406.08864
作者: Yuxiang Hu,Jinxin Hu,Ting Xu,Bo Zhang,Jiajie Yuan,Haozhang Deng
关键词: risk early warning, early warning model, convolutional neural network, warning model based, cardiovascular disease risk
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 6 pages

点击查看摘要

Abstract:This project intends to study a cardiovascular disease risk early warning model based on one-dimensional convolutional neural networks. First, the missing values of 13 physiological and symptom indicators such as patient age, blood glucose, cholesterol, and chest pain were filled and Z-score was standardized. The convolutional neural network is converted into a 2D matrix, the convolution function of 1,3, and 5 is used for the first-order convolution operation, and the Max Pooling algorithm is adopted for dimension reduction. Set the learning rate and output rate. It is optimized by the Adam algorithm. The result of classification is output by a soft classifier. This study was conducted based on Statlog in the UCI database and heart disease database respectively. The empirical data indicate that the forecasting precision of this technique has been enhanced by 11.2%, relative to conventional approaches, while there is a significant improvement in the logarithmic curve fitting. The efficacy and applicability of the novel approach are corroborated through the examination employing a one-dimensional convolutional neural network.

[LG-104] Cognitively Inspired Energy-Based World Models

链接: https://arxiv.org/abs/2406.08862
作者: Alexi Gladstone,Ganesh Nanduru,Md Mofijul Islam,Aman Chadha,Jundong Li,Tariq Iqbal
关键词: Natural Language Processing, Large Language Models, Language Processing, Natural Language, Large Language
类目: Machine Learning (cs.LG)
*备注: 23 pages, 6 figures

点击查看摘要

Abstract:One of the predominant methods for training world models is autoregressive prediction in the output space of the next element of a sequence. In Natural Language Processing (NLP), this takes the form of Large Language Models (LLMs) predicting the next token; in Computer Vision (CV), this takes the form of autoregressive models predicting the next frame/token/pixel. However, this approach differs from human cognition in several respects. First, human predictions about the future actively influence internal cognitive processes. Second, humans naturally evaluate the plausibility of predictions regarding future states. Based on this capability, and third, by assessing when predictions are sufficient, humans allocate a dynamic amount of time to make a prediction. This adaptive process is analogous to System 2 thinking in psychology. All these capabilities are fundamental to the success of humans at high-level reasoning and planning. Therefore, to address the limitations of traditional autoregressive models lacking these human-like capabilities, we introduce Energy-Based World Models (EBWM). EBWM involves training an Energy-Based Model (EBM) to predict the compatibility of a given context and a predicted future state. In doing so, EBWM enables models to achieve all three facets of human cognition described. Moreover, we developed a variant of the traditional autoregressive transformer tailored for Energy-Based models, termed the Energy-Based Transformer (EBT). Our results demonstrate that EBWM scales better with data and GPU Hours than traditional autoregressive transformers in CV, and that EBWM offers promising early scaling in NLP. Consequently, this approach offers an exciting path toward training future models capable of System 2 thinking and intelligently searching across state spaces.

[LG-105] OmniH2O: Universal and Dexterous Human-to-Humanoid Whole-Body Teleoperation and Learning

链接: https://arxiv.org/abs/2406.08858
作者: Tairan He,Zhengyi Luo,Xialin He,Wenli Xiao,Chong Zhang,Weinan Zhang,Kris Kitani,Changliu Liu,Guanya Shi
关键词: learning-based system, Omni, whole-body humanoid teleoperation, humanoid, humanoid whole-body
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Project page: this https URL

点击查看摘要

Abstract:We present OmniH2O (Omni Human-to-Humanoid), a learning-based system for whole-body humanoid teleoperation and autonomy. Using kinematic pose as a universal control interface, OmniH2O enables various ways for a human to control a full-sized humanoid with dexterous hands, including using real-time teleoperation through VR headset, verbal instruction, and RGB camera. OmniH2O also enables full autonomy by learning from teleoperated demonstrations or integrating with frontier models such as GPT-4. OmniH2O demonstrates versatility and dexterity in various real-world whole-body tasks through teleoperation or autonomy, such as playing multiple sports, moving and manipulating objects, and interacting with humans. We develop an RL-based sim-to-real pipeline, which involves large-scale retargeting and augmentation of human motion datasets, learning a real-world deployable policy with sparse sensor input by imitating a privileged teacher policy, and reward designs to enhance robustness and stability. We release the first humanoid whole-body control dataset, OmniH2O-6, containing six everyday tasks, and demonstrate humanoid whole-body skill learning from teleoperated datasets.

[LG-106] Current applications and potential future directions of reinforcement learning-based Digital Twins in agriculture

链接: https://arxiv.org/abs/2406.08854
作者: Georg Goldenits,Kevin Mallinger,Sebastian Raubitzek,Thomas Neubauer
关键词: Digital Twins, agricultural Digital Twin, Digital Twin implementations, reinforcement learning, Digital
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Digital Twins have gained attention in various industries for simulation, monitoring, and decision-making, relying on ever-improving machine learning models. However, agricultural Digital Twin implementations are limited compared to other industries. Meanwhile, machine learning, particularly reinforcement learning, has shown potential in agricultural applications like optimizing decision-making, task automation, and resource management. A key aspect of Digital Twins is representing physical assets or systems in a virtual environment, which aligns well with reinforcement learning’s need for environment representations to learn the best policy for a task. Reinforcement learning in agriculture can thus enable various Digital Twin applications in agricultural domains. This review aims to categorize existing research employing reinforcement learning in agricultural settings by application domains like robotics, greenhouse management, irrigation systems, and crop management, identifying potential future areas for reinforcement learning-based Digital Twins. It also categorizes the reinforcement learning techniques used, including tabular methods, Deep Q-Networks (DQN), Policy Gradient methods, and Actor-Critic algorithms, to overview currently employed models. The review seeks to provide insights into the state-of-the-art in integrating Digital Twins and reinforcement learning in agriculture, identifying gaps and opportunities for future research, and exploring synergies to tackle agricultural challenges and optimize farming, paving the way for more efficient and sustainable farming methodologies.

[LG-107] Inverse Probability of Treatment Weighting with Deep Sequence Models Enables Accurate treatment effect Estimation from Electronic Health Records

链接: https://arxiv.org/abs/2406.08851
作者: Junghwan Lee,Simin Ma,Nicoleta Serban,Shihao Yang
关键词: Observational data, treatment effect, electronic health records, treatment, estimate treatment effect
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Observational data have been actively used to estimate treatment effect, driven by the growing availability of electronic health records (EHRs). However, EHRs typically consist of longitudinal records, often introducing time-dependent confoundings that hinder the unbiased estimation of treatment effect. Inverse probability of treatment weighting (IPTW) is a widely used propensity score method since it provides unbiased treatment effect estimation and its derivation is straightforward. In this study, we aim to utilize IPTW to estimate treatment effect in the presence of time-dependent confounding using claims records. Previous studies have utilized propensity score methods with features derived from claims records through feature processing, which generally requires domain knowledge and additional resources to extract information to accurately estimate propensity scores. Deep sequence models, particularly recurrent neural networks and self-attention-based architectures, have demonstrated good performance in modeling EHRs for various downstream tasks. We propose that these deep sequence models can provide accurate IPTW estimation of treatment effect by directly estimating the propensity scores from claims records without the need for feature processing. We empirically demonstrate this by conducting comprehensive evaluations using synthetic and semi-synthetic datasets.

[LG-108] Roping in Uncertainty: Robustness and Regularization in Markov Games

链接: https://arxiv.org/abs/2406.08847
作者: Jeremy McMahan,Giovanni Artiglio,Qiaomin Xie
关键词: study robust Markov, robust Markov games, robust Nash equilibrium, robust Markov, Markov games
类目: Computer Science and Game Theory (cs.GT); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Accepted to ICML 2024

点击查看摘要

Abstract:We study robust Markov games (RMG) with s -rectangular uncertainty. We show a general equivalence between computing a robust Nash equilibrium (RNE) of a s -rectangular RMG and computing a Nash equilibrium (NE) of an appropriately constructed regularized MG. The equivalence result yields a planning algorithm for solving s -rectangular RMGs, as well as provable robustness guarantees for policies computed using regularized methods. However, we show that even for just reward-uncertain two-player zero-sum matrix games, computing an RNE is PPAD-hard. Consequently, we derive a special uncertainty structure called efficient player-decomposability and show that RNE for two-player zero-sum RMG in this class can be provably solved in polynomial time. This class includes commonly used uncertainty sets such as L_1 and L_\infty ball uncertainty sets.

[LG-109] Conceptual Learning via Embedding Approximations for Reinforcing Interpretability and Transparency

链接: https://arxiv.org/abs/2406.08840
作者: Maor Dikter,Tsachi Blau,Chaim Baskin
关键词: emerged as critical, critical tools, tools in domains, underline, textbf
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Concept bottleneck models (CBMs) have emerged as critical tools in domains where interpretability is paramount. These models rely on predefined textual descriptions, referred to as concepts, to inform their decision-making process and offer more accurate reasoning. As a result, the selection of concepts used in the model is of utmost significance. This study proposes \underline\textbfConceptual \underline\textbfLearning via \underline\textbfEmbedding \underline\textbfApproximations for \underline\textbfReinforcing Interpretability and Transparency, abbreviated as CLEAR, a framework for constructing a CBM for image classification. Using score matching and Langevin sampling, we approximate the embedding of concepts within the latent space of a vision-language model (VLM) by learning the scores associated with the joint distribution of images and concepts. A concept selection process is then employed to optimize the similarity between the learned embeddings and the predefined ones. The derived bottleneck offers insights into the CBM’s decision-making process, enabling more comprehensive interpretations. Our approach was evaluated through extensive experiments and achieved state-of-the-art performance on various benchmarks. The code for our experiments is available at this https URL

[LG-110] Research on Optimization of Natural Language Processing Model Based on Multimodal Deep Learning

链接: https://arxiv.org/abs/2406.08838
作者: Dan Sun,Yaxin Liang,Yining Yang,Yuhan Ma,Qishi Zhan,Erdi Gao
关键词: multimodal data, based on attention, attention mechanism, mechanism and multimodal, image representation based
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This project intends to study the image representation based on attention mechanism and multimodal data. By adding multiple pattern layers to the attribute model, the semantic and hidden layers of image content are integrated. The word vector is quantified by the Word2Vec method and then evaluated by a word embedding convolutional neural network. The published experimental results of the two groups were tested. The experimental results show that this method can convert discrete features into continuous characters, thus reducing the complexity of feature preprocessing. Word2Vec and natural language processing technology are integrated to achieve the goal of direct evaluation of missing image features. The robustness of the image feature evaluation model is improved by using the excellent feature analysis characteristics of a convolutional neural network. This project intends to improve the existing image feature identification methods and eliminate the subjective influence in the evaluation process. The findings from the simulation indicate that the novel approach has developed is viable, effectively augmenting the features within the produced representations.

[LG-111] Center-Sensitive Kernel Optimization for Efficient On-Device Incremental Learning

链接: https://arxiv.org/abs/2406.08830
作者: Dingwen Zhang,Yan Li,De Cheng,Nannan Wang,Junwei Han
关键词: on-device incremental learning, on-device training methods, limited computation resource, incremental learning constrained, on-device training
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To facilitate the evolution of edge intelligence in ever-changing environments, we study on-device incremental learning constrained in limited computation resource in this paper. Current on-device training methods just focus on efficient training without considering the catastrophic forgetting, preventing the model getting stronger when continually exploring the world. To solve this problem, a direct solution is to involve the existing incremental learning mechanisms into the on-device training framework. Unfortunately, such a manner cannot work well as those mechanisms usually introduce large additional computational cost to the network optimization process, which would inevitably exceed the memory capacity of the edge devices. To address this issue, this paper makes an early effort to propose a simple but effective edge-friendly incremental learning framework. Based on an empirical study on the knowledge intensity of the kernel elements of the neural network, we find that the center kernel is the key for maximizing the knowledge intensity for learning new data, while freezing the other kernel elements would get a good balance on the model’s capacity for overcoming catastrophic forgetting. Upon this finding, we further design a center-sensitive kernel optimization framework to largely alleviate the cost of the gradient computation and back-propagation. Besides, a dynamic channel element selection strategy is also proposed to facilitate a sparse orthogonal gradient projection for further reducing the optimization complexity, upon the knowledge explored from the new task data. Extensive experiments validate our method is efficient and effective, e.g., our method achieves average accuracy boost of 38.08% with even less memory and approximate computation compared to existing on-device training methods, indicating its significant potential for on-device incremental learning.

[LG-112] AIM: Attributing Interpreting Mitigating Data Unfairness

链接: https://arxiv.org/abs/2406.08819
作者: Zhining Liu,Ruizhong Qiu,Zhichen Zeng,Yada Zhu,Hendrik Hamann,Hanghang Tong
关键词: encapsulates historical discrimination, real world, world often encapsulates, discrimination against disadvantaged, Data collected
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 12 pages, 6 figures, accepted by ACM SIGKDD 2024

点击查看摘要

Abstract:Data collected in the real world often encapsulates historical discrimination against disadvantaged groups and individuals. Existing fair machine learning (FairML) research has predominantly focused on mitigating discriminative bias in the model prediction, with far less effort dedicated towards exploring how to trace biases present in the data, despite its importance for the transparency and interpretability of FairML. To fill this gap, we investigate a novel research problem: discovering samples that reflect biases/prejudices from the training data. Grounding on the existing fairness notions, we lay out a sample bias criterion and propose practical algorithms for measuring and countering sample bias. The derived bias score provides intuitive sample-level attribution and explanation of historical bias in data. On this basis, we further design two FairML strategies via sample-bias-informed minimal data editing. They can mitigate both group and individual unfairness at the cost of minimal or zero predictive utility loss. Extensive experiments and analyses on multiple real-world datasets demonstrate the effectiveness of our methods in explaining and mitigating unfairness. Code is available at this https URL.

[LG-113] A Dual Approach to Imitation Learning from Observations with Offline Datasets

链接: https://arxiv.org/abs/2406.08805
作者: Harshit Sikchi,Caleb Chuck,Amy Zhang,Scott Niekum
关键词: effective alternative, alternative to task, task specification, designing a reward, learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Under submission. 23 pages

点击查看摘要

Abstract:Demonstrations are an effective alternative to task specification for learning agents in settings where designing a reward function is difficult. However, demonstrating expert behavior in the action space of the agent becomes unwieldy when robots have complex, unintuitive morphologies. We consider the practical setting where an agent has a dataset of prior interactions with the environment and is provided with observation-only expert demonstrations. Typical learning from observations approaches have required either learning an inverse dynamics model or a discriminator as intermediate steps of training. Errors in these intermediate one-step models compound during downstream policy learning or deployment. We overcome these limitations by directly learning a multi-step utility function that quantifies how each action impacts the agent’s divergence from the expert’s visitation distribution. Using the principle of duality, we derive DILO(Dual Imitation Learning from Observations), an algorithm that can leverage arbitrary suboptimal data to learn imitating policies without requiring expert actions. DILO reduces the learning from observations problem to that of simply learning an actor and a critic, bearing similar complexity to vanilla offline RL. This allows DILO to gracefully scale to high dimensional observations, and demonstrate improved performance across the board. Project page (code and videos): \hrefthis https URL\textthis http URL

[LG-114] Can Synthetic Audio From Generative Foundation Models Assist Audio Recognition and Speech Modeling?

链接: https://arxiv.org/abs/2406.08800
作者: Tiantian Feng,Dimitrios Dimitriadis,Shrikanth Narayanan
关键词: produce high-fidelity sounds, Recent advances, enabled audio-generative models, Frechet Audio Distance, human actions
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to 2024 INTERSPEECH

点击查看摘要

Abstract:Recent advances in foundation models have enabled audio-generative models that produce high-fidelity sounds associated with music, events, and human actions. Despite the success achieved in modern audio-generative models, the conventional approach to assessing the quality of the audio generation relies heavily on distance metrics like Frechet Audio Distance. In contrast, we aim to evaluate the quality of audio generation by examining the effectiveness of using them as training data. Specifically, we conduct studies to explore the use of synthetic audio for audio recognition. Moreover, we investigate whether synthetic audio can serve as a resource for data augmentation in speech-related modeling. Our comprehensive experiments demonstrate the potential of using synthetic audio for audio recognition and speech-related modeling. Our code is available at this https URL.

[LG-115] Pareto Front-Diverse Batch Multi-Objective Bayesian Optimization

链接: https://arxiv.org/abs/2406.08799
作者: Alaleh Ahmadianshalchi,Syrine Belakaria,Janardhan Rao Doppa
关键词: diverse Pareto fronts, goal of discovering, discovering high-quality, allowed to evaluate, Pareto fronts
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published at AAAI Conference on Artificial Intelligence, 2024

点击查看摘要

Abstract:We consider the problem of multi-objective optimization (MOO) of expensive black-box functions with the goal of discovering high-quality and diverse Pareto fronts where we are allowed to evaluate a batch of inputs. This problem arises in many real-world applications including penicillin production where diversity of solutions is critical. We solve this problem in the framework of Bayesian optimization (BO) and propose a novel approach referred to as Pareto front-Diverse Batch Multi-Objective BO (PDBO). PDBO tackles two important challenges: 1) How to automatically select the best acquisition function in each BO iteration, and 2) How to select a diverse batch of inputs by considering multiple objectives. We propose principled solutions to address these two challenges. First, PDBO employs a multi-armed bandit approach to select one acquisition function from a given library. We solve a cheap MOO problem by assigning the selected acquisition function for each expensive objective function to obtain a candidate set of inputs for evaluation. Second, it utilizes Determinantal Point Processes (DPPs) to choose a Pareto-front-diverse batch of inputs for evaluation from the candidate set obtained from the first step. The key parameters for the methods behind these two steps are updated after each round of function evaluations. Experiments on multiple MOO benchmarks demonstrate that PDBO outperforms prior methods in terms of both the quality and diversity of Pareto solutions.

[LG-116] Understanding the Generalizability of Link Predictors Under Distribution Shifts on Graphs

链接: https://arxiv.org/abs/2406.08788
作者: Jay Revolinsky,Harry Shomer,Jiliang Tang
关键词: multiple models proposed, demonstrate impressive results, multiple models, link prediction, demonstrate impressive
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages, 4 figures, 14 tables, submitted to NeurIPS - Datasets Benchmarks Track 2024

点击查看摘要

Abstract:Recently, multiple models proposed for link prediction (LP) demonstrate impressive results on benchmark datasets. However, many popular benchmark datasets often assume that dataset samples are drawn from the same distribution (i.e., IID samples). In real-world situations, this assumption is often incorrect; since uncontrolled factors may lead train and test samples to come from separate distributions. To tackle the distribution shift problem, recent work focuses on creating datasets that feature distribution shifts and designing generalization methods that perform well on the new data. However, those studies only consider distribution shifts that affect \it node- and \it graph-level tasks, thus ignoring link-level tasks. Furthermore, relatively few LP generalization methods exist. To bridge this gap, we introduce a set of LP-specific data splits which utilizes structural properties to induce a controlled distribution shift. We verify the shift’s effect empirically through evaluation of different SOTA LP methods and subsequently couple these methods with generalization techniques. Interestingly, LP-specific methods frequently generalize poorly relative to heuristics or basic GNN methods. Finally, this work provides analysis to uncover insights for enhancing LP generalization. Our code is available at: \hrefthis https URLthis https URL

[LG-117] LLM-based Knowledge Pruning for Time Series Data Analytics on Edge-computing Devices

链接: https://arxiv.org/abs/2406.08765
作者: Ruibing Jin,Qing Xu,Min Wu,Yuecong Xu,Dan Li,Xiaoli Li,Zhenghua Chen
关键词: time series data, series data, time series, neural networks trained, show unsatisfacotry performances
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures

点击查看摘要

Abstract:Limited by the scale and diversity of time series data, the neural networks trained on time series data often overfit and show unsatisfacotry performances. In comparison, large language models (LLMs) recently exhibit impressive generalization in diverse fields. Although massive LLM based approaches are proposed for time series tasks, these methods require to load the whole LLM in both training and reference. This high computational demands limit practical applications in resource-constrained settings, like edge-computing and IoT devices. To address this issue, we propose Knowledge Pruning (KP), a novel paradigm for time series learning in this paper. For a specific downstream task, we argue that the world knowledge learned by LLMs is much redundant and only the related knowledge termed as “pertinent knowledge” is useful. Unlike other methods, our KP targets to prune the redundant knowledge and only distill the pertinent knowledge into the target model. This reduces model size and computational costs significantly. Additionally, different from existing LLM based approaches, our KP does not require to load the LLM in the process of training and testing, further easing computational burdens. With our proposed KP, a lightweight network can effectively learn the pertinent knowledge, achieving satisfactory performances with a low computation cost. To verify the effectiveness of our KP, two fundamental tasks on edge-computing devices are investigated in our experiments, where eight diverse environments or benchmarks with different networks are used to verify the generalization of our KP. Through experiments, our KP demonstrates effective learning of pertinent knowledge, achieving notable performance improvements in regression (19.7% on average) and classification (up to 13.7%) tasks, showcasing state-of-the-art results.

[LG-118] Optimizing Large Model Training through Overlapped Activation Recomputation

链接: https://arxiv.org/abs/2406.08756
作者: Ping Chen,Wenjie Zhang,Shuibing He,Yingjie Gu,Zhuwei Peng,Kexin Huang,Xuan Zhan,Weijian Chen,Yi Zheng,Zhefeng Wang,Yanlong Yin,Gang Chen
关键词: parallelism of data, alleviate the memory, memory pressure, pressure and pipelining, pipelining to exploit
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:Large model training has been using recomputation to alleviate the memory pressure and pipelining to exploit the parallelism of data, tensor, and devices. The existing recomputation approaches may incur up to 40% overhead when training real-world models, e.g., the GPT model with 22B parameters. This is because they are executed on demand in the critical training path. In this paper, we design a new recomputation framework, Lynx, to reduce the overhead by overlapping the recomputation with communication occurring in training pipelines. It consists of an optimal scheduling algorithm (OPT) and a heuristic-based scheduling algorithm (HEU). OPT achieves a global optimum but suffers from a long search time. HEU was designed based on our observation that there are identical structures in large DNN models so that we can apply the same scheduling policy to all identical structures. HEU achieves a local optimum but reduces the search time by 99% compared to OPT. Our comprehensive evaluation using GPT models with 1.3B-20B parameters shows that both OPT and HEU outperform the state-of-the-art recomputation approaches (e.g., Megatron-LM and Checkmake) by 1.02-1.53x. HEU achieves a similar performance as OPT with a search time of 0.16s on average.

[LG-119] Mathematical models for off-ball scoring prediction in basketball

链接: https://arxiv.org/abs/2406.08749
作者: Rikako Kono,Keisuke Fujii
关键词: off-ball scoring opportunities, off-ball scoring, scoring opportunities based, scoring opportunities, scoring
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:In professional basketball, the accurate prediction of scoring opportunities based on strategic decision-making is crucial for space and player evaluations. However, traditional models often face challenges in accounting for the complexities of off-ball movements, which are essential for accurate predictive performance. In this study, we propose two mathematical models to predict off-ball scoring opportunities in basketball, considering both pass-to-score and dribble-to-score movements: the Ball Movement for Off-ball Scoring (BMOS) and the Ball Intercept and Movement for Off-ball Scoring (BIMOS) models. The BMOS adapts principles from the Off-Ball Scoring Opportunities (OBSO) model, originally designed for soccer, to basketball, whereas the BIMOS also incorporates the likelihood of interception during ball movements. We evaluated these models using player tracking data from 630 NBA games in the 2015-2016 regular season, demonstrating that the BIMOS outperforms the BMOS in terms of scoring prediction accuracy. Thus, our models provide valuable insights for tactical analysis and player evaluation in basketball.

[LG-120] Learning in Feature Spaces via Coupled Covariances: Asymmetric Kernel SVD and Nystr"om method

链接: https://arxiv.org/abs/2406.08748
作者: Qinghua Tao,Francesco Tonin,Alex Lambert,Yingyi Chen,Panagiotis Patrinos,Johan A.K. Suykens
关键词: Principal Component Analysis, Kernel Principal Component, Singular Value Decomposition, Mercer kernel-based approaches, Component Analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 19 pages, 9 tables, 6 figures

点击查看摘要

Abstract:In contrast with Mercer kernel-based approaches as used e.g., in Kernel Principal Component Analysis (KPCA), it was previously shown that Singular Value Decomposition (SVD) inherently relates to asymmetric kernels and Asymmetric Kernel Singular Value Decomposition (KSVD) has been proposed. However, the existing formulation to KSVD cannot work with infinite-dimensional feature mappings, the variational objective can be unbounded, and needs further numerical evaluation and exploration towards machine learning. In this work, i) we introduce a new asymmetric learning paradigm based on coupled covariance eigenproblem (CCE) through covariance operators, allowing infinite-dimensional feature maps. The solution to CCE is ultimately obtained from the SVD of the induced asymmetric kernel matrix, providing links to KSVD. ii) Starting from the integral equations corresponding to a pair of coupled adjoint eigenfunctions, we formalize the asymmetric Nyström method through a finite sample approximation to speed up training. iii) We provide the first empirical evaluations verifying the practical utility and benefits of KSVD and compare with methods resorting to symmetrization or linear SVD across multiple tasks.

[LG-121] Generalizable Implicit Neural Representation As a Universal Spatiotemporal Traffic Data Learner

链接: https://arxiv.org/abs/2406.08743
作者: Tong Nie,Guoyang Qin,Wei Ma,Jian Sun
关键词: Spatiotemporal Traffic Data, Spatiotemporal Implicit Neural, Generalized Traffic Data, Traffic Data Learner, Traffic Data
类目: Machine Learning (cs.LG)
*备注: Accepted by the Conference in Emerging Technologies in Transportation Systems (TRC-30). arXiv admin note: substantial text overlap with arXiv:2405.03185

点击查看摘要

Abstract: \textbfThis is the conference version of our paper: Spatiotemporal Implicit Neural Representation as a Generalized Traffic Data Learner . Spatiotemporal Traffic Data (STTD) measures the complex dynamical behaviors of the multiscale transportation system. Existing methods aim to reconstruct STTD using low-dimensional models. However, they are limited to data-specific dimensions or source-dependent patterns, restricting them from unifying representations. Here, we present a novel paradigm to address the STTD learning problem by parameterizing STTD as an implicit neural representation. To discern the underlying dynamics in low-dimensional regimes, coordinate-based neural networks that can encode high-frequency structures are employed to directly map coordinates to traffic variables. To unravel the entangled spatial-temporal interactions, the variability is decomposed into separate processes. We further enable modeling in irregular spaces such as sensor graphs using spectral embedding. Through continuous representations, our approach enables the modeling of a variety of STTD with a unified input, thereby serving as a generalized learner of the underlying traffic dynamics. It is also shown that it can learn implicit low-rank priors and smoothness regularization from the data, making it versatile for learning different dominating data patterns. We validate its effectiveness through extensive experiments in real-world scenarios, showcasing applications from corridor to network scales. Empirical results not only indicate that our model has significant superiority over conventional low-rank models, but also highlight that the versatility of the approach. We anticipate that this pioneering modeling perspective could lay the foundation for universal representation of STTD in various real-world tasks. \textbfThe full version can be found at: this https URL.

[LG-122] An AI Architecture with the Capability to Explain Recognition Results

链接: https://arxiv.org/abs/2406.08740
作者: Paul Whitten,Francis Wolff,Chris Papachristou
关键词: machine learning, needed to establish, establish confidence, machine learning models, machine learning methods
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explainability is needed to establish confidence in machine learning results. Some explainable methods take a post hoc approach to explain the weights of machine learning models, others highlight areas of the input contributing to decisions. These methods do not adequately explain decisions, in plain terms. Explainable property-based systems have been shown to provide explanations in plain terms, however, they have not performed as well as leading unexplainable machine learning methods. This research focuses on the importance of metrics to explainability and contributes two methods yielding performance gains. The first method introduces a combination of explainable and unexplainable flows, proposing a metric to characterize explainability of a decision. The second method compares classic metrics for estimating the effectiveness of neural networks in the system, posing a new metric as the leading performer. Results from the new methods and examples from handwritten datasets are presented.

[LG-123] At the edge of a generative cultural precipice

链接: https://arxiv.org/abs/2406.08739
作者: Diego Porres,Alex Gomez-Villa
关键词: Stable Diffusion, threatened and stolen, large generative models, NFTs and large, jobs threatened
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Accepted at the CVPR Fourth Workshop on Ethical Considerations in Creative applications of Computer Vision

点击查看摘要

Abstract:Since NFTs and large generative models (such as DALLE2 and Stable Diffusion) have been publicly available, artists have seen their jobs threatened and stolen. While artists depend on sharing their art on online platforms such as Deviantart, Pixiv, and Artstation, many slowed down sharing their work or downright removed their past work therein, especially if these platforms fail to provide certain guarantees regarding the copyright of their uploaded work. Text-to-image (T2I) generative models are trained using human-produced content to better guide the style and themes they can produce. Still, if the trend continues where data found online is generated by a machine instead of a human, this will have vast repercussions in culture. Inspired by recent work in generative models, we wish to tell a cautionary tale and ask what will happen to the visual arts if generative models continue on the path to be (eventually) trained solely on generated content.

[LG-124] Introducing Diminutive Causal Structure into Graph Representation Learning

链接: https://arxiv.org/abs/2406.08709
作者: Hang Gao,Peng Qiao,Yifan Jin,Fengge Wu,Jiangmeng Li,Changwen Zheng
关键词: Graph Neural Networks, Neural Networks, accurately capturing authentic, Graph Neural, capturing authentic data
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:When engaging in end-to-end graph representation learning with Graph Neural Networks (GNNs), the intricate causal relationships and rules inherent in graph data pose a formidable challenge for the model in accurately capturing authentic data relationships. A proposed mitigating strategy involves the direct integration of rules or relationships corresponding to the graph data into the model. However, within the domain of graph representation learning, the inherent complexity of graph data obstructs the derivation of a comprehensive causal structure that encapsulates universal rules or relationships governing the entire dataset. Instead, only specialized diminutive causal structures, delineating specific causal relationships within constrained subsets of graph data, emerge as discernible. Motivated by empirical insights, it is observed that GNN models exhibit a tendency to converge towards such specialized causal structures during the training process. Consequently, we posit that the introduction of these specific causal structures is advantageous for the training of GNN models. Building upon this proposition, we introduce a novel method that enables GNN models to glean insights from these specialized diminutive causal structures, thereby enhancing overall performance. Our method specifically extracts causal knowledge from the model representation of these diminutive causal structures and incorporates interchange intervention to optimize the learning process. Theoretical analysis serves to corroborate the efficacy of our proposed method. Furthermore, empirical experiments consistently demonstrate significant performance improvements across diverse datasets.

[LG-125] Global AI Governance in Healthcare: A Cross-Jurisdictional Regulatory Analysis

链接: https://arxiv.org/abs/2406.08695
作者: Attrayee Chakraborty,Mandar Karhade
关键词: Artificial Intelligence, AI-enabled medical devices, North America dominate, North America, South East Asia
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 32 pages, 8 figures, 5 tables

点击查看摘要

Abstract:Artificial Intelligence (AI) is being adopted across the world and promises a new revolution in healthcare. While AI-enabled medical devices in North America dominate 42.3% of the global market, the use of AI-enabled medical devices in other countries is still a story waiting to be unfolded. We aim to delve deeper into global regulatory approaches towards AI use in healthcare, with a focus on how common themes are emerging globally. We compare these themes to the World Health Organization’s (WHO) regulatory considerations and principles on ethical use of AI for healthcare applications. Our work seeks to take a global perspective on AI policy by analyzing 14 legal jurisdictions including countries representative of various regions in the world (North America, South America, South East Asia, Middle East, Africa, Australia, and the Asia-Pacific). Our eventual goal is to foster a global conversation on the ethical use of AI in healthcare and the regulations that will guide it. We propose solutions to promote international harmonization of AI regulations and examine the requirements for regulating generative AI, using China and Singapore as examples of countries with well-developed policies in this area.

[LG-126] UnO: Unsupervised Occupancy Fields for Perception and Forecasting

链接: https://arxiv.org/abs/2406.08691
作者: Ben Agro,Quinlan Sykora,Sergio Casas,Thomas Gilles,Raquel Urtasun
关键词: future state, Perceiving the world, Perceiving, world, critical task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Perceiving the world and forecasting its future state is a critical task for self-driving. Supervised approaches leverage annotated object labels to learn a model of the world – traditionally with object detections and trajectory predictions, or temporal bird’s-eye-view (BEV) occupancy fields. However, these annotations are expensive and typically limited to a set of predefined categories that do not cover everything we might encounter on the road. Instead, we learn to perceive and forecast a continuous 4D (spatio-temporal) occupancy field with self-supervision from LiDAR data. This unsupervised world model can be easily and effectively transferred to downstream tasks. We tackle point cloud forecasting by adding a lightweight learned renderer and achieve state-of-the-art performance in Argoverse 2, nuScenes, and KITTI. To further showcase its transferability, we fine-tune our model for BEV semantic occupancy forecasting and show that it outperforms the fully supervised state-of-the-art, especially when labeled data is scarce. Finally, when compared to prior state-of-the-art on spatio-temporal geometric occupancy prediction, our 4D world model achieves a much higher recall of objects from classes relevant to self-driving.

[LG-127] HelpSteer2: Open-source dataset for training top-performing reward models

链接: https://arxiv.org/abs/2406.08673
作者: Zhilin Wang,Yi Dong,Olivier Delalleau,Jiaqi Zeng,Gerald Shen,Daniel Egert,Jimmy J. Zhang,Makesh Narsimhan Sreedhar,Oleksii Kuchaiev
关键词: guide large language, generating high-quality responses, High-quality preference datasets, high-quality responses aligned, generating high-quality
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-quality preference datasets are essential for training reward models that can effectively guide large language models (LLMs) in generating high-quality responses aligned with human preferences. As LLMs become stronger and better aligned, permissively licensed preference datasets, such as Open Assistant, HH-RLHF, and HelpSteer need to be updated to remain effective for reward modeling. Methods that distil preference data from proprietary LLMs such as GPT-4 have restrictions on commercial usage imposed by model providers. To improve upon both generated responses and attribute labeling quality, we release HelpSteer2, a permissively licensed preference dataset (CC-BY-4.0). Using a powerful internal base model trained on HelpSteer2, we are able to achieve the SOTA score (92.0%) on Reward-Bench’s primary dataset, outperforming currently listed open and proprietary models, as of June 12th, 2024. Notably, HelpSteer2 consists of only ten thousand response pairs, an order of magnitude fewer than existing preference datasets (e.g., HH-RLHF), which makes it highly efficient for training reward models. Our extensive experiments demonstrate that reward models trained with HelpSteer2 are effective in aligning LLMs. In particular, we propose SteerLM 2.0, a model alignment approach that can effectively make use of the rich multi-attribute score predicted by our reward models. HelpSteer2 is available at this https URL and code is available at this https URL

[LG-128] Interventional Causal Discovery in a Mixture of DAGs

链接: https://arxiv.org/abs/2406.08666
作者: Burak Varıcı,Dmitriy Katz-Rogozhnikov,Dennis Wei,Prasanna Sattigeri,Ali Tajer
关键词: single causal graph, Causal interactions, co-existing causal graphs, Causal, learning causal interactions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Causal interactions among a group of variables are often modeled by a single causal graph. In some domains, however, these interactions are best described by multiple co-existing causal graphs, e.g., in dynamical systems or genomics. This paper addresses the hitherto unknown role of interventions in learning causal interactions among variables governed by a mixture of causal systems, each modeled by one directed acyclic graph (DAG). Causal discovery from mixtures is fundamentally more challenging than single-DAG causal discovery. Two major difficulties stem from (i) inherent uncertainty about the skeletons of the component DAGs that constitute the mixture and (ii) possibly cyclic relationships across these component DAGs. This paper addresses these challenges and aims to identify edges that exist in at least one component DAG of the mixture, referred to as true edges. First, it establishes matching necessary and sufficient conditions on the size of interventions required to identify the true edges. Next, guided by the necessity results, an adaptive algorithm is designed that learns all true edges using \cal O(n^2) interventions, where n is the number of nodes. Remarkably, the size of the interventions is optimal if the underlying mixture model does not contain cycles across its components. More generally, the gap between the intervention size used by the algorithm and the optimal size is quantified. It is shown to be bounded by the cyclic complexity number of the mixture model, defined as the size of the minimal intervention that can break the cycles in the mixture, which is upper bounded by the number of cycles among the ancestors of a node.

[LG-129] MOTImathcalVE: A Drug-Target Interaction Graph For Inductive Link Prediction

链接: https://arxiv.org/abs/2406.08649
作者: John Arevalo,Ellen Su,Anne E Carpenter,Shantanu Singh
关键词: Cell Painting features, Cell Painting, Drug-target interaction, mechanisms of action, comprises Cell Painting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Drug-target interaction (DTI) prediction is crucial for identifying new therapeutics and detecting mechanisms of action. While structure-based methods accurately model physical interactions between a drug and its protein target, cell-based assays such as Cell Painting can better capture complex DTI interactions. This paper introduces MOTI \mathcalVE , a Morphological cOmpound Target Interaction Graph dataset that comprises Cell Painting features for 11,000 genes and 3,600 compounds along with their relationships extracted from seven publicly available databases. We provide random, cold-source (new drugs), and cold-target (new genes) data splits to enable rigorous evaluation under realistic use cases. Our benchmark results show that graph neural networks that use Cell Painting features consistently outperform those that learn from graph structure alone, feature-based models, and topological heuristics. MOTI \mathcalVE accelerates both graph ML research and drug discovery by promoting the development of more reliable DTI prediction models. MOTI \mathcalVE resources are available at this https URL.

[LG-130] Conditional Similarity Triplets Enable Covariate-Informed Representations of Single-Cell Data

链接: https://arxiv.org/abs/2406.08638
作者: Chi-Jane Chen,Haidong Yi,Natalie Stanley
关键词: Single-cell technologies enable, technologies enable comprehensive, enable comprehensive profiling, diverse immune cell-types, Single-cell technologies
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Single-cell technologies enable comprehensive profiling of diverse immune cell-types through the measurement of multiple genes or proteins per cell. In order to translate data from immune profiling assays into powerful diagnostics, machine learning approaches are used to compute per-sample immunological summaries, or featurizations that can be used as inputs to models for outcomes of interest. Current supervised learning approaches for computing per-sample representations are optimized based only on the outcome variable to be predicted and do not take into account clinically-relevant covariates that are likely to also be measured. Here we expand the optimization problem to also take into account such additional patient covariates to directly inform the learned per-sample representations. To do this, we introduce CytoCoSet, a set-based encoding method, which formulates a loss function with an additional triplet term penalizing samples with similar covariates from having disparate embedding results in per-sample representations. Overall, incorporating clinical covariates leads to improved prediction of clinical phenotypes.

[LG-131] owards Integrating Personal Knowledge into Test-Time Predictions

链接: https://arxiv.org/abs/2406.08636
作者: Isaac Lage,Sonali Parbhoo,Finale Doshi-Velez
关键词: make decisions based, missing personal knowledge, Machine learning, amounts of data, make decisions
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) models can make decisions based on large amounts of data, but they can be missing personal knowledge available to human users about whom predictions are made. For example, a model trained to predict psychiatric outcomes may know nothing about a patient’s social support system, and social support may look different for different patients. In this work, we introduce the problem of human feature integration, which provides a way to incorporate important personal-knowledge from users without domain expertise into ML predictions. We characterize this problem through illustrative user stories and comparisons to existing approaches; we formally describe this problem in a way that paves the ground for future technical solutions; and we provide a proof-of-concept study of a simple version of a solution to this problem in a semi-realistic setting.

[LG-132] me-MMD: A New Multi-Domain Multimodal Dataset for Time Series Analysis

链接: https://arxiv.org/abs/2406.08627
作者: Haoxin Liu,Shangqing Xu,Zhiyuan Zhao,Lingkai Kong,Harshavardhan Kamarthi,Aditya B. Sasanur,Megha Sharma,Jiaming Cui,Qingsong Wen,Chao Zhang,B. Aditya Prakash
关键词: Time series, real-world time series, wide range, series, time series analysis
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Time series data are ubiquitous across a wide range of real-world domains. While real-world time series analysis (TSA) requires human experts to integrate numerical series data with multimodal domain-specific knowledge, most existing TSA models rely solely on numerical data, overlooking the significance of information beyond numerical series. This oversight is due to the untapped potential of textual series data and the absence of a comprehensive, high-quality multimodal dataset. To overcome this obstacle, we introduce Time-MMD, the first multi-domain, multimodal time series dataset covering 9 primary data domains. Time-MMD ensures fine-grained modality alignment, eliminates data contamination, and provides high usability. Additionally, we develop MM-TSFlib, the first multimodal time-series forecasting (TSF) library, seamlessly pipelining multimodal TSF evaluations based on Time-MMD for in-depth analyses. Extensive experiments conducted on Time-MMD through MM-TSFlib demonstrate significant performance enhancements by extending unimodal TSF to multimodality, evidenced by over 15% mean squared error reduction in general, and up to 40% in domains with rich textual data. More importantly, our datasets and library revolutionize broader applications, impacts, research topics to advance TSA. The dataset and library are available at this https URL and this https URL.

[LG-133] Emotion Manipulation Through Music – A Deep Learning Interactive Visual Approach

链接: https://arxiv.org/abs/2406.08623
作者: Adel N. Abdalla,Jared Osborne,Razvan Andonie
关键词: Music evokes emotion, evokes emotion, Music evokes, Russel Circumplex model, Music
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Music evokes emotion in many people. We introduce a novel way to manipulate the emotional content of a song using AI tools. Our goal is to achieve the desired emotion while leaving the original melody as intact as possible. For this, we create an interactive pipeline capable of shifting an input song into a diametrically opposed emotion and visualize this result through Russel’s Circumplex model. Our approach is a proof-of-concept for Semantic Manipulation of Music, a novel field aimed at modifying the emotional content of existing music. We design a deep learning model able to assess the accuracy of our modifications to key, SoundFont instrumentation, and other musical features. The accuracy of our model is in-line with the current state of the art techniques on the 4Q Emotion dataset. With further refinement, this research may contribute to on-demand custom music generation, the automated remixing of existing work, and music playlists tuned for emotional progression.

[LG-134] Self-Supervised Speech Representations are More Phonetic than Semantic

链接: https://arxiv.org/abs/2406.08619
作者: Kwanghee Choi,Ankita Pasad,Tomohiko Nakamura,Satoru Fukayama,Karen Livescu,Shinji Watanabe
关键词: Self-supervised speech models, Self-supervised speech, effective backbone, Self-supervised, speech applications
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to Interspeech 2024. Source code at this https URL

点击查看摘要

Abstract:Self-supervised speech models (S3Ms) have become an effective backbone for speech applications. Various analyses suggest that S3Ms encode linguistic properties. In this work, we seek a more fine-grained analysis of the word-level linguistic properties encoded in S3Ms. Specifically, we curate a novel dataset of near homophone (phonetically similar) and synonym (semantically similar) word pairs and measure the similarities between S3M word representation pairs. Our study reveals that S3M representations consistently and significantly exhibit more phonetic than semantic similarity. Further, we question whether widely used intent classification datasets such as Fluent Speech Commands and Snips Smartlights are adequate for measuring semantic abilities. Our simple baseline, using only the word identity, surpasses S3M-based models. This corroborates our findings and suggests that high scores on these datasets do not necessarily guarantee the presence of semantic content.

[LG-135] LayeredDoc: Domain Adaptive Document Restoration with a Layer Separation Approach

链接: https://arxiv.org/abs/2406.08610
作者: Maria Pilligua,Nil Biescas,Javier Vazquez-Corral,Josep Lladós,Ernest Valveny,Sanket Biswas
关键词: processing systems demands, systems demands robust, intelligent document processing, demands robust solutions, extensive retraining
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to ICDAR 2024 (Athens, Greece) Workshop on Automatically Domain-Adapted and Personalized Document Analysis (ADAPDA)

点击查看摘要

Abstract:The rapid evolution of intelligent document processing systems demands robust solutions that adapt to diverse domains without extensive retraining. Traditional methods often falter with variable document types, leading to poor performance. To overcome these limitations, this paper introduces a text-graphic layer separation approach that enhances domain adaptability in document image restoration (DIR) systems. We propose LayeredDoc, which utilizes two layers of information: the first targets coarse-grained graphic components, while the second refines machine-printed textual content. This hierarchical DIR framework dynamically adjusts to the characteristics of the input document, facilitating effective domain adaptation. We evaluated our approach both qualitatively and quantitatively using a new real-world dataset, LayeredDocDB, developed for this study. Initially trained on a synthetically generated dataset, our model demonstrates strong generalization capabilities for the DIR task, offering a promising solution for handling variability in real-world data. Our code is accessible on GitHub.

[LG-136] FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion

链接: https://arxiv.org/abs/2406.08603
作者: George Cazenavette,Avneesh Sud,Thomas Leung,Ben Usman
关键词: GenAI systems, Stable Diffusion, potential for abuse, abuse of GenAI, task of detecting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Due to the high potential for abuse of GenAI systems, the task of detecting synthetic images has recently become of great interest to the research community. Unfortunately, existing image-space detectors quickly become obsolete as new high-fidelity text-to-image models are developed at blinding speed. In this work, we propose a new synthetic image detector that uses features obtained by inverting an open-source pre-trained Stable Diffusion model. We show that these inversion features enable our detector to generalize well to unseen generators of high visual fidelity (e.g., DALL-E 3) even when the detector is trained only on lower fidelity fake images generated via Stable Diffusion. This detector achieves new state-of-the-art across multiple training and evaluation setups. Moreover, we introduce a new challenging evaluation protocol that uses reverse image search to mitigate stylistic and thematic biases in the detector evaluation. We show that the resulting evaluation scores align well with detectors’ in-the-wild performance, and release these datasets as public benchmarks for future research.

[LG-137] CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

链接: https://arxiv.org/abs/2406.08587
作者: Xiaoshuai Song,Muxi Diao,Guanting Dong,Zhengyang Wang,Yujia Fu,Runqi Qiao,Zhexu Wang,Dayuan Fu,Huangxuan Wu,Bin Liang,Weihao Zeng,Yejie Wang,Zhuoma GongQue,Jianing Yu,Qiuna Tan,Weiran Xu
关键词: Computer Science, human intelligence, artificial intelligence, profoundly advancing, modern society
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:Computer Science (CS) stands as a testament to the intricacies of human intelligence, profoundly advancing the development of artificial intelligence and modern society. However, the current community of large language models (LLMs) overly focuses on benchmarks for analyzing specific foundational skills (e.g. mathematics and code generation), neglecting an all-round evaluation of the computer science field. To bridge this gap, we introduce CS-Bench, the first bilingual (Chinese-English) benchmark dedicated to evaluating the performance of LLMs in computer science. CS-Bench comprises approximately 5K meticulously curated test samples, covering 26 subfields across 4 key areas of computer science, encompassing various task forms and divisions of knowledge and reasoning. Utilizing CS-Bench, we conduct a comprehensive evaluation of over 30 mainstream LLMs, revealing the relationship between CS performance and model scales. We also quantitatively analyze the reasons for failures in existing LLMs and highlight directions for improvements, including knowledge supplementation and CS-specific reasoning. Further cross-capability experiments show a high correlation between LLMs’ capabilities in computer science and their abilities in mathematics and coding. Moreover, expert LLMs specialized in mathematics and coding also demonstrate strong performances in several CS subfields. Looking ahead, we envision CS-Bench serving as a cornerstone for LLM applications in the CS field and paving new avenues in assessing LLMs’ diverse reasoning capabilities. The CS-Bench data and evaluation code are available at this https URL.

[LG-138] Using Quality Attribute Scenarios for ML Model Test Case Generation

链接: https://arxiv.org/abs/2406.08575
作者: Rachel Brower-Sinning,Grace A. Lewis,Sebastían Echeverría,Ipek Ozkaya
关键词: machine learning, practitioners alike, challenge identified, identified by researchers, researchers and practitioners
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Paper accepted and presented in SAML 2024, the 3rd International Workshop on Software Architecture and Machine Learning, co-located with ICSA 2024, the 21st IEEE International Conference on Software Architecture

点击查看摘要

Abstract:Testing of machine learning (ML) models is a known challenge identified by researchers and practitioners alike. Unfortunately, current practice for ML model testing prioritizes testing for model performance, while often neglecting the requirements and constraints of the ML-enabled system that integrates the model. This limited view of testing leads to failures during integration, deployment, and operations, contributing to the difficulties of moving models from development to production. This paper presents an approach based on quality attribute (QA) scenarios to elicit and define system- and model-relevant test cases for ML models. The QA-based approach described in this paper has been integrated into MLTE, a process and tool to support ML model test and evaluation. Feedback from users of MLTE highlights its effectiveness in testing beyond model performance and identifying failures early in the development process.

[LG-139] HDNet: Physics-Inspired Neural Network for Flow Estimation based on Helmholtz Decomposition

链接: https://arxiv.org/abs/2406.08570
作者: Miao Qi,Ramzi Idoughi,Wolfgang Heidrich
关键词: Flow estimation problems, Inspired Neural Network, Flow estimation, scientific imaging, ubiquitous in scientific
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Flow estimation problems are ubiquitous in scientific imaging. Often, the underlying flows are subject to physical constraints that can be exploited in the flow estimation; for example, incompressible (divergence-free) flows are expected for many fluid experiments, while irrotational (curl-free) flows arise in the analysis of optical distortions and wavefront sensing. In this work, we propose a Physics- Inspired Neural Network (PINN) named HDNet, which performs a Helmholtz decomposition of an arbitrary flow field, i.e., it decomposes the input flow into a divergence-only and a curl-only component. HDNet can be trained exclusively on synthetic data generated by reverse Helmholtz decomposition, which we call Helmholtz synthesis. As a PINN, HDNet is fully differentiable and can easily be integrated into arbitrary flow estimation problems.

[LG-140] Noise-Aware Differentially Private Regression via Meta-Learning

链接: https://arxiv.org/abs/2406.08569
作者: Ossi Räisä,Stratis Markou,Matthew Ashman,Wessel P. Bruinsma,Marlon Tobaben,Antti Honkela,Richard E. Turner
关键词: protect user privacy, user privacy, high-stakes applications require, applications require machine, protecting user privacy
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Many high-stakes applications require machine learning models that protect user privacy and provide well-calibrated, accurate predictions. While Differential Privacy (DP) is the gold standard for protecting user privacy, standard DP mechanisms typically significantly impair performance. One approach to mitigating this issue is pre-training models on simulated data before DP learning on the private data. In this work we go a step further, using simulated data to train a meta-learning model that combines the Convolutional Conditional Neural Process (ConvCNP) with an improved functional DP mechanism of Hall et al. [2013] yielding the DPConvCNP. DPConvCNP learns from simulated data how to map private data to a DP predictive model in one forward pass, and then provides accurate, well-calibrated predictions. We compare DPConvCNP with a DP Gaussian Process (GP) baseline with carefully tuned hyperparameters. The DPConvCNP outperforms the GP baseline, especially on non-Gaussian data, yet is much faster at test time and requires less tuning.

[LG-141] Adaptive Teaching with a Shared Classifier for Knowledge Distillation

链接: https://arxiv.org/abs/2406.08528
作者: Jaeyeon Jang,Young-Ik Kim,Jisu Lim,Hyeonseong Lee
关键词: student network, teacher network, less-parameterized student network, Knowledge distillation, overparameterized teacher network
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge distillation (KD) is a technique used to transfer knowledge from an overparameterized teacher network to a less-parameterized student network, thereby minimizing the incurred performance loss. KD methods can be categorized into offline and online approaches. Offline KD leverages a powerful pretrained teacher network, while online KD allows the teacher network to be adjusted dynamically to enhance the learning effectiveness of the student network. Recently, it has been discovered that sharing the classifier of the teacher network can significantly boost the performance of the student network with only a minimal increase in the number of network parameters. Building on these insights, we propose adaptive teaching with a shared classifier (ATSC). In ATSC, the pretrained teacher network self-adjusts to better align with the learning needs of the student network based on its capabilities, and the student network benefits from the shared classifier, enhancing its performance. Additionally, we extend ATSC to environments with multiple teachers. We conduct extensive experiments, demonstrating the effectiveness of the proposed KD method. Our approach achieves state-of-the-art results on the CIFAR-100 and ImageNet datasets in both single-teacher and multiteacher scenarios, with only a modest increase in the number of required model parameters. The source code is publicly available at this https URL.

[LG-142] Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning

链接: https://arxiv.org/abs/2406.08527
作者: Jaehyun Nam,Kyuyoung Kim,Seunghyuk Oh,Jihoon Tack,Jaehyung Kim,Jinwoo Shin
关键词: deep learning methods, data is crucial, success of deep, raw data, raw column features
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages

点击查看摘要

Abstract:Learning effective representations from raw data is crucial for the success of deep learning methods. However, in the tabular domain, practitioners often prefer augmenting raw column features over using learned representations, as conventional tree-based algorithms frequently outperform competing approaches. As a result, feature engineering methods that automatically generate candidate features have been widely used. While these approaches are often effective, there remains ambiguity in defining the space over which to search for candidate features. Moreover, they often rely solely on validation scores to select good features, neglecting valuable feedback from past experiments that could inform the planning of future experiments. To address the shortcomings, we propose a new tabular learning framework based on large language models (LLMs), coined Optimizing Column feature generator with decision Tree reasoning (OCTree). Our key idea is to leverage LLMs’ reasoning capabilities to find good feature generation rules without manually specifying the search space and provide language-based reasoning information highlighting past experiments as feedback for iterative rule improvements. Here, we choose a decision tree as reasoning as it can be interpreted in natural language, effectively conveying knowledge of past experiments (i.e., the prediction models trained with the generated features) to the LLM. Our empirical results demonstrate that this simple framework consistently enhances the performance of various prediction models across diverse tabular benchmarks, outperforming competing automatic feature engineering methods.

[LG-143] IMFL-AIGC: Incentive Mechanism Design for Federated Learning Empowered by Artificial Intelligence Generated Content

链接: https://arxiv.org/abs/2406.08526
作者: Guangjing Huang,Qiong Wu,Jingyi Li,Xu Chen
关键词: Federated learning, shared global model, promising paradigm, paradigm that enables, collaboratively train
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT)
*备注: The paper has been accepted by IEEE Transactions on Mobile Computing

点击查看摘要

Abstract:Federated learning (FL) has emerged as a promising paradigm that enables clients to collaboratively train a shared global model without uploading their local data. To alleviate the heterogeneous data quality among clients, artificial intelligence-generated content (AIGC) can be leveraged as a novel data synthesis technique for FL model performance enhancement. Due to various costs incurred by AIGC-empowered FL (e.g., costs of local model computation and data synthesis), however, clients are usually reluctant to participate in FL without adequate economic incentives, which leads to an unexplored critical issue for enabling AIGC-empowered FL. To fill this gap, we first devise a data quality assessment method for data samples generated by AIGC and rigorously analyze the convergence performance of FL model trained using a blend of authentic and AI-generated data samples. We then propose a data quality-aware incentive mechanism to encourage clients’ participation. In light of information asymmetry incurred by clients’ private multi-dimensional attributes, we investigate clients’ behavior patterns and derive the server’s optimal incentive strategies to minimize server’s cost in terms of both model accuracy loss and incentive payments for both complete and incomplete information scenarios. Numerical results demonstrate that our proposed mechanism exhibits highest training accuracy and reduces up to 53.34% of the server’s cost with real-world datasets, compared with existing benchmark mechanisms.

[LG-144] A Mathematical Certification for Positivity Conditions in Neural Networks with Applications to Partial Monotonicity and Ethical AI

链接: https://arxiv.org/abs/2406.08525
作者: Alejandro Polo-Molina,David Alfaya,Jose Portela
关键词: Artificial Neural Networks, Artificial Neural, Neural Networks, modeling complex relationships, ANN
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 5 figures

点击查看摘要

Abstract:Artificial Neural Networks (ANNs) have become a powerful tool for modeling complex relationships in large-scale datasets. However, their black-box nature poses ethical challenges. In certain situations, ensuring ethical predictions might require following specific partial monotonic constraints. However, certifying if an already-trained ANN is partially monotonic is challenging. Therefore, ANNs are often disregarded in some critical applications, such as credit scoring, where partial monotonicity is required. To address this challenge, this paper presents a novel algorithm (LipVor) that certifies if a black-box model, such as an ANN, is positive based on a finite number of evaluations. Therefore, as partial monotonicity can be stated as a positivity condition of the partial derivatives, the LipVor Algorithm can certify whether an already trained ANN is partially monotonic. To do so, for every positively evaluated point, the Lipschitzianity of the black-box model is used to construct a specific neighborhood where the function remains positive. Next, based on the Voronoi diagram of the evaluated points, a sufficient condition is stated to certify if the function is positive in the domain. Compared to prior methods, our approach is able to mathematically certify if an ANN is partially monotonic without needing constrained ANN’s architectures or piece-wise linear activation functions. Therefore, LipVor could open up the possibility of using unconstrained ANN in some critical fields. Moreover, some other properties of an ANN, such as convexity, can be posed as positivity conditions, and therefore, LipVor could also be applied.

[LG-145] Federated Incomplete Multi-View Clustering with Heterogeneous Graph Neural Networks

链接: https://arxiv.org/abs/2406.08524
作者: Xueming Yan,Ziqi Wang,Yaochu Jin
关键词: multi-view clustering offers, multiple devices, offers the potential, potential to develop, multi-view data
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated multi-view clustering offers the potential to develop a global clustering model using data distributed across multiple devices. However, current methods face challenges due to the absence of label information and the paramount importance of data privacy. A significant issue is the feature heterogeneity across multi-view data, which complicates the effective mining of complementary clustering information. Additionally, the inherent incompleteness of multi-view data in a distributed setting can further complicate the clustering process. To address these challenges, we introduce a federated incomplete multi-view clustering framework with heterogeneous graph neural networks (FIM-GNNs). In the proposed FIM-GNNs, autoencoders built on heterogeneous graph neural network models are employed for feature extraction of multi-view data at each client site. At the server level, heterogeneous features from overlapping samples of each client are aggregated into a global feature representation. Global pseudo-labels are generated at the server to enhance the handling of incomplete view data, where these labels serve as a guide for integrating and refining the clustering process across different data views. Comprehensive experiments have been conducted on public benchmark datasets to verify the performance of the proposed FIM-GNNs in comparison with state-of-the-art algorithms.

[LG-146] Predicting Cascading Failures with a Hyperparametric Diffusion Model

链接: https://arxiv.org/abs/2406.08522
作者: Bin Xiang,Bogdan Cautis,Xiaokui Xiao,Olga Mula,Dusit Niyato,Laks V.S. Lakshmanan
关键词: study cascading failures, cascading failures, diffusion, model, power grid
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: KDD 2024

点击查看摘要

Abstract:In this paper, we study cascading failures in power grids through the lens of information diffusion models. Similar to the spread of rumors or influence in an online social network, it has been observed that failures (outages) in a power grid can spread contagiously, driven by viral spread mechanisms. We employ a stochastic diffusion model that is Markovian (memoryless) and local (the activation of one node, i.e., transmission line, can only be caused by its neighbors). Our model integrates viral diffusion principles with physics-based concepts, by correlating the diffusion weights (contagion probabilities between transmission lines) with the hyperparametric Information Cascades (IC) model. We show that this diffusion model can be learned from traces of cascading failures, enabling accurate modeling and prediction of failure propagation. This approach facilitates actionable information through well-understood and efficient graph analysis methods and graph diffusion simulations. Furthermore, by leveraging the hyperparametric model, we can predict diffusion and mitigate the risks of cascading failures even in unseen grid configurations, whereas existing methods falter due to a lack of training data. Extensive experiments based on a benchmark power grid and simulations therein show that our approach effectively captures the failure diffusion phenomena and guides decisions to strengthen the grid, reducing the risk of large-scale cascading failures. Additionally, we characterize our model’s sample complexity, improving upon the existing bound.

[LG-147] Enhanced Anomaly Detection in Automotive Systems Using SAAD: Statistical Aggregated Anomaly Detection

链接: https://arxiv.org/abs/2406.08516
作者: Dacian Goina,Eduard Hogea,George Maties
关键词: detection methodology termed, methodology termed Statistical, anomaly detection methodology, termed Statistical Aggregated, anomaly detection
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper presents a novel anomaly detection methodology termed Statistical Aggregated Anomaly Detection (SAAD). The SAAD approach integrates advanced statistical techniques with machine learning, and its efficacy is demonstrated through validation on real sensor data from a Hardware-in-the-Loop (HIL) environment within the automotive domain. The key innovation of SAAD lies in its ability to significantly enhance the accuracy and robustness of anomaly detection when combined with Fully Connected Networks (FCNs) augmented by dropout layers. Comprehensive experimental evaluations indicate that the standalone statistical method achieves an accuracy of 72.1%, whereas the deep learning model alone attains an accuracy of 71.5%. In contrast, the aggregated method achieves a superior accuracy of 88.3% and an F1 score of 0.921, thereby outperforming the individual models. These results underscore the effectiveness of SAAD, demonstrating its potential for broad application in various domains, including automotive systems.

[LG-148] Operational Latent Spaces

链接: https://arxiv.org/abs/2406.02699
作者: Scott H. Hawley,Austin R. Tackett
关键词: semantically meaningful operations, support semantically meaningful, operational latent spaces, latent spaces, operational latent
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: 7 pages, 6 figures. Accepted to AES International Symposium on AI and the Musician

点击查看摘要

Abstract:We investigate the construction of latent spaces through self-supervised learning to support semantically meaningful operations. Analogous to operational amplifiers, these “operational latent spaces” (OpLaS) not only demonstrate semantic structure such as clustering but also support common transformational operations with inherent semantic meaning. Some operational latent spaces are found to have arisen “unintentionally” in the progress toward some (other) self-supervised learning objective, in which unintended but still useful properties are discovered among the relationships of points in the space. Other spaces may be constructed “intentionally” by developers stipulating certain kinds of clustering or transformations intended to produce the desired structure. We focus on the intentional creation of operational latent spaces via self-supervised learning, including the introduction of rotation operators via a novel “FiLMR” layer, which can be used to enable ring-like symmetries found in some musical constructions.

[LG-149] Learning conditional distributions on continuous spaces

链接: https://arxiv.org/abs/2406.09375
作者: Cyril Bénézet,Ziteng Cheng,Sebastian Jaimungal
关键词: multi-dimensional unit boxes, investigate sample-based learning, unit boxes, investigate sample-based, sample-based learning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We investigate sample-based learning of conditional distributions on multi-dimensional unit boxes, allowing for different dimensions of the feature and target spaces. Our approach involves clustering data near varying query points in the feature space to create empirical measures in the target space. We employ two distinct clustering schemes: one based on a fixed-radius ball and the other on nearest neighbors. We establish upper bounds for the convergence rates of both methods and, from these bounds, deduce optimal configurations for the radius and the number of neighbors. We propose to incorporate the nearest neighbors method into neural network training, as our empirical analysis indicates it has better performance in practice. For efficiency, our training process utilizes approximate nearest neighbors search with random binary space partitioning. Additionally, we employ the Sinkhorn algorithm and a sparsity-enforced transport plan. Our empirical findings demonstrate that, with a suitably designed structure, the neural network has the ability to adapt to a suitable level of Lipschitz continuity locally. For reproducibility, our code is available at \urlthis https URL.

[LG-150] Instance-level quantitative saliency in multiple sclerosis lesion segmentation

链接: https://arxiv.org/abs/2406.09335
作者: Federico Spagnolo,Nataliia Molchanova,Roger Schaer,Meritxell Bach Cuadra,Mario Ocampo Pineda,Lester Melie-Garcia,Cristina Granziera,Vincent Andrearczyk,Adrien Depeursinge
关键词: describe models’ decision, models’ decision mechanisms, recent years, artificial intelligence, classification tasks
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, explainable methods for artificial intelligence (XAI) have tried to reveal and describe models’ decision mechanisms in the case of classification tasks. However, XAI for semantic segmentation and in particular for single instances has been little studied to date. Understanding the process underlying automatic segmentation of single instances is crucial to reveal what information was used to detect and segment a given object of interest. In this study, we proposed two instance-level explanation maps for semantic segmentation based on SmoothGrad and Grad-CAM++ methods. Then, we investigated their relevance for the detection and segmentation of white matter lesions (WML), a magnetic resonance imaging (MRI) biomarker in multiple sclerosis (MS). 687 patients diagnosed with MS for a total of 4043 FLAIR and MPRAGE MRI scans were collected at the University Hospital of Basel, Switzerland. Data were randomly split into training, validation and test sets to train a 3D U-Net for MS lesion segmentation. We observed 3050 true positive (TP), 1818 false positive (FP), and 789 false negative (FN) cases. We generated instance-level explanation maps for semantic segmentation, by developing two XAI methods based on SmoothGrad and Grad-CAM++. We investigated: 1) the distribution of gradients in saliency maps with respect to both input MRI sequences; 2) the model’s response in the case of synthetic lesions; 3) the amount of perilesional tissue needed by the model to segment a lesion. Saliency maps (based on SmoothGrad) in FLAIR showed positive values inside a lesion and negative in its neighborhood. Peak values of saliency maps generated for these four groups of volumes presented distributions that differ significantly from one another, suggesting a quantitative nature of the proposed saliency. Contextual information of 7mm around the lesion border was required for their segmentation.

[LG-151] Neural networks in non-metric spaces

链接: https://arxiv.org/abs/2406.09310
作者: Luca Galimberti
关键词: universal approximation property, universal approximation theorems, approximation property shown, universal approximation, Leveraging the infinite
类目: Functional Analysis (math.FA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Leveraging the infinite dimensional neural network architecture we proposed in arXiv:2109.13512v4 and which can process inputs from Fréchet spaces, and using the universal approximation property shown therein, we now largely extend the scope of this architecture by proving several universal approximation theorems for a vast class of input and output spaces. More precisely, the input space \mathfrak X is allowed to be a general topological space satisfying only a mild condition (“quasi-Polish”), and the output space can be either another quasi-Polish space \mathfrak Y or a topological vector space E . Similarly to arXiv:2109.13512v4, we show furthermore that our neural network architectures can be projected down to “finite dimensional” subspaces with any desirable accuracy, thus obtaining approximating networks that are easy to implement and allow for fast computation and fitting. The resulting neural network architecture is therefore applicable for prediction tasks based on functional data. To the best of our knowledge, this is the first result which deals with such a wide class of input/output spaces and simultaneously guarantees the numerical feasibility of the ensuing architectures. Finally, we prove an obstruction result which indicates that the category of quasi-Polish spaces is in a certain sense the correct category to work with if one aims at constructing approximating architectures on infinite-dimensional spaces \mathfrak X which, at the same time, have sufficient expressive power to approximate continuous functions on \mathfrak X , are specified by a finite number of parameters only and are “stable” with respect to these parameters.

[LG-152] End-to-end Streaming model for Low-Latency Speech Anonymization

链接: https://arxiv.org/abs/2406.09277
作者: Waris Quamer,Ricardo Gutierrez-Osuna
关键词: preserving linguistic content, Speaker anonymization aims, aims to conceal, conceal cues, preserving linguistic
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Speaker anonymization aims to conceal cues to speaker identity while preserving linguistic content. Current machine learning based approaches require substantial computational resources, hindering real-time streaming applications. To address these concerns, we propose a streaming model that achieves speaker anonymization with low latency. The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder that extracts HuBERT-like information, a pretrained speaker encoder that extract speaker identity, and a variance encoder that injects pitch and energy information. These three disentangled representations are fed to a decoder that resynthesizes the speech signal. We present evaluation results from two implementations of our system, a full model that achieves a latency of 230ms, and a lite version (0.1x in size) that further reduces latency to 66ms while maintaining state-of-the-art performance in naturalness, intelligibility, and privacy preservation.

[LG-153] Generative Inverse Design of Crystal Structures via Diffusion Models with Transformers

链接: https://arxiv.org/abs/2406.09263
作者: Izumi Takahara,Kiyou Shibata,Teruyasu Mizoguchi
关键词: Recent advances, advances in deep, deep learning, learning have enabled, crystal structures
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in deep learning have enabled the generation of realistic data by training generative models on large datasets of text, images, and audio. While these models have demonstrated exceptional performance in generating novel and plausible data, it remains an open question whether they can effectively accelerate scientific discovery through the data generation and drive significant advancements across various scientific fields. In particular, the discovery of new inorganic materials with promising properties poses a critical challenge, both scientifically and for industrial applications. However, unlike textual or image data, materials, or more specifically crystal structures, consist of multiple types of variables - including lattice vectors, atom positions, and atomic species. This complexity in data give rise to a variety of approaches for representing and generating such data. Consequently, the design choices of generative models for crystal structures remain an open question. In this study, we explore a new type of diffusion model for the generative inverse design of crystal structures, with a backbone based on a Transformer architecture. We demonstrate our models are superior to previous methods in their versatility for generating crystal structures with desired properties. Furthermore, our empirical results suggest that the optimal conditioning methods vary depending on the dataset.

[LG-154] Deep Sketched Output Kernel Regression for Structured Prediction

链接: https://arxiv.org/abs/2406.09253
作者: Tamim El Ahmad,Junjie Yang,Pierre Laforgue,Florence d’Alché-Buc
关键词: kernel trick, kernel-induced losses provide, define structured output, provide a principled, wide variety
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:By leveraging the kernel trick in the output space, kernel-induced losses provide a principled way to define structured output prediction tasks for a wide variety of output modalities. In particular, they have been successfully used in the context of surrogate non-parametric regression, where the kernel trick is typically exploited in the input space as well. However, when inputs are images or texts, more expressive models such as deep neural networks seem more suited than non-parametric methods. In this work, we tackle the question of how to train neural networks to solve structured output prediction tasks, while still benefiting from the versatility and relevance of kernel-induced losses. We design a novel family of deep neural architectures, whose last layer predicts in a data-dependent finite-dimensional subspace of the infinite-dimensional output feature space deriving from the kernel-induced loss. This subspace is chosen as the span of the eigenfunctions of a randomly-approximated version of the empirical kernel covariance operator. Interestingly, this approach unlocks the use of gradient descent algorithms (and consequently of any neural architecture) for structured prediction. Experiments on synthetic tasks as well as real-world supervised graph prediction problems show the relevance of our method.

[LG-155] What is the long-run distribution of stochastic gradient descent? A large deviations analysis

链接: https://arxiv.org/abs/2406.09241
作者: Waïss Azizian,Franck Iutzeler,Jérôme Malick,Panayotis Mertikopoulos
关键词: stochastic gradient descent, gradient descent, stochastic gradient, long-run distribution, non-convex problems
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注: 70 pages, 3 figures; to be published in the proceedings of ICML 2024

点击查看摘要

Abstract:In this paper, we examine the long-run distribution of stochastic gradient descent (SGD) in general, non-convex problems. Specifically, we seek to understand which regions of the problem’s state space are more likely to be visited by SGD, and by how much. Using an approach based on the theory of large deviations and randomly perturbed dynamical systems, we show that the long-run distribution of SGD resembles the Boltzmann-Gibbs distribution of equilibrium thermodynamics with temperature equal to the method’s step-size and energy levels determined by the problem’s objective and the statistics of the noise. In particular, we show that, in the long run, (a) the problem’s critical region is visited exponentially more often than any non-critical region; (b) the iterates of SGD are exponentially concentrated around the problem’s minimum energy state (which does not always coincide with the global minimum of the objective); © all other connected components of critical points are visited with frequency that is exponentially proportional to their energy level; and, finally (d) any component of local maximizers or saddle points is “dominated” by a component of local minimizers which is visited exponentially more often.

[LG-156] Precise analysis of ridge interpolators under heavy correlations – a Random Duality Theory view

链接: https://arxiv.org/abs/2406.09199
作者: Mihailo Stojnic
关键词: column-correlated linear regression, minimum norm interpolators, including minimum norm, Random Duality Theory, linear regression models
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We consider fully row/column-correlated linear regression models and study several classical estimators (including minimum norm interpolators (GLS), ordinary least squares (LS), and ridge regressors). We show that \emphRandom Duality Theory (RDT) can be utilized to obtain precise closed form characterizations of all estimators related optimizing quantities of interest, including the \emphprediction risk (testing or generalization error). On a qualitative level out results recover the risk’s well known non-monotonic (so-called double-descent) behavior as the number of features/sample size ratio increases. On a quantitative level, our closed form results show how the risk explicitly depends on all key model parameters, including the problem dimensions and covariance matrices. Moreover, a special case of our results, obtained when intra-sample (or time-series) correlations are not present, precisely match the corresponding ones obtained via spectral methods in [6,16,17,24].

[LG-157] Bengining overfitting in Fixed Dimension via Physics-Informed Learning with Smooth Iductive Bias

链接: https://arxiv.org/abs/2406.09194
作者: Honam Wong,Wendao Wu,Fanghui Liu,Yiping Lu
关键词: Recent advances, learning theory showed, machine learning theory, over-parameterized machine learning, machine learning algorithms
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Numerical Analysis (math.NA); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Recent advances in machine learning theory showed that interpolation to noisy samples using over-parameterized machine learning algorithms always leads to inconsistency. However, this work surprisingly discovers that interpolated machine learning can exhibit benign overfitting and consistency when using physics-informed learning for supervised tasks governed by partial differential equations (PDEs) describing laws of physics. An analysis provides an asymptotic Sobolev norm learning curve for kernel ridge(less) regression addressing linear inverse problems involving elliptic PDEs. The results reveal that the PDE operators can stabilize variance and lead to benign overfitting for fixed-dimensional problems, contrasting standard regression settings. The impact of various inductive biases introduced by minimizing different Sobolev norms as implicit regularization is also examined. Notably, the convergence rate is independent of the specific (smooth) inductive bias for both ridge and ridgeless regression. For regularized least squares estimators, all (smooth enough) inductive biases can achieve optimal convergence rates when the regularization parameter is properly chosen. The smoothness requirement recovers a condition previously found in the Bayesian setting and extends conclusions to minimum norm interpolation estimators.

[LG-158] Ridge interpolators in correlated factor regression models – exact risk analysis

链接: https://arxiv.org/abs/2406.09183
作者: Mihailo Stojnic
关键词: Random Duality Theory, emph, analyze the performance, classical ridge interpolators, Random Duality
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We consider correlated \emphfactor regression models (FRM) and analyze the performance of classical ridge interpolators. Utilizing powerful \emphRandom Duality Theory (RDT) mathematical engine, we obtain \emphprecise closed form characterizations of the underlying optimization problems and all associated optimizing quantities. In particular, we provide \emphexcess prediction risk characterizations that clearly show the dependence on all key model parameters, covariance matrices, loadings, and dimensions. As a function of the over-parametrization ratio, the generalized least squares (GLS) risk also exhibits the well known \emphdouble-descent (non-monotonic) behavior. Similarly to the classical linear regression models (LRM), we demonstrate that such FRM phenomenon can be smoothened out by the optimally tuned ridge regularization. The theoretical results are supplemented by numerical simulations and an excellent agrement between the two is observed. Moreover, we note that ``ridge smootenhing’’ is often of limited effect already for over-parametrization ratios above 5 and of virtually no effect for those above 10 . This solidifies the notion that one of the recently most popular neural networks paradigms – \emphzero-training (interpolating) generalizes well – enjoys wider applicability, including the one within the FRM estimation/prediction context.

[LG-159] Federated Contrastive Learning for Personalized Semantic Communication

链接: https://arxiv.org/abs/2406.09182
作者: Yining Wang,Wanli Ni,Wenqiang Yi,Xiaodong Xu,Ping Zhang,Arumugam Nallanathan
关键词: personalized semantic communication, supporting personalized semantic, design a federated, aimed at supporting, supporting personalized
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: IEEE Communications Letters

点击查看摘要

Abstract:In this letter, we design a federated contrastive learning (FedCL) framework aimed at supporting personalized semantic communication. Our FedCL enables collaborative training of local semantic encoders across multiple clients and a global semantic decoder owned by the base station. This framework supports heterogeneous semantic encoders since it does not require client-side model aggregation. Furthermore, to tackle the semantic imbalance issue arising from heterogeneous datasets across distributed clients, we employ contrastive learning to train a semantic centroid generator (SCG). This generator obtains representative global semantic centroids that exhibit intra-semantic compactness and inter-semantic separability. Consequently, it provides superior supervision for learning discriminative local semantic features. Additionally, we conduct theoretical analysis to quantify the convergence performance of FedCL. Simulation results verify the superiority of the proposed FedCL framework compared to other distributed learning benchmarks in terms of task performance and robustness under different numbers of clients and channel conditions, especially in low signal-to-noise ratio and highly heterogeneous data scenarios.

[LG-160] Scalable and Flexible Causal Discovery with an Efficient Test for Adjacency

链接: https://arxiv.org/abs/2406.09177
作者: Alan Nawzad Amin,Andrew Gordon Wilson
关键词: understand mechanisms, Differentiable Adjacency Test, causal, variables, graph
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: ICML 2024; Code at this https URL

点击查看摘要

Abstract:To make accurate predictions, understand mechanisms, and design interventions in systems of many variables, we wish to learn causal graphs from large scale data. Unfortunately the space of all possible causal graphs is enormous so scalably and accurately searching for the best fit to the data is a challenge. In principle we could substantially decrease the search space, or learn the graph entirely, by testing the conditional independence of variables. However, deciding if two variables are adjacent in a causal graph may require an exponential number of tests. Here we build a scalable and flexible method to evaluate if two variables are adjacent in a causal graph, the Differentiable Adjacency Test (DAT). DAT replaces an exponential number of tests with a provably equivalent relaxed problem. It then solves this problem by training two neural networks. We build a graph learning method based on DAT, DAT-Graph, that can also learn from data with interventions. DAT-Graph can learn graphs of 1000 variables with state of the art accuracy. Using the graph learned by DAT-Graph, we also build models that make much more accurate predictions of the effects of interventions on large scale RNA sequencing data.

[LG-161] Generative vs. Discriminative modeling under the lens of uncertainty quantification

链接: https://arxiv.org/abs/2406.09172
作者: Elouan Argouarc’h,François Desbouvries,Eric Barat,Eiji Kawasaki
关键词: parametric conditional probability, capture intrinsic dependencies, conditional probability distribution, parametric conditional, enables to capture
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Learning a parametric model from a given dataset indeed enables to capture intrinsic dependencies between random variables via a parametric conditional probability distribution and in turn predict the value of a label variable given observed variables. In this paper, we undertake a comparative analysis of generative and discriminative approaches which differ in their construction and the structure of the underlying inference problem. Our objective is to compare the ability of both approaches to leverage information from various sources in an epistemic uncertainty aware inference via the posterior predictive distribution. We assess the role of a prior distribution, explicit in the generative case and implicit in the discriminative case, leading to a discussion about discriminative models suffering from imbalanced dataset. We next examine the double role played by the observed variables in the generative case, and discuss the compatibility of both approaches with semi-supervised learning. We also provide with practical insights and we examine how the modeling choice impacts the sampling from the posterior predictive distribution. With regard to this, we propose a general sampling scheme enabling supervised learning for both approaches, as well as semi-supervised learning when compatible with the considered modeling approach. Throughout this paper, we illustrate our arguments and conclusions using the example of affine regression, and validate our comparative analysis through classification simulations using neural network based models.

[LG-162] SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-Resolution

链接: https://arxiv.org/abs/2406.09168
作者: Soufiane Belharbi,Mara KM Whitford,Phuong Hoang,Shakeeb Murtaza,Luke McCaffrey,Eric Granger
关键词: Scanning confocal microscopy, Confocal fluorescence microscopy, biological processes, confocal microscopy, accessible and widely
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 23 pages, 13 figures

点击查看摘要

Abstract:Confocal fluorescence microscopy is one of the most accessible and widely used imaging techniques for the study of biological processes. Scanning confocal microscopy allows the capture of high-quality images from 3D samples, yet suffers from well-known limitations such as photobleaching and phototoxicity of specimens caused by intense light exposure, which limits its use in some applications, especially for living cells. Cellular damage can be alleviated by changing imaging parameters to reduce light exposure, often at the expense of image quality. Machine/deep learning methods for single-image super-resolution (SISR) can be applied to restore image quality by upscaling lower-resolution (LR) images to produce high-resolution images (HR). These SISR methods have been successfully applied to photo-realistic images due partly to the abundance of publicly available data. In contrast, the lack of publicly available data partly limits their application and success in scanning confocal microscopy. In this paper, we introduce a large scanning confocal microscopy dataset named SR-CACO-2 that is comprised of low- and high-resolution image pairs marked for three different fluorescent markers. It allows the evaluation of performance of SISR methods on three different upscaling levels (X2, X4, X8). SR-CACO-2 contains the human epithelial cell line Caco-2 (ATCC HTB-37), and it is composed of 22 tiles that have been translated in the form of 9,937 image patches for experiments with SISR methods. Given the new SR-CACO-2 dataset, we also provide benchmarking results for 15 state-of-the-art methods that are representative of the main SISR families. Results show that these methods have limited success in producing high-resolution textures, indicating that SR-CACO-2 represents a challenging problem. Our dataset, code and pretrained weights are available: this https URL.

[LG-163] Operator-informed score matching for Markov diffusion models

链接: https://arxiv.org/abs/2406.09084
作者: Zheyang Shen,Chris J. Oates
关键词: Markov diffusion models, typically trained, score matching, Diffusion models, diffusion models enjoy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint; 19 pages, 5 figures

点击查看摘要

Abstract:Diffusion models are typically trained using score matching, yet score matching is agnostic to the particular forward process that defines the model. This paper argues that Markov diffusion models enjoy an advantage over other types of diffusion model, as their associated operators can be exploited to improve the training process. In particular, (i) there exists an explicit formal solution to the forward process as a sequence of time-dependent kernel mean embeddings; and (ii) the derivation of score-matching and related estimators can be streamlined. Building upon (i), we propose Riemannian diffusion kernel smoothing, which ameliorates the need for neural score approximation, at least in the low-dimensional context; Building upon (ii), we propose operator-informed score matching, a variance reduction technique that is straightforward to implement in both low- and high-dimensional diffusion modeling and is demonstrated to improve score matching in an empirical proof-of-concept.

[LG-164] Bayesian Structural Model Updating with Multimodal Variational Autoencoder

链接: https://arxiv.org/abs/2406.09051
作者: Tatsuya Itoi,Kazuho Amishiki,Sangwon Lee,Taro Yaoyama
关键词: multimodal variational autoencoder, surrogate unimodal encoders, Bayesian structural model, framework for Bayesian, Bayesian structural
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 20 pages, 11 figures

点击查看摘要

Abstract:This paper presents a novel framework for Bayesian structural model updating and proposes a method that utilizes the surrogate unimodal encoders of a multimodal variational autoencoder. This method facilitates an efficient nonparametric estimation of the likelihood describing the observed data. It is particularly suitable for high-dimensional correlated simultaneous observations applicable to various dynamic analysis models. The proposed approach is benchmarked using a numerical model of a single-story frame building with acceleration and dynamic strain measurements.

[LG-165] Central Limit Theorem for Bayesian Neural Network trained with Variational Inference

链接: https://arxiv.org/abs/2406.09048
作者: Arnaud Descours(MAGNET),Tom Huix(X),Arnaud Guillin(LMBP),Manon Michel(LMBP),Éric Moulines(X),Boris Nectoux(LMBP)
关键词: Central Limit Theorems, Bayesian two-layerneural networks, derive Central Limit, rigorously derive Central, derive Central
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:In this paper, we rigorously derive Central Limit Theorems (CLT) for Bayesian two-layerneural networks in the infinite-width limit and trained by variational inference on a regression task. The different networks are trained via different maximization schemes of the regularized evidence lower bound: (i) the idealized case with exact estimation of a multiple Gaussian integral from the reparametrization trick, (ii) a minibatch scheme using Monte Carlo sampling, commonly known as Bayes-by-Backprop, and (iii) a computationally cheaper algorithm named Minimal VI. The latter was recently introduced by leveraging the information obtained at the level of the mean-field limit. Laws of large numbers are already rigorously proven for the three schemes that admits the same asymptotic limit. By deriving CLT, this work shows that the idealized and Bayes-by-Backprop schemes have similar fluctuation behavior, that is different from the Minimal VI one. Numerical experiments then illustrate that the Minimal VI scheme is still more efficient, in spite of bigger variances, thanks to its important gain in computational complexity.

[LG-166] From Theory to Therapy: Reframing SBDD Model Evaluation via Practical Metrics

链接: https://arxiv.org/abs/2406.08980
作者: Bowen Gao,Haichuan Tan,Yanwen Huang,Minsi Ren,Xiao Huang,Wei-Ying Ma,Ya-Qin Zhang,Yanyan Lan
关键词: specific protein pockets, bind specific protein, structure-based drug design, Recent advancements, generating molecules tailored
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in structure-based drug design (SBDD) have significantly enhanced the efficiency and precision of drug discovery by generating molecules tailored to bind specific protein pockets. Despite these technological strides, their practical application in real-world drug development remains challenging due to the complexities of synthesizing and testing these molecules. The reliability of the Vina docking score, the current standard for assessing binding abilities, is increasingly questioned due to its susceptibility to overfitting. To address these limitations, we propose a comprehensive evaluation framework that includes assessing the similarity of generated molecules to known active compounds, introducing a virtual screening-based metric for practical deployment capabilities, and re-evaluating binding affinity more rigorously. Our experiments reveal that while current SBDD models achieve high Vina scores, they fall short in practical usability metrics, highlighting a significant gap between theoretical predictions and real-world applicability. Our proposed metrics and dataset aim to bridge this gap, enhancing the practical applicability of future SBDD models and aligning them more closely with the needs of pharmaceutical research and development.

[LG-167] SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction

链接: https://arxiv.org/abs/2406.08961
作者: Yanwen Huang,Bowen Gao,Yinjun Jia,Hongbo Ma,Wei-Ying Ma,Ya-Qin Zhang,Yanyan Lan
关键词: Small molecules play, small molecule-protein interactions, modern medicine, play a pivotal, pivotal role
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Small molecules play a pivotal role in modern medicine, and scrutinizing their interactions with protein targets is essential for the discovery and development of novel, life-saving therapeutics. The term “bioactivity” encompasses various biological effects resulting from these interactions, including both binding and functional responses. The magnitude of bioactivity dictates the therapeutic or toxic pharmacological outcomes of small molecules, rendering accurate bioactivity prediction crucial for the development of safe and effective drugs. However, existing structural datasets of small molecule-protein interactions are often limited in scale and lack systematically organized bioactivity labels, thereby impeding our understanding of these interactions and precise bioactivity prediction. In this study, we introduce a comprehensive dataset of small molecule-protein interactions, consisting of over a million binding structures, each annotated with real biological activity labels. This dataset is designed to facilitate unbiased bioactivity prediction. We evaluated several classical models on this dataset, and the results demonstrate that the task of unbiased bioactivity prediction is challenging yet essential.

[LG-168] Mirror and Preconditioned Gradient Descent in Wasserstein Space

链接: https://arxiv.org/abs/2406.08938
作者: Clément Bonet,Théo Uscidda,Adam David,Pierre-Cyril Aubin-Frankowski,Anna Korba
关键词: Wasserstein space encompasses, Wasserstein space, machine learning, problem of minimizing, encompasses many applications
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the problem of minimizing functionals on the Wasserstein space encompasses many applications in machine learning, different optimization algorithms on \mathbbR^d have received their counterpart analog on the Wasserstein space. We focus here on lifting two explicit algorithms: mirror descent and preconditioned gradient descent. These algorithms have been introduced to better capture the geometry of the function to minimize and are provably convergent under appropriate (namely relative) smoothness and convexity conditions. Adapting these notions to the Wasserstein space, we prove guarantees of convergence of some Wasserstein-gradient-based discrete-time schemes for new pairings of objective functionals and regularizers. The difficulty here is to carefully select along which curves the functionals should be smooth and convex. We illustrate the advantages of adapting the geometry induced by the regularizer on ill-conditioned optimization tasks, and showcase the improvement of choosing different discrepancies and geometries in a computational biology task of aligning single-cells.

[LG-169] Assessment of Uncertainty Quantification in Universal Differential Equations

链接: https://arxiv.org/abs/2406.08853
作者: Nina Schmid,David Fernandes del Pozo,Willem Waegeman,Jan Hasenauer
关键词: Scientific Machine Learning, Scientific Machine, uncovering governing equations, Machine Learning, integrate physical knowledge
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Shared last authorship between W.W. and J.H

点击查看摘要

Abstract:Scientific Machine Learning is a new class of approaches that integrate physical knowledge and mechanistic models with data-driven techniques for uncovering governing equations of complex processes. Among the available approaches, Universal Differential Equations (UDEs) are used to combine prior knowledge in the form of mechanistic formulations with universal function approximators, like neural networks. Integral to the efficacy of UDEs is the joint estimation of parameters within mechanistic formulations and the universal function approximators using empirical data. The robustness and applicability of resultant models, however, hinge upon the rigorous quantification of uncertainties associated with these parameters, as well as the predictive capabilities of the overall model or its constituent components. With this work, we provide a formalisation of uncertainty quantification (UQ) for UDEs and investigate important frequentist and Bayesian methods. By analysing three synthetic examples of varying complexity, we evaluate the validity and efficiency of ensembles, variational inference and Markov chain Monte Carlo sampling as epistemic UQ methods for UDEs.

[LG-170] Research on Deep Learning Model of Feature Extraction Based on Convolutional Neural Network

链接: https://arxiv.org/abs/2406.08837
作者: Houze Liu,Iris Li,Yaxin Liang,Dan Sun,Yining Yang,Haowei Yang
关键词: convolutional neural networks, Neural networks, accurately identifying pneumonia, deep neural networks, shallow layers
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks with relatively shallow layers and simple structures may have limited ability in accurately identifying pneumonia. In addition, deep neural networks also have a large demand for computing resources, which may cause convolutional neural networks to be unable to be implemented on terminals. Therefore, this paper will carry out the optimal classification of convolutional neural networks. Firstly, according to the characteristics of pneumonia images, AlexNet and InceptionV3 were selected to obtain better image recognition results. Combining the features of medical images, the forward neural network with deeper and more complex structure is learned. Finally, knowledge extraction technology is used to extract the obtained data into the AlexNet model to achieve the purpose of improving computing efficiency and reducing computing costs. The results showed that the prediction accuracy, specificity, and sensitivity of the trained AlexNet model increased by 4.25 percentage points, 7.85 percentage points, and 2.32 percentage points, respectively. The graphics processing usage has decreased by 51% compared to the InceptionV3 mode.

[LG-171] Orthogonalized Estimation of Difference of Q-functions

链接: https://arxiv.org/abs/2406.08697
作者: Angela Zhou
关键词: Offline reinforcement learning, policies online due, Offline reinforcement, due to safety, observational data
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning is important in many settings with available observational data but the inability to deploy new policies online due to safety, cost, and other concerns. Many recent advances in causal inference and machine learning target estimation of causal contrast functions such as CATE, which is sufficient for optimizing decisions and can adapt to potentially smoother structure. We develop a dynamic generalizati