本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-06-25)

今日共更新414篇论文,其中:

  • 自然语言处理100篇(Computation and Language (cs.CL))
  • 计算机视觉79篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能124篇(Artificial Intelligence (cs.AI))
  • 机器学习112篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] A SMART Mnemonic Sounds like “Glue Tonic”: Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick
[NLP-0] 智能助记器听起来像“胶水补品”:将法学硕士与学生反馈混合起来,打造助记器学习棒

链接: https://arxiv.org/abs/2406.15352
作者: Nishant Balepur,Matthew Shu,Alexander Hoyle,Alison Robey,Shi Feng,Seraphina Goldfarb-Tarrant,Jordan Boyd-Graber
关键词: simpler keywords, SMART, memorable explanations, explanations that link, Keyword mnemonics
中文关键词: 更简单的关键词、Smart、令人难忘的解释、链接的解释、关键词助记符
类目: Computation and Language (cs.CL)
备注: In-Progress Preprint

点击查看摘要

Abstract:Keyword mnemonics are memorable explanations that link new terms to simpler keywords. Prior works generate mnemonics for students, but they do not guide models toward mnemonics students prefer and aid learning. We build SMART, a mnemonic generator trained on feedback from real students learning new terms. To train SMART, we first fine-tune LLaMA-2 on a curated set of user-written mnemonics. We then use LLM alignment to enhance SMART: we deploy mnemonics generated by SMART in a flashcard app to find preferences on mnemonics students favor. We gather 2684 preferences from 45 students across two types: expressed (inferred from ratings) and observed (inferred from student learning), yielding three key findings. First, expressed and observed preferences disagree; what students think is helpful does not fully capture what is truly helpful. Second, Bayesian models can synthesize complementary data from multiple preference types into a single effectiveness signal. SMART is tuned via Direct Preference Optimization on this signal, which we show resolves ties and missing labels in the typical method of pairwise comparisons, augmenting data for LLM output quality gains. Third, mnemonic experts assess SMART as matching GPT-4, at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education.
摘要:关键字助记法是将新术语链接到更简单的关键字的令人难忘的解释。以前的工作为学生创造了助记法,但它们并没有引导模型朝着学生喜欢的助记法和帮助学习的方向发展。我们构建了SMART,这是一个助记生成器,根据学习新术语的真实学生的反馈进行训练。为了训练聪明,我们首先在一套精心策划的用户编写的助记符上微调骆驼-2。然后我们使用LLM对齐来增强SMART:我们在抽认卡应用程序中部署SMART生成的助记符,以找到学生喜欢的助记符的首选项。我们收集了来自45名学生的2684个偏好,分为两种类型:表达的(根据评分推断)和观察(根据学生的学习推断),得出了三个关键发现。首先,表达的偏好和观察到的偏好不一致;学生认为有帮助的东西并不能完全捕捉到真正有帮助的东西。其次,贝叶斯模型可以将来自多种偏好类型的互补数据合成单一的有效性信号。SMART是通过对此信号的直接偏好优化进行调整的,我们显示,这解决了典型的配对比较方法中的平局和缺失标签,增加了LLM输出质量收益的数据。第三,助记专家评估SMART与GPT-4不相上下,部署成本要低得多,这表明了捕获不同的学生反馈来调整LLM在教育中的效用。

[NLP-1] Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
[NLP-1] 多模式任务载体实现多镜头多模式上下文学习

链接: https://arxiv.org/abs/2406.15334
作者: Brandon Huang,Chancharik Mitra,Assaf Arbelle,Leonid Karlinsky,Trevor Darrell,Roei Herzig
关键词: interleaved Large Multimodal, Large Multimodal Models, interleaved Large, Large Multimodal, multimodal ICL setting
中文关键词: 交错大型多模式、大型多模式模型、交错大型、大型多模式、多模式ICL设置
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model’s context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV)–compact implicit representations of in-context examples compressed in the model’s attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.
摘要:最近交织大多通道模型(LMM)在少发式学习中的成功表明,具有许多例子的情境学习(ICL)在学习新任务方面可能是很有前途的。然而,这种多镜头多模式ICL设置有一个关键问题:它从根本上受到模型在预训练时设置的上下文长度的限制。这个问题在多模式领域尤为突出,因为它同时处理文本和图像,需要额外的令牌。这促使需要一种多模式方法,将许多镜头压缩为更少的标记,而不需要进行精细调整。在这项工作中,我们使LMM能够通过利用多模式任务向量(MTV)来执行多模式、多镜头的上下文中学习-MTV是压缩在模型注意力头部中的上下文中示例的紧凑隐式表示。具体地说,我们首先证明了这种MTV在LMM中的存在,然后利用这些提取的MTV来实现针对各种视觉和语言任务的多镜头情景学习。我们的实验表明,MTV可以随着压缩镜头的数量而在性能上进行扩展,并且可以推广到类似的域外任务,而不需要额外的上下文长度来进行推理。

[NLP-2] Gradient-Mask Tuning Elevates the Upper Limits of LLM Performance
[NLP-2] 副屏蔽调整提高了LLM性能的上限

链接: https://arxiv.org/abs/2406.15330
作者: Haoling Li,Xin Zhang,Xiao Liu,Yeyun Gong,Yifan Wang,Yujiu Yang,Qi Chen,Peng Cheng
关键词: Large language models, Large language, language models, revolutionized lots, lots of fields
中文关键词: 大型语言模型,大型语言,语言模型,彻底改变了很多很多领域
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized lots of fields of research. Although it is well-known that fine-tuning is essential for enhancing the capabilities of LLMs, existing research suggests that there is potential redundancy in the fine-tuning process and therefore proposes to update only a subset of parameters. However, these methods fail to leverage the task-specific information to identify important parameters during training. Based on the insight that gradients inherently contain information on task-specific data, we propose Gradient-Mask Tuning (GMT), a method that selectively updates parameters during training based on their gradient information. Specifically, we compute the absolute values of the gradients and apply masking to those with relatively smaller magnitudes. Our empirical results across various tasks demonstrate that GMT not only outperforms traditional fine-tuning methods but also elevates the upper limits of LLM performance. Further analysis indicates that GMT exhibits insensitivity to mask ratio and possesses computational efficiency comparable to vanilla SFT.
摘要:大型语言模型给许多研究领域带来了革命性的变化。虽然众所周知,微调对于增强LLMS的能力是必不可少的,但现有的研究表明,微调过程中存在潜在的冗余,因此建议只更新参数的一个子集。然而,这些方法无法利用特定于任务的信息来识别训练过程中的重要参数。基于梯度固有地包含特定任务数据的信息这一观点,我们提出了梯度掩码调整(GMT),一种在训练过程中基于梯度信息选择性地更新参数的方法。具体地说,我们计算梯度的绝对值,并将掩码应用于具有相对较小幅度的梯度。我们在不同任务上的实验结果表明,GMT不仅优于传统的微调方法,而且提高了LLM性能的上限。进一步分析表明,GMT对掩模比不敏感,具有与香草SFT相当的计算效率。

[NLP-3] LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
[NLP-3] LongRAG:利用长上下文LLM增强检索增强生成

链接: https://arxiv.org/abs/2406.15319
作者: Ziyan Jiang,Xueguang Ma,Wenhu Chen
关键词: traditional RAG framework, basic retrieval units, traditional RAG, units, DPR normally work
中文关键词: 传统RAG框架、基本检索单元、传统RAG、单元、DPR正常工作
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical Report

点击查看摘要

Abstract:In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the needle' unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced heavy’ retriever and light' reader design can lead to sub-optimal performance. In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a long retriever’ and a `long reader’. LongRAG processes the entire Wikipedia into 4K-token units, which is 30x longer than before. By increasing the unit size, we significantly reduce the total units from 22M to 700K. This significantly lowers the burden of retriever, which leads to a remarkable retrieval score: answer recall@1=71% on NQ (previously 52%) and answer recall@2=72% (previously 47%) on HotpotQA (full-wiki). Then we feed the top-k retrieved units ( \approx 30K tokens) to an existing long-context LLM to perform zero-shot answer extraction. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ, which is the best known result. LongRAG also achieves 64.3% on HotpotQA (full-wiki), which is on par of the SoTA model. Our study offers insights into the future roadmap for combining RAG with long-context LLMs.
摘要:在传统的RAG框架中,基本检索单元通常较短。像DPR这样的常见检索器通常使用100个单词的维基百科段落。这样的设计迫使寻回犬在一个大语料库中搜索,以找到“针”单位。相比之下,读者只需要从简短的检索单元中提取答案。这种不平衡的“重型”取回器和“轻型”读取器的设计可能会导致性能不佳。为了缓解这种不平衡,我们提出了一个新的框架LongRAG,它由一个“长检索器”和一个“长阅读器”组成。LongRAG将整个维基百科处理成4K令牌单元,比以前长30倍。通过增加单元大小,我们将总单元从22M大幅减少到700K。这大大减轻了检索者的负担,从而导致了显著的检索分数:NQ上的答案召回率@1=71%(以前是52%),HotpotQA(全维基)上的答案召回率@2=72%(以前是47%)。然后,我们将检索到的前k个单元(约30K个令牌)提供给现有的长上下文LLM,以执行零命中答案提取。在不需要任何培训的情况下,LongRAG在NQ上实现了62.7%的EM,这是最知名的结果。LongRAG在HotpotQA(全维基)上的支持率也达到了64.3%,与SOTA模式不相上下。我们的研究为将RAG与长上下文LLM相结合的未来路线图提供了见解。

[NLP-4] STARD: A Chinese Statute Retrieval Dataset with Real Queries Issued by Non-professionals
[NLP-4] STARD:非专业人士发布的带真实收件箱的中国法规检索数据集

链接: https://arxiv.org/abs/2406.15313
作者: Weihang Su,Yiran Hu,Anzhe Xie,Qingyao Ai,Zibing Que,Ning Zheng,Yun Liu,Weixing Shen,Yiqun Liu
关键词: find relevant statutory, Statute retrieval aims, Statute retrieval, Existing statute retrieval, aims to find
中文关键词: 查找相关法规,法规检索目标,法规检索,现有法规检索,旨在查找
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Statute retrieval aims to find relevant statutory articles for specific queries. This process is the basis of a wide range of legal applications such as legal advice, automated judicial decisions, legal document drafting, etc. Existing statute retrieval benchmarks focus on formal and professional queries from sources like bar exams and legal case documents, thereby neglecting non-professional queries from the general public, which often lack precise legal terminology and references. To address this gap, we introduce the STAtute Retrieval Dataset (STARD), a Chinese dataset comprising 1,543 query cases collected from real-world legal consultations and 55,348 candidate statutory articles. Unlike existing statute retrieval datasets, which primarily focus on professional legal queries, STARD captures the complexity and diversity of real queries from the general public. Through a comprehensive evaluation of various retrieval baselines, we reveal that existing retrieval approaches all fall short of these real queries issued by non-professional users. The best method only achieves a Recall@100 of 0.907, suggesting the necessity for further exploration and additional research in this area. All the codes and datasets are available at: this https URL Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2406.15313 [cs.IR] (or arXiv:2406.15313v1 [cs.IR] for this version)
摘要:法规检索的目的是为特定的查询找到相关的法规文章。这一过程是法律咨询、自动化司法裁决、法律文件起草等广泛法律应用的基础。现有的法规检索基准侧重于来自律师资格考试和法律案件文件等来源的正式和专业查询,从而忽视了来自普通公众的非专业查询,这些查询往往缺乏准确的法律术语和参考。为了弥补这一差距,我们引入了法规检索数据集(STARD),这是一个中文数据集,包含从现实世界法律咨询中收集的1,543个查询案例和55,348个候选法律文章。与主要关注专业法律查询的现有法规检索数据集不同,STARD捕获了来自普通公众的真实查询的复杂性和多样性。通过对各种检索基线的综合评价,我们发现现有的检索方法都不能满足非专业用户的真实查询。最好的方法只能达到0.907的召回率,这表明有必要在这一领域进行进一步的探索和进一步的研究。所有代码和数据集可在以下网址获得:This HTTPS URL主题:信息检索(cs.IR);计算和语言(cs.CL)引用为:arxiv:2406.15313cs.IR

[NLP-5] Advanced Multimodal Deep Learning Architecture for Image-Text Matching
[NLP-5] 用于图像文本匹配的高级多模式深度学习架构

链接: https://arxiv.org/abs/2406.15306
作者: Jinyin Wang,Haijing Zhang,Yihao Zhong,Yingbin Liang,Rongwei Ji,Yiru Cang
关键词: key multimodal task, multimodal deep learning, image-text matching models, text, key multimodal
中文关键词: 关键多模式任务、多模式深度学习、图像-文本匹配模型、文本、关键多模式
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2405.17460 by other authors

点击查看摘要

Abstract:Image-text matching is a key multimodal task that aims to model the semantic association between images and text as a matching relationship. With the advent of the multimedia information age, image, and text data show explosive growth, and how to accurately realize the efficient and accurate semantic correspondence between them has become the core issue of common concern in academia and industry. In this study, we delve into the limitations of current multimodal deep learning models in processing image-text pairing tasks. Therefore, we innovatively design an advanced multimodal deep learning architecture, which combines the high-level abstract representation ability of deep neural networks for visual information with the advantages of natural language processing models for text semantic understanding. By introducing a novel cross-modal attention mechanism and hierarchical feature fusion strategy, the model achieves deep fusion and two-way interaction between image and text feature space. In addition, we also optimize the training objectives and loss functions to ensure that the model can better map the potential association structure between images and text during the learning process. Experiments show that compared with existing image-text matching models, the optimized new model has significantly improved performance on a series of benchmark data sets. In addition, the new model also shows excellent generalization and robustness on large and diverse open scenario datasets and can maintain high matching performance even in the face of previously unseen complex situations.
摘要:图文匹配是一项关键的多模式任务,其目的是将图像和文本之间的语义关联建模为匹配关系。随着多媒体信息时代的到来,图像和文本数据呈现爆炸式的增长,如何准确地实现它们之间高效准确的语义对应成为学术界和工业界共同关注的核心问题。在本研究中,我们深入探讨了当前多通道深度学习模型在图文配对任务处理中的局限性。因此,我们创新性地设计了一种先进的多通道深度学习体系结构,它结合了深度神经网络对视觉信息的高级抽象表示能力和自然语言处理模型在文本语义理解方面的优势。该模型通过引入一种新颖的跨模式注意机制和分层特征融合策略,实现了图文特征空间的深度融合和双向交互。此外,我们还对训练目标和损失函数进行了优化,以确保模型在学习过程中能够更好地映射图像和文本之间的潜在关联结构。实验表明,与现有的图文匹配模型相比,优化后的新模型在一系列基准数据集上的性能有了显著的提高。此外,新模型在大规模、多样化的开放场景数据集上也表现出了良好的泛化能力和鲁棒性,即使在未知的复杂情况下也能保持较高的匹配性能。

[NLP-6] NLP-KG: A System for Exploratory Search of Scientific Literature in Natural Language Processing
[NLP-6] NLP-KG:自然语言处理中科学文献探索性搜索系统

链接: https://arxiv.org/abs/2406.15294
作者: Tim Schopf,Florian Matthes
关键词: Scientific literature searches, interested in learning, Scientific literature, Scientific, literature
中文关键词: 科学文献搜索,对学习感兴趣,科学文献,科学,文学
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2024 System Demonstrations

点击查看摘要

Abstract:Scientific literature searches are often exploratory, whereby users are not yet familiar with a particular field or concept but are interested in learning more about it. However, existing systems for scientific literature search are typically tailored to keyword-based lookup searches, limiting the possibilities for exploration. We propose NLP-KG, a feature-rich system designed to support the exploration of research literature in unfamiliar natural language processing (NLP) fields. In addition to a semantic search, NLP-KG allows users to easily find survey papers that provide a quick introduction to a field of interest. Further, a Fields of Study hierarchy graph enables users to familiarize themselves with a field and its related areas. Finally, a chat interface allows users to ask questions about unfamiliar concepts or specific articles in NLP and obtain answers grounded in knowledge retrieved from scientific publications. Our system provides users with comprehensive exploration possibilities, supporting them in investigating the relationships between different fields, understanding unfamiliar concepts in NLP, and finding relevant research literature. Demo, video, and code are available at: this https URL.
摘要:科学文献搜索通常是探索性的,即用户对某个特定领域或概念还不熟悉,但有兴趣了解更多。然而,现有的科学文献搜索系统通常是为基于关键字的查找搜索量身定做的,限制了探索的可能性。我们提出了NLP-KG,这是一个功能丰富的系统,旨在支持对陌生自然语言处理(NLP)领域的研究文献的探索。除了语义搜索,NLP-KG还允许用户轻松地找到提供感兴趣领域的快速介绍的调查论文。此外,研究领域层次图使用户能够熟悉领域及其相关领域。最后,聊天界面允许用户询问关于NLP中不熟悉的概念或特定文章的问题,并根据从科学出版物检索到的知识获得答案。我们的系统为用户提供了全面的探索可能性,支持他们调查不同领域之间的关系,理解自然语言处理中不熟悉的概念,并找到相关的研究文献。演示、视频和代码可在以下网址获得:这个HTTPS URL。

[NLP-7] he Greek podcast corpus: Competitive speech models for low-resourced languages with weakly supervised data
[NLP-7] 希腊播客文集:具有弱监督数据的低资源语言的竞争语音模型

链接: https://arxiv.org/abs/2406.15284
作者: Georgios Paraskevopoulos,Chara Tsoukala,Athanasios Katsamanis,Vassilis Katsouros
关键词: poses significant challenges, limited digital representation, digital representation poses, representation poses significant, significant challenges
中文关键词: 提出了重大挑战,有限的数字表示,数字表示构成,表示构成了重大的挑战
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: To be presented at Interspeech 2024

点击查看摘要

Abstract:The development of speech technologies for languages with limited digital representation poses significant challenges, primarily due to the scarcity of available data. This issue is exacerbated in the era of large, data-intensive models. Recent research has underscored the potential of leveraging weak supervision to augment the pool of available data. In this study, we compile an 800-hour corpus of Modern Greek from podcasts and employ Whisper large-v3 to generate silver transcriptions. This corpus is utilized to fine-tune our models, aiming to assess the efficacy of this approach in enhancing ASR performance. Our analysis spans 16 distinct podcast domains, alongside evaluations on established datasets for Modern Greek. The findings indicate consistent WER improvements, correlating with increases in both data volume and model size. Our study confirms that assembling large, weakly supervised corpora serves as a cost-effective strategy for advancing speech technologies in under-resourced languages.
摘要:对于数字表示有限的语言,语音技术的发展带来了巨大的挑战,这主要是由于可用数据的稀缺。在大型数据密集型模型的时代,这个问题变得更加严重。最近的研究强调了利用薄弱的监管来扩大可用数据池的潜力。在这项研究中,我们从播客中汇编了一个800小时的现代希腊语语料库,并使用Whisper Large-v3生成银色转录。这个语料库被用来微调我们的模型,旨在评估这种方法在提高ASR性能方面的有效性。我们的分析涵盖了16个不同的播客领域,并对已建立的现代希腊语数据集进行了评估。研究结果表明,随着数据量和模型大小的增加,WER得到了一致的改善。我们的研究证实,汇编大型、弱监督的语料库是在资源不足的语言中推进语音技术的一种具有成本效益的策略。

[NLP-8] Cross-Modality Safety Alignment
[NLP-8] 跨模式安全调整

链接: https://arxiv.org/abs/2406.15279
作者: Siyin Wang,Xingsong Ye,Qinyuan Cheng,Junwen Duan,Shimin Li,Jinlan Fu,Xipeng Qiu,Xuanjing Huang
关键词: Artificial General Intelligence, General Intelligence, Artificial General, human life, systems is paramount
中文关键词: 人工通用智能,通用智能,人工通用,人类生命,系统至关重要
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Artificial General Intelligence (AGI) becomes increasingly integrated into various facets of human life, ensuring the safety and ethical alignment of such systems is paramount. Previous studies primarily focus on single-modality threats, which may not suffice given the integrated and complex nature of cross-modality interactions. We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment. Specifically, it considers cases where single modalities are safe independently but could potentially lead to unsafe or unethical outputs when combined. To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations. Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, such as GPT-4V and LLaVA, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
摘要:随着人工通用智能(AGI)日益融入人类生活的各个方面,确保此类系统的安全性和伦理一致性至关重要。以往的研究主要集中在单一通道的威胁上,考虑到跨通道交互作用的综合性和复杂性,这可能是不够的。我们引入了一种新的安全对齐挑战,称为安全输入但不安全输出(SIUO)来评估跨通道安全对齐。具体地说,它考虑了以下情况:单一模式独立是安全的,但合并起来可能会导致不安全或不道德的结果。为了对这一问题进行实证研究,我们开发了SIUO,这是一个涵盖9个关键安全领域的跨通道基准,如自残、非法活动和侵犯隐私。我们的发现揭示了封闭和开源LVLM(如GPT-4V和LLaVA)中的重大安全漏洞,突显了当前模型在可靠地解释和响应复杂的真实世界场景方面的不足。

[NLP-9] Cognitive Map for Language Models: Optimal Planning via Verbally Representing the World Model
[NLP-9] 语言模型的认知地图:通过口头代表世界模型进行最佳规划

链接: https://arxiv.org/abs/2406.15275
作者: Doyoung Kim,Jongwon Lee,Jinho Park,Minjoon Seo
关键词: requiring multi-step simulations, demonstrated impressive capabilities, natural language processing, tasks requiring multi-step, multi-step simulations
中文关键词: 需要多步模拟、展示令人印象深刻的能力、自然语言处理、需要多步、多步模拟的任务
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models have demonstrated impressive capabilities across various natural language processing tasks, yet they struggle with planning tasks requiring multi-step simulations. Inspired by human cognitive processes, this paper investigates the optimal planning power of language models that can construct a cognitive map of a given environment. Our experiments demonstrate that cognitive map significantly enhances the performance of both optimal and reachable planning generation ability in the Gridworld path planning task. We observe that our method showcases two key characteristics similar to human cognition: \textbfgeneralization of its planning ability to extrapolated environments and rapid adaptation with limited training data. We hope our findings in the Gridworld task provide insights into modeling human cognitive processes in language models, potentially leading to the development of more advanced and robust systems that better resemble human cognition.
摘要:语言模型在各种自然语言处理任务中表现出了令人印象深刻的能力,但它们却难以应对需要多步模拟的规划任务。受人类认知过程的启发,本文研究了可以构建给定环境认知地图的语言模型的最佳规划能力。我们的实验表明,认知地图显着增强了Gridworld路径规划任务中最优和可达规划生成能力的性能。我们观察到,我们的方法展示了与人类认知类似的两个关键特征:\textbf将其外推环境的规划能力概括化,并通过有限的训练数据进行快速适应。我们希望我们在Gridworld任务中的发现为在语言模型中建模人类认知过程提供见解,从而可能导致开发更先进、更强大、更类似人类认知的系统。

[NLP-10] Evaluating Diversity in Automatic Poetry Generation
[NLP-10] 评估自动诗歌生成中的多样性

链接: https://arxiv.org/abs/2406.15267
作者: Yanran Chen,Hannes Gröner,Sina Zarrieß,Steffen Eger
关键词: Natural Language Generation, Natural Language, impactful research fields, Language Generation, automatic poetry generation
中文关键词: 自然语言生成,自然语言,有影响力的研究领域,语言生成,自动诗歌生成
类目: Computation and Language (cs.CL)
备注: init version

点击查看摘要

Abstract:Natural Language Generation (NLG), and more generally generative AI, are among the currently most impactful research fields. Creative NLG, such as automatic poetry generation, is a fascinating niche in this area. While most previous research has focused on forms of the Turing test when evaluating automatic poetry generation - can humans distinguish between automatic and human generated poetry - we evaluate the diversity of automatically generated poetry, by comparing distributions of generated poetry to distributions of human poetry along structural, lexical, semantic and stylistic dimensions, assessing different model types (word vs. character-level, general purpose LLMs vs. poetry-specific models), including the very recent LLaMA3, and types of fine-tuning (conditioned vs. unconditioned). We find that current automatic poetry systems are considerably underdiverse along multiple dimensions - they often do not rhyme sufficiently, are semantically too uniform and even do not match the length distribution of human poetry. Our experiments reveal, however, that style-conditioning and character-level modeling clearly increases diversity across virtually all dimensions we explore. Our identified limitations may serve as the basis for more genuinely diverse future poetry generation models.
摘要:自然语言生成(NLG),以及更广泛的生成式人工智能,是目前最有影响力的研究领域之一。创造性的NLG,如自动诗歌生成,是这一领域的一个迷人的利基市场。虽然以前的大多数研究都集中在评估自动诗歌生成时图灵测试的形式-人类能否区分自动生成的诗歌和人类生成的诗歌-我们评估自动生成的诗歌的多样性,方法是将生成的诗歌的分布与人类诗歌在结构、词汇、语义和文体维度上的分布进行比较,评估不同的模型类型(单词与字符级别、通用LLMS与诗歌特定模型),包括最近的LLaMA3,以及微调类型(条件与非条件)。我们发现,当前的自动诗歌系统在多个维度上的多样性相当低-它们往往押韵不够充分,语义过于统一,甚至与人类诗歌的长度分布不匹配。然而,我们的实验表明,风格条件和角色级别的建模明显增加了我们探索的几乎所有维度的多样性。我们确定的局限性可能会成为未来更多样化的诗歌生成模式的基础。

[NLP-11] Perception of Phonological Assimilation by Neural Speech Recognition Models
[NLP-11] 神经语音识别模型对语音同化的感知

链接: https://arxiv.org/abs/2406.15265
作者: Charlotte Pouw,Marianne de Heer Kloots,Afra Alishahi,Willem Zuidema
关键词: listeners effortlessly compensate, Human listeners effortlessly, Automatic Speech Recognition, unconsciously inferring, inferring the intended
中文关键词: 听众毫不费力地补偿,人类听众毫不费力地补偿,自动语音识别,无意识地推断,推断意图
类目: Computation and Language (cs.CL)
备注: Accepted for publication in Computational Linguistics (Special Issue on Language Learning, Representation, and Processing in Humans and Machines)

点击查看摘要

Abstract:Human listeners effortlessly compensate for phonological changes during speech perception, often unconsciously inferring the intended sounds. For example, listeners infer the underlying /n/ when hearing an utterance such as “clea[m] pan”, where [m] arises from place assimilation to the following labial [p]. This article explores how the neural speech recognition model Wav2Vec2 perceives assimilated sounds, and identifies the linguistic knowledge that is implemented by the model to compensate for assimilation during Automatic Speech Recognition (ASR). Using psycholinguistic stimuli, we systematically analyze how various linguistic context cues influence compensation patterns in the model’s output. Complementing these behavioral experiments, our probing experiments indicate that the model shifts its interpretation of assimilated sounds from their acoustic form to their underlying form in its final layers. Finally, our causal intervention experiments suggest that the model relies on minimal phonological context cues to accomplish this shift. These findings represent a step towards better understanding the similarities and differences in phonological processing between neural ASR models and humans.
摘要:人类听者在言语感知过程中毫不费力地对语音变化进行补偿,往往会无意识地推断出预期的声音。例如,当听者听到诸如“Clea[m]pann”这样的发音时,就会推断出潜在的/n/,其中[m]是从地方同化到下面的唇音[p]。本文探讨了神经语音识别模型Wav2Vec2如何感知同化的声音,并识别了该模型实现的语言知识,以补偿自动语音识别(ASR)中的同化。使用心理语言刺激,我们系统地分析了不同的语境线索如何影响模型输出中的补偿模式。作为对这些行为实验的补充,我们的探索性实验表明,该模型将其对同化声音的解释从声学形式转移到其最终层的底层形式。最后,我们的因果干预实验表明,该模型依赖于最小的语音语境线索来完成这种转换。这些发现代表着朝着更好地理解神经ASR模型和人类在语音处理方面的相似和不同迈出了一步。

[NLP-12] owards Fine-Grained Citation Evaluation in Generated Text: A Comparative Analysis of Faithfulness Metrics
[NLP-12] owards生成文本中的细粒度引用评价:忠诚度的比较分析

链接: https://arxiv.org/abs/2406.15264
作者: Weijia Zhang,Mohammad Aliannejadi,Yifei Yuan,Jiahuan Pei,Jia-Hong Huang,Evangelos Kanoulas
关键词: Large language models, Large language, language models, unverifiable information, produce unsupported
中文关键词: 大型语言模型,大型语言,语言模型,无法验证的信息,产生不受支持的
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:Large language models (LLMs) often produce unsupported or unverifiable information, known as “hallucinations.” To mitigate this, retrieval-augmented LLMs incorporate citations, grounding the content in verifiable sources. Despite such developments, manually assessing how well a citation supports the associated statement remains a major challenge. Previous studies use faithfulness metrics to estimate citation support automatically but are limited to binary classification, overlooking fine-grained citation support in practical scenarios. To investigate the effectiveness of faithfulness metrics in fine-grained scenarios, we propose a comparative evaluation framework that assesses the metric effectiveness in distinguishinging citations between three-category support levels: full, partial, and no support. Our framework employs correlation analysis, classification evaluation, and retrieval evaluation to measure the alignment between metric scores and human judgments comprehensively. Our results show no single metric consistently excels across all evaluations, revealing the complexity of assessing fine-grained support. Based on the findings, we provide practical recommendations for developing more effective metrics.
摘要:大型语言模型(LLM)通常会产生不支持或无法验证的信息,称为“幻觉”。为了缓解这一问题,增强检索的LLMS纳入了引文,使内容植根于可核实的来源。尽管有这样的发展,手动评估引文对相关陈述的支持程度仍然是一个重大挑战。以前的研究使用忠诚度来自动估计引文支持度,但仅限于二进制分类,忽略了实际场景中的细粒度引文支持。为了考察忠诚度度量在细粒度场景中的有效性,我们提出了一个比较评估框架,该框架评估了在区分三个类别支持级别:完全支持、部分支持和不支持的引文时的有效性。我们的框架使用相关性分析、分类评价和检索评价来综合衡量度量分数和人类判断之间的一致性。我们的结果表明,没有一个指标在所有评估中都是一致的,这揭示了评估细粒度支持的复杂性。基于研究结果,我们为开发更有效的指标提供了切实可行的建议。

[NLP-13] Unsupervised Morphological Tree Tokenizer
[NLP-13] 无监督形态树令牌器

链接: https://arxiv.org/abs/2406.15245
作者: Qingyang Zhu,Xiang Hu,Pengyu Ji,Wei Wu,Kewei Tu
关键词: pre-defined atomic units, involves segmenting text, segmenting text inputs, tokenization involves segmenting, atomic units
中文关键词: 预定义的原子单位,涉及分割文本、分割文本输入、标记化涉及分割、原子单位
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic information. To address this drawback, we introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words. Specifically, the deep model jointly encodes internal structures and representations of words with a mechanism named \textitMorphOverriding to ensure the indecomposability of morphemes. By training the model with self-supervised objectives, our method is capable of inducing character-level structures that align with morphological rules without annotated training data. Based on the induced structures, our algorithm tokenizes words through vocabulary matching in a top-down manner. Empirical results indicate that the proposed method effectively retains complete morphemes and outperforms widely adopted methods such as BPE and WordPiece on both morphological segmentation tasks and language modeling tasks. The code will be released later.
摘要:作为语言建模的基石,标记化涉及到将文本输入分割成预定义的原子单元。传统的统计标记器经常扰乱单词内的成分边界,从而破坏语义信息。针对这一缺陷,我们将形态结构指导引入到标记化中,并提出了一个深层模型来归纳单词的字符层次结构。具体地说,深层模型通过一种名为文本形态覆盖的机制对单词的内部结构和表示进行联合编码,以确保语素的不可分解。通过使用自监督目标对模型进行训练,我们的方法能够在没有标注训练数据的情况下归纳出与形态规则一致的字符级别结构。基于归纳出的结构,我们的算法自上而下地通过词汇匹配来对单词进行标记化。实验结果表明,该方法有效地保留了完整的语素,在形态切分任务和语言建模任务上都优于BPE和WordPiess等广泛采用的方法。代码将于晚些时候发布。

[NLP-14] Detecting Synthetic Lyrics with Few-Shot Inference
[NLP-14] 用少镜头推理检测合成歌词

链接: https://arxiv.org/abs/2406.15231
作者: Yanis Labrak,Gabriel Meseguer-Brocal,Elena V. Epure
关键词: gained significant popularity, produce human-like lyrics, large language models, recent years, significant popularity
中文关键词: 获得了显着的受欢迎程度,产生了类人的歌词,大型语言模型,近年来,显着受欢迎程度
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:In recent years, generated content in music has gained significant popularity, with large language models being effectively utilized to produce human-like lyrics in various styles, themes, and linguistic structures. This technological advancement supports artists in their creative processes but also raises issues of authorship infringement, consumer satisfaction and content spamming. To address these challenges, methods for detecting generated lyrics are necessary. However, existing works have not yet focused on this specific modality or on creative text in general regarding machine-generated content detection methods and datasets. In response, we have curated the first dataset of high-quality synthetic lyrics and conducted a comprehensive quantitative evaluation of various few-shot content detection approaches, testing their generalization capabilities and complementing this with a human evaluation. Our best few-shot detector, based on LLM2Vec, surpasses stylistic and statistical methods, which are shown competitive in other domains at distinguishing human-written from machine-generated content. It also shows good generalization capabilities to new artists and models, and effectively detects post-generation paraphrasing. This study emphasizes the need for further research on creative content detection, particularly in terms of generalization and scalability with larger song catalogs. All datasets, pre-processing scripts, and code are available publicly on GitHub and Hugging Face under the Apache 2.0 license.
摘要:近年来,音乐中生成的内容越来越受欢迎,大型语言模型被有效地利用来生成各种风格、主题和语言结构的类似人类的歌词。这种技术进步支持了艺术家的创作过程,但也带来了作者侵权、消费者满意和内容垃圾邮件等问题。为了解决这些挑战,检测生成的歌词的方法是必要的。然而,现有的工作还没有集中在这一特定的形态或一般关于机器生成的内容检测方法和数据集的创造性文本。作为回应,我们整理了第一个高质量合成歌词的数据集,并对各种少镜头内容检测方法进行了全面的定量评估,测试了它们的泛化能力,并以人类评估作为补充。我们最好的基于LLM2Vec的少镜头检测器在区分人类编写的内容和机器生成的内容方面超过了文体和统计方法,这两种方法在其他领域显示出了竞争力。它还对新的艺术家和模特显示了良好的泛化能力,并有效地检测出后一代的释义。这项研究强调了对创造性内容检测的进一步研究的必要性,特别是在更大歌曲目录的泛化和可扩展性方面。在Apache2.0许可下,所有数据集、预处理脚本和代码都可以在GitHub和Huging Face上公开获得。

[NLP-15] A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation
[NLP-15] 基于LLM的自动反叙事生成评估排名方法

链接: https://arxiv.org/abs/2406.15227
作者: Irune Zubiaga,Aitor Soroa,Rodrigo Agerri
关键词: effective Counter Narrative, Counter Narrative, effective Counter, harmful narratives, generation techniques
中文关键词: 有效的反叙事,反叙事,有效的反叙事,有害叙事,生成技术
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The proliferation of misinformation and harmful narratives in online discourse has underscored the critical need for effective Counter Narrative (CN) generation techniques. However, existing automatic evaluation methods often lack interpretability and fail to capture the nuanced relationship between generated CNs and human perception. Aiming to achieve a higher correlation with human judgments, this paper proposes a novel approach to asses generated CNs that consists on the use of a Large Language Model (LLM) as a evaluator. By comparing generated CNs pairwise in a tournament-style format, we establish a model ranking pipeline that achieves a correlation of 0.88 with human preference. As an additional contribution, we leverage LLMs as zero-shot (ZS) CN generators and conduct a comparative analysis of chat, instruct, and base models, exploring their respective strengths and limitations. Through meticulous evaluation, including fine-tuning experiments, we elucidate the differences in performance and responsiveness to domain-specific data. We conclude that chat-aligned models in ZS are the best option for carrying out the task, provided they do not refuse to generate an answer due to security concerns.
摘要:网络话语中错误信息和有害叙事的泛滥凸显了对有效的反叙事(CN)生成技术的迫切需求。然而,现有的自动评估方法往往缺乏可解释性,无法捕捉到生成的CNS与人类感知之间的细微差别关系。为了获得与人类判断更高的相关性,提出了一种新的基于大型语言模型(LLM)的CNS评测方法。通过将生成的CNS以锦标赛的形式进行两两比较,我们建立了一个模型排名管道,其与人类偏好的相关性达到0.88。作为额外的贡献,我们利用LLM作为零射击(ZS)CN生成器,并对聊天、指令和基本模型进行比较分析,探索它们各自的优势和局限性。通过细致的评估,包括微调实验,我们阐明了特定领域数据在性能和响应性方面的差异。我们的结论是,ZS中的聊天对齐模型是执行该任务的最佳选择,前提是它们不会因为安全考虑而拒绝生成答案。

[NLP-16] Unsupervised Extraction of Dialogue Policies from Conversations
[NLP-16] 无监督地从对话中提取对话政策

链接: https://arxiv.org/abs/2406.15214
作者: Makesh Narsimhan Sreedhar,Traian Rebedea,Christopher Parisien
关键词: typically require substantial, require substantial effort, task-oriented dialogue systems, Dialogue policies, Dialogue policies play
中文关键词: 通常需要大量、需要大量努力、面向任务的对话系统、对话政策、对话政策发挥
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Dialogue policies play a crucial role in developing task-oriented dialogue systems, yet their development and maintenance are challenging and typically require substantial effort from experts in dialogue modeling. While in many situations, large amounts of conversational data are available for the task at hand, people lack an effective solution able to extract dialogue policies from this data. In this paper, we address this gap by first illustrating how Large Language Models (LLMs) can be instrumental in extracting dialogue policies from datasets, through the conversion of conversations into a unified intermediate representation consisting of canonical forms. We then propose a novel method for generating dialogue policies utilizing a controllable and interpretable graph-based methodology. By combining canonical forms across conversations into a flow network, we find that running graph traversal algorithms helps in extracting dialogue flows. These flows are a better representation of the underlying interactions than flows extracted by prompting LLMs. Our technique focuses on giving conversation designers greater control, offering a productivity tool to improve the process of developing dialogue policies.
摘要:对话策略在开发面向任务的对话系统中起着至关重要的作用,但其开发和维护具有挑战性,通常需要对话建模方面的专家做出大量努力。虽然在许多情况下,大量的对话数据可用于手头的任务,但人们缺乏能够从这些数据中提取对话策略的有效解决方案。在本文中,我们首先通过说明大型语言模型(LLM)如何通过将会话转换为由规范形式组成的统一中间表示来从数据集中提取对话策略来解决这一差距。然后,我们提出了一种新的方法来生成对话策略,该方法使用了一种可控制和可解释的基于图的方法。通过将会话间的规范形式组合成一个流网络,我们发现运行图遍历算法有助于提取会话流。与通过提示LLM提取的流相比,这些流更好地表示底层交互。我们的技术侧重于给予对话设计者更大的控制权,提供一个提高对话策略开发过程的生产力的工具。

[NLP-17] How Effective is GPT-4 Turbo in Generating School-Level Questions from Textbooks Based on Blooms Revised Taxonomy?
[NLP-17] GPT-4涡轮在根据Blooms修订的分类从教科书中生成学校级别的问题方面有多有效?

链接: https://arxiv.org/abs/2406.15211
作者: Subhankar Maity,Aniket Deroy,Sudeshna Sarkar
关键词: Bloom Revised Taxonomy, NCERT textbooks, Bloom Revised, Revised Taxonomy, zero-shot mode
中文关键词: 布鲁姆修订的分类学,NCERT教科书,布鲁姆修订的分类学,零射击模式
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at Learnersourcing: Student-Generated Content @ Scale 2024

点击查看摘要

Abstract:We evaluate the effectiveness of GPT-4 Turbo in generating educational questions from NCERT textbooks in zero-shot mode. Our study highlights GPT-4 Turbo’s ability to generate questions that require higher-order thinking skills, especially at the “understanding” level according to Bloom’s Revised Taxonomy. While we find a notable consistency between questions generated by GPT-4 Turbo and those assessed by humans in terms of complexity, there are occasional differences. Our evaluation also uncovers variations in how humans and machines evaluate question quality, with a trend inversely related to Bloom’s Revised Taxonomy levels. These findings suggest that while GPT-4 Turbo is a promising tool for educational question generation, its efficacy varies across different cognitive levels, indicating a need for further refinement to fully meet educational standards.
摘要:我们评估GPT-4 Turbo在零射击模式下从NCERT教科书生成教育问题的有效性。我们的研究强调了GPT-4 Turbo生成需要更高级思维技能的问题的能力,特别是根据Bloom的修订分类法,在“理解”层面。虽然我们发现GPT-4 Turbo生成的问题与人类评估的问题在复杂性方面存在显着一致性,但偶尔也会存在差异。我们的评估还揭示了人类和机器如何评估问题质量的差异,其趋势与布鲁姆的修订分类水平呈负相关。这些发现表明,虽然GPT-4 Turbo是一种有前途的教育问题生成工具,但其功效在不同的认知水平上有所不同,这表明需要进一步改进以完全满足教育标准。

[NLP-18] Reward Steering with Evolutionary Heuristics for Decoding-time Alignment
[NLP-18] 利用进化启发法进行奖励引导以实现解码时间一致

链接: https://arxiv.org/abs/2406.15193
作者: Chia-Yu Hung,Navonil Majumder,Ambuj Mehrish,Soujanya Poria
关键词: align LLM responses, align LLM, widespread applicability, applicability and increasing, increasing omnipresence
中文关键词: 调整LLM响应、调整LLM、广泛适用性、适用性和日益普遍的存在性
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The widespread applicability and increasing omnipresence of LLMs have instigated a need to align LLM responses to user and stakeholder preferences. Many preference optimization approaches have been proposed that fine-tune LLM parameters to achieve good alignment. However, such parameter tuning is known to interfere with model performance on many tasks. Moreover, keeping up with shifting user preferences is tricky in such a situation. Decoding-time alignment with reward model guidance solves these issues at the cost of increased inference time. However, most of such methods fail to strike the right balance between exploration and exploitation of reward – often due to the conflated formulation of these two aspects - to give well-aligned responses. To remedy this we decouple these two aspects and implement them in an evolutionary fashion: exploration is enforced by decoding from mutated instructions and exploitation is represented as the periodic replacement of poorly-rewarded generations with well-rewarded ones. Empirical evidences indicate that this strategy outperforms many preference optimization and decode-time alignment approaches on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Our implementation will be available at: this https URL.
摘要:低成本管理的广泛适用性和日益无处不在的存在促使人们需要使低成本管理响应与用户和利益相关者的偏好保持一致。已经提出了许多通过微调LLM参数来实现良好配准的偏好优化方法。然而,众所周知,这种参数调整会干扰模型在许多任务上的性能。此外,在这种情况下,跟上用户偏好的变化是很棘手的。采用奖励模型指导的解码时间对齐以增加推理时间为代价解决了这些问题。然而,大多数这类方法未能在探索和利用报酬之间取得适当的平衡–这往往是由于将这两个方面的表述混为一谈–以给出很好的协调一致的反应。为了纠正这一点,我们将这两个方面分离,并以一种进化的方式实现它们:探索是通过从突变的指令中解码来强制执行的,而利用则表示为用回报良好的世代周期性地替换回报较低的世代。实验结果表明,在AlpacaEval 2和MT-BENCH两个广泛接受的比对基准上,该策略的性能优于许多偏好优化和解码时间比对方法。我们的实现将在以下地址获得:This HTTPS URL。

[NLP-19] Hybrid Alignment Training for Large Language Models
[NLP-19] 大型语言模型的混合对齐训练

链接: https://arxiv.org/abs/2406.15178
作者: Chenglong Wang,Hang Zhou,Kaiyan Chang,Bei Li,Yongyu Mu,Tong Xiao,Tongran Liu,Jingbo Zhu
关键词: large language models, enabling large language, language models, Alignment training, crucial for enabling
中文关键词: 大型语言模型,支持大型语言,语言模型,对齐训练,对于支持至关重要
类目: Computation and Language (cs.CL)
备注: accepted by ACL (Findings) 2024

点击查看摘要

Abstract:Alignment training is crucial for enabling large language models (LLMs) to cater to human intentions and preferences. It is typically performed based on two stages with different objectives: instruction-following alignment and human-preference alignment. However, aligning LLMs with these objectives in sequence suffers from an inherent problem: the objectives may conflict, and the LLMs cannot guarantee to simultaneously align with the instructions and human preferences well. To response to these, in this work, we propose a Hybrid Alignment Training (Hbat) approach, based on alternating alignment and modified elastic weight consolidation methods. The basic idea is to alternate between different objectives during alignment training, so that better collaboration can be achieved between the two alignment tasks.We experiment with Hbat on summarization and dialogue tasks. Experimental results show that the proposed \textscHbat can significantly outperform all baselines. Notably, Hbat yields consistent performance gains over the traditional two-stage alignment training when using both proximal policy optimization and direct preference optimization.
摘要:对齐训练对于使大型语言模型(LLM)能够迎合人类的意图和偏好至关重要。它通常是基于两个不同目标的阶段进行的:遵循指令的协调和人的偏好协调。然而,将LLM与这些目标按顺序对齐存在一个固有的问题:目标可能会冲突,并且LLM不能保证同时很好地与指令和人的喜好保持一致。针对这些问题,本文提出了一种基于交替对齐和修正弹性权重合并的混合对齐训练方法(HBAT)。基本思想是在比对训练过程中在不同目标之间交替进行,以便在两个比对任务之间实现更好的协作。实验结果表明,该算法的性能明显优于所有基线算法。值得注意的是,HBAT在同时使用近端策略优化和直接偏好优化时,与传统的两阶段对齐训练相比,产生了一致的性能收益。

[NLP-20] Enhancing Idiomatic Representation in Multiple Languages via an Adaptive Contrastive Triplet Loss
[NLP-20] 通过自适应对比三重缺失增强多种语言的习语表示

链接: https://arxiv.org/abs/2406.15175
作者: Wei He,Marco Idiart,Carolina Scarton,Aline Villavicencio
关键词: Natural Language Processing, Accurately modeling idiomatic, Accurately modeling, Language Processing, Natural Language
中文关键词: 自然语言处理,准确建模习语,准确建模,语言处理,自然语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurately modeling idiomatic or non-compositional language has been a longstanding challenge in Natural Language Processing (NLP). This is partly because these expressions do not derive their meanings solely from their constituent words, but also due to the scarcity of relevant data resources, and their impact on the performance of downstream tasks such as machine translation and simplification. In this paper we propose an approach to model idiomaticity effectively using a triplet loss that incorporates the asymmetric contribution of components words to an idiomatic meaning for training language models by using adaptive contrastive learning and resampling miners to build an idiomatic-aware learning objective. Our proposed method is evaluated on a SemEval challenge and outperforms previous alternatives significantly in many metrics.
摘要:准确建模惯用语言或非组合语言一直是自然语言处理(NLP)中的一个长期挑战。这部分是因为这些表达的含义不仅仅来自其组成词,还因为相关数据资源的稀缺性,以及它们对机器翻译和简化等下游任务的性能的影响。在本文中,我们提出了一种使用三重损失来有效地建模习语的方法,该方法将成分词对习语意义的不对称贡献纳入训练语言模型,通过使用自适应对比学习和重新分配挖掘器来构建习语感知的学习目标。我们提出的方法在SemEval挑战中进行了评估,并且在许多指标上显着优于之前的替代方案。

[NLP-21] A Syntax-Injected Approach for Faster and More Accurate Sentiment Analysis
[NLP-21] 一种注入语法的方法,用于更快、更准确的情绪分析

链接: https://arxiv.org/abs/2406.15163
作者: Muhammad Imran,Olga Kellert,Carlos Gómez-Rodríguez
关键词: Natural Language Processing, Language Processing, Natural Language, addressing subjective assessments, aspect of Natural
中文关键词: 自然语言处理,语言处理,自然语言,解决主观评估,自然方面
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sentiment Analysis (SA) is a crucial aspect of Natural Language Processing (NLP), addressing subjective assessments in textual content. Syntactic parsing is useful in SA because explicit syntactic information can improve accuracy while providing explainability, but it tends to be a computational bottleneck in practice due to the slowness of parsing algorithms. This paper addresses said bottleneck by using a SEquence Labeling Syntactic Parser (SELSP) to inject syntax into SA. By treating dependency parsing as a sequence labeling problem, we greatly enhance the speed of syntax-based SA. SELSP is trained and evaluated on a ternary polarity classification task, demonstrating its faster performance and better accuracy in polarity prediction tasks compared to conventional parsers like Stanza and to heuristic approaches that use shallow syntactic rules for SA like VADER. This increased speed and improved accuracy make SELSP particularly appealing to SA practitioners in both research and industry. In addition, we test several sentiment dictionaries on our SELSP to see which one improves the performance in polarity prediction tasks. Moreover, we compare the SELSP with Transformer-based models trained on a 5-label classification task. The results show that dictionaries that capture polarity judgment variation provide better results than dictionaries that ignore polarity judgment variation. Moreover, we show that SELSP is considerably faster than Transformer-based models in polarity prediction tasks.
摘要:情感分析(SA)是自然语言处理(NLP)的一个重要方面,它处理文本内容中的主观评估。句法分析在SA中很有用,因为显式的句法信息可以在提供可解释性的同时提高准确性,但由于句法分析算法的缓慢,它在实践中往往是一个计算瓶颈。针对这一瓶颈,本文使用序列标注语法分析器(SELSP)将语法注入到SA中。通过将依存分析问题转化为序列标注问题,大大提高了基于句法的SA的速度。SELSP在一个三元极性分类任务上进行了训练和评估,表明它在极性预测任务中比传统的句法分析器(如stanza)和启发式方法(如Vader)具有更快的性能和更好的准确性。这种更快的速度和更高的精确度使SELSP对SA研究和工业从业者特别有吸引力。此外,我们在我们的SELSP上测试了几个情感词典,看看哪一个在极性预测任务中提高了性能。此外,我们将SELSP与基于Transformer的模型在5标签分类任务上进行了比较。结果表明,捕捉极性判断变异的词典比忽略极性判断变异的词典具有更好的效果。此外,我们还表明,在极性预测任务中,SELSP比基于变压器的模型要快得多。

[NLP-22] Assessing Good Bad and Ugly Arguments Generated by ChatGPT: a New Dataset its Methodology and Associated Tasks
[NLP-22] 评估ChatGPT生成的好、坏和丑论点:新数据集、方法论和相关任务

链接: https://arxiv.org/abs/2406.15130
作者: Victor Hugo Nascimento Rocha,Igor Cataneo Silveira,Paulo Pirozelli,Denis Deratani Mauá,Fabio Gagliardi Cozman
关键词: Large Language Models, Large Language, Language Models, success of Large, spread misinformation
中文关键词: 大型语言模型,大型语言,语言模型,大型的成功,传播错误信息
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent success of Large Language Models (LLMs) has sparked concerns about their potential to spread misinformation. As a result, there is a pressing need for tools to identify ``fake arguments’’ generated by such models. To create these tools, examples of texts generated by LLMs are needed. This paper introduces a methodology to obtain good, bad and ugly arguments from argumentative essays produced by ChatGPT, OpenAI’s LLM. We then describe a novel dataset containing a set of diverse arguments, ArGPT. We assess the effectiveness of our dataset and establish baselines for several argumentation-related tasks. Finally, we show that the artificially generated data relates well to human argumentation and thus is useful as a tool to train and test systems for the defined tasks.
摘要:大型语言模型(LLM)最近的成功引发了人们对其传播错误信息潜力的担忧。因此,迫切需要工具来识别此类模型生成的“假论点”。要创建这些工具,需要LLM生成的文本示例。本文介绍了一种从OpenAI的LLM ChatGPT制作的论点论文中获取好、坏和丑陋论点的方法。然后,我们描述了一个包含一组不同参数的新颖数据集ArGPT。我们评估数据集的有效性,并为几项与辩论相关的任务建立基线。最后,我们表明人工生成的数据与人类论证有很好的相关性,因此可以作为针对定义任务训练和测试系统的工具。

[NLP-23] On LLMs-Driven Synthetic Data Generation Curation and Evaluation: A Survey
[NLP-23] LLM驱动的合成数据生成、策划和评估:概览

链接: https://arxiv.org/abs/2406.15126
作者: Lin Long,Rui Wang,Ruixuan Xiao,Junbo Zhao,Xiao Ding,Gang Chen,Haobo Wang
关键词: Large Language Models, synthetic data generation, deep learning, long-standing problem, data generation
中文关键词: 大型语言模型、合成数据生成、深度学习、长期存在的问题、数据生成
类目: Computation and Language (cs.CL)
备注: A survey on LLMs-driven synthetic data generation, curation and evaluation

点击查看摘要

Abstract:Within the evolving landscape of deep learning, the dilemma of data quantity and quality has been a long-standing problem. The recent advent of Large Language Models (LLMs) offers a data-centric solution to alleviate the limitations of real-world data with synthetic data generation. However, current investigations into this field lack a unified framework and mostly stay on the surface. Therefore, this paper provides an organization of relevant studies based on a generic workflow of synthetic data generation. By doing so, we highlight the gaps within existing research and outline prospective avenues for future study. This work aims to shepherd the academic and industrial communities towards deeper, more methodical inquiries into the capabilities and applications of LLMs-driven synthetic data generation.
摘要:在深度学习不断变化的格局中,数据数量和质量的困境一直是一个长期存在的问题。大型语言模型(LLM)的最近出现提供了一种以数据为中心的解决方案,通过合成数据生成来缓解现实世界数据的局限性。然而,目前对该领域的研究缺乏统一的框架,大多停留在表面。因此,本文基于合成数据生成的通用工作流程提供了相关研究的组织。通过这样做,我们强调了现有研究中的差距,并概述了未来研究的潜在途径。这项工作旨在引导学术界和工业界对LLM驱动的合成数据生成的能力和应用进行更深入、更系统的研究。

[NLP-24] Investigating the impact of 2D gesture representation on co-speech gesture generation
[NLP-24] 调查2D手势表示对共语音手势生成的影响

链接: https://arxiv.org/abs/2406.15111
作者: Teo Guichoux,Laure Soulier,Nicolas Obin,Catherine Pelachaud
关键词: embodied conversational agents, Co-speech gestures play, natural co-speech gestures, Co-speech gestures, co-speech gestures synchronized
中文关键词: 嵌入式对话代理、共语音手势播放、自然共语音手势、共语音手势、同步共语音手势
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages. Paper accepted at WACAI 2024

点击查看摘要

Abstract:Co-speech gestures play a crucial role in the interactions between humans and embodied conversational agents (ECA). Recent deep learning methods enable the generation of realistic, natural co-speech gestures synchronized with speech, but such approaches require large amounts of training data. “In-the-wild” datasets, which compile videos from sources such as YouTube through human pose detection models, offer a solution by providing 2D skeleton sequences that are paired with speech. Concurrently, innovative lifting models have emerged, capable of transforming these 2D pose sequences into their 3D counterparts, leading to large and diverse datasets of 3D gestures. However, the derived 3D pose estimation is essentially a pseudo-ground truth, with the actual ground truth being the 2D motion data. This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions, a topic that, to our knowledge, remains largely unexplored. In this work, we evaluate the impact of the dimensionality of the training data, 2D or 3D joint coordinates, on the performance of a multimodal speech-to-gesture deep generative model. We use a lifting model to convert 2D-generated sequences of body pose to 3D. Then, we compare the sequence of gestures generated directly in 3D to the gestures generated in 2D and lifted to 3D as post-processing.
摘要:协同言语手势在人类与具体化会话主体(ECA)的交互中起着至关重要的作用。最近的深度学习方法能够生成与语音同步的真实、自然的协同语音手势,但这种方法需要大量的训练数据。“野外”数据集通过人体姿势检测模型从YouTube等来源汇编视频,通过提供与语音配对的2D骨架序列提供了一种解决方案。与此同时,出现了创新的提升模型,能够将这些2D姿势序列转换为3D姿势序列,从而产生大量且多样化的3D手势数据集。然而,导出的3D位姿估计本质上是伪地面真实,而实际的地面真实是2D运动数据。这一区别引发了关于手势表征维度对生成运动质量的影响的问题,据我们所知,这一主题在很大程度上仍未被探索。在这项工作中,我们评估了训练数据的维度,2D或3D联合坐标,对多模式语音到手势深度生成模型的性能的影响。我们使用提升模型将2D生成的身体姿势序列转换为3D。然后,我们将直接在3D中生成的手势序列与在2D中生成并提升到3D中作为后处理的手势序列进行比较。

[NLP-25] Brain-Like Language Processing via a Shallow Untrained Multihead Attention Network
[NLP-25] 通过浅层未经训练的多头注意力网络进行类大脑语言处理

链接: https://arxiv.org/abs/2406.15109
作者: Badr AlKhamissi,Greta Tuckute,Antoine Bosselut,Martin Schrimpf
关键词: Large Language Models, Large Language, brain, alignment, predicting most explainable
中文关键词: 大型语言模型,大型语言,大脑,对齐,预测最可解释的
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) have been shown to be effective models of the human language system, with some models predicting most explainable variance of brain activity in current datasets. Even in untrained models, the representations induced by architectural priors can exhibit reasonable alignment to brain data. In this work, we investigate the key architectural components driving the surprising alignment of untrained models. To estimate LLM-to-brain similarity, we first select language-selective units within an LLM, similar to how neuroscientists identify the language network in the human brain. We then benchmark the brain alignment of these LLM units across five different brain recording datasets. By isolating critical components of the Transformer architecture, we identify tokenization strategy and multihead attention as the two major components driving brain alignment. A simple form of recurrence further improves alignment. We further demonstrate this quantitative brain alignment of our model by reproducing landmark studies in the language neuroscience field, showing that localized model units – just like language voxels measured empirically in the human brain – discriminate more reliably between lexical than syntactic differences, and exhibit similar response profiles under the same experimental conditions. Finally, we demonstrate the utility of our model’s representations for language modeling, achieving improved sample and parameter efficiency over comparable architectures. Our model’s estimates of surprisal sets a new state-of-the-art in the behavioral alignment to human reading times. Taken together, we propose a highly brain- and behaviorally-aligned model that conceptualizes the human language system as an untrained shallow feature encoder, with structural priors, combined with a trained decoder to achieve efficient and performant language processing.
摘要:大语言模型已被证明是人类语言系统的有效模型,一些模型预测了当前数据集中大脑活动的最可解释的变化。即使在未经训练的模型中,由建筑先验诱导的表征也可以显示出与大脑数据合理的一致性。在这项工作中,我们调查了驱动未经训练的模型惊人对齐的关键架构组件。为了估计LLM与大脑的相似性,我们首先选择LLM中的语言选择单元,类似于神经科学家识别人脑中的语言网络的方式。然后,我们在五个不同的大脑记录数据集上对这些LLM单元的大脑比对进行基准测试。通过隔离Transformer架构的关键组件,我们确定标记化策略和多头注意力是驱动大脑对齐的两个主要组件。一种简单的递归形式可以进一步改进对齐。我们通过重现语言神经科学领域的里程碑式研究进一步证明了我们模型的这种定量大脑对齐,表明本地化模型单位-就像在人脑中经验测量的语言体素-更可靠地区分词汇差异而不是句法差异,并在相同的实验条件下显示出类似的反应曲线。最后,我们演示了我们的模型表示用于语言建模的实用性,与同类体系结构相比,实现了更高的样本和参数效率。我们的模型对惊喜的估计在行为与人类阅读时间的一致性方面设定了一个新的艺术状态。综上所述,我们提出了一个与大脑和行为高度一致的模型,该模型将人类语言系统概念化为一个未经训练的浅层特征编码器,具有结构先验,与训练过的解码器相结合,以实现高效和高性能的语言处理。

[NLP-26] A Unified Framework for Input Feature Attribution Analysis
[NLP-26] 输入特征归因分析的统一框架

链接: https://arxiv.org/abs/2406.15085
作者: Jingyi Sun,Pepa Atanasova,Isabelle Augenstein
关键词: Explaining the decision-making, Louvain Span Interactions, reliability and fairness, Bivariate Shapley, Integrated Gradients
中文关键词: 解释决策、Louvain Span互动、可靠性和公平性、二元Shapley、综合要素
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Explaining the decision-making process of machine learning models is crucial for ensuring their reliability and fairness. One popular explanation form highlights key input features, such as i) tokens (e.g., Shapley Values and Integrated Gradients), ii) interactions between tokens (e.g., Bivariate Shapley and Attention-based methods), or iii) interactions between spans of the input (e.g., Louvain Span Interactions). However, these explanation types have only been studied in isolation, making it difficult to judge their respective applicability. To bridge this gap, we propose a unified framework that facilitates a direct comparison between highlight and interactive explanations comprised of four diagnostic properties. Through extensive analysis across these three types of input feature explanations–each utilizing three different explanation techniques–across two datasets and two models, we reveal that each explanation type excels in terms of different diagnostic properties. In our experiments, highlight explanations are the most faithful to a model’s prediction, and interactive explanations provide better utility for learning to simulate a model’s predictions. These insights further highlight the need for future research to develop combined methods that enhance all diagnostic properties.
摘要:解释机器学习模型的决策过程是保证其可靠性和公平性的关键。一种流行的解释形式突出了关键输入特征,例如i)令牌(例如Shapley值和积分梯度),ii)令牌之间的交互(例如,二元Shapley和基于注意力的方法),或iii)输入跨度之间的交互(例如,Louvain Span交互)。然而,这些解释类型仅被孤立地研究,因此很难判断它们各自的适用性。为了弥补这一差距,我们提出了一个统一的框架,该框架便于在突出显示和交互解释之间进行直接比较,包括四个诊断属性。通过对这三种类型的输入特征解释的广泛分析–每种类型都使用了三种不同的解释技术–跨越两个数据集和两个模型,我们揭示了每种解释类型在不同的诊断属性方面的优势。在我们的实验中,突出显示的解释是对模型预测最忠实的,交互式解释为学习模拟模型的预测提供了更好的效用。这些见解进一步强调了未来研究的必要性,以开发增强所有诊断特性的组合方法。

[NLP-27] Cross-lingual paraphrase identification
[NLP-27] 跨语言重述识别

链接: https://arxiv.org/abs/2406.15066
作者: Inessa Fedorova,Aleksei Musatow
关键词: involves measuring semantic, measuring semantic similarity, short sentences, task involves measuring, involves measuring
中文关键词: 涉及测量语义,测量语义相似性,短句,任务涉及测量,涉及测量
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The paraphrase identification task involves measuring semantic similarity between two short sentences. It is a tricky task, and multilingual paraphrase identification is even more challenging. In this work, we train a bi-encoder model in a contrastive manner to detect hard paraphrases across multiple languages. This approach allows us to use model-produced embeddings for various tasks, such as semantic search. We evaluate our model on downstream tasks and also assess embedding space quality. Our performance is comparable to state-of-the-art cross-encoders, with only a minimal relative drop of 7-10% on the chosen dataset, while keeping decent quality of embeddings.
摘要:重述识别任务涉及测量两个短句之间的语义相似性。这是一项棘手的任务,多语言重述识别更具挑战性。在这项工作中,我们以对比的方式训练双编码器模型,以检测跨多种语言的硬转述。这种方法允许我们使用模型生成的嵌入来执行各种任务,例如语义搜索。我们评估下游任务的模型,并评估嵌入空间质量。我们的性能与最先进的交叉编码器相当,所选数据集的相对下降仅为7-10%,同时保持不错的嵌入质量。

[NLP-28] PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
[NLP-28] PARIKSHA:对多语言和多文化数据的人类LLM评估者协议的大规模调查

链接: https://arxiv.org/abs/2406.15053
作者: Ishaan Watts,Varun Gumma,Aditya Yadavalli,Vivek Seshadri,Manohar Swaminathan,Sunayana Sitaram
关键词: Large Language Models, sufficient linguistic diversity, multilingual Large Language, LLM pre-training data, Large Language
中文关键词: 大型语言模型、足够的语言多样性、多语言大型语言、LLM预训练数据、大型语言
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors – the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. In this work, we study human and LLM-based evaluation in a multilingual, multi-cultural setting. We evaluate 30 models across 10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations and find that models such as GPT-4o and Llama-3 70B consistently perform best for most Indic languages. We build leaderboards for two evaluation settings - pairwise comparison and direct assessment and analyse the agreement between humans and LLMs. We find that humans and LLMs agree fairly well in the pairwise setting but the agreement drops for direct assessment evaluation especially for languages such as Bengali and Odia. We also check for various biases in human and LLM-based evaluation and find evidence of self-bias in the GPT-based evaluator. Our work presents a significant step towards scaling up multilingual evaluation of LLMs.
摘要:由于多种因素的影响,多语言大型语言模型的评估具有挑战性–缺乏具有足够语言多样性的基准,流行的基准混入大型语言模型的训练前数据,以及翻译的基准缺乏当地的文化细微差别。在这项工作中,我们研究了多语言、多文化背景下的人类评价和基于LLM的评价。我们通过进行90K人工评估和30K基于LLM的评估,对10种印度语的30种模型进行了评估,发现GPT-40和Llama-3 70B等模型对大多数印度语都表现最好。我们建立了两种评估设置-配对比较和直接评估的排行榜,并分析了人与低层管理之间的一致性。我们发现人类和LLM在两两配对的环境下的一致性相当好,但对于直接评估评估的一致性下降,特别是对孟加拉语和奥迪亚语这样的语言。我们还检查了人类和基于LLM的评估中的各种偏见,并在基于GPT的评估者中发现了自我偏见的证据。我们的工作是朝着扩大LLMS多语言评估的方向迈出了重要的一步。

[NLP-29] ri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis
[NLP-29] ri-VQA:用于多属性分析的三角推理医学视觉问题解答

链接: https://arxiv.org/abs/2406.15050
作者: Lin Fan,Xun Gong,Cenyang Zheng,Yafei Ou
关键词: Visual Question Answering, challenging research topic, advantages including patient, including patient engagement, clinical expert involvement
中文关键词: 视觉问题解答、具有挑战性的研究主题、包括患者在内的优势,包括患者参与、临床专家参与
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The intersection of medical Visual Question Answering (Med-VQA) is a challenging research topic with advantages including patient engagement and clinical expert involvement for second opinions. However, existing Med-VQA methods based on joint embedding fail to explain whether their provided results are based on correct reasoning or coincidental answers, which undermines the credibility of VQA answers. In this paper, we investigate the construction of a more cohesive and stable Med-VQA structure. Motivated by causal effect, we propose a novel Triangular Reasoning VQA (Tri-VQA) framework, which constructs reverse causal questions from the perspective of “Why this answer?” to elucidate the source of the answer and stimulate more reasonable forward reasoning processes. We evaluate our method on the Endoscopic Ultrasound (EUS) multi-attribute annotated dataset from five centers, and test it on medical VQA datasets. Experimental results demonstrate the superiority of our approach over existing methods. Our codes and pre-trained models are available at https://anonymous.4open.science/r/Tri_VQA.
摘要:医学视觉问答(MED-VQA)是一个极具挑战性的研究课题,具有患者参与度和临床专家参与度等优点。然而,现有的基于联合嵌入的Med-VQA方法不能解释它们所提供的结果是基于正确的推理还是基于巧合的答案,这破坏了VQA答案的可信度。在本文中,我们研究了一个更有内聚力和稳定性的Med-VQA结构的构建。在因果效应的启发下,我们提出了一种新的三角推理VQA(Tri-VQA)框架,它从“为什么要这样回答?”的角度来构造反向因果问题。阐明答案的来源,激发更合理的正向推理过程。我们在来自五个中心的超声内窥镜(EUS)多属性标注数据集上对我们的方法进行了评估,并在医学VQA数据集上进行了测试。实验结果表明,与现有方法相比,该方法具有一定的优越性。我们的代码和预先培训的模型可在https://anonymous.4open.science/r/Tri_VQA.上获得

[NLP-30] Harnessing Knowledge Retrieval with Large Language Models for Clinical Report Error Correction
[NLP-30] 利用大型语言模型的知识检索进行临床报告错误纠正

链接: https://arxiv.org/abs/2406.15045
作者: Jinge Wu,Zhaolong Wu,Abul Hasan,Yunsoo Kim,Jason P.Y. Cheung,Teng Zhang,Honghan Wu
关键词: large language models, leveraging large language, leveraging large, language models, retrieval-augmented generation
中文关键词: 大型语言模型、利用大型语言、利用大型语言模型、检索增强生成
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study proposes an approach for error correction in clinical radiology reports, leveraging large language models (LLMs) and retrieval-augmented generation (RAG) techniques. The proposed framework employs internal and external retrieval mechanisms to extract relevant medical entities and relations from the report and external knowledge sources. A three-stage inference process is introduced, decomposing the task into error detection, localization, and correction subtasks, which enhances the explainability and performance of the system. The effectiveness of the approach is evaluated using a benchmark dataset created by corrupting real-world radiology reports with realistic errors, guided by domain experts. Experimental results demonstrate the benefits of the proposed methods, with the combination of internal and external retrieval significantly improving the accuracy of error detection, localization, and correction across various state-of-the-art LLMs. The findings contribute to the development of more robust and reliable error correction systems for clinical documentation.
摘要:这项研究提出了一种利用大型语言模型(LLMS)和检索-增强生成(RAG)技术来纠正临床放射学报告中的错误的方法。该框架使用内部和外部检索机制,从报告和外部知识源中提取相关的医疗实体和关系。引入了一个三阶段推理过程,将任务分解为错误检测、定位和纠错三个子任务,增强了系统的可解释性和性能。该方法的有效性是使用一个基准数据集来评估的,该基准数据集是在领域专家的指导下,通过破坏具有现实错误的真实世界放射学报告而创建的。实验结果证明了所提出的方法的好处,内部和外部检索的结合显著提高了对各种最先进的LLM的错误检测、定位和纠正的准确性。这些发现有助于为临床文件开发更强大和更可靠的纠错系统。

[NLP-31] Online detection and infographic explanation of spam reviews with data drift adaptation
[NLP-31] 具有数据漂移适应的垃圾邮件评论的在线检测和信息图表解释

链接: https://arxiv.org/abs/2406.15038
作者: Francisco de Arriba-Pérez,Silvia García-Méndez,Fátima Leal,Benedita Malheiro,J. C. Burguillo
关键词: online platforms due, impact on reputation, platforms due, significant impact, Spam
中文关键词: 在线平台到期,对声誉的影响,平台到期,重大影响,垃圾邮件
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Spam reviews are a pervasive problem on online platforms due to its significant impact on reputation. However, research into spam detection in data streams is scarce. Another concern lies in their need for transparency. Consequently, this paper addresses those problems by proposing an online solution for identifying and explaining spam reviews, incorporating data drift adaptation. It integrates (i) incremental profiling, (ii) data drift detection adaptation, and (iii) identification of spam reviews employing Machine Learning. The explainable mechanism displays a visual and textual prediction explanation in a dashboard. The best results obtained reached up to 87 % spam F-measure.
摘要:垃圾邮件评论因其对声誉的重大影响而成为在线平台上普遍存在的问题。然而,对数据流中垃圾邮件检测的研究很少。另一个问题是他们对透明度的需要。因此,本文通过提出一种用于识别和解释垃圾邮件评论的在线解决方案,并结合数据漂移适应来解决这些问题。它集成了(i)增量分析、(ii)数据漂移检测适应以及(iii)使用机器学习识别垃圾邮件评论。可解释机制在仪表板中显示视觉和文本预测解释。获得的最佳结果高达87%的垃圾邮件F测量值。

[NLP-32] GiusBERTo: A Legal Language Model for Personal Data De-identification in Italian Court of Auditors Decisions
[NLP-32] GiusBERTo:意大利审计法院判决中个人数据去身份化的法律语言模型

链接: https://arxiv.org/abs/2406.15032
作者: Giulio Salierno,Rosamaria Bertè,Luca Attias,Carla Morrone,Dario Pettazzoni,Daniela Battisti
关键词: Natural Language Processing, Natural Language, Language Processing, Recent advances, pretrained language models
中文关键词: 自然语言处理,自然语言,语言处理,最新进展,预训练语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures, 6 Tables

点击查看摘要

Abstract:Recent advances in Natural Language Processing have demonstrated the effectiveness of pretrained language models like BERT for a variety of downstream tasks. We present GiusBERTo, the first BERT-based model specialized for anonymizing personal data in Italian legal documents. GiusBERTo is trained on a large dataset of Court of Auditors decisions to recognize entities to anonymize, including names, dates, locations, while retaining contextual relevance. We evaluate GiusBERTo on a held-out test set and achieve 97% token-level accuracy. GiusBERTo provides the Italian legal community with an accurate and tailored BERT model for de-identification, balancing privacy and data protection.
摘要:自然语言处理的最新进展已经证明了BERT等预训练语言模型对于各种下游任务的有效性。我们介绍GiusBEerto,这是第一个基于BERT的模型,专门用于匿名化意大利法律文件中的个人数据。GiusBERTo接受了审计法院裁决的大型数据集的培训,以识别要匿名的实体,包括姓名、日期、地点,同时保留上下文相关性。我们在固定的测试集上评估GiusBEerto,并实现97%的代币级准确性。GiusBERTo为意大利法律界提供了准确且量身定制的BERT模型,用于去识别、平衡隐私和数据保护。

[NLP-33] MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens
[NLP-33] MedOdyssey:长达20万个代币的长期上下文评估的医疗领域基准

链接: https://arxiv.org/abs/2406.15019
作者: Yongqi Fan,Hongli Sun,Kui Xue,Xiaofan Zhang,Shaoting Zhang,Tong Ruan
关键词: Large Language Models, Numerous advanced Large, advanced Large Language, Language Models, Large Language
中文关键词: 大型语言模型,众多高级大型,高级大型语言,语言模型,大型语言
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Numerous advanced Large Language Models (LLMs) now support context lengths up to 128K, and some extend to 200K. Some benchmarks in the generic domain have also followed up on evaluating long-context capabilities. In the medical domain, tasks are distinctive due to the unique contexts and need for domain expertise, necessitating further evaluation. However, despite the frequent presence of long texts in medical scenarios, evaluation benchmarks of long-context capabilities for LLMs in this field are still rare. In this paper, we propose MedOdyssey, the first medical long-context benchmark with seven length levels ranging from 4K to 200K tokens. MedOdyssey consists of two primary components: the medical-context “needles in a haystack” task and a series of tasks specific to medical applications, together comprising 10 datasets. The first component includes challenges such as counter-intuitive reasoning and novel (unknown) facts injection to mitigate knowledge leakage and data contamination of LLMs. The second component confronts the challenge of requiring professional medical expertise. Especially, we design the ``Maximum Identical Context’’ principle to improve fairness by guaranteeing that different LLMs observe as many identical contexts as possible. Our experiment evaluates advanced proprietary and open-source LLMs tailored for processing long contexts and presents detailed performance analyses. This highlights that LLMs still face challenges and need for further research in this area. Our code and data are released in the repository: \urlthis https URL.
摘要:许多高级大型语言模型(LLM)现在支持高达128K的上下文长度,有些甚至扩展到200K。通用领域的一些基准也在评估长上下文能力方面采取了后续行动。在医学领域,由于独特的背景和对领域专门知识的需要,任务是独特的,需要进一步评估。然而,尽管在医疗场景中经常出现长文本,但这一领域的LLMS的长上下文能力评估基准仍然很少。在本文中,我们提出了第一个医疗长上下文基准,它具有从4K到200K的七个长度级别。“医学奥德赛”由两个主要部分组成:医疗上下文“大海捞针”任务和一系列特定于医疗应用的任务,加在一起由10个数据集组成。第一个组成部分包括挑战,如反直觉推理和注入新的(未知)事实,以减轻低成本管理的知识泄漏和数据污染。第二个组成部分面临着需要专业医疗专业知识的挑战。特别是,我们设计了“最大相同的上下文”原则,通过保证不同的LLM尽可能多地遵守相同的上下文来提高公平性。我们的实验评估了为处理长上下文而量身定做的高级专有和开源LLM,并提供了详细的性能分析。这突出表明,小岛屿发展中国家仍然面临挑战,需要在这一领域进行进一步研究。我们的代码和数据在存储库中发布:\urlThis HTTPS URL。

[NLP-34] GraLMatch: Matching Groups of Entities with Graphs and Language Models
[NLP-34] GraLMatch:将实体组与图形和语言模型进行匹配

链接: https://arxiv.org/abs/2406.15015
作者: Fernando De Meer Pardo,Claude Lehmann,Dennis Gehrig,Andrea Nagy,Stefano Nicoli,Branka Hadji Misheva,Martin Braschler,Kurt Stockinger
关键词: entity group matching, records, call entity group, multiple data sources, Matching
中文关键词: 实体组匹配、记录、调用实体组、多个数据源、匹配
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 4 figures, accepted as research paper at EDBT 2025

点击查看摘要

Abstract:In this paper, we present an end-to-end multi-source Entity Matching problem, which we call entity group matching, where the goal is to assign to the same group, records originating from multiple data sources but representing the same real-world entity. We focus on the effects of transitively matched records, i.e. the records connected by paths in the graph G = (V,E) whose nodes and edges represent the records and whether they are a match or not. We present a real-world instance of this problem, where the challenge is to match records of companies and financial securities originating from different data providers. We also introduce two new multi-source benchmark datasets that present similar matching challenges as real-world records. A distinctive characteristic of these records is that they are regularly updated following real-world events, but updates are not applied uniformly across data sources. This phenomenon makes the matching of certain groups of records only possible through the use of transitive information. In our experiments, we illustrate how considering transitively matched records is challenging since a limited amount of false positive pairwise match predictions can throw off the group assignment of large quantities of records. Thus, we propose GraLMatch, a method that can partially detect and remove false positive pairwise predictions through graph-based properties. Finally, we showcase how fine-tuning a Transformer-based model (DistilBERT) on a reduced number of labeled samples yields a better final entity group matching than training on more samples and/or incorporating fine-tuning optimizations, illustrating how precision becomes the deciding factor in the entity group matching of large volumes of records. Comments: 12 pages, 4 figures, accepted as research paper at EDBT 2025 Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2406.15015 [cs.DB] (or arXiv:2406.15015v1 [cs.DB] for this version)
摘要:本文提出了一种端到端的多源实体匹配问题,称为实体组匹配,其目标是将来自多个数据源但代表同一现实世界实体的记录分配给同一组。我们关注传递匹配记录的影响,即通过图G=(V,E)中的路径连接的记录,其节点和边代表记录以及它们是否是匹配的。我们提供了这个问题的一个真实实例,其中的挑战是匹配来自不同数据提供商的公司和金融证券的记录。我们还介绍了两个新的多源基准数据集,它们提出了与真实世界记录相似的匹配挑战。这些记录的一个显著特点是,它们会在实际事件之后定期更新,但更新并不是跨数据源统一应用的。这种现象使某些记录组的匹配只能通过使用传递信息来实现。在我们的实验中,我们说明了考虑过渡匹配记录是如何具有挑战性的,因为有限数量的假阳性两两匹配预测可能会导致大量记录的分组分配。因此,我们提出了GraLMatch方法,该方法可以通过基于图的性质来检测和消除部分假阳性两两预测。最后,我们展示了在减少标签样本数量上微调基于Transformer的模型(DistilBERT)如何产生比在更多样本上训练和/或结合微调优化更好的最终实体组匹配,说明了精度如何成为大量记录的实体组匹配的决定性因素。评论:12页,4张图,作为EDBT2025年会议的研究论文接受主题:数据库(cs.DB);人工智能(cs.AI);计算与语言(cs.CL)引用为:arxiv:2406.15015cs.db

[NLP-35] Unveiling the Impact of Multi-Modal Interactions on User Engagement: A Comprehensive Evaluation in AI-driven Conversations
[NLP-35] 揭示多模式交互对用户参与度的影响:人工智能驱动对话中的综合评估

链接: https://arxiv.org/abs/2406.15000
作者: Lichao Zhang,Jia Yu,Shuai Zhang,Long Li,Yangyang Zhong,Guanbao Liang,Yuming Yan,Qing Ma,Fangsheng Weng,Fayu Pan,Jing Li,Renjun Xu,Zhenzhong Lan
关键词: Large Language Models, Large Language, Language Models, advanced user-bot interactions, significantly advanced user-bot
中文关键词: 大型语言模型、大型语言、语言模型、高级用户机器人交互、显着高级用户机器人
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced user-bot interactions, enabling more complex and coherent dialogues. However, the prevalent text-only modality might not fully exploit the potential for effective user engagement. This paper explores the impact of multi-modal interactions, which incorporate images and audio alongside text, on user engagement in chatbot conversations. We conduct a comprehensive analysis using a diverse set of chatbots and real-user interaction data, employing metrics such as retention rate and conversation length to evaluate user engagement. Our findings reveal a significant enhancement in user engagement with multi-modal interactions compared to text-only dialogues. Notably, the incorporation of a third modality significantly amplifies engagement beyond the benefits observed with just two modalities. These results suggest that multi-modal interactions optimize cognitive processing and facilitate richer information comprehension. This study underscores the importance of multi-modality in chatbot design, offering valuable insights for creating more engaging and immersive AI communication experiences and informing the broader AI community about the benefits of multi-modal interactions in enhancing user engagement.
摘要:大型语言模型(LLM)极大地促进了用户与机器人的交互,使对话更加复杂和连贯。然而,流行的纯文本模式可能没有充分利用有效用户参与的潜力。本文探讨了将图像和音频与文本结合在一起的多模式交互对用户参与聊天机器人对话的影响。我们使用一组不同的聊天机器人和真实的用户交互数据进行了全面的分析,使用保留率和对话时长等指标来评估用户参与度。我们的发现显示,与纯文本对话相比,多模式交互在用户参与度方面有显著增强。值得注意的是,纳入第三种模式大大扩大了参与度,而不仅仅是两种模式所带来的好处。这些结果表明,多通道交互作用优化了认知加工,促进了更丰富的信息理解。这项研究强调了多模式在聊天机器人设计中的重要性,为创造更具吸引力和身临其境的人工智能通信体验提供了有价值的见解,并向更广泛的人工智能社区告知多模式交互在提高用户参与度方面的好处。

[NLP-36] Disability Representations: Finding Biases in Automatic Image Generation
[NLP-36] 残疾表示:发现自动图像生成中的偏差

链接: https://arxiv.org/abs/2406.14993
作者: Yannis Tevissen
关键词: enabled widespread access, Recent advancements, AI-generated imagery, visual content, technology have enabled
中文关键词: 实现了广泛的访问,最近的进步、人工智能生成的图像、视觉内容、技术使
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Presented at AVA Workshop of CVPR 2024

点击查看摘要

Abstract:Recent advancements in image generation technology have enabled widespread access to AI-generated imagery, prominently used in advertising, entertainment, and progressively in every form of visual content. However, these technologies often perpetuate societal biases. This study investigates the representation biases in popular image generation models towards people with disabilities (PWD). Through a comprehensive experiment involving several popular text-to-image models, we analyzed the depiction of disability. The results indicate a significant bias, with most generated images portraying disabled individuals as old, sad, and predominantly using manual wheelchairs. These findings highlight the urgent need for more inclusive AI development, ensuring diverse and accurate representation of PWD in generated images. This research underscores the importance of addressing and mitigating biases in AI models to foster equitable and realistic representations.
摘要:图像生成技术的最新进步使人们能够广泛访问人工智能生成的图像,这些图像主要用于广告、娱乐,并逐渐用于各种形式的视觉内容。然而,这些技术往往会导致社会偏见永久化。本研究调查了流行图像生成模型中针对残疾人(PWD)的代表偏见。通过涉及几种流行的文本到图像模型的综合实验,我们分析了残疾的描述。结果表明存在显着的偏见,大多数生成的图像将残疾人描绘成年老、悲伤的人,并且主要使用手动轮椅。这些发现凸显了更加包容性的人工智能开发的迫切需要,确保在生成的图像中多样化、准确地表示PWD。这项研究强调了解决和减轻人工智能模型中的偏见以促进公平和现实的表现的重要性。

[NLP-37] SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation
[NLP-37] 电子表格长凳:走向令人惊叹的现实世界电子表格操纵

链接: https://arxiv.org/abs/2406.14991
作者: Zeyao Ma,Bohan Zhang,Jing Zhang,Jifan Yu,Xiaokang Zhang,Xiaohan Zhang,Sijia Luo,Xi Wang,Jie Tang
关键词: immerse current large, current large language, manipulation benchmark exclusively, benchmark exclusively derived, challenging spreadsheet manipulation
中文关键词: 沉浸当前大型、当前大型语言、独家操纵基准、独家衍生基准、具有挑战性的电子表格操纵
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Homepage: this https URL

点击查看摘要

Abstract:We introduce SpreadsheetBench, a challenging spreadsheet manipulation benchmark exclusively derived from real-world scenarios, designed to immerse current large language models (LLMs) in the actual workflow of spreadsheet users. Unlike existing benchmarks that rely on synthesized queries and simplified spreadsheet files, SpreadsheetBench is built from 912 real questions gathered from online Excel forums, which reflect the intricate needs of users. The associated spreadsheets from the forums contain a variety of tabular data such as multiple tables, non-standard relational tables, and abundant non-textual elements. Furthermore, we propose a more reliable evaluation metric akin to online judge platforms, where multiple spreadsheet files are created as test cases for each instruction, ensuring the evaluation of robust solutions capable of handling spreadsheets with varying values. Our comprehensive evaluation of various LLMs under both single-round and multi-round inference settings reveals a substantial gap between the state-of-the-art (SOTA) models and human performance, highlighting the benchmark’s difficulty.
摘要:我们介绍了SpreadsheetBch,这是一个具有挑战性的电子表格操作基准,完全来自真实世界的场景,旨在将当前的大型语言模型(LLM)沉浸在电子表格用户的实际工作流中。与依赖合成查询和简化电子表格文件的现有基准不同,SpreadsheetBch是由从在线Excel论坛收集的912个真实问题构建的,这些问题反映了用户的复杂需求。来自论坛的相关电子表格包含各种表格数据,例如多个表格、非标准关系表格和丰富的非文本元素。此外,我们提出了一种类似于在线评判平台的更可靠的评估指标,其中为每条指令创建多个电子表格文件作为测试用例,确保对能够处理不同值的电子表格的健壮解决方案进行评估。我们在单轮和多轮推理设置下对各种LLM的综合评估显示,最先进的(SOTA)模型与人类表现之间存在巨大差距,突显了基准的难度。

[NLP-38] Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers
[NLP-38] 大型语言模型是否表现出认知失调?研究揭示的信念和陈述的答案之间的差异

链接: https://arxiv.org/abs/2406.14986
作者: Manuel Mondal,Ljiljana Dolamic,Gérôme Bovet,Philippe Cudré-Mauroux
关键词: Large Language Models, Multiple Choices Questions, Choices Questions, Language Models, Large Language
中文关键词: 大型语言模型、多重选择问题、选择问题、语言模型、大型语言
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prompting and Multiple Choices Questions (MCQ) have become the preferred approach to assess the capabilities of Large Language Models (LLMs), due to their ease of manipulation and evaluation. Such experimental appraisals have pointed toward the LLMs’ apparent ability to perform causal reasoning or to grasp uncertainty. In this paper, we investigate whether these abilities are measurable outside of tailored prompting and MCQ by reformulating these issues as direct text completion - the foundation of LLMs. To achieve this goal, we define scenarios with multiple possible outcomes and we compare the prediction made by the LLM through prompting (their Stated Answer) to the probability distributions they compute over these outcomes during next token prediction (their Revealed Belief). Our findings suggest that the Revealed Belief of LLMs significantly differs from their Stated Answer and hint at multiple biases and misrepresentations that their beliefs may yield in many scenarios and outcomes. As text completion is at the core of LLMs, these results suggest that common evaluation methods may only provide a partial picture and that more research is needed to assess the extent and nature of their capabilities.
摘要:提示题和多项选择题因其易于操作和评价而成为评价大型语言模型能力的首选方法。这样的实验评估指向了LLMS执行因果推理或掌握不确定性的明显能力。在这篇文章中,我们通过将这些问题重新表述为直接文本补全–LLMS的基础–来考察这些能力是否可以在定制提示和McQ之外测量。为了实现这一目标,我们定义了具有多个可能结果的情景,并将LLM通过提示(他们声明的答案)做出的预测与他们在下一次令牌预测(他们揭示的信念)期间计算的这些结果的概率分布进行了比较。我们的发现表明,LLMS的表露信念与他们所陈述的答案显著不同,并暗示了他们的信念可能在许多情景和结果中产生的多重偏见和错误陈述。由于文本补全是LLMS的核心,这些结果表明,常用的评估方法可能只能提供部分情况,需要更多的研究来评估其能力的程度和性质。

[NLP-39] Retrieve-Plan-Generation: An Iterative Planning and Answering Framework for Knowledge-Intensive LLM Generation
[NLP-39] 检索计划生成:知识密集型LLM生成的迭代规划和志愿服务框架

链接: https://arxiv.org/abs/2406.14979
作者: Yuanjie Lyu,Zihan Niu,Zheyong Xie,Chao Zhang,Tong Xu,Yang Wang,Enhong Chen
关键词: produce factual errors, limited internal knowledge, factual errors due, large language models, significant progress
中文关键词: 产生事实错误、内部知识有限、事实错误、大型语言模型、重大进展
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the significant progress of large language models (LLMs) in various tasks, they often produce factual errors due to their limited internal knowledge. Retrieval-Augmented Generation (RAG), which enhances LLMs with external knowledge sources, offers a promising solution. However, these methods can be misled by irrelevant paragraphs in retrieved documents. Due to the inherent uncertainty in LLM generation, inputting the entire document may introduce off-topic information, causing the model to deviate from the central topic and affecting the relevance of the generated content. To address these issues, we propose the Retrieve-Plan-Generation (RPG) framework. RPG generates plan tokens to guide subsequent generation in the plan stage. In the answer stage, the model selects relevant fine-grained paragraphs based on the plan and uses them for further answer generation. This plan-answer process is repeated iteratively until completion, enhancing generation relevance by focusing on specific topics. To implement this framework efficiently, we utilize a simple but effective multi-task prompt-tuning method, enabling the existing LLMs to handle both planning and answering. We comprehensively compare RPG with baselines across 5 knowledge-intensive generation tasks, demonstrating the effectiveness of our approach.
摘要:尽管大型语言模型在各种任务中取得了长足的进步,但由于其内部知识有限,它们往往会产生事实错误。检索-增强生成(RAG)是一种利用外部知识源增强LLM的方法,它提供了一个很有前途的解决方案。然而,这些方法可能会被检索到的文档中不相关的段落所误导。由于LLM生成中的固有不确定性,输入整个文档可能会引入离题信息,导致模型偏离中心主题,并影响生成的内容的相关性。为了解决这些问题,我们提出了检索-计划-生成(RPG)框架。RPG生成计划令牌以指导计划阶段的后续生成。在答案阶段,该模型根据计划选择相关的细粒度段落,并使用它们来进一步生成答案。这个计划-答案过程反复重复,直到完成,通过关注特定主题来增强生成相关性。为了有效地实现这个框架,我们使用了一种简单但有效的多任务提示调优方法,使现有的LLMS能够同时处理规划和回答。我们在5个知识密集型生成任务中将RPG与基线进行了全面比较,证明了我们方法的有效性。

[NLP-40] A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems
[NLP-40] 信任和准确性的故事:RAG系统中的基础与指导LL LM

链接: https://arxiv.org/abs/2406.14972
作者: Florin Cuconasu,Giovanni Trappolini,Nicola Tonellotto,Fabrizio Silvestri
关键词: Retrieval Augmented Generation, Augmented Generation, artificial intelligence combining, Retrieval Augmented, large language models
中文关键词: 检索增强生成、增强生成、人工智能结合、检索增强、大型语言模型
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) represents a significant advancement in artificial intelligence combining a retrieval phase with a generative phase, with the latter typically being powered by large language models (LLMs). The current common practices in RAG involve using “instructed” LLMs, which are fine-tuned with supervised training to enhance their ability to follow instructions and are aligned with human preferences using state-of-the-art techniques. Contrary to popular belief, our study demonstrates that base models outperform their instructed counterparts in RAG tasks by 20% on average under our experimental settings. This finding challenges the prevailing assumptions about the superiority of instructed LLMs in RAG applications. Further investigations reveal a more nuanced situation, questioning fundamental aspects of RAG and suggesting the need for broader discussions on the topic; or, as Fromm would have it, “Seldom is a glance at the statistics enough to understand the meaning of the figures”.
摘要:检索增强生成(RAG)代表着人工智能领域的重大进步,它将检索阶段和生成阶段结合在一起,后者通常由大语言模型(LLMS)提供支持。目前RAG的常见做法包括使用“指导的”LLM,这种方法通过监督培训进行微调,以增强它们遵循指令的能力,并使用最先进的技术与人类的偏好保持一致。与普遍认为的相反,我们的研究表明,在我们的实验设置下,基础模型在RAG任务中的表现平均比指导模型高出20%。这一发现挑战了主流的假设,即在RAG应用中使用指导式LLM的优越性。进一步的调查揭示了一种更加微妙的情况,对RAG的基本方面提出了质疑,并暗示有必要就这一主题进行更广泛的讨论;或者,正如弗洛姆所说,“很少有人对统计数据看一眼就足以理解数字的含义”。

[NLP-41] Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation
[NLP-41] 通过连续预训练和模型合并对Llama 3 - 70 B-Direcct的领域适应:综合评估

链接: https://arxiv.org/abs/2406.14971
作者: Shamane Siriwardhana,Mark McQuade,Thomas Gauthier,Lucas Atkins,Fernando Fernandes Neto,Luke Meyers,Anneketh Vij,Tyler Odenthal,Charles Goddard,Mary MacCarthy,Jacob Solawetz
关键词: conducted extensive experiments, SEC data, exploring its performance, conducted extensive, extensive experiments
中文关键词: 进行了广泛的实验,SEC数据,探索其性能,进行了广泛的实验
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:We conducted extensive experiments on domain adaptation of the Meta-Llama-3-70B-Instruct model on SEC data, exploring its performance on both general and domain-specific benchmarks. Our focus included continual pre-training (CPT) and model merging, aiming to enhance the model’s domain-specific capabilities while mitigating catastrophic forgetting. Through this study, we evaluated the impact of integrating financial regulatory data into a robust language model and examined the effectiveness of our model merging techniques in preserving and improving the model’s instructive abilities. The model is accessible at hugging face: this https URL, arcee-ai/Llama-3-SEC-Base. This is an intermediate checkpoint of our final model, which has seen 20B tokens so far. The full model is still in the process of training. This is a preprint technical report with thorough evaluations to understand the entire process.
摘要:我们对SEC数据上的Meta-Lama-3- 70 B-Direcct模型的领域适应进行了广泛的实验,探索了其在通用基准和特定领域基准上的性能。我们的重点包括持续预训练(CPD)和模型合并,旨在增强模型的特定领域能力,同时减轻灾难性遗忘。通过这项研究,我们评估了将金融监管数据集成到稳健语言模型中的影响,并检查了我们的模型合并技术在保留和提高模型指导能力方面的有效性。该模型可以在拥抱脸时访问:此httpsURL,arcee-ai/Llama-3-REC-Base。这是我们最终模型的中间检查点,到目前为止,该模型已经看到了20 B个代币。完整模型仍在训练过程中。这是一份预印本技术报告,包含全面的评估,以了解整个过程。

[NLP-42] Unlocking the Global Synergies in Low-Rank Adapters
[NLP-42] 释放低级别适应者的全球协同效应

链接: https://arxiv.org/abs/2406.14956
作者: Zixi Zhang,Cheng Zhang,Xitong Gao,Robert D. Mullins,George A. Constantinides,Yiren Zhao
关键词: Low-rank Adaption, de-facto parameter-efficient fine-tuning, parameter-efficient fine-tuning technique, large language models, de-facto parameter-efficient
中文关键词: 低等级适应、事实上的参数高效的微调、参数高效的微调技术、大型语言模型、事实上的参数高效
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted at ICML2024 ES-FoMo-II Workshop

点击查看摘要

Abstract:Low-rank Adaption (LoRA) has been the de-facto parameter-efficient fine-tuning technique for large language models. We present HeteroLoRA, a light-weight search algorithm that leverages zero-cost proxies to allocate the limited LoRA trainable parameters across the model for better fine-tuned performance. In addition to the allocation for the standard LoRA-adapted models, we also demonstrate the efficacy of HeteroLoRA by performing the allocation in a more challenging search space that includes LoRA modules and LoRA-adapted shortcut connections. Experiments show that HeteroLoRA enables improvements in model performance given the same parameter budge. For example, on MRPC, we see an improvement of 1.6% in accuracy with similar training parameter budget. We will open-source our algorithm once the paper is accepted.
摘要:低等级适应(LoRA)一直是大型语言模型事实上的参数高效微调技术。我们提出了HeteroLoRA,这是一种轻量级搜索算法,它利用零成本代理在整个模型中分配有限的LoRA可训练参数,以获得更好的微调性能。除了分配标准LoRA适应模型外,我们还通过在更具挑战性的搜索空间(包括LoRA模块和LoRA适应快捷连接)中执行分配来证明HeteroLoRA的功效。实验表明,在相同的参数调整下,HeteroLoRA可以提高模型性能。例如,在MRPC上,我们看到使用类似的训练参数预算,准确性提高了1.6%。一旦论文被接受,我们将开源我们的算法。

[NLP-43] ICLEval: Evaluating In-Context Learning Ability of Large Language Models
[NLP-43] ICLEval:评估大型语言模型的上下文学习能力

链接: https://arxiv.org/abs/2406.14955
作者: Wentong Chen,Yankai Lin,ZhenHao Zhou,HongYun Huang,Yantao Jia,Zhao Cao,Ji-Rong Wen
关键词: Large Language Models, capability of Large, Large Language, ICL ability, ICL
中文关键词: 大型语言模型,大型能力,大型语言,ICL能力,ICL
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-Context Learning (ICL) is a critical capability of Large Language Models (LLMs) as it empowers them to comprehend and reason across interconnected inputs. Evaluating the ICL ability of LLMs can enhance their utilization and deepen our understanding of how this ability is acquired at the training stage. However, existing evaluation frameworks primarily focus on language abilities and knowledge, often overlooking the assessment of ICL ability. In this work, we introduce the ICLEval benchmark to evaluate the ICL abilities of LLMs, which encompasses two key sub-abilities: exact copying and rule learning. Through the ICLEval benchmark, we demonstrate that ICL ability is universally present in different LLMs, and model size is not the sole determinant of ICL efficacy. Surprisingly, we observe that ICL abilities, particularly copying, develop early in the pretraining process and stabilize afterward. Our source codes and benchmark are released at this https URL.
摘要:上下文学习(ICL)是大型语言模型(LLM)的一项关键功能,因为它使它们能够理解和推理相互关联的输入。评估法学硕士的ICL能力可以提高其利用率,并加深我们对如何在培训阶段获得这种能力的理解。然而,现有的评估框架主要关注语言能力和知识,往往忽视了对ICL能力的评估。在这项工作中,我们引入了ICLEval基准来评估LLM的ICL能力,其中包括两个关键子能力:精确复制和规则学习。通过ICLEval基准,我们证明ICL能力普遍存在于不同的LLM中,模型尺寸并不是ICL功效的唯一决定因素。令人惊讶的是,我们观察到ICL能力,尤其是复制,在训练前过程的早期发展并在训练后稳定下来。我们的源代码和基准测试在此https URL上发布。

[NLP-44] ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models
[NLP-44] ESC-Eval:评估大型语言模型中的情感支持对话

链接: https://arxiv.org/abs/2406.14952
作者: Haiquan Zhao,Lingyu Li,Shisong Chen,Shuqi Kong,Jiaan Wang,Kexin Huang,Tianle Gu,Yixu Wang,Dandan Liang,Zhixu Li,Yan Teng,Yanghua Xiao,Yingchun Wang
关键词: Emotion Support Conversation, Emotion Support, Support Conversation, offer emotional guidance, ESC models
中文关键词: 情感支持对话,情感支持,支持对话,提供情感指导,电子稳定控制模型
类目: Computation and Language (cs.CL)
备注: Pre-print

点击查看摘要

Abstract:Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at this https URL.
摘要:情感支持对话(ESC)是一种重要的应用,旨在减轻人类的压力,提供情感指导,最终提高人类的身心健康。随着大语言模型的提出,许多研究者使用大语言模型作为ESC模型。然而,对这些基于LLM的ESCs的评估仍然不确定。受角色扮演智能体发展的启发,我们提出了一个ESC评估框架(ESC-EVAL),该框架使用角色扮演智能体与ESC模型进行交互,然后对交互对话进行手动评估。详细地说,我们首先从七个现有的数据集重新组织2,801张角色扮演卡,以定义角色扮演代理的角色。其次,我们训练了一个特定的角色扮演模型,称为ESC-Role,它的行为比GPT-4更像一个困惑的人。第三,通过ESC-Role和组织角色卡,系统地使用了14个LLMS作为ESC模型,包括通用人工智能辅助LLMS(ChatGPT)和面向ESC的LLMS(ExTES-Llama)。我们对不同ESC模型的交互式多轮对话进行了全面的人类注释。结果表明,面向ESC的LLMS具有比一般人工智能辅助LLMS更好的ESC能力,但与人类的表现相比仍有差距。此外,为了自动为未来的ESC模型评分,我们开发了ESC-RANK,它在标注的数据上进行训练,获得了超过GPT-4的35分的评分性能。我们的数据和代码可以在这个HTTPS URL上找到。

[NLP-45] owards Retrieval Augmented Generation over Large Video Libraries
[NLP-45] 大型视频库上的owards检索增强生成

链接: https://arxiv.org/abs/2406.14938
作者: Yannis Tevissen,Khalil Guetari,Frédéric Petitpont
关键词: requires complex manual, Library Question Answering, automated searches, creators need efficient, efficient tools
中文关键词: 需要复杂的手册、图书馆问题解答、自动搜索,创作者需要高效、高效的工具
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in IEEE HSI 2024

点击查看摘要

Abstract:Video content creators need efficient tools to repurpose content, a task that often requires complex manual or automated searches. Crafting a new video from large video libraries remains a challenge. In this paper we introduce the task of Video Library Question Answering (VLQA) through an interoperable architecture that applies Retrieval Augmented Generation (RAG) to video libraries. We propose a system that uses large language models (LLMs) to generate search queries, retrieving relevant video moments indexed by speech and visual metadata. An answer generation module then integrates user queries with this metadata to produce responses with specific video timestamps. This approach shows promise in multimedia content retrieval, and AI-assisted video content creation.
摘要:视频内容创作者需要高效的工具来重新利用内容,这项任务通常需要复杂的手动或自动搜索。从大型视频库中制作新视频仍然是一项挑战。在本文中,我们通过将检索增强生成(RAG)应用于视频库的可互操作架构来介绍视频库问题解答(VLQA)的任务。我们提出了一种使用大型语言模型(LLM)来生成搜索查询、检索由语音和视觉元数据索引的相关视频时刻的系统。然后,答案生成模块将用户查询与此元数据集成,以生成具有特定视频时间戳的响应。这种方法在多媒体内容检索和人工智能辅助视频内容创建方面展现了前景。

[NLP-46] Autonomous Agents for Collaborative Task under Information Asymmetry
[NLP-46] 信息不对称下协作任务的自治代理

链接: https://arxiv.org/abs/2406.14928
作者: Wei Liu,Chenxi Wang,Yifei Wang,Zihao Xie,Rennai Qiu,Yufan Dang,Zhuoyun Du,Weize Chen,Cheng Yang,Chen Qian
关键词: Large Language Model, Language Model Multi-Agent, Large Language, Language Model, Model Multi-Agent Systems
中文关键词: 大型语言模型,语言模型多代理,大型语言,语言模型,模型多代理系统
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注: 16 pages, 8 figures, 5 tables, Work in progress

点击查看摘要

Abstract:Large Language Model Multi-Agent Systems (LLM-MAS) have achieved great progress in solving complex tasks. It performs communication among agents within the system to collaboratively solve tasks, under the premise of shared information. However, when agents’ communication is leveraged to enhance human cooperation, a new challenge arises due to information asymmetry, since each agent can only access the information of its human user. Previous MAS struggle to complete tasks under this condition. To address this, we propose a new MAS paradigm termed iAgents, which denotes Informative Multi-Agent Systems. In iAgents, the human social network is mirrored in the agent network, where agents proactively exchange human information necessary for task resolution, thereby overcoming information asymmetry. iAgents employs a novel agent reasoning mechanism, InfoNav, to navigate agents’ communication towards effective information exchange. Together with InfoNav, iAgents organizes human information in a mixed memory to provide agents with accurate and comprehensive information for exchange. Additionally, we introduce InformativeBench, the first benchmark tailored for evaluating LLM agents’ task-solving ability under information asymmetry. Experimental results show that iAgents can collaborate within a social network of 140 individuals and 588 relationships, autonomously communicate over 30 turns, and retrieve information from nearly 70,000 messages to complete tasks within 3 minutes.
摘要:大语言模型多智能体系统(LLM-MAS)在解决复杂任务方面取得了很大进展。它在信息共享的前提下,进行系统内多个智能体之间的通信,以协同解决任务。然而,当利用智能体的通信来加强人类合作时,由于信息不对称,新的挑战出现了,因为每个智能体只能访问其人类用户的信息。以前的MAS在这种情况下很难完成任务。为了解决这一问题,我们提出了一种新的MAS范型,称为iAgents,即信息型多智能体系统。在iAgents中,人际社交网络被镜像到代理网络中,在代理网络中,代理主动交换任务解决所需的人类信息,从而克服了信息不对称。IAgents采用了一种新颖的智能体推理机制InfoNav,以引导智能体之间的通信走向有效的信息交换。与InfoNav一起,iAgents在混合内存中组织人类信息,为代理提供准确和全面的信息以供交换。此外,我们引入了InformativeBch,这是第一个为评估信息不对称下LLM代理的任务解决能力而量身定做的基准。实验结果表明,iAgents可以在一个由140个人和588个关系组成的社交网络中进行协作,自主交流30次,并从近7万条消息中检索信息,在3分钟内完成任务。

[NLP-47] LLM2FEA: Discover Novel Designs with Generative Evolutionary Multitasking
[NLP-47] LLM2FEA:通过生成式进化多任务处理发现新颖设计

链接: https://arxiv.org/abs/2406.14917
作者: Melvin Wong,Jiao Liu,Thiago Rios,Stefan Menzel,Yew Soon Ong
关键词: generative artificial intelligence, high-quality images, rapid research, research and development, artificial intelligence
中文关键词: 生成式人工智能、高质量图像、快速研究、研发、人工智能
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:The rapid research and development of generative artificial intelligence has enabled the generation of high-quality images, text, and 3D models from text prompts. This advancement impels an inquiry into whether these models can be leveraged to create digital artifacts for both creative and engineering applications. Drawing on innovative designs from other domains may be one answer to this question, much like the historical practice of ``bionics", where humans have sought inspiration from nature’s exemplary designs. This raises the intriguing possibility of using generative models to simultaneously tackle design tasks across multiple domains, facilitating cross-domain learning and resulting in a series of innovative design solutions. In this paper, we propose LLM2FEA as the first attempt to discover novel designs in generative models by transferring knowledge across multiple domains. By utilizing a multi-factorial evolutionary algorithm (MFEA) to drive a large language model, LLM2FEA integrates knowledge from various fields to generate prompts that guide the generative model in discovering novel and practical objects. Experimental results in the context of 3D aerodynamic design verify the discovery capabilities of the proposed LLM2FEA. The designs generated by LLM2FEA not only satisfy practicality requirements to a certain degree but also feature novel and aesthetically pleasing shapes, demonstrating the potential applications of LLM2FEA in discovery tasks.
摘要:生成式人工智能的快速研究和发展使得从文本提示生成高质量的图像、文本和3D模型成为可能。这一进步促使人们调查这些模型是否可以被用来为创意和工程应用创造数字艺术品。借鉴其他领域的创新设计可能是这个问题的一个答案,很像历史上的仿生学实践,人类从自然的模范设计中寻找灵感。这增加了使用生成性模型同时处理跨多个领域的设计任务的有趣可能性,促进了跨领域学习,并产生了一系列创新的设计解决方案。在本文中,我们提出了LLM2FEA,作为通过跨多个领域转移知识来发现产生式模型中的新设计的首次尝试。通过使用多因素进化算法(MFEA)来驱动大型语言模型,LLM2FEA集成了来自各个领域的知识来生成提示,指导生成模型发现新颖和实用的对象。三维气动设计的实验结果验证了所提出的LLM2FEA的发现能力。LLM2FEA生成的设计不仅在一定程度上满足了实用性要求,而且造型新颖美观,展示了LLM2FEA在发现任务中的潜在应用。

[NLP-48] MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
[NLP-48] MoA:用于自动大型语言模型压缩的稀疏注意力混合

链接: https://arxiv.org/abs/2406.14909
作者: Tianyu Fu,Haofeng Huang,Xuefei Ning,Genghan Zhang,Boju Chen,Tianqi Wu,Hongyi Wang,Zixiao Huang,Shiyao Li,Shengen Yan,Guohao Dai,Huazhong Yang,Yu Wang
关键词: Large Language Models, Large Language, demands of Large, Language Models, attention
中文关键词: 大型语言模型,大型语言,大型需求,语言模型,关注
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:Sparse attention can effectively mitigate the significant memory and throughput demands of Large Language Models (LLMs) in long contexts. Existing methods typically employ a uniform sparse attention mask, applying the same sparse pattern across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers. MoA constructs and navigates a search space of various attention patterns and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal sparse attention compression plan. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by 3.9\times with the same average attention span, boosting retrieval accuracy by 1.5-7.1\times over the uniform-attention baseline across Vicuna-7B, Vicuna-13B, and Llama3-8B models. Moreover, MoA narrows the capability gaps between sparse and dense models, reducing the maximum relative performance drop from 9%-36% to within 5% across two long-context understanding benchmarks. MoA achieves a 1.2-1.4\times GPU memory reduction and boosts decode throughput by 5.5-6.7 \times for 7B and 13B dense models on a single GPU, with minimal impact on performance.
摘要:稀疏关注可以有效地缓解大语言模型在长上下文中对内存和吞吐量的巨大需求。现有的方法通常使用统一的稀疏注意掩码,在不同的注意头部和输入长度上应用相同的稀疏模式。然而,这种统一的方法未能捕捉到LLMS固有的多样化的注意模式,忽略了它们明显的准确性-延迟权衡。为了应对这一挑战,我们提出了注意力混合(MoA),它自动为不同的头部和层定制不同的稀疏注意配置。MOA构建并导航具有各种注意模式及其相对于输入序列长度的缩放规则的搜索空间。它分析了模型,评估了可能的配置,并确定了最优稀疏注意压缩计划。MOA适应不同的输入大小,揭示出一些注意力头部扩大注意力以适应更长的序列,而另一些头部持续地专注于固定长度的局部上下文。实验表明,在保持平均注意时长不变的情况下,MOA算法的有效语境长度提高了3.9倍,检索准确率提高了1.5~7.1倍。此外,MOA缩小了稀疏和密集模型之间的能力差距,将两个长上下文理解基准测试的最大相对性能降幅从9-36%降至5%以内。MOA在单个GPU上实现了1.2-1.4倍的GPU内存减少,并将7B和13B高密度型号的解码吞吐量提高了5.5-6.7倍,而对性能的影响最小。

[NLP-49] Safely Learning with Private Data: A Federated Learning Framework for Large Language Model
[NLP-49] 利用私人数据安全学习:大型语言模型的联邦学习框架

链接: https://arxiv.org/abs/2406.14898
作者: JiaYing Zheng,HaiNan Zhang,LingXiang Wang,WangJie Qiu,HongWei Zheng,ZhiMing Zheng
关键词: greatly improve large, improve large language, large language models, LLM, larger and quality-higher
中文关键词: 大幅改进大型、改进大型语言、大型语言模型、LLM、更大、更高质量
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Private data, being larger and quality-higher than public data, can greatly improve large language models (LLM). However, due to privacy concerns, this data is often dispersed in multiple silos, making its secure utilization for LLM training a challenge. Federated learning (FL) is an ideal solution for training models with distributed private data, but traditional frameworks like FedAvg are unsuitable for LLM due to their high computational demands on clients. An alternative, split learning, offloads most training parameters to the server while training embedding and output layers locally, making it more suitable for LLM. Nonetheless, it faces significant challenges in security and efficiency. Firstly, the gradients of embeddings are prone to attacks, leading to potential reverse engineering of private data. Furthermore, the server’s limitation of handle only one client’s training request at a time hinders parallel training, severely impacting training efficiency. In this paper, we propose a Federated Learning framework for LLM, named FL-GLM, which prevents data leakage caused by both server-side and peer-client attacks while improving training efficiency. Specifically, we first place the input block and output block on local client to prevent embedding gradient attacks from server. Secondly, we employ key-encryption during client-server communication to prevent reverse engineering attacks from peer-clients. Lastly, we employ optimization methods like client-batching or server-hierarchical, adopting different acceleration methods based on the actual computational capabilities of the server. Experimental results on NLU and generation tasks demonstrate that FL-GLM achieves comparable metrics to centralized chatGLM model, validating the effectiveness of our federated learning framework.
摘要:私有数据比公共数据更大、质量更高,可以极大地改进大型语言模型。然而,出于隐私方面的考虑,这些数据通常分散在多个竖井中,这使得将其安全地用于LLM培训成为一项挑战。联邦学习(FL)是一种适用于具有分布式私有数据的模型训练的理想解决方案,但FedAvg等传统框架由于对客户端的计算要求较高而不适用于LLM。另一种选择是分离学习,将大部分训练参数卸载到服务器,同时在本地训练嵌入和输出层,使其更适合LLM。尽管如此,它在安全和效率方面仍面临重大挑战。首先,嵌入的梯度容易受到攻击,从而导致对私有数据的潜在逆向工程。此外,服务器一次只能处理一个客户端的训练请求的限制阻碍了并行训练,严重影响了训练效率。本文提出了一种用于LLM的联邦学习框架FL-GLM,该框架在提高训练效率的同时,防止了服务器端攻击和对等客户端攻击引起的数据泄漏。具体地说,我们首先将输入块和输出块放置在本地客户端,以防止来自服务器的嵌入梯度攻击。其次,我们在客户-服务器通信过程中使用密钥加密,以防止来自对等客户端的反向工程攻击。最后,我们采用了客户端批处理或服务器分层等优化方法,根据服务器的实际计算能力采用不同的加速方法。在NLU和生成任务上的实验结果表明,FL-GLM达到了与集中式ChatGLM模型相当的指标,验证了联邦学习框架的有效性。

[NLP-50] alking the Talk Does Not Entail Walking the Walk: On the Limits of Large Language Models in Lexical Entailment Recognition
[NLP-50] 空谈并不意味着走路:论大型语言模型在词汇蕴含识别中的局限性

链接: https://arxiv.org/abs/2406.14894
作者: Candida M. Greco,Lucio La Cava,Andrea Tagarelli
关键词: providing the structure, form the backbone, Large Language Models, lexical entailment, Verbs form
中文关键词: 提供结构、形式支柱、大型语言模型、词汇蕴含、动词形式
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Verbs form the backbone of language, providing the structure and meaning to sentences. Yet, their intricate semantic nuances pose a longstanding challenge. Understanding verb relations through the concept of lexical entailment is crucial for comprehending sentence meanings and grasping verb dynamics. This work investigates the capabilities of eight Large Language Models in recognizing lexical entailment relations among verbs through differently devised prompting strategies and zero-/few-shot settings over verb pairs from two lexical databases, namely WordNet and HyperLex. Our findings unveil that the models can tackle the lexical entailment recognition task with moderately good performance, although at varying degree of effectiveness and under different conditions. Also, utilizing few-shot prompting can enhance the models’ performance. However, perfectly solving the task arises as an unmet challenge for all examined LLMs, which raises an emergence for further research developments on this topic.
摘要:动词是语言的脊梁,为句子提供结构和意义。然而,它们错综复杂的语义差异构成了一个长期存在的挑战。通过词汇蕴涵的概念来理解动词关系,对于理解句子意义和掌握动词动态是至关重要的。通过对WordNet和HyperLex两个词汇数据库中动词对的不同提示策略和零射/少射设置,考察了八种大型语言模型识别动词间词汇蕴涵关系的能力。我们的研究结果表明,这些模型能够较好地解决词汇蕴涵再认任务,尽管在不同的有效性程度和不同的条件下。此外,利用小镜头提示可以提高模型的性能。然而,完美地解决这一任务对于所有被研究的LLM来说都是一个尚未满足的挑战,这引发了在这个主题上的进一步研究发展的出现。

[NLP-51] Generate-then-Ground in Retrieval-Augmented Generation for Multi-hop Question Answering
[NLP-51] 用于多跳问题回答的检索增强生成中先生成然后接地

链接: https://arxiv.org/abs/2406.14891
作者: Zhengliang Shi,Shuo Zhang,Weiwei Sun,Shen Gao,Pengjie Ren,Zhumin Chen,Zhaochun Ren
关键词: Multi-Hop Question Answering, intensive knowledge required, large language models, Question Answering, tasks present
中文关键词: 多跳问题解答、需要密集的知识、大型语言模型、问题解答、存在的任务
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: ACL 2024 (main conference)

点击查看摘要

Abstract:Multi-Hop Question Answering (MHQA) tasks present a significant challenge for large language models (LLMs) due to the intensive knowledge required. Current solutions, like Retrieval-Augmented Generation, typically retrieve potential documents from an external corpus to read an answer. However, the performance of this retrieve-then-read paradigm is constrained by the retriever and the inevitable noise in the retrieved documents. To mitigate these challenges, we introduce a novel generate-then-ground (GenGround) framework, synergizing the parametric knowledge of LLMs and external documents to solve a multi-hop question. GenGround empowers LLMs to alternate two phases until the final answer is derived: (1) formulate a simpler, single-hop question and directly generate the answer; (2) ground the question-answer pair in retrieved documents, amending any wrong predictions in the answer. We also propose an instructional grounding distillation method to generalize our method into smaller models. Extensive experiments conducted on four datasets illustrate the superiority of our method.
摘要:多跳问答(MHQA)任务对大型语言模型(LLM)来说是一个巨大的挑战,因为需要密集的知识。目前的解决方案,如检索-增强生成,通常从外部语料库检索潜在的文档来阅读答案。然而,这种先检索后读取的范例的性能受到检索者和检索文档中不可避免的噪声的限制。为了缓解这些挑战,我们引入了一种新的生成-然后-地面(GenGround)框架,将LLM的参数知识和外部文档协同起来解决多跳问题。GenGround使LLM能够交替两个阶段,直到得出最终答案:(1)制定一个更简单的单跳问题并直接生成答案;(2)将问题-答案对固定在检索到的文档中,修正答案中的任何错误预测。我们还提出了一种指导性基础蒸馏方法,将我们的方法推广到更小的模型中。在四个数据集上进行的大量实验表明了该方法的优越性。

[NLP-52] InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions
[NLP-52] 互偏:通过偏置中间预测来提高不可见的单词识别

链接: https://arxiv.org/abs/2406.14890
作者: Yu Nakagome,Michael Hentschel
关键词: training data vocabulary, speech recognition methods, data vocabulary, resulting in inaccurate, proper nouns
中文关键词: 训练数据词汇、语音识别方法、数据词汇,导致不准确的专有名词
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:Despite recent advances in end-to-end speech recognition methods, their output is biased to the training data’s vocabulary, resulting in inaccurate recognition of unknown terms or proper nouns. To improve the recognition accuracy for a given set of such terms, we propose an adaptation parameter-free approach based on Self-conditioned CTC. Our method improves the recognition accuracy of misrecognized target keywords by substituting their intermediate CTC predictions with corrected labels, which are then passed on to the subsequent layers. First, we create pairs of correct labels and recognition error instances for a keyword list using Text-to-Speech and a recognition model. We use these pairs to replace intermediate prediction errors by the labels. Conditioning the subsequent layers of the encoder on the labels, it is possible to acoustically evaluate the target keywords. Experiments conducted in Japanese demonstrated that our method successfully improved the F1 score for unknown words.
摘要:尽管端到端语音识别方法最近取得了一些进展,但它们的输出偏向于训练数据的词汇,导致对未知术语或专有名词的识别不准确。为了提高对给定术语集的识别精度,我们提出了一种基于自条件CTC的无参数自适应识别方法。我们的方法通过将中间CTC预测替换为校正后的标签,然后将其传递到后续层,从而提高了误识别目标关键词的识别准确率。首先,我们使用文语转换和识别模型为关键字列表创建正确的标签和识别错误实例对。我们使用这些对来替换标签的中间预测误差。在标签上调节编码器的后续层,可以对目标关键字进行声学评估。用日语进行的实验表明,我们的方法成功地提高了未登录词的F1分数。

[NLP-53] InternLM-Law: An Open Source Chinese Legal Large Language Model
[NLP-53] InternLM-Law:开源中文法律大型语言模型

链接: https://arxiv.org/abs/2406.14887
作者: Zhiwei Fei,Songyang Zhang,Xiaoyu Shen,Dawei Zhu,Xiao Wang,Maosong Cao,Fengzhe Zhou,Yining Li,Wenwei Zhang,Dahua Lin,Kai Chen,Jidong Ge
关键词: showcased impressive capabilities, specialized expertise required, large language models, legal queries due, impressive capabilities
中文关键词: 展示了令人印象深刻的能力、所需的专业知识、大型语言模型、应有的法律查询、令人印象深刻的能力
类目: Computation and Language (cs.CL)
备注: Our dataset, code and models will be released at this https URL

点击查看摘要

Abstract:While large language models (LLMs) have showcased impressive capabilities, they struggle with addressing legal queries due to the intricate complexities and specialized expertise required in the legal field. In this paper, we introduce InternLM-Law, a specialized LLM tailored for addressing diverse legal queries related to Chinese laws, spanning from responding to standard legal questions (e.g., legal exercises in textbooks) to analyzing complex real-world legal situations. We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries, and implement a data filtering and processing pipeline to ensure its diversity and quality. Our training approach involves a novel two-stage process: initially fine-tuning LLMs on both legal-specific and general-purpose content to equip the models with broad knowledge, followed by exclusive fine-tuning on high-quality legal data to enhance structured output generation. InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks. We make InternLM-Law and our dataset publicly available to facilitate future research in applying LLMs within the legal domain.
摘要:虽然大型语言模型(LLM)展示了令人印象深刻的能力,但由于法律领域所需的复杂和专业知识,它们在解决法律问题方面遇到了困难。在本文中,我们介绍了InternLM-Law,这是一个专门用于解决与中国法律相关的各种法律查询的专业LLM,从回答标准法律问题(例如教科书中的法律练习)到分析复杂的现实法律情况。我们精心构建了一个涵盖100多万个查询的中文法律领域数据集,并实施了数据过滤和处理管道,以确保其多样性和质量。我们的培训方法涉及一个新颖的两阶段过程:首先对法律特定内容和通用内容进行微调,以使模型具备广泛的知识,然后对高质量的法律数据进行专门的微调,以增强结构化输出生成。InternLM-Law在LawBitch上实现了最高的平均性能,在20个子任务中的13个上超过了包括GPT-4在内的最先进的模型。我们将InternLM-Law和我们的数据集公开,以促进未来在法律领域应用LLMS的研究。

[NLP-54] FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents
[NLP-54] FlowBench:重新审视和基准测试基于LLM的代理的工作流引导规划

链接: https://arxiv.org/abs/2406.14884
作者: Ruixuan Xiao,Wentao Ma,Ke Wang,Yuchuan Wu,Junbo Zhao,Haobo Wang,Fei Huang,Yongbin Li
关键词: fulfill complex tasks, promising tools, emerged as promising, crafted to fulfill, fulfill complex
中文关键词: 完成复杂的任务,有前途的工具,出现的有前途的,精心设计的,完成复杂的
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-based agents have emerged as promising tools, which are crafted to fulfill complex tasks by iterative planning and action. However, these agents are susceptible to undesired planning hallucinations when lacking specific knowledge for expertise-intensive tasks. To address this, preliminary attempts are made to enhance planning reliability by incorporating external workflow-related knowledge. Despite the promise, such infused knowledge is mostly disorganized and diverse in formats, lacking rigorous formalization and comprehensive comparisons. Motivated by this, we formalize different formats of workflow knowledge and present FlowBench, the first benchmark for workflow-guided planning. FlowBench covers 51 different scenarios from 6 domains, with knowledge presented in diverse formats. To assess different LLMs on FlowBench, we design a multi-tiered evaluation framework. We evaluate the efficacy of workflow knowledge across multiple formats, and the results indicate that current LLM agents need considerable improvements for satisfactory planning. We hope that our challenging benchmark can pave the way for future agent planning research.
摘要:基于LLM的代理已经成为一种很有前途的工具,它们可以通过迭代规划和行动来完成复杂的任务。然而,当缺乏对专业知识密集型任务的具体知识时,这些代理容易产生不希望看到的计划幻觉。为了解决这一问题,初步尝试通过纳入外部工作流程相关知识来提高规划的可靠性。尽管有希望,但这种灌输的知识大多是杂乱无章的,形式多样,缺乏严格的形式化和全面的比较。受此启发,我们将不同格式的工作流知识形式化,并提出了第一个用于工作流指导计划的基准FlowBuch.FlowBch涵盖了6个领域的51个不同场景,知识以不同的格式呈现。为了评估FlowBitch上不同的LLM,我们设计了一个多层次的评估框架。我们评估了多种格式的工作流知识的有效性,结果表明,当前的LLM代理需要相当大的改进才能实现满意的规划。我们希望我们具有挑战性的基准能够为未来的代理规划研究铺平道路。

[NLP-55] OATH-Frames: Characterizing Online Attitudes Towards Homelessness with LLM Assistants
[NLP-55] OATH-Frame:与法学硕士助理一起描述对无家可归者的在线态度

链接: https://arxiv.org/abs/2406.14883
作者: Jaspreet Ranjit,Brihi Joshi,Rebecca Dorn,Laura Petry,Olga Koumoundouros,Jayne Bottarini,Peichen Liu,Eric Rice,Swabha Swayamdipta
关键词: Warning, Contents, attitudes, cs.CL, homelessness
中文关键词: 警告、内容、态度、cs.CL、无家可归
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Project website: this https URL

点击查看摘要

Abstract:Warning: Contents of this paper may be upsetting. Public attitudes towards key societal issues, expressed on online media, are of immense value in policy and reform efforts, yet challenging to understand at scale. We study one such social issue: homelessness in the U.S., by leveraging the remarkable capabilities of large language models to assist social work experts in analyzing millions of posts from Twitter. We introduce a framing typology: Online Attitudes Towards Homelessness (OATH) Frames: nine hierarchical frames capturing critiques, responses and perceptions. We release annotations with varying degrees of assistance from language models, with immense benefits in scaling: 6.5x speedup in annotation time while only incurring a 3 point F1 reduction in performance with respect to the domain experts. Our experiments demonstrate the value of modeling OATH-Frames over existing sentiment and toxicity classifiers. Our large-scale analysis with predicted OATH-Frames on 2.4M posts on homelessness reveal key trends in attitudes across states, time periods and vulnerable populations, enabling new insights on the issue. Our work provides a general framework to understand nuanced public attitudes at scale, on issues beyond homelessness. Comments: Project website: this https URL Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2406.14883 [cs.CL] (or arXiv:2406.14883v1 [cs.CL] for this version)
摘要:警告:本文的内容可能会令人不安。在线媒体上表达的公众对关键社会问题的态度在政策和改革努力中具有巨大价值,但在规模上难以理解。我们研究了一个这样的社会问题:美国的无家可归者,通过利用大型语言模型的非凡能力来帮助社会工作专家分析来自Twitter的数百万条帖子。我们介绍了一种框架类型:在线对无家可归的态度(誓言)框架:九个层次框架,捕捉批评、回应和感知。我们发布了来自语言模型的不同程度的注释,在可伸缩性方面有巨大的好处:注释时间加速6.5倍,而与领域专家相比,性能只降低了3点F1。我们的实验证明了与现有的情感和毒性分类器相比,对宣誓框架进行建模的价值。我们对240万篇关于无家可归者的帖子进行了大规模的预测分析,揭示了各州、不同时期和弱势群体的态度的关键趋势,使我们能够对这个问题有新的见解。我们的工作提供了一个总体框架,以了解公众在无家可归以外的问题上的微妙态度。评论:项目网站:此HTTPS URL主题:计算与语言(cs.CL);计算机与社会(cs.CY)引用为:arxiv:2406.14883cs.CL

[NLP-56] 70B-parameter large language models in Japanese medical question-answering
[NLP-56] 日本医学问答中的70 B参数大语言模型

链接: https://arxiv.org/abs/2406.14882
作者: Issey Sukeda,Risa Kishikawa,Satoshi Kodera
关键词: rise of large, hot topics, Japanese LLMs, Japanese medical, English medical dataset
中文关键词: 大型热门话题的兴起、日本法学硕士、日本医学、英国医学数据集
类目: Computation and Language (cs.CL)
备注: 7 pages, 2 figures, 4 Tables

点击查看摘要

Abstract:Since the rise of large language models (LLMs), the domain adaptation has been one of the hot topics in various domains. Many medical LLMs trained with English medical dataset have made public recently. However, Japanese LLMs in medical domain still lack its research. Here we utilize multiple 70B-parameter LLMs for the first time and show that instruction tuning using Japanese medical question-answering dataset significantly improves the ability of Japanese LLMs to solve Japanese medical license exams, surpassing 50% in accuracy. In particular, the Japanese-centric models exhibit a more significant leap in improvement through instruction tuning compared to their English-centric counterparts. This underscores the importance of continual pretraining and the adjustment of the tokenizer in our local language. We also examine two slightly different prompt formats, resulting in non-negligible performance improvement.
摘要:自大型语言模型(LLM)兴起以来,领域适应一直是各个领域的热门话题之一。许多接受英语医学数据集培训的医学法学硕士最近已公开。然而,日本医学领域的LLM仍然缺乏研究。在这里,我们首次利用多个70 B参数LLM,并表明使用日本医学问答数据集的指令调优显着提高了日本LLM解决日本医学执照考试的能力,准确率超过50%。特别是,与以英语为中心的模式相比,以日本为中心的模式通过教学调整在改进方面表现出更显着的飞跃。这强调了持续预培训和调整本地语言的标记器的重要性。我们还研究了两种略有不同的提示格式,从而带来了不可忽视的性能改进。

[NLP-57] Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video
[NLP-57] 运动智力:通过从文本到视频的提问来评估语言模型的运动理解能力

链接: https://arxiv.org/abs/2406.14877
作者: Zhengbang Yang,Haotian Xia,Jingxi Li,Zezhi Chen,Zhuangdi Zhu,Weining Shen
关键词: Natural Language Processing, advancement of Natural, Language Processing, Natural Language, dynamic nature
中文关键词: 自然语言处理,自然的进步,语言处理,自然语言,动态自然
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding sports is crucial for the advancement of Natural Language Processing (NLP) due to its intricate and dynamic nature. Reasoning over complex sports scenarios has posed significant challenges to current NLP technologies which require advanced cognitive capabilities. Toward addressing the limitations of existing benchmarks on sports understanding in the NLP field, we extensively evaluated mainstream large language models for various sports tasks. Our evaluation spans from simple queries on basic rules and historical facts to complex, context-specific reasoning, leveraging strategies from zero-shot to few-shot learning, and chain-of-thought techniques. In addition to unimodal analysis, we further assessed the sports reasoning capabilities of mainstream video language models to bridge the gap in multimodal sports understanding benchmarking. Our findings highlighted the critical challenges of sports understanding for NLP. We proposed a new benchmark based on a comprehensive overview of existing sports datasets and provided extensive error analysis which we hope can help identify future research priorities in this field.
摘要:由于运动的复杂性和动态性,理解运动对于自然语言处理(NLP)的发展至关重要。对复杂运动场景的推理对当前需要高级认知能力的NLP技术提出了重大挑战。为了解决现有体育理解基准在自然语言处理领域的局限性,我们广泛评估了各种体育任务的主流大型语言模型。我们的评估范围从对基本规则和历史事实的简单查询到复杂的、特定于上下文的推理,利用了从零机会学习到很少机会学习的策略,以及思维链技术。除了单峰分析外,我们还进一步评估了主流视频语言模型的运动推理能力,以弥补多通道运动理解基准测试的差距。我们的发现突出了体育理解对NLP的关键挑战。我们在全面概述现有体育数据集的基础上提出了一个新的基准,并提供了广泛的误差分析,我们希望这可以帮助确定该领域未来的研究重点。

[NLP-58] Direct Multi-Turn Preference Optimization for Language Agents
[NLP-58] 语言代理的直接多轮偏好优化

链接: https://arxiv.org/abs/2406.14868
作者: Wentao Shi,Mengqi Yuan,Junkang Wu,Qifan Wang,Fuli Feng
关键词: Adapting Large Language, Large Language Models, Adapting Large, developing language agents, Large Language
中文关键词: 适应大型语言,大型语言模型,适应大型,开发语言代理,大型语言
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Adapting Large Language Models (LLMs) for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function. Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent tasks with theoretical explanations. Extensive experiments on three multi-turn agent task datasets confirm the effectiveness and superiority of the DMPO loss.
摘要:在开发语言代理时,采用大语言模型(LLM)来完成代理任务是非常关键的。直接偏好优化(DPO)是一种很有前途的技术,可以减少合成误差,提供一种直接优化强化学习(RL)目标的方法。然而,由于无法取消分区函数,将DPO应用于多回合任务会带来挑战。克服这一障碍包括使配分函数独立于当前状态,并解决优选和非优选轨迹之间的长度差异。在这种情况下,我们用RL目标中的状态-动作占用度量约束来代替策略约束,并在Bradley-Terry模型中增加了长度归一化,得到了一种新的多回合代理任务的损失函数DMPO,并进行了理论解释。在三个多回合智能体任务数据集上的大量实验证实了DMPO损失的有效性和优越性。

[NLP-59] DistiLRR: Transferring Code Repair for Low-Resource Programming Languages
[NLP-59] DistiLRR:低资源编程语言的转移代码修复

链接: https://arxiv.org/abs/2406.14867
作者: Kyle Wong,Alfonso Amayuelas,Liangming Pan,William Yang Wang
关键词: Large language models, shown remarkable performance, Large language, code generation tasks, code repair
中文关键词: 大型语言模型,表现出出色的性能,大型语言,代码生成任务,代码修复
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance on code generation tasks. A recent application of LLMs for code generation is iterative code repair, where a model fixes an incorrect program by rationalizing about errors and generating a new program. However, code repair is primarily studied on high-resource languages like Python, and the framework’s efficacy is under-explored on low-resource languages. To apply code repair for low-resource languages, we propose Distilling Low-Resource Repairs (DistiLRR), an approach that transfers the reasoning and code generation ability from a teacher model to a student model. Our results show that DistiLRR consistently outperforms baselines on low-resource languages, but has similar performance on high-resource languages. To investigate this behavior, we perform a further analysis and find that the correlation between rationale quality and code correctness is weaker than previously perceived. We hypothesize this weakness is magnified in low-resource settings where base models lack deep knowledge of a programming language, leading to wavering benefits of code repair between high-resource and low-resource languages.
摘要:大型语言模型(LLM)在代码生成任务中表现出了显著的性能。LLMS最近用于代码生成的一个应用是迭代代码修复,其中模型通过对错误进行合理化并生成新程序来修复不正确的程序。然而,代码修复主要是在高资源语言(如Python)上研究的,而该框架在低资源语言上的有效性还没有得到充分的研究。为了将代码修复应用于低资源语言,我们提出了Distilling Low-Resource Repair(DistiLRR),一种将推理和代码生成能力从教师模型转移到学生模型的方法。我们的结果表明,DistiLRR在低资源语言上的性能一直高于基线,但在高资源语言上的性能相似。为了调查这一行为,我们执行了进一步的分析,并发现基本原理质量和代码正确性之间的相关性比之前认为的要弱。我们假设这一弱点在低资源环境中被放大,在低资源环境中,基本模型缺乏对编程语言的深入知识,导致在高资源语言和低资源语言之间进行代码修复的好处摇摆不定。

[NLP-60] LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models
[NLP-60] LatentExplainer:用多模式基础模型解释深度生成模型中的潜在表示

链接: https://arxiv.org/abs/2406.14862
作者: Mengdan Zhu,Raasikh Kanjiani,Jiahui Lu,Andrew Choi,Qirui Ye,Liang Zhao
关键词: Deep generative models, latent variables, leveraging latent variables, generate high-quality samples, generative models
中文关键词: 深度生成模型,潜在变量,利用潜在变量,生成高质量样本,生成模型
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces LatentExplainer, a framework for automatically generating semantically meaningful explanations of latent variables in deep generative models. LatentExplainer tackles three main challenges: inferring the meaning of latent variables, aligning explanations with inductive biases, and handling varying degrees of explainability. By perturbing latent variables and interpreting changes in generated data, the framework provides a systematic approach to understanding and controlling the data generation process, enhancing the transparency and interpretability of deep generative models. We evaluate our proposed method on several real-world and synthetic datasets, and the results demonstrate superior performance in generating high-quality explanations of latent variables.
摘要:深度生成模型,如VAE和扩散模型,通过利用潜在变量来学习数据分布并生成高质量的样本,从而推进了各种生成任务。尽管可解释人工智能领域在解释机器学习模型方面取得了进展,但理解生成模型中的潜在变量仍然具有挑战性。本文介绍了LatentExplainer框架,该框架用于自动生成深层生成模型中潜在变量的语义有意义的解释。LatentExplainer解决了三个主要挑战:推断潜在变量的含义,使解释与归纳偏差保持一致,以及处理不同程度的可解释性。通过扰动潜在变量和解释生成数据的变化,该框架提供了一种系统的方法来理解和控制数据生成过程,提高了深度生成模型的透明度和可解释性。我们在几个真实世界和合成数据集上对我们提出的方法进行了评估,结果表明我们的方法在生成对潜在变量的高质量解释方面表现出了优越的性能。

[NLP-61] From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
[NLP-61] 从LLM到MLLM:探索多模式越狱的格局

链接: https://arxiv.org/abs/2406.14859
作者: Siyuan Wang,Zhuohan Long,Zhihao Fan,Zhongyu Wei
关键词: Large Language Models, Language Models, Multimodal Large Language, Large Language, Models
中文关键词: 大型语言模型,语言模型,多模式大型语言,大型语言,模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid development of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has exposed vulnerabilities to various adversarial attacks. This paper provides a comprehensive overview of jailbreaking research targeting both LLMs and MLLMs, highlighting recent advancements in evaluation benchmarks, attack techniques and defense strategies. Compared to the more advanced state of unimodal jailbreaking, multimodal domain remains underexplored. We summarize the limitations and potential research directions of multimodal jailbreaking, aiming to inspire future research and further enhance the robustness and security of MLLMs.
摘要:大型语言模型(LLM)和多模式大型语言模型(MLLM)的快速发展暴露了各种对抗攻击的脆弱性。本文全面概述了针对LLM和MLLM的越狱研究,重点介绍了评估基准、攻击技术和防御策略方面的最新进展。与更先进的单模式越狱相比,多模式领域仍然被探索不足。我们总结了多模式越狱的局限性和潜在研究方向,旨在启发未来的研究并进一步增强MLLM的稳健性和安全性。

[NLP-62] Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models
[NLP-62] 利用段落嵌入以高效的列表方式重新排序大型语言模型

链接: https://arxiv.org/abs/2406.14848
作者: Qi Liu,Bo Wang,Nan Wang,Jiaxin Mao
关键词: large language language, language language models, Recent studies, language language, large language
中文关键词: 大语言语言,语言语言模型,最近的研究,语言语言,大语言
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recent studies have demonstrated the effectiveness of using large language language models (LLMs) in passage ranking. The listwise approaches, such as RankGPT, have become new state-of-the-art in this task. However, the efficiency of RankGPT models is limited by the maximum context length and relatively high latency of LLM inference. To address these issues, in this paper, we propose PE-Rank, leveraging the single passage embedding as a good context compression for efficient listwise passage reranking. By treating each passage as a special token, we can directly input passage embeddings into LLMs, thereby reducing input length. Additionally, we introduce an inference method that dynamically constrains the decoding space to these special tokens, accelerating the decoding process. For adapting the model to reranking, we employ listwise learning to rank loss for training. Evaluation results on multiple benchmarks demonstrate that PE-Rank significantly improves efficiency in both prefilling and decoding, while maintaining competitive ranking effectiveness. The Code is available at \urlthis https URL.
摘要:最近的研究表明,使用大语言模型(LLM)进行文章排名是有效的。列表式方法,如RankGPT,已经成为这项任务中的新技术。然而,RankGPT模型的效率受到LLM推理的最大上下文长度和相对较高的延迟的限制。为了解决这些问题,在本文中,我们提出了PE-Rank,利用单通道嵌入作为一种很好的上下文压缩来实现高效的列表通道重排序。通过将每个段落看作一个特殊的令牌,我们可以直接将嵌入的段落输入到LLMS中,从而减少了输入长度。此外,我们还引入了一种推理方法,动态地将解码空间限制在这些特殊的令牌上,从而加快了解码过程。为了使模型适应重新排序,我们使用列表学习来对损失进行排序以进行训练。在多个基准上的评估结果表明,PE-Rank在保持竞争力排名有效性的同时,显著提高了预填充和解码的效率。代码可在此HTTPS URL上找到。

[NLP-63] oVo: Toxicity Taxonomy via Voting
[NLP-63] oVo:通过投票进行毒性分类

链接: https://arxiv.org/abs/2406.14835
作者: Tinh Son Luong,Thanh-Thien Le,Thang Viet Doan,Linh Ngo Van,Thien Huu Nguyen,Diep Thi-Ngoc Nguyen
关键词: face significant limitations, models face significant, significant limitations, face significant, detection models face
中文关键词: 面临重大限制,模型面临重大、重大限制,面临重大、检测模型面临
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing toxic detection models face significant limitations, such as lack of transparency, customization, and reproducibility. These challenges stem from the closed-source nature of their training data and the paucity of explanations for their evaluation mechanism. To address these issues, we propose a dataset creation mechanism that integrates voting and chain-of-thought processes, producing a high-quality open-source dataset for toxic content detection. Our methodology ensures diverse classification metrics for each sample and includes both classification scores and explanatory reasoning for the classifications. We utilize the dataset created through our proposed mechanism to train our model, which is then compared against existing widely-used detectors. Our approach not only enhances transparency and customizability but also facilitates better fine-tuning for specific use cases. This work contributes a robust framework for developing toxic content detection models, emphasizing openness and adaptability, thus paving the way for more effective and user-specific content moderation solutions. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2406.14835 [cs.CL] (or arXiv:2406.14835v1 [cs.CL] for this version)
摘要:现有的有毒物质检测模型面临着明显的局限性,如缺乏透明度、定制化和重复性。这些挑战源于其培训数据的封闭性,以及对其评价机制缺乏解释。为了解决这些问题,我们提出了一种整合投票和思维链过程的数据集创建机制,为有毒内容检测生成高质量的开源数据集。我们的方法确保了每个样本的不同分类度量,并包括分类分数和分类的解释性推理。我们利用通过我们提出的机制创建的数据集来训练我们的模型,然后将其与现有的广泛使用的检测器进行比较。我们的方法不仅提高了透明度和可定制性,还促进了针对特定用例的更好的微调。这项工作为开发有毒内容检测模型提供了一个强大的框架,强调开放性和适应性,从而为更有效和针对用户的内容审核解决方案铺平了道路。科目:计算与语言(cs.CL);机器学习(cs.LG)引用AS:arxiv:2406.14835cs.CL

[NLP-64] Efficient Continual Pre-training by Mitigating the Stability Gap
[NLP-64] 通过缓解稳定性差距进行高效的连续预训练

链接: https://arxiv.org/abs/2406.14833
作者: Yiduo Guo,Jie Fu,Huishuai Zhang,Dongyan Zhao,Yikang Shen
关键词: adapting Large Language, Large Language Models, Large Language, Language Models, Continual pre-training
中文关键词: 适应大型语言、大型语言模型、大型语言、语言模型、连续预训练
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Continual pre-training has increasingly become the predominant approach for adapting Large Language Models (LLMs) to new domains. This process involves updating the pre-trained LLM with a corpus from a new domain, resulting in a shift in the training distribution. To study the behavior of LLMs during this shift, we measured the model’s performance throughout the continual pre-training process. we observed a temporary performance drop at the beginning, followed by a recovery phase, a phenomenon known as the “stability gap,” previously noted in vision models classifying new classes. To address this issue and enhance LLM performance within a fixed compute budget, we propose three effective strategies: (1) Continually pre-training the LLM on a subset with a proper size for multiple epochs, resulting in faster performance recovery than pre-training the LLM on a large corpus in a single epoch; (2) Pre-training the LLM only on high-quality sub-corpus, which rapidly boosts domain performance; and (3) Using a data mixture similar to the pre-training data to reduce distribution gap. We conduct various experiments on Llama-family models to validate the effectiveness of our strategies in both medical continual pre-training and instruction tuning. For example, our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget and enhance the average general task performance without causing forgetting. Furthermore, we apply our strategies to the Llama-3-8B model. The resulting model, Llama-3-Physician, achieves the best medical performance among current open-source models, and performs comparably to or even better than GPT-4 on several medical benchmarks. We release our models at \urlthis https URL.
摘要:持续的预训练已日益成为大型语言模型适应新领域的主要方法。这个过程涉及用来自新领域的语料库更新预训练的LLM,导致训练分布的转变。为了研究LLM在这种转变过程中的行为,我们测量了模型在整个持续的预培训过程中的表现。我们观察到,在开始时,性能会暂时下降,然后是恢复阶段,这一现象被称为“稳定性差距”,这是以前在视觉模型中对新类别进行分类时注意到的。为了解决这个问题并在固定的计算预算内提高LLM的性能,我们提出了三种有效的策略:(1)在多个历元的适当大小的子集上连续预训练LLM,从而比在单个历元的大型语料库上预训练LLM的性能更快;(2)仅在高质量的子语料库上预训练LLM,从而快速提高领域性能;以及(3)使用类似于预训练数据的数据混合来缩小分布差距。我们在大羊驼家庭模型上进行了各种实验,以验证我们的策略在医学持续预培训和教学调整方面的有效性。例如,我们的策略将OpenLlama-3B模型的平均医疗任务性能从36.2%提高到40.7%,而只使用原始训练预算的40%,并在不导致遗忘的情况下提高了平均一般任务性能。此外,我们将我们的策略应用于Llama-3-8B模型。由此产生的模型Llama-3-Doctors在当前开源模型中实现了最好的医疗性能,在几个医疗基准上的性能与GPT-4相当,甚至更好。我们在此HTTPS URL上发布我们的模型。

[NLP-65] Is this a bad table? A Closer Look at the Evaluation of Table Generation from Text
[NLP-65] 这张桌子坏吗?仔细研究从文本生成表的评估

链接: https://arxiv.org/abs/2406.14829
作者: Pritika Ramu,Aparna Garimella,Sambaran Bandyopadhyay
关键词: creating or editing, editing documents, documents using automatic, Understanding, automatic methods
中文关键词: 创建或编辑、编辑文档、使用自动、理解、自动方法的文档
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding whether a generated table is of good quality is important to be able to use it in creating or editing documents using automatic methods. In this work, we underline that existing measures for table quality evaluation fail to capture the overall semantics of the tables, and sometimes unfairly penalize good tables and reward bad ones. We propose TabEval, a novel table evaluation strategy that captures table semantics by first breaking down a table into a list of natural language atomic statements and then compares them with ground truth statements using entailment-based measures. To validate our approach, we curate a dataset comprising of text descriptions for 1,250 diverse Wikipedia tables, covering a range of topics and structures, in contrast to the limited scope of existing datasets. We compare TabEval with existing metrics using unsupervised and supervised text-to-table generation methods, demonstrating its stronger correlation with human judgments of table quality across four datasets.
摘要:了解生成的表格是否具有良好的质量,对于能够使用它使用自动方法创建或编辑文档非常重要。在这项工作中,我们强调了现有的表质量评估方法不能捕获表的整体语义,有时不公平地惩罚好的表而奖励坏的表。我们提出了一种新的表评估策略TabEval,该策略首先将表分解为一组自然语言原子语句,然后使用基于蕴涵的度量将其与基本真值语句进行比较,从而捕获表的语义。为了验证我们的方法,我们整理了一个包含1250个不同维基百科表格的文本描述的数据集,涵盖了一系列主题和结构,与现有数据集的有限范围形成了对比。我们使用无监督和有监督的文本到表格生成方法将TabEval与现有指标进行了比较,证明了它与人类对四个数据集的表格质量的判断具有更强的相关性。

[NLP-66] Word Matters: What Influences Domain Adaptation in Summarization?
[NLP-66] 词很重要:是什么影响总结中的领域适应?

链接: https://arxiv.org/abs/2406.14828
作者: Yinghao Li,Siyu Miao,Heyan Huang,Yang Gao
关键词: enable Large Language, Large Language Models, Large Language, Domain adaptation aims, enable Large
中文关键词: 启用大型语言、大型语言模型、大型语言、领域适应目标、启用大型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Domain adaptation aims to enable Large Language Models (LLMs) to generalize domain datasets unseen effectively during the training phase. However, factors such as the size of the model parameters and the scale of training data are general influencers and do not reflect the nuances of domain adaptation performance. This paper investigates the fine-grained factors affecting domain adaptation performance, analyzing the specific impact of `words’ in training data on summarization tasks. We propose quantifying dataset learning difficulty as the learning difficulty of generative summarization, which is determined by two indicators: word-based compression rate and abstraction level. Our experiments conclude that, when considering dataset learning difficulty, the cross-domain overlap and the performance gain in summarization tasks exhibit an approximate linear relationship, which is not directly related to the number of words. Based on this finding, predicting a model’s performance on unknown domain datasets is possible without undergoing training.
摘要:领域自适应旨在使大型语言模型能够有效地概括在训练阶段看不到的领域数据集。然而,模型参数的大小和训练数据的规模等因素是一般性的影响因素,并不反映领域适应性能的细微差别。本文考察了影响领域适应性能的细粒度因素,分析了训练数据中的“词”对摘要任务的具体影响。我们提出将数据集的学习难度量化为生成性摘要的学习难度,它由两个指标决定:基于词的压缩率和摘要水平。实验结果表明,在考虑数据集学习难度的情况下,摘要任务的跨域重叠度与性能增益呈现近似的线性关系,而与词数没有直接关系。基于这一发现,预测模型在未知域数据集上的性能是可能的,而不需要经过训练。

[NLP-67] mPrompt: Multi-Task Prompt Learning for Temporal Relation Extraction in RAG-based Crowdsourcing Systems
[NLP-67] mPromise:基于RAG的众包系统中用于时间关系提取的多任务提示学习

链接: https://arxiv.org/abs/2406.14825
作者: Jing Yang,Yu Zhao,Yang Linyao,Xiao Wang,Fei-Yue Wang
关键词: helping understand task, understand task requests, task requests initiated, Temporal relation extraction, relation extraction
中文关键词: 帮助理解任务、理解任务请求、启动任务请求、时间关系提取、关系提取
类目: Computation and Language (cs.CL)
备注: 12 pages, 9 figures

点击查看摘要

Abstract:Temporal relation extraction (TRE) aims to grasp the evolution of events or actions, and thus shape the workflow of associated tasks, so it holds promise in helping understand task requests initiated by requesters in crowdsourcing systems. However, existing methods still struggle with limited and unevenly distributed annotated data. Therefore, inspired by the abundant global knowledge stored within pre-trained language models (PLMs), we propose a multi-task prompt learning framework for TRE (TemPrompt), incorporating prompt tuning and contrastive learning to tackle these issues. To elicit more effective prompts for PLMs, we introduce a task-oriented prompt construction approach that thoroughly takes the myriad factors of TRE into consideration for automatic prompt generation. In addition, we present temporal event reasoning as a supplement to bolster the model’s focus on events and temporal cues. The experimental results demonstrate that TemPrompt outperforms all compared baselines across the majority of metrics under both standard and few-shot settings. A case study is provided to validate its effectiveness in crowdsourcing scenarios.
摘要:时态关系提取旨在掌握事件或动作的演化,从而塑造相关任务的工作流,因此它有望帮助理解众包系统中请求者发起的任务请求。然而,现有的方法仍然难以处理有限且分布不均匀的注释数据。因此,受预先训练的语言模型(PLM)中存储的丰富的全局知识的启发,我们提出了一种多任务的TrE(TemPrompt)快速学习框架,该框架结合了快速调整和对比学习来解决这些问题。为了给PLM提供更有效的提示,我们引入了一种面向任务的提示构建方法,该方法充分考虑了TRE的各种因素,实现了提示的自动生成。此外,我们还提出了时间事件推理作为补充,以支持模型对事件和时间线索的关注。实验结果表明,在标准设置和少镜头设置下,TemPrompt在大多数度量指标上都优于所有比较的基线。通过一个案例研究,验证了该算法在众包场景中的有效性。

[NLP-68] How Well Do LLMs Represent Values Across Cultures? Empirical Analysis of LLM Responses Based on Hofstede Cultural Dimensions
[NLP-68] LLM如何很好地代表跨文化的价值观?基于霍夫斯特德文化维度的LLM回应实证分析

链接: https://arxiv.org/abs/2406.14805
作者: Julia Kharchenko,Tanya Roosta,Aman Chadha,Chirag Shah
关键词: Large Language Models, imitate human behavior, Large Language, Language Models, attempt to imitate
中文关键词: 大型语言模型,模仿人类行为,大型语言,语言模型,尝试模仿
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) attempt to imitate human behavior by responding to humans in a way that pleases them, including by adhering to their values. However, humans come from diverse cultures with different values. It is critical to understand whether LLMs showcase different values to the user based on the stereotypical values of a user’s known country. We prompt different LLMs with a series of advice requests based on 5 Hofstede Cultural Dimensions – a quantifiable way of representing the values of a country. Throughout each prompt, we incorporate personas representing 36 different countries and, separately, languages predominantly tied to each country to analyze the consistency in the LLMs’ cultural understanding. Through our analysis of the responses, we found that LLMs can differentiate between one side of a value and another, as well as understand that countries have differing values, but will not always uphold the values when giving advice, and fail to understand the need to answer differently based on different cultural values. Rooted in these findings, we present recommendations for training value-aligned and culturally sensitive LLMs. More importantly, the methodology and the framework developed here can help further understand and mitigate culture and language alignment issues with LLMs.
摘要:大型语言模型试图模仿人类的行为,以取悦人类的方式回应人类,包括坚持人类的价值观。然而,人类来自不同的文化,有着不同的价值观。关键是要了解LLMS是否基于用户所在国家的刻板印象向用户展示了不同的价值观。我们根据5个霍夫斯特德文化维度–一种可以量化的方式代表一个国家的价值观–提出了一系列建议请求,以提示不同的LLM。在每个提示中,我们纳入了代表36个不同国家的人物角色,并分别使用主要与每个国家联系在一起的语言,以分析LLMS在文化理解方面的一致性。通过对回答的分析,我们发现LLMS能够区分价值观的一个方面和另一个方面,并理解各国有不同的价值观,但在提供建议时并不总是坚持价值观,也不理解需要根据不同的文化价值观做出不同的回答。基于这些发现,我们提出了培训价值一致和文化敏感的低成本管理人员的建议。更重要的是,这里开发的方法和框架可以帮助进一步了解和缓解LLM的文化和语言一致性问题。

[NLP-69] Understanding Finetuning for Factual Knowledge Extraction
[NLP-69] 了解微调以提取事实知识

链接: https://arxiv.org/abs/2406.14785
作者: Gaurav Ghosal,Tatsunori Hashimoto,Aditi Raghunathan
关键词: study the impact, downstream factuality, lesser-known facts, facts, fine-tuning
中文关键词: 研究影响、下游事实、鲜为人知的事实、事实、微调
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: To appear in ICML 2024

点击查看摘要

Abstract:In this work, we study the impact of QA fine-tuning data on downstream factuality. We show that fine-tuning on lesser-known facts that are poorly stored during pretraining yields significantly worse factuality than fine-tuning on well-known facts, even when all facts are seen during pretraining. We prove this phenomenon theoretically, showing that training on lesser-known facts can lead the model to ignore subject entity names and instead output a generic plausible response even when the relevant factual knowledge is encoded in the model. On three question answering benchmarks (PopQA, Entity Questions, and MMLU) and two language models (Llama-2-7B and Mistral-7B), we find that (i) finetuning on a completely factual but lesser-known subset of the data deteriorates downstream factuality (5-10%) and (ii) finetuning on a subset of better-known examples matches or outperforms finetuning on the entire dataset. Ultimately, our results shed light on the interaction between pretrained knowledge and finetuning data and demonstrate the importance of taking into account how facts are stored in the pretrained model when fine-tuning for knowledge-intensive tasks.
摘要:在这项工作中,我们研究了质量保证微调数据对下游真实性的影响。我们表明,即使在预训练期间看到了所有事实,对在预训练期间存储得很差的不太知名的事实进行微调产生的真实性明显比对众所周知的事实进行微调要差得多。我们从理论上证明了这一现象,表明对不太知名的事实进行训练会导致模型忽略主体实体名称,而即使相关的事实知识已编码在模型中,也会输出通用的似是而非的响应。在三个问答基准(PopQA、实体问题和MMLU)和两个语言模型(Llama-2-7B和Mistral-7B)上,我们发现(I)对完全事实但不太为人所知的数据子集进行精调会恶化下游真实性(5-10%),(Ii)对较知名示例的子集进行精调与整个数据集上的精调匹配或优于精调。最终,我们的结果揭示了预先训练的知识和微调数据之间的相互作用,并证明了在为知识密集型任务进行微调时,考虑到事实是如何存储在预先训练的模型中的重要性。

[NLP-70] Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework
[NLP-70] 使用RAGElo评估RAG-Fusion:一个基于Elo的自动化框架

链接: https://arxiv.org/abs/2406.14783
作者: Zackary Rackauckas,Arthur Câmara,Jakub Zavrel
关键词: systems include hallucination, gold standard benchmarks, company internal tasks, include hallucination problems, Retrieval-Augmented Generation
中文关键词: 系统包括幻觉、金标准基准、公司内部任务、包括幻觉问题、检索增强一代
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted to LLM4Eval @ SIGIR24

点击查看摘要

Abstract:Challenges in the automated evaluation of Retrieval-Augmented Generation (RAG) Question-Answering (QA) systems include hallucination problems in domain-specific knowledge and the lack of gold standard benchmarks for company internal tasks. This results in difficulties in evaluating RAG variations, like RAG-Fusion (RAGF), in the context of a product QA task at Infineon Technologies. To solve these problems, we propose a comprehensive evaluation framework, which leverages Large Language Models (LLMs) to generate large datasets of synthetic queries based on real user queries and in-domain documents, uses LLM-as-a-judge to rate retrieved documents and answers, evaluates the quality of answers, and ranks different variants of Retrieval-Augmented Generation (RAG) agents with RAGElo’s automated Elo-based competition. LLM-as-a-judge rating of a random sample of synthetic queries shows a moderate, positive correlation with domain expert scoring in relevance, accuracy, completeness, and precision. While RAGF outperformed RAG in Elo score, a significance analysis against expert annotations also shows that RAGF significantly outperforms RAG in completeness, but underperforms in precision. In addition, Infineon’s RAGF assistant demonstrated slightly higher performance in document relevance based on MRR@5 scores. We find that RAGElo positively aligns with the preferences of human annotators, though due caution is still required. Finally, RAGF’s approach leads to more complete answers based on expert annotations and better answers overall based on RAGElo’s evaluation criteria.
摘要:检索增强生成(RAG)问答系统的自动评估面临的挑战包括特定领域知识的幻觉问题和公司内部任务缺乏黄金标准基准。这导致在英飞凌技术公司的产品质量保证任务中,难以评估RAG变化,如RAG-Fusion(RAGF)。为了解决这些问题,我们提出了一个综合的评估框架,该框架利用大型语言模型(LLMS)来生成基于真实用户查询和领域内文档的合成查询的大型数据集,使用LLM作为评判来对检索到的文档和答案进行评级,评估答案的质量,并通过RAGElo的基于ELO的自动竞争对RAG代理的不同变体进行排名。对合成查询的随机样本进行的LLM评分显示,在相关性、准确性、完备性和精确度方面,与领域专家评分呈中等正相关。虽然RAGF在ELO得分上优于RAG,但对专家注释的显著性分析也表明,RAGF在完备性方面显著优于RAG,但在精确度方面落后于RAG。此外,英飞凌的RAGF助手在基于MRR@5分数的文档相关性方面表现出略高的表现。我们发现RAGElo与人类注释者的偏好是积极一致的,尽管仍然需要适当的谨慎。最后,RAGF的方法基于专家注释产生更完整的答案,并基于RAGElo的评估标准产生更好的整体答案。

[NLP-71] Evaluating Numerical Reasoning in Text-to-Image Models
[NLP-71] 文本到图像模型中的数值评估推理

链接: https://arxiv.org/abs/2406.14774
作者: Ivana Kajić,Olivia Wiles,Isabela Albuquerque,Matthias Bauer,Su Wang,Jordi Pont-Tuset,Aida Nematzadeh
关键词: faithfully depict concepts, producing high-quality images, natural language, capable of producing, producing high-quality
中文关键词: 忠实地描绘概念,产生高质量的图像,自然语言,能够产生,产生高质量的
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image generative models are capable of producing high-quality images that often faithfully depict concepts described using natural language. In this work, we comprehensively evaluate a range of text-to-image models on numerical reasoning tasks of varying difficulty, and show that even the most advanced models have only rudimentary numerical skills. Specifically, their ability to correctly generate an exact number of objects in an image is limited to small numbers, it is highly dependent on the context the number term appears in, and it deteriorates quickly with each successive number. We also demonstrate that models have poor understanding of linguistic quantifiers (such as “a few” or “as many as”), the concept of zero, and struggle with more advanced concepts such as partial quantities and fractional representations. We bundle prompts, generated images and human annotations into GeckoNum, a novel benchmark for evaluation of numerical reasoning.
摘要:文本到图像生成模型能够生成高质量的图像,这些图像通常忠实地描述使用自然语言描述的概念。在这项工作中,我们全面评估了一系列针对不同难度的数字推理任务的文本到图像模型,并表明即使是最先进的模型也只具有基本的数字技能。具体来说,它们在图像中正确生成确切数量的对象的能力仅限于较小的数字,它高度依赖于数字项出现的上下文,并且随着每个连续的数字而迅速恶化。我们还证明,模型对语言量化词(例如“几个”或“尽可能多”)、零概念的理解较差,并且难以理解更高级的概念,例如部分量和分数表示。我们将提示、生成的图像和人类注释捆绑到Geckoilon中,这是一种评估数值推理的新型基准。

[NLP-72] ChatGPT as Research Scientist: Probing GPTs Capabilities as a Research Librarian Research Ethicist Data Generator and Data Predictor
[NLP-72] ChatGPT作为研究科学家:探索GPT作为研究图书馆员研究伦理学家数据生成器和数据预测者的能力

链接: https://arxiv.org/abs/2406.14765
作者: Steven A. Lehr,Aylin Caliskan,Suneragiri Liyanage,Mahzarin R. Banaji
关键词: Research Ethicist, Research Librarian, research, Study, Data Generator
中文关键词: 研究伦理学家、研究图书馆员、研究、研究、数据生成器
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Main article is 14 pages, 1 table. Includes SI Appendix: 26 pages, 12 tables, 2 figures. Total: 40 pages, 13 tables, 2 figures. Under revised review at PNAS

点击查看摘要

Abstract:How good a research scientist is ChatGPT? We systematically probed the capabilities of GPT-3.5 and GPT-4 across four central components of the scientific process: as a Research Librarian, Research Ethicist, Data Generator, and Novel Data Predictor, using psychological science as a testing field. In Study 1 (Research Librarian), unlike human researchers, GPT-3.5 and GPT-4 hallucinated, authoritatively generating fictional references 36.0% and 5.4% of the time, respectively, although GPT-4 exhibited an evolving capacity to acknowledge its fictions. In Study 2 (Research Ethicist), GPT-4 (though not GPT-3.5) proved capable of detecting violations like p-hacking in fictional research protocols, correcting 88.6% of blatantly presented issues, and 72.6% of subtly presented issues. In Study 3 (Data Generator), both models consistently replicated patterns of cultural bias previously discovered in large language corpora, indicating that ChatGPT can simulate known results, an antecedent to usefulness for both data generation and skills like hypothesis generation. Contrastingly, in Study 4 (Novel Data Predictor), neither model was successful at predicting new results absent in their training data, and neither appeared to leverage substantially new information when predicting more versus less novel outcomes. Together, these results suggest that GPT is a flawed but rapidly improving librarian, a decent research ethicist already, capable of data generation in simple domains with known characteristics but poor at predicting novel patterns of empirical data to aid future experimentation.
摘要:ChatGPT是一名多好的研究科学家?我们系统地探讨了GPT-3.5和GPT-4在科学过程的四个核心组成部分的能力:作为研究图书馆员、研究伦理学家、数据生成器和新颖的数据预测者,以心理科学为测试领域。在研究1(研究图书馆员)中,与人类研究人员不同,GPT-3.5和GPT-4产生幻觉,分别在36.0%和5.4%的时间内权威性地产生虚构的参考文献,尽管GPT-4表现出不断进化的承认其虚构的能力。在研究2(研究伦理学家)中,GPT-4(尽管不是GPT-3.5)被证明能够检测到虚构研究协议中的p-hack等违规行为,纠正了88.6%的公然提出的问题和72.6%的微妙提出的问题。在研究3(数据生成器)中,两个模型都一致地复制了以前在大型语言语料库中发现的文化偏见模式,表明ChatGPT可以模拟已知结果,这是数据生成和假设生成等技能有用的先决条件。相比之下,在研究4(新的数据预测器)中,两个模型都没有成功地预测其训练数据中没有的新结果,而且在预测更多或更不新颖的结果时,两个模型似乎都没有充分利用新信息。总而言之,这些结果表明,GPT是一个有缺陷但正在迅速进步的图书馆员,已经是一个体面的研究伦理学家,有能力在具有已知特征的简单领域中生成数据,但在预测新的经验数据模式以帮助未来实验方面做得很差。

[NLP-73] RE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation
[NLP-73] RE-AdaptIR:通过反向工程适应改进信息检索

链接: https://arxiv.org/abs/2406.14764
作者: William Fleshman,Benjamin Van Durme
关键词: Large language models, Large language, fine-tuned for text-retrieval, text-retrieval have demonstrated, information retrieval
中文关键词: 大型语言模型,大型语言,针对文本检索进行微调,文本检索已演示,信息检索
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) fine-tuned for text-retrieval have demonstrated state-of-the-art results across several information retrieval (IR) benchmarks. However, supervised training for improving these models requires numerous labeled examples, which are generally unavailable or expensive to acquire. In this work, we explore the effectiveness of extending reverse engineered adaptation to the context of information retrieval (RE-AdaptIR). We use RE-AdaptIR to improve LLM-based IR models using only unlabeled data. We demonstrate improved performance both in training domains as well as zero-shot in domains where the models have seen no queries. We analyze performance changes in various fine-tuning scenarios and offer findings of immediate use to practitioners.
摘要:针对文本检索进行微调的大型语言模型(LLM)已在多个信息检索(IR)基准测试中展示了最先进的结果。然而,用于改进这些模型的监督训练需要大量标记的示例,而这些示例通常无法获得或获取成本高昂。在这项工作中,我们探索了将反向工程适应扩展到信息检索上下文(RE-AdaptIR)的有效性。我们使用RE-AdaptIR仅使用未标记的数据来改进基于LLM的IR模型。我们展示了在训练域以及模型未看到查询的域中的零射击方面的性能都得到了改进。我们分析各种微调场景中的性能变化,并为从业者提供立即使用的发现。

[NLP-74] A Learn-Then-Reason Model Towards Generalization in Knowledge Base Question Answering
[NLP-74] 知识库问题解答中的一个“先学后理”模型

链接: https://arxiv.org/abs/2406.14763
作者: Lingxi Zhang,Jing Zhang,Yanling Wang,Cuiping Li,Hong Chen
关键词: Wikidata house millions, Freebase and Wikidata, Large-scale knowledge bases, Base Question Answering, Wikidata house
中文关键词: 维基数据库数百万人、Freebase和维基数据、大规模知识库、基础问题解答、维基数据库
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large-scale knowledge bases (KBs) like Freebase and Wikidata house millions of structured knowledge. Knowledge Base Question Answering (KBQA) provides a user-friendly way to access these valuable KBs via asking natural language questions. In order to improve the generalization capabilities of KBQA models, extensive research has embraced a retrieve-then-reason framework to retrieve relevant evidence for logical expression generation. These multi-stage efforts prioritize acquiring external sources but overlook the incorporation of new knowledge into their model parameters. In effect, even advanced language models and retrievers have knowledge boundaries, thereby limiting the generalization capabilities of previous KBQA models. Therefore, this paper develops KBLLaMA, which follows a learn-then-reason framework to inject new KB knowledge into a large language model for flexible end-to-end KBQA. At the core of KBLLaMA, we study (1) how to organize new knowledge about KBQA and (2) how to facilitate the learning of the organized knowledge. Extensive experiments on various KBQA generalization tasks showcase the state-of-the-art performance of KBLLaMA. Especially on the general benchmark GrailQA and domain-specific benchmark Bio-chemical, KBLLaMA respectively derives a performance gain of up to 3.8% and 9.8% compared to the baselines.
摘要:像Freebase和Wikidata这样的大型知识库包含了数以百万计的结构化知识。知识库问答(KBQA)提供了一种用户友好的方式,通过询问自然语言问题来访问这些有价值的知识库。为了提高KBQA模型的泛化能力,广泛的研究采用了检索然后推理的框架来检索用于逻辑表达式生成的相关证据。这些多阶段的努力优先考虑获取外部来源,但忽略了将新知识纳入其模型参数。实际上,即使是高级语言模型和检索器也有知识边界,从而限制了以前的KBQA模型的泛化能力。因此,本文开发了KBLLaMA,它遵循先学习后推理的框架,将新的知识库知识注入到大型语言模型中,以实现灵活的端到端KBQA。在KBLLaMA的核心部分,我们研究了(1)如何组织关于KBQA的新知识和(2)如何促进已组织知识的学习。在各种KBQA泛化任务上的大量实验展示了KBLLaMA的最新性能。尤其是在通用基准Gail QA和领域特定基准Bio-Chemical上,KBLLaMA分别获得了与基准相比高达3.8%和9.8%的性能提升。

[NLP-75] An LLM Feature-based Framework for Dialogue Constructiveness Assessment
[NLP-75] LLM基于环境的对话建设性评估框架

链接: https://arxiv.org/abs/2406.14760
作者: Lexin Zhou,Youmna Farag,Andreas Vlachos
关键词: analysing conversational factors, constructiveness assessment focuses, LLM feature-based models, predicting constructive outcomes, LLM feature-based
中文关键词: 分析对话因素、建设性评估重点、LLM基于特征的模型、预测建设性结果、LLM基于特征
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Research on dialogue constructiveness assessment focuses on (i) analysing conversational factors that influence individuals to take specific actions, win debates, change their perspectives or broaden their open-mindedness and (ii) predicting constructive outcomes following dialogues for such use cases. These objectives can be achieved by training either interpretable feature-based models (which often involve costly human annotations) or neural models such as pre-trained language models (which have empirically shown higher task accuracy but lack interpretability). We propose a novel LLM feature-based framework that combines the strengths of feature-based and neural approaches while mitigating their downsides, in assessing dialogue constructiveness. The framework first defines a set of dataset-independent and interpretable linguistic features, which can be extracted by both prompting an LLM and simple heuristics. Such features are then used to train LLM feature-based models. We apply this framework to three datasets of dialogue constructiveness and find that our LLM feature-based models significantly outperform standard feature-based models and neural models, and tend to learn more robust prediction rules instead of relying on superficial shortcuts (as seen with neural models). Further, we demonstrate that interpreting these LLM feature-based models can yield valuable insights into what makes a dialogue constructive.
摘要:对话建构性评价的研究主要集中在:(1)分析影响个体采取具体行动、赢得辩论、改变观点或扩大开放的对话因素;(2)预测对话后对此类用例的建设性结果。这些目标可以通过训练可解释的基于特征的模型(通常涉及昂贵的人工注释)或神经模型(如预先训练的语言模型)来实现(经验表明,这些模型的任务精度较高,但缺乏可解释性)。我们提出了一种新的基于特征的LLM框架,它结合了基于特征的方法和神经方法的优点,同时减轻了它们的缺点,在评估对话建设性方面。该框架首先定义了一组独立于数据集的、可解释的语言特征,这些特征可以通过提示LLM和简单启发式两种方法来提取。这些特征然后被用来训练基于LLM特征的模型。我们将该框架应用于三个对话建设性的数据集,发现我们的LLM基于特征的模型显著优于标准的基于特征的模型和神经模型,并且倾向于学习更健壮的预测规则,而不是依赖于表面的快捷方式(与神经模型一样)。此外,我们还证明,解释这些基于特征的LLM模型可以产生有价值的见解,了解是什么使对话具有建设性。

[NLP-76] An Adapter-Based Unified Model for Multiple Spoken Language Processing Tasks
[NLP-76] 基于适配器的多口语处理任务统一模型

链接: https://arxiv.org/abs/2406.14747
作者: Varsha Suresh,Salah Aït-Mokhtar,Caroline Brun,Ioan Calapodescu
关键词: Self-supervised learning models, Self-supervised learning, revolutionized the field, Spoken Emotion Recognition, Automatic Speech Recognition
中文关键词: 自我监督学习模型,自我监督学习,彻底改变了该领域,言语情感识别,自动语音识别
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICASSP 2024

点击查看摘要

Abstract:Self-supervised learning models have revolutionized the field of speech processing. However, the process of fine-tuning these models on downstream tasks requires substantial computational resources, particularly when dealing with multiple speech-processing tasks. In this paper, we explore the potential of adapter-based fine-tuning in developing a unified model capable of effectively handling multiple spoken language processing tasks. The tasks we investigate are Automatic Speech Recognition, Phoneme Recognition, Intent Classification, Slot Filling, and Spoken Emotion Recognition. We validate our approach through a series of experiments on the SUPERB benchmark, and our results indicate that adapter-based fine-tuning enables a single encoder-decoder model to perform multiple speech processing tasks with an average improvement of 18.4% across the five target tasks while staying efficient in terms of parameter updates.
摘要:自我监督学习模型彻底改变了语音处理领域。然而,在下游任务上微调这些模型的过程需要大量的计算资源,特别是在处理多个语音处理任务时。在本文中,我们探索了基于适配器的微调在开发能够有效处理多个口语处理任务的统一模型方面的潜力。我们研究的任务是自动语音识别、音素识别、意图分类、老虎机填充和口语情感识别。我们通过在SUPERB基准测试上进行的一系列实验验证了我们的方法,结果表明,基于适配器的微调使单个编码器-解码器模型能够执行多个语音处理任务,五个目标任务的平均改进为18.4%,同时在参数更新方面保持高效。

[NLP-77] Relation Extraction with Fine-Tuned Large Language Models in Retrieval Augmented Generation Frameworks
[NLP-77] 检索增强生成框架中使用微调大语言模型的关系提取

链接: https://arxiv.org/abs/2406.14745
作者: Sefika Efeoglu,Adrian Paschke
关键词: Knowledge Graphs, Information Extraction, converting unstructured data, formats like Knowledge, crucial for converting
中文关键词: 知识图、信息提取、转换非结构化数据、知识等格式,对于转换至关重要
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Information Extraction (IE) is crucial for converting unstructured data into structured formats like Knowledge Graphs (KGs). A key task within IE is Relation Extraction (RE), which identifies relationships between entities in text. Various RE methods exist, including supervised, unsupervised, weakly supervised, and rule-based approaches. Recent studies leveraging pre-trained language models (PLMs) have shown significant success in this area. In the current era dominated by Large Language Models (LLMs), fine-tuning these models can overcome limitations associated with zero-shot LLM prompting-based RE methods, especially regarding domain adaptation challenges and identifying implicit relations between entities in sentences. These implicit relations, which cannot be easily extracted from a sentence’s dependency tree, require logical inference for accurate identification. This work explores the performance of fine-tuned LLMs and their integration into the Retrieval Augmented-based (RAG) RE approach to address the challenges of identifying implicit relations at the sentence level, particularly when LLMs act as generators within the RAG framework. Empirical evaluations on the TACRED, TACRED-Revisited (TACREV), Re-TACRED, and SemEVAL datasets show significant performance improvements with fine-tuned LLMs, including Llama2-7B, Mistral-7B, and T5 (Large). Notably, our approach achieves substantial gains on SemEVAL, where implicit relations are common, surpassing previous results on this dataset. Additionally, our method outperforms previous works on TACRED, TACREV, and Re-TACRED, demonstrating exceptional performance across diverse evaluation scenarios.
摘要:信息抽取是将非结构化数据转换为知识图等结构化数据的关键。IE中的一项关键任务是关系提取(RE),它识别文本中实体之间的关系。存在各种RE方法,包括有监督、无监督、弱监督和基于规则的方法。最近利用预先训练的语言模型(PLM)的研究表明,在这一领域取得了重大成功。在当前以大语言模型(LLM)为主导的时代,对这些模型进行微调可以克服基于零命中LLM提示的RE方法的局限性,特别是在领域适应挑战和识别句子中实体之间的隐含关系方面。这些隐含关系不容易从句子的依存关系树中提取出来,需要逻辑推理才能准确识别。这项工作探索了微调的LLM的性能及其与基于检索增强(RAG)的RE方法的集成,以解决在句子层面识别隐含关系的挑战,特别是当LLM在RAG框架中充当生成器时。在TACRED、TACRED-REVISITED(TACREV)、Re-TACRED和SemEVAL数据集上的经验评估表明,使用微调的LLM(包括Llama2-7B、Mistral-7B和T5(Large)),性能有显著改善。值得注意的是,我们的方法在SemEVAL上取得了实质性的收益,在SemEVAL上,隐式关系很常见,超过了之前在这个数据集上的结果。此外,我们的方法比以前在TACRED、TACREV和Re-TACRED上的工作都要好,在不同的评估场景中表现出了出色的性能。

[NLP-78] Learning to Retrieve Iteratively for In-Context Learning
[NLP-78] 学习迭代地进行上下文学习

链接: https://arxiv.org/abs/2406.14739
作者: Yunmo Chen,Tongfei Chen,Harsh Jhamtani,Patrick Xia,Richard Shin,Jason Eisner,Benjamin Van Durme
关键词: introduce iterative retrieval, make iterative decisions, framework that empowers, decisions through policy, policy optimization
中文关键词: 引入迭代检索、做出迭代决策、赋权框架、通过政策做出决策、政策优化
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce iterative retrieval, a novel framework that empowers retrievers to make iterative decisions through policy optimization. Finding an optimal portfolio of retrieved items is a combinatorial optimization problem, generally considered NP-hard. This approach provides a learned approximation to such a solution, meeting specific task requirements under a given family of large language models (LLMs). We propose a training procedure based on reinforcement learning, incorporating feedback from LLMs. We instantiate an iterative retriever for composing in-context learning (ICL) exemplars and apply it to various semantic parsing tasks that demand synthesized programs as outputs. By adding only 4M additional parameters for state encoding, we convert an off-the-shelf dense retriever into a stateful iterative retriever, outperforming previous methods in selecting ICL exemplars on semantic parsing datasets such as CalFlow, TreeDST, and MTOP. Additionally, the trained iterative retriever generalizes across different inference LLMs beyond the one used during training.
摘要:我们介绍了迭代检索,这是一个新的框架,允许检索者通过策略优化来做出迭代决策。寻找最优的检索项目组合是一个组合优化问题,通常被认为是NP难的。这种方法提供了这种解决方案的学习近似,满足了给定的大型语言模型(LLM)家族下的特定任务要求。我们提出了一种基于强化学习的训练过程,融合了LLMS的反馈。我们实例化了一个用于组合上下文中学习(ICL)样本的迭代检索器,并将其应用于需要合成程序作为输出的各种语义分析任务。通过添加4M个额外的状态编码参数,我们将现有的密集检索器转换为有状态迭代检索器,在选择语义分析数据集上的ICL样本方面优于以往的方法,如CalFlow、TreeDST和MTOP。此外,训练的迭代检索器在不同的推理LLM上进行泛化,而不是在训练期间使用的LLM。

[NLP-79] Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task?
[NLP-79] 用望远镜剖析乌尔曼变奏曲:为什么LLM在错误信念任务的微小改变方面失败了?

链接: https://arxiv.org/abs/2406.14737
作者: Zhiqiang Pi,Annapurna Vadaparty,Benjamin K. Bergen,Cameron R. Jones
关键词: Large Language Models, Language Models, Theory of Mind, Large Language, Recent empirical results
中文关键词: 大型语言模型、语言模型、心理理论、大型语言、最近的实证结果
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent empirical results have sparked a debate about whether or not Large Language Models (LLMs) are capable of Theory of Mind (ToM). While some have found LLMs to be successful on ToM evaluations such as the False Belief task (Kosinski, 2023), others have argued that LLMs solve these tasks by exploiting spurious correlations – not representing beliefs – since they fail on trivial alterations to these tasks (Ullman, 2023). In this paper, we introduce SCALPEL: a technique to generate targeted modifications for False Belief tasks to test different specific hypotheses about why LLMs fail. We find that modifications which make explicit common inferences – such as that looking at a transparent object implies recognizing its contents – preserve LLMs’ performance. This suggests that LLMs’ failures on modified ToM tasks could result from a lack of more general commonsense reasoning, rather than a failure to represent mental states. We argue that SCALPEL could be helpful for explaining LLM successes and failures in other cases.
摘要:最近的实证研究结果引发了一场关于大型语言模型是否具有心理理论能力的争论。虽然有些人发现LLM在TOM评估中是成功的,比如错误信念任务(Kosinski,2023),但另一些人认为LLM通过利用虚假关联来解决这些任务–而不是代表信念–因为它们在这些任务的微小变化上失败(Ullman,2023)。在本文中,我们引入了SCALPEL:一种为错误信念任务生成有针对性的修改的技术,以测试关于LLMS失败原因的不同特定假设。我们发现,做出明确的共同推论的修改–例如,查看透明对象意味着识别其内容–保持了LLMS的性能。这表明,LLMS在修改后的TOM任务上的失败可能是因为缺乏更一般的常识推理,而不是未能代表心理状态。我们认为,手术刀可以帮助解释LLM在其他情况下的成功和失败。

[NLP-80] QA-RS- A break-down prompting approach for Multi-hop Table-Text Question Answering with Reasoning and Summarization
[NLP-80] QA-RS-一种用于多跳表文本问题解答的分解提示方法,具有推理和总结

链接: https://arxiv.org/abs/2406.14732
作者: Jayetri Bardhan,Bushi Xiao,Daisy Zhe Wang
关键词: Question answering, Table-Text Question Answering, Multi-hop Table-Text Question, gained much popularity, table-text
中文关键词: 问答、表格文本问题解答、多跳表格文本问题、广受欢迎、表格文本
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Question answering (QA) over tables and text has gained much popularity over the years. Multi-hop table-text QA requires multiple hops between the table and text, making it a challenging QA task. Although several works have attempted to solve the table-text QA task, most involve training the models and requiring labeled data. In this paper, we have proposed a model - TTQA-RS: A break-down prompting approach for Multi-hop Table-Text Question Answering with Reasoning and Summarization. Our model uses augmented knowledge including table-text summary with decomposed sub-question with answer for a reasoning-based table-text QA. Using open-source language models our model outperformed all existing prompting methods for table-text QA tasks on existing table-text QA datasets like HybridQA and OTT-QA’s development set. Our results are comparable with the training-based state-of-the-art models, demonstrating the potential of prompt-based approaches using open-source LLMs. Additionally, by using GPT-4 with LLaMA3-70B, our model achieved state-of-the-art performance for prompting-based methods on multi-hop table-text QA.
摘要:近年来,基于表格和文本的问答(QA)越来越受到人们的欢迎。多跳表-文本QA需要在表和文本之间进行多跳,这使得它成为一项具有挑战性的QA任务。虽然有几项工作试图解决表格文本QA任务,但大多数涉及训练模型和需要标记数据。在本文中,我们提出了一个模型TTQA-RS:一种带推理和摘要的多跳表文问答的分解提示方法。对于基于推理的表文本问答,我们的模型使用了包括表文本摘要、分解子问题和答案在内的扩充知识。使用开源语言模型,我们的模型在现有的表文本QA数据集(如HyBridge QA和OTT-QA的开发集)上的性能优于所有现有的表文本QA任务提示方法。我们的结果与基于训练的最新模型具有可比性,表明了使用开源LLMS的基于提示的方法的潜力。此外,通过使用GPT-4和LLaMA3-70B,我们的模型在基于提示的多跳表文本问答方法上获得了最先进的性能。

[NLP-81] 112: Can Large Language Models Serve as Cross-Lingual Knowledge Aggregators?
[NLP-81] 112:大型语言模型可以充当跨语言知识聚合器吗?

链接: https://arxiv.org/abs/2406.14721
作者: Yue Huang,Chenrui Fan,Yuan Li,Siyuan Wu,Tianyi Zhou,Xiangliang Zhang,Lichao Sun
关键词: Large Language Models, garnered significant attention, significant attention due, Large Language, Language Models
中文关键词: 大型语言模型,引起了极大的关注,引起了极大的关注,大型语言,语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have garnered significant attention due to their remarkable ability to process information across various languages. Despite their capabilities, they exhibit inconsistencies in handling identical queries in different languages, presenting challenges for further advancement. This paper introduces a method to enhance the multilingual performance of LLMs by aggregating knowledge from diverse languages. This approach incorporates a low-resource knowledge detector specific to a language, a language selection process, and mechanisms for answer replacement and integration. Our experiments demonstrate notable performance improvements, particularly in reducing language performance disparity. An ablation study confirms that each component of our method significantly contributes to these enhancements. This research highlights the inherent potential of LLMs to harmonize multilingual capabilities and offers valuable insights for further exploration.
摘要:大型语言模型(LLM)因其处理各种语言信息的出色能力而受到了广泛关注。尽管它们有能力,但它们在处理不同语言的相同查询时表现出不一致,这给进一步发展带来了挑战。本文介绍了一种通过聚合不同语言的知识来提高LLM多语言性能的方法。这种方法结合了特定于语言的低资源知识检测器、语言选择过程以及答案替换和集成机制。我们的实验证明了显着的性能改进,特别是在减少语言性能差异方面。一项消融研究证实,我们方法的每个组成部分都对这些增强做出了显着贡献。这项研究强调了法学硕士协调多语言能力的固有潜力,并为进一步探索提供了宝贵的见解。

[NLP-82] MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate
[NLP-82] 多Agent协作攻击:通过辩论调查大型语言模型协作中的对抗性攻击

链接: https://arxiv.org/abs/2406.14711
作者: Alfonso Amayuelas,Xianjun Yang,Antonis Antoniades,Wenyue Hua,Liangming Pan,William Wang
关键词: shown exceptional results, Large Language Models, Large Language, working individually, shown exceptional
中文关键词: 显示出色的结果,大型语言模型,大型语言,单独工作,显示出色
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown exceptional results on current benchmarks when working individually. The advancement in their capabilities, along with a reduction in parameter size and inference times, has facilitated the use of these models as agents, enabling interactions among multiple models to execute complex tasks. Such collaborations offer several advantages, including the use of specialized models (e.g. coding), improved confidence through multiple computations, and enhanced divergent thinking, leading to more diverse outputs. Thus, the collaborative use of language models is expected to grow significantly in the coming years. In this work, we evaluate the behavior of a network of models collaborating through debate under the influence of an adversary. We introduce pertinent metrics to assess the adversary’s effectiveness, focusing on system accuracy and model agreement. Our findings highlight the importance of a model’s persuasive ability in influencing others. Additionally, we explore inference-time methods to generate more compelling arguments and evaluate the potential of prompt-based mitigation as a defensive strategy.
摘要:大型语言模型(LLM)在单独工作时,在当前的基准测试中显示出了特殊的结果。它们能力的进步,加上参数大小和推理时间的减少,促进了这些模型作为代理的使用,使多个模型之间能够相互作用,以执行复杂的任务。这种协作提供了几个优势,包括使用专门的模型(例如编码)、通过多次计算提高信心以及增强发散思维,从而产生更多样化的产出。因此,语言模型的协作使用预计在未来几年将显著增长。在这项工作中,我们评估了一个模型网络在对手的影响下通过辩论进行合作的行为。我们引入了相关的度量来评估对手的有效性,重点是系统的准确性和模型的一致性。我们的发现突显了模特的说服力在影响他人方面的重要性。此外,我们探索推理时间方法来生成更令人信服的论点,并评估基于即时缓解作为一种防御策略的潜力。

[NLP-83] Factual Dialogue Summarization via Learning from Large Language Models
[NLP-83] 通过从大型语言模型中学习进行事实对话总结

链接: https://arxiv.org/abs/2406.14709
作者: Rongxin Zhu,Jey Han Lau,Jianzhong Qi
关键词: important quality, Factual consistency, dialogue summarization, summarization, summarization models
中文关键词: 重要质量、事实一致性、对话总结、总结、总结模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Factual consistency is an important quality in dialogue summarization. Large language model (LLM)-based automatic text summarization models generate more factually consistent summaries compared to those by smaller pretrained language models, but they face deployment challenges in real-world applications due to privacy or resource constraints. In this paper, we investigate the use of symbolic knowledge distillation to improve the factual consistency of smaller pretrained models for dialogue summarization. We employ zero-shot learning to extract symbolic knowledge from LLMs, generating both factually consistent (positive) and inconsistent (negative) summaries. We then apply two contrastive learning objectives on these summaries to enhance smaller summarization models. Experiments with BART, PEGASUS, and Flan-T5 indicate that our approach surpasses strong baselines that rely on complex data augmentation strategies. Our approach achieves better factual consistency while maintaining coherence, fluency, and relevance, as confirmed by various automatic evaluation metrics. We also provide access to the data and code to facilitate future research.
摘要:事实的一致性是对话总结的重要品质。基于大型语言模型(LLM)的自动文本摘要模型与使用较小的预先训练的语言模型生成的摘要相比,可以生成更真实一致的摘要,但由于隐私或资源的限制,它们在实际应用中面临部署挑战。在本文中,我们研究了使用符号知识提取来提高用于对话摘要的较小预训练模型的事实一致性。我们使用零镜头学习从LLM中提取符号知识,生成事实一致(肯定的)和不一致的(否定的)摘要。然后,我们将两个对比学习目标应用于这些摘要,以增强较小的摘要模型。用BART、Pegasus和Flan-T5进行的实验表明,我们的方法超过了依赖复杂数据增强策略的强基线。我们的方法在保持连贯性、流畅性和相关性的同时实现了更好的事实一致性,这一点得到了各种自动评估度量的证实。我们还提供对数据和代码的访问,以促进未来的研究。

[NLP-84] Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics
[NLP-84] LLM是否具有独特且一致的个性?特质:为具有心理测量学的法学硕士设计的性格测试集

链接: https://arxiv.org/abs/2406.14703
作者: Seungbeen Lee,Seungwon Lim,Seungju Han,Giyeong Oh,Hyungjoo Chae,Jiwan Chung,Minju Kim,Beong-woo Kwak,Yeonsoo Lee,Dongha Lee,Jinyoung Yeo,Youngjae Yu
关键词: Large Language Models, Language Models, Large Language, extended to Large, observable behavior
中文关键词: 大型语言模型,语言模型,大型语言,扩展到大型可观察行为
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint; Under review

点击查看摘要

Abstract:The idea of personality in descriptive psychology, traditionally defined through observable behavior, has now been extended to Large Language Models (LLMs) to better understand their behavior. This raises a question: do LLMs exhibit distinct and consistent personality traits, similar to humans? Existing self-assessment personality tests, while applicable, lack the necessary validity and reliability for precise personality measurements. To address this, we introduce TRAIT, a new tool consisting of 8K multi-choice questions designed to assess the personality of LLMs with validity and reliability. TRAIT is built on the psychometrically validated human questionnaire, Big Five Inventory (BFI) and Short Dark Triad (SD-3), enhanced with the ATOMIC10X knowledge graph for testing personality in a variety of real scenarios. TRAIT overcomes the reliability and validity issues when measuring personality of LLM with self-assessment, showing the highest scores across three metrics: refusal rate, prompt sensitivity, and option order sensitivity. It reveals notable insights into personality of LLM: 1) LLMs exhibit distinct and consistent personality, which is highly influenced by their training data (i.e., data used for alignment tuning), and 2) current prompting techniques have limited effectiveness in eliciting certain traits, such as high psychopathy or low conscientiousness, suggesting the need for further research in this direction.
摘要:描述心理学中的人格概念,传统上是通过观察到的行为来定义的,现在已经扩展到大语言模型(LLM),以更好地理解他们的行为。这就提出了一个问题:LLM是否表现出与人类相似的明显和一致的个性特征?现有的自我评估人格测试虽然适用,但缺乏准确测量人格所必需的有效性和可靠性。为了解决这一问题,我们引入了特征,这是一个由8K个多项选择题组成的新工具,旨在评估LLMS的个性,具有有效性和可靠性。特征是建立在心理测量学验证的人类问卷,五大清单(BFI)和短黑暗三人组(SD-3)的基础上,并通过ATOMIC10X知识图谱在各种真实场景中测试人格。特征克服了用自我评估测量LLM人格时的信度和效度问题,在拒绝率、提示敏感度和选项顺序敏感度三个维度上表现出最高分。本研究揭示了LLM的人格特征:1)LLM表现出鲜明而一致的个性,这受他们的训练数据(即用于对齐调整的数据)的高度影响;2)目前的提示技术在诱发某些特征方面的有效性有限,如高度精神病或缺乏责任心,这表明在这方面需要进一步的研究。

[NLP-85] Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions
[NLP-85] 使用RNNT丢失的语音前置调整以改善LLM预测

链接: https://arxiv.org/abs/2406.14701
作者: Murali Karthick Baskar,Andrew Rosenberg,Bhuvana Ramabhadran,Neeraj Gaur,Zhong Meng
关键词: focus on addressing, addressing the constraints, constraints faced, ASR, LLMs
中文关键词: 专注于解决、解决限制因素、面临的限制因素、ASC、LLM
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. Recent works utilize prefixLM-type models, which directly apply speech as a prefix to LLMs for ASR. We have found that optimizing speech prefixes leads to better ASR performance and propose applying RNNT loss to perform speech prefix-tuning. This is a simple approach and does not increase the model complexity or alter the inference pipeline. We also propose language-based soft prompting to further improve with frozen LLMs. Empirical analysis on realtime testset from 10 Indic languages demonstrate that our proposed speech prefix-tuning yields improvements with both frozen and fine-tuned LLMs. Our recognition results on an average of 10 Indics show that the proposed prefix-tuning with RNNT loss results in a 12% relative improvement in WER over the baseline with a fine-tuned LLM. Our proposed approches with the frozen LLM leads to a 31% relative improvement over basic soft-prompting prefixLM.
摘要:在本文中,我们重点解决将LLM应用于ASC时面临的限制。最近的作品利用了PrefixLM类型模型,该模型直接将语音作为ZR的LLM的前置码。我们发现优化语音前置可以带来更好的ASB性能,并建议应用RNNT损失来执行语音前置调整。这是一种简单的方法,不会增加模型复杂性或改变推理管道。我们还提出基于语言的软提示,以通过冻结LLM进一步改进。对10种印度语言的实时测试集的实证分析表明,我们提出的语音前置调整通过冻结和微调的LLM都能带来改进。我们对平均10个指标的识别结果表明,提出的带有RNNT损失的前置调整导致WER比经过微调的LLM的基线相对提高了12%。我们提出的冻结LLM方法比基本软提示后缀LM相对提高了31%。

[NLP-86] Depth F_1: Improving Evaluation of Cross-Domain Text Classification by Measuring Semantic Generalizability
[NLP-86] 深度F_1:通过测量语义可概括性改进跨领域文本分类的评估

链接: https://arxiv.org/abs/2406.14695
作者: Parker Seegmiller,Joseph Gatto,Sarah Masud Preum
关键词: cross-domain text classification, cross-domain text, text classification, source domain, obtain domain-invariant performance
中文关键词: 跨域文本分类,跨域文本,文本分类,源域,获得域不变性能
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent evaluations of cross-domain text classification models aim to measure the ability of a model to obtain domain-invariant performance in a target domain given labeled samples in a source domain. The primary strategy for this evaluation relies on assumed differences between source domain samples and target domain samples in benchmark datasets. This evaluation strategy fails to account for the similarity between source and target domains, and may mask when models fail to transfer learning to specific target samples which are highly dissimilar from the source domain. We introduce Depth F_1 , a novel cross-domain text classification performance metric. Designed to be complementary to existing classification metrics such as F_1 , Depth F_1 measures how well a model performs on target samples which are dissimilar from the source domain. We motivate this metric using standard cross-domain text classification datasets and benchmark several recent cross-domain text classification models, with the goal of enabling in-depth evaluation of the semantic generalizability of cross-domain text classification models.
摘要:最近对跨域文本分类模型的评估旨在衡量模型在源域中给定标记样本的情况下在目标域中获得域不变性能的能力。这一评估的主要战略依赖于基准数据集中源域样本和目标域样本之间的假设差异。这种评估策略没有考虑到源域和目标域之间的相似性,并且可能会掩盖模型无法将学习迁移到与源域高度不相似的特定目标样本的情况。我们引入了深度F1,一种新的跨域文本分类性能度量。深度F_1是对F_1等现有分类指标的补充,它衡量的是模型对与源域不同的目标样本的处理效果。我们使用标准的跨域文本分类数据集来激励这一度量,并对最近的几个跨域文本分类模型进行基准测试,目的是能够深入评估跨域文本分类模型的语义泛化能力。

[NLP-87] A Contrastive Learning Approach to Mitigate Bias in Speech Models
[NLP-87] 减轻语音模型中偏见的对比学习方法

链接: https://arxiv.org/abs/2406.14686
作者: Alkis Koudounas,Flavio Giobergia,Eliana Pastor,Elena Baralis
关键词: raising concerns, concerns about fair, fair treatment, mitigate speech model, speech model bias
中文关键词: 提出担忧,对公平、公平待遇的担忧,减轻言语模型,言语模型偏见
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2024

点击查看摘要

Abstract:Speech models may be affected by performance imbalance in different population subgroups, raising concerns about fair treatment across these groups. Prior attempts to mitigate unfairness either focus on user-defined subgroups, potentially overlooking other affected subgroups, or do not explicitly improve the internal representation at the subgroup level. This paper proposes the first adoption of contrastive learning to mitigate speech model bias in underperforming subgroups. We employ a three-level learning technique that guides the model in focusing on different scopes for the contrastive loss, i.e., task, subgroup, and the errors within subgroups. The experiments on two spoken language understanding datasets and two languages demonstrate that our approach improves internal subgroup representations, thus reducing model bias and enhancing performance.
摘要:言语模型可能会受到不同人群亚组表现不平衡的影响,从而引发人们对这些群体公平待遇的担忧。之前减轻不公平性的尝试要么关注用户定义的子组,可能会忽视其他受影响的子组,要么没有明确改善子组级别的内部代表性。本文提出首次采用对比学习来减轻表现不佳的亚组中的语音模型偏差。我们采用三层学习技术,指导模型专注于对比损失的不同范围,即任务、子组以及子组内的错误。对两个口语理解数据集和两种语言的实验表明,我们的方法改进了内部子组表示,从而减少了模型偏差并提高了性能。

[NLP-88] AGLAS: An atlas of text-attributed graph datasets in the era of large graph and language models
[NLP-88] AGLAS:大型图形和语言模型时代的文本属性图形数据集地图集

链接: https://arxiv.org/abs/2406.14683
作者: Jiarui Feng,Hao Liu,Lecheng Kong,Yixin Chen,Muhan Zhang
关键词: atlas of text-attributed, present TAGLAS, datasets, TAG datasets, TAGLAS
中文关键词: 文本属性地图集、当前TAGLAS、数据集、TAG数据集、TAGLAS
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:In this report, we present TAGLAS, an atlas of text-attributed graph (TAG) datasets and benchmarks. TAGs are graphs with node and edge features represented in text, which have recently gained wide applicability in training graph-language or graph foundation models. In TAGLAS, we collect and integrate more than 23 TAG datasets with domains ranging from citation graphs to molecule graphs and tasks from node classification to graph question-answering. Unlike previous graph datasets and benchmarks, all datasets in TAGLAS have a unified node and edge text feature format, which allows a graph model to be simultaneously trained and evaluated on multiple datasets from various domains. Further, we provide a standardized, efficient, and simplified way to load all datasets and tasks. We also provide useful utils like text-to-embedding conversion, and graph-to-text conversion, which can facilitate different evaluation scenarios. Finally, we also provide standard and easy-to-use evaluation utils. The project is open-sourced at this https URL and is still under construction. Please expect more datasets/features in the future.
摘要:在这份报告中,我们介绍了TAGLAS,一个文本属性图(TAG)数据集和基准的地图集。标签是以文本表示节点和边特征的图,最近在训练图语言或图基础模型方面获得了广泛的适用性。在TAGLAS中,我们收集和整合了超过23个标签数据集,涉及的领域从引文图到分子图,以及从节点分类到图问答的任务。与以往的图形数据集和基准测试不同,TAGLAS中的所有数据集都具有统一的节点和边文本特征格式,允许在来自不同领域的多个数据集上同时训练和评估一个图模型。此外,我们还提供了一种标准化、高效和简化的方法来加载所有数据集和任务。我们还提供了有用的实用工具,如文本到嵌入的转换和图形到文本的转换,这可以促进不同的评估场景。最后,我们还提供了标准的、易于使用的评估工具。该项目在这个HTTPS URL上是开源的,目前仍在建设中。请期待未来会有更多的数据集/功能。

[NLP-89] Dravidian language family through Universal Dependencies lens
[NLP-89] 从普遍附属品角度看达罗威语系

链接: https://arxiv.org/abs/2406.14680
作者: Taraka Rama,Sowmya Vajjala
关键词: facilitate multilingual NLP, Universal Dependencies, cross-linguistically consistent dependency, consistent dependency annotation, multilingual NLP
中文关键词: 促进多语言NLP、通用从属关系、跨语言一致的依赖关系、一致的依赖关系注释、多语言NLP
类目: Computation and Language (cs.CL)
备注: unpublished report from 2021

点击查看摘要

Abstract:The Universal Dependencies (UD) project aims to create a cross-linguistically consistent dependency annotation for multiple languages, to facilitate multilingual NLP. It currently supports 114 languages. Dravidian languages are spoken by over 200 million people across the word, and yet there are only two languages from this family in UD. This paper examines some of the morphological and syntactic features of Dravidian languages and explores how they can be annotated in the UD framework.
摘要:通用依赖性(UD)项目旨在为多种语言创建跨语言一致的依赖性注释,以促进多语言NLP。目前支持114种语言。全世界有超过2亿人使用德拉威亚语言,但在UD中只有两种属于这个家族的语言。本文探讨了达罗威语言的一些形态和语法特征,并探讨了如何在UD框架中对其进行注释。

[NLP-90] Bidirectional Transformer Representations of (Spanish) Ambiguous Words in Context: A New Lexical Resource and Empirical Analysis
[NLP-90] 上下文中(西班牙语)歧义词的双向Transformer表示:一种新的词汇资源和实证分析

链接: https://arxiv.org/abs/2406.14678
作者: Pamela D. Rivière(1),Anne L. Beatty-Martínez(1),Sean Trott(1 and 2) ((1) Department of Cognitive Science UC San Diego, (2) Computational Social Science UC San Diego)
关键词: Lexical ambiguity, large language models’, form distinct, context-dependent meanings, single wordform
中文关键词: 词汇歧义,大型语言模型,形成独特的、依赖上下文的含义,单一的词形
类目: Computation and Language (cs.CL)
备注: 16 pages, 12 figures, submitted to conference (EMNLP 2024)

点击查看摘要

Abstract:Lexical ambiguity – where a single wordform takes on distinct, context-dependent meanings – serves as a useful tool to compare across different large language models’ (LLMs’) ability to form distinct, contextualized representations of the same stimulus. Few studies have systematically compared LLMs’ contextualized word embeddings for languages beyond English. Here, we evaluate multiple bidirectional transformers’ (BERTs’) semantic representations of Spanish ambiguous nouns in context. We develop a novel dataset of minimal-pair sentences evoking the same or different sense for a target ambiguous noun. In a pre-registered study, we collect contextualized human relatedness judgments for each sentence pair. We find that various BERT-based LLMs’ contextualized semantic representations capture some variance in human judgments but fall short of the human benchmark, and for Spanish – unlike English – model scale is uncorrelated with performance. We also identify stereotyped trajectories of target noun disambiguation as a proportion of traversal through a given LLM family’s architecture, which we partially replicate in English. We contribute (1) a dataset of controlled, Spanish sentence stimuli with human relatedness norms, and (2) to our evolving understanding of the impact that LLM specification (architectures, training protocols) exerts on contextualized embeddings.
摘要:词汇歧义是一种有用的工具,可以用来比较不同大语言模型(LLMS)对同一刺激的不同语境表征能力。很少有研究系统地比较LLMS在英语以外的语言中的语境化单词嵌入。在这里,我们评估了多个双向转换器(BERTS)在上下文中对西班牙语歧义名词的语义表示。我们开发了一个新的数据集,其中最小对句子对目标歧义名词具有相同或不同的意义。在一项预先登记的研究中,我们收集了每对句子的语境化人类关联性判断。我们发现,各种基于BERT的LLMS的上下文语义表示捕捉到了人类判断中的一些差异,但没有达到人类的基准,并且对于西班牙语–与英语不同–模型规模与性能无关。我们还确定了目标名词歧义消除的刻板印象轨迹,作为对给定LLM家族架构的遍历的一部分,我们在英语中部分复制了这一点。我们贡献了(1)一个具有人类相关性规范的受控西班牙语句子刺激的数据集,以及(2)我们对LLM规范(体系结构、训练协议)对上下文嵌入的影响的演变理解。

[NLP-91] Insights into LLM Long-Context Failures: When Transformers Know but Dont Tell
[NLP-91] 对LLM长期背景失败的见解:当变形金刚知道但不说

链接: https://arxiv.org/abs/2406.14673
作者: Taiming Lu,Muhan Gao,Kuai Yu,Adam Byerly,Daniel Khashabi
关键词: Large Language Models, exhibit positional bias, Large Language, Language Models, exhibit positional
中文关键词: 大型语言模型,表现出位置偏见,大型语言,语言模型,表现出位置
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit positional bias, struggling to utilize information from the middle or end of long contexts. Our study explores LLMs’ long-context reasoning by probing their hidden representations. We find that while LLMs encode the position of target information, they often fail to leverage this in generating accurate responses. This reveals a disconnect between information retrieval and utilization, a “know but don’t tell” phenomenon. We further analyze the relationship between extraction time and final accuracy, offering insights into the underlying mechanics of transformer models.
摘要:大型语言模型(LLM)表现出位置偏差,难以利用来自长上下文中间或结尾的信息。我们的研究通过探索LLM隐藏的表示来探索LLM的长上下文推理。我们发现,虽然LLM对目标信息的位置进行编码,但它们通常无法利用这一点来生成准确的响应。这揭示了信息检索和利用之间的脱节,这是一种“知道但不说”的现象。我们进一步分析了提取时间和最终准确性之间的关系,从而深入了解Transformer模型的基本机制。

[NLP-92] Exploring Design Choices for Building Language-Specific LLMs
[NLP-92] 探索特定建筑物LLM的设计选择

链接: https://arxiv.org/abs/2406.14670
作者: Atula Tejaswi,Nilesh Gupta,Eunsol Choi
关键词: languages remain unsatisfactory, remain unsatisfactory, rapid progress, progress in large, vast majority
中文关键词: 语言仍然不令人满意,仍然不令人满意,快速进步,在很大程度上进步,绝大多数
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 6 figures, 11 tables

点击查看摘要

Abstract:Despite rapid progress in large language models (LLMs), their performance on a vast majority of languages remain unsatisfactory. In this paper, we study building language-specific LLMs by adapting monolingual and multilingual LLMs. We conduct systematic experiments on how design choices (base model selection, vocabulary extension, and continued fine-tuning) impact the adapted LLM, both in terms of efficiency (how many tokens are needed to encode the same amount of information) and end task performance. We find that (1) the initial performance before the adaptation is not always indicative of the final performance. (2) Efficiency can easily improved with simple vocabulary extension and continued fine-tuning in most LLMs we study, and (3) The optimal adaptation method is highly language-dependent, and the simplest approach works well across various experimental settings. Adapting English-centric models can yield better results than adapting multilingual models despite their worse initial performance on low-resource languages. Together, our work lays foundations on efficiently building language-specific LLMs by adapting existing LLMs.
摘要:尽管大型语言模型取得了长足的进步,但它们在绝大多数语言上的表现仍然不尽如人意。在本文中,我们通过采用单语言和多语言的LLMS来研究建立特定于语言的LLMS。我们就设计选择(基本模型选择、词汇扩展和持续微调)如何影响适应的LLM进行了系统的实验,包括效率(编码相同数量的信息需要多少令牌)和结束任务的性能。我们发现:(1)改编前的初始成绩并不一定代表最终的成绩。(2)在我们研究的大多数LLMS中,通过简单的词汇量扩展和持续的微调可以很容易地提高效率;(3)最佳顺应法高度依赖于语言,最简单的方法在各种实验环境下都能很好地发挥作用。采用以英语为中心的模型可以产生比适应多语言模型更好的结果,尽管它们在低资源语言上的初始表现较差。总之,我们的工作为通过适应现有的LLM有效地构建特定于语言的LLM奠定了基础。

[NLP-93] Co-training for Low Resource Scientific Natural Language Inference
[NLP-93] 低资源科学自然语言推理的联合培训

链接: https://arxiv.org/abs/2406.14666
作者: Mobashir Sadat,Cornelia Caragea
关键词: Scientific Natural Language, Natural Language Inference, Scientific Natural, Language Inference, Natural Language
中文关键词: 科学自然语言,自然语言推理,科学自然,语言推理,自然语言
类目: Computation and Language (cs.CL)
备注: Accepted in ACL 2024 (main conference)

点击查看摘要

Abstract:Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. The automatic annotation method based on distant supervision for the training set of SciNLI (Sadat and Caragea, 2022b), the first and most popular dataset for this task, results in label noise which inevitably degenerates the performance of classifiers. In this paper, we propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels, reflective of the manner they are used in the subsequent training epochs. That is, unlike the existing semi-supervised learning (SSL) approaches, we consider the historical behavior of the classifiers to evaluate the quality of the automatically annotated labels. Furthermore, by assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data, while ensuring that the noisy labels have a minimal impact on model training. The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines. We make our code and data available on Github.
摘要:科学自然语言推理是对从科研论文中提取的句子之间的语义关系进行预测的任务。基于远程监督的SciNLI(Sadat和Caragea,2022b)训练集的自动标注方法,是该任务中第一个也是最受欢迎的数据集,它会导致标签噪声,从而不可避免地降低分类器的性能。在本文中,我们提出了一种新的联合训练方法,该方法根据分类器的训练动态为远程监督标签分配权重,反映了它们在后续训练时段中的使用方式。也就是说,与现有的半监督学习方法不同,我们考虑了分类器的历史行为来评估自动标注标签的质量。此外,通过分配重要性权重而不是基于预测置信度的任意阈值来过滤样本,我们最大化地使用自动标记的数据,同时确保噪声标记对模型训练的影响最小。与远距离监督基线相比,该方法在Macro F1中获得了1.5%的改进,与其他几个强的SSL基线相比也有显著的改善。我们在Github上提供我们的代码和数据。

[NLP-94] OpenDebateEvidence: A Massive-Scale Argument Mining and Summarization Dataset
[NLP-94] OpenDebateEvidence:大规模论点挖掘和摘要数据集

链接: https://arxiv.org/abs/2406.14657
作者: Allen Roush,Yusuf Shabazz,Arvind Balaji,Peter Zhang,Stefano Mezza,Markus Zhang,Sanjay Basu,Sriram Vishwanath,Mehdi Fatemi,Ravid Schwartz-Ziv
关键词: American Competitive Debate, Competitive Debate community, American Competitive, Competitive Debate, Debate community
中文关键词: 美国竞争辩论,竞争辩论社区,美国竞争,竞争辩论,辩论社区
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for Publication to ARGMIN 2024 at ACL2024

点击查看摘要

Abstract:We introduce OpenDebateEvidence, a comprehensive dataset for argument mining and summarization sourced from the American Competitive Debate community. This dataset includes over 3.5 million documents with rich metadata, making it one of the most extensive collections of debate evidence. OpenDebateEvidence captures the complexity of arguments in high school and college debates, providing valuable resources for training and evaluation. Our extensive experiments demonstrate the efficacy of fine-tuning state-of-the-art large language models for argumentative abstractive summarization across various methods, models, and datasets. By providing this comprehensive resource, we aim to advance computational argumentation and support practical applications for debaters, educators, and researchers. OpenDebateEvidence is publicly available to support further research and innovation in computational argumentation. Access it here: this https URL
摘要:我们引入OpenDebateEvidence,这是一个来自美国竞争辩论社区的用于论点挖掘和总结的综合数据集。该数据集包括超过350万份具有丰富元数据的文档,使其成为最广泛的辩论证据集合之一。OpenDebateEvidence捕捉了高中和大学辩论中争论的复杂性,为培训和评估提供宝贵的资源。我们广泛的实验证明了对最先进的大型语言模型进行微调,以实现跨各种方法、模型和数据集的论证抽象摘要的有效性。通过提供这一全面的资源,我们的目标是推进计算论证并支持辩论者、教育工作者和研究人员的实际应用。OpenDebateEvidence已公开,以支持计算论证方面的进一步研究和创新。在这里访问:这个https URL

[NLP-95] Major Entity Identification: A Generalizable Alternative to Coreference Resolution
[NLP-95] 主要实体识别:共同参考决议的通用替代方案

链接: https://arxiv.org/abs/2406.14654
作者: Kawshik Manikantan(1),Shubham Toshniwal(2),Makarand Tapaswi(1),Vineet Gandhi(1) ((1) CVIT, IIIT Hyderabad, (2) NVIDIA)
关键词: task broad application, Major Entity Identification, coreference resolution, broad application, major bottleneck
中文关键词: 任务广泛应用、主要实体识别、共指解析、广泛应用、主要瓶颈
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 6 figures

点击查看摘要

Abstract:The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task’s broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative formulation of the CR task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities. Through extensive experiments, we demonstrate that MEI models generalize well across domains on multiple datasets with supervised models and LLM-based few-shot prompting. Additionally, the MEI task fits the classification framework, which enables the use of classification-based metrics that are more robust than the current CR metrics. Finally, MEI is also of practical use as it allows a user to search for all mentions of a particular entity or a group of entities of interest.
摘要:共指消解(CR)模型的泛化能力有限一直是制约其广泛应用的主要瓶颈。以前的工作已经确定标注差异,特别是在提及检测方面,这是泛化差距的主要原因之一,并建议使用额外的标注目标领域数据。我们不依赖于这一额外的注释,而是提出了CR任务的另一种表述,重大实体识别(MEI),其中我们:(A)假设目标实体在输入中指定,(B)将任务限制为仅频繁实体。通过大量的实验,我们证明了在监督模型和基于LLM的少镜头提示下,MEI模型在多个数据集上能够很好地跨域泛化。此外,MEI任务符合分类框架,该框架允许使用比当前CR指标更健壮的基于分类的指标。最后,MEI还具有实际用途,因为它允许用户搜索感兴趣的特定实体或一组实体的所有提及。

[NLP-96] Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation
[NLP-96] 揭开语言模型中数据污染的频谱:从检测到补救的调查

链接: https://arxiv.org/abs/2406.14644
作者: Chunyuan Deng,Yilun Zhao,Yuzhao Heng,Yitong Li,Jiannan Cao,Xiangru Tang,Arman Cohan
关键词: large language models, garnered increased attention, internet-derived training corpora, extensive internet-derived training, language models
中文关键词: 大型语言模型,引起越来越多的关注,互联网衍生的培训库,广泛的互联网衍生的培训,语言模型
类目: Computation and Language (cs.CL)
备注: ACL 2024 Camera-Ready Version

点击查看摘要

Abstract:Data contamination has garnered increased attention in the era of large language models (LLMs) due to the reliance on extensive internet-derived training corpora. The issue of training corpus overlap with evaluation benchmarks–referred to as contamination–has been the focus of significant recent research. This body of work aims to identify contamination, understand its impacts, and explore mitigation strategies from diverse perspectives. However, comprehensive studies that provide a clear pathway from foundational concepts to advanced insights are lacking in this nascent field. Therefore, we present a comprehensive survey in the field of data contamination, laying out the key issues, methodologies, and findings to date, and highlighting areas in need of further research and development. In particular, we begin by examining the effects of data contamination across various stages and forms. We then provide a detailed analysis of current contamination detection methods, categorizing them to highlight their focus, assumptions, strengths, and limitations. We also discuss mitigation strategies, offering a clear guide for future research. This survey serves as a succinct overview of the most recent advancements in data contamination research, providing a straightforward guide for the benefit of future research endeavors.
摘要:在大型语言模型(LLM)时代,由于依赖于大量的互联网培训语料库,数据污染已经引起了越来越多的关注。培训语料库与评价基准重叠的问题–称为污染–一直是最近重要研究的重点。这项工作旨在确定污染,了解其影响,并从不同的角度探索缓解策略。然而,在这个新兴领域,缺乏提供从基本概念到高级见解的明确途径的全面研究。因此,我们介绍了数据污染领域的全面调查,列出了到目前为止的关键问题、方法和发现,并强调了需要进一步研究和开发的领域。特别是,我们首先检查数据污染在不同阶段和形式中的影响。然后,我们提供了对当前污染检测方法的详细分析,对它们进行分类,以突出它们的重点、假设、优势和局限性。我们还讨论了缓解策略,为未来的研究提供了明确的指导。本调查简明扼要地概述了数据污染研究的最新进展,为未来的研究工作提供了直接的指导。

[NLP-97] Holistic Evaluation for Interleaved Text-and-Image Generation
[NLP-97] 交织文本和图像生成的整体评估

链接: https://arxiv.org/abs/2406.14643
作者: Minqian Liu,Zhiyang Xu,Zihao Lin,Trevor Ashby,Joy Rimchala,Jiaxin Zhang,Lifu Huang
关键词: intriguing research direction, arbitrary order, required to generate, Interleaved, interleaved generation
中文关键词: 有趣的研究方向,任意顺序,需要生成,交织,交织生成
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress. 13 pages, 5 figure, 6 tables

点击查看摘要

Abstract:Interleaved text-and-image generation has been an intriguing research direction, where the models are required to generate both images and text pieces in an arbitrary order. Despite the emerging advancements in interleaved generation, the progress in its evaluation still significantly lags behind. Existing evaluation benchmarks do not support arbitrarily interleaved images and text for both inputs and outputs, and they only cover a limited number of domains and use cases. Also, current works predominantly use similarity-based metrics which fall short in assessing the quality in open-ended scenarios. To this end, we introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. InterleavedBench features a rich array of tasks to cover diverse real-world use cases. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation. We carefully define five essential evaluation aspects for InterleavedEval, including text quality, perceptual quality, image coherence, text-image coherence, and helpfulness, to ensure a comprehensive and fine-grained assessment. Through extensive experiments and rigorous human evaluation, we show that our benchmark and metric can effectively evaluate the existing models with a strong correlation with human judgments surpassing previous reference-based metrics. We also provide substantial findings and insights to foster future research in interleaved generation and its evaluation.
摘要:交错文本和图像生成一直是一个有趣的研究方向,其中需要模型以任意顺序生成图像和文本片段。尽管在交叉生成方面取得了新的进展,但其评估方面的进展仍然明显滞后。现有的评价基准不支持输入和输出的任意交错的图像和文本,它们只涵盖有限数量的领域和用例。此外,目前的工作主要使用基于相似性的度量,这在评估开放式场景中的质量方面存在不足。为此,我们引入了第一个针对交错文本和图像生成的评估精心设计的基准测试–InterleedBch。InterleedBch提供了一系列丰富的任务,以涵盖各种现实世界的用例。此外,我们提出了InterleedEval,这是一种由GPT-4o支持的强大的无参考指标,以提供准确和可解释的评估。我们仔细定义了交错评估的五个基本评估方面,包括文本质量、感知质量、图像连贯性、文本-图像连贯性和帮助度,以确保全面和细粒度的评估。通过广泛的实验和严格的人类评估,我们的基准和度量可以有效地评估现有的模型,并且与人类判断的相关性强,超过了以前基于参考的度量。我们还提供了大量的发现和见解,以促进未来对交错生成及其评估的研究。

[NLP-98] Can LLMs Learn by Teaching? A Preliminary Study
[NLP-98] 法学硕士可以通过教学学习吗?初探

链接: https://arxiv.org/abs/2406.14629
作者: Xuefei Ning,Zifu Wang,Shiyao Li,Zinan Lin,Peiran Yao,Tianyu Fu,Matthew B. Blaschko,Guohao Dai,Huazhong Yang,Yu Wang
关键词: extensively studied methodology, knowledge distillation, extensively studied, studied methodology, Teaching
中文关键词: 广泛研究的方法论,知识提炼,广泛研究,研究的方法论,教学
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Teaching to improve student models (e.g., knowledge distillation) is an extensively studied methodology in LLMs. However, for humans, teaching not only improves students but also improves teachers. We ask: Can LLMs also learn by teaching (LbT)? If yes, we can potentially unlock the possibility of continuously advancing the models without solely relying on human-produced data or stronger models. In this paper, we provide a preliminary exploration of this ambitious agenda. We show that LbT ideas can be incorporated into existing LLM training/prompting pipelines and provide noticeable improvements. Specifically, we design three methods, each mimicking one of the three levels of LbT in humans: observing students’ feedback, learning from the feedback, and learning iteratively, with the goals of improving answer accuracy without training and improving models’ inherent capability with fine-tuning. The findings are encouraging. For example, similar to LbT in human, we see that: (1) LbT can induce weak-to-strong generalization: strong models can improve themselves by teaching other weak models; (2) Diversity in students might help: teaching multiple students could be better than teaching one student or the teacher itself. We hope that this early promise can inspire future research on LbT and more broadly adopting the advanced techniques in education to improve LLMs. The code is available at this https URL.
摘要:改进学生模型的教学(例如,知识提炼)是学习管理中被广泛研究的方法论。然而,对于人类来说,教学不仅提高了学生,也提高了教师。我们问:LLMS也可以通过教学(LBT)来学习吗?如果是,我们可能会释放出持续推进模型的可能性,而不需要仅仅依赖人类产生的数据或更强大的模型。在本文中,我们对这一雄心勃勃的议程进行了初步的探索。我们表明,LBT的想法可以整合到现有的LLM培训/激励渠道中,并提供显著的改进。具体地说,我们设计了三种方法,每种方法都模仿了人类LBT的三个水平之一:观察学生的反馈,从反馈中学习,迭代学习,目标是在不经过训练的情况下提高答案的准确性,并通过微调提高模型的内在能力。调查结果令人鼓舞。例如,与人类的LBT相似,我们看到:(1)LBT可以诱导从弱到强的概括:强模式可以通过教授其他弱模式来提高自己;(2)学生的多样性可能会有所帮助:教多个学生可能比教一个学生或老师本身更好。我们希望这一早期的承诺能够启发未来对LBT的研究,并更广泛地在教育中采用先进技术来改进LLM。代码可在此HTTPS URL上找到。

[NLP-99] Safety Verification of Wait-Only Non-Blocking Broadcast Protocols
[NLP-99] 仅等待非阻塞广播协议的安全验证

链接: https://arxiv.org/abs/2403.18591
作者: Lucie Guillou,Arnaud Sangnier,Nathalie Sznajder
关键词: study networks, communicate synchronously, process, process can broadcast, finite protocol
中文关键词: 学习网络,同步通信,过程,过程可以广播,有限协议
类目: Logic in Computer Science (cs.LO); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Long version of a paper accepted to PetriNets 2024

点击查看摘要

Abstract:We study networks of processes that all execute the same finite protocol and communicate synchronously in two different ways: a process can broadcast one message to all other processes or send it to at most one other process. In both cases, if no process can receive the message, it will still be sent. We establish a precise complexity class for two coverability problems with a parameterised number of processes: the state coverability problem and the configuration coverability problem. It is already known that these problems are Ackermann-hard (but decidable) in the general case. We show that when the protocol is Wait-Only, i.e., it has no state from which a process can send and receive messages, the complexity drops to P and PSPACE, respectively.
摘要:我们研究进程网络,这些进程都执行相同的有限协议,并以两种不同的方式同步通信:一个进程可以将一条消息广播到所有其他进程,或将其发送到最多一个其他进程。在这两种情况下,如果没有进程可以接收消息,则仍然会发送该消息。我们为两个具有参数化进程数量的可覆盖性问题建立了一个精确的复杂性类:状态可覆盖性问题和配置可覆盖性问题。众所周知,在一般情况下,这些问题是阿克曼难的(但可决定的)。我们表明,当协议为仅等待时,即,它没有进程可以发送和接收消息的状态,复杂性分别下降到P和PSYS。

计算机视觉

[CV-0] NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking

链接: https://arxiv.org/abs/2406.15349
作者: Daniel Dauner,Marcel Hallgarten,Tianyu Li,Xinshuo Weng,Zhiyu Huang,Zetong Yang,Hongyang Li,Igor Gilitschenski,Boris Ivanovic,Marco Pavone,Andreas Geiger,Kashyap Chitta
关键词: vision-based driving policies, Benchmarking vision-based driving, Benchmarking vision-based, vision-based driving, driving policies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Benchmarking vision-based driving policies is challenging. On one hand, open-loop evaluation with real data is easy, but these results do not reflect closed-loop performance. On the other, closed-loop evaluation is possible in simulation, but is hard to scale due to its significant computational demands. Further, the simulators available today exhibit a large domain gap to real data. This has resulted in an inability to draw clear conclusions from the rapidly growing body of research on end-to-end autonomous driving. In this paper, we present NAVSIM, a middle ground between these evaluation paradigms, where we use large datasets in combination with a non-reactive simulator to enable large-scale real-world benchmarking. Specifically, we gather simulation-based metrics, such as progress and time to collision, by unrolling bird’s eye view abstractions of the test scenes for a short simulation horizon. Our simulation is non-reactive, i.e., the evaluated policy and environment do not influence each other. As we demonstrate empirically, this decoupling allows open-loop metric computation while being better aligned with closed-loop evaluations than traditional displacement errors. NAVSIM enabled a new competition held at CVPR 2024, where 143 teams submitted 463 entries, resulting in several new insights. On a large set of challenging scenarios, we observe that simple methods with moderate compute requirements such as TransFuser can match recent large-scale end-to-end driving architectures such as UniAD. Our modular framework can potentially be extended with new datasets, data curation strategies, and metrics, and will be continually maintained to host future challenges. Our code is available at this https URL.

[CV-1] Image Conductor: Precision Control for Interactive Video Synthesis

链接: https://arxiv.org/abs/2406.15339
作者: Yaowei Li,Xintao Wang,Zhaoyang Zhang,Zhouxia Wang,Ziyang Yuan,Liangbin Xie,Yuexian Zou,Ying Shan
关键词: typically involving labor-intensive, labor-intensive real-world capturing, involving labor-intensive real-world, Filmmaking and animation, require sophisticated techniques
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Project webpage available at this https URL

点击查看摘要

Abstract:Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements, typically involving labor-intensive real-world capturing. Despite advancements in generative AI for video creation, achieving precise control over motion for interactive video asset generation remains challenging. To this end, we propose Image Conductor, a method for precise control of camera transitions and object movements to generate video assets from a single image. An well-cultivated training strategy is proposed to separate distinct camera and object motion by camera LoRA weights and object LoRA weights. To further address cinematographic variations from ill-posed trajectories, we introduce a camera-free guidance technique during inference, enhancing object movements while eliminating camera transitions. Additionally, we develop a trajectory-oriented video motion data curation pipeline for training. Quantitative and qualitative experiments demonstrate our method’s precision and fine-grained control in generating motion-controllable videos from images, advancing the practical application of interactive video synthesis. Project webpage available at this https URL

[CV-2] Keystroke Dynamics Against Academic Dishonesty in the Age of LLMs

链接: https://arxiv.org/abs/2406.15335
作者: Debnath Kundu,Atharva Mehta,Rajesh Kumar,Naman Lal,Avinash Anand,Apoorv Singh,Rajiv Ratn Shah
关键词: raises significant concerns, assignments raises significant, transition to online, online examinations, examinations and assignments
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
*备注: Accepted for publication at The IEEE International Joint Conference on Biometrics (IJCB2024), contains 9 pages, 3 figures, 3 tables

点击查看摘要

Abstract:The transition to online examinations and assignments raises significant concerns about academic integrity. Traditional plagiarism detection systems often struggle to identify instances of intelligent cheating, particularly when students utilize advanced generative AI tools to craft their responses. This study proposes a keystroke dynamics-based method to differentiate between bona fide and assisted writing within academic contexts. To facilitate this, a dataset was developed to capture the keystroke patterns of individuals engaged in writing tasks, both with and without the assistance of generative AI. The detector, trained using a modified TypeNet architecture, achieved accuracies ranging from 74.98% to 85.72% in condition-specific scenarios and from 52.24% to 80.54% in condition-agnostic scenarios. The findings highlight significant differences in keystroke dynamics between genuine and assisted writing. The outcomes of this study enhance our understanding of how users interact with generative AI and have implications for improving the reliability of digital educational platforms.

[CV-3] Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

链接: https://arxiv.org/abs/2406.15334
作者: Brandon Huang,Chancharik Mitra,Assaf Arbelle,Leonid Karlinsky,Trevor Darrell,Roei Herzig
关键词: interleaved Large Multimodal, Large Multimodal Models, interleaved Large, Large Multimodal, multimodal ICL setting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model’s context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV)–compact implicit representations of in-context examples compressed in the model’s attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.

[CV-4] GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation

链接: https://arxiv.org/abs/2406.15333
作者: Chubin Zhang,Hongliang Song,Yi Wei,Yu Chen,Jiwen Lu,Yansong Tang
关键词: Geometry-Aware Large Reconstruction, predict high-quality assets, GPU memory, Geometry-Aware Large, Large Reconstruction Model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The code is available at this https URL

点击查看摘要

Abstract:In this work, we introduce the Geometry-Aware Large Reconstruction Model (GeoLRM), an approach which can predict high-quality assets with 512k Gaussians and 21 input images in only 11 GB GPU memory. Previous works neglect the inherent sparsity of 3D structure and do not utilize explicit geometric relationships between 3D and 2D images. This limits these methods to a low-resolution representation and makes it difficult to scale up to the dense views for better quality. GeoLRM tackles these issues by incorporating a novel 3D-aware transformer structure that directly processes 3D points and uses deformable cross-attention mechanisms to effectively integrate image features into 3D representations. We implement this solution through a two-stage pipeline: initially, a lightweight proposal network generates a sparse set of 3D anchor points from the posed image inputs; subsequently, a specialized reconstruction transformer refines the geometry and retrieves textural details. Extensive experimental results demonstrate that GeoLRM significantly outperforms existing models, especially for dense view inputs. We also demonstrate the practical applicability of our model with 3D generation tasks, showcasing its versatility and potential for broader adoption in real-world applications.

[CV-5] Masked Extended Attention for Zero-Shot Virtual Try-On In The Wild

链接: https://arxiv.org/abs/2406.15331
作者: Nadav Orzech,Yotam Nitzan,Ulysse Mizrahi,Dov Danon,Amit H. Bermano
关键词: highly active line, Virtual Try-On, line of research, increasing demand, highly active
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project page available at this https URL

点击查看摘要

Abstract:Virtual Try-On (VTON) is a highly active line of research, with increasing demand. It aims to replace a piece of garment in an image with one from another, while preserving person and garment characteristics as well as image fidelity. Current literature takes a supervised approach for the task, impairing generalization and imposing heavy computation. In this paper, we present a novel zero-shot training-free method for inpainting a clothing garment by reference. Our approach employs the prior of a diffusion model with no additional training, fully leveraging its native generalization capabilities. The method employs extended attention to transfer image information from reference to target images, overcoming two significant challenges. We first initially warp the reference garment over the target human using deep features, alleviating “texture sticking”. We then leverage the extended attention mechanism with careful masking, eliminating leakage of reference background and unwanted influence. Through a user study, qualitative, and quantitative comparison to state-of-the-art approaches, we demonstrate superior image quality and garment preservation compared unseen clothing pieces or human figures.

[CV-6] An End-to-End Segmentation-Free Arabic Handwritten Recognition Model on KHATT

链接: https://arxiv.org/abs/2406.15329
作者: Sondos Aabed,Ahmad Khairaldin
关键词: Connectionist Temporal Classification, Long-Short Term Memory, Bidirectional Long-Short Term, alongside Bidirectional Long-Short, deep learning model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:An end-to-end, segmentation-free, deep learning model trained from scratch is proposed, leveraging DCNN for feature extraction, alongside Bidirectional Long-Short Term Memory (BLSTM) for sequence recognition and Connectionist Temporal Classification (CTC) loss function on the KHATT database. The training phase yields remarkable results 84% recognition rate on the test dataset at the character level and 71% on the word level, establishing an image-based sequence recognition framework that operates without segmentation only at the line level. The analysis and preprocessing of the KFUPM Handwritten Arabic TexT (KHATT) database are also presented. Finally, advanced image processing techniques, including filtering, transformation, and line segmentation are implemented. The importance of this work is highlighted by its wide-ranging applications. Including digitizing, documentation, archiving, and text translation in fields such as banking. Moreover, AHR serves as a pivotal tool for making images searchable, enhancing information retrieval capabilities, and enabling effortless editing. This functionality significantly reduces the time and effort required for tasks such as Arabic data organization and manipulation.

[CV-7] Rethinking Remote Sensing Change Detection With A Mask View

链接: https://arxiv.org/abs/2406.15320
作者: Xiaowen Ma,Zhenkai Wu,Rongrong Lian,Wei Zhang,Siyang Song
关键词: Remote sensing change, Remote sensing, qualitatively assess changes, sensing change detection, change detection aims
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Under review

点击查看摘要

Abstract:Remote sensing change detection aims to compare two or more images recorded for the same area but taken at different time stamps to quantitatively and qualitatively assess changes in geographical entities and environmental factors. Mainstream models usually built on pixel-by-pixel change detection paradigms, which cannot tolerate the diversity of changes due to complex scenes and variation in imaging conditions. To address this shortcoming, this paper rethinks the change detection with the mask view, and further proposes the corresponding: 1) meta-architecture CDMask and 2) instance network CDMaskFormer. Components of CDMask include Siamese backbone, change extractor, pixel decoder, transformer decoder and normalized detector, which ensures the proper functioning of the mask detection paradigm. Since the change query can be adaptively updated based on the bi-temporal feature content, the proposed CDMask can adapt to different latent data distributions, thus accurately identifying regions of interest changes in complex scenarios. Consequently, we further propose the instance network CDMaskFormer customized for the change detection task, which includes: (i) a Spatial-temporal convolutional attention-based instantiated change extractor to capture spatio-temporal context simultaneously with lightweight operations; and (ii) a scene-guided axial attention-instantiated transformer decoder to extract more spatial details. State-of-the-art performance of CDMaskFormer is achieved on five benchmark datasets with a satisfactory efficiency-accuracy trade-off. Code is available at this https URL.

[CV-8] Advanced Multimodal Deep Learning Architecture for Image-Text Matching

链接: https://arxiv.org/abs/2406.15306
作者: Jinyin Wang,Haijing Zhang,Yihao Zhong,Yingbin Liang,Rongwei Ji,Yiru Cang
关键词: key multimodal task, multimodal deep learning, image-text matching models, text, key multimodal
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2405.17460 by other authors

点击查看摘要

Abstract:Image-text matching is a key multimodal task that aims to model the semantic association between images and text as a matching relationship. With the advent of the multimedia information age, image, and text data show explosive growth, and how to accurately realize the efficient and accurate semantic correspondence between them has become the core issue of common concern in academia and industry. In this study, we delve into the limitations of current multimodal deep learning models in processing image-text pairing tasks. Therefore, we innovatively design an advanced multimodal deep learning architecture, which combines the high-level abstract representation ability of deep neural networks for visual information with the advantages of natural language processing models for text semantic understanding. By introducing a novel cross-modal attention mechanism and hierarchical feature fusion strategy, the model achieves deep fusion and two-way interaction between image and text feature space. In addition, we also optimize the training objectives and loss functions to ensure that the model can better map the potential association structure between images and text during the learning process. Experiments show that compared with existing image-text matching models, the optimized new model has significantly improved performance on a series of benchmark data sets. In addition, the new model also shows excellent generalization and robustness on large and diverse open scenario datasets and can maintain high matching performance even in the face of previously unseen complex situations.

[CV-9] ADR: Attention Diversification Regularization for Mitigating Overfitting in Multiple Instance Learning based Whole Slide Image Classification

链接: https://arxiv.org/abs/2406.15303
作者: Yunlong Zhang,Zhongyi Shui,Yunxuan Sun,Honglin Li,Jingxiong Li,Chenglu Zhu,Sunyi Zheng,Lin Yang
关键词: Multiple Instance Learning, Multiple Instance, Instance Learning, encounters overfitting challenges, slide images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multiple Instance Learning (MIL) has demonstrated effectiveness in analyzing whole slide images (WSIs), yet it often encounters overfitting challenges in real-world applications. This paper reveals the correlation between MIL’s performance and the entropy of attention values. Based on this observation, we propose Attention Diversity Regularization (ADR), a simple but effective technique aimed at promoting high entropy in attention values. Specifically, ADR introduces a negative Shannon entropy loss for attention values into the regular MIL framework. Compared to existing methods aimed at alleviating overfitting, which often necessitate additional modules or processing steps, our ADR approach requires no such extras, demonstrating simplicity and efficiency. We evaluate our ADR on three WSI classification tasks. ADR achieves superior performance over the state-of-the-art on most of them. We also show that ADR can enhance heatmaps, aligning them better with pathologists’ diagnostic criteria. The source code is available at \urlthis https URL.

[CV-10] Learning Spatio-Temporal Patterns of Polar Ice Layers With Physics-Informed Graph Neural Network

链接: https://arxiv.org/abs/2406.15299
作者: Zesheng Liu,Maryam Rahnemoonfar
关键词: ice dynamic processes, ice sheet balance, evaluating ice dynamic, dynamic processes, crucial for monitoring
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Learning spatio-temporal patterns of polar ice layers is crucial for monitoring the change in ice sheet balance and evaluating ice dynamic processes. While a few researchers focus on learning ice layer patterns from echogram images captured by airborne snow radar sensors via different convolutional neural networks, the noise in the echogram images proves to be a major obstacle. Instead, we focus on geometric deep learning based on graph neural networks to learn the spatio-temporal patterns from thickness information of shallow ice layers and make predictions for deep layers. In this paper, we propose a physics-informed hybrid graph neural network that combines the GraphSAGE framework for graph feature learning with the long short-term memory (LSTM) structure for learning temporal changes, and introduce measurements of physical ice properties from Model Atmospheric Regional (MAR) weather model as physical node features. We found that our proposed network can consistently outperform the current non-inductive or non-physical model in predicting deep ice layer thickness.

[CV-11] You Only Acquire Sparse-channel (YOAS): A Unified Framework for Dense-channel EEG Generation

链接: https://arxiv.org/abs/2406.15269
作者: Hongyu Chen,Weiming Zeng,Luhui Cai,Yueyang Li,Lei Wang,Jia Lu,Hongjie Yan,Wai Ting Siok,Nizhuan Wang
关键词: EEG, EEG signal generation, EEG generation, Synthetic EEG Generation, High-precision acquisition
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:High-precision acquisition of dense-channel electroencephalogram (EEG) signals is often impeded by the costliness and lack of portability of equipment. In contrast, generating dense-channel EEG signals effectively from sparse channels shows promise and economic viability. However, sparse-channel EEG poses challenges such as reduced spatial resolution, information loss, signal mixing, and heightened susceptibility to noise and interference. To address these challenges, we first theoretically formulate the dense-channel EEG generation problem as by optimizing a set of cross-channel EEG signal generation problems. Then, we propose the YOAS framework for generating dense-channel data from sparse-channel EEG signals. The YOAS totally consists of four sequential stages: Data Preparation, Data Preprocessing, Biased-EEG Generation, and Synthetic EEG Generation. Data Preparation and Preprocessing carefully consider the distribution of EEG electrodes and low signal-to-noise ratio problem of EEG signals. Biased-EEG Generation includes sub-modules of BiasEEGGanFormer and BiasEEGDiffFormer, which facilitate long-term feature extraction with attention and generate signals by combining electrode position alignment with diffusion model, respectively. Synthetic EEG Generation synthesizes the final signals, employing a deduction paradigm for multi-channel EEG generation. Extensive experiments confirmed YOAS’s feasibility, efficiency, and theoretical validity, even remarkably enhancing data discernibility. This breakthrough in dense-channel EEG signal generation from sparse-channel data opens new avenues for exploration in EEG signal processing and application.

[CV-12] Fingerprint Membership and Identity Inference Against Generative Adversarial Networks

链接: https://arxiv.org/abs/2406.15253
作者: Saverio Cavasin,Daniele Mari,Simone Milani,Mauro Conti
关键词: gaining significant attention, industrial revolution, gaining significant, significant attention, attention as potential
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Paper submitted at “Pattern Recognition Letters”, 9 pages, 6 images

点击查看摘要

Abstract:Generative models are gaining significant attention as potential catalysts for a novel industrial revolution. Since automated sample generation can be useful to solve privacy and data scarcity issues that usually affect learned biometric models, such technologies became widely spread in this field. In this paper, we assess the vulnerabilities of generative machine learning models concerning identity protection by designing and testing an identity inference attack on fingerprint datasets created by means of a generative adversarial network. Experimental results show that the proposed solution proves to be effective under different configurations and easily extendable to other biometric measurements.

[CV-13] VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

链接: https://arxiv.org/abs/2406.15252
作者: Xuan He,Dongfu Jiang,Ge Zhang,Max Ku,Achint Soni,Sherman Siu,Haonan Chen,Abhranil Chandra,Ziyan Jiang,Aaran Arulraj,Kai Wang,Quy Duc Do,Yuansheng Ni,Bohan Lyu,Yaswanth Narsupalli,Rongqi Fan,Zhiheng Lyu,Yuchen Lin,Wenhu Chen
关键词: witnessed great advances, recent years, years have witnessed, video, human
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis) based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman correlation between VideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result on other held-out EvalCrafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with human judges than other metrics. Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.

[CV-14] Landscape More Secure Than Portrait? Zooming Into the Directionality of Digital Images With Security Implications

链接: https://arxiv.org/abs/2406.15206
作者: Benedikt Lorch,Rainer Böhme
关键词: captured can affect, affect the resulting, resulting security, downstream applications, media security assume
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The orientation in which a source image is captured can affect the resulting security in downstream applications. One reason for this is that many state-of-the-art methods in media security assume that image statistics are similar in the horizontal and vertical directions, allowing them to reduce the number of features (or trainable weights) by merging coefficients. We show that this artificial symmetrization tends to suppress important properties of natural images and common processing operations, causing a loss of performance. We also observe the opposite problem, where unaddressed directionality causes learning-based methods to overfit to a single orientation. These are vulnerable to manipulation if an adversary chooses inputs with the less common orientation. This paper takes a comprehensive approach, identifies and systematizes causes of directionality at several stages of a typical acquisition pipeline, measures their effect, and demonstrates for three selected security applications (steganalysis, forensic source identification, and the detection of synthetic images) how the performance of state-of-the-art methods can be improved by properly accounting for directionality.

[CV-15] DiffExplainer: Unveiling Black Box Models Via Counterfactual Generation

链接: https://arxiv.org/abs/2406.15182
作者: Yingying Fang,Shuang Wu,Zihao Jin,Caiwen Xu,Shiyi Wang,Simon Walsh,Guang Yang
关键词: early disease detection, understanding the reasoning, related to early, early disease, disease detection
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: MICCAI 2024

点击查看摘要

Abstract:In the field of medical imaging, particularly in tasks related to early disease detection and prognosis, understanding the reasoning behind AI model predictions is imperative for assessing their reliability. Conventional explanation methods encounter challenges in identifying decisive features in medical image classifications, especially when discriminative features are subtle or not immediately evident. To address this limitation, we propose an agent model capable of generating counterfactual images that prompt different decisions when plugged into a black box model. By employing this agent model, we can uncover influential image patterns that impact the black model’s final predictions. Through our methodology, we efficiently identify features that influence decisions of the deep black box. We validated our approach in the rigorous domain of medical prognosis tasks, showcasing its efficacy and potential to enhance the reliability of deep learning models in medical image classification compared to existing interpretation methods. The code will be publicly available at this https URL.

[CV-16] Stochastic Optimisation Framework using the Core Imaging Library and Synergistic Image Reconstruction Framework for PET Reconstruction

链接: https://arxiv.org/abs/2406.15159
作者: Evangelos Papoutsellis,Casper da Costa-Luis,Daniel Deidda,Claire Delplancke,Margaret Duff,Gemma Fardell,Ashley Gillman,Jakob S. Jørgensen,Zeljko Kereta,Evgueni Ovtchinnikov,Edoardo Pasca,Georg Schramm,Kris Thielemans
关键词: Core Imaging Library, source Core Imaging, Imaging Library, Core Imaging, enables easy development
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We introduce a stochastic framework into the open–source Core Imaging Library (CIL) which enables easy development of stochastic algorithms. Five such algorithms from the literature are developed, Stochastic Gradient Descent, Stochastic Average Gradient (-Amélioré), (Loopless) Stochastic Variance Reduced Gradient. We showcase the functionality of the framework with a comparative study against a deterministic algorithm on a simulated 2D PET dataset, with the use of the open-source Synergistic Image Reconstruction Framework. We observe that stochastic optimisation methods can converge in fewer passes of the data than a standard deterministic algorithm.

[CV-17] Gaussian Splatting to Real World Flight Navigation Transfer with Liquid Networks

链接: https://arxiv.org/abs/2406.15149
作者: Alex Quach,Makram Chahine,Alexander Amini,Ramin Hasani,Daniela Rus
关键词: scalable data generation, offer scalable data, autonomous robot learning, flexible design, optimization of trajectories
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Simulators are powerful tools for autonomous robot learning as they offer scalable data generation, flexible design, and optimization of trajectories. However, transferring behavior learned from simulation data into the real world proves to be difficult, usually mitigated with compute-heavy domain randomization methods or further model fine-tuning. We present a method to improve generalization and robustness to distribution shifts in sim-to-real visual quadrotor navigation tasks. To this end, we first build a simulator by integrating Gaussian Splatting with quadrotor flight dynamics, and then, train robust navigation policies using Liquid neural networks. In this way, we obtain a full-stack imitation learning protocol that combines advances in 3D Gaussian splatting radiance field rendering, crafty programming of expert demonstration training data, and the task understanding capabilities of Liquid networks. Through a series of quantitative flight tests, we demonstrate the robust transfer of navigation skills learned in a single simulation scene directly to the real world. We further show the ability to maintain performance beyond the training environment under drastic distribution and physical environment changes. Our learned Liquid policies, trained on single target manoeuvres curated from a photorealistic simulated indoor flight only, generalize to multi-step hikes onboard a real hardware platform outdoors.

[CV-18] High Resolution Surface Reconstruction of Cultural Heritage Objects Using Shape from Polarization Method

链接: https://arxiv.org/abs/2406.15121
作者: F. S. Mortazavi,M. Saadatseresht
关键词: cultural heritage objects, computer vision, computer graphics, cultural heritage, heritage objects
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Nowadays, three-dimensional reconstruction is used in various fields like computer vision, computer graphics, mixed reality and digital twin. The three-dimensional reconstruction of cultural heritage objects is one of the most important applications in this area which is usually accomplished by close range photogrammetry. The problem here is that the images are often noisy, and the dense image matching method has significant limitations to reconstruct the geometric details of cultural heritage objects in practice. Therefore, displaying high-level details in three-dimensional models, especially for cultural heritage objects, is a severe challenge in this field. In this paper, the shape from polarization method has been investigated, a passive method with no drawbacks of active methods. In this method, the resolution of the depth maps can be dramatically increased using the information obtained from the polarization light by rotating a linear polarizing filter in front of a digital camera. Through these polarized images, the surface details of the object can be reconstructed locally with high accuracy. The fusion of polarization and photogrammetric methods is an appropriate solution for achieving high resolution three-dimensional reconstruction. The surface reconstruction assessments have been performed visually and quantitatively. The evaluations showed that the proposed method could significantly reconstruct the surfaces’ details in the three-dimensional model compared to the photogrammetric method with 10 times higher depth resolution.

[CV-19] Surface Normal Reconstruction Using Polarization-Unet

链接: https://arxiv.org/abs/2406.15118
作者: F. S. Mortazavi,S. Dajkhosh,M. Saadatseresht
关键词: three-dimensional reconstruction, resolution three-dimensional reconstruction, high-resolution three-dimensional reconstruction, three-dimensional reconstruction methods, active three-dimensional reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Today, three-dimensional reconstruction of objects has many applications in various fields, and therefore, choosing a suitable method for high resolution three-dimensional reconstruction is an important issue and displaying high-level details in three-dimensional models is a serious challenge in this field. Until now, active methods have been used for high-resolution three-dimensional reconstruction. But the problem of active three-dimensional reconstruction methods is that they require a light source close to the object. Shape from polarization (SfP) is one of the best solutions for high-resolution three-dimensional reconstruction of objects, which is a passive method and does not have the drawbacks of active methods. The changes in polarization of the reflected light from an object can be analyzed by using a polarization camera or locating polarizing filter in front of the digital camera and rotating the filter. Using this information, the surface normal can be reconstructed with high accuracy, which will lead to local reconstruction of the surface details. In this paper, an end-to-end deep learning approach has been presented to produce the surface normal of objects. In this method a benchmark dataset has been used to train the neural network and evaluate the results. The results have been evaluated quantitatively and qualitatively by other methods and under different lighting conditions. The MAE value (Mean-Angular-Error) has been used for results evaluation. The evaluations showed that the proposed method could accurately reconstruct the surface normal of objects with the lowest MAE value which is equal to 18.06 degree on the whole dataset, in comparison to previous physics-based methods which are between 41.44 and 49.03 degree.

[CV-20] Investigating the impact of 2D gesture representation on co-speech gesture generation

链接: https://arxiv.org/abs/2406.15111
作者: Teo Guichoux,Laure Soulier,Nicolas Obin,Catherine Pelachaud
关键词: embodied conversational agents, Co-speech gestures play, natural co-speech gestures, Co-speech gestures, co-speech gestures synchronized
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages. Paper accepted at WACAI 2024

点击查看摘要

Abstract:Co-speech gestures play a crucial role in the interactions between humans and embodied conversational agents (ECA). Recent deep learning methods enable the generation of realistic, natural co-speech gestures synchronized with speech, but such approaches require large amounts of training data. “In-the-wild” datasets, which compile videos from sources such as YouTube through human pose detection models, offer a solution by providing 2D skeleton sequences that are paired with speech. Concurrently, innovative lifting models have emerged, capable of transforming these 2D pose sequences into their 3D counterparts, leading to large and diverse datasets of 3D gestures. However, the derived 3D pose estimation is essentially a pseudo-ground truth, with the actual ground truth being the 2D motion data. This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions, a topic that, to our knowledge, remains largely unexplored. In this work, we evaluate the impact of the dimensionality of the training data, 2D or 3D joint coordinates, on the performance of a multimodal speech-to-gesture deep generative model. We use a lifting model to convert 2D-generated sequences of body pose to 3D. Then, we compare the sequence of gestures generated directly in 3D to the gestures generated in 2D and lifted to 3D as post-processing.

[CV-21] Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors

链接: https://arxiv.org/abs/2406.15104
作者: Peter Lorenz,Mario Fernandez,Jens Müller,Ullrich Köthe
关键词: safely deploying deep, deploying deep learning, deep learning models, inputs is critical, critical for safely
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Detecting out-of-distribution (OOD) inputs is critical for safely deploying deep learning models in real-world scenarios. In recent years, many OOD detectors have been developed, and even the benchmarking has been standardized, i.e. OpenOOD. The number of post-hoc detectors is growing fast and showing an option to protect a pre-trained classifier against natural distribution shifts, claiming to be ready for real-world scenarios. However, its efficacy in handling adversarial examples has been neglected in the majority of studies. This paper investigates the adversarial robustness of the 16 post-hoc detectors on several evasion attacks and discuss a roadmap towards adversarial defense in OOD detectors.

[CV-22] HLQ: Fast and Efficient Backpropagation via Hadamard Low-rank Quantization

链接: https://arxiv.org/abs/2406.15102
作者: Seonggon Kim,Eunhyeok Park
关键词: rapid increase, increase in model, model size, growing importance, Hadamard Low-rank Quantization
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid increase in model size and the growing importance of various fine-tuning applications, lightweight training has become crucial. Since the backward pass is twice as expensive as the forward pass, optimizing backpropagation is particularly important. However, modifications to this process can lead to suboptimal convergence, so training optimization should minimize perturbations, which is a highly challenging task. In this study, we introduce a novel optimization strategy called Hadamard Low-rank Quantization (HLQ), focusing on reducing the cost of backpropagation in convolutional and linear layers. We first analyze the sensitivity of gradient computation with respect to activation and weight, and judiciously design the HLQ pipeline to apply 4-bit Hadamard quantization to the activation gradient and Hadamard low-rank approximation to the weight gradient. This combination was found to be the best for maximizing benefits, and our extensive experiments demonstrate the outstanding performance of HLQ in both training from scratch and fine-tuning, achieving significant memory savings and acceleration on real GPUs with negligible quality degradation.

[CV-23] ECLIPSE: Expunging Clean-label Indiscriminate Poisons via Sparse Diffusion Purification

链接: https://arxiv.org/abs/2406.15093
作者: Xianlong Wang,Shengshan Hu,Yechao Zhang,Ziqi Zhou,Leo Yu Zhang,Peng Xu,Wei Wan,Hai Jin
关键词: Clean-label indiscriminate poisoning, add invisible perturbations, Clean-label indiscriminate, correctly labeled training, indiscriminate poisoning attacks
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Accepted by ESORICS 2024

点击查看摘要

Abstract:Clean-label indiscriminate poisoning attacks add invisible perturbations to correctly labeled training images, thus dramatically reducing the generalization capability of the victim models. Recently, some defense mechanisms have been proposed such as adversarial training, image transformation techniques, and image purification. However, these schemes are either susceptible to adaptive attacks, built on unrealistic assumptions, or only effective against specific poison types, limiting their universal applicability. In this research, we propose a more universally effective, practical, and robust defense scheme called ECLIPSE. We first investigate the impact of Gaussian noise on the poisons and theoretically prove that any kind of poison will be largely assimilated when imposing sufficient random noise. In light of this, we assume the victim has access to an extremely limited number of clean images (a more practical scene) and subsequently enlarge this sparse set for training a denoising probabilistic model (a universal denoising tool). We then begin by introducing Gaussian noise to absorb the poisons and then apply the model for denoising, resulting in a roughly purified dataset. Finally, to address the trade-off of the inconsistency in the assimilation sensitivity of different poisons by Gaussian noise, we propose a lightweight corruption compensation module to effectively eliminate residual poisons, providing a more universal defense approach. Extensive experiments demonstrate that our defense approach outperforms 10 state-of-the-art defenses. We also propose an adaptive attack against ECLIPSE and verify the robustness of our defense scheme. Our code is available at this https URL.

[CV-24] ri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis

链接: https://arxiv.org/abs/2406.15050
作者: Lin Fan,Xun Gong,Cenyang Zheng,Yafei Ou
关键词: Visual Question Answering, challenging research topic, advantages including patient, including patient engagement, clinical expert involvement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The intersection of medical Visual Question Answering (Med-VQA) is a challenging research topic with advantages including patient engagement and clinical expert involvement for second opinions. However, existing Med-VQA methods based on joint embedding fail to explain whether their provided results are based on correct reasoning or coincidental answers, which undermines the credibility of VQA answers. In this paper, we investigate the construction of a more cohesive and stable Med-VQA structure. Motivated by causal effect, we propose a novel Triangular Reasoning VQA (Tri-VQA) framework, which constructs reverse causal questions from the perspective of “Why this answer?” to elucidate the source of the answer and stimulate more reasonable forward reasoning processes. We evaluate our method on the Endoscopic Ultrasound (EUS) multi-attribute annotated dataset from five centers, and test it on medical VQA datasets. Experimental results demonstrate the superiority of our approach over existing methods. Our codes and pre-trained models are available at https://anonymous.4open.science/r/Tri_VQA.

[CV-25] Improving Interpretability and Robustness for the Detection of AI-Generated Images

链接: https://arxiv.org/abs/2406.15035
作者: Tatiana Gaintseva,Laida Kushnareva,German Magai,Irina Piontkovskaya,Sergey Nikolenko,Martin Benning,Serguei Barannikov,Gregory Slabaugh
关键词: artificial content detection, generative models, artificial content, difficult task, growing abilities
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:With growing abilities of generative models, artificial content detection becomes an increasingly important and difficult task. However, all popular approaches to this problem suffer from poor generalization across domains and generative models. In this work, we focus on the robustness of AI-generated image (AIGI) detectors. We analyze existing state-of-the-art AIGI detection methods based on frozen CLIP embeddings and show how to interpret them, shedding light on how images produced by various AI generators differ from real ones. Next we propose two ways to improve robustness: based on removing harmful components of the embedding vector and based on selecting the best performing attention heads in the image encoder model. Our methods increase the mean out-of-distribution (OOD) classification score by up to 6% for cross-model transfer. We also propose a new dataset for AIGI detection and use it in our evaluation; we believe this dataset will help boost further research. The dataset and code are provided as a supplement.

[CV-26] SVFormer: A Direct Training Spiking Transformer for Efficient Video Action Recognition

链接: https://arxiv.org/abs/2406.15034
作者: Liutao Yu,Liwei Huang,Chenlin Zhou,Han Zhang,Zhengyu Ma,Huihui Zhou,Yonghong Tian
关键词: plays crucial roles, Video action recognition, action recognition, plays crucial, industrial automation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IJCAI 2024 workshop - Human Brain and Artificial Intelligence

点击查看摘要

Abstract:Video action recognition (VAR) plays crucial roles in various domains such as surveillance, healthcare, and industrial automation, making it highly significant for the society. Consequently, it has long been a research spot in the computer vision field. As artificial neural networks (ANNs) are flourishing, convolution neural networks (CNNs), including 2D-CNNs and 3D-CNNs, as well as variants of the vision transformer (ViT), have shown impressive performance on VAR. However, they usually demand huge computational cost due to the large data volume and heavy information redundancy introduced by the temporal dimension. To address this challenge, some researchers have turned to brain-inspired spiking neural networks (SNNs), such as recurrent SNNs and ANN-converted SNNs, leveraging their inherent temporal dynamics and energy efficiency. Yet, current SNNs for VAR also encounter limitations, such as nontrivial input preprocessing, intricate network construction/training, and the need for repetitive processing of the same video clip, hindering their practical deployment. In this study, we innovatively propose the directly trained SVFormer (Spiking Video transFormer) for VAR. SVFormer integrates local feature extraction, global self-attention, and the intrinsic dynamics, sparsity, and spike-driven nature of SNNs, to efficiently and effectively extract spatio-temporal features. We evaluate SVFormer on two RGB datasets (UCF101, NTU-RGBD60) and one neuromorphic dataset (DVS128-Gesture), demonstrating comparable performance to the mainstream models in a more efficient way. Notably, SVFormer achieves a top-1 accuracy of 84.03% with ultra-low power consumption (21 mJ/video) on UCF101, which is state-of-the-art among directly trained deep SNNs, showcasing significant advantages over prior models.

[CV-27] A3D: Does Diffusion Dream about 3D Alignment?

链接: https://arxiv.org/abs/2406.15020
作者: Savva Ignatyev,Nina Konovalova,Daniil Selikhanovych,Nikolay Patakin,Oleg Voynov,Dmitry Senushkin,Alexander Filippov,Anton Konushin,Peter Wonka,Evgeny Burnaev
关键词: problem of text-driven, tackle the problem, geometry alignment perspective, objects, Score Distillation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We tackle the problem of text-driven 3D generation from a geometry alignment perspective. We aim at the generation of multiple objects which are consistent in terms of semantics and geometry. Recent methods based on Score Distillation have succeeded in distilling the knowledge from 2D diffusion models to high-quality objects represented by 3D neural radiance fields. These methods handle multiple text queries separately, and therefore, the resulting objects have a high variability in object pose and structure. However, in some applications such as geometry editing, it is desirable to obtain aligned objects. In order to achieve alignment, we propose to optimize the continuous trajectories between the aligned objects, by modeling a space of linear pairwise interpolations of the textual embeddings with a single NeRF representation. We demonstrate that similar objects, consisting of semantically corresponding parts, can be well aligned in 3D space without costly modifications to the generation process. We provide several practical scenarios including mesh editing and object hybridization that benefit from geometry alignment and experimentally demonstrate the efficiency of our method. this https URL

[CV-28] Real-Time Hand Gesture Recognition: Integrating Skeleton-Based Data Fusion and Multi-Stream CNN

链接: https://arxiv.org/abs/2406.15003
作者: Oluwaleke Yusuf,Maki Habib,Mohamed Moustafa
关键词: Hand Gesture Recognition, Gesture Recognition, real-world contexts, Ensemble Tuner Multi-stream, Tuner Multi-stream CNN
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
*备注: 13 pages. 7 figures. Code available at this https URL

点击查看摘要

Abstract:This study focuses on Hand Gesture Recognition (HGR), which is vital for perceptual computing across various real-world contexts. The primary challenge in the HGR domain lies in dealing with the individual variations inherent in human hand morphology. To tackle this challenge, we introduce an innovative HGR framework that combines data-level fusion and an Ensemble Tuner Multi-stream CNN architecture. This approach effectively encodes spatiotemporal gesture information from the skeleton modality into RGB images, thereby minimizing noise while improving semantic gesture comprehension. Our framework operates in real-time, significantly reducing hardware requirements and computational complexity while maintaining competitive performance on benchmark datasets such as SHREC2017, DHG1428, FPHA, LMDHG and CNR. This improvement in HGR demonstrates robustness and paves the way for practical, real-time applications that leverage resource-limited devices for human-machine interaction and ambient intelligence.

[CV-29] Disability Representations: Finding Biases in Automatic Image Generation

链接: https://arxiv.org/abs/2406.14993
作者: Yannis Tevissen
关键词: enabled widespread access, Recent advancements, AI-generated imagery, visual content, technology have enabled
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Presented at AVA Workshop of CVPR 2024

点击查看摘要

Abstract:Recent advancements in image generation technology have enabled widespread access to AI-generated imagery, prominently used in advertising, entertainment, and progressively in every form of visual content. However, these technologies often perpetuate societal biases. This study investigates the representation biases in popular image generation models towards people with disabilities (PWD). Through a comprehensive experiment involving several popular text-to-image models, we analyzed the depiction of disability. The results indicate a significant bias, with most generated images portraying disabled individuals as old, sad, and predominantly using manual wheelchairs. These findings highlight the urgent need for more inclusive AI development, ensuring diverse and accurate representation of PWD in generated images. This research underscores the importance of addressing and mitigating biases in AI models to foster equitable and realistic representations.

[CV-30] E2GS: Event Enhanced Gaussian Splatting

链接: https://arxiv.org/abs/2406.14978
作者: Hiroyuki Deguchi,Mana Masuda,Takuya Nakabayashi,Hideo Saito
关键词: low energy usage, high dynamic range, Neural Radiance Field, Enhanced Gaussian Splatting, absence of motion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7pages,

点击查看摘要

Abstract:Event cameras, known for their high dynamic range, absence of motion blur, and low energy usage, have recently found a wide range of applications thanks to these attributes. In the past few years, the field of event-based 3D reconstruction saw remarkable progress, with the Neural Radiance Field (NeRF) based approach demonstrating photorealistic view synthesis results. However, the volume rendering paradigm of NeRF necessitates extensive training and rendering times. In this paper, we introduce Event Enhanced Gaussian Splatting (E2GS), a novel method that incorporates event data into Gaussian Splatting, which has recently made significant advances in the field of novel view synthesis. Our E2GS effectively utilizes both blurry images and event data, significantly improving image deblurring and producing high-quality novel view synthesis. Our comprehensive experiments on both synthetic and real-world datasets demonstrate our E2GS can generate visually appealing renderings while offering faster training and rendering speed (140 FPS). Our code is available at this https URL.

[CV-31] LU2Net: A Lightweight Network for Real-time Underwater Image Enhancement

链接: https://arxiv.org/abs/2406.14973
作者: Haodong Yang,Jisheng Xu,Zhiliang Lin,Jianping He
关键词: Computer vision techniques, including object tracking, Computer vision, underwater, underwater image enhancement
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Computer vision techniques have empowered underwater robots to effectively undertake a multitude of tasks, including object tracking and path planning. However, underwater optical factors like light refraction and absorption present challenges to underwater vision, which cause degradation of underwater images. A variety of underwater image enhancement methods have been proposed to improve the effectiveness of underwater vision perception. Nevertheless, for real-time vision tasks on underwater robots, it is necessary to overcome the challenges associated with algorithmic efficiency and real-time capabilities. In this paper, we introduce Lightweight Underwater Unet (LU2Net), a novel U-shape network designed specifically for real-time enhancement of underwater images. The proposed model incorporates axial depthwise convolution and the channel attention module, enabling it to significantly reduce computational demands and model parameters, thereby improving processing speed. The extensive experiments conducted on the dataset and real-world underwater robots demonstrate the exceptional performance and speed of proposed model. It is capable of providing well-enhanced underwater images at a speed 8 times faster than the current state-of-the-art underwater image enhancement method. Moreover, LU2Net is able to handle real-time underwater video enhancement.

[CV-32] VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

链接: https://arxiv.org/abs/2406.14964
作者: Zixuan Chen,Ruijie Su,Jiahao Zhu,Lingxiao Yang,Jian-Huang Lai,Xiaohua Xie
关键词: Score Distillation Sampling, Consistency Distillation Sampling, Pose-dependent Consistency Distillation, generation, Distillation Sampling
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-3D generation aims to create 3D assets from text-to-image diffusion models. However, existing methods face an inherent bottleneck in generation quality because the widely-used objectives such as Score Distillation Sampling (SDS) inappropriately omit U-Net jacobians for swift generation, leading to significant bias compared to the “true” gradient obtained by full denoising sampling. This bias brings inconsistent updating direction, resulting in implausible 3D generation e.g., color deviation, Janus problem, and semantically inconsistent details). In this work, we propose Pose-dependent Consistency Distillation Sampling (PCDS), a novel yet efficient objective for diffusion-based 3D generation tasks. Specifically, PCDS builds the pose-dependent consistency function within diffusion trajectories, allowing to approximate true gradients through minimal sampling steps (1-3). Compared to SDS, PCDS can acquire a more accurate updating direction with the same sampling time (1 sampling step), while enabling few-step (2-3) sampling to trade compute for higher generation quality. For efficient generation, we propose a coarse-to-fine optimization strategy, which first utilizes 1-step PCDS to create the basic structure of 3D objects, and then gradually increases PCDS steps to generate fine-grained details. Extensive experiments demonstrate that our approach outperforms the state-of-the-art in generation quality and training efficiency, conspicuously alleviating the implausible 3D generation issues caused by the deviated updating direction. Moreover, it can be simply applied to many 3D generative applications to yield impressive 3D assets, please see our project page: this https URL.

[CV-33] Contextual Interaction via Primitive-based Adversarial Training For Compositional Zero-shot Learning

链接: https://arxiv.org/abs/2406.14962
作者: Suyi Li,Chenyi Jiang,Shidong Wang,Yang Long,Zheng Zhang,Haofeng Zhang
关键词: Compositional Zero-shot Learning, Zero-shot Learning, CZSL tasks lies, Compositional Zero-shot, aims to identify
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Compositional Zero-shot Learning (CZSL) aims to identify novel compositions via known attribute-object pairs. The primary challenge in CZSL tasks lies in the significant discrepancies introduced by the complex interaction between the visual primitives of attribute and object, consequently decreasing the classification performance towards novel compositions. Previous remarkable works primarily addressed this issue by focusing on disentangling strategy or utilizing object-based conditional probabilities to constrain the selection space of attributes. Unfortunately, few studies have explored the problem from the perspective of modeling the mechanism of visual primitive interactions. Inspired by the success of vanilla adversarial learning in Cross-Domain Few-Shot Learning, we take a step further and devise a model-agnostic and Primitive-Based Adversarial training (PBadv) method to deal with this problem. Besides, the latest studies highlight the weakness of the perception of hard compositions even under data-balanced conditions. To this end, we propose a novel over-sampling strategy with object-similarity guidance to augment target compositional training data. We performed detailed quantitative analysis and retrieval experiments on well-established datasets, such as UT-Zappos50K, MIT-States, and C-GQA, to validate the effectiveness of our proposed method, and the state-of-the-art (SOTA) performance demonstrates the superiority of our approach. The code is available at this https URL.

[CV-34] Skip and Skip: Segmenting Medical Images with Prompts

链接: https://arxiv.org/abs/2406.14958
作者: Jiawei Chen,Dingkang Yang,Yuxuan Lei,Lihua Zhang
关键词: hand-crafted accurate annotations, segmentation methods rely, rely on hand-crafted, hand-crafted accurate, medical image lesion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Work in progress

点击查看摘要

Abstract:Most medical image lesion segmentation methods rely on hand-crafted accurate annotations of the original image for supervised learning. Recently, a series of weakly supervised or unsupervised methods have been proposed to reduce the dependence on pixel-level annotations. However, these methods are essentially based on pixel-level annotation, ignoring the image-level diagnostic results of the current massive medical images. In this paper, we propose a dual U-shaped two-stage framework that utilizes image-level labels to prompt the segmentation. In the first stage, we pre-train a classification network with image-level labels, which is used to obtain the hierarchical pyramid features and guide the learning of downstream branches. In the second stage, we feed the hierarchical features obtained from the classification branch into the downstream branch through short-skip and long-skip and get the lesion masks under the supervised learning of pixel-level labels. Experiments show that our framework achieves better results than networks simply using pixel-level annotations.

[CV-35] Deep Imbalanced Regression to Estimate Vascular Age from PPG Data: a Novel Digital Biomarker for Cardiovascular Health

链接: https://arxiv.org/abs/2406.14953
作者: Guangkun Nie,Qinghao Zhao,Gongzheng Tang,Jun Li,Shenda Hong
关键词: monitoring human hemodynamics, recent studies highlighting, assessing vascular aging, human hemodynamics, Dist Loss
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Photoplethysmography (PPG) is emerging as a crucial tool for monitoring human hemodynamics, with recent studies highlighting its potential in assessing vascular aging through deep learning. However, real-world age distributions are often imbalanced, posing significant challenges for deep learning models. In this paper, we introduce a novel, simple, and effective loss function named the Dist Loss to address deep imbalanced regression tasks. We trained a one-dimensional convolutional neural network (Net1D) incorporating the Dist Loss on the extensive UK Biobank dataset (n=502,389) to estimate vascular age from PPG signals and validate its efficacy in characterizing cardiovascular health. The model’s performance was validated on a 40% held-out test set, achieving state-of-the-art results, especially in regions with small sample sizes. Furthermore, we divided the population into three subgroups based on the difference between predicted vascular age and chronological age: less than -10 years, between -10 and 10 years, and greater than 10 years. We analyzed the relationship between predicted vascular age and several cardiovascular events over a follow-up period of up to 10 years, including death, coronary heart disease, and heart failure. Our results indicate that the predicted vascular age has significant potential to reflect an individual’s cardiovascular health status. Our code will be available at this https URL.

[CV-36] Brightearth roads: Towards fully automatic road network extraction from satellite imagery

链接: https://arxiv.org/abs/2406.14941
作者: Liuyun Duan(LCT),Willard Mapurisa(LCT),Maxime Leras(LCT),Leigh Lotter(LCT),Yuliya Tarabalka(LCT)
关键词: comprises intricately designed, intricately designed structures, topology comprises intricately, automatically reconstructing road, network topology comprises
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The modern road network topology comprises intricately designed structures that introduce complexity when automatically reconstructing road networks. While open resources like OpenStreetMap (OSM) offer road networks with well-defined topology, they may not always be up to date worldwide. In this paper, we propose a fully automated pipeline for extracting road networks from very-high-resolution (VHR) satellite imagery. Our approach directly generates road line-strings that are seamlessly connected and precisely positioned. The process involves three key modules: a CNN-based neural network for road segmentation, a graph optimization algorithm to convert road predictions into vector line-strings, and a machine learning model for classifying road materials. Compared to OSM data, our results demonstrate significant potential for providing the latest road layouts and precise positions of road segments.

[CV-37] Gaussian-Informed Continuum for Physical Property Identification and Simulation

链接: https://arxiv.org/abs/2406.14927
作者: Junhao Cai,Yuji Yang,Weihao Yuan,Yisheng He,Zilong Dong,Liefeng Bo,Hui Cheng,Qifeng Chen
关键词: estimating physical properties, physical property estimation, system identification, visual observations, paper studies
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: 19 pages, 8 figures

点击查看摘要

Abstract:This paper studies the problem of estimating physical properties (system identification) through visual observations. To facilitate geometry-aware guidance in physical property estimation, we introduce a novel hybrid framework that leverages 3D Gaussian representation to not only capture explicit shapes but also enable the simulated continuum to deduce implicit shapes during training. We propose a new dynamic 3D Gaussian framework based on motion factorization to recover the object as 3D Gaussian point sets across different time states. Furthermore, we develop a coarse-to-fine filling strategy to generate the density fields of the object from the Gaussian reconstruction, allowing for the extraction of object continuums along with their surfaces and the integration of Gaussian attributes into these continuums. In addition to the extracted object surfaces, the Gaussian-informed continuum also enables the rendering of object masks during simulations, serving as implicit shape guidance for physical property estimation. Extensive experimental evaluations demonstrate that our pipeline achieves state-of-the-art performance across multiple benchmarks and metrics. Additionally, we illustrate the effectiveness of the proposed method through real-world demonstrations, showcasing its practical utility. Our project page is at this https URL.

[CV-38] DiPEx: Dispersing Prompt Expansion for Class-Agnostic Object Detection

链接: https://arxiv.org/abs/2406.14924
作者: Jia Syuen Lim,Zhuoxiao Chen,Mahsa Baktashmotlagh,Zhi Chen,Xin Yu,Zi Huang,Yadan Luo
关键词: downstream vision tasks, object detection, Class-agnostic object detection, Dispersing Prompt Expansion, prompts
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 19 pages

点击查看摘要

Abstract:Class-agnostic object detection (OD) can be a cornerstone or a bottleneck for many downstream vision tasks. Despite considerable advancements in bottom-up and multi-object discovery methods that leverage basic visual cues to identify salient objects, consistently achieving a high recall rate remains difficult due to the diversity of object types and their contextual complexity. In this work, we investigate using vision-language models (VLMs) to enhance object detection via a self-supervised prompt learning strategy. Our initial findings indicate that manually crafted text queries often result in undetected objects, primarily because detection confidence diminishes when the query words exhibit semantic overlap. To address this, we propose a Dispersing Prompt Expansion (DiPEx) approach. DiPEx progressively learns to expand a set of distinct, non-overlapping hyperspherical prompts to enhance recall rates, thereby improving performance in downstream tasks such as out-of-distribution OD. Specifically, DiPEx initiates the process by self-training generic parent prompts and selecting the one with the highest semantic uncertainty for further expansion. The resulting child prompts are expected to inherit semantics from their parent prompts while capturing more fine-grained semantics. We apply dispersion losses to ensure high inter-class discrepancy among child prompts while preserving semantic consistency between parent-child prompt pairs. To prevent excessive growth of the prompt sets, we utilize the maximum angular coverage (MAC) of the semantic space as a criterion for early termination. We demonstrate the effectiveness of DiPEx through extensive class-agnostic OD and OOD-OD experiments on MS-COCO and LVIS, surpassing other prompting methods by up to 20.1% in AR and achieving a 21.3% AP improvement over SAM. The code is available at this https URL.

[CV-39] LLM2FEA: Discover Novel Designs with Generative Evolutionary Multitasking

链接: https://arxiv.org/abs/2406.14917
作者: Melvin Wong,Jiao Liu,Thiago Rios,Stefan Menzel,Yew Soon Ong
关键词: generative artificial intelligence, high-quality images, rapid research, research and development, artificial intelligence
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:The rapid research and development of generative artificial intelligence has enabled the generation of high-quality images, text, and 3D models from text prompts. This advancement impels an inquiry into whether these models can be leveraged to create digital artifacts for both creative and engineering applications. Drawing on innovative designs from other domains may be one answer to this question, much like the historical practice of ``bionics", where humans have sought inspiration from nature’s exemplary designs. This raises the intriguing possibility of using generative models to simultaneously tackle design tasks across multiple domains, facilitating cross-domain learning and resulting in a series of innovative design solutions. In this paper, we propose LLM2FEA as the first attempt to discover novel designs in generative models by transferring knowledge across multiple domains. By utilizing a multi-factorial evolutionary algorithm (MFEA) to drive a large language model, LLM2FEA integrates knowledge from various fields to generate prompts that guide the generative model in discovering novel and practical objects. Experimental results in the context of 3D aerodynamic design verify the discovery capabilities of the proposed LLM2FEA. The designs generated by LLM2FEA not only satisfy practicality requirements to a certain degree but also feature novel and aesthetically pleasing shapes, demonstrating the potential applications of LLM2FEA in discovery tasks.

[CV-40] Demonstrating the Efficacy of Kolmogorov-Arnold Networks in Vision Tasks

链接: https://arxiv.org/abs/2406.14916
作者: Minjong Cheon
关键词: Kolmogorov-Arnold Network, deep learning, multilayer projections, realm of deep, vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the realm of deep learning, the Kolmogorov-Arnold Network (KAN) has emerged as a potential alternative to multilayer projections (MLPs). However, its applicability to vision tasks has not been extensively validated. In our study, we demonstrated the effectiveness of KAN for vision tasks through multiple trials on the MNIST, CIFAR10, and CIFAR100 datasets, using a training batch size of 32. Our results showed that while KAN outperformed the original MLP-Mixer on CIFAR10 and CIFAR100, it performed slightly worse than the state-of-the-art ResNet-18. These findings suggest that KAN holds significant promise for vision tasks, and further modifications could enhance its performance in future evaluations.Our contributions are threefold: first, we showcase the efficiency of KAN-based algorithms for visual tasks; second, we provide extensive empirical assessments across various vision benchmarks, comparing KAN’s performance with MLP-Mixer, CNNs, and Vision Transformers (ViT); and third, we pioneer the use of natural KAN layers in visual tasks, addressing a gap in previous research. This paper lays the foundation for future studies on KANs, highlighting their potential as a reliable alternative for image classification tasks.

[CV-41] FC3DNet: A Fully Connected Encoder-Decoder for Efficient Demoireing

链接: https://arxiv.org/abs/2406.14912
作者: Zhibo Du,Long Peng,Yang Wang,Yang Cao,Zheng-Jun Zha
关键词: textbf, taking photos, Abstract, screens, photos of screens
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ICIP2024

点击查看摘要

Abstract:Moiré patterns are commonly seen when taking photos of screens. Camera devices usually have limited hardware performance but take high-resolution photos. However, users are sensitive to the photo processing time, which presents a hardly considered challenge of efficiency for demoiréing methods. To balance the network speed and quality of results, we propose a \textbfFully \textbfConnected en\textbfCoder-de\textbfCoder based \textbfDemoiréing \textbfNetwork (FC3DNet). FC3DNet utilizes features with multiple scales in each stage of the decoder for comprehensive information, which contains long-range patterns as well as various local moiré styles that both are crucial aspects in demoiréing. Besides, to make full use of multiple features, we design a Multi-Feature Multi-Attention Fusion (MFMAF) module to weigh the importance of each feature and compress them for efficiency. These designs enable our network to achieve performance comparable to state-of-the-art (SOTA) methods in real-world datasets while utilizing only a fraction of parameters, FLOPs, and runtime.

[CV-42] MOS: Model Synergy for Test-Time Adaptation on LiDAR-Based 3D Object Detection

链接: https://arxiv.org/abs/2406.14878
作者: Zhuoxiao Chen,Junjie Meng,Mahsa Baktashmotlagh,Zi Huang,Yadan Luo
关键词: point clouds originating, unseen test point, test point clouds, object detection, detection systems
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:LiDAR-based 3D object detection is pivotal across many applications, yet the performance of such detection systems often degrades after deployment, especially when faced with unseen test point clouds originating from diverse locations or subjected to corruption. In this work, we introduce a new online adaptation framework for detectors named Model Synergy (MOS). Specifically, MOS dynamically assembles best-fit supermodels for each test batch from a bank of historical checkpoints, leveraging long-term knowledge to guide model updates without forgetting. The model assembly is directed by the proposed synergy weights (SW), employed for weighted averaging of the selected checkpoints to minimize redundancy in the composite supermodel. These weights are calculated by evaluating the similarity of predicted bounding boxes on test data and the feature independence among model pairs in the bank. To maintain an informative yet compact model bank, we pop out checkpoints with the lowest average SW scores and insert newly updated model weights. Our method was rigorously tested against prior test-time domain adaptation strategies on three datasets and under eight types of corruptions, demonstrating its superior adaptability to changing scenes and conditions. Remarkably, our approach achieved a 67.3% increase in performance in a complex “cross-corruption” scenario, which involves cross-dataset inconsistencies and real-world scene corruptions, providing a more realistic testbed of adaptation capabilities. The code is available at this https URL.

[CV-43] raceNet: Segment one thing efficiently

链接: https://arxiv.org/abs/2406.14874
作者: Mingyuan Wu,Zichuan Liu,Haozhen Zheng,Hongpeng Guo,Bo Chen,Xin Lu,Klara Nahrstedt
关键词: mobile imaging applications, single instance segmentation, single instance, imaging applications, instance segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Efficient single instance segmentation is essential for unlocking features in the mobile imaging applications, such as capture or editing. Existing on-the-fly mobile imaging applications scope the segmentation task to portraits or the salient subject due to the computational constraints. Instance segmentation, despite its recent developments towards efficient networks, is still heavy due to the cost of computation on the entire image to identify all instances. To address this, we propose and formulate a one tap driven single instance segmentation task that segments a single instance selected by a user via a positive tap. This task, in contrast to the broader task of segmenting anything as suggested in the Segment Anything Model \citesam, focuses on efficient segmentation of a single instance specified by the user. To solve this problem, we present TraceNet, which explicitly locates the selected instance by way of receptive field tracing. TraceNet identifies image regions that are related to the user tap and heavy computations are only performed on selected regions of the image. Therefore overall computation cost and memory consumption are reduced during inference. We evaluate the performance of TraceNet on instance IoU average over taps and the proportion of the region that a user tap can fall into for a high-quality single-instance mask. Experimental results on MS-COCO and LVIS demonstrate the effectiveness and efficiency of the proposed approach. TraceNet can jointly achieve the efficiency and interactivity, filling in the gap between needs for efficient mobile inference and recent research trend towards multimodal and interactive segmentation models.

[CV-44] LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models

链接: https://arxiv.org/abs/2406.14862
作者: Mengdan Zhu,Raasikh Kanjiani,Jiahui Lu,Andrew Choi,Qirui Ye,Liang Zhao
关键词: Deep generative models, latent variables, leveraging latent variables, generate high-quality samples, generative models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces LatentExplainer, a framework for automatically generating semantically meaningful explanations of latent variables in deep generative models. LatentExplainer tackles three main challenges: inferring the meaning of latent variables, aligning explanations with inductive biases, and handling varying degrees of explainability. By perturbing latent variables and interpreting changes in generated data, the framework provides a systematic approach to understanding and controlling the data generation process, enhancing the transparency and interpretability of deep generative models. We evaluate our proposed method on several real-world and synthetic datasets, and the results demonstrate superior performance in generating high-quality explanations of latent variables.

[CV-45] Accessible At-Home Detection of Parkinsons Disease via Multi-task Video Analysis

链接: https://arxiv.org/abs/2406.14856
作者: Md Saiful Islam,Tariq Adnan,Jan Freyberg,Sangwu Lee,Abdelrahman Abdelkader,Meghan Pawlik,Cathe Schwartz,Karen Jaffe,Ruth B. Schneider,E Ray Dorsey,Ehsan Hoque
关键词: neurological care leads, detect Parkinson disease, Parkinson disease, Monte Carlo Dropout, unidentified and untreated
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Limited access to neurological care leads to missed diagnoses of Parkinson’s disease (PD), leaving many individuals unidentified and untreated. We trained a novel neural network-based fusion architecture to detect Parkinson’s disease (PD) by analyzing features extracted from webcam recordings of three tasks: finger tapping, facial expression (smiling), and speech (uttering a sentence containing all letters of the alphabet). Additionally, the model incorporated Monte Carlo Dropout to improve prediction accuracy by considering uncertainties. The study participants (n = 845, 272 with PD) were randomly split into three sets: 60% for training, 20% for model selection (hyper-parameter tuning), and 20% for final performance evaluation. The dataset consists of 1102 sessions, each session containing videos of all three tasks. Our proposed model achieved significantly better accuracy, area under the ROC curve (AUROC), and sensitivity at non-inferior specificity compared to any single-task model. Withholding uncertain predictions further boosted the performance, achieving 88.0% (95% CI: 87.7% - 88.4%) accuracy, 93.0% (92.8% - 93.2%) AUROC, 79.3% (78.4% - 80.2%) sensitivity, and 92.6% (92.3% - 92.8%) specificity, at the expense of not being able to predict for 2.3% (2.0% - 2.6%) data. Further analysis suggests that the trained model does not exhibit any detectable bias across sex and ethnic subgroups and is most effective for individuals aged between 50 and 80. This accessible, low-cost approach requiring only an internet-enabled device with a webcam and microphone paves the way for convenient PD screening at home, particularly in regions with limited access to clinical specialists.

[CV-46] Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models

链接: https://arxiv.org/abs/2406.14855
作者: Jie Ren,Kangrui Chen,Yingqian Cui,Shenglai Zeng,Hui Liu,Yue Xing,Jiliang Tang,Lingjuan Lyu
关键词: shown exceptional capabilities, diffusion models, generating images, shown exceptional, exceptional capabilities
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts. However, the advancement of T2I diffusion models presents significant risks, as the models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts. To mitigate these risks, concept removal methods have been proposed. These methods aim to modify diffusion models to prevent the generation of malicious and unwanted concepts. Despite these efforts, existing research faces several challenges: (1) a lack of consistent comparisons on a comprehensive dataset, (2) ineffective prompts in harmful and nudity concepts, (3) overlooked evaluation of the ability to generate the benign part within prompts containing malicious concepts. To address these gaps, we propose to benchmark the concept removal methods by introducing a new dataset, Six-CD, along with a novel evaluation metric. In this benchmark, we conduct a thorough evaluation of concept removals, with the experimental observations and discussions offering valuable insights in the field.

[CV-47] PEANO-ViT: Power-Efficient Approximations of Non-Linearities in Vision Transformers

链接: https://arxiv.org/abs/2406.14854
作者: Mohammad Erfan Sadeghi,Arash Fayyazi,Seyedarmin Azizi,Massoud Pedram
关键词: Field-Programmable Gate Arrays, Error Linear Unit, Gaussian Error Linear, specially Field-Programmable Gate, Gate Arrays
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The deployment of Vision Transformers (ViTs) on hardware platforms, specially Field-Programmable Gate Arrays (FPGAs), presents many challenges, which are mainly due to the substantial computational and power requirements of their non-linear functions, notably layer normalization, softmax, and Gaussian Error Linear Unit (GELU). These critical functions pose significant obstacles to efficient hardware implementation due to their complex mathematical operations and the inherent resource count and architectural limitations of FPGAs. PEANO-ViT offers a novel approach to streamlining the implementation of the layer normalization layer by introducing a division-free technique that simultaneously approximates the division and square root function. Additionally, PEANO-ViT provides a multi-scale division strategy to eliminate division operations in the softmax layer, aided by a Pade-based approximation for the exponential function. Finally, PEANO-ViT introduces a piece-wise linear approximation for the GELU function, carefully designed to bypass the computationally intensive operations associated with GELU. In our comprehensive evaluations, PEANO-ViT exhibits minimal accuracy degradation (= 0.5% for DeiT-B) while significantly enhancing power efficiency, achieving improvements of 1.91x, 1.39x, 8.01x for layer normalization, softmax, and GELU, respectively. This improvement is achieved through substantial reductions in DSP, LUT, and register counts for these non-linear operations. Consequently, PEANO-ViT enables efficient deployment of Vision Transformers on resource- and power-constrained FPGA platforms.

[CV-48] Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

链接: https://arxiv.org/abs/2406.14852
作者: Jiayu Wang,Yifei Ming,Zhenmei Shi,Vibhav Vineet,Xin Wang,Neel Joshi
关键词: Large language models, demonstrated remarkable performance, Large language, tasks and domains, demonstrated remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) and vision-language models (VLMs) have demonstrated remarkable performance across a wide range of tasks and domains. Despite this promise, spatial understanding and reasoning – a fundamental component of human cognition – remains under-explored. We develop novel benchmarks that cover diverse aspects of spatial reasoning such as relationship understanding, navigation, and counting. We conduct a comprehensive evaluation of competitive language and vision-language models. Our findings reveal several counter-intuitive insights that have been overlooked in the literature: (1) Spatial reasoning poses significant challenges where competitive models can fall behind random guessing; (2) Despite additional visual input, VLMs often under-perform compared to their LLM counterparts; (3) When both textual and visual information is available, multi-modal language models become less reliant on visual information if sufficient textual clues are provided. Additionally, we demonstrate that leveraging redundancy between vision and text can significantly enhance model performance. We hope our study will inform the development of multimodal models to improve spatial intelligence and further close the gap with human intelligence.

[CV-49] Fair Text to Medical Image Diffusion Model with Subgroup Distribution Aligned Tuning

链接: https://arxiv.org/abs/2406.14847
作者: Xu Han,Fangfang Fan,Jingzhao Rong,Xiaofeng Liu
关键词: patient status description, specific patient status, latent diffusion model, medical imaging data, underlying appearance distribution
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The text to medical image (T2MedI) with latent diffusion model has great potential to alleviate the scarcity of medical imaging data and explore the underlying appearance distribution of lesions in a specific patient status description. However, as the text to nature image models, we show that the T2MedI model can also bias to some subgroups to overlook the minority ones in the training set. In this work, we first build a T2MedI model based on the pre-trained Imagen model, which has the fixed contrastive language-image pre-training (CLIP) text encoder, while its decoder has been fine-tuned on medical images from the Radiology Objects in COntext (ROCO) dataset. Its gender bias is analyzed qualitatively and quantitatively. Toward this issue, we propose to fine-tune the T2MedI toward the target application dataset to align their sensitive subgroups distribution probability. Specifically, the alignment loss for fine-tuning is guided by an off-the-shelf sensitivity-subgroup classifier to match the classification probability between the generated images and the expected target dataset. In addition, the image quality is maintained by a CLIP-consistency regularization term following a knowledge distillation scheme. For evaluation, we set the target dataset to be enhanced as the BraST18 dataset, and trained a brain magnetic resonance (MR) slice-based gender classifier from it. With our method, the generated MR image can markedly reduce the inconsistency with the gender proportion in the BraTS18 dataset.

[CV-50] CLIP-Decoder : ZeroShot Multilabel Classification using Multimodal CLIP Aligned Representation

链接: https://arxiv.org/abs/2406.14830
作者: Muhammad Ali,Salman Khan
关键词: essential task utilized, real-world applications, wide variety, variety of real-world, Multi-label classification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ICCVW- VLAR

点击查看摘要

Abstract:Multi-label classification is an essential task utilized in a wide variety of real-world applications. Multi-label zero-shot learning is a method for classifying images into multiple unseen categories for which no training data is available, while in general zero-shot situations, the test set may include observed classes. The CLIP-Decoder is a novel method based on the state-of-the-art ML-Decoder attention-based head. We introduce multi-modal representation learning in CLIP-Decoder, utilizing the text encoder to extract text features and the image encoder for image feature extraction. Furthermore, we minimize semantic mismatch by aligning image and word embeddings in the same dimension and comparing their respective representations using a combined loss, which comprises classification loss and CLIP loss. This strategy outperforms other methods and we achieve cutting-edge results on zero-shot multilabel classification tasks using CLIP-Decoder. Our method achieves an absolute increase of 3.9% in performance compared to existing methods for zero-shot learning multi-label classification tasks. Additionally, in the generalized zero-shot learning multi-label classification task, our method shows an impressive increase of almost 2.3%.

[CV-51] SAM-EG: Segment Anything Model with Egde Guidance framework for efficient Polyp Segmentation

链接: https://arxiv.org/abs/2406.14819
作者: Quoc-Huy Trinh,Hai-Dang Nguyen,Bao-Tram Nguyen Ngoc,Debesh Jha,Ulas Bagci,Minh-Triet Tran
关键词: prompted numerous proposed, segmented masks, critical concern, prompted numerous, aimed at enhancing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Polyp segmentation, a critical concern in medical imaging, has prompted numerous proposed methods aimed at enhancing the quality of segmented masks. While current state-of-the-art techniques produce impressive results, the size and computational cost of these models pose challenges for practical industry applications. Recently, the Segment Anything Model (SAM) has been proposed as a robust foundation model, showing promise for adaptation to medical image segmentation. Inspired by this concept, we propose SAM-EG, a framework that guides small segmentation models for polyp segmentation to address the computation cost challenge. Additionally, in this study, we introduce the Edge Guiding module, which integrates edge information into image features to assist the segmentation model in addressing boundary issues from current segmentation model in this task. Through extensive experiments, our small models showcase their efficacy by achieving competitive results with state-of-the-art methods, offering a promising approach to developing compact models with high accuracy for polyp segmentation and in the broader field of medical imaging.

[CV-52] Latent diffusion models for parameterization and data assimilation of facies-based geomodels

链接: https://arxiv.org/abs/2406.14815
作者: Guido Di Federico,Louis J. Durlofsky
关键词: Geological parameterization entails, latent diffusion model, porosity and permeability, Diffusion models, entails the representation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Geological parameterization entails the representation of a geomodel using a small set of latent variables and a mapping from these variables to grid-block properties such as porosity and permeability. Parameterization is useful for data assimilation (history matching), as it maintains geological realism while reducing the number of variables to be determined. Diffusion models are a new class of generative deep-learning procedures that have been shown to outperform previous methods, such as generative adversarial networks, for image generation tasks. Diffusion models are trained to “denoise”, which enables them to generate new geological realizations from input fields characterized by random noise. Latent diffusion models, which are the specific variant considered in this study, provide dimension reduction through use of a low-dimensional latent variable. The model developed in this work includes a variational autoencoder for dimension reduction and a U-net for the denoising process. Our application involves conditional 2D three-facies (channel-levee-mud) systems. The latent diffusion model is shown to provide realizations that are visually consistent with samples from geomodeling software. Quantitative metrics involving spatial and flow-response statistics are evaluated, and general agreement between the diffusion-generated models and reference realizations is observed. Stability tests are performed to assess the smoothness of the parameterization method. The latent diffusion model is then used for ensemble-based data assimilation. Two synthetic “true” models are considered. Significant uncertainty reduction, posterior P _10 -P _90 forecasts that generally bracket observed data, and consistent posterior geomodels, are achieved in both cases.

[CV-53] Relighting Scenes with Object Insertions in Neural Radiance Fields

链接: https://arxiv.org/abs/2406.14806
作者: Xuening Zhu,Renjiao Yi,Xin Wen,Chenyang Zhu,Kai Xu
关键词: commonly utilized applications, augmented reality, commonly utilized, utilized applications, inserting virtual objects
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: 14 pages

点击查看摘要

Abstract:The insertion of objects into a scene and relighting are commonly utilized applications in augmented reality (AR). Previous methods focused on inserting virtual objects using CAD models or real objects from single-view images, resulting in highly limited AR application scenarios. We propose a novel NeRF-based pipeline for inserting object NeRFs into scene NeRFs, enabling novel view synthesis and realistic relighting, supporting physical interactions like casting shadows onto each other, from two sets of images depicting the object and scene. The lighting environment is in a hybrid representation of Spherical Harmonics and Spherical Gaussians, representing both high- and low-frequency lighting components very well, and supporting non-Lambertian surfaces. Specifically, we leverage the benefits of volume rendering and introduce an innovative approach for efficient shadow rendering by comparing the depth maps between the camera view and the light source view and generating vivid soft shadows. The proposed method achieves realistic relighting effects in extensive experimental evaluations.

[CV-54] Camera-Invariant Meta-Learning Network for Single-Camera-Training Person Re-identification

链接: https://arxiv.org/abs/2406.14797
作者: Jiangbo Pei,Zhuqing Jiang,Aidong Men,Haiying Wang,Haiyong Luo,Shiping Wen
关键词: SCT re-ID, SCT, SCT datasets, aims to train, re-ID
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Single-camera-training person re-identification (SCT re-ID) aims to train a re-ID model using SCT datasets where each person appears in only one camera. The main challenge of SCT re-ID is to learn camera-invariant feature representations without cross-camera same-person (CCSP) data as supervision. Previous methods address it by assuming that the most similar person should be found in another camera. However, this assumption is not guaranteed to be correct. In this paper, we propose a Camera-Invariant Meta-Learning Network (CIMN) for SCT re-ID. CIMN assumes that the camera-invariant feature representations should be robust to camera changes. To this end, we split the training data into meta-train set and meta-test set based on camera IDs and perform a cross-camera simulation via meta-learning strategy, aiming to enforce the representations learned from the meta-train set to be robust to the meta-test set. With the cross-camera simulation, CIMN can learn camera-invariant and identity-discriminative representations even there are no CCSP data. However, this simulation also causes the separation of the meta-train set and the meta-test set, which ignores some beneficial relations between them. Thus, we introduce three losses: meta triplet loss, meta classification loss, and meta camera alignment loss, to leverage the ignored relations. The experiment results demonstrate that our method achieves comparable performance with and without CCSP data, and outperforms the state-of-the-art methods on SCT re-ID benchmarks. In addition, it is also effective in improving the domain generalization ability of the model.

[CV-55] Evaluating Numerical Reasoning in Text-to-Image Models

链接: https://arxiv.org/abs/2406.14774
作者: Ivana Kajić,Olivia Wiles,Isabela Albuquerque,Matthias Bauer,Su Wang,Jordi Pont-Tuset,Aida Nematzadeh
关键词: faithfully depict concepts, producing high-quality images, natural language, capable of producing, producing high-quality
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image generative models are capable of producing high-quality images that often faithfully depict concepts described using natural language. In this work, we comprehensively evaluate a range of text-to-image models on numerical reasoning tasks of varying difficulty, and show that even the most advanced models have only rudimentary numerical skills. Specifically, their ability to correctly generate an exact number of objects in an image is limited to small numbers, it is highly dependent on the context the number term appears in, and it deteriorates quickly with each successive number. We also demonstrate that models have poor understanding of linguistic quantifiers (such as “a few” or “as many as”), the concept of zero, and struggle with more advanced concepts such as partial quantities and fractional representations. We bundle prompts, generated images and human annotations into GeckoNum, a novel benchmark for evaluation of numerical reasoning.

[CV-56] Regularized Distribution Matching Distillation for One-step Unpaired Image-to-Image Translation

链接: https://arxiv.org/abs/2406.14762
作者: Denis Rakitin,Ivan Shchekotov,Dmitry Vetrov
关键词: Distribution Matching Distillation, distillation methods aim, Regularized Distribution Matching, efficient one-step generators, Distribution Matching
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion distillation methods aim to compress the diffusion models into efficient one-step generators while trying to preserve quality. Among them, Distribution Matching Distillation (DMD) offers a suitable framework for training general-form one-step generators, applicable beyond unconditional generation. In this work, we introduce its modification, called Regularized Distribution Matching Distillation, applicable to unpaired image-to-image (I2I) problems. We demonstrate its empirical performance in application to several translation tasks, including 2D examples and I2I between different image datasets, where it performs on par or better than multi-step diffusion baselines.

[CV-57] his Looks Better than That: Better Interpretable Models with ProtoPNeXt

链接: https://arxiv.org/abs/2406.14675
作者: Frank Willard,Luke Moffett,Emmanuel Mokel,Jon Donnelly,Stark Guo,Julia Yang,Giyoung Kim,Alina Jade Barnett,Cynthia Rudin
关键词: popular interpretable alternative, black-box deep learning, deep learning models, computer vision, popular interpretable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prototypical-part models are a popular interpretable alternative to black-box deep learning models for computer vision. However, they are difficult to train, with high sensitivity to hyperparameter tuning, inhibiting their application to new datasets and our understanding of which methods truly improve their performance. To facilitate the careful study of prototypical-part networks (ProtoPNets), we create a new framework for integrating components of prototypical-part models – ProtoPNeXt. Using ProtoPNeXt, we show that applying Bayesian hyperparameter tuning and an angular prototype similarity metric to the original ProtoPNet is sufficient to produce new state-of-the-art accuracy for prototypical-part models on CUB-200 across multiple backbones. We further deploy this framework to jointly optimize for accuracy and prototype interpretability as measured by metrics included in ProtoPNeXt. Using the same resources, this produces models with substantially superior semantics and changes in accuracy between +1.3% and -1.5%. The code and trained models will be made publicly available upon publication.

[CV-58] Holistic Evaluation for Interleaved Text-and-Image Generation

链接: https://arxiv.org/abs/2406.14643
作者: Minqian Liu,Zhiyang Xu,Zihao Lin,Trevor Ashby,Joy Rimchala,Jiaxin Zhang,Lifu Huang
关键词: intriguing research direction, arbitrary order, required to generate, Interleaved, interleaved generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Work in progress. 13 pages, 5 figure, 6 tables

点击查看摘要

Abstract:Interleaved text-and-image generation has been an intriguing research direction, where the models are required to generate both images and text pieces in an arbitrary order. Despite the emerging advancements in interleaved generation, the progress in its evaluation still significantly lags behind. Existing evaluation benchmarks do not support arbitrarily interleaved images and text for both inputs and outputs, and they only cover a limited number of domains and use cases. Also, current works predominantly use similarity-based metrics which fall short in assessing the quality in open-ended scenarios. To this end, we introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. InterleavedBench features a rich array of tasks to cover diverse real-world use cases. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation. We carefully define five essential evaluation aspects for InterleavedEval, including text quality, perceptual quality, image coherence, text-image coherence, and helpfulness, to ensure a comprehensive and fine-grained assessment. Through extensive experiments and rigorous human evaluation, we show that our benchmark and metric can effectively evaluate the existing models with a strong correlation with human judgments surpassing previous reference-based metrics. We also provide substantial findings and insights to foster future research in interleaved generation and its evaluation.

[CV-59] Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models

链接: https://arxiv.org/abs/2406.14599
作者: Matthew Zheng,Enis Simsar,Hidir Yesiltepe,Federico Tombari,Joel Simon,Pinar Yanardag
关键词: enabling highly detailed, digital art creation, visual content generation, increasingly popular, revolutionizing the landscape
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image models are becoming increasingly popular, revolutionizing the landscape of digital art creation by enabling highly detailed and creative visual content generation. These models have been widely employed across various domains, particularly in art generation, where they facilitate a broad spectrum of creative expression and democratize access to artistic creation. In this paper, we introduce \textttSTYLEBREEDER, a comprehensive dataset of 6.8M images and 1.8M prompts generated by 95K users on Artbreeder, a platform that has emerged as a significant hub for creative exploration with over 13M users. We introduce a series of tasks with this dataset aimed at identifying diverse artistic styles, generating personalized content, and recommending styles based on user interests. By documenting unique, user-generated styles that transcend conventional categories like ‘cyberpunk’ or ‘Picasso,’ we explore the potential for unique, crowd-sourced styles that could provide deep insights into the collective creative psyche of users worldwide. We also evaluate different personalization methods to enhance artistic expression and introduce a style atlas, making these models available in LoRA format for public use. Our research demonstrates the potential of text-to-image diffusion models to uncover and promote unique artistic expressions, further democratizing AI in art and fostering a more diverse and inclusive artistic community. The dataset, code and models are available at this https URL under a Public Domain (CC0) license.

[CV-60] ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights

链接: https://arxiv.org/abs/2406.14596
作者: Gabriel Sarch,Lawrence Jang,Michael J. Tarr,William W. Cohen,Kenneth Marino,Katerina Fragkiadaki
关键词: Large-scale generative language, Large-scale generative, generative language, language and vision-language, decision making
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project website: this http URL

点击查看摘要

Abstract:Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot in-context learning for decision making and instruction following. However, they require high-quality exemplar demonstrations to be included in their context window. In this work, we ask: Can LLMs and VLMs generate their own prompt examples from generic, sub-optimal demonstrations? We propose In-Context Abstraction Learning (ICAL), a method that builds a memory of multimodal experience insights from sub-optimal demonstrations and human feedback. Given a noisy demonstration in a new domain, VLMs abstract the trajectory into a general program by fixing inefficient actions and annotating cognitive abstractions: task relationships, object state changes, temporal subgoals, and task construals. These abstractions are refined and adapted interactively through human feedback while the agent attempts to execute the trajectory in a similar environment. The resulting abstractions, when used as exemplars in the prompt, significantly improve decision-making in retrieval-augmented LLM and VLM agents. Our ICAL agent surpasses the state-of-the-art in dialogue-based instruction following in TEACh, multimodal web agents in VisualWebArena, and action anticipation in Ego4D. In TEACh, we achieve a 12.6% improvement in goal-condition success. In VisualWebArena, our task success rate improves over the SOTA from 14.3% to 22.7%. In Ego4D action forecasting, we improve over few-shot GPT-4V and remain competitive with supervised models. We show finetuning our retrieval-augmented in-context agent yields additional improvements. Our approach significantly reduces reliance on expert-crafted examples and consistently outperforms in-context learning from action plans that lack such insights.

[CV-61] Modeling Evaluating the Performance of Convolutional Neural Networks for Classifying Steel Surface Defects

链接: https://arxiv.org/abs/2406.14583
作者: Nadeem Jabbar Chaudhry,M. Bilal Khan,M. Javaid Iqbal,Siddiqui Muhammad Yasir
关键词: convolutional neural networks, CNN models, outstanding identification rates, outstanding identification, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Recently, outstanding identification rates in image classification tasks were achieved by convolutional neural networks (CNNs). to use such skills, selective CNNs trained on a dataset of well-known images of metal surface defects captured with an RGB camera. Defects must be detected early to take timely corrective action due to production concerns. For image classification up till now, a model-based method has been utilized, which indicated the predicted reflection characteristics of surface defects in comparison to flaw-free surfaces. The problem of detecting steel surface defects has grown in importance as a result of the vast range of steel applications in end-product sectors such as automobiles, households, construction, etc. The manual processes for detections are time-consuming, labor-intensive, and expensive. Different strategies have been used to automate manual processes, but CNN models have proven to be the most effective rather than image processing and machine learning techniques. By using different CNN models with fine-tuning, easily compare their performance and select the best-performing model for the same kinds of tasks. However, it is important that using different CNN models either from fine tuning can be computationally expensive and time-consuming. Therefore, our study helps the upcoming researchers to choose the CNN without considering the issues of model complexity, performance, and computational resources. In this article, the performance of various CNN models with transfer learning techniques are evaluated. These models were chosen based on their popularity and impact in the field of computer vision research, as well as their performance on benchmark datasets. According to the outcomes, DenseNet201 outperformed the other CNN models and had the greatest detection rate on the NEU dataset, falling in at 98.37 percent.

[CV-62] Faster Metallic Surface Defect Detection Using Deep Learning with Channel Shuffling

链接: https://arxiv.org/abs/2406.14582
作者: Siddiqui Muhammad Yasir,Hyunsik Ahn
关键词: Deep learning, defect detection model, defect detection, constantly improving, improving in recent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning has been constantly improving in recent years and a significant number of researchers have devoted themselves to the research of defect detection algorithms. Detection and recognition of small and complex targets is still a problem that needs to be solved. The authors of this research would like to present an improved defect detection model for detecting small and complex defect targets in steel surfaces. During steel strip production mechanical forces and environmental factors cause surface defects of the steel strip. Therefore the detection of such defects is key to the production of high-quality products. Moreover surface defects of the steel strip cause great economic losses to the high-tech industry. So far few studies have explored methods of identifying the defects and most of the currently available algorithms are not sufficiently effective. Therefore this study presents an improved real-time metallic surface defect detection model based on You Only Look Once (YOLOv5) specially designed for small networks. For the smaller features of the target the conventional part is replaced with a depth-wise convolution and channel shuffle mechanism. Then assigning weights to Feature Pyramid Networks (FPN) output features and fusing them increases feature propagation and the networks characterization ability. The experimental results reveal that the improved proposed model outperforms other comparable models in terms of accuracy and detection time. The precision of the proposed model achieved by @mAP is 77.5% on the Northeastern University Dataset NEU-DET and 70.18% on the GC10-DET datasets

[CV-63] 3D Instance Segmentation Using Deep Learning on RGB-D Indoor Data

链接: https://arxiv.org/abs/2406.14581
作者: Siddiqui Muhammad Yasir,Amin Muhammad Sadiq,Hyunsik Ahn
关键词: home indoor environments, indoor environments, industrial and home, home indoor, robot systems
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:3D object recognition is a challenging task for intelligent and robot systems in industrial and home indoor environments. It is critical for such systems to recognize and segment the 3D object instances that they encounter on a frequent basis. The computer vision, graphics, and machine learning fields have all given it a lot of attention. Traditionally, 3D segmentation was done with hand-crafted features and designed approaches that did not achieve acceptable performance and could not be generalized to large-scale data. Deep learning approaches have lately become the preferred method for 3D segmentation challenges by their great success in 2D computer vision. However, the task of instance segmentation is currently less explored. In this paper, we propose a novel approach for efficient 3D instance segmentation using red green blue and depth (RGB-D) data based on deep learning. The 2D region based convolutional neural networks (Mask R-CNN) deep learning model with point based rending module is adapted to integrate with depth information to recognize and segment 3D instances of objects. In order to generate 3D point cloud coordinates (x, y, z), segmented 2D pixels (u, v) of recognized object regions in the RGB image are merged into (u, v) points of the depth image. Moreover, we conducted an experiment and analysis to compare our proposed method from various points of view and distances. The experimentation shows the proposed 3D object recognition and instance segmentation are sufficiently beneficial to support object handling in robotic and intelligent systems.

[CV-64] DragPoser: Motion Reconstruction from Variable Sparse Tracking Signals via Latent Space Optimization

链接: https://arxiv.org/abs/2406.14567
作者: Jose Luis Ponton,Eduard Pujol,Andreas Aristidou,Carlos Andujar,Nuria Pelechano
关键词: High-quality motion reconstruction, high-end mocap systems, High-quality motion, user movements, high-end mocap
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:High-quality motion reconstruction that follows the user’s movements can be achieved by high-end mocap systems with many sensors. However, obtaining such animation quality with fewer input devices is gaining popularity as it brings mocap closer to the general public. The main challenges include the loss of end-effector accuracy in learning-based approaches, or the lack of naturalness and smoothness in IK-based solutions. In addition, such systems are often finely tuned to a specific number of trackers and are highly sensitive to missing data e.g., in scenarios where a sensor is occluded or malfunctions. In response to these challenges, we introduce DragPoser, a novel deep-learning-based motion reconstruction system that accurately represents hard and dynamic on-the-fly constraints, attaining real-time high end-effectors position accuracy. This is achieved through a pose optimization process within a structured latent space. Our system requires only one-time training on a large human motion dataset, and then constraints can be dynamically defined as losses, while the pose is iteratively refined by computing the gradients of these losses within the latent space. To further enhance our approach, we incorporate a Temporal Predictor network, which employs a Transformer architecture to directly encode temporality within the latent space. This network ensures the pose optimization is confined to the manifold of valid poses and also leverages past pose data to predict temporally coherent poses. Results demonstrate that DragPoser surpasses both IK-based and the latest data-driven methods in achieving precise end-effector positioning, while it produces natural poses and temporally coherent motion. In addition, our system showcases robustness against on-the-fly constraint modifications, and exhibits exceptional adaptability to various input configurations and changes.

[CV-65] LM-IGTD: a 2D image generator for low-dimensional and mixed-type tabular data to leverage the potential of convolutional neural networks

链接: https://arxiv.org/abs/2406.14566
作者: Vanesa Gómez-Martínez,Francisco J. Lara-Abelenda,Pablo Peiro-Corbacho,David Chushig-Muzo,Conceicao Granja,Cristina Soguero-Ruiz
关键词: Tabular data, transforming tabular data, data, knowledge domains, Tabular
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tabular data have been extensively used in different knowledge domains. Convolutional neural networks (CNNs) have been successfully used in many applications where important information about data is embedded in the order of features (images), outperforming predictive results of traditional models. Recently, several researchers have proposed transforming tabular data into images to leverage the potential of CNNs and obtain high results in predictive tasks such as classification and regression. In this paper, we present a novel and effective approach for transforming tabular data into images, addressing the inherent limitations associated with low-dimensional and mixed-type datasets. Our method, named Low Mixed-Image Generator for Tabular Data (LM-IGTD), integrates a stochastic feature generation process and a modified version of the IGTD. We introduce an automatic and interpretable end-to-end pipeline, enabling the creation of images from tabular data. A mapping between original features and the generated images is established, and post hoc interpretability methods are employed to identify crucial areas of these images, enhancing interpretability for predictive tasks. An extensive evaluation of the tabular-to-image generation approach proposed on 12 low-dimensional and mixed-type datasets, including binary and multi-class classification scenarios. In particular, our method outperformed all traditional ML models trained on tabular data in five out of twelve datasets when using images generated with LM-IGTD and CNN. In the remaining datasets, LM-IGTD images and CNN consistently surpassed three out of four traditional ML models, achieving similar results to the fourth model.

[CV-66] ReflectanceFusion: Diffusion-based text to SVBRDF Generation

链接: https://arxiv.org/abs/2406.14565
作者: Bowen Xue,Giuseppe Claudio Guarnera,Shuang Zhao,Zahra Montazeri
关键词: introduce Reflectance Diffusion, high-fidelity SVBRDF maps, SVBRDF maps, generating high-fidelity SVBRDF, introduce Reflectance
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce Reflectance Diffusion, a new neural text-to-texture model capable of generating high-fidelity SVBRDF maps from textual descriptions. Our method leverages a tandem neural approach, consisting of two modules, to accurately model the distribution of spatially varying reflectance as described by text prompts. Initially, we employ a pre-trained stable diffusion 2 model to generate a latent representation that informs the overall shape of the material and serves as our backbone model. Then, our ReflectanceUNet enables fine-tuning control over the material’s physical appearance and generates SVBRDF maps. ReflectanceUNet module is trained on an extensive dataset comprising approximately 200,000 synthetic spatially varying materials. Our generative SVBRDF diffusion model allows for the synthesis of multiple SVBRDF estimates from a single textual input, offering users the possibility to choose the output that best aligns with their requirements. We illustrate our method’s versatility by generating SVBRDF maps from a range of textual descriptions, both specific and broad. Our ReflectanceUNet model can integrate optional physical parameters, such as roughness and specularity, enhancing customization. When the backbone module is fixed, the ReflectanceUNet module refines the material, allowing direct edits to its physical attributes. Comparative evaluations demonstrate that ReflectanceFusion achieves better accuracy than existing text-to-material models, such as Text2Mat, while also providing the benefits of editable and relightable SVBRDF maps.

[CV-67] Full-Scale Indexing and Semantic Annotation of CT Imaging: Boosting FAIRness

链接: https://arxiv.org/abs/2406.15340
作者: Hannes Ulrich,Robin Hendel,Santiago Pazmino,Björn Bergh,Björn Schreiweis
关键词: significant advances, treatment planning, artificial intelligence, intelligence into medicine, medicine has led
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Background: The integration of artificial intelligence into medicine has led to significant advances, particularly in diagnostics and treatment planning. However, the reliability of AI models is highly dependent on the quality of the training data, especially in medical imaging, where varying patient data and evolving medical knowledge pose a challenge to the accuracy and generalizability of given datasets. Results: The proposed approach focuses on the integration and enhancement of clinical computed tomography (CT) image series for better findability, accessibility, interoperability, and reusability. Through an automated indexing process, CT image series are semantically enhanced using the TotalSegmentator framework for segmentation and resulting SNOMED CT annotations. The metadata is standardized with HL7 FHIR resources to enable efficient data recognition and data exchange between research projects. Conclusions: The study successfully integrates a robust process within the UKSH MeDIC, leading to the semantic enrichment of over 230,000 CT image series and over 8 million SNOMED CT annotations. The standardized representation using HL7 FHIR resources improves discoverability and facilitates interoperability, providing a foundation for the FAIRness of medical imaging data. However, developing automated annotation methods that can keep pace with growing clinical datasets remains a challenge to ensure continued progress in large-scale integration and indexing of medical imaging for advanced healthcare AI applications.

[CV-68] Rapid and Accurate Diagnosis of Acute Aortic Syndrome using Non-contrast CT: A Large-scale Retrospective Multi-center and AI-based Study

链接: https://arxiv.org/abs/2406.15222
作者: Yujian Hu,Yilang Xiang,Yan-Jie Zhou,Yangyan He,Shifeng Yang,Xiaolong Du,Chunlan Den,Youyao Xu,Gaofeng Wang,Zhengyao Ding,Jingyong Huang,Wenjun Zhao,Xuejun Wu,Donglin Li,Qianqian Zhu,Zhenjiang Li,Chenyang Qiu,Ziheng Wu,Yunjun He,Chen Tian,Yihui Qiu,Zuodong Lin,Xiaolong Zhang,Yuan He,Zhenpeng Yuan,Xiaoxiang Zhou,Rong Fan,Ruihan Chen,Wenchao Guo,Jianpeng Zhang,Tony C. W. Mok,Zi Li,Le Lu,Dehai Lang,Xiaoqiang Li,Guofu Wang,Wei Lu,Zhengxing Huang,Minfeng Xu,Hongkun Zhang
关键词: Chest pain symptoms, acute aortic syndrome, acute chest pain, catastrophic cardiovascular emergency, chest pain conditions
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: submit to Nature Medicine

点击查看摘要

Abstract:Chest pain symptoms are highly prevalent in emergency departments (EDs), where acute aortic syndrome (AAS) is a catastrophic cardiovascular emergency with a high fatality rate, especially when timely and accurate treatment is not administered. However, current triage practices in the ED can cause up to approximately half of patients with AAS to have an initially missed diagnosis or be misdiagnosed as having other acute chest pain conditions. Subsequently, these AAS patients will undergo clinically inaccurate or suboptimal differential diagnosis. Fortunately, even under these suboptimal protocols, nearly all these patients underwent non-contrast CT covering the aorta anatomy at the early stage of differential diagnosis. In this study, we developed an artificial intelligence model (DeepAAS) using non-contrast CT, which is highly accurate for identifying AAS and provides interpretable results to assist in clinical decision-making. Performance was assessed in two major phases: a multi-center retrospective study (n = 20,750) and an exploration in real-world emergency scenarios (n = 137,525). In the multi-center cohort, DeepAAS achieved a mean area under the receiver operating characteristic curve of 0.958 (95% CI 0.950-0.967). In the real-world cohort, DeepAAS detected 109 AAS patients with misguided initial suspicion, achieving 92.6% (95% CI 76.2%-97.5%) in mean sensitivity and 99.2% (95% CI 99.1%-99.3%) in mean specificity. Our AI model performed well on non-contrast CT at all applicable early stages of differential diagnosis workflows, effectively reduced the overall missed diagnosis and misdiagnosis rate from 48.8% to 4.8% and shortened the diagnosis time for patients with misguided initial suspicion from an average of 681.8 (74-11,820) mins to 68.5 (23-195) mins. DeepAAS could effectively fill the gap in the current clinical workflow without requiring additional tests.

[CV-69] Multimodal Deformable Image Registration for Long-COVID Analysis Based on Progressive Alignment and Multi-perspective Loss

链接: https://arxiv.org/abs/2406.15172
作者: Jiahua Li,James T. Grist,Fergus V. Gleeson,Bartłomiej W. Papież
关键词: necessitates advanced imaging, Long COVID, persistent symptoms, pulmonary impairment, accurate diagnosis
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Long COVID is characterized by persistent symptoms, particularly pulmonary impairment, which necessitates advanced imaging for accurate diagnosis. Hyperpolarised Xenon-129 MRI (XeMRI) offers a promising avenue by visualising lung ventilation, perfusion, as well as gas transfer. Integrating functional data from XeMRI with structural data from Computed Tomography (CT) is crucial for comprehensive analysis and effective treatment strategies in long COVID, requiring precise data alignment from those complementary imaging modalities. To this end, CT-MRI registration is an essential intermediate step, given the significant challenges posed by the direct alignment of CT and Xe-MRI. Therefore, we proposed an end-to-end multimodal deformable image registration method that achieves superior performance for aligning long-COVID lung CT and proton density MRI (pMRI) data. Moreover, our method incorporates a novel Multi-perspective Loss (MPL) function, enhancing state-of-the-art deep learning methods for monomodal registration by making them adaptable for multimodal tasks. The registration results achieve a Dice coefficient score of 0.913, indicating a substantial improvement over the state-of-the-art multimodal image registration techniques. Since the XeMRI and pMRI images are acquired in the same sessions and can be roughly aligned, our results facilitate subsequent registration between XeMRI and CT, thereby potentially enhancing clinical decision-making for long COVID management.

[CV-70] A Wavelet Guided Attention Module for Skin Cancer Classification with Gradient-based Feature Fusion

链接: https://arxiv.org/abs/2406.15128
作者: Ayush Roy,Sujan Sarkar,Sohom Ghosal,Dmitrii Kaplun,Asya Lyanova,Ram Sarkar
关键词: highly dangerous type, diagnose skin cancer, Skin cancer, physicians diagnose skin, dangerous type
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Skin cancer is a highly dangerous type of cancer that requires an accurate diagnosis from experienced physicians. To help physicians diagnose skin cancer more efficiently, a computer-aided diagnosis (CAD) system can be very helpful. In this paper, we propose a novel model, which uses a novel attention mechanism to pinpoint the differences in features across the spatial dimensions and symmetry of the lesion, thereby focusing on the dissimilarities of various classes based on symmetry, uniformity in texture and color, etc. Additionally, to take into account the variations in the boundaries of the lesions for different classes, we employ a gradient-based fusion of wavelet and soft attention-aided features to extract boundary information of skin lesions. We have tested our model on the multi-class and highly class-imbalanced dataset, called HAM10000, and achieved promising results, with a 91.17% F1-score and 90.75% accuracy. The code is made available at: this https URL.

[CV-71] FA-Net: A Fuzzy Attention-aided Deep Neural Network for Pneumonia Detection in Chest X-Rays

链接: https://arxiv.org/abs/2406.15117
作者: Ayush Roy,Anurag Bhattacharjee,Diego Oliva,Oscar Ramos-Soto,Francisco J. Alvarez-Padilla,Ram Sarkar
关键词: respiratory infection caused, caused by bacteria, Chest X-ray, infection caused, Pneumonia
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pneumonia is a respiratory infection caused by bacteria, fungi, or viruses. It affects many people, particularly those in developing or underdeveloped nations with high pollution levels, unhygienic living conditions, overcrowding, and insufficient medical infrastructure. Pneumonia can cause pleural effusion, where fluids fill the lungs, leading to respiratory difficulty. Early diagnosis is crucial to ensure effective treatment and increase survival rates. Chest X-ray imaging is the most commonly used method for diagnosing pneumonia. However, visual examination of chest X-rays can be difficult and subjective. In this study, we have developed a computer-aided diagnosis system for automatic pneumonia detection using chest X-ray images. We have used DenseNet-121 and ResNet50 as the backbone for the binary class (pneumonia and normal) and multi-class (bacterial pneumonia, viral pneumonia, and normal) classification tasks, respectively. We have also implemented a channel-specific spatial attention mechanism, called Fuzzy Channel Selective Spatial Attention Module (FCSSAM), to highlight the specific spatial regions of relevant channels while removing the irrelevant channels of the extracted features by the backbone. We evaluated the proposed approach on a publicly available chest X-ray dataset, using binary and multi-class classification setups. Our proposed method achieves accuracy rates of 97.15% and 79.79% for the binary and multi-class classification setups, respectively. The results of our proposed method are superior to state-of-the-art (SOTA) methods. The code of the proposed model will be available at: this https URL.

[CV-72] A Dual Attention-aided DenseNet-121 for Classification of Glaucoma from Fundus Images

链接: https://arxiv.org/abs/2406.15113
作者: Soham Chakraborty,Ayush Roy,Payel Pramanik,Daria Valenkova,Ram Sarkar
关键词: Deep learning, computer vision methods, field of ophthalmology, learning and computer, computer vision
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning and computer vision methods are nowadays predominantly used in the field of ophthalmology. In this paper, we present an attention-aided DenseNet-121 for classifying normal and glaucomatous eyes from fundus images. It involves the convolutional block attention module to highlight relevant spatial and channel features extracted by DenseNet-121. The channel recalibration module further enriches the features by utilizing edge information along with the statistical features of the spatial dimension. For the experiments, two standard datasets, namely RIM-ONE and ACRIMA, have been used. Our method has shown superior results than state-of-the-art models. An ablation study has also been conducted to show the effectiveness of each of the components. The code of the proposed work is available at: this https URL.

[CV-73] Benchmarking Retinal Blood Vessel Segmentation Models for Cross-Dataset and Cross-Disease Generalization

链接: https://arxiv.org/abs/2406.14994
作者: Jeremiah Fadugba,Patrick Köhler,Lisa Koch,Petru Manescu,Philipp Berens
关键词: Retinal blood vessel, extract clinically relevant, clinically relevant information, Convolution Neural Networks, Retinal blood
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:Retinal blood vessel segmentation can extract clinically relevant information from fundus images. As manual tracing is cumbersome, algorithms based on Convolution Neural Networks have been developed. Such studies have used small publicly available datasets for training and measuring performance, running the risk of overfitting. Here, we provide a rigorous benchmark for various architectural and training choices commonly used in the literature on the largest dataset published to date. We train and evaluate five published models on the publicly available FIVES fundus image dataset, which exceeds previous ones in size and quality and which contains also images from common ophthalmological conditions (diabetic retinopathy, age-related macular degeneration, glaucoma). We compare the performance of different model architectures across different loss functions, levels of image qualitiy and ophthalmological conditions and assess their ability to perform well in the face of disease-induced domain shifts. Given sufficient training data, basic architectures such as U-Net perform just as well as more advanced ones, and transfer across disease-induced domain shifts typically works well for most architectures. However, we find that image quality is a key factor determining segmentation outcomes. When optimizing for segmentation performance, investing into a well curated dataset to train a standard architecture yields better results than tuning a sophisticated architecture on a smaller dataset or one with lower image quality. We distilled the utility of architectural advances in terms of their clinical relevance therefore providing practical guidance for model choices depending on the circumstances of the clinical setting

[CV-74] CoCPF: Coordinate-based Continuous Projection Field for Ill-Posed Inverse Problem in Imaging

链接: https://arxiv.org/abs/2406.14976
作者: Zixuan Chen,Lingxiao Yang,Jian-Huang Lai,Xiaohua Xie
关键词: Sparse-view computed tomography, Sparse-view computed, computed tomography, based on sparsely-sampled, Sparse-view
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Sparse-view computed tomography (SVCT) reconstruction aims to acquire CT images based on sparsely-sampled measurements. It allows the subjects exposed to less ionizing radiation, reducing the lifetime risk of developing cancers. Recent researches employ implicit neural representation (INR) techniques to reconstruct CT images from a single SV sinogram. However, due to ill-posedness, these INR-based methods may leave considerable ``holes’’ (i.e., unmodeled spaces) in their fields, leading to sub-optimal results. In this paper, we propose the Coordinate-based Continuous Projection Field (CoCPF), which aims to build hole-free representation fields for SVCT reconstruction, achieving better reconstruction quality. Specifically, to fill the holes, CoCPF first employs the stripe-based volume sampling module to broaden the sampling regions of Radon transformation from rays (1D space) to stripes (2D space), which can well cover the internal regions between SV projections. Then, by feeding the sampling regions into the proposed differentiable rendering modules, the holes can be jointly optimized during training, reducing the ill-posed levels. As a result, CoCPF can accurately estimate the internal measurements between SV projections (i.e., DV sinograms), producing high-quality CT images after re-projection. Extensive experiments on simulated and real projection datasets demonstrate that CoCPF outperforms state-of-the-art methods for 2D and 3D SVCT reconstructions under various projection numbers and geometries, yielding fine-grained details and fewer artifacts. Our code will be publicly available.

[CV-75] A Unified Framework for Synthesizing Multisequence Brain MRI via Hybrid Fusion

链接: https://arxiv.org/abs/2406.14954
作者: Jihoon Cho,Jonghye Woo,Jinah Park
关键词: Magnetic Resonance Imaging, Multisequence Magnetic Resonance, Resonance Imaging, Magnetic Resonance, Multisequence Magnetic
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:Multisequence Magnetic Resonance Imaging (MRI) provides a reliable diagnosis in clinical applications through complementary information within sequences. However, in practice, the absence of certain MR sequences is a common problem that can lead to inconsistent analysis results. In this work, we propose a novel unified framework for synthesizing multisequence MR images, called Hybrid Fusion GAN (HF-GAN). We introduce a hybrid fusion encoder designed to ensure the disentangled extraction of complementary and modality-specific information, along with a channel attention-based feature fusion module that integrates the features into a common latent space handling the complexity from combinations of accessible MR sequences. Common feature representations are transformed into a target latent space via the modality infuser to synthesize missing MR sequences. We have performed experiments on multisequence brain MRI datasets from healthy individuals and patients diagnosed with brain tumors. Experimental results show that our method outperforms state-of-the-art methods in both quantitative and qualitative comparisons. In addition, a detailed analysis of our framework demonstrates the superiority of our designed modules and their effectiveness for use in data imputation tasks.

[CV-76] SelfReg-UNet: Self-Regularized UNet for Medical Image Segmentation

链接: https://arxiv.org/abs/2406.14896
作者: Wenhui Zhu,Xiwen Chen,Peijie Qiu,Mohammad Farazi,Aristeidis Sotiras,Abolfazl Razi,Yalin Wang
关键词: medical image segmentation, image segmentation tasks, medical image, image segmentation, leading a variety
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted as a conference paper to 2024 MICCAI

点击查看摘要

Abstract:Since its introduction, UNet has been leading a variety of medical image segmentation tasks. Although numerous follow-up studies have also been dedicated to improving the performance of standard UNet, few have conducted in-depth analyses of the underlying interest pattern of UNet in medical image segmentation. In this paper, we explore the patterns learned in a UNet and observe two important factors that potentially affect its performance: (i) irrelative feature learned caused by asymmetric supervision; (ii) feature redundancy in the feature map. To this end, we propose to balance the supervision between encoder and decoder and reduce the redundant information in the UNet. Specifically, we use the feature map that contains the most semantic information (i.e., the last layer of the decoder) to provide additional supervision to other blocks to provide additional supervision and reduce feature redundancy by leveraging feature distillation. The proposed method can be easily integrated into existing UNet architecture in a plug-and-play fashion with negligible computational cost. The experimental results suggest that the proposed method consistently improves the performance of standard UNets on four medical image segmentation datasets. The code is available at \urlthis https URL

[CV-77] ImageFlowNet: Forecasting Multiscale Trajectories of Disease Progression with Irregularly-Sampled Longitudinal Medical Images

链接: https://arxiv.org/abs/2406.14794
作者: Chen Liu,Ke Xu,Liangbo L. Shen,Guillaume Huguet,Zilong Wang,Alexander Tong,Danilo Bzdok,Jay Stewart,Jay C. Wang,Lucian V. Del Priore,Smita Krishnaswamy
关键词: clinical decision making, decision making, holy grail, grail for clinical, clinical decision
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The forecasting of disease progression from images is a holy grail for clinical decision making. However, this task is complicated by the inherent high dimensionality, temporal sparsity and sampling irregularity in longitudinal image acquisitions. Existing methods often rely on extracting hand-crafted features and performing time-series analysis in this vector space, leading to a loss of rich spatial information within the images. To overcome these challenges, we introduce ImageFlowNet, a novel framework that learns latent-space flow fields that evolve multiscale representations in joint embedding spaces using neural ODEs and SDEs to model disease progression in the image domain. Notably, ImageFlowNet learns multiscale joint representation spaces by combining cohorts of patients together so that information can be transferred between the patient samples. The dynamics then provide plausible trajectories of progression, with the SDE providing alternative trajectories from the same starting point. We provide theoretical insights that support our formulation of ODEs, and motivate our regularizations involving high-level visual features, latent space organization, and trajectory smoothness. We then demonstrate ImageFlowNet’s effectiveness through empirical evaluations on three longitudinal medical image datasets depicting progression in retinal geographic atrophy, multiple sclerosis, and glioblastoma.

[CV-78] Policy Gradient-Driven Noise Mask

链接: https://arxiv.org/abs/2406.14568
作者: Mehmet Can Yavuz,Yang Yang
关键词: face significant challenges, Deep learning classifiers, classifiers face significant, Deep learning, face significant
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages; 8 figures; 3 tables

点击查看摘要

Abstract:Deep learning classifiers face significant challenges when dealing with heterogeneous multi-modal and multi-organ biomedical datasets. The low-level feature distinguishability limited to imaging-modality hinders the classifiers’ ability to learn high-level semantic relationships, resulting in sub-optimal performance. To address this issue, image augmentation strategies are employed as regularization techniques. While additive noise input during network training is a well-established augmentation as regularization method, modern pipelines often favor more robust techniques such as dropout and weight decay. This preference stems from the observation that combining these established techniques with noise input can adversely affect model performance. In this study, we propose a novel pretraining pipeline that learns to generate conditional noise mask specifically tailored to improve performance on multi-modal and multi-organ datasets. As a reinforcement learning algorithm, our approach employs a dual-component system comprising a very light-weight policy network that learns to sample conditional noise using a differentiable beta distribution and a classifier network. The policy network is trained using the reinforce algorithm to generate image-specific noise masks that regularize the classifier during pretraining. A key aspect is that the policy network’s role is limited to obtaining an intermediate (or heated) model before fine-tuning. During inference, the policy network is omitted, allowing direct comparison between the baseline and noise-regularized models. We conducted experiments and related analyses on RadImageNet datasets. Results demonstrate that fine-tuning the intermediate models consistently outperforms conventional training algorithms on both classification and generalization to unseen concept tasks. Comments: 11 pages; 8 figures; 3 tables Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2406.14568 [eess.IV] (or arXiv:2406.14568v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2406.14568 Focus to learn more arXiv-issued DOI via DataCite

机器学习

[LG-0] NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking

链接: https://arxiv.org/abs/2406.15349
作者: Daniel Dauner,Marcel Hallgarten,Tianyu Li,Xinshuo Weng,Zhiyu Huang,Zetong Yang,Hongyang Li,Igor Gilitschenski,Boris Ivanovic,Marco Pavone,Andreas Geiger,Kashyap Chitta
关键词: vision-based driving policies, Benchmarking vision-based driving, Benchmarking vision-based, vision-based driving, driving policies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Benchmarking vision-based driving policies is challenging. On one hand, open-loop evaluation with real data is easy, but these results do not reflect closed-loop performance. On the other, closed-loop evaluation is possible in simulation, but is hard to scale due to its significant computational demands. Further, the simulators available today exhibit a large domain gap to real data. This has resulted in an inability to draw clear conclusions from the rapidly growing body of research on end-to-end autonomous driving. In this paper, we present NAVSIM, a middle ground between these evaluation paradigms, where we use large datasets in combination with a non-reactive simulator to enable large-scale real-world benchmarking. Specifically, we gather simulation-based metrics, such as progress and time to collision, by unrolling bird’s eye view abstractions of the test scenes for a short simulation horizon. Our simulation is non-reactive, i.e., the evaluated policy and environment do not influence each other. As we demonstrate empirically, this decoupling allows open-loop metric computation while being better aligned with closed-loop evaluations than traditional displacement errors. NAVSIM enabled a new competition held at CVPR 2024, where 143 teams submitted 463 entries, resulting in several new insights. On a large set of challenging scenarios, we observe that simple methods with moderate compute requirements such as TransFuser can match recent large-scale end-to-end driving architectures such as UniAD. Our modular framework can potentially be extended with new datasets, data curation strategies, and metrics, and will be continually maintained to host future challenges. Our code is available at this https URL.

[LG-1] Privacy Preserved Blood Glucose Level Cross-Prediction: An Asynchronous Decentralized Federated Learning Approach

链接: https://arxiv.org/abs/2406.15346
作者: Chengzhe Piao,Taiyu Zhu,Yu Wang,Stephanie E Baldeweg,Paul Taylor,Pantelis Georgiou,Jiahao Sun,Jun Wang,Kezhi Li
关键词: Newly diagnosed Type, Continuous Glucose Monitoring, effective Blood Glucose, obtain effective Blood, Glucose Monitoring
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Newly diagnosed Type 1 Diabetes (T1D) patients often struggle to obtain effective Blood Glucose (BG) prediction models due to the lack of sufficient BG data from Continuous Glucose Monitoring (CGM), presenting a significant “cold start” problem in patient care. Utilizing population models to address this challenge is a potential solution, but collecting patient data for training population models in a privacy-conscious manner is challenging, especially given that such data is often stored on personal devices. Considering the privacy protection and addressing the “cold start” problem in diabetes care, we propose “GluADFL”, blood Glucose prediction by Asynchronous Decentralized Federated Learning. We compared GluADFL with eight baseline methods using four distinct T1D datasets, comprising 298 participants, which demonstrated its superior performance in accurately predicting BG levels for cross-patient analysis. Furthermore, patients’ data might be stored and shared across various communication networks in GluADFL, ranging from highly interconnected (e.g., random, performs the best among others) to more structured topologies (e.g., cluster and ring), suitable for various social networks. The asynchronous training framework supports flexible participation. By adjusting the ratios of inactive participants, we found it remains stable if less than 70% are inactive. Our results confirm that GluADFL offers a practical, privacy-preserving solution for BG prediction in T1D, significantly enhancing the quality of diabetes management.

[LG-2] GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians

链接: https://arxiv.org/abs/2406.15341
作者: Haoyang Liu,Haohan Wang
关键词: Recent advancements, advancements in machine, machine learning, learning have significantly, significantly improved
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
*备注: 25 pages, 3 figures

点击查看摘要

Abstract:Recent advancements in machine learning have significantly improved the identification of disease-associated genes from gene expression datasets. However, these processes often require extensive expertise and manual effort, limiting their scalability. Large Language Model (LLM)-based agents have shown promise in automating these tasks due to their increasing problem-solving abilities. To support the evaluation and development of such methods, we introduce GenoTEX, a benchmark dataset for the automatic exploration of gene expression data, involving the tasks of dataset selection, preprocessing, and statistical analysis. GenoTEX provides annotated code and results for solving a wide range of gene identification problems, in a full analysis pipeline that follows the standard of computational genomics. These annotations are curated by human bioinformaticians who carefully analyze the datasets to ensure accuracy and reliability. To provide baselines for these tasks, we present GenoAgents, a team of LLM-based agents designed with context-aware planning, iterative correction, and domain expert consultation to collaboratively explore gene datasets. Our experiments with GenoAgents demonstrate the potential of LLM-based approaches in genomics data analysis, while error analysis highlights the challenges and areas for future improvement. We propose GenoTEX as a promising resource for benchmarking and enhancing AI-driven methods for genomics data analysis. We make our benchmark publicly available at \urlthis https URL.

[LG-3] Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

链接: https://arxiv.org/abs/2406.15334
作者: Brandon Huang,Chancharik Mitra,Assaf Arbelle,Leonid Karlinsky,Trevor Darrell,Roei Herzig
关键词: interleaved Large Multimodal, Large Multimodal Models, interleaved Large, Large Multimodal, multimodal ICL setting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model’s context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV)–compact implicit representations of in-context examples compressed in the model’s attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.

[LG-4] Masked Extended Attention for Zero-Shot Virtual Try-On In The Wild

链接: https://arxiv.org/abs/2406.15331
作者: Nadav Orzech,Yotam Nitzan,Ulysse Mizrahi,Dov Danon,Amit H. Bermano
关键词: highly active line, Virtual Try-On, line of research, increasing demand, highly active
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project page available at this https URL

点击查看摘要

Abstract:Virtual Try-On (VTON) is a highly active line of research, with increasing demand. It aims to replace a piece of garment in an image with one from another, while preserving person and garment characteristics as well as image fidelity. Current literature takes a supervised approach for the task, impairing generalization and imposing heavy computation. In this paper, we present a novel zero-shot training-free method for inpainting a clothing garment by reference. Our approach employs the prior of a diffusion model with no additional training, fully leveraging its native generalization capabilities. The method employs extended attention to transfer image information from reference to target images, overcoming two significant challenges. We first initially warp the reference garment over the target human using deep features, alleviating “texture sticking”. We then leverage the extended attention mechanism with careful masking, eliminating leakage of reference background and unwanted influence. Through a user study, qualitative, and quantitative comparison to state-of-the-art approaches, we demonstrate superior image quality and garment preservation compared unseen clothing pieces or human figures.

[LG-5] Fine-grained Attention in Hierarchical Transformers for Tabular Time-series

链接: https://arxiv.org/abs/2406.15327
作者: Raphael Azorin,Zied Ben Houidi,Massimo Gallo,Alessandro Finamore,Pietro Michiardi
关键词: real-life systems, Tabular, tabular time-series, Tabular data, data
类目: Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Tabular data is ubiquitous in many real-life systems. In particular, time-dependent tabular data, where rows are chronologically related, is typically used for recording historical events, e.g., financial transactions, healthcare records, or stock history. Recently, hierarchical variants of the attention mechanism of transformer architectures have been used to model tabular time-series data. At first, rows (or columns) are encoded separately by computing attention between their fields. Subsequently, encoded rows (or columns) are attended to one another to model the entire tabular time-series. While efficient, this approach constrains the attention granularity and limits its ability to learn patterns at the field-level across separate rows, or columns. We take a first step to address this gap by proposing Fieldy, a fine-grained hierarchical model that contextualizes fields at both the row and column levels. We compare our proposal against state of the art models on regression and classification tasks using public tabular time-series datasets. Our results show that combining row-wise and column-wise attention improves performance without increasing model size. Code and data are available at this https URL.

[LG-6] Advanced Multimodal Deep Learning Architecture for Image-Text Matching

链接: https://arxiv.org/abs/2406.15306
作者: Jinyin Wang,Haijing Zhang,Yihao Zhong,Yingbin Liang,Rongwei Ji,Yiru Cang
关键词: key multimodal task, multimodal deep learning, image-text matching models, text, key multimodal
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2405.17460 by other authors

点击查看摘要

Abstract:Image-text matching is a key multimodal task that aims to model the semantic association between images and text as a matching relationship. With the advent of the multimedia information age, image, and text data show explosive growth, and how to accurately realize the efficient and accurate semantic correspondence between them has become the core issue of common concern in academia and industry. In this study, we delve into the limitations of current multimodal deep learning models in processing image-text pairing tasks. Therefore, we innovatively design an advanced multimodal deep learning architecture, which combines the high-level abstract representation ability of deep neural networks for visual information with the advantages of natural language processing models for text semantic understanding. By introducing a novel cross-modal attention mechanism and hierarchical feature fusion strategy, the model achieves deep fusion and two-way interaction between image and text feature space. In addition, we also optimize the training objectives and loss functions to ensure that the model can better map the potential association structure between images and text during the learning process. Experiments show that compared with existing image-text matching models, the optimized new model has significantly improved performance on a series of benchmark data sets. In addition, the new model also shows excellent generalization and robustness on large and diverse open scenario datasets and can maintain high matching performance even in the face of previously unseen complex situations.

[LG-7] Learning Spatio-Temporal Patterns of Polar Ice Layers With Physics-Informed Graph Neural Network

链接: https://arxiv.org/abs/2406.15299
作者: Zesheng Liu,Maryam Rahnemoonfar
关键词: ice dynamic processes, ice sheet balance, evaluating ice dynamic, dynamic processes, crucial for monitoring
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Learning spatio-temporal patterns of polar ice layers is crucial for monitoring the change in ice sheet balance and evaluating ice dynamic processes. While a few researchers focus on learning ice layer patterns from echogram images captured by airborne snow radar sensors via different convolutional neural networks, the noise in the echogram images proves to be a major obstacle. Instead, we focus on geometric deep learning based on graph neural networks to learn the spatio-temporal patterns from thickness information of shallow ice layers and make predictions for deep layers. In this paper, we propose a physics-informed hybrid graph neural network that combines the GraphSAGE framework for graph feature learning with the long short-term memory (LSTM) structure for learning temporal changes, and introduce measurements of physical ice properties from Model Atmospheric Regional (MAR) weather model as physical node features. We found that our proposed network can consistently outperform the current non-inductive or non-physical model in predicting deep ice layer thickness.

[LG-8] Pessimistic asynchronous sampling in high-cost Bayesian optimization

链接: https://arxiv.org/abs/2406.15291
作者: Amanda A. Volk,Kristofer G. Reyes,Jeffrey G. Ethier,Luke A. Baldwin
关键词: recently implemented technique, Bayesian optimization, disjointed workflows, Asynchronous Bayesian optimization, recently implemented
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Asynchronous Bayesian optimization is a recently implemented technique that allows for parallel operation of experimental systems and disjointed workflows. Contrasting with serial Bayesian optimization which individually selects experiments one at a time after conducting a measurement for each experiment, asynchronous policies sequentially assign multiple experiments before measurements can be taken and evaluate new measurements continuously as they are made available. This technique allows for faster data generation and therefore faster optimization of an experimental space. This work extends the capabilities of asynchronous optimization methods beyond prior studies by evaluating four additional policies that incorporate pessimistic predictions in the training data set. Combined with a conventional greedy policy, the five total policies were evaluated in a simulated environment and benchmarked with serial sampling. Under some conditions and parameter space dimensionalities, the pessimistic asynchronous policy reached optimum experimental conditions in significantly fewer experiments than equivalent serial policies and proved to be less susceptible to convergence onto local optima at higher dimensions. Without accounting for the faster sampling rate, the pessimistic asynchronous algorithm presented in this work could result in more efficient algorithm driven optimization of high-cost experimental spaces. Accounting for sampling rate, the presented asynchronous algorithm could allow for faster optimization in experimental spaces where multiple experiments can be run before results are collected.

[LG-9] FT-AED: Benchmark Dataset for Early Freeway Traffic Anomalous Event Detection

链接: https://arxiv.org/abs/2406.15283
作者: Austin Coursey,Junyi Ji,Marcos Quinones-Grueiro,William Barbour,Yuhang Zhang,Tyler Derr,Gautam Biswas,Daniel B. Work
关键词: improve emergency response, Early and accurate, response and clearance, improve emergency, emergency response
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Early and accurate detection of anomalous events on the freeway, such as accidents, can improve emergency response and clearance. However, existing delays and errors in event identification and reporting make it a difficult problem to solve. Current large-scale freeway traffic datasets are not designed for anomaly detection and ignore these challenges. In this paper, we introduce the first large-scale lane-level freeway traffic dataset for anomaly detection. Our dataset consists of a month of weekday radar detection sensor data collected in 4 lanes along an 18-mile stretch of Interstate 24 heading toward Nashville, TN, comprising over 3.7 million sensor measurements. We also collect official crash reports from the Nashville Traffic Management Center and manually label all other potential anomalies in the dataset. To show the potential for our dataset to be used in future machine learning and traffic research, we benchmark numerous deep learning anomaly detection models on our dataset. We find that unsupervised graph neural network autoencoders are a promising solution for this problem and that ignoring spatial relationships leads to decreased performance. We demonstrate that our methods can reduce reporting delays by over 10 minutes on average while detecting 75% of crashes. Our dataset and all preprocessing code needed to get started are publicly released at this https URL to facilitate future research.

[LG-10] Open Problem: Order Optimal Regret Bounds for Kernel-Based Reinforcement Learning

链接: https://arxiv.org/abs/2406.15250
作者: Sattar Vakili
关键词: Reinforcement Learning, shown great empirical, great empirical success, Markov Decision Process, Decision Process structures
类目: Machine Learning (cs.LG)
*备注: Open problem track. Conference on Learning Theory (COLT 2024)

点击查看摘要

Abstract:Reinforcement Learning (RL) has shown great empirical success in various application domains. The theoretical aspects of the problem have been extensively studied over past decades, particularly under tabular and linear Markov Decision Process structures. Recently, non-linear function approximation using kernel-based prediction has gained traction. This approach is particularly interesting as it naturally extends the linear structure, and helps explain the behavior of neural-network-based models at their infinite width limit. The analytical results however do not adequately address the performance guarantees for this case. We will highlight this open problem, overview existing partial results, and discuss related challenges.

[LG-11] Machine Learning Techniques in Automatic Music Transcription: A Systematic Survey

链接: https://arxiv.org/abs/2406.15249
作者: Fatemeh Jamshidi,Gary Pike,Amit Das,Richard Chapman
关键词: Music Information Retrieval, Information Retrieval, Music Information, music signal analysis, Automatic Music Transcription
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In the domain of Music Information Retrieval (MIR), Automatic Music Transcription (AMT) emerges as a central challenge, aiming to convert audio signals into symbolic notations like musical notes or sheet music. This systematic review accentuates the pivotal role of AMT in music signal analysis, emphasizing its importance due to the intricate and overlapping spectral structure of musical harmonies. Through a thorough examination of existing machine learning techniques utilized in AMT, we explore the progress and constraints of current models and methodologies. Despite notable advancements, AMT systems have yet to match the accuracy of human experts, largely due to the complexities of musical harmonies and the need for nuanced interpretation. This review critically evaluates both fully automatic and semi-automatic AMT systems, emphasizing the importance of minimal user intervention and examining various methodologies proposed to date. By addressing the limitations of prior techniques and suggesting avenues for improvement, our objective is to steer future research towards fully automated AMT systems capable of accurately and efficiently translating intricate audio signals into precise symbolic representations. This study not only synthesizes the latest advancements but also lays out a road-map for overcoming existing challenges in AMT, providing valuable insights for researchers aiming to narrow the gap between current systems and human-level transcription accuracy.

[LG-12] Unsupervised Morphological Tree Tokenizer

链接: https://arxiv.org/abs/2406.15245
作者: Qingyang Zhu,Xiang Hu,Pengyu Ji,Wei Wu,Kewei Tu
关键词: pre-defined atomic units, involves segmenting text, segmenting text inputs, tokenization involves segmenting, atomic units
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic information. To address this drawback, we introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words. Specifically, the deep model jointly encodes internal structures and representations of words with a mechanism named \textitMorphOverriding to ensure the indecomposability of morphemes. By training the model with self-supervised objectives, our method is capable of inducing character-level structures that align with morphological rules without annotated training data. Based on the induced structures, our algorithm tokenizes words through vocabulary matching in a top-down manner. Empirical results indicate that the proposed method effectively retains complete morphemes and outperforms widely adopted methods such as BPE and WordPiece on both morphological segmentation tasks and language modeling tasks. The code will be released later.

[LG-13] Large Batch Analysis for Adagrad Under Anisotropic Smoothness

链接: https://arxiv.org/abs/2406.15244
作者: Yuxing Liu,Rui Pan,Tong Zhang
关键词: deep neural networks, training large-scale deep, large-scale deep neural, large foundation models, neural networks
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Adaptive gradient algorithms have been widely adopted in training large-scale deep neural networks, especially large foundation models. Despite their huge success in practice, their theoretical advantages over stochastic gradient descent (SGD) have not been fully understood, especially in the large batch-size setting commonly used in practice. This is because the only theoretical result that can demonstrate the benefit of Adagrad over SGD was obtained in the original paper of Adagrad for nonsmooth objective functions. However, for nonsmooth objective functions, there can be a linear slowdown of convergence when batch size increases, and thus a convergence analysis based on nonsmooth assumption cannot be used for large batch algorithms. In this work, we resolve this gap between theory and practice by providing a new analysis of Adagrad on both convex and nonconvex smooth objectives suitable for the large batch setting. It is shown that under the anisotropic smoothness and noise conditions, increased batch size does not slow down convergence for Adagrad, and thus it can still achieve a faster convergence guarantee over SGD even in the large batch setting. We present detailed comparisons between SGD and Adagrad to provide a better understanding of the benefits of adaptive gradient methods. Experiments in logistic regression and instruction following fine-tuning tasks provide strong evidence to support our theoretical analysis.

[LG-14] Detecting Synthetic Lyrics with Few-Shot Inference

链接: https://arxiv.org/abs/2406.15231
作者: Yanis Labrak,Gabriel Meseguer-Brocal,Elena V. Epure
关键词: gained significant popularity, produce human-like lyrics, large language models, recent years, significant popularity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:In recent years, generated content in music has gained significant popularity, with large language models being effectively utilized to produce human-like lyrics in various styles, themes, and linguistic structures. This technological advancement supports artists in their creative processes but also raises issues of authorship infringement, consumer satisfaction and content spamming. To address these challenges, methods for detecting generated lyrics are necessary. However, existing works have not yet focused on this specific modality or on creative text in general regarding machine-generated content detection methods and datasets. In response, we have curated the first dataset of high-quality synthetic lyrics and conducted a comprehensive quantitative evaluation of various few-shot content detection approaches, testing their generalization capabilities and complementing this with a human evaluation. Our best few-shot detector, based on LLM2Vec, surpasses stylistic and statistical methods, which are shown competitive in other domains at distinguishing human-written from machine-generated content. It also shows good generalization capabilities to new artists and models, and effectively detects post-generation paraphrasing. This study emphasizes the need for further research on creative content detection, particularly in terms of generalization and scalability with larger song catalogs. All datasets, pre-processing scripts, and code are available publicly on GitHub and Hugging Face under the Apache 2.0 license.

[LG-15] ExDAG: Exact learning of DAGs

链接: https://arxiv.org/abs/2406.15229
作者: Pavel Rytíř,Aleš Wodecki,Jakub Mareček
关键词: recent years, growing interest, including Bayesian networks, including Bayesian, Bayesian networks
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13 pages

点击查看摘要

Abstract:There has been a growing interest in causal learning in recent years. Commonly used representations of causal structures, including Bayesian networks and structural equation models (SEM), take the form of directed acyclic graphs (DAGs). We provide a novel mixed-integer quadratic programming formulation and associated algorithm that identifies DAGs on up to 50 vertices, where these are identifiable. We call this method ExDAG, which stands for Exact learning of DAGs. Although there is a superexponential number of constraints that prevent the formation of cycles, the algorithm adds constraints violated by solutions found, rather than imposing all constraints in each continuous-valued relaxation. Our empirical results show that ExDAG outperforms local state-of-the-art solvers in terms of precision and outperforms state-of-the-art global solvers with respect to scaling, when considering Gaussian noise. We also provide validation with respect to other noise distributions.

[LG-16] Injecting Bias in Text-To-Image Models via Composite-Trigger Backdoors

链接: https://arxiv.org/abs/2406.15213
作者: Ali Naseh,Jaechul Roh,Eugene Bagdasaryan,Amir Houmansadr
关键词: Stable Diffusion, text-conditional image generative, large text-conditional image, Recent advances, image generative models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Recent advances in large text-conditional image generative models such as Stable Diffusion, Midjourney, and DALL-E 3 have revolutionized the field of image generation, allowing users to produce high-quality, realistic images from textual prompts. While these developments have enhanced artistic creation and visual communication, they also present an underexplored attack opportunity: the possibility of inducing biases by an adversary into the generated images for malicious intentions, e.g., to influence society and spread propaganda. In this paper, we demonstrate the possibility of such a bias injection threat by an adversary who backdoors such models with a small number of malicious data samples; the implemented backdoor is activated when special triggers exist in the input prompt of the backdoored models. On the other hand, the model’s utility is preserved in the absence of the triggers, making the attack highly undetectable. We present a novel framework that enables efficient generation of poisoning samples with composite (multi-word) triggers for such an attack. Our extensive experiments using over 1 million generated images and against hundreds of fine-tuned models demonstrate the feasibility of the presented backdoor attack. We illustrate how these biases can bypass conventional detection mechanisms, highlighting the challenges in proving the existence of biases within operational constraints. Our cost analysis confirms the low financial barrier to executing such attacks, underscoring the need for robust defensive strategies against such vulnerabilities in text-to-image generation models.

[LG-17] Causal Learning in Biomedical Applications

链接: https://arxiv.org/abs/2406.15189
作者: Petr Ryšavý,Xiaoyu He,Jakub Mareček
关键词: present a benchmark, benchmark for methods, causal learning, Abstract, Krebs cycle
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a benchmark for methods in causal learning. Specifically, we consider training a rich class of causal models from time-series data, and we suggest the use of the Krebs cycle and models of metabolism more broadly.

[LG-18] Perks and Pitfalls of Faithfulness in Regular Self-Explainable and Domain Invariant GNNs

链接: https://arxiv.org/abs/2406.15156
作者: Steve Azzolin,Antonio Longa,Stefano Teso,Andrea Passerini
关键词: Graph Neural Networks, Neural Networks, Graph Neural, build robust tools, paramount to build
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As Graph Neural Networks (GNNs) become more pervasive, it becomes paramount to build robust tools for computing explanations of their predictions. A key desideratum is that these explanations are faithful, i.e., that they portray an accurate picture of the GNN’s reasoning process. A number of different faithfulness metrics exist, begging the question of what faithfulness is exactly, and what its properties are. We begin by showing that existing metrics are not interchangeable – i.e., explanations attaining high faithfulness according to one metric may be unfaithful according to others – and can be systematically insensitive to important properties of the explanation, and suggest how to address these issues. We proceed to show that, surprisingly, optimizing for faithfulness is not always a sensible design goal. Specifically, we show that for injective regular GNN architectures, perfectly faithful explanations are completely uninformative. The situation is different for modular GNNs, such as self-explainable and domain-invariant architectures, where optimizing faithfulness does not compromise informativeness, and is also unexpectedly tied to out-of-distribution generalization.

[LG-19] Generative Topological Networks

链接: https://arxiv.org/abs/2406.15152
作者: Alona Levy-Jurgenson,Zohar Yakhini
关键词: Generative Topological Networks, Generative models, recent years, introduce Generative Topological, significant advancements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generative models have seen significant advancements in recent years, yet often remain challenging and costly to train and use. We introduce Generative Topological Networks (GTNs) – a new class of generative models that addresses these shortcomings. GTNs are trained deterministically using a simple supervised learning approach grounded in topology theory. GTNs are fast to train, and require only a single forward pass in a standard feedforward neural network to generate samples. We demonstrate the strengths of GTNs in several datasets, including MNIST, celebA and the Hands and Palm Images dataset. Finally, the theory behind GTNs offers insights into how to train generative models for improved performance.

[LG-20] Younger: The First Dataset for Artificial Intelligence-Generated Neural Network Architecture

链接: https://arxiv.org/abs/2406.15132
作者: Zhengxin Yang,Wanling Gao,Luzhou Peng,Yunyou Huang,Fei Tang,Jianfeng Zhan
关键词: requires extensive expertise, typically requires extensive, Designing and optimizing, architectures typically requires, optimizing neural network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 31 pages, 29 figures, 11 tables

点击查看摘要

Abstract:Designing and optimizing neural network architectures typically requires extensive expertise, starting with handcrafted designs and then manual or automated refinement. This dependency presents a significant barrier to rapid innovation. Recognizing the complexity of automatically generating neural network architecture from scratch, we introduce Younger, a pioneering dataset to advance this ambitious goal. Derived from over 174K real-world models across more than 30 tasks from various public model hubs, Younger includes 7,629 unique architectures, and each is represented as a directed acyclic graph with detailed operator-level information. The dataset facilitates two primary design paradigms: global, for creating complete architectures from scratch, and local, for detailed architecture component refinement. By establishing these capabilities, Younger contributes to a new frontier, Artificial Intelligence-Generated Neural Network Architecture (AIGNNA). Our experiments explore the potential and effectiveness of Younger for automated architecture generation and, as a secondary benefit, demonstrate that Younger can serve as a benchmark dataset, advancing the development of graph neural networks. We release the dataset and code publicly to lower the entry barriers and encourage further research in this challenging area.

[LG-21] KalMamba: Towards Efficient Probabilistic State Space Models for RL under Uncertainty

链接: https://arxiv.org/abs/2406.15131
作者: Philipp Becker,Niklas Freymuth,Gerhard Neumann
关键词: State Space Models, Reinforcement Learning, Probabilistic State Space, essential for Reinforcement, provide concise representations
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Probabilistic State Space Models (SSMs) are essential for Reinforcement Learning (RL) from high-dimensional, partial information as they provide concise representations for control. Yet, they lack the computational efficiency of their recent deterministic counterparts such as S4 or Mamba. We propose KalMamba, an efficient architecture to learn representations for RL that combines the strengths of probabilistic SSMs with the scalability of deterministic SSMs. KalMamba leverages Mamba to learn the dynamics parameters of a linear Gaussian SSM in a latent space. Inference in this latent space amounts to standard Kalman filtering and smoothing. We realize these operations using parallel associative scanning, similar to Mamba, to obtain a principled, highly efficient, and scalable probabilistic SSM. Our experiments show that KalMamba competes with state-of-the-art SSM approaches in RL while significantly improving computational efficiency, especially on longer interaction sequences.

[LG-22] Embracing Federated Learning: Enabling Weak Client Participation via Partial Model Training

链接: https://arxiv.org/abs/2406.15125
作者: Sunwoo Lee,Tuo Zhang,Saurav Prakash,Yue Niu,Salman Avestimehr
关键词: Federated Learning, Federated, distributed learning method, clients, weak devices
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In Federated Learning (FL), clients may have weak devices that cannot train the full model or even hold it in their memory space. To implement large-scale FL applications, thus, it is crucial to develop a distributed learning method that enables the participation of such weak clients. We propose EmbracingFL, a general FL framework that allows all available clients to join the distributed training regardless of their system resource capacity. The framework is built upon a novel form of partial model training method in which each client trains as many consecutive output-side layers as its system resources allow. Our study demonstrates that EmbracingFL encourages each layer to have similar data representations across clients, improving FL efficiency. The proposed partial model training method guarantees convergence to a neighbor of stationary points for non-convex and smooth problems. We evaluate the efficacy of EmbracingFL under a variety of settings with a mixed number of strong, moderate (~40% memory), and weak (~15% memory) clients, datasets (CIFAR-10, FEMNIST, and IMDB), and models (ResNet20, CNN, and LSTM). Our empirical study shows that EmbracingFL consistently achieves high accuracy as like all clients are strong, outperforming the state-of-the-art width reduction methods (i.e. HeteroFL and FjORD).

[LG-23] A Provably Efficient Option-Based Algorithm for both High-Level and Low-Level Learning

链接: https://arxiv.org/abs/2406.15124
作者: Gianluca Drappo,Alberto Maria Metelli,Marcello Restelli
关键词: Hierarchical Reinforcement Learning, Reinforcement Learning, shown successful results, Hierarchical Reinforcement, approaches have shown
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hierarchical Reinforcement Learning (HRL) approaches have shown successful results in solving a large variety of complex, structured, long-horizon problems. Nevertheless, a full theoretical understanding of this empirical evidence is currently missing. In the context of the \emphoption framework, prior research has devised efficient algorithms for scenarios where options are fixed, and the high-level policy selecting among options only has to be learned. However, the fully realistic scenario in which both the high-level and the low-level policies are learned is surprisingly disregarded from a theoretical perspective. This work makes a step towards the understanding of this latter scenario. Focusing on the finite-horizon problem, we present a meta-algorithm alternating between regret minimization algorithms instanced at different (high and low) temporal abstractions. At the higher level, we treat the problem as a Semi-Markov Decision Process (SMDP), with fixed low-level policies, while at a lower level, inner option policies are learned with a fixed high-level policy. The bounds derived are compared with the lower bound for non-hierarchical finite-horizon problems, allowing to characterize when a hierarchical approach is provably preferable, even without pre-trained options.

[LG-24] Brain-Like Language Processing via a Shallow Untrained Multihead Attention Network

链接: https://arxiv.org/abs/2406.15109
作者: Badr AlKhamissi,Greta Tuckute,Antoine Bosselut,Martin Schrimpf
关键词: Large Language Models, Large Language, brain, alignment, predicting most explainable
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) have been shown to be effective models of the human language system, with some models predicting most explainable variance of brain activity in current datasets. Even in untrained models, the representations induced by architectural priors can exhibit reasonable alignment to brain data. In this work, we investigate the key architectural components driving the surprising alignment of untrained models. To estimate LLM-to-brain similarity, we first select language-selective units within an LLM, similar to how neuroscientists identify the language network in the human brain. We then benchmark the brain alignment of these LLM units across five different brain recording datasets. By isolating critical components of the Transformer architecture, we identify tokenization strategy and multihead attention as the two major components driving brain alignment. A simple form of recurrence further improves alignment. We further demonstrate this quantitative brain alignment of our model by reproducing landmark studies in the language neuroscience field, showing that localized model units – just like language voxels measured empirically in the human brain – discriminate more reliably between lexical than syntactic differences, and exhibit similar response profiles under the same experimental conditions. Finally, we demonstrate the utility of our model’s representations for language modeling, achieving improved sample and parameter efficiency over comparable architectures. Our model’s estimates of surprisal sets a new state-of-the-art in the behavioral alignment to human reading times. Taken together, we propose a highly brain- and behaviorally-aligned model that conceptualizes the human language system as an untrained shallow feature encoder, with structural priors, combined with a trained decoder to achieve efficient and performant language processing.

[LG-25] HLQ: Fast and Efficient Backpropagation via Hadamard Low-rank Quantization

链接: https://arxiv.org/abs/2406.15102
作者: Seonggon Kim,Eunhyeok Park
关键词: rapid increase, increase in model, model size, growing importance, Hadamard Low-rank Quantization
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid increase in model size and the growing importance of various fine-tuning applications, lightweight training has become crucial. Since the backward pass is twice as expensive as the forward pass, optimizing backpropagation is particularly important. However, modifications to this process can lead to suboptimal convergence, so training optimization should minimize perturbations, which is a highly challenging task. In this study, we introduce a novel optimization strategy called Hadamard Low-rank Quantization (HLQ), focusing on reducing the cost of backpropagation in convolutional and linear layers. We first analyze the sensitivity of gradient computation with respect to activation and weight, and judiciously design the HLQ pipeline to apply 4-bit Hadamard quantization to the activation gradient and Hadamard low-rank approximation to the weight gradient. This combination was found to be the best for maximizing benefits, and our extensive experiments demonstrate the outstanding performance of HLQ in both training from scratch and fine-tuning, achieving significant memory savings and acceleration on real GPUs with negligible quality degradation.

[LG-26] How Intermodal Interaction Affects the Performance of Deep Multimodal Fusion for Mixed-Type Time Series

链接: https://arxiv.org/abs/2406.15098
作者: Simon Dietz,Thomas Altstidl,Dario Zanca,Björn Eskofier,An Nguyen
关键词: Mixed-type time series, Mixed-type time, environmental monitoring, time series, social media
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mixed-type time series (MTTS) is a bimodal data type that is common in many domains, such as healthcare, finance, environmental monitoring, and social media. It consists of regularly sampled continuous time series and irregularly sampled categorical event sequences. The integration of both modalities through multimodal fusion is a promising approach for processing MTTS. However, the question of how to effectively fuse both modalities remains open. In this paper, we present a comprehensive evaluation of several deep multimodal fusion approaches for MTTS forecasting. Our comparison includes three fusion types (early, intermediate, and late) and five fusion methods (concatenation, weighted mean, weighted mean with correlation, gating, and feature sharing). We evaluate these fusion approaches on three distinct datasets, one of which was generated using a novel framework. This framework allows for the control of key data properties, such as the strength and direction of intermodal interactions, modality imbalance, and the degree of randomness in each modality, providing a more controlled environment for testing fusion approaches. Our findings show that the performance of different fusion approaches can be substantially influenced by the direction and strength of intermodal interactions. The study reveals that early and intermediate fusion approaches excel at capturing fine-grained and coarse-grained cross-modal features, respectively. These findings underscore the crucial role of intermodal interactions in determining the most effective fusion strategy for MTTS forecasting.

[LG-27] owards General Negotiation Strategies with End-to-End Reinforcement Learning

链接: https://arxiv.org/abs/2406.15096
作者: Bram M. Renting,Thomas M. Moerland,Holger H. Hoos,Catholijn M. Jonker
关键词: research field, field of automated, long history, history of designing, negotiation
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: Accepted at the Reinforcement Learning Conference (RLC) 2024

点击查看摘要

Abstract:The research field of automated negotiation has a long history of designing agents that can negotiate with other agents. Such negotiation strategies are traditionally based on manual design and heuristics. More recently, reinforcement learning approaches have also been used to train agents to negotiate. However, negotiation problems are diverse, causing observation and action dimensions to change, which cannot be handled by default linear policy networks. Previous work on this topic has circumvented this issue either by fixing the negotiation problem, causing policies to be non-transferable between negotiation problems or by abstracting the observations and actions into fixed-size representations, causing loss of information and expressiveness due to feature design. We developed an end-to-end reinforcement learning method for diverse negotiation problems by representing observations and actions as a graph and applying graph neural networks in the policy. With empirical evaluations, we show that our method is effective and that we can learn to negotiate with other agents on never-before-seen negotiation problems. Our result opens up new opportunities for reinforcement learning in negotiation agents.

[LG-28] GOAL: A Generalist Combinatorial Optimization Agent Learner

链接: https://arxiv.org/abs/2406.15079
作者: Darko Drakulic,Sofia Michel,Jean-Marc Andreoli
关键词: Machine Learning-based heuristics, Machine Learning-based, recently shown impressive, shown impressive performance, Learning-based heuristics
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine Learning-based heuristics have recently shown impressive performance in solving a variety of hard combinatorial optimization problems (COPs). However they generally rely on a separate neural model, specialized and trained for each single problem. Any variation of a problem requires adjustment of its model and re-training from scratch. In this paper, we propose GOAL (for Generalist combinatorial Optimization Agent Learning), a generalist model capable of efficiently solving multiple COPs and which can be fine-tuned to solve new COPs. GOAL consists of a single backbone plus light-weight problem-specific adapters, mostly for input and output processing. The backbone is based on a new form of mixed-attention blocks which allows to handle problems defined on graphs with arbitrary combinations of node, edge and instance-level features. Additionally, problems which involve heterogeneous nodes or edges, such as in multi-partite graphs, are handled through a novel multi-type transformer architecture, where the attention blocks are duplicated to attend only the relevant combination of types while relying on the same shared parameters. We train GOAL on a set of routing, scheduling and classic graph problems and show that it is only slightly inferior to the specialized baselines while being the first multi-task model that solves a variety of COPs. Finally, we showcase the strong transfer learning capacity of GOAL by fine-tuning or learning the adapters for new problems, with only few shots and little data.

[LG-29] Neural Incremental Data Assimilation

链接: https://arxiv.org/abs/2406.15076
作者: Matthieu Blanke,Ronan Fablet,Marc Lelarge
关键词: geophysical applications, weather forecasting, central problem, Data assimilation, data assimilation methods
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data assimilation is a central problem in many geophysical applications, such as weather forecasting. It aims to estimate the state of a potentially large system, such as the atmosphere, from sparse observations, supplemented by prior physical knowledge. The size of the systems involved and the complexity of the underlying physical equations make it a challenging task from a computational point of view. Neural networks represent a promising method of emulating the physics at low cost, and therefore have the potential to considerably improve and accelerate data assimilation. In this work, we introduce a deep learning approach where the physical system is modeled as a sequence of coarse-to-fine Gaussian prior distributions parametrized by a neural network. This allows us to define an assimilation operator, which is trained in an end-to-end fashion to minimize the reconstruction error on a dataset with different observation processes. We illustrate our approach on chaotic dynamical physical systems with sparse observations, and compare it to traditional variational data assimilation methods.

[LG-30] mpora-Fusion: Time-Lock Puzzle with Efficient Verifiable Homomorphic Linear Combination

链接: https://arxiv.org/abs/2406.15070
作者: Aydin Abadi
关键词: securely transmit sensitive, transmit sensitive information, securely transmit, transmit sensitive, sensitive information
类目: Cryptography and Security (cs.CR); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To securely transmit sensitive information into the future, Time-Lock Puzzles (TLPs) have been developed. Their applications include scheduled payments, timed commitments, e-voting, and sealed-bid auctions. Homomorphic TLP is a key variant of TLP that enables computation on puzzles from different clients. This allows a solver/server to tackle only a single puzzle encoding the computation’s result. However, existing homomorphic TLPs lack support for verifying the correctness of the computation results. We address this limitation by introducing Tempora-Fusion, a TLP that allows a server to perform homomorphic linear combinations of puzzles from different clients while ensuring verification of computation correctness. This scheme avoids asymmetric-key cryptography for verification, thus paving the way for efficient implementations. We discuss our scheme’s application in various domains, such as federated learning, scheduled payments in online banking, and e-voting.

[LG-31] Latent Space Translation via Inverse Relative Projection

链接: https://arxiv.org/abs/2406.15057
作者: Valentino Maiorca,Luca Moschella,Marco Fumero,Francesco Locatello,Emanuele Rodolà
关键词: representation learning community, Latent space, latent space translation, Latent space communication, sparked significant interest
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2311.00664 , arXiv:2406.11014

点击查看摘要

Abstract:The emergence of similar representations between independently trained neural models has sparked significant interest in the representation learning community, leading to the development of various methods to obtain communication between latent spaces. “Latent space communication” can be achieved in two ways: i) by independently mapping the original spaces to a shared or relative one; ii) by directly estimating a transformation from a source latent space to a target one. In this work, we combine the two into a novel method to obtain latent space translation through the relative space. By formalizing the invertibility of angle-preserving relative representations and assuming the scale invariance of decoder modules in neural models, we can effectively use the relative space as an intermediary, independently projecting onto and from other semantically similar spaces. Extensive experiments over various architectures and datasets validate our scale invariance assumption and demonstrate the high accuracy of our method in latent space translation. We also apply our method to zero-shot stitching between arbitrary pre-trained text and image encoders and their classifiers, even across modalities. Our method has significant potential for facilitating the reuse of models in a practical manner via compositionality.

[LG-32] ri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis

链接: https://arxiv.org/abs/2406.15050
作者: Lin Fan,Xun Gong,Cenyang Zheng,Yafei Ou
关键词: Visual Question Answering, challenging research topic, advantages including patient, including patient engagement, clinical expert involvement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The intersection of medical Visual Question Answering (Med-VQA) is a challenging research topic with advantages including patient engagement and clinical expert involvement for second opinions. However, existing Med-VQA methods based on joint embedding fail to explain whether their provided results are based on correct reasoning or coincidental answers, which undermines the credibility of VQA answers. In this paper, we investigate the construction of a more cohesive and stable Med-VQA structure. Motivated by causal effect, we propose a novel Triangular Reasoning VQA (Tri-VQA) framework, which constructs reverse causal questions from the perspective of “Why this answer?” to elucidate the source of the answer and stimulate more reasonable forward reasoning processes. We evaluate our method on the Endoscopic Ultrasound (EUS) multi-attribute annotated dataset from five centers, and test it on medical VQA datasets. Experimental results demonstrate the superiority of our approach over existing methods. Our codes and pre-trained models are available at https://anonymous.4open.science/r/Tri_VQA.

[LG-33] From Overfitting to Robustness: Quantity Quality and Variety Oriented Negative Sample Selection in Graph Contrastive Learning

链接: https://arxiv.org/abs/2406.15044
作者: Adnan Ali,Jinlong Li,Huanhuan Chen,Ali Kashif Bashir
关键词: contrast positive-negative counterparts, negative samples, negative sample pools, graph data augmentation, negative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph contrastive learning (GCL) aims to contrast positive-negative counterparts to learn the node embeddings, whereas graph data augmentation methods are employed to generate these positive-negative samples. The variation, quantity, and quality of negative samples compared to positive samples play crucial roles in learning meaningful embeddings for node classification downstream tasks. Less variation, excessive quantity, and low-quality negative samples cause the model to be overfitted for particular nodes, resulting in less robust models. To solve the overfitting problem in the GCL paradigm, this study proposes a novel Cumulative Sample Selection (CSS) algorithm by comprehensively considering negative samples’ quality, variations, and quantity. Initially, three negative sample pools are constructed: easy, medium, and hard negative samples, which contain 25%, 50%, and 25% of the total available negative samples, respectively. Then, 10% negative samples are selected from each of these three negative sample pools for training the model. After that, a decision agent module evaluates model training results and decides whether to explore more negative samples from three negative sample pools by increasing the ratio or keep exploiting the current sampling ratio. The proposed algorithm is integrated into a proposed graph contrastive learning framework named NegAmplify. NegAmplify is compared with the SOTA methods on nine graph node classification datasets, with seven achieving better node classification accuracy with up to 2.86% improvement.

[LG-34] Discovering Common Information in Multi-view Data

链接: https://arxiv.org/abs/2406.15043
作者: Qi Zhang,Mingfei Lu,Shujian Yu,Jingmin Xin,Badong Chen
关键词: mathematically rigorous definition, computing common information, Gács-Körner common information, drawing inspiration, common information
类目: Machine Learning (cs.LG)
*备注: Manuscript accepted by Information Fusion (\url{ this https URL }). We have updated a few descriptions for clarity. Code is available at \url{ this https URL }

点击查看摘要

Abstract:We introduce an innovative and mathematically rigorous definition for computing common information from multi-view data, drawing inspiration from Gács-Körner common information in information theory. Leveraging this definition, we develop a novel supervised multi-view learning framework to capture both common and unique information. By explicitly minimizing a total correlation term, the extracted common information and the unique information from each view are forced to be independent of each other, which, in turn, theoretically guarantees the effectiveness of our framework. To estimate information-theoretic quantities, our framework employs matrix-based Rényi’s \alpha -order entropy functional, which forgoes the need for variational approximation and distributional estimation in high-dimensional space. Theoretical proof is provided that our framework can faithfully discover both common and unique information from multi-view data. Experiments on synthetic and seven benchmark real-world datasets demonstrate the superior performance of our proposed framework over state-of-the-art approaches.

[LG-35] Behaviour Distillation

链接: https://arxiv.org/abs/2406.15042
作者: Andrei Lupu,Chris Lu,Jarek Liesen,Robert Tjarko Lange,Jakob Foerster
关键词: small number, drop-in replacements, condense large datasets, distillation, datasets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at ICLR 2024

点击查看摘要

Abstract:Dataset distillation aims to condense large datasets into a small number of synthetic examples that can be used as drop-in replacements when training new models. It has applications to interpretability, neural architecture search, privacy, and continual learning. Despite strong successes in supervised domains, such methods have not yet been extended to reinforcement learning, where the lack of a fixed dataset renders most distillation methods unusable. Filling the gap, we formalize behaviour distillation, a setting that aims to discover and then condense the information required for training an expert policy into a synthetic dataset of state-action pairs, without access to expert data. We then introduce Hallucinating Datasets with Evolution Strategies (HaDES), a method for behaviour distillation that can discover datasets of just four state-action pairs which, under supervised learning, train agents to competitive performance levels in continuous control tasks. We show that these datasets generalize out of distribution to training policies with a wide range of architectures and hyperparameters. We also demonstrate application to a downstream task, namely training multi-task agents in a zero-shot fashion. Beyond behaviour distillation, HaDES provides significant improvements in neuroevolution for RL over previous approaches and achieves SoTA results on one standard supervised dataset distillation task. Finally, we show that visualizing the synthetic datasets can provide human-interpretable task insights.

[LG-36] Online detection and infographic explanation of spam reviews with data drift adaptation

链接: https://arxiv.org/abs/2406.15038
作者: Francisco de Arriba-Pérez,Silvia García-Méndez,Fátima Leal,Benedita Malheiro,J. C. Burguillo
关键词: online platforms due, impact on reputation, platforms due, significant impact, Spam
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Spam reviews are a pervasive problem on online platforms due to its significant impact on reputation. However, research into spam detection in data streams is scarce. Another concern lies in their need for transparency. Consequently, this paper addresses those problems by proposing an online solution for identifying and explaining spam reviews, incorporating data drift adaptation. It integrates (i) incremental profiling, (ii) data drift detection adaptation, and (iii) identification of spam reviews employing Machine Learning. The explainable mechanism displays a visual and textual prediction explanation in a dashboard. The best results obtained reached up to 87 % spam F-measure.

[LG-37] Using Neural Networks for Data Cleaning in Weather Datasets

链接: https://arxiv.org/abs/2406.15027
作者: Jack R. P. Hanslope,Laurence Aitchison
关键词: climate science, IBTrACS, Abstract, storm location, network
类目: Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, ICML 2024 Workshop on Machine Learning for Earth System Modeling

点击查看摘要

Abstract:In climate science, we often want to compare across different datasets. Difficulties can arise in doing this due to inevitable mismatches that arise between observational and reanalysis data, or even between different reanalyses. This misalignment can raise problems for any work that seeks to make inferences about one dataset from another. We considered tropical cyclone location as an example task with one dataset providing atmospheric conditions (ERA5) and another providing storm tracks (IBTrACS). We found that while the examples often aligned well, there were a considerable proportion (around 25%) which were not well aligned. We trained a neural network to map from the wind field to the storm location; in this setting misalignment in the datasets appears as “label noise” (i.e. the labelled storm location does not correspond to the underlying wind field). We found that this neural network trained only on the often noisy labels from IBTrACS had a denoising effect, and performed better than the IBTrACS labels themselves, as measured by human preferences. Remarkably, this even held true for training points, on which we might have expected the network to overfit to the IBTrACS predictions.

[LG-38] SiT: Symmetry-Invariant Transformers for Generalisation in Reinforcement Learning

链接: https://arxiv.org/abs/2406.15025
作者: Matthias Weissenbacher,Rishabh Agarwal,Yoshinobu Kawahara
关键词: reinforcement learning, semantically-similar environments, open challenge, challenge in reinforcement, effective deployment
类目: Machine Learning (cs.LG)
*备注: 9 main pages, accepted to ICML2024

点击查看摘要

Abstract:An open challenge in reinforcement learning (RL) is the effective deployment of a trained policy to new or slightly different situations as well as semantically-similar environments. We introduce Symmetry-Invariant Transformer (SiT), a scalable vision transformer (ViT) that leverages both local and global data patterns in a self-supervised manner to improve generalisation. Central to our approach is Graph Symmetric Attention, which refines the traditional self-attention mechanism to preserve graph symmetries, resulting in invariant and equivariant latent representations. We showcase SiT’s superior generalization over ViTs on MiniGrid and Procgen RL benchmarks, and its sample efficiency on Atari 100k and CIFAR10.

[LG-39] Probabilistic and Differentiable Wireless Simulation with Geometric Transformers

链接: https://arxiv.org/abs/2406.14995
作者: Thomas Hehn,Markus Peschl,Tribhuvanesh Orekondy,Arash Behboodi,Johann Brehmer
关键词: modern communication systems, designing modern communication, communication systems, critical for designing, designing modern
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Modelling the propagation of electromagnetic signals is critical for designing modern communication systems. While there are precise simulators based on ray tracing, they do not lend themselves to solving inverse problems or the integration in an automated design loop. We propose to address these challenges through differentiable neural surrogates that exploit the geometric aspects of the problem. We first introduce the Wireless Geometric Algebra Transformer (Wi-GATr), a generic backbone architecture for simulating wireless propagation in a 3D environment. It uses versatile representations based on geometric algebra and is equivariant with respect to E(3), the symmetry group of the underlying physics. Second, we study two algorithmic approaches to signal prediction and inverse problems based on differentiable predictive modelling and diffusion models. We show how these let us predict received power, localize receivers, and reconstruct the 3D environment from the received signal. Finally, we introduce two large, geometry-focused datasets of wireless signal propagation in indoor scenes. In experiments, we show that our geometry-forward approach achieves higher-fidelity predictions with less data than various baselines.

[LG-40] Learning Variable Compliance Control From a Few Demonstrations for Bimanual Robot with Haptic Feedback Teleoperation System

链接: https://arxiv.org/abs/2406.14990
作者: Tatsuya Kamijo,Cristian C. Beltran-Hernandez,Masashi Hamaya
关键词: Automating dexterous, challenge in robotics, significant challenge, rigid robots, compliance control
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automating dexterous, contact-rich manipulation tasks using rigid robots is a significant challenge in robotics. Rigid robots, defined by their actuation through position commands, face issues of excessive contact forces due to their inability to adapt to contact with the environment, potentially causing damage. While compliance control schemes have been introduced to mitigate these issues by controlling forces via external sensors, they are hampered by the need for fine-tuning task-specific controller parameters. Learning from Demonstrations (LfD) offers an intuitive alternative, allowing robots to learn manipulations through observed actions. In this work, we introduce a novel system to enhance the teaching of dexterous, contact-rich manipulations to rigid robots. Our system is twofold: firstly, it incorporates a teleoperation interface utilizing Virtual Reality (VR) controllers, designed to provide an intuitive and cost-effective method for task demonstration with haptic feedback. Secondly, we present Comp-ACT (Compliance Control via Action Chunking with Transformers), a method that leverages the demonstrations to learn variable compliance control from a few demonstrations. Our methods have been validated across various complex contact-rich manipulation tasks using single-arm and bimanual robot setups in simulated and real-world environments, demonstrating the effectiveness of our system in teaching robots dexterous manipulations with enhanced adaptability and safety.

[LG-41] Hierarchical thematic classification of major conference proceedings

链接: https://arxiv.org/abs/2406.14983
作者: Arsentii Kuzmin,Alexander Aduenko,Vadim Strijov
关键词: decision support system, hierarchical similarity function, hierarchical, develop a decision, decision support
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we develop a decision support system for the hierarchical text classification. We consider text collections with a fixed hierarchical structure of topics given by experts in the form of a tree. The system sorts the topics by relevance to a given document. The experts choose one of the most relevant topics to finish the classification. We propose a weighted hierarchical similarity function to calculate topic relevance. The function calculates the similarity of a document and a tree branch. The weights in this function determine word importance. We use the entropy of words to estimate the weights. The proposed hierarchical similarity function formulates a joint hierarchical thematic classification probability model of the document topics, parameters, and hyperparameters. The variational Bayesian inference gives a closed-form EM algorithm. The EM algorithm estimates the parameters and calculates the probability of a topic for a given document. Compared to hierarchical multiclass SVM, hierarchical PLSA with adaptive regularization, and hierarchical naive Bayes, the weighted hierarchical similarity function has better improvement in ranking accuracy in an abstract collection of a major conference EURO and a website collection of industrial companies. Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML) Cite as: arXiv:2406.14983 [cs.LG] (or arXiv:2406.14983v1 [cs.LG] for this version)

[LG-42] Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation

链接: https://arxiv.org/abs/2406.14971
作者: Shamane Siriwardhana,Mark McQuade,Thomas Gauthier,Lucas Atkins,Fernando Fernandes Neto,Luke Meyers,Anneketh Vij,Tyler Odenthal,Charles Goddard,Mary MacCarthy,Jacob Solawetz
关键词: conducted extensive experiments, SEC data, exploring its performance, conducted extensive, extensive experiments
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:We conducted extensive experiments on domain adaptation of the Meta-Llama-3-70B-Instruct model on SEC data, exploring its performance on both general and domain-specific benchmarks. Our focus included continual pre-training (CPT) and model merging, aiming to enhance the model’s domain-specific capabilities while mitigating catastrophic forgetting. Through this study, we evaluated the impact of integrating financial regulatory data into a robust language model and examined the effectiveness of our model merging techniques in preserving and improving the model’s instructive abilities. The model is accessible at hugging face: this https URL, arcee-ai/Llama-3-SEC-Base. This is an intermediate checkpoint of our final model, which has seen 20B tokens so far. The full model is still in the process of training. This is a preprint technical report with thorough evaluations to understand the entire process.

[LG-43] Uni-Mol2: Exploring Molecular Pretraining Model at Scale

链接: https://arxiv.org/abs/2406.14969
作者: Xiaohong Ji,Wang Zhen,Zhifeng Gao,Hang Zheng,Linfeng Zhang,Guolin Ke,Weinan E
关键词: natural language processing, made significant advancements, molecular pretraining models, molecular pretraining, significant advancements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, pretraining models have made significant advancements in the fields of natural language processing (NLP), computer vision (CV), and life sciences. The significant advancements in NLP and CV are predominantly driven by the expansion of model parameters and data size, a phenomenon now recognized as the scaling laws. However, research exploring scaling law in molecular pretraining models remains unexplored. In this work, we present Uni-Mol2 , an innovative molecular pretraining model that leverages a two-track transformer to effectively integrate features at the atomic level, graph level, and geometry structure level. Along with this, we systematically investigate the scaling law within molecular pretraining models, characterizing the power-law correlations between validation loss and model size, dataset size, and computational resources. Consequently, we successfully scale Uni-Mol2 to 1.1 billion parameters through pretraining on 800 million conformations, making it the largest molecular pretraining model to date. Extensive experiments show consistent improvement in the downstream tasks as the model size grows. The Uni-Mol2 with 1.1B parameters also outperforms existing methods, achieving an average 27% improvement on the QM9 and 14% on COMPAS-1D dataset.

[LG-44] Optimised Grouped-Query Attention Mechanism for Transformers

链接: https://arxiv.org/abs/2406.14963
作者: Yuang Chen,Cheng Zhang,Xitong Gao,Robert D. Mullins,George A. Constantinides,Yiren Zhao
关键词: Grouped-query attention, multi-head attention, MHA, widely adopted, adopted in LLMs
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML2024 ES-FoMo-II Workshop

点击查看摘要

Abstract:Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA’s trade-off problem between model performance and hardware efficiency.

[LG-45] Unlocking the Global Synergies in Low-Rank Adapters

链接: https://arxiv.org/abs/2406.14956
作者: Zixi Zhang,Cheng Zhang,Xitong Gao,Robert D. Mullins,George A. Constantinides,Yiren Zhao
关键词: Low-rank Adaption, de-facto parameter-efficient fine-tuning, parameter-efficient fine-tuning technique, large language models, de-facto parameter-efficient
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: Accepted at ICML2024 ES-FoMo-II Workshop

点击查看摘要

Abstract:Low-rank Adaption (LoRA) has been the de-facto parameter-efficient fine-tuning technique for large language models. We present HeteroLoRA, a light-weight search algorithm that leverages zero-cost proxies to allocate the limited LoRA trainable parameters across the model for better fine-tuned performance. In addition to the allocation for the standard LoRA-adapted models, we also demonstrate the efficacy of HeteroLoRA by performing the allocation in a more challenging search space that includes LoRA modules and LoRA-adapted shortcut connections. Experiments show that HeteroLoRA enables improvements in model performance given the same parameter budge. For example, on MRPC, we see an improvement of 1.6% in accuracy with similar training parameter budget. We will open-source our algorithm once the paper is accepted.

[LG-46] Deep Imbalanced Regression to Estimate Vascular Age from PPG Data: a Novel Digital Biomarker for Cardiovascular Health

链接: https://arxiv.org/abs/2406.14953
作者: Guangkun Nie,Qinghao Zhao,Gongzheng Tang,Jun Li,Shenda Hong
关键词: monitoring human hemodynamics, recent studies highlighting, assessing vascular aging, human hemodynamics, Dist Loss
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Photoplethysmography (PPG) is emerging as a crucial tool for monitoring human hemodynamics, with recent studies highlighting its potential in assessing vascular aging through deep learning. However, real-world age distributions are often imbalanced, posing significant challenges for deep learning models. In this paper, we introduce a novel, simple, and effective loss function named the Dist Loss to address deep imbalanced regression tasks. We trained a one-dimensional convolutional neural network (Net1D) incorporating the Dist Loss on the extensive UK Biobank dataset (n=502,389) to estimate vascular age from PPG signals and validate its efficacy in characterizing cardiovascular health. The model’s performance was validated on a 40% held-out test set, achieving state-of-the-art results, especially in regions with small sample sizes. Furthermore, we divided the population into three subgroups based on the difference between predicted vascular age and chronological age: less than -10 years, between -10 and 10 years, and greater than 10 years. We analyzed the relationship between predicted vascular age and several cardiovascular events over a follow-up period of up to 10 years, including death, coronary heart disease, and heart failure. Our results indicate that the predicted vascular age has significant potential to reflect an individual’s cardiovascular health status. Our code will be available at this https URL.

[LG-47] An Idiosyncrasy of Time-discretization in Reinforcement Learning

链接: https://arxiv.org/abs/2406.14951
作者: Kris De Asis,Richard S. Sutton
关键词: discrete time steps, agent interacts, discrete time, time steps, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: RLC 2024

点击查看摘要

Abstract:Many reinforcement learning algorithms are built on an assumption that an agent interacts with an environment over fixed-duration, discrete time steps. However, physical systems are continuous in time, requiring a choice of time-discretization granularity when digitally controlling them. Furthermore, such systems do not wait for decisions to be made before advancing the environment state, necessitating the study of how the choice of discretization may affect a reinforcement learning algorithm. In this work, we consider the relationship between the definitions of the continuous-time and discrete-time returns. Specifically, we acknowledge an idiosyncrasy with naively applying a discrete-time algorithm to a discretized continuous-time environment, and note how a simple modification can better align the return definitions. This observation is of practical consideration when dealing with environments where time-discretization granularity is a choice, or situations where such granularity is inherently stochastic.

[LG-48] On the growth of the parameters of approximating ReLU neural networks

链接: https://arxiv.org/abs/2406.14936
作者: Erion Morina,Martin Holler
关键词: fully connected feed, connected feed forward, feed forward ReLU, smooth function, forward ReLU neural
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This work focuses on the analysis of fully connected feed forward ReLU neural networks as they approximate a given, smooth function. In contrast to conventionally studied universal approximation properties under increasing architectures, e.g., in terms of width or depth of the networks, we are concerned with the asymptotic growth of the parameters of approximating networks. Such results are of interest, e.g., for error analysis or consistency results for neural network training. The main result of our work is that, for a ReLU architecture with state of the art approximation error, the realizing parameters grow at most polynomially. The obtained rate with respect to a normalized network size is compared to existing results and is shown to be superior in most cases, in particular for high dimensional input.

[LG-49] Efficient Graph Similarity Computation with Alignment Regularization

链接: https://arxiv.org/abs/2406.14929
作者: Wei Zhuo,Guang Tan
关键词: graph edit distance, Graph Neural Networks, graph similarity computation, edit distance, methods treat GSC
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2022

点击查看摘要

Abstract:We consider the graph similarity computation (GSC) task based on graph edit distance (GED) estimation. State-of-the-art methods treat GSC as a learning-based prediction task using Graph Neural Networks (GNNs). To capture fine-grained interactions between pair-wise graphs, these methods mostly contain a node-level matching module in the end-to-end learning pipeline, which causes high computational costs in both the training and inference stages. We show that the expensive node-to-node matching module is not necessary for GSC, and high-quality learning can be attained with a simple yet powerful regularization technique, which we call the Alignment Regularization (AReg). In the training stage, the AReg term imposes a node-graph correspondence constraint on the GNN encoder. In the inference stage, the graph-level representations learned by the GNN encoder are directly used to compute the similarity score without using AReg again to speed up inference. We further propose a multi-scale GED discriminator to enhance the expressive ability of the learned representations. Extensive experiments on real-world datasets demonstrate the effectiveness, efficiency and transferability of our approach.

[LG-50] LLM2FEA: Discover Novel Designs with Generative Evolutionary Multitasking

链接: https://arxiv.org/abs/2406.14917
作者: Melvin Wong,Jiao Liu,Thiago Rios,Stefan Menzel,Yew Soon Ong
关键词: generative artificial intelligence, high-quality images, rapid research, research and development, artificial intelligence
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:The rapid research and development of generative artificial intelligence has enabled the generation of high-quality images, text, and 3D models from text prompts. This advancement impels an inquiry into whether these models can be leveraged to create digital artifacts for both creative and engineering applications. Drawing on innovative designs from other domains may be one answer to this question, much like the historical practice of ``bionics", where humans have sought inspiration from nature’s exemplary designs. This raises the intriguing possibility of using generative models to simultaneously tackle design tasks across multiple domains, facilitating cross-domain learning and resulting in a series of innovative design solutions. In this paper, we propose LLM2FEA as the first attempt to discover novel designs in generative models by transferring knowledge across multiple domains. By utilizing a multi-factorial evolutionary algorithm (MFEA) to drive a large language model, LLM2FEA integrates knowledge from various fields to generate prompts that guide the generative model in discovering novel and practical objects. Experimental results in the context of 3D aerodynamic design verify the discovery capabilities of the proposed LLM2FEA. The designs generated by LLM2FEA not only satisfy practicality requirements to a certain degree but also feature novel and aesthetically pleasing shapes, demonstrating the potential applications of LLM2FEA in discovery tasks.

[LG-51] Demonstrating the Efficacy of Kolmogorov-Arnold Networks in Vision Tasks

链接: https://arxiv.org/abs/2406.14916
作者: Minjong Cheon
关键词: Kolmogorov-Arnold Network, deep learning, multilayer projections, realm of deep, vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the realm of deep learning, the Kolmogorov-Arnold Network (KAN) has emerged as a potential alternative to multilayer projections (MLPs). However, its applicability to vision tasks has not been extensively validated. In our study, we demonstrated the effectiveness of KAN for vision tasks through multiple trials on the MNIST, CIFAR10, and CIFAR100 datasets, using a training batch size of 32. Our results showed that while KAN outperformed the original MLP-Mixer on CIFAR10 and CIFAR100, it performed slightly worse than the state-of-the-art ResNet-18. These findings suggest that KAN holds significant promise for vision tasks, and further modifications could enhance its performance in future evaluations.Our contributions are threefold: first, we showcase the efficiency of KAN-based algorithms for visual tasks; second, we provide extensive empirical assessments across various vision benchmarks, comparing KAN’s performance with MLP-Mixer, CNNs, and Vision Transformers (ViT); and third, we pioneer the use of natural KAN layers in visual tasks, addressing a gap in previous research. This paper lays the foundation for future studies on KANs, highlighting their potential as a reliable alternative for image classification tasks.

[LG-52] owards Dynamic Resource Allocation and Client Scheduling in Hierarchical Federated Learning: A Two-Phase Deep Reinforcement Learning Approach

链接: https://arxiv.org/abs/2406.14910
作者: Xiaojing Chen,Zhenyuan Li,Wei Ni,Xin Wang,Shunqing Zhang,Yanzan Sun,Shugong Xu,Qingqi Pei
关键词: shared machine learning, machine learning model, Federated learning, sharing data, viable technique
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a viable technique to train a shared machine learning model without sharing data. Hierarchical FL (HFL) system has yet to be studied regrading its multiple levels of energy, computation, communication, and client scheduling, especially when it comes to clients relying on energy harvesting to power their operations. This paper presents a new two-phase deep deterministic policy gradient (DDPG) framework, referred to as ``TP-DDPG’', to balance online the learning delay and model accuracy of an FL process in an energy harvesting-powered HFL system. The key idea is that we divide optimization decisions into two groups, and employ DDPG to learn one group in the first phase, while interpreting the other group as part of the environment to provide rewards for training the DDPG in the second phase. Specifically, the DDPG learns the selection of participating clients, and their CPU configurations and the transmission powers. A new straggler-aware client association and bandwidth allocation (SCABA) algorithm efficiently optimizes the other decisions and evaluates the reward for the DDPG. Experiments demonstrate that with substantially reduced number of learnable parameters, the TP-DDPG can quickly converge to effective polices that can shorten the training time of HFL by 39.4% compared to its benchmarks, when the required test accuracy of HFL is 0.9.

[LG-53] MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression

链接: https://arxiv.org/abs/2406.14909
作者: Tianyu Fu,Haofeng Huang,Xuefei Ning,Genghan Zhang,Boju Chen,Tianqi Wu,Hongyi Wang,Zixiao Huang,Shiyao Li,Shengen Yan,Guohao Dai,Huazhong Yang,Yu Wang
关键词: Large Language Models, Large Language, demands of Large, Language Models, attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 10 pages

点击查看摘要

Abstract:Sparse attention can effectively mitigate the significant memory and throughput demands of Large Language Models (LLMs) in long contexts. Existing methods typically employ a uniform sparse attention mask, applying the same sparse pattern across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers. MoA constructs and navigates a search space of various attention patterns and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal sparse attention compression plan. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by 3.9\times with the same average attention span, boosting retrieval accuracy by 1.5-7.1\times over the uniform-attention baseline across Vicuna-7B, Vicuna-13B, and Llama3-8B models. Moreover, MoA narrows the capability gaps between sparse and dense models, reducing the maximum relative performance drop from 9%-36% to within 5% across two long-context understanding benchmarks. MoA achieves a 1.2-1.4\times GPU memory reduction and boosts decode throughput by 5.5-6.7 \times for 7B and 13B dense models on a single GPU, with minimal impact on performance.

[LG-54] Pathformer: Recursive Path Query Encoding for Complex Logical Query Answering

链接: https://arxiv.org/abs/2406.14880
作者: Chongzhi Zhang,Zhiping Peng,Junhao Zheng,Linghao Wang,Ruifeng Shi,Qianli Ma
关键词: Logical Query Answering, Query Answering, Query, incomplete knowledge graphs, Complex Logical Query
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: This work has been submitted to the IEEE

点击查看摘要

Abstract:Complex Logical Query Answering (CLQA) over incomplete knowledge graphs is a challenging task. Recently, Query Embedding (QE) methods are proposed to solve CLQA by performing multi-hop logical reasoning. However, most of them only consider historical query context information while ignoring future information, which leads to their failure to capture the complex dependencies behind the elements of a query. In recent years, the transformer architecture has shown a strong ability to model long-range dependencies between words. The bidirectional attention mechanism proposed by the transformer can solve the limitation of these QE methods regarding query context. Still, as a sequence model, it is difficult for the transformer to model complex logical queries with branch structure computation graphs directly. To this end, we propose a neural one-point embedding method called Pathformer based on the tree-like computation graph, i.e., query computation tree. Specifically, Pathformer decomposes the query computation tree into path query sequences by branches and then uses the transformer encoder to recursively encode these path query sequences to obtain the final query embedding. This allows Pathformer to fully utilize future context information to explicitly model the complex interactions between various parts of the path query. Experimental results show that Pathformer outperforms existing competitive neural QE methods, and we found that Pathformer has the potential to be applied to non-one-point embedding space.

[LG-55] MOS: Model Synergy for Test-Time Adaptation on LiDAR-Based 3D Object Detection

链接: https://arxiv.org/abs/2406.14878
作者: Zhuoxiao Chen,Junjie Meng,Mahsa Baktashmotlagh,Zi Huang,Yadan Luo
关键词: point clouds originating, unseen test point, test point clouds, object detection, detection systems
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:LiDAR-based 3D object detection is pivotal across many applications, yet the performance of such detection systems often degrades after deployment, especially when faced with unseen test point clouds originating from diverse locations or subjected to corruption. In this work, we introduce a new online adaptation framework for detectors named Model Synergy (MOS). Specifically, MOS dynamically assembles best-fit supermodels for each test batch from a bank of historical checkpoints, leveraging long-term knowledge to guide model updates without forgetting. The model assembly is directed by the proposed synergy weights (SW), employed for weighted averaging of the selected checkpoints to minimize redundancy in the composite supermodel. These weights are calculated by evaluating the similarity of predicted bounding boxes on test data and the feature independence among model pairs in the bank. To maintain an informative yet compact model bank, we pop out checkpoints with the lowest average SW scores and insert newly updated model weights. Our method was rigorously tested against prior test-time domain adaptation strategies on three datasets and under eight types of corruptions, demonstrating its superior adaptability to changing scenes and conditions. Remarkably, our approach achieved a 67.3% increase in performance in a complex “cross-corruption” scenario, which involves cross-dataset inconsistencies and real-world scene corruptions, providing a more realistic testbed of adaptation capabilities. The code is available at this https URL.

[LG-56] raining Greedy Policy for Proposal Batch Selection in Expensive Multi-Objective Combinatorial Optimization

链接: https://arxiv.org/abs/2406.14876
作者: Deokjae Lee,Hyun Oh Song,Kyunghyun Cho
关键词: subset selection problem, batch acquisition score, batch acquisition, challenging subset selection, Active learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICML 2024; Codes at this https URL

点击查看摘要

Abstract:Active learning is increasingly adopted for expensive multi-objective combinatorial optimization problems, but it involves a challenging subset selection problem, optimizing the batch acquisition score that quantifies the goodness of a batch for evaluation. Due to the excessively large search space of the subset selection problem, prior methods optimize the batch acquisition on the latent space, which has discrepancies with the actual space, or optimize individual acquisition scores without considering the dependencies among candidates in a batch instead of directly optimizing the batch acquisition. To manage the vast search space, a simple and effective approach is the greedy method, which decomposes the problem into smaller subproblems, yet it has difficulty in parallelization since each subproblem depends on the outcome from the previous ones. To this end, we introduce a novel greedy-style subset selection algorithm that optimizes batch acquisition directly on the combinatorial space by sequential greedy sampling from the greedy policy, specifically trained to address all greedy subproblems concurrently. Notably, our experiments on the red fluorescent proteins design task show that our proposed method achieves the baseline performance in 1.69x fewer queries, demonstrating its efficiency.

[LG-57] I dont trust you (anymore)! – The effect of students LLM use on Lecturer-Student-Trust in Higher Education

链接: https://arxiv.org/abs/2406.14871
作者: Simon Kloker,Matthew Bazanya,Twaha Kateete
关键词: Large Language Models, Team Trust, encompassing teaching, plays a pivotal, pivotal role
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Working Paper

点击查看摘要

Abstract:Trust plays a pivotal role in Lecturer-Student-Collaboration, encompassing teaching and research aspects. The advent of Large Language Models (LLMs) in platforms like Open AI’s ChatGPT, coupled with their cost-effectiveness and high-quality results, has led to their rapid adoption among university students. However, discerning genuine student input from LLM-generated output poses a challenge for lecturers. This dilemma jeopardizes the trust relationship between lecturers and students, potentially impacting university downstream activities, particularly collaborative research initiatives. Despite attempts to establish guidelines for student LLM use, a clear framework mutually beneficial for lecturers and students in higher education remains elusive. This study addresses the research question: How does the use of LLMs by students impact Informational and Procedural Justice, influencing Team Trust and Expected Team Performance? Methodically, we applied a quantitative construct-based survey, evaluated using techniques of Structural Equation Modelling (PLS- SEM) to examine potential relationships among these constructs. Our findings based on 23 valid respondents from Ndejje University indicate that lecturers are less concerned about the fairness of LLM use per se but are more focused on the transparency of student utilization, which significantly influences Team Trust positively. This research contributes to the global discourse on integrating and regulating LLMs and subsequent models in education. We propose that guidelines should support LLM use while enforcing transparency in Lecturer-Student- Collaboration to foster Team Trust and Performance. The study contributes valuable insights for shaping policies enabling ethical and transparent LLMs usage in education to ensure effectiveness of collaborative learning environments.

[LG-58] Direct Multi-Turn Preference Optimization for Language Agents

链接: https://arxiv.org/abs/2406.14868
作者: Wentao Shi,Mengqi Yuan,Junkang Wu,Qifan Wang,Fuli Feng
关键词: Adapting Large Language, Large Language Models, Adapting Large, developing language agents, Large Language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adapting Large Language Models (LLMs) for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function. Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent tasks with theoretical explanations. Extensive experiments on three multi-turn agent task datasets confirm the effectiveness and superiority of the DMPO loss.

[LG-59] DistiLRR: Transferring Code Repair for Low-Resource Programming Languages

链接: https://arxiv.org/abs/2406.14867
作者: Kyle Wong,Alfonso Amayuelas,Liangming Pan,William Yang Wang
关键词: Large language models, shown remarkable performance, Large language, code generation tasks, code repair
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance on code generation tasks. A recent application of LLMs for code generation is iterative code repair, where a model fixes an incorrect program by rationalizing about errors and generating a new program. However, code repair is primarily studied on high-resource languages like Python, and the framework’s efficacy is under-explored on low-resource languages. To apply code repair for low-resource languages, we propose Distilling Low-Resource Repairs (DistiLRR), an approach that transfers the reasoning and code generation ability from a teacher model to a student model. Our results show that DistiLRR consistently outperforms baselines on low-resource languages, but has similar performance on high-resource languages. To investigate this behavior, we perform a further analysis and find that the correlation between rationale quality and code correctness is weaker than previously perceived. We hypothesize this weakness is magnified in low-resource settings where base models lack deep knowledge of a programming language, leading to wavering benefits of code repair between high-resource and low-resource languages.

[LG-60] A review of feature selection strategies utilizing graph data structures and knowledge graphs

链接: https://arxiv.org/abs/2406.14864
作者: Sisi Shao,Pedro Henrique Ribeiro,Christina Ramirez,Jason H. Moore
关键词: Natural Language Processing, Natural Language, Language Processing, personalized recommendation systems, including biomedical research
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Feature selection in Knowledge Graphs (KGs) are increasingly utilized in diverse domains, including biomedical research, Natural Language Processing (NLP), and personalized recommendation systems. This paper delves into the methodologies for feature selection within KGs, emphasizing their roles in enhancing machine learning (ML) model efficacy, hypothesis generation, and interpretability. Through this comprehensive review, we aim to catalyze further innovation in feature selection for KGs, paving the way for more insightful, efficient, and interpretable analytical models across various domains. Our exploration reveals the critical importance of scalability, accuracy, and interpretability in feature selection techniques, advocating for the integration of domain knowledge to refine the selection process. We highlight the burgeoning potential of multi-objective optimization and interdisciplinary collaboration in advancing KG feature selection, underscoring the transformative impact of such methodologies on precision medicine, among other fields. The paper concludes by charting future directions, including the development of scalable, dynamic feature selection algorithms and the integration of explainable AI principles to foster transparency and trust in KG-driven models.

[LG-61] LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models

链接: https://arxiv.org/abs/2406.14862
作者: Mengdan Zhu,Raasikh Kanjiani,Jiahui Lu,Andrew Choi,Qirui Ye,Liang Zhao
关键词: Deep generative models, latent variables, leveraging latent variables, generate high-quality samples, generative models
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces LatentExplainer, a framework for automatically generating semantically meaningful explanations of latent variables in deep generative models. LatentExplainer tackles three main challenges: inferring the meaning of latent variables, aligning explanations with inductive biases, and handling varying degrees of explainability. By perturbing latent variables and interpreting changes in generated data, the framework provides a systematic approach to understanding and controlling the data generation process, enhancing the transparency and interpretability of deep generative models. We evaluate our proposed method on several real-world and synthetic datasets, and the results demonstrate superior performance in generating high-quality explanations of latent variables.

[LG-62] Accessible At-Home Detection of Parkinsons Disease via Multi-task Video Analysis

链接: https://arxiv.org/abs/2406.14856
作者: Md Saiful Islam,Tariq Adnan,Jan Freyberg,Sangwu Lee,Abdelrahman Abdelkader,Meghan Pawlik,Cathe Schwartz,Karen Jaffe,Ruth B. Schneider,E Ray Dorsey,Ehsan Hoque
关键词: neurological care leads, detect Parkinson disease, Parkinson disease, Monte Carlo Dropout, unidentified and untreated
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Limited access to neurological care leads to missed diagnoses of Parkinson’s disease (PD), leaving many individuals unidentified and untreated. We trained a novel neural network-based fusion architecture to detect Parkinson’s disease (PD) by analyzing features extracted from webcam recordings of three tasks: finger tapping, facial expression (smiling), and speech (uttering a sentence containing all letters of the alphabet). Additionally, the model incorporated Monte Carlo Dropout to improve prediction accuracy by considering uncertainties. The study participants (n = 845, 272 with PD) were randomly split into three sets: 60% for training, 20% for model selection (hyper-parameter tuning), and 20% for final performance evaluation. The dataset consists of 1102 sessions, each session containing videos of all three tasks. Our proposed model achieved significantly better accuracy, area under the ROC curve (AUROC), and sensitivity at non-inferior specificity compared to any single-task model. Withholding uncertain predictions further boosted the performance, achieving 88.0% (95% CI: 87.7% - 88.4%) accuracy, 93.0% (92.8% - 93.2%) AUROC, 79.3% (78.4% - 80.2%) sensitivity, and 92.6% (92.3% - 92.8%) specificity, at the expense of not being able to predict for 2.3% (2.0% - 2.6%) data. Further analysis suggests that the trained model does not exhibit any detectable bias across sex and ethnic subgroups and is most effective for individuals aged between 50 and 80. This accessible, low-cost approach requiring only an internet-enabled device with a webcam and microphone paves the way for convenient PD screening at home, particularly in regions with limited access to clinical specialists.

[LG-63] Graph Edge Representation via Tensor Product Graph Convolutional Representation

链接: https://arxiv.org/abs/2406.14846
作者: Bo Jiang,Sheng Ge,Ziyan Zhang,Beibei Wang,Jin Tang,Bin Luo
关键词: Graph Convolutional Networks, Convolutional Networks, Graph Convolutional, tensor product graph, Product Graph Convolution
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Convolutional Networks (GCNs) have been widely studied. The core of GCNs is the definition of convolution operators on graphs. However, existing Graph Convolution (GC) operators are mainly defined on adjacency matrix and node features and generally focus on obtaining effective node embeddings which cannot be utilized to address the graphs with (high-dimensional) edge features. To address this problem, by leveraging tensor contraction representation and tensor product graph diffusion theories, this paper analogously defines an effective convolution operator on graphs with edge features which is named as Tensor Product Graph Convolution (TPGC). The proposed TPGC aims to obtain effective edge embeddings. It provides a complementary model to traditional graph convolutions (GCs) to address the more general graph data analysis with both node and edge features. Experimental results on several graph learning tasks demonstrate the effectiveness of the proposed TPGC.

[LG-64] DN-CL: Deep Symbolic Regression against Noise via Contrastive Learning

链接: https://arxiv.org/abs/2406.14844
作者: Jingyi Liu,Yanjie Li,Lina Yu,Min Wu,Weijun Li,Wenqiang Li,Meilan Hao,Yusong Deng,Shu Wei
关键词: factors including physical, numerous factors including, Noise ubiquitously exists, including physical, environmental effects
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Noise ubiquitously exists in signals due to numerous factors including physical, electronic, and environmental effects. Traditional methods of symbolic regression, such as genetic programming or deep learning models, aim to find the most fitting expressions for these signals. However, these methods often overlook the noise present in real-world data, leading to reduced fitting accuracy. To tackle this issue, we propose \textit\textbfDeep Symbolic Regression against \textbfNoise via \textbfContrastive \textbfLearning (DN-CL). DN-CL employs two parameter-sharing encoders to embed data points from various data transformations into feature shields against noise. This model treats noisy data and clean data as different views of the ground-truth mathematical expressions. Distances between these features are minimized, utilizing contrastive learning to distinguish between ‘positive’ noise-corrected pairs and ‘negative’ contrasting pairs. Our experiments indicate that DN-CL demonstrates superior performance in handling both noisy and clean data, presenting a promising method of symbolic regression.

[LG-65] abularMark: Watermarking Tabular Datasets for Machine Learning

链接: https://arxiv.org/abs/2406.14841
作者: Yihao Zheng,Haocheng Xia,Junyuan Pang,Jinfei Liu,Kui Ren,Lingyang Chu,Yang Cao,Li Xiong
关键词: protect ownership, ownership of shared, data utility, data, shared data
类目: Cryptography and Security (cs.CR); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Watermarking is broadly utilized to protect ownership of shared data while preserving data utility. However, existing watermarking methods for tabular datasets fall short on the desired properties (detectability, non-intrusiveness, and robustness) and only preserve data utility from the perspective of data statistics, ignoring the performance of downstream ML models trained on the datasets. Can we watermark tabular datasets without significantly compromising their utility for training ML models while preventing attackers from training usable ML models on attacked datasets? In this paper, we propose a hypothesis testing-based watermarking scheme, TabularMark. Data noise partitioning is utilized for data perturbation during embedding, which is adaptable for numerical and categorical attributes while preserving the data utility. For detection, a custom-threshold one proportion z-test is employed, which can reliably determine the presence of the watermark. Experiments on real-world and synthetic datasets demonstrate the superiority of TabularMark in detectability, non-intrusiveness, and robustness.

[LG-66] oVo: Toxicity Taxonomy via Voting

链接: https://arxiv.org/abs/2406.14835
作者: Tinh Son Luong,Thanh-Thien Le,Thang Viet Doan,Linh Ngo Van,Thien Huu Nguyen,Diep Thi-Ngoc Nguyen
关键词: face significant limitations, models face significant, significant limitations, face significant, detection models face
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing toxic detection models face significant limitations, such as lack of transparency, customization, and reproducibility. These challenges stem from the closed-source nature of their training data and the paucity of explanations for their evaluation mechanism. To address these issues, we propose a dataset creation mechanism that integrates voting and chain-of-thought processes, producing a high-quality open-source dataset for toxic content detection. Our methodology ensures diverse classification metrics for each sample and includes both classification scores and explanatory reasoning for the classifications. We utilize the dataset created through our proposed mechanism to train our model, which is then compared against existing widely-used detectors. Our approach not only enhances transparency and customizability but also facilitates better fine-tuning for specific use cases. This work contributes a robust framework for developing toxic content detection models, emphasizing openness and adaptability, thus paving the way for more effective and user-specific content moderation solutions. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2406.14835 [cs.CL] (or arXiv:2406.14835v1 [cs.CL] for this version)

[LG-67] Latent diffusion models for parameterization and data assimilation of facies-based geomodels

链接: https://arxiv.org/abs/2406.14815
作者: Guido Di Federico,Louis J. Durlofsky
关键词: Geological parameterization entails, latent diffusion model, porosity and permeability, Diffusion models, entails the representation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Geological parameterization entails the representation of a geomodel using a small set of latent variables and a mapping from these variables to grid-block properties such as porosity and permeability. Parameterization is useful for data assimilation (history matching), as it maintains geological realism while reducing the number of variables to be determined. Diffusion models are a new class of generative deep-learning procedures that have been shown to outperform previous methods, such as generative adversarial networks, for image generation tasks. Diffusion models are trained to “denoise”, which enables them to generate new geological realizations from input fields characterized by random noise. Latent diffusion models, which are the specific variant considered in this study, provide dimension reduction through use of a low-dimensional latent variable. The model developed in this work includes a variational autoencoder for dimension reduction and a U-net for the denoising process. Our application involves conditional 2D three-facies (channel-levee-mud) systems. The latent diffusion model is shown to provide realizations that are visually consistent with samples from geomodeling software. Quantitative metrics involving spatial and flow-response statistics are evaluated, and general agreement between the diffusion-generated models and reference realizations is observed. Stability tests are performed to assess the smoothness of the parameterization method. The latent diffusion model is then used for ensemble-based data assimilation. Two synthetic “true” models are considered. Significant uncertainty reduction, posterior P _10 -P _90 forecasts that generally bracket observed data, and consistent posterior geomodels, are achieved in both cases.

[LG-68] Probabilistic Emulation of a Global Climate Model with Spherical DYffusion

链接: https://arxiv.org/abs/2406.14798
作者: Salva Rühling Cachay,Brian Henn,Oliver Watt-Meyer,Christopher S. Bretherton,Rose Yu
关键词: Data-driven deep learning, global weather forecasting, transforming global weather, deep learning models, weather forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Data-driven deep learning models are on the verge of transforming global weather forecasting. It is an open question if this success can extend to climate modeling, where long inference rollouts and data complexity pose significant challenges. Here, we present the first conditional generative model able to produce global climate ensemble simulations that are accurate and physically consistent. Our model runs at 6-hourly time steps and is shown to be stable for 10-year-long simulations. Our approach beats relevant baselines and nearly reaches a gold standard for successful climate model emulation. We discuss the key design choices behind our dynamics-informed diffusion model-based approach which enables this significant step towards efficient, data-driven climate simulations that can help us better understand the Earth and adapt to a changing climate.

[LG-69] MU-Bench: A Multitask Multimodal Benchmark for Machine Unlearning

链接: https://arxiv.org/abs/2406.14796
作者: Jiali Cheng,Hadi Amiri
关键词: Recent advancements, Machine Unlearning, trained models, sensitive information, introduced solutions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in Machine Unlearning (MU) have introduced solutions to selectively remove certain training samples, such as those with outdated or sensitive information, from trained models. Despite these advancements, evaluation of MU methods have been inconsistent, employing different trained models and architectures, and sample removal strategies, which hampers accurate comparison. In addition, prior MU approaches have mainly focused on singular tasks or modalities, which is not comprehensive. To address these limitations, we develop MU-Bench, the first comprehensive benchmark for MU that (i) unifies the sets of deleted samples and trained models, and (ii) provides broad coverage of tasks and data modalities, including previously unexplored domains such as speech and video classification. Our evaluation show that RandLabel and SalUn are the most effective general MU approaches on MU-Bench, and BadT and SCRUB are capable of achieving random performance on the deletion set. We analyze several under-investigated aspects of unlearning, including scalability, the impacts of parameter-efficient fine-tuning and curriculum learning, and susceptibility to dataset biases. MU-Bench provides an easy-to-use package that includes dataset splits, models, and implementations, together with a leader board to enable unified and scalable MU research.

[LG-70] Graph Structure Learning with Interpretable Bayesian Neural Networks

链接: https://arxiv.org/abs/2406.14786
作者: Max Wasserman,Gonzalo Mateos
关键词: underlying relational structure, serve as generic, generic tools, tools to encode, encode the underlying
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Graphs serve as generic tools to encode the underlying relational structure of data. Often this graph is not given, and so the task of inferring it from nodal observations becomes important. Traditional approaches formulate a convex inverse problem with a smoothness promoting objective and rely on iterative methods to obtain a solution. In supervised settings where graph labels are available, one can unroll and truncate these iterations into a deep network that is trained end-to-end. Such a network is parameter efficient and inherits inductive bias from the optimization formulation, an appealing aspect for data constrained settings in, e.g., medicine, finance, and the natural sciences. But typically such settings care equally about uncertainty over edge predictions, not just point estimates. Here we introduce novel iterations with independently interpretable parameters, i.e., parameters whose values - independent of other parameters’ settings - proportionally influence characteristics of the estimated graph, such as edge sparsity. After unrolling these iterations, prior knowledge over such graph characteristics shape prior distributions over these independently interpretable network parameters to yield a Bayesian neural network (BNN) capable of graph structure learning (GSL) from smooth signal observations. Fast execution and parameter efficiency allow for high-fidelity posterior approximation via Markov Chain Monte Carlo (MCMC) and thus uncertainty quantification on edge predictions. Synthetic and real data experiments corroborate this model’s ability to provide well-calibrated estimates of uncertainty, in test cases that include unveiling economic sector modular structure from S \ P 500 data and recovering pairwise digit similarities from MNIST images. Overall, this framework enables GSL in modest-scale applications where uncertainty on the data structure is paramount.

[LG-71] Understanding Finetuning for Factual Knowledge Extraction

链接: https://arxiv.org/abs/2406.14785
作者: Gaurav Ghosal,Tatsunori Hashimoto,Aditi Raghunathan
关键词: study the impact, downstream factuality, lesser-known facts, facts, fine-tuning
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: To appear in ICML 2024

点击查看摘要

Abstract:In this work, we study the impact of QA fine-tuning data on downstream factuality. We show that fine-tuning on lesser-known facts that are poorly stored during pretraining yields significantly worse factuality than fine-tuning on well-known facts, even when all facts are seen during pretraining. We prove this phenomenon theoretically, showing that training on lesser-known facts can lead the model to ignore subject entity names and instead output a generic plausible response even when the relevant factual knowledge is encoded in the model. On three question answering benchmarks (PopQA, Entity Questions, and MMLU) and two language models (Llama-2-7B and Mistral-7B), we find that (i) finetuning on a completely factual but lesser-known subset of the data deteriorates downstream factuality (5-10%) and (ii) finetuning on a subset of better-known examples matches or outperforms finetuning on the entire dataset. Ultimately, our results shed light on the interaction between pretrained knowledge and finetuning data and demonstrate the importance of taking into account how facts are stored in the pretrained model when fine-tuning for knowledge-intensive tasks.

[LG-72] Active Learning for Fair and Stable Online Allocations

链接: https://arxiv.org/abs/2406.14784
作者: Riddhiman Bhattacharya,Thanh Nguyen,Will Wei Sun,Mohit Tawarmalani
关键词: active learning approach, dynamic fair resource, fair resource allocation, resource allocation problems, resource allocation
类目: Machine Learning (cs.LG); Other Statistics (stat.OT)
*备注:

点击查看摘要

Abstract:We explore an active learning approach for dynamic fair resource allocation problems. Unlike previous work that assumes full feedback from all agents on their allocations, we consider feedback from a select subset of agents at each epoch of the online resource allocation process. Despite this restriction, our proposed algorithms provide regret bounds that are sub-linear in number of time-periods for various measures that include fairness metrics commonly used in resource allocation problems and stability considerations in matching mechanisms. The key insight of our algorithms lies in adaptively identifying the most informative feedback using dueling upper and lower confidence bounds. With this strategy, we show that efficient decision-making does not require extensive feedback and produces efficient outcomes for a variety of problem classes.

[LG-73] Learning to Cover: Online Learning and Optimization with Irreversible Decisions

链接: https://arxiv.org/abs/2406.14777
作者: Alexandre Jacquillat,Michael Lingzhi Li
关键词: irreversible decisions contributing, problem with irreversible, irreversible decisions, decisions contributing, guide future decisions
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We define an online learning and optimization problem with irreversible decisions contributing toward a coverage target. At each period, a decision-maker selects facilities to open, receives information on the success of each one, and updates a machine learning model to guide future decisions. The goal is to minimize costs across a finite horizon under a chance constraint reflecting the coverage target. We derive an optimal algorithm and a tight lower bound in an asymptotic regime characterized by a large target number of facilities m\to\infty but a finite horizon T\in\mathbbZ_+ . We find that the regret grows sub-linearly at a rate \Theta\left(m^\frac12\cdot\frac11-2^-T\right) , thus converging exponentially fast to \Theta(\sqrtm) . We establish the robustness of this result to the learning environment; we also extend it to a more complicated facility location setting in a bipartite facility-customer graph with a target on customer coverage. Throughout, constructive proofs identify a policy featuring limited exploration initially for learning purposes, and fast exploitation later on for optimization purposes once uncertainty gets mitigated. These findings underscore the benefits of limited online learning and optimization, in that even a few rounds can provide significant benefits as compared to a no-learning baseline.

[LG-74] Evaluating Numerical Reasoning in Text-to-Image Models

链接: https://arxiv.org/abs/2406.14774
作者: Ivana Kajić,Olivia Wiles,Isabela Albuquerque,Matthias Bauer,Su Wang,Jordi Pont-Tuset,Aida Nematzadeh
关键词: faithfully depict concepts, producing high-quality images, natural language, capable of producing, producing high-quality
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Text-to-image generative models are capable of producing high-quality images that often faithfully depict concepts described using natural language. In this work, we comprehensively evaluate a range of text-to-image models on numerical reasoning tasks of varying difficulty, and show that even the most advanced models have only rudimentary numerical skills. Specifically, their ability to correctly generate an exact number of objects in an image is limited to small numbers, it is highly dependent on the context the number term appears in, and it deteriorates quickly with each successive number. We also demonstrate that models have poor understanding of linguistic quantifiers (such as “a few” or “as many as”), the concept of zero, and struggle with more advanced concepts such as partial quantities and fractional representations. We bundle prompts, generated images and human annotations into GeckoNum, a novel benchmark for evaluation of numerical reasoning.

[LG-75] ChatGPT as Research Scientist: Probing GPTs Capabilities as a Research Librarian Research Ethicist Data Generator and Data Predictor

链接: https://arxiv.org/abs/2406.14765
作者: Steven A. Lehr,Aylin Caliskan,Suneragiri Liyanage,Mahzarin R. Banaji
关键词: Research Ethicist, Research Librarian, research, Study, Data Generator
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Main article is 14 pages, 1 table. Includes SI Appendix: 26 pages, 12 tables, 2 figures. Total: 40 pages, 13 tables, 2 figures. Under revised review at PNAS

点击查看摘要

Abstract:How good a research scientist is ChatGPT? We systematically probed the capabilities of GPT-3.5 and GPT-4 across four central components of the scientific process: as a Research Librarian, Research Ethicist, Data Generator, and Novel Data Predictor, using psychological science as a testing field. In Study 1 (Research Librarian), unlike human researchers, GPT-3.5 and GPT-4 hallucinated, authoritatively generating fictional references 36.0% and 5.4% of the time, respectively, although GPT-4 exhibited an evolving capacity to acknowledge its fictions. In Study 2 (Research Ethicist), GPT-4 (though not GPT-3.5) proved capable of detecting violations like p-hacking in fictional research protocols, correcting 88.6% of blatantly presented issues, and 72.6% of subtly presented issues. In Study 3 (Data Generator), both models consistently replicated patterns of cultural bias previously discovered in large language corpora, indicating that ChatGPT can simulate known results, an antecedent to usefulness for both data generation and skills like hypothesis generation. Contrastingly, in Study 4 (Novel Data Predictor), neither model was successful at predicting new results absent in their training data, and neither appeared to leverage substantially new information when predicting more versus less novel outcomes. Together, these results suggest that GPT is a flawed but rapidly improving librarian, a decent research ethicist already, capable of data generation in simple domains with known characteristics but poor at predicting novel patterns of empirical data to aid future experimentation.

[LG-76] RE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation

链接: https://arxiv.org/abs/2406.14764
作者: William Fleshman,Benjamin Van Durme
关键词: Large language models, Large language, fine-tuned for text-retrieval, text-retrieval have demonstrated, information retrieval
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) fine-tuned for text-retrieval have demonstrated state-of-the-art results across several information retrieval (IR) benchmarks. However, supervised training for improving these models requires numerous labeled examples, which are generally unavailable or expensive to acquire. In this work, we explore the effectiveness of extending reverse engineered adaptation to the context of information retrieval (RE-AdaptIR). We use RE-AdaptIR to improve LLM-based IR models using only unlabeled data. We demonstrate improved performance both in training domains as well as zero-shot in domains where the models have seen no queries. We analyze performance changes in various fine-tuning scenarios and offer findings of immediate use to practitioners.

[LG-77] Regularized Distribution Matching Distillation for One-step Unpaired Image-to-Image Translation

链接: https://arxiv.org/abs/2406.14762
作者: Denis Rakitin,Ivan Shchekotov,Dmitry Vetrov
关键词: Distribution Matching Distillation, distillation methods aim, Regularized Distribution Matching, efficient one-step generators, Distribution Matching
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion distillation methods aim to compress the diffusion models into efficient one-step generators while trying to preserve quality. Among them, Distribution Matching Distillation (DMD) offers a suitable framework for training general-form one-step generators, applicable beyond unconditional generation. In this work, we introduce its modification, called Regularized Distribution Matching Distillation, applicable to unpaired image-to-image (I2I) problems. We demonstrate its empirical performance in application to several translation tasks, including 2D examples and I2I between different image datasets, where it performs on par or better than multi-step diffusion baselines.

[LG-78] An LLM Feature-based Framework for Dialogue Constructiveness Assessment

链接: https://arxiv.org/abs/2406.14760
作者: Lexin Zhou,Youmna Farag,Andreas Vlachos
关键词: analysing conversational factors, constructiveness assessment focuses, LLM feature-based models, predicting constructive outcomes, LLM feature-based
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Research on dialogue constructiveness assessment focuses on (i) analysing conversational factors that influence individuals to take specific actions, win debates, change their perspectives or broaden their open-mindedness and (ii) predicting constructive outcomes following dialogues for such use cases. These objectives can be achieved by training either interpretable feature-based models (which often involve costly human annotations) or neural models such as pre-trained language models (which have empirically shown higher task accuracy but lack interpretability). We propose a novel LLM feature-based framework that combines the strengths of feature-based and neural approaches while mitigating their downsides, in assessing dialogue constructiveness. The framework first defines a set of dataset-independent and interpretable linguistic features, which can be extracted by both prompting an LLM and simple heuristics. Such features are then used to train LLM feature-based models. We apply this framework to three datasets of dialogue constructiveness and find that our LLM feature-based models significantly outperform standard feature-based models and neural models, and tend to learn more robust prediction rules instead of relying on superficial shortcuts (as seen with neural models). Further, we demonstrate that interpreting these LLM feature-based models can yield valuable insights into what makes a dialogue constructive.

[LG-79] A General Control-Theoretic Approach for Reinforcement Learning: Theory and Algorithms

链接: https://arxiv.org/abs/2406.14753
作者: Weiqin Chen,Mark S. Squillante,Chai Wah Wu,Santiago Paternain
关键词: control-theoretic reinforcement learning, support direct learning, reinforcement learning approach, optimal policy, devise a control-theoretic
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We devise a control-theoretic reinforcement learning approach to support direct learning of the optimal policy. We establish theoretical properties of our approach and derive an algorithm based on a specific instance of this approach. Our empirical results demonstrate the significant benefits of our approach.

[LG-80] Relational Reasoning On Graphs Using Opinion Dynamics

链接: https://arxiv.org/abs/2406.14746
作者: Yulong Yang,Bowen Feng,Keqin Wang,Naomi Leonard,Adji Bousso Dieng,Christine Allen-Blanchette
关键词: Kuramoto oscillators, pedestrians to Kuramoto, dynamical systems evolve, multitude of dynamical, evolve in space
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 14 pages, 7 figures

点击查看摘要

Abstract:From pedestrians to Kuramoto oscillators, interactions between agents govern how a multitude of dynamical systems evolve in space and time. Discovering how these agents relate to each other can improve our understanding of the often complex dynamics that underlie these systems. Recent works learn to categorize relationships between agents based on observations of their physical behavior. These approaches are limited in that the relationship categories are modelled as independent and mutually exclusive, when in real world systems categories are often interacting. In this work, we introduce a level of abstraction between the physical behavior of agents and the categories that define their behavior. To do this, we learn a mapping from the agents’ states to their affinities for each category in a graph neural network. We integrate the physical proximity of agents and their affinities in a nonlinear opinion dynamics model which provides a mechanism to identify mutually exclusive categories, predict an agent’s evolution in time, and control an agent’s behavior. We demonstrate the utility of our model for learning interpretable categories for mechanical systems, and demonstrate its efficacy on several long-horizon trajectory prediction benchmarks where we consistently out perform existing methods.

[LG-81] A General Online Algorithm for Optimizing Complex Performance Metrics

链接: https://arxiv.org/abs/2406.14743
作者: Wojciech Kotłowski,Marek Wydmuch,Erik Schultheis,Rohit Babbar,Krzysztof Dembczyński
关键词: sequential maximization, confusion matrix, F-measure, G-mean, general functions
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This is the authors’ version of the work accepted to ICML 2024

点击查看摘要

Abstract:We consider sequential maximization of performance metrics that are general functions of a confusion matrix of a classifier (such as precision, F-measure, or G-mean). Such metrics are, in general, non-decomposable over individual instances, making their optimization very challenging. While they have been extensively studied under different frameworks in the batch setting, their analysis in the online learning regime is very limited, with only a few distinguished exceptions. In this paper, we introduce and analyze a general online algorithm that can be used in a straightforward way with a variety of complex performance metrics in binary, multi-class, and multi-label classification problems. The algorithm’s update and prediction rules are appealingly simple and computationally efficient without the need to store any past data. We show the algorithm attains \mathcalO(\frac\ln nn) regret for concave and smooth metrics and verify the efficiency of the proposed algorithm in empirical studies.

[LG-82] Latent Variable Sequence Identification for Cognitive Models with Neural Bayes Estimation

链接: https://arxiv.org/abs/2406.14742
作者: Ti-Fen Pan,Jing-Jing Li,Bill Thompson,Anne Collins
关键词: Extracting time-varying latent, Extracting time-varying, cognitive models, key step, aims to understand
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Extracting time-varying latent variables from computational cognitive models is a key step in model-based neural analysis, which aims to understand the neural correlates of cognitive processes. However, existing methods only allow researchers to infer latent variables that explain subjects’ behavior in a relatively small class of cognitive models. For example, a broad class of relevant cognitive models with analytically intractable likelihood is currently out of reach from standard techniques, based on Maximum a Posteriori parameter estimation. Here, we present an approach that extends neural Bayes estimation to learn a direct mapping between experimental data and the targeted latent variable space using recurrent neural networks and simulated datasets. We show that our approach achieves competitive performance in inferring latent variable sequences in both tractable and intractable models. Furthermore, the approach is generalizable across different computational models and is adaptable for both continuous and discrete latent spaces. We then demonstrate its applicability in real world datasets. Our work underscores that combining recurrent neural networks and simulation-based inference to identify latent variable sequences can enable researchers to access a wider class of cognitive models for model-based neural analyses, and thus test a broader set of theories.

[LG-83] An Advanced Physics-Informed Neural Operator for Comprehensive Design Optimization of Highly-Nonlinear Systems: An Aerospace Composites Processing Case Study

链接: https://arxiv.org/abs/2406.14715
作者: Milad Ramezankhani,Anirudh Deodhar,Rishi Yash Parekh,Dagnachew Birru
关键词: Deep Operator Networks, traditional neural networks, Deep Operator, Operator Networks, partial differential equations
类目: Machine Learning (cs.LG)
*备注: Accepted at the ICML 2024 Workshop on AI for Science: Scaling in AI for Scientific Discovery

点击查看摘要

Abstract:Deep Operator Networks (DeepONets) and their physics-informed variants have shown significant promise in learning mappings between function spaces of partial differential equations, enhancing the generalization of traditional neural networks. However, for highly nonlinear real-world applications like aerospace composites processing, existing models often fail to capture underlying solutions accurately and are typically limited to single input functions, constraining rapid process design development. This paper introduces an advanced physics-informed DeepONet tailored for such complex systems with multiple input functions. Equipped with architectural enhancements like nonlinear decoders and effective training strategies such as curriculum learning and domain decomposition, the proposed model handles high-dimensional design spaces with significantly improved accuracy, outperforming the vanilla physics-informed DeepONet by two orders of magnitude. Its zero-shot prediction capability across a broad design space makes it a powerful tool for accelerating composites process design and optimization, with potential applications in other engineering fields characterized by strong nonlinearity.

[LG-84] Preferential Multi-Objective Bayesian Optimization

链接: https://arxiv.org/abs/2406.14699
作者: Raul Astudillo,Kejun Li,Maegan Tucker,Chu Xin Cheng,Aaron D. Ames,Yisong Yue
关键词: Preferential Bayesian optimization, Preferential Bayesian, Bayesian optimization, decision-maker latent preferences, PBO
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Preferential Bayesian optimization (PBO) is a framework for optimizing a decision-maker’s latent preferences over available design choices. While preferences often involve multiple conflicting objectives, existing work in PBO assumes that preferences can be encoded by a single objective function. For example, in robotic assistive devices, technicians often attempt to maximize user comfort while simultaneously minimizing mechanical energy consumption for longer battery life. Similarly, in autonomous driving policy design, decision-makers wish to understand the trade-offs between multiple safety and performance attributes before committing to a policy. To address this gap, we propose the first framework for PBO with multiple objectives. Within this framework, we present dueling scalarized Thompson sampling (DSTS), a multi-objective generalization of the popular dueling Thompson algorithm, which may be of interest beyond the PBO setting. We evaluate DSTS across four synthetic test functions and two simulated exoskeleton personalization and driving policy design tasks, showing that it outperforms several benchmarks. Finally, we prove that DSTS is asymptotically consistent. As a direct consequence, this result provides, to our knowledge, the first convergence guarantee for dueling Thompson sampling in the PBO setting.

[LG-85] A Benchmark Study of Deep-RL Methods for Maximum Coverage Problems over Graphs

链接: https://arxiv.org/abs/2406.14697
作者: Zhicheng Liang,Yu Yang,Xiangyu Ke,Xiaokui Xiao,Yunjun Gao
关键词: Deep-RL methods, years have witnessed, witnessed a growing, growing trend, trend toward employing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent years have witnessed a growing trend toward employing deep reinforcement learning (Deep-RL) to derive heuristics for combinatorial optimization (CO) problems on graphs. Maximum Coverage Problem (MCP) and its probabilistic variant on social networks, Influence Maximization (IM), have been particularly prominent in this line of research. In this paper, we present a comprehensive benchmark study that thoroughly investigates the effectiveness and efficiency of five recent Deep-RL methods for MCP and IM. These methods were published in top data science venues, namely S2V-DQN, Geometric-QN, GCOMB, RL4IM, and LeNSE. Our findings reveal that, across various scenarios, the Lazy Greedy algorithm consistently outperforms all Deep-RL methods for MCP. In the case of IM, theoretically sound algorithms like IMM and OPIM demonstrate superior performance compared to Deep-RL methods in most scenarios. Notably, we observe an abnormal phenomenon in IM problem where Deep-RL methods slightly outperform IMM and OPIM when the influence spread nearly does not increase as the budget increases. Furthermore, our experimental results highlight common issues when applying Deep-RL methods to MCP and IM in practical settings. Finally, we discuss potential avenues for improving Deep-RL methods. Our benchmark study sheds light on potential challenges in current deep reinforcement learning research for solving combinatorial optimization problems.

[LG-86] A Contrastive Learning Approach to Mitigate Bias in Speech Models

链接: https://arxiv.org/abs/2406.14686
作者: Alkis Koudounas,Flavio Giobergia,Eliana Pastor,Elena Baralis
关键词: raising concerns, concerns about fair, fair treatment, mitigate speech model, speech model bias
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at Interspeech 2024

点击查看摘要

Abstract:Speech models may be affected by performance imbalance in different population subgroups, raising concerns about fair treatment across these groups. Prior attempts to mitigate unfairness either focus on user-defined subgroups, potentially overlooking other affected subgroups, or do not explicitly improve the internal representation at the subgroup level. This paper proposes the first adoption of contrastive learning to mitigate speech model bias in underperforming subgroups. We employ a three-level learning technique that guides the model in focusing on different scopes for the contrastive loss, i.e., task, subgroup, and the errors within subgroups. The experiments on two spoken language understanding datasets and two languages demonstrate that our approach improves internal subgroup representations, thus reducing model bias and enhancing performance.

[LG-87] AGLAS: An atlas of text-attributed graph datasets in the era of large graph and language models

链接: https://arxiv.org/abs/2406.14683
作者: Jiarui Feng,Hao Liu,Lecheng Kong,Yixin Chen,Muhan Zhang
关键词: atlas of text-attributed, present TAGLAS, datasets, TAG datasets, TAGLAS
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: Preprint

点击查看摘要

Abstract:In this report, we present TAGLAS, an atlas of text-attributed graph (TAG) datasets and benchmarks. TAGs are graphs with node and edge features represented in text, which have recently gained wide applicability in training graph-language or graph foundation models. In TAGLAS, we collect and integrate more than 23 TAG datasets with domains ranging from citation graphs to molecule graphs and tasks from node classification to graph question-answering. Unlike previous graph datasets and benchmarks, all datasets in TAGLAS have a unified node and edge text feature format, which allows a graph model to be simultaneously trained and evaluated on multiple datasets from various domains. Further, we provide a standardized, efficient, and simplified way to load all datasets and tasks. We also provide useful utils like text-to-embedding conversion, and graph-to-text conversion, which can facilitate different evaluation scenarios. Finally, we also provide standard and easy-to-use evaluation utils. The project is open-sourced at this https URL and is still under construction. Please expect more datasets/features in the future.

[LG-88] his Looks Better than That: Better Interpretable Models with ProtoPNeXt

链接: https://arxiv.org/abs/2406.14675
作者: Frank Willard,Luke Moffett,Emmanuel Mokel,Jon Donnelly,Stark Guo,Julia Yang,Giyoung Kim,Alina Jade Barnett,Cynthia Rudin
关键词: popular interpretable alternative, black-box deep learning, deep learning models, computer vision, popular interpretable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prototypical-part models are a popular interpretable alternative to black-box deep learning models for computer vision. However, they are difficult to train, with high sensitivity to hyperparameter tuning, inhibiting their application to new datasets and our understanding of which methods truly improve their performance. To facilitate the careful study of prototypical-part networks (ProtoPNets), we create a new framework for integrating components of prototypical-part models – ProtoPNeXt. Using ProtoPNeXt, we show that applying Bayesian hyperparameter tuning and an angular prototype similarity metric to the original ProtoPNet is sufficient to produce new state-of-the-art accuracy for prototypical-part models on CUB-200 across multiple backbones. We further deploy this framework to jointly optimize for accuracy and prototype interpretability as measured by metrics included in ProtoPNeXt. Using the same resources, this produces models with substantially superior semantics and changes in accuracy between +1.3% and -1.5%. The code and trained models will be made publicly available upon publication.

[LG-89] Exploring Design Choices for Building Language-Specific LLMs

链接: https://arxiv.org/abs/2406.14670
作者: Atula Tejaswi,Nilesh Gupta,Eunsol Choi
关键词: languages remain unsatisfactory, remain unsatisfactory, rapid progress, progress in large, vast majority
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 6 figures, 11 tables

点击查看摘要

Abstract:Despite rapid progress in large language models (LLMs), their performance on a vast majority of languages remain unsatisfactory. In this paper, we study building language-specific LLMs by adapting monolingual and multilingual LLMs. We conduct systematic experiments on how design choices (base model selection, vocabulary extension, and continued fine-tuning) impact the adapted LLM, both in terms of efficiency (how many tokens are needed to encode the same amount of information) and end task performance. We find that (1) the initial performance before the adaptation is not always indicative of the final performance. (2) Efficiency can easily improved with simple vocabulary extension and continued fine-tuning in most LLMs we study, and (3) The optimal adaptation method is highly language-dependent, and the simplest approach works well across various experimental settings. Adapting English-centric models can yield better results than adapting multilingual models despite their worse initial performance on low-resource languages. Together, our work lays foundations on efficiently building language-specific LLMs by adapting existing LLMs.

[LG-90] Advantage Alignment Algorithms

链接: https://arxiv.org/abs/2406.14662
作者: Juan Agustin Duque,Milad Aghajohari,Tim Cooijmans,Tianyu Zhang,Aaron Courville
关键词: optimizing individual interests, artificially intelligent agents, agent optimizing individual, LLM assistants, Reinforcement Learning agents
类目: Machine Learning (cs.LG)
*备注: 20 Pages, 6 figures

点击查看摘要

Abstract:The growing presence of artificially intelligent agents in everyday decision-making, from LLM assistants to autonomous vehicles, hints at a future in which conflicts may arise from each agent optimizing individual interests. In general-sum games these conflicts are apparent, where naive Reinforcement Learning agents get stuck in Pareto-suboptimal Nash equilibria. Consequently, opponent shaping has been introduced as a method with success at finding socially beneficial equilibria in social dilemmas. In this work, we introduce Advantage Alignment, a family of algorithms derived from first principles that perform opponent shaping efficiently and intuitively. This is achieved by aligning the advantages of conflicting agents in a given game by increasing the probability of mutually-benefiting actions. We prove that existing opponent shaping methods, including LOLA and LOQA, implicitly perform Advantage Alignment. Compared to these works, Advantage Alignment mathematically simplifies the formulation of opponent shaping and seamlessly works for continuous action domains. We also demonstrate the effectiveness of our algorithm in a wide range of social dilemmas, achieving state of the art results in each case, including a social dilemma version of the Negotiation Game.

[LG-91] OpenDebateEvidence: A Massive-Scale Argument Mining and Summarization Dataset

链接: https://arxiv.org/abs/2406.14657
作者: Allen Roush,Yusuf Shabazz,Arvind Balaji,Peter Zhang,Stefano Mezza,Markus Zhang,Sanjay Basu,Sriram Vishwanath,Mehdi Fatemi,Ravid Schwartz-Ziv
关键词: American Competitive Debate, Competitive Debate community, American Competitive, Competitive Debate, Debate community
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for Publication to ARGMIN 2024 at ACL2024

点击查看摘要

Abstract:We introduce OpenDebateEvidence, a comprehensive dataset for argument mining and summarization sourced from the American Competitive Debate community. This dataset includes over 3.5 million documents with rich metadata, making it one of the most extensive collections of debate evidence. OpenDebateEvidence captures the complexity of arguments in high school and college debates, providing valuable resources for training and evaluation. Our extensive experiments demonstrate the efficacy of fine-tuning state-of-the-art large language models for argumentative abstractive summarization across various methods, models, and datasets. By providing this comprehensive resource, we aim to advance computational argumentation and support practical applications for debaters, educators, and researchers. OpenDebateEvidence is publicly available to support further research and innovation in computational argumentation. Access it here: this https URL

[LG-92] HYPERmotion: Learning Hybrid Behavior Planning for Autonomous Loco-manipulation

链接: https://arxiv.org/abs/2406.14655
作者: Jin Wang,Rui Dai,Weijie Wang,Luca Rossini,Francesco Ruscelli,Nikos Tsagarakis
关键词: autonomously perform hybrid, Enabling robots, perform hybrid motions, household chores, material handling
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Enabling robots to autonomously perform hybrid motions in diverse environments can be beneficial for long-horizon tasks such as material handling, household chores, and work assistance. This requires extensive exploitation of intrinsic motion capabilities, extraction of affordances from rich environmental information, and planning of physical interaction behaviors. Despite recent progress has demonstrated impressive humanoid whole-body control abilities, they struggle to achieve versatility and adaptability for new tasks. In this work, we propose HYPERmotion, a framework that learns, selects and plans behaviors based on tasks in different scenarios. We combine reinforcement learning with whole-body optimization to generate motion for 38 actuated joints and create a motion library to store the learned skills. We apply the planning and reasoning features of the large language models (LLMs) to complex loco-manipulation tasks, constructing a hierarchical task graph that comprises a series of primitive behaviors to bridge lower-level execution with higher-level planning. By leveraging the interaction of distilled spatial geometry and 2D observation with a visual language model (VLM) to ground knowledge into a robotic morphology selector to choose appropriate actions in single- or dual-arm, legged or wheeled locomotion. Experiments in simulation and real-world show that learned motions can efficiently adapt to new tasks, demonstrating high autonomy from free-text commands in unstructured scenes. Videos and website: this http URL

[LG-93] Major Entity Identification: A Generalizable Alternative to Coreference Resolution

链接: https://arxiv.org/abs/2406.14654
作者: Kawshik Manikantan(1),Shubham Toshniwal(2),Makarand Tapaswi(1),Vineet Gandhi(1) ((1) CVIT, IIIT Hyderabad, (2) NVIDIA)
关键词: task broad application, Major Entity Identification, coreference resolution, broad application, major bottleneck
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task’s broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative formulation of the CR task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities. Through extensive experiments, we demonstrate that MEI models generalize well across domains on multiple datasets with supervised models and LLM-based few-shot prompting. Additionally, the MEI task fits the classification framework, which enables the use of classification-based metrics that are more robust than the current CR metrics. Finally, MEI is also of practical use as it allows a user to search for all mentions of a particular entity or a group of entities of interest.

[LG-94] Harvesting Efficient On-Demand Order Pooling from Skilled Couriers: Enhancing Graph Representation Learning for Refining Real-time Many-to-One Assignments

链接: https://arxiv.org/abs/2406.14635
作者: Yile Liang,Jiuxia Zhao,Donghui Li,Jie Feng,Chen Zhang,Xuetao Ding,Jinghua Hao,Renqing He
关键词: offering delivery fulfillment, on-demand food delivery, recent past, past has witnessed, witnessed a notable
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted in KDD 2024 ADS Track

点击查看摘要

Abstract:The recent past has witnessed a notable surge in on-demand food delivery (OFD) services, offering delivery fulfillment within dozens of minutes after an order is placed. In OFD, pooling multiple orders for simultaneous delivery in real-time order assignment is a pivotal efficiency source, which may in turn extend delivery time. Constructing high-quality order pooling to harmonize platform efficiency with the experiences of consumers and couriers, is crucial to OFD platforms. However, the complexity and real-time nature of order assignment, making extensive calculations impractical, significantly limit the potential for order consolidation. Moreover, offline environment is frequently riddled with unknown factors, posing challenges for the platform’s perceptibility and pooling decisions. Nevertheless, delivery behaviors of skilled couriers (SCs) who know the environment well, can improve system awareness and effectively inform decisions. Hence a SC delivery network (SCDN) is constructed, based on an enhanced attributed heterogeneous network embedding approach tailored for OFD. It aims to extract features from rich temporal and spatial information, and uncover the latent potential for order combinations embedded within SC trajectories. Accordingly, the vast search space of order assignment can be effectively pruned through scalable similarity calculations of low-dimensional vectors, making comprehensive and high-quality pooling outcomes more easily identified in real time. SCDN has now been deployed in Meituan dispatch system. Online tests reveal that with SCDN, the pooling quality and extent have been greatly improved. And our system can boost couriers’efficiency by 45-55% during noon peak hours, while upholding the timely delivery commitment.

[LG-95] ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights

链接: https://arxiv.org/abs/2406.14596
作者: Gabriel Sarch,Lawrence Jang,Michael J. Tarr,William W. Cohen,Kenneth Marino,Katerina Fragkiadaki
关键词: Large-scale generative language, Large-scale generative, generative language, language and vision-language, decision making
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project website: this http URL

点击查看摘要

Abstract:Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot in-context learning for decision making and instruction following. However, they require high-quality exemplar demonstrations to be included in their context window. In this work, we ask: Can LLMs and VLMs generate their own prompt examples from generic, sub-optimal demonstrations? We propose In-Context Abstraction Learning (ICAL), a method that builds a memory of multimodal experience insights from sub-optimal demonstrations and human feedback. Given a noisy demonstration in a new domain, VLMs abstract the trajectory into a general program by fixing inefficient actions and annotating cognitive abstractions: task relationships, object state changes, temporal subgoals, and task construals. These abstractions are refined and adapted interactively through human feedback while the agent attempts to execute the trajectory in a similar environment. The resulting abstractions, when used as exemplars in the prompt, significantly improve decision-making in retrieval-augmented LLM and VLM agents. Our ICAL agent surpasses the state-of-the-art in dialogue-based instruction following in TEACh, multimodal web agents in VisualWebArena, and action anticipation in Ego4D. In TEACh, we achieve a 12.6% improvement in goal-condition success. In VisualWebArena, our task success rate improves over the SOTA from 14.3% to 22.7%. In Ego4D action forecasting, we improve over few-shot GPT-4V and remain competitive with supervised models. We show finetuning our retrieval-augmented in-context agent yields additional improvements. Our approach significantly reduces reliance on expert-crafted examples and consistently outperforms in-context learning from action plans that lack such insights.

[LG-96] Adversaries Can Misuse Combinations of Safe Models

链接: https://arxiv.org/abs/2406.14595
作者: Erik Jones,Anca Dragan,Jacob Steinhardt
关键词: user manipulation, model, Developers, model enables cyberoffense, adversaries
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Developers try to evaluate whether an AI system can be misused by adversaries before releasing it; for example, they might test whether a model enables cyberoffense, user manipulation, or bioterrorism. In this work, we show that individually testing models for misuse is inadequate; adversaries can misuse combinations of models even when each individual model is safe. The adversary accomplishes this by first decomposing tasks into subtasks, then solving each subtask with the best-suited model. For example, an adversary might solve challenging-but-benign subtasks with an aligned frontier model, and easy-but-malicious subtasks with a weaker misaligned model. We study two decomposition methods: manual decomposition where a human identifies a natural decomposition of a task, and automated decomposition where a weak model generates benign tasks for a frontier model to solve, then uses the solutions in-context to solve the original task. Using these decompositions, we empirically show that adversaries can create vulnerable code, explicit images, python scripts for hacking, and manipulative tweets at much higher rates with combinations of models than either individual model. Our work suggests that even perfectly-aligned frontier systems can enable misuse without ever producing malicious outputs, and that red-teaming efforts should extend beyond single models in isolation.

[LG-97] Enhancing Dropout-based Bayesian Neural Networks with Multi-Exit on FPGA

链接: https://arxiv.org/abs/2406.14593
作者: Hao Mark Chen,Liam Castelli,Martin Ferianc,Hongyu Zhou,Shuanglong Liu,Wayne Luk,Hongxiang Fan
关键词: Reliable uncertainty estimation, uncertainty estimation plays, Reliable uncertainty, uncertainty estimation, autonomous driving
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2308.06849

点击查看摘要

Abstract:Reliable uncertainty estimation plays a crucial role in various safety-critical applications such as medical diagnosis and autonomous driving. In recent years, Bayesian neural networks (BayesNNs) have gained substantial research and industrial interests due to their capability to make accurate predictions with reliable uncertainty estimation. However, the algorithmic complexity and the resulting hardware performance of BayesNNs hinder their adoption in real-life applications. To bridge this gap, this paper proposes an algorithm and hardware co-design framework that can generate field-programmable gate array (FPGA)-based accelerators for efficient BayesNNs. At the algorithm level, we propose novel multi-exit dropout-based BayesNNs with reduced computational and memory overheads while achieving high accuracy and quality of uncertainty estimation. At the hardware level, this paper introduces a transformation framework that can generate FPGA-based accelerators for the proposed efficient multi-exit BayesNNs. Several optimization techniques such as the mix of spatial and temporal mappings are introduced to reduce resource consumption and improve the overall hardware performance. Comprehensive experiments demonstrate that our approach can achieve higher energy efficiency compared to CPU, GPU, and other state-of-the-art hardware implementations. To support the future development of this research, we have open-sourced our code at: this https URL

[LG-98] Physics-informed neural networks for parameter learning of wildfire spreading

链接: https://arxiv.org/abs/2406.14591
作者: Konstantinos Vogiatzoglou,Costas Papadimitriou,Vasilis Bontozoglou,Konstantinos Ampountolas
关键词: Wildland fires pose, terrifying natural hazards, pose terrifying natural, wildfire spreading model, fires pose terrifying
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 26 pages, 10 figures, 1 Table, Under review in Computer Methods in Applied Mechanics and Engineering

点击查看摘要

Abstract:Wildland fires pose terrifying natural hazards, underscoring the urgent need to develop data-driven and physics-informed digital twins for wildfire prevention, monitoring, intervention, and response. In this direction of research, this work introduces a physics-informed neural network (PiNN) to learn the unknown parameters of an interpretable wildfire spreading model. The considered wildfire spreading model integrates fundamental physical laws articulated by key model parameters, essential for capturing the complex behavior of wildfires. The proposed machine learning approach leverages the theory of artificial neural networks with the physical constraints governing wildfire dynamics, such as the first principles of mass and energy conservation. Training of the PiNN for physics-informed parameter identification is realized using data of the temporal evolution of one- and two-dimensional (plane surface) fire fronts that have been obtained from a high-fidelity simulator of the wildfire spreading model under consideration. The parameter learning results demonstrate the remarkable predictive ability of the proposed PiNN in uncovering the unknown coefficients in both the one- and two-dimensional fire spreading scenarios. Additionally, this methodology exhibits robustness by identifying the same parameters in the presence of noisy data. The proposed framework is envisioned to be incorporated in a physics-informed digital twin for intelligent wildfire management and risk assessment.

[LG-99] PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models

链接: https://arxiv.org/abs/2406.14571
作者: Yunjae Lee,Hyeseong Kim,Minsoo Rhu
关键词: Training recommendation systems, faces several challenges, stage to preprocess, seamless manner, raw data
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training recommendation systems (RecSys) faces several challenges as it requires the “data preprocessing” stage to preprocess an ample amount of raw data and feed them to the GPU for training in a seamless manner. To sustain high training throughput, state-of-the-art solutions reserve a large fleet of CPU servers for preprocessing which incurs substantial deployment cost and power consumption. Our characterization reveals that prior CPU-centric preprocessing is bottlenecked on feature generation and feature normalization operations as it fails to reap out the abundant inter-/intra-feature parallelism in RecSys preprocessing. PreSto is a storage-centric preprocessing system leveraging In-Storage Processing (ISP), which offloads the bottlenecked preprocessing operations to our ISP units. We show that PreSto outperforms the baseline CPU-centric system with a 9.6\times speedup in end-to-end preprocessing time, 4.3\times enhancement in cost-efficiency, and 11.3\times improvement in energyefficiency on average for production-scale RecSys preprocessing.

[LG-100] ackling GenAI Copyright Issues: Originality Estimation and Genericization

链接: https://arxiv.org/abs/2406.03341
作者: Hiroaki Chiba-Okabe,Weijie J. Su
关键词: numerous lawsuits filed, significant copyright concerns, sparked significant copyright, leading to numerous, generative model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:The rapid progress of generative AI technology has sparked significant copyright concerns, leading to numerous lawsuits filed against AI developers. While some studies explore methods to mitigate copyright risks by steering the outputs of generative models away from those resembling copyrighted data, little attention has been paid to the question of how much of a resemblance is undesirable; more original or unique data are afforded stronger protection, and the threshold level of resemblance for constituting infringement correspondingly lower. Here, leveraging this principle, we propose a genericization method that modifies the outputs of a generative model to make them more generic and less likely to infringe copyright. To achieve this, we introduce a metric for quantifying the level of originality of data in a manner that is consistent with the legal framework. This metric can be practically estimated by drawing samples from a generative model, which is then used for the genericization process. Experiments demonstrate that our genericization method successfully modifies the output of a text-to-image generative model so that it produces more generic, copyright-compliant images.

[LG-101] Straight-Through meets Sparse Recovery: the Support Exploration Algorithm

链接: https://arxiv.org/abs/2301.13584
作者: Mimoun Mohamed(QARMA, I2M),François Malgouyres(IMT),Valentin Emiya(QARMA),Caroline Chaux(IPAL)
关键词: http URL make, quantized neural networks, optimize quantized neural, sparse support recovery, http URL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:The \it straight-through estimator (STE) is commonly used to optimize quantized neural networks, yet its contexts of effective performance are still unclear despite empirical this http URL make a step forward in this comprehension, we apply STE to a well-understood problem: \it sparse support recovery. We introduce the \it Support Exploration Algorithm (SEA), a novel algorithm promoting sparsity, and we analyze its performance in support recovery (a.k.a. model selection) problems. SEA explores more supports than the state-of-the-art, leading to superior performance in experiments, especially when the columns of A are strongly coherent.The theoretical analysis considers recovery guarantees when the linear measurements matrix A satisfies the \it Restricted Isometry Property (RIP).The sufficient conditions of recovery are comparable but more stringent than those of the state-of-the-art in sparse support recovery. Their significance lies mainly in their applicability to an instance of the STE.

[LG-102] Dislocation cartography: Representations and unsupervised classification of dislocation networks with unique fingerprints

链接: https://arxiv.org/abs/2406.15004
作者: Benjamin Udofia,Tushar Jogi,Markus Stricker
关键词: Detecting structure, step to arrive, arrive at meaningful, meaningful representations, Detecting
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 26 pages, 7 figures

点击查看摘要

Abstract:Detecting structure in data is the first step to arrive at meaningful representations for systems. This is particularly challenging for dislocation networks evolving as a consequence of plastic deformation of crystalline systems. Our study employs Isomap, a manifold learning technique, to unveil the intrinsic structure of high-dimensional density field data of dislocation structures from different compression axis. The resulting maps provide a systematic framework for quantitatively comparing dislocation structures, offering unique fingerprints based on density fields. Our novel, unbiased approach contributes to the quantitative classification of dislocation structures which can be systematically extended.

[LG-103] Enhancing reliability in prediction intervals using point forecasters: Heteroscedastic Quantile Regression and Width-Adaptive Conformal Inference

链接: https://arxiv.org/abs/2406.14904
作者: Carlos Sebastián,Carlos E. González-Guillén,Jesús Juan
关键词: Building prediction intervals, forecasting problems presents, series forecasting problems, Building prediction, time series forecasting
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Building prediction intervals for time series forecasting problems presents a complex challenge, particularly when relying solely on point predictors, a common scenario for practitioners in the industry. While research has primarily focused on achieving increasingly efficient valid intervals, we argue that, when evaluating a set of intervals, traditional measures alone are insufficient. There are additional crucial characteristics: the intervals must vary in length, with this variation directly linked to the difficulty of the prediction, and the coverage of the interval must remain independent of the difficulty of the prediction for practical utility. We propose the Heteroscedastic Quantile Regression (HQR) model and the Width-Adaptive Conformal Inference (WACI) method, providing theoretical coverage guarantees, to overcome those issues, respectively. The methodologies are evaluated in the context of Electricity Price Forecasting and Wind Power Forecasting, representing complex scenarios in time series forecasting. The results demonstrate that HQR and WACI not only improve or achieve typical measures of validity and efficiency but also successfully fulfil the commonly ignored mentioned characteristics.

[LG-104] Bayesian neural networks for predicting uncertainty in full-field material response

链接: https://arxiv.org/abs/2406.14838
作者: George D. Pasparakis,Lori Graham-Brady,Michael D. Shields
关键词: Hamiltonian Monte Carlo, uncertainty estimates, Monte Carlo, Monte Carlo Dropout, important tasks
类目: Machine Learning (stat.ML); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Stress and material deformation field predictions are among the most important tasks in computational mechanics. These predictions are typically made by solving the governing equations of continuum mechanics using finite element analysis, which can become computationally prohibitive considering complex microstructures and material behaviors. Machine learning (ML) methods offer potentially cost effective surrogates for these applications. However, existing ML surrogates are either limited to low-dimensional problems and/or do not provide uncertainty estimates in the predictions. This work proposes an ML surrogate framework for stress field prediction and uncertainty quantification for diverse materials microstructures. A modified Bayesian U-net architecture is employed to provide a data-driven image-to-image mapping from initial microstructure to stress field with prediction (epistemic) uncertainty estimates. The Bayesian posterior distributions for the U-net parameters are estimated using three state-of-the-art inference algorithms: the posterior sampling-based Hamiltonian Monte Carlo method and two variational approaches, the Monte-Carlo Dropout method and the Bayes by Backprop algorithm. A systematic comparison of the predictive accuracy and uncertainty estimates for these methods is performed for a fiber reinforced composite material and polycrystalline microstructure application. It is shown that the proposed methods yield predictions of high accuracy compared to the FEA solution, while uncertainty estimates depend on the inference approach. Generally, the Hamiltonian Monte Carlo and Bayes by Backprop methods provide consistent uncertainty estimates. Uncertainty estimates from Monte Carlo Dropout, on the other hand, are more difficult to interpret and depend strongly on the method’s design.

[LG-105] On the estimation rate of Bayesian PINN for inverse problems

链接: https://arxiv.org/abs/2406.14808
作者: Yi Sun,Debarghya Mukherjee,Yves Atchade
关键词: Physics-informed neural networks, machine learning community, rapidly growing approach, Solving partial differential, problems using Physics-informed
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 35 Pages, 3 figures, and 2 tables

点击查看摘要

Abstract:Solving partial differential equations (PDEs) and their inverse problems using Physics-informed neural networks (PINNs) is a rapidly growing approach in the physics and machine learning community. Although several architectures exist for PINNs that work remarkably in practice, our theoretical understanding of their performances is somewhat limited. In this work, we study the behavior of a Bayesian PINN estimator of the solution of a PDE from n independent noisy measurement of the solution. We focus on a class of equations that are linear in their parameters (with unknown coefficients \theta_\star ). We show that when the partial differential equation admits a classical solution (say u_\star ), differentiable to order \beta , the mean square error of the Bayesian posterior mean is at least of order n^-2\beta/(2\beta + d) . Furthermore, we establish a convergence rate of the linear coefficients of \theta_\star depending on the order of the underlying differential operator. Last but not least, our theoretical results are validated through extensive simulations.

[LG-106] ImageFlowNet: Forecasting Multiscale Trajectories of Disease Progression with Irregularly-Sampled Longitudinal Medical Images

链接: https://arxiv.org/abs/2406.14794
作者: Chen Liu,Ke Xu,Liangbo L. Shen,Guillaume Huguet,Zilong Wang,Alexander Tong,Danilo Bzdok,Jay Stewart,Jay C. Wang,Lucian V. Del Priore,Smita Krishnaswamy
关键词: clinical decision making, decision making, holy grail, grail for clinical, clinical decision
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The forecasting of disease progression from images is a holy grail for clinical decision making. However, this task is complicated by the inherent high dimensionality, temporal sparsity and sampling irregularity in longitudinal image acquisitions. Existing methods often rely on extracting hand-crafted features and performing time-series analysis in this vector space, leading to a loss of rich spatial information within the images. To overcome these challenges, we introduce ImageFlowNet, a novel framework that learns latent-space flow fields that evolve multiscale representations in joint embedding spaces using neural ODEs and SDEs to model disease progression in the image domain. Notably, ImageFlowNet learns multiscale joint representation spaces by combining cohorts of patients together so that information can be transferred between the patient samples. The dynamics then provide plausible trajectories of progression, with the SDE providing alternative trajectories from the same starting point. We provide theoretical insights that support our formulation of ODEs, and motivate our regularizations involving high-level visual features, latent space organization, and trajectory smoothness. We then demonstrate ImageFlowNet’s effectiveness through empirical evaluations on three longitudinal medical image datasets depicting progression in retinal geographic atrophy, multiple sclerosis, and glioblastoma.

[LG-107] Machine Learning Global Simulation of Nonlocal Gravity Wave Propagation

链接: https://arxiv.org/abs/2406.14775
作者: Aman Gupta,Aditi Sheshadri,Sujit Roy,Vishal Gaur,Manil Maskey,Rahul Ramachandran
关键词: models typically operate, resolve atmospheric mesoscale, gravity waves, atmospheric mesoscale processes, typically operate
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn); Geophysics (physics.geo-ph)
*备注: 9 pages, 7 figures, no tables

点击查看摘要

Abstract:Global climate models typically operate at a grid resolution of hundreds of kilometers and fail to resolve atmospheric mesoscale processes, e.g., clouds, precipitation, and gravity waves (GWs). Model representation of these processes and their sources is essential to the global circulation and planetary energy budget, but subgrid scale contributions from these processes are often only approximately represented in models using parameterizations. These parameterizations are subject to approximations and idealizations, which limit their capability and accuracy. The most drastic of these approximations is the “single-column approximation” which completely neglects the horizontal evolution of these processes, resulting in key biases in current climate models. With a focus on atmospheric GWs, we present the first-ever global simulation of atmospheric GW fluxes using machine learning (ML) models trained on the WINDSET dataset to emulate global GW emulation in the atmosphere, as an alternative to traditional single-column parameterizations. Using an Attention U-Net-based architecture trained on globally resolved GW momentum fluxes, we illustrate the importance and effectiveness of global nonlocality, when simulating GWs using data-driven schemes.

[LG-108] Pathological Regularization Regimes in Classification Tasks

链接: https://arxiv.org/abs/2406.14731
作者: Maximilian Wiesmann,Paul Larsen
关键词: binary classification tasks, classification score obtained, trend reversal, binary classification, classification tasks
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:In this paper we demonstrate the possibility of a trend reversal in binary classification tasks between the dataset and a classification score obtained from a trained model. This trend reversal occurs for certain choices of the regularization parameter for model training, namely, if the parameter is contained in what we call the pathological regularization regime. For ridge regression, we give necessary and sufficient algebraic conditions on the dataset for the existence of a pathological regularization regime. Moreover, our results provide a data science practitioner with a hands-on tool to avoid hyperparameter choices suffering from trend reversal. We furthermore present numerical results on pathological regularization regimes for logistic regression. Finally, we draw connections to datasets exhibiting Simpson’s paradox, providing a natural source of pathological datasets.

[LG-109] Voice Disorder Analysis: a Transformer-based Approach

链接: https://arxiv.org/abs/2406.14693
作者: Alkis Koudounas,Gabriele Ciravegna,Marco Fantini,Giovanni Succo,Erika Crosetti,Tania Cerquitelli,Elena Baralis
关键词: significantly affecting patient, affecting patient quality, pathologies significantly affecting, quality of life, significantly affecting
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Accepted at Interspeech 2024

点击查看摘要

Abstract:Voice disorders are pathologies significantly affecting patient quality of life. However, non-invasive automated diagnosis of these pathologies is still under-explored, due to both a shortage of pathological voice data, and diversity of the recording types used for the diagnosis. This paper proposes a novel solution that adopts transformers directly working on raw voice signals and addresses data shortage through synthetic data generation and data augmentation. Further, we consider many recording types at the same time, such as sentence reading and sustained vowel emission, by employing a Mixture of Expert ensemble to align the predictions on different data types. The experimental results, obtained on both public and private datasets, show the effectiveness of our solution in the disorder detection and classification tasks and largely improve over existing approaches.

[LG-110] Uniform Convergence of Adversarially Robust Classifiers

链接: https://arxiv.org/abs/2406.14682
作者: Rachel Morris,Ryan Murray
关键词: recent years, significant interest, data classification problems, classification problems, adversarially-perturbed classification problems
类目: Analysis of PDEs (math.AP); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 36 pages, 2 figures

点击查看摘要

Abstract:In recent years there has been significant interest in the effect of different types of adversarial perturbations in data classification problems. Many of these models incorporate the adversarial power, which is an important parameter with an associated trade-off between accuracy and robustness. This work considers a general framework for adversarially-perturbed classification problems, in a large data or population-level limit. In such a regime, we demonstrate that as adversarial strength goes to zero that optimal classifiers converge to the Bayes classifier in the Hausdorff distance. This significantly strengthens previous results, which generally focus on L^1 -type convergence. The main argument relies upon direct geometric comparisons and is inspired by techniques from geometric measure theory.

[LG-111] Deep-learning-assisted reconfigurable metasurface antenna for real-time holographic beam steering

链接: https://arxiv.org/abs/2406.14585
作者: Hyunjun Ma,Jin-soo Kim,Jong-Ho Choe,Q-Han Park
关键词: holographic beam steering, beam steering, metasurface antenna capable, time holographic beam, field pattern
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:We propose a metasurface antenna capable of real time holographic beam steering. An array of reconfigurable dipoeles can generate on demand far field patterns of radiation through the specific encoding of meta atomic states. i.e., the configuration of each dipole. Suitable states for the generation of the desired patterns can be identified using iteartion, but this is very slow and needs to be done for each far field pattern. Here, we present a deep learning based method for the control of a metasurface antenna with point dipole elements that vary in their state using dipole polarizability. Instead of iteration, we adopt a deep learning algorithm that combines an autoencoder with an electromagnetic scattering equation to determin the states required for a target far field pattern in real time. The scattering equation from Born approximation is used as the decoder in training the neural network, and analytic Green’s function calculation is used to check the validity of Born approximation. Our learning based algorithm requires a computing time of within in 200 microseconds to determine the meta atomic states, thus enabling the real time opeartion of a holographic antenna.

信息检索

[IR-0] STARD: A Chinese Statute Retrieval Dataset with Real Queries Issued by Non-professionals

链接: https://arxiv.org/abs/2406.15313
作者: Weihang Su,Yiran Hu,Anzhe Xie,Qingyao Ai,Zibing Que,Ning Zheng,Yun Liu,Weixing Shen,Yiqun Liu
关键词: find relevant statutory, Statute retrieval aims, Statute retrieval, Existing statute retrieval, aims to find
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Statute retrieval aims to find relevant statutory articles for specific queries. This process is the basis of a wide range of legal applications such as legal advice, automated judicial decisions, legal document drafting, etc. Existing statute retrieval benchmarks focus on formal and professional queries from sources like bar exams and legal case documents, thereby neglecting non-professional queries from the general public, which often lack precise legal terminology and references. To address this gap, we introduce the STAtute Retrieval Dataset (STARD), a Chinese dataset comprising 1,543 query cases collected from real-world legal consultations and 55,348 candidate statutory articles. Unlike existing statute retrieval datasets, which primarily focus on professional legal queries, STARD captures the complexity and diversity of real queries from the general public. Through a comprehensive evaluation of various retrieval baselines, we reveal that existing retrieval approaches all fall short of these real queries issued by non-professional users. The best method only achieves a Recall@100 of 0.907, suggesting the necessity for further exploration and additional research in this area. All the codes and datasets are available at: this https URL Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2406.15313 [cs.IR] (or arXiv:2406.15313v1 [cs.IR] for this version)

[IR-1] owards Fine-Grained Citation Evaluation in Generated Text: A Comparative Analysis of Faithfulness Metrics

链接: https://arxiv.org/abs/2406.15264
作者: Weijia Zhang,Mohammad Aliannejadi,Yifei Yuan,Jiahuan Pei,Jia-Hong Huang,Evangelos Kanoulas
关键词: Large language models, Large language, language models, unverifiable information, produce unsupported
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注: 12 pages, 3 figures

点击查看摘要

Abstract:Large language models (LLMs) often produce unsupported or unverifiable information, known as “hallucinations.” To mitigate this, retrieval-augmented LLMs incorporate citations, grounding the content in verifiable sources. Despite such developments, manually assessing how well a citation supports the associated statement remains a major challenge. Previous studies use faithfulness metrics to estimate citation support automatically but are limited to binary classification, overlooking fine-grained citation support in practical scenarios. To investigate the effectiveness of faithfulness metrics in fine-grained scenarios, we propose a comparative evaluation framework that assesses the metric effectiveness in distinguishinging citations between three-category support levels: full, partial, and no support. Our framework employs correlation analysis, classification evaluation, and retrieval evaluation to measure the alignment between metric scores and human judgments comprehensively. Our results show no single metric consistently excels across all evaluations, revealing the complexity of assessing fine-grained support. Based on the findings, we provide practical recommendations for developing more effective metrics.

[IR-2] Retrieval Augmented Zero-Shot Text Classification

链接: https://arxiv.org/abs/2406.15241
作者: Tassallah Abdullahi,Ritambhara Singh,Carsten Eickhoff
关键词: unseen classes efficiently, handle unseen classes, task-specific training data, Zero-shot text learning, handle unseen
类目: Information Retrieval (cs.IR)
*备注: Proceedings of the 2024 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR '24), July 13, 2024, Washington DC, DC, USA

点击查看摘要

Abstract:Zero-shot text learning enables text classifiers to handle unseen classes efficiently, alleviating the need for task-specific training data. A simple approach often relies on comparing embeddings of query (text) to those of potential classes. However, the embeddings of a simple query sometimes lack rich contextual information, which hinders the classification performance. Traditionally, this has been addressed by improving the embedding model with expensive training. We introduce QZero, a novel training-free knowledge augmentation approach that reformulates queries by retrieving supporting categories from Wikipedia to improve zero-shot text classification performance. Our experiments across six diverse datasets demonstrate that QZero enhances performance for state-of-the-art static and contextual embedding models without the need for retraining. Notably, in News and medical topic classification tasks, QZero improves the performance of even the largest OpenAI embedding model by at least 5% and 3%, respectively. Acting as a knowledge amplifier, QZero enables small word embedding models to achieve performance levels comparable to those of larger contextual models, offering the potential for significant computational savings. Additionally, QZero offers meaningful insights that illuminate query context and verify topic relevance, aiding in understanding model predictions. Overall, QZero improves embedding-based zero-shot classifiers while maintaining their simplicity. This makes it particularly valuable for resource-constrained environments and domains with constantly evolving information.

[IR-3] UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis

链接: https://arxiv.org/abs/2406.15187
作者: Yulong Hui,Yao Lu,Huanchen Zhang
关键词: Large Language Models, improved Large Language, Language Models, Large Language, significant challenges exist
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The use of Retrieval-Augmented Generation (RAG) has improved Large Language Models (LLMs) in collaborating with external data, yet significant challenges exist in real-world scenarios. In areas such as academic literature and finance question answering, data are often found in raw text and tables in HTML or PDF formats, which can be lengthy and highly unstructured. In this paper, we introduce a benchmark suite, namely Unstructured Document Analysis (UDA), that involves 2,965 real-world documents and 29,590 expert-annotated QA pairs. We revisit popular LLM- and RAG-based solutions for document analysis and evaluate the design choices and answer qualities across multiple document domains and diverse query types. Our evaluation yields interesting findings and highlights the importance of data parsing and retrieval. We hope our benchmark can shed light and better serve real-world document analysis applications. The benchmark suite and code can be found at this https URL.

[IR-4] Evaluation des capacites de reponse de larges mod`eles de langage (LLM) pour des questions dhistoriens

链接: https://arxiv.org/abs/2406.15173
作者: Mathieu Chartier,Nabil Dakkoune,Guillaume Bourgeois,Stéphane Jean
关键词: Large Language Models, revolutionized information retrieval, ChatGPT or Bard, Bard have revolutionized, generate custom responses
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: in French language

点击查看摘要

Abstract:Large Language Models (LLMs) like ChatGPT or Bard have revolutionized information retrieval and captivated the audience with their ability to generate custom responses in record time, regardless of the topic. In this article, we assess the capabilities of various LLMs in producing reliable, comprehensive, and sufficiently relevant responses about historical facts in French. To achieve this, we constructed a testbed comprising numerous history-related questions of varying types, themes, and levels of difficulty. Our evaluation of responses from ten selected LLMs reveals numerous shortcomings in both substance and form. Beyond an overall insufficient accuracy rate, we highlight uneven treatment of the French language, as well as issues related to verbosity and inconsistency in the responses provided by LLMs.

[IR-5] Hierarchical thematic classification of major conference proceedings

链接: https://arxiv.org/abs/2406.14983
作者: Arsentii Kuzmin,Alexander Aduenko,Vadim Strijov
关键词: decision support system, hierarchical similarity function, hierarchical, develop a decision, decision support
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we develop a decision support system for the hierarchical text classification. We consider text collections with a fixed hierarchical structure of topics given by experts in the form of a tree. The system sorts the topics by relevance to a given document. The experts choose one of the most relevant topics to finish the classification. We propose a weighted hierarchical similarity function to calculate topic relevance. The function calculates the similarity of a document and a tree branch. The weights in this function determine word importance. We use the entropy of words to estimate the weights. The proposed hierarchical similarity function formulates a joint hierarchical thematic classification probability model of the document topics, parameters, and hyperparameters. The variational Bayesian inference gives a closed-form EM algorithm. The EM algorithm estimates the parameters and calculates the probability of a topic for a given document. Compared to hierarchical multiclass SVM, hierarchical PLSA with adaptive regularization, and hierarchical naive Bayes, the weighted hierarchical similarity function has better improvement in ranking accuracy in an abstract collection of a major conference EURO and a website collection of industrial companies. Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML) Cite as: arXiv:2406.14983 [cs.LG] (or arXiv:2406.14983v1 [cs.LG] for this version)

[IR-6] A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems

链接: https://arxiv.org/abs/2406.14972
作者: Florin Cuconasu,Giovanni Trappolini,Nicola Tonellotto,Fabrizio Silvestri
关键词: Retrieval Augmented Generation, Augmented Generation, artificial intelligence combining, Retrieval Augmented, large language models
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) represents a significant advancement in artificial intelligence combining a retrieval phase with a generative phase, with the latter typically being powered by large language models (LLMs). The current common practices in RAG involve using “instructed” LLMs, which are fine-tuned with supervised training to enhance their ability to follow instructions and are aligned with human preferences using state-of-the-art techniques. Contrary to popular belief, our study demonstrates that base models outperform their instructed counterparts in RAG tasks by 20% on average under our experimental settings. This finding challenges the prevailing assumptions about the superiority of instructed LLMs in RAG applications. Further investigations reveal a more nuanced situation, questioning fundamental aspects of RAG and suggesting the need for broader discussions on the topic; or, as Fromm would have it, “Seldom is a glance at the statistics enough to understand the meaning of the figures”.

[IR-7] IDentity with Locality: An ideal hash for gene sequence search

链接: https://arxiv.org/abs/2406.14901
作者: Aditya Desai,Gaurav Gupta,Tianyi Zhang,Anshumali Shrivastava
关键词: Merged Bloom filters, Bloom Filters, IDL functions, Gene sequence search, gene search
类目: Information Retrieval (cs.IR)
*备注: 13 pages

点击查看摘要

Abstract:Gene sequence search is a fundamental operation in computational genomics. Due to the petabyte scale of genome archives, most gene search systems now use hashing-based data structures such as Bloom Filters (BF). The state-of-the-art systems such as Compact bit-slicing signature index (COBS) and Repeated And Merged Bloom filters (RAMBO) use BF with Random Hash (RH) functions for gene representation and identification. The standard recipe is to cast the gene search problem as a sequence of membership problems testing if each subsequent gene substring (called kmer) of Q is present in the set of kmers of the entire gene database D. We observe that RH functions, which are crucial to the memory and the computational advantage of BF, are also detrimental to the system performance of gene-search systems. While subsequent kmers being queried are likely very similar, RH, oblivious to any similarity, uniformly distributes the kmers to different parts of potentially large BF, thus triggering excessive cache misses and causing system slowdown. We propose a novel hash function called the Identity with Locality (IDL) hash family, which co-locates the keys close in input space without causing collisions. This approach ensures both cache locality and key preservation. IDL functions can be a drop-in replacement for RH functions and help improve the performance of information retrieval systems. We give a simple but practical construction of IDL function families and show that replacing the RH with IDL functions reduces cache misses by a factor of 5x, thus improving query and indexing times of SOTA methods such as COBS and RAMBO by factors up to 2x without compromising their quality. We also provide a theoretical analysis of the false positive rate of BF with IDL functions. Our hash function is the first study that bridges Locality Sensitive Hash (LSH) and RH to obtain cache efficiency.

[IR-8] Decoding Matters: Addressing Amplification Bias and Homogeneity Issue for LLM-based Recommendation

链接: https://arxiv.org/abs/2406.14900
作者: Keqin Bao,Jizhi Zhang,Yang Zhang,Xinyue Huo,Chong Chen,Fuli Feng
关键词: Adapting Large Language, Large Language Models, Adapting Large, Large Language, requires careful consideration
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Adapting Large Language Models (LLMs) for recommendation requires careful consideration of the decoding process, given the inherent differences between generating items and natural language. Existing approaches often directly apply LLMs’ original decoding methods. However, we find these methods encounter significant challenges: 1) amplification bias – where standard length normalization inflates scores for items containing tokens with generation probabilities close to 1 (termed ghost tokens), and 2) homogeneity issue – generating multiple similar or repetitive items for a user. To tackle these challenges, we introduce a new decoding approach named Debiasing-Diversifying Decoding (D3). D3 disables length normalization for ghost tokens to alleviate amplification bias, and it incorporates a text-free assistant model to encourage tokens less frequently generated by LLMs for counteracting recommendation homogeneity. Extensive experiments on real-world datasets demonstrate the method’s effectiveness in enhancing accuracy and diversity.

[IR-9] alking the Talk Does Not Entail Walking the Walk: On the Limits of Large Language Models in Lexical Entailment Recognition

链接: https://arxiv.org/abs/2406.14894
作者: Candida M. Greco,Lucio La Cava,Andrea Tagarelli
关键词: providing the structure, form the backbone, Large Language Models, lexical entailment, Verbs form
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:Verbs form the backbone of language, providing the structure and meaning to sentences. Yet, their intricate semantic nuances pose a longstanding challenge. Understanding verb relations through the concept of lexical entailment is crucial for comprehending sentence meanings and grasping verb dynamics. This work investigates the capabilities of eight Large Language Models in recognizing lexical entailment relations among verbs through differently devised prompting strategies and zero-/few-shot settings over verb pairs from two lexical databases, namely WordNet and HyperLex. Our findings unveil that the models can tackle the lexical entailment recognition task with moderately good performance, although at varying degree of effectiveness and under different conditions. Also, utilizing few-shot prompting can enhance the models’ performance. However, perfectly solving the task arises as an unmet challenge for all examined LLMs, which raises an emergence for further research developments on this topic.

[IR-10] Generate-then-Ground in Retrieval-Augmented Generation for Multi-hop Question Answering

链接: https://arxiv.org/abs/2406.14891
作者: Zhengliang Shi,Shuo Zhang,Weiwei Sun,Shen Gao,Pengjie Ren,Zhumin Chen,Zhaochun Ren
关键词: Multi-Hop Question Answering, intensive knowledge required, large language models, Question Answering, tasks present
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: ACL 2024 (main conference)

点击查看摘要

Abstract:Multi-Hop Question Answering (MHQA) tasks present a significant challenge for large language models (LLMs) due to the intensive knowledge required. Current solutions, like Retrieval-Augmented Generation, typically retrieve potential documents from an external corpus to read an answer. However, the performance of this retrieve-then-read paradigm is constrained by the retriever and the inevitable noise in the retrieved documents. To mitigate these challenges, we introduce a novel generate-then-ground (GenGround) framework, synergizing the parametric knowledge of LLMs and external documents to solve a multi-hop question. GenGround empowers LLMs to alternate two phases until the final answer is derived: (1) formulate a simpler, single-hop question and directly generate the answer; (2) ground the question-answer pair in retrieved documents, amending any wrong predictions in the answer. We also propose an instructional grounding distillation method to generalize our method into smaller models. Extensive experiments conducted on four datasets illustrate the superiority of our method.

[IR-11] Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models

链接: https://arxiv.org/abs/2406.14848
作者: Qi Liu,Bo Wang,Nan Wang,Jiaxin Mao
关键词: large language language, language language models, Recent studies, language language, large language
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent studies have demonstrated the effectiveness of using large language language models (LLMs) in passage ranking. The listwise approaches, such as RankGPT, have become new state-of-the-art in this task. However, the efficiency of RankGPT models is limited by the maximum context length and relatively high latency of LLM inference. To address these issues, in this paper, we propose PE-Rank, leveraging the single passage embedding as a good context compression for efficient listwise passage reranking. By treating each passage as a special token, we can directly input passage embeddings into LLMs, thereby reducing input length. Additionally, we introduce an inference method that dynamically constrains the decoding space to these special tokens, accelerating the decoding process. For adapting the model to reranking, we employ listwise learning to rank loss for training. Evaluation results on multiple benchmarks demonstrate that PE-Rank significantly improves efficiency in both prefilling and decoding, while maintaining competitive ranking effectiveness. The Code is available at \urlthis https URL.

[IR-12] Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework

链接: https://arxiv.org/abs/2406.14783
作者: Zackary Rackauckas,Arthur Câmara,Jakub Zavrel
关键词: systems include hallucination, gold standard benchmarks, company internal tasks, include hallucination problems, Retrieval-Augmented Generation
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注: Accepted to LLM4Eval @ SIGIR24

点击查看摘要

Abstract:Challenges in the automated evaluation of Retrieval-Augmented Generation (RAG) Question-Answering (QA) systems include hallucination problems in domain-specific knowledge and the lack of gold standard benchmarks for company internal tasks. This results in difficulties in evaluating RAG variations, like RAG-Fusion (RAGF), in the context of a product QA task at Infineon Technologies. To solve these problems, we propose a comprehensive evaluation framework, which leverages Large Language Models (LLMs) to generate large datasets of synthetic queries based on real user queries and in-domain documents, uses LLM-as-a-judge to rate retrieved documents and answers, evaluates the quality of answers, and ranks different variants of Retrieval-Augmented Generation (RAG) agents with RAGElo’s automated Elo-based competition. LLM-as-a-judge rating of a random sample of synthetic queries shows a moderate, positive correlation with domain expert scoring in relevance, accuracy, completeness, and precision. While RAGF outperformed RAG in Elo score, a significance analysis against expert annotations also shows that RAGF significantly outperforms RAG in completeness, but underperforms in precision. In addition, Infineon’s RAGF assistant demonstrated slightly higher performance in document relevance based on MRR@5 scores. We find that RAGElo positively aligns with the preferences of human annotators, though due caution is still required. Finally, RAGF’s approach leads to more complete answers based on expert annotations and better answers overall based on RAGElo’s evaluation criteria.

[IR-13] ChatGPT as Research Scientist: Probing GPTs Capabilities as a Research Librarian Research Ethicist Data Generator and Data Predictor

链接: https://arxiv.org/abs/2406.14765
作者: Steven A. Lehr,Aylin Caliskan,Suneragiri Liyanage,Mahzarin R. Banaji
关键词: Research Ethicist, Research Librarian, research, Study, Data Generator
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Main article is 14 pages, 1 table. Includes SI Appendix: 26 pages, 12 tables, 2 figures. Total: 40 pages, 13 tables, 2 figures. Under revised review at PNAS

点击查看摘要

Abstract:How good a research scientist is ChatGPT? We systematically probed the capabilities of GPT-3.5 and GPT-4 across four central components of the scientific process: as a Research Librarian, Research Ethicist, Data Generator, and Novel Data Predictor, using psychological science as a testing field. In Study 1 (Research Librarian), unlike human researchers, GPT-3.5 and GPT-4 hallucinated, authoritatively generating fictional references 36.0% and 5.4% of the time, respectively, although GPT-4 exhibited an evolving capacity to acknowledge its fictions. In Study 2 (Research Ethicist), GPT-4 (though not GPT-3.5) proved capable of detecting violations like p-hacking in fictional research protocols, correcting 88.6% of blatantly presented issues, and 72.6% of subtly presented issues. In Study 3 (Data Generator), both models consistently replicated patterns of cultural bias previously discovered in large language corpora, indicating that ChatGPT can simulate known results, an antecedent to usefulness for both data generation and skills like hypothesis generation. Contrastingly, in Study 4 (Novel Data Predictor), neither model was successful at predicting new results absent in their training data, and neither appeared to leverage substantially new information when predicting more versus less novel outcomes. Together, these results suggest that GPT is a flawed but rapidly improving librarian, a decent research ethicist already, capable of data generation in simple domains with known characteristics but poor at predicting novel patterns of empirical data to aid future experimentation.

[IR-14] RE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation

链接: https://arxiv.org/abs/2406.14764
作者: William Fleshman,Benjamin Van Durme
关键词: Large language models, Large language, fine-tuned for text-retrieval, text-retrieval have demonstrated, information retrieval
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) fine-tuned for text-retrieval have demonstrated state-of-the-art results across several information retrieval (IR) benchmarks. However, supervised training for improving these models requires numerous labeled examples, which are generally unavailable or expensive to acquire. In this work, we explore the effectiveness of extending reverse engineered adaptation to the context of information retrieval (RE-AdaptIR). We use RE-AdaptIR to improve LLM-based IR models using only unlabeled data. We demonstrate improved performance both in training domains as well as zero-shot in domains where the models have seen no queries. We analyze performance changes in various fine-tuning scenarios and offer findings of immediate use to practitioners.

[IR-15] QA-RS- A break-down prompting approach for Multi-hop Table-Text Question Answering with Reasoning and Summarization

链接: https://arxiv.org/abs/2406.14732
作者: Jayetri Bardhan,Bushi Xiao,Daisy Zhe Wang
关键词: Question answering, Table-Text Question Answering, Multi-hop Table-Text Question, gained much popularity, table-text
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Question answering (QA) over tables and text has gained much popularity over the years. Multi-hop table-text QA requires multiple hops between the table and text, making it a challenging QA task. Although several works have attempted to solve the table-text QA task, most involve training the models and requiring labeled data. In this paper, we have proposed a model - TTQA-RS: A break-down prompting approach for Multi-hop Table-Text Question Answering with Reasoning and Summarization. Our model uses augmented knowledge including table-text summary with decomposed sub-question with answer for a reasoning-based table-text QA. Using open-source language models our model outperformed all existing prompting methods for table-text QA tasks on existing table-text QA datasets like HybridQA and OTT-QA’s development set. Our results are comparable with the training-based state-of-the-art models, demonstrating the potential of prompt-based approaches using open-source LLMs. Additionally, by using GPT-4 with LLaMA3-70B, our model achieved state-of-the-art performance for prompting-based methods on multi-hop table-text QA.

[IR-16] Bioptic – A Target-Agnostic Efficacy-Based Small Molecules Search Engine

链接: https://arxiv.org/abs/2406.14572
作者: Vlad Vinogradov,Ivan Izmailov,Simon Steshin,Kong T. Nguyen
关键词: Recent successes, extensive chemical libraries, successes in virtual, virtual screening, extensive chemical
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent successes in virtual screening have been made possible by large models and extensive chemical libraries. However, combining these elements is challenging: the larger the model, the more expensive it is to run, making ultra-large libraries unfeasible. To address this, we developed a target-agnostic, efficacy-based molecule search model, which allows us to find structurally dissimilar molecules with similar biological activities. We used the best practices to design fast retrieval system, based on processor-optimized SIMD instructions, enabling us to screen the ultra-large 40B Enamine REAL library with 100% recall rate. We extensively benchmarked our model and several state-of-the-art models for both speed performance and retrieval quality of novel molecules.

人工智能

[AI-0] NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking

链接: https://arxiv.org/abs/2406.15349
作者: Daniel Dauner,Marcel Hallgarten,Tianyu Li,Xinshuo Weng,Zhiyu Huang,Zetong Yang,Hongyang Li,Igor Gilitschenski,Boris Ivanovic,Marco Pavone,Andreas Geiger,Kashyap Chitta
关键词: vision-based driving policies, Benchmarking vision-based driving, Benchmarking vision-based, vision-based driving, driving policies
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Benchmarking vision-based driving policies is challenging. On one hand, open-loop evaluation with real data is easy, but these results do not reflect closed-loop performance. On the other, closed-loop evaluation is possible in simulation, but is hard to scale due to its significant computational demands. Further, the simulators available today exhibit a large domain gap to real data. This has resulted in an inability to draw clear conclusions from the rapidly growing body of research on end-to-end autonomous driving. In this paper, we present NAVSIM, a middle ground between these evaluation paradigms, where we use large datasets in combination with a non-reactive simulator to enable large-scale real-world benchmarking. Specifically, we gather simulation-based metrics, such as progress and time to collision, by unrolling bird’s eye view abstractions of the test scenes for a short simulation horizon. Our simulation is non-reactive, i.e., the evaluated policy and environment do not influence each other. As we demonstrate empirically, this decoupling allows open-loop metric computation while being better aligned with closed-loop evaluations than traditional displacement errors. NAVSIM enabled a new competition held at CVPR 2024, where 143 teams submitted 463 entries, resulting in several new insights. On a large set of challenging scenarios, we observe that simple methods with moderate compute requirements such as TransFuser can match recent large-scale end-to-end driving architectures such as UniAD. Our modular framework can potentially be extended with new datasets, data curation strategies, and metrics, and will be continually maintained to host future challenges. Our code is available at this https URL.

[AI-1] Privacy Preserved Blood Glucose Level Cross-Prediction: An Asynchronous Decentralized Federated Learning Approach

链接: https://arxiv.org/abs/2406.15346
作者: Chengzhe Piao,Taiyu Zhu,Yu Wang,Stephanie E Baldeweg,Paul Taylor,Pantelis Georgiou,Jiahao Sun,Jun Wang,Kezhi Li
关键词: Newly diagnosed Type, Continuous Glucose Monitoring, effective Blood Glucose, obtain effective Blood, Glucose Monitoring
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Newly diagnosed Type 1 Diabetes (T1D) patients often struggle to obtain effective Blood Glucose (BG) prediction models due to the lack of sufficient BG data from Continuous Glucose Monitoring (CGM), presenting a significant “cold start” problem in patient care. Utilizing population models to address this challenge is a potential solution, but collecting patient data for training population models in a privacy-conscious manner is challenging, especially given that such data is often stored on personal devices. Considering the privacy protection and addressing the “cold start” problem in diabetes care, we propose “GluADFL”, blood Glucose prediction by Asynchronous Decentralized Federated Learning. We compared GluADFL with eight baseline methods using four distinct T1D datasets, comprising 298 participants, which demonstrated its superior performance in accurately predicting BG levels for cross-patient analysis. Furthermore, patients’ data might be stored and shared across various communication networks in GluADFL, ranging from highly interconnected (e.g., random, performs the best among others) to more structured topologies (e.g., cluster and ring), suitable for various social networks. The asynchronous training framework supports flexible participation. By adjusting the ratios of inactive participants, we found it remains stable if less than 70% are inactive. Our results confirm that GluADFL offers a practical, privacy-preserving solution for BG prediction in T1D, significantly enhancing the quality of diabetes management.

[AI-2] GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians

链接: https://arxiv.org/abs/2406.15341
作者: Haoyang Liu,Haohan Wang
关键词: Recent advancements, advancements in machine, machine learning, learning have significantly, significantly improved
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
*备注: 25 pages, 3 figures

点击查看摘要

Abstract:Recent advancements in machine learning have significantly improved the identification of disease-associated genes from gene expression datasets. However, these processes often require extensive expertise and manual effort, limiting their scalability. Large Language Model (LLM)-based agents have shown promise in automating these tasks due to their increasing problem-solving abilities. To support the evaluation and development of such methods, we introduce GenoTEX, a benchmark dataset for the automatic exploration of gene expression data, involving the tasks of dataset selection, preprocessing, and statistical analysis. GenoTEX provides annotated code and results for solving a wide range of gene identification problems, in a full analysis pipeline that follows the standard of computational genomics. These annotations are curated by human bioinformaticians who carefully analyze the datasets to ensure accuracy and reliability. To provide baselines for these tasks, we present GenoAgents, a team of LLM-based agents designed with context-aware planning, iterative correction, and domain expert consultation to collaboratively explore gene datasets. Our experiments with GenoAgents demonstrate the potential of LLM-based approaches in genomics data analysis, while error analysis highlights the challenges and areas for future improvement. We propose GenoTEX as a promising resource for benchmarking and enhancing AI-driven methods for genomics data analysis. We make our benchmark publicly available at \urlthis https URL.

[AI-3] Image Conductor: Precision Control for Interactive Video Synthesis

链接: https://arxiv.org/abs/2406.15339
作者: Yaowei Li,Xintao Wang,Zhaoyang Zhang,Zhouxia Wang,Ziyang Yuan,Liangbin Xie,Yuexian Zou,Ying Shan
关键词: typically involving labor-intensive, labor-intensive real-world capturing, involving labor-intensive real-world, Filmmaking and animation, require sophisticated techniques
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注: Project webpage available at this https URL

点击查看摘要

Abstract:Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements, typically involving labor-intensive real-world capturing. Despite advancements in generative AI for video creation, achieving precise control over motion for interactive video asset generation remains challenging. To this end, we propose Image Conductor, a method for precise control of camera transitions and object movements to generate video assets from a single image. An well-cultivated training strategy is proposed to separate distinct camera and object motion by camera LoRA weights and object LoRA weights. To further address cinematographic variations from ill-posed trajectories, we introduce a camera-free guidance technique during inference, enhancing object movements while eliminating camera transitions. Additionally, we develop a trajectory-oriented video motion data curation pipeline for training. Quantitative and qualitative experiments demonstrate our method’s precision and fine-grained control in generating motion-controllable videos from images, advancing the practical application of interactive video synthesis. Project webpage available at this https URL

[AI-4] Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

链接: https://arxiv.org/abs/2406.15334
作者: Brandon Huang,Chancharik Mitra,Assaf Arbelle,Leonid Karlinsky,Trevor Darrell,Roei Herzig
关键词: interleaved Large Multimodal, Large Multimodal Models, interleaved Large, Large Multimodal, multimodal ICL setting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model’s context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV)–compact implicit representations of in-context examples compressed in the model’s attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.

[AI-5] Gradient-Mask Tuning Elevates the Upper Limits of LLM Performance

链接: https://arxiv.org/abs/2406.15330
作者: Haoling Li,Xin Zhang,Xiao Liu,Yeyun Gong,Yifan Wang,Yujiu Yang,Qi Chen,Peng Cheng
关键词: Large language models, Large language, language models, revolutionized lots, lots of fields
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized lots of fields of research. Although it is well-known that fine-tuning is essential for enhancing the capabilities of LLMs, existing research suggests that there is potential redundancy in the fine-tuning process and therefore proposes to update only a subset of parameters. However, these methods fail to leverage the task-specific information to identify important parameters during training. Based on the insight that gradients inherently contain information on task-specific data, we propose Gradient-Mask Tuning (GMT), a method that selectively updates parameters during training based on their gradient information. Specifically, we compute the absolute values of the gradients and apply masking to those with relatively smaller magnitudes. Our empirical results across various tasks demonstrate that GMT not only outperforms traditional fine-tuning methods but also elevates the upper limits of LLM performance. Further analysis indicates that GMT exhibits insensitivity to mask ratio and possesses computational efficiency comparable to vanilla SFT.

[AI-6] An End-to-End Segmentation-Free Arabic Handwritten Recognition Model on KHATT

链接: https://arxiv.org/abs/2406.15329
作者: Sondos Aabed,Ahmad Khairaldin
关键词: Connectionist Temporal Classification, Long-Short Term Memory, Bidirectional Long-Short Term, alongside Bidirectional Long-Short, deep learning model
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:An end-to-end, segmentation-free, deep learning model trained from scratch is proposed, leveraging DCNN for feature extraction, alongside Bidirectional Long-Short Term Memory (BLSTM) for sequence recognition and Connectionist Temporal Classification (CTC) loss function on the KHATT database. The training phase yields remarkable results 84% recognition rate on the test dataset at the character level and 71% on the word level, establishing an image-based sequence recognition framework that operates without segmentation only at the line level. The analysis and preprocessing of the KFUPM Handwritten Arabic TexT (KHATT) database are also presented. Finally, advanced image processing techniques, including filtering, transformation, and line segmentation are implemented. The importance of this work is highlighted by its wide-ranging applications. Including digitizing, documentation, archiving, and text translation in fields such as banking. Moreover, AHR serves as a pivotal tool for making images searchable, enhancing information retrieval capabilities, and enabling effortless editing. This functionality significantly reduces the time and effort required for tasks such as Arabic data organization and manipulation.

[AI-7] Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks

链接: https://arxiv.org/abs/2406.15325
作者: Hokyung Lee,Sumanyu Sharma,Bing Hu
关键词: retrieving contextual information, Large Language Models, Recent research, large text documents, Language Models
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 8 pages

点击查看摘要

Abstract:Recent research in Needle-in-a-Haystack (NIAH) benchmarks has explored the capabilities of Large Language Models (LLMs) in retrieving contextual information from large text documents. However, as LLMs become increasingly integrated into software development processes, it is crucial to evaluate their performance in code-based environments. As LLMs are further developed for program synthesis, we need to ensure that LLMs can understand syntax and write syntactically correct code. As a step in ensuring LLMs understand syntax, LLMs can be evaluated in their ability to find and detect syntax bugs. Our benchmark, Bug In The Code Stack (BICS), is designed to assess the ability of LLMs to identify simple syntax bugs within large source code. Our findings reveal three key insights: (1) code-based environments pose significantly more challenge compared to text-based environments for retrieval tasks, (2) there is a substantial performance disparity among different models, and (3) there is a notable correlation between longer context lengths and performance degradation, though the extent of this degradation varies between models.

[AI-8] LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

链接: https://arxiv.org/abs/2406.15319
作者: Ziyan Jiang,Xueguang Ma,Wenhu Chen
关键词: traditional RAG framework, basic retrieval units, traditional RAG, units, DPR normally work
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Technical Report

点击查看摘要

Abstract:In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the needle' unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced heavy’ retriever and light' reader design can lead to sub-optimal performance. In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a long retriever’ and a `long reader’. LongRAG processes the entire Wikipedia into 4K-token units, which is 30x longer than before. By increasing the unit size, we significantly reduce the total units from 22M to 700K. This significantly lowers the burden of retriever, which leads to a remarkable retrieval score: answer recall@1=71% on NQ (previously 52%) and answer recall@2=72% (previously 47%) on HotpotQA (full-wiki). Then we feed the top-k retrieved units ( \approx 30K tokens) to an existing long-context LLM to perform zero-shot answer extraction. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ, which is the best known result. LongRAG also achieves 64.3% on HotpotQA (full-wiki), which is on par of the SoTA model. Our study offers insights into the future roadmap for combining RAG with long-context LLMs.

[AI-9] PID: Prompt-Independent Data Protection Against Latent Diffusion Models

链接: https://arxiv.org/abs/2406.15305
作者: Ang Li,Yichuan Mo,Mingjie Li,Yisen Wang
关键词: Latent Diffusion Models, Diffusion Models, Latent Diffusion, grasp new concepts, limited number
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 27 pages, ICML 2024 poster

点击查看摘要

Abstract:The few-shot fine-tuning of Latent Diffusion Models (LDMs) has enabled them to grasp new concepts from a limited number of images. However, given the vast amount of personal images accessible online, this capability raises critical concerns about civil privacy. While several previous defense methods have been developed to prevent such misuse of LDMs, they typically assume that the textual prompts used by data protectors exactly match those employed by data exploiters. In this paper, we first empirically demonstrate that breaking this assumption, i.e., in cases where discrepancies exist between the textual conditions used by protectors and exploiters, could substantially reduce the effectiveness of these defenses. Furthermore, considering the visual encoder’s independence from textual prompts, we delve into the visual encoder and thoroughly investigate how manipulating the visual encoder affects the few-shot fine-tuning process of LDMs. Drawing on these insights, we propose a simple yet effective method called \textbfPrompt-Independent Defense (PID) to safeguard privacy against LDMs. We show that PID can act as a strong privacy shield on its own while requiring significantly less computational power. We believe our studies, along with the comprehensive understanding and new defense method, provide a notable advance toward reliable data protection against LDMs.

[AI-10] Grants4Companies: Applying Declarative Methods for Recommending and Reasoning About Business Grants in the Austrian Public Administration (System Description)

链接: https://arxiv.org/abs/2406.15293
作者: Björn Lellmann,Philipp Marek,Markus Triska
关键词: methods and technologies, technologies underlying, underlying the application, business grants suitable, Business Service Portal
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We describe the methods and technologies underlying the application Grants4Companies. The application uses a logic-based expert system to display a list of business grants suitable for the logged-in business. To evaluate suitability of the grants, formal representations of their conditions are evaluated against properties of the business, taken from the registers of the Austrian public administration. The logical language for the representations of the grant conditions is based on S-expressions. We further describe a Proof of Concept implementation of reasoning over the formalised grant conditions. The proof of concept is implemented in Common Lisp and interfaces with a reasoning engine implemented in Scryer Prolog. The application has recently gone live and is provided as part of the Business Service Portal by the Austrian Federal Ministry of Finance.

[AI-11] Cross-Modality Safety Alignment

链接: https://arxiv.org/abs/2406.15279
作者: Siyin Wang,Xingsong Ye,Qinyuan Cheng,Junwen Duan,Shimin Li,Jinlan Fu,Xipeng Qiu,Xuanjing Huang
关键词: Artificial General Intelligence, General Intelligence, Artificial General, human life, systems is paramount
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:As Artificial General Intelligence (AGI) becomes increasingly integrated into various facets of human life, ensuring the safety and ethical alignment of such systems is paramount. Previous studies primarily focus on single-modality threats, which may not suffice given the integrated and complex nature of cross-modality interactions. We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment. Specifically, it considers cases where single modalities are safe independently but could potentially lead to unsafe or unethical outputs when combined. To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations. Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, such as GPT-4V and LLaVA, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.

[AI-12] owards Robust Training Datasets for Machine Learning with Ontologies: A Case Study for Emergency Road Vehicle Detection

链接: https://arxiv.org/abs/2406.15268
作者: Lynn Vonderhaar,Timothy Elvira,Tyler Procko,Omar Ochoa
关键词: Machine Learning, Countless domains rely, rely on Machine, Countless domains, including safety-critical domains
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Countless domains rely on Machine Learning (ML) models, including safety-critical domains, such as autonomous driving, which this paper focuses on. While the black box nature of ML is simply a nuisance in some domains, in safety-critical domains, this makes ML models difficult to trust. To fully utilize ML models in safety-critical domains, it would be beneficial to have a method to improve trust in model robustness and accuracy without human experts checking each decision. This research proposes a method to increase trust in ML models used in safety-critical domains by ensuring the robustness and completeness of the model’s training dataset. Because ML models embody what they are trained with, ensuring the completeness of training datasets can help to increase the trust in the training of ML models. To this end, this paper proposes the use of a domain ontology and an image quality characteristic ontology to validate the domain completeness and image quality robustness of a training dataset. This research also presents an experiment as a proof of concept for this method, where ontologies are built for the emergency road vehicle domain.

[AI-13] V-RECS a Low-Cost LLM4VIS Recommender with Explanations Captioning and Suggestions

链接: https://arxiv.org/abs/2406.15259
作者: Luca Podo,Marco Angelini,Paola Velardi
关键词: interpreting natural language, natural language queries, involves interpreting natural, recent research area, natural language
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:NL2VIS (natural language to visualization) is a promising and recent research area that involves interpreting natural language queries and translating them into visualizations that accurately represent the underlying data. As we navigate the era of big data, NL2VIS holds considerable application potential since it greatly facilitates data exploration by non-expert users. Following the increasingly widespread usage of generative AI in NL2VIS applications, in this paper we present V-RECS, the first LLM-based Visual Recommender augmented with explanations(E), captioning©, and suggestions(S) for further data exploration. V-RECS’ visualization narratives facilitate both response verification and data exploration by non-expert users. Furthermore, our proposed solution mitigates computational, controllability, and cost issues associated with using powerful LLMs by leveraging a methodology to effectively fine-tune small models. To generate insightful visualization narratives, we use Chain-of-Thoughts (CoT), a prompt engineering technique to help LLM identify and generate the logical steps to produce a correct answer. Since CoT is reported to perform poorly with small LLMs, we adopted a strategy in which a large LLM (GPT-4), acting as a Teacher, generates CoT-based instructions to fine-tune a small model, Llama-2-7B, which plays the role of a Student. Extensive experiments-based on a framework for the quantitative evaluation of AI-based visualizations and on manual assessment by a group of participants-show that V-RECS achieves performance scores comparable to GPT-4, at a much lower cost. The efficacy of the V-RECS teacher-student paradigm is also demonstrated by the fact that the un-tuned Llama fails to perform the task in the vast majority of test cases. We release V-RECS for the visualization community to assist visualization designers throughout the entire visualization generation process.

[AI-14] VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

链接: https://arxiv.org/abs/2406.15252
作者: Xuan He,Dongfu Jiang,Ge Zhang,Max Ku,Achint Soni,Sherman Siu,Haonan Chen,Abhranil Chandra,Ziyan Jiang,Aaran Arulraj,Kai Wang,Quy Duc Do,Yuansheng Ni,Bohan Lyu,Yaswanth Narsupalli,Rongqi Fan,Zhiheng Lyu,Yuchen Lin,Wenhu Chen
关键词: witnessed great advances, recent years, years have witnessed, video, human
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis) based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman correlation between VideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result on other held-out EvalCrafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with human judges than other metrics. Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.

[AI-15] Machine Learning Techniques in Automatic Music Transcription: A Systematic Survey

链接: https://arxiv.org/abs/2406.15249
作者: Fatemeh Jamshidi,Gary Pike,Amit Das,Richard Chapman
关键词: Music Information Retrieval, Information Retrieval, Music Information, music signal analysis, Automatic Music Transcription
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In the domain of Music Information Retrieval (MIR), Automatic Music Transcription (AMT) emerges as a central challenge, aiming to convert audio signals into symbolic notations like musical notes or sheet music. This systematic review accentuates the pivotal role of AMT in music signal analysis, emphasizing its importance due to the intricate and overlapping spectral structure of musical harmonies. Through a thorough examination of existing machine learning techniques utilized in AMT, we explore the progress and constraints of current models and methodologies. Despite notable advancements, AMT systems have yet to match the accuracy of human experts, largely due to the complexities of musical harmonies and the need for nuanced interpretation. This review critically evaluates both fully automatic and semi-automatic AMT systems, emphasizing the importance of minimal user intervention and examining various methodologies proposed to date. By addressing the limitations of prior techniques and suggesting avenues for improvement, our objective is to steer future research towards fully automated AMT systems capable of accurately and efficiently translating intricate audio signals into precise symbolic representations. This study not only synthesizes the latest advancements but also lays out a road-map for overcoming existing challenges in AMT, providing valuable insights for researchers aiming to narrow the gap between current systems and human-level transcription accuracy.

[AI-16] Detecting Synthetic Lyrics with Few-Shot Inference

链接: https://arxiv.org/abs/2406.15231
作者: Yanis Labrak,Gabriel Meseguer-Brocal,Elena V. Epure
关键词: gained significant popularity, produce human-like lyrics, large language models, recent years, significant popularity
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:In recent years, generated content in music has gained significant popularity, with large language models being effectively utilized to produce human-like lyrics in various styles, themes, and linguistic structures. This technological advancement supports artists in their creative processes but also raises issues of authorship infringement, consumer satisfaction and content spamming. To address these challenges, methods for detecting generated lyrics are necessary. However, existing works have not yet focused on this specific modality or on creative text in general regarding machine-generated content detection methods and datasets. In response, we have curated the first dataset of high-quality synthetic lyrics and conducted a comprehensive quantitative evaluation of various few-shot content detection approaches, testing their generalization capabilities and complementing this with a human evaluation. Our best few-shot detector, based on LLM2Vec, surpasses stylistic and statistical methods, which are shown competitive in other domains at distinguishing human-written from machine-generated content. It also shows good generalization capabilities to new artists and models, and effectively detects post-generation paraphrasing. This study emphasizes the need for further research on creative content detection, particularly in terms of generalization and scalability with larger song catalogs. All datasets, pre-processing scripts, and code are available publicly on GitHub and Hugging Face under the Apache 2.0 license.

[AI-17] Deep UAV Path Planning with Assured Connectivity in Dense Urban Setting

链接: https://arxiv.org/abs/2406.15225
作者: Jiyong Oh,Syed M. Raza,Lusungu J. Mwasinga,Moonseong Kim,Hyunseung Choo
关键词: Unmanned Ariel Vehicle, Unmanned Ariel, Ariel Vehicle, UAV, numerous applications
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO); Signal Processing (eess.SP)
*备注: 5 pages, 4 figures, Published in the 2024 IEEE Network Operations and Management Symposium (NOMS 2024)

点击查看摘要

Abstract:Unmanned Ariel Vehicle (UAV) services with 5G connectivity is an emerging field with numerous applications. Operator-controlled UAV flights and manual static flight configurations are major limitations for the wide adoption of scalability of UAV services. Several services depend on excellent UAV connectivity with a cellular network and maintaining it is challenging in predetermined flight paths. This paper addresses these limitations by proposing a Deep Reinforcement Learning (DRL) framework for UAV path planning with assured connectivity (DUPAC). During UAV flight, DUPAC determines the best route from a defined source to the destination in terms of distance and signal quality. The viability and performance of DUPAC are evaluated under simulated real-world urban scenarios using the Unity framework. The results confirm that DUPAC achieves an autonomous UAV flight path similar to base method with only 2% increment while maintaining an average 9% better connection quality throughout the flight.

[AI-18] Injecting Bias in Text-To-Image Models via Composite-Trigger Backdoors

链接: https://arxiv.org/abs/2406.15213
作者: Ali Naseh,Jaechul Roh,Eugene Bagdasaryan,Amir Houmansadr
关键词: Stable Diffusion, text-conditional image generative, large text-conditional image, Recent advances, image generative models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Recent advances in large text-conditional image generative models such as Stable Diffusion, Midjourney, and DALL-E 3 have revolutionized the field of image generation, allowing users to produce high-quality, realistic images from textual prompts. While these developments have enhanced artistic creation and visual communication, they also present an underexplored attack opportunity: the possibility of inducing biases by an adversary into the generated images for malicious intentions, e.g., to influence society and spread propaganda. In this paper, we demonstrate the possibility of such a bias injection threat by an adversary who backdoors such models with a small number of malicious data samples; the implemented backdoor is activated when special triggers exist in the input prompt of the backdoored models. On the other hand, the model’s utility is preserved in the absence of the triggers, making the attack highly undetectable. We present a novel framework that enables efficient generation of poisoning samples with composite (multi-word) triggers for such an attack. Our extensive experiments using over 1 million generated images and against hundreds of fine-tuned models demonstrate the feasibility of the presented backdoor attack. We illustrate how these biases can bypass conventional detection mechanisms, highlighting the challenges in proving the existence of biases within operational constraints. Our cost analysis confirms the low financial barrier to executing such attacks, underscoring the need for robust defensive strategies against such vulnerabilities in text-to-image generation models.

[AI-19] How Effective is GPT-4 Turbo in Generating School-Level Questions from Textbooks Based on Blooms Revised Taxonomy?

链接: https://arxiv.org/abs/2406.15211
作者: Subhankar Maity,Aniket Deroy,Sudeshna Sarkar
关键词: Bloom Revised Taxonomy, NCERT textbooks, Bloom Revised, Revised Taxonomy, zero-shot mode
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at Learnersourcing: Student-Generated Content @ Scale 2024

点击查看摘要

Abstract:We evaluate the effectiveness of GPT-4 Turbo in generating educational questions from NCERT textbooks in zero-shot mode. Our study highlights GPT-4 Turbo’s ability to generate questions that require higher-order thinking skills, especially at the “understanding” level according to Bloom’s Revised Taxonomy. While we find a notable consistency between questions generated by GPT-4 Turbo and those assessed by humans in terms of complexity, there are occasional differences. Our evaluation also uncovers variations in how humans and machines evaluate question quality, with a trend inversely related to Bloom’s Revised Taxonomy levels. These findings suggest that while GPT-4 Turbo is a promising tool for educational question generation, its efficacy varies across different cognitive levels, indicating a need for further refinement to fully meet educational standards.

[AI-20] Exploring the Efficacy of Robotic Assistants with ChatGPT and Claude in Enhancing ADHD Therapy: Innovating Treatment Paradigms

链接: https://arxiv.org/abs/2406.15198
作者: Santiago Berrezueta-Guzman,Mohanad Kandil,María-Luisa Martín-Ruiz,Iván Pau-de-la-Cruz,Stephan Krusche
关键词: Attention Deficit Hyperactivity, Deficit Hyperactivity Disorder, Attention Deficit, Hyperactivity Disorder, Deficit Hyperactivity
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
*备注: Paper accepted at the 20th International Conference on Intelligent Environments

点击查看摘要

Abstract:Attention Deficit Hyperactivity Disorder (ADHD) is a neurodevelopmental condition characterized by inattention, hyperactivity, and impulsivity, which can significantly impact an individual’s daily functioning and quality of life. Occupational therapy plays a crucial role in managing ADHD by fostering the development of skills needed for daily living and enhancing an individual’s ability to participate fully in school, home, and social situations. Recent studies highlight the potential of integrating Large Language Models (LLMs) like ChatGPT and Socially Assistive Robots (SAR) to improve psychological treatments. This integration aims to overcome existing limitations in mental health therapy by providing tailored support and adapting to the unique needs of this sensitive group. However, there remains a significant gap in research exploring the combined use of these advanced technologies in ADHD therapy, suggesting an opportunity for novel therapeutic approaches. Thus, we integrated two advanced language models, ChatGPT-4 Turbo and Claude-3 Opus, into a robotic assistant to explore how well each model performs in robot-assisted interactions. Additionally, we have compared their performance in a simulated therapy scenario to gauge their effectiveness against a clinically validated customized model. The results of this study show that ChatGPT-4 Turbo excelled in performance and responsiveness, making it suitable for time-sensitive applications. Claude-3 Opus, on the other hand, showed strengths in understanding, coherence, and ethical considerations, prioritizing safe and engaging interactions. Both models demonstrated innovation and adaptability, but ChatGPT-4 Turbo offered greater ease of integration and broader language support. The selection between them hinges on the specific demands of ADHD therapy. Comments: Paper accepted at the 20th International Conference on Intelligent Environments Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE) Cite as: arXiv:2406.15198 [cs.AI] (or arXiv:2406.15198v1 [cs.AI] for this version)

[AI-21] UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis

链接: https://arxiv.org/abs/2406.15187
作者: Yulong Hui,Yao Lu,Huanchen Zhang
关键词: Large Language Models, improved Large Language, Language Models, Large Language, significant challenges exist
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The use of Retrieval-Augmented Generation (RAG) has improved Large Language Models (LLMs) in collaborating with external data, yet significant challenges exist in real-world scenarios. In areas such as academic literature and finance question answering, data are often found in raw text and tables in HTML or PDF formats, which can be lengthy and highly unstructured. In this paper, we introduce a benchmark suite, namely Unstructured Document Analysis (UDA), that involves 2,965 real-world documents and 29,590 expert-annotated QA pairs. We revisit popular LLM- and RAG-based solutions for document analysis and evaluate the design choices and answer qualities across multiple document domains and diverse query types. Our evaluation yields interesting findings and highlights the importance of data parsing and retrieval. We hope our benchmark can shed light and better serve real-world document analysis applications. The benchmark suite and code can be found at this https URL.

[AI-22] Enhancing Idiomatic Representation in Multiple Languages via an Adaptive Contrastive Triplet Loss

链接: https://arxiv.org/abs/2406.15175
作者: Wei He,Marco Idiart,Carolina Scarton,Aline Villavicencio
关键词: Natural Language Processing, Accurately modeling idiomatic, Accurately modeling, Language Processing, Natural Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurately modeling idiomatic or non-compositional language has been a longstanding challenge in Natural Language Processing (NLP). This is partly because these expressions do not derive their meanings solely from their constituent words, but also due to the scarcity of relevant data resources, and their impact on the performance of downstream tasks such as machine translation and simplification. In this paper we propose an approach to model idiomaticity effectively using a triplet loss that incorporates the asymmetric contribution of components words to an idiomatic meaning for training language models by using adaptive contrastive learning and resampling miners to build an idiomatic-aware learning objective. Our proposed method is evaluated on a SemEval challenge and outperforms previous alternatives significantly in many metrics.

[AI-23] Evaluation des capacites de reponse de larges mod`eles de langage (LLM) pour des questions dhistoriens

链接: https://arxiv.org/abs/2406.15173
作者: Mathieu Chartier,Nabil Dakkoune,Guillaume Bourgeois,Stéphane Jean
关键词: Large Language Models, revolutionized information retrieval, ChatGPT or Bard, Bard have revolutionized, generate custom responses
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: in French language

点击查看摘要

Abstract:Large Language Models (LLMs) like ChatGPT or Bard have revolutionized information retrieval and captivated the audience with their ability to generate custom responses in record time, regardless of the topic. In this article, we assess the capabilities of various LLMs in producing reliable, comprehensive, and sufficiently relevant responses about historical facts in French. To achieve this, we constructed a testbed comprising numerous history-related questions of varying types, themes, and levels of difficulty. Our evaluation of responses from ten selected LLMs reveals numerous shortcomings in both substance and form. Beyond an overall insufficient accuracy rate, we highlight uneven treatment of the French language, as well as issues related to verbosity and inconsistency in the responses provided by LLMs.

[AI-24] his actually looks like that: Proto-BagNets for local and global interpretability-by-design

链接: https://arxiv.org/abs/2406.15168
作者: Kerol Djoumessi,Bubacarr Bah,Laura Kühlewein,Philipp Berens,Lisa Koch
关键词: including medical diagnosis, high-stakes applications, including medical, medical diagnosis, key requirement
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Interpretability is a key requirement for the use of machine learning models in high-stakes applications, including medical diagnosis. Explaining black-box models mostly relies on post-hoc methods that do not faithfully reflect the model’s behavior. As a remedy, prototype-based networks have been proposed, but their interpretability is limited as they have been shown to provide coarse, unreliable, and imprecise explanations. In this work, we introduce Proto-BagNets, an interpretable-by-design prototype-based model that combines the advantages of bag-of-local feature models and prototype learning to provide meaningful, coherent, and relevant prototypical parts needed for accurate and interpretable image classification tasks. We evaluated the Proto-BagNet for drusen detection on publicly available retinal OCT data. The Proto-BagNet performed comparably to the state-of-the-art interpretable and non-interpretable models while providing faithful, accurate, and clinically meaningful local and global explanations. The code is available at this https URL.

[AI-25] Perks and Pitfalls of Faithfulness in Regular Self-Explainable and Domain Invariant GNNs

链接: https://arxiv.org/abs/2406.15156
作者: Steve Azzolin,Antonio Longa,Stefano Teso,Andrea Passerini
关键词: Graph Neural Networks, Neural Networks, Graph Neural, build robust tools, paramount to build
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As Graph Neural Networks (GNNs) become more pervasive, it becomes paramount to build robust tools for computing explanations of their predictions. A key desideratum is that these explanations are faithful, i.e., that they portray an accurate picture of the GNN’s reasoning process. A number of different faithfulness metrics exist, begging the question of what faithfulness is exactly, and what its properties are. We begin by showing that existing metrics are not interchangeable – i.e., explanations attaining high faithfulness according to one metric may be unfaithful according to others – and can be systematically insensitive to important properties of the explanation, and suggest how to address these issues. We proceed to show that, surprisingly, optimizing for faithfulness is not always a sensible design goal. Specifically, we show that for injective regular GNN architectures, perfectly faithful explanations are completely uninformative. The situation is different for modular GNNs, such as self-explainable and domain-invariant architectures, where optimizing faithfulness does not compromise informativeness, and is also unexpectedly tied to out-of-distribution generalization.

[AI-26] Generative Topological Networks

链接: https://arxiv.org/abs/2406.15152
作者: Alona Levy-Jurgenson,Zohar Yakhini
关键词: Generative Topological Networks, Generative models, recent years, introduce Generative Topological, significant advancements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generative models have seen significant advancements in recent years, yet often remain challenging and costly to train and use. We introduce Generative Topological Networks (GTNs) – a new class of generative models that addresses these shortcomings. GTNs are trained deterministically using a simple supervised learning approach grounded in topology theory. GTNs are fast to train, and require only a single forward pass in a standard feedforward neural network to generate samples. We demonstrate the strengths of GTNs in several datasets, including MNIST, celebA and the Hands and Palm Images dataset. Finally, the theory behind GTNs offers insights into how to train generative models for improved performance.

[AI-27] Gaussian Splatting to Real World Flight Navigation Transfer with Liquid Networks

链接: https://arxiv.org/abs/2406.15149
作者: Alex Quach,Makram Chahine,Alexander Amini,Ramin Hasani,Daniela Rus
关键词: scalable data generation, offer scalable data, autonomous robot learning, flexible design, optimization of trajectories
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Simulators are powerful tools for autonomous robot learning as they offer scalable data generation, flexible design, and optimization of trajectories. However, transferring behavior learned from simulation data into the real world proves to be difficult, usually mitigated with compute-heavy domain randomization methods or further model fine-tuning. We present a method to improve generalization and robustness to distribution shifts in sim-to-real visual quadrotor navigation tasks. To this end, we first build a simulator by integrating Gaussian Splatting with quadrotor flight dynamics, and then, train robust navigation policies using Liquid neural networks. In this way, we obtain a full-stack imitation learning protocol that combines advances in 3D Gaussian splatting radiance field rendering, crafty programming of expert demonstration training data, and the task understanding capabilities of Liquid networks. Through a series of quantitative flight tests, we demonstrate the robust transfer of navigation skills learned in a single simulation scene directly to the real world. We further show the ability to maintain performance beyond the training environment under drastic distribution and physical environment changes. Our learned Liquid policies, trained on single target manoeuvres curated from a photorealistic simulated indoor flight only, generalize to multi-step hikes onboard a real hardware platform outdoors.

[AI-28] Younger: The First Dataset for Artificial Intelligence-Generated Neural Network Architecture

链接: https://arxiv.org/abs/2406.15132
作者: Zhengxin Yang,Wanling Gao,Luzhou Peng,Yunyou Huang,Fei Tang,Jianfeng Zhan
关键词: requires extensive expertise, typically requires extensive, Designing and optimizing, architectures typically requires, optimizing neural network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 31 pages, 29 figures, 11 tables

点击查看摘要

Abstract:Designing and optimizing neural network architectures typically requires extensive expertise, starting with handcrafted designs and then manual or automated refinement. This dependency presents a significant barrier to rapid innovation. Recognizing the complexity of automatically generating neural network architecture from scratch, we introduce Younger, a pioneering dataset to advance this ambitious goal. Derived from over 174K real-world models across more than 30 tasks from various public model hubs, Younger includes 7,629 unique architectures, and each is represented as a directed acyclic graph with detailed operator-level information. The dataset facilitates two primary design paradigms: global, for creating complete architectures from scratch, and local, for detailed architecture component refinement. By establishing these capabilities, Younger contributes to a new frontier, Artificial Intelligence-Generated Neural Network Architecture (AIGNNA). Our experiments explore the potential and effectiveness of Younger for automated architecture generation and, as a secondary benefit, demonstrate that Younger can serve as a benchmark dataset, advancing the development of graph neural networks. We release the dataset and code publicly to lower the entry barriers and encourage further research in this challenging area.

[AI-29] Assessing Good Bad and Ugly Arguments Generated by ChatGPT: a New Dataset its Methodology and Associated Tasks

链接: https://arxiv.org/abs/2406.15130
作者: Victor Hugo Nascimento Rocha,Igor Cataneo Silveira,Paulo Pirozelli,Denis Deratani Mauá,Fabio Gagliardi Cozman
关键词: Large Language Models, Large Language, Language Models, success of Large, spread misinformation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The recent success of Large Language Models (LLMs) has sparked concerns about their potential to spread misinformation. As a result, there is a pressing need for tools to identify ``fake arguments’’ generated by such models. To create these tools, examples of texts generated by LLMs are needed. This paper introduces a methodology to obtain good, bad and ugly arguments from argumentative essays produced by ChatGPT, OpenAI’s LLM. We then describe a novel dataset containing a set of diverse arguments, ArGPT. We assess the effectiveness of our dataset and establish baselines for several argumentation-related tasks. Finally, we show that the artificially generated data relates well to human argumentation and thus is useful as a tool to train and test systems for the defined tasks.

[AI-30] Speech Emotion Recognition under Resource Constraints with Data Distillation

链接: https://arxiv.org/abs/2406.15119
作者: Yi Chang,Zhao Ren,Zhonghao Zhao,Thanh Tam Nguyen,Kun Qian,Tanja Schultz,Björn W. Schuller
关键词: Speech emotion recognition, emotion recognition, plays a crucial, human-computer interaction, SER models
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment of SER models. To address these challenges, we propose a data distillation framework to facilitate efficient development of SER models in IoT applications using a synthesised, smaller, and distilled dataset. Our experiments demonstrate that the distilled dataset can be effectively utilised to train SER models with fixed initialisation, achieving performances comparable to those developed using the original full emotional speech dataset.

[AI-31] Investigating the impact of 2D gesture representation on co-speech gesture generation

链接: https://arxiv.org/abs/2406.15111
作者: Teo Guichoux,Laure Soulier,Nicolas Obin,Catherine Pelachaud
关键词: embodied conversational agents, Co-speech gestures play, natural co-speech gestures, Co-speech gestures, co-speech gestures synchronized
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages. Paper accepted at WACAI 2024

点击查看摘要

Abstract:Co-speech gestures play a crucial role in the interactions between humans and embodied conversational agents (ECA). Recent deep learning methods enable the generation of realistic, natural co-speech gestures synchronized with speech, but such approaches require large amounts of training data. “In-the-wild” datasets, which compile videos from sources such as YouTube through human pose detection models, offer a solution by providing 2D skeleton sequences that are paired with speech. Concurrently, innovative lifting models have emerged, capable of transforming these 2D pose sequences into their 3D counterparts, leading to large and diverse datasets of 3D gestures. However, the derived 3D pose estimation is essentially a pseudo-ground truth, with the actual ground truth being the 2D motion data. This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions, a topic that, to our knowledge, remains largely unexplored. In this work, we evaluate the impact of the dimensionality of the training data, 2D or 3D joint coordinates, on the performance of a multimodal speech-to-gesture deep generative model. We use a lifting model to convert 2D-generated sequences of body pose to 3D. Then, we compare the sequence of gestures generated directly in 3D to the gestures generated in 2D and lifted to 3D as post-processing.

[AI-32] How Intermodal Interaction Affects the Performance of Deep Multimodal Fusion for Mixed-Type Time Series

链接: https://arxiv.org/abs/2406.15098
作者: Simon Dietz,Thomas Altstidl,Dario Zanca,Björn Eskofier,An Nguyen
关键词: Mixed-type time series, Mixed-type time, environmental monitoring, time series, social media
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mixed-type time series (MTTS) is a bimodal data type that is common in many domains, such as healthcare, finance, environmental monitoring, and social media. It consists of regularly sampled continuous time series and irregularly sampled categorical event sequences. The integration of both modalities through multimodal fusion is a promising approach for processing MTTS. However, the question of how to effectively fuse both modalities remains open. In this paper, we present a comprehensive evaluation of several deep multimodal fusion approaches for MTTS forecasting. Our comparison includes three fusion types (early, intermediate, and late) and five fusion methods (concatenation, weighted mean, weighted mean with correlation, gating, and feature sharing). We evaluate these fusion approaches on three distinct datasets, one of which was generated using a novel framework. This framework allows for the control of key data properties, such as the strength and direction of intermodal interactions, modality imbalance, and the degree of randomness in each modality, providing a more controlled environment for testing fusion approaches. Our findings show that the performance of different fusion approaches can be substantially influenced by the direction and strength of intermodal interactions. The study reveals that early and intermediate fusion approaches excel at capturing fine-grained and coarse-grained cross-modal features, respectively. These findings underscore the crucial role of intermodal interactions in determining the most effective fusion strategy for MTTS forecasting.

[AI-33] KnobTree: Intelligent Database Parameter Configuration via Explainable Reinforcement Learning

链接: https://arxiv.org/abs/2406.15073
作者: Jiahan Chen,Shuhan Qi,Yifan Li,Zeyu Dong,Mingfeng Ding,Yulin Wu,Xuan Wang
关键词: contemporary information systems, traditional rule-based configuration, information systems, rule-based configuration methods, configuration methods struggle
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Databases are fundamental to contemporary information systems, yet traditional rule-based configuration methods struggle to manage the complexity of real-world applications with hundreds of tunable parameters. Deep reinforcement learning (DRL), which combines perception and decision-making, presents a potential solution for intelligent database configuration tuning. However, due to black-box property of RL-based method, the generated database tuning strategies still face the urgent problem of lack explainability. Besides, the redundant parameters in large scale database always make the strategy learning become unstable. This paper proposes KnobTree, an interpertable framework designed for the optimization of database parameter configuration. In this framework, an interpertable database tuning algorithm based on RL-based differentatial tree is proposed, which building a transparent tree-based model to generate explainable database tuning strategies. To address the problem of large-scale parameters, We also introduce a explainable method for parameter importance assessment, by utilizing Shapley Values to identify parameters that have significant impacts on database performance. Experiments conducted on MySQL and Gbase8s databases have verified exceptional transparency and interpretability of the KnobTree model. The good property makes generated strategies can offer practical guidance to algorithm designers and database administrators. Moreover, our approach also slightly outperforms the existing RL-based tuning algorithms in aspects such as throughput, latency, and processing time.

[AI-34] ri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis

链接: https://arxiv.org/abs/2406.15050
作者: Lin Fan,Xun Gong,Cenyang Zheng,Yafei Ou
关键词: Visual Question Answering, challenging research topic, advantages including patient, including patient engagement, clinical expert involvement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The intersection of medical Visual Question Answering (Med-VQA) is a challenging research topic with advantages including patient engagement and clinical expert involvement for second opinions. However, existing Med-VQA methods based on joint embedding fail to explain whether their provided results are based on correct reasoning or coincidental answers, which undermines the credibility of VQA answers. In this paper, we investigate the construction of a more cohesive and stable Med-VQA structure. Motivated by causal effect, we propose a novel Triangular Reasoning VQA (Tri-VQA) framework, which constructs reverse causal questions from the perspective of “Why this answer?” to elucidate the source of the answer and stimulate more reasonable forward reasoning processes. We evaluate our method on the Endoscopic Ultrasound (EUS) multi-attribute annotated dataset from five centers, and test it on medical VQA datasets. Experimental results demonstrate the superiority of our approach over existing methods. Our codes and pre-trained models are available at https://anonymous.4open.science/r/Tri_VQA.

[AI-35] From Overfitting to Robustness: Quantity Quality and Variety Oriented Negative Sample Selection in Graph Contrastive Learning

链接: https://arxiv.org/abs/2406.15044
作者: Adnan Ali,Jinlong Li,Huanhuan Chen,Ali Kashif Bashir
关键词: contrast positive-negative counterparts, negative samples, negative sample pools, graph data augmentation, negative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph contrastive learning (GCL) aims to contrast positive-negative counterparts to learn the node embeddings, whereas graph data augmentation methods are employed to generate these positive-negative samples. The variation, quantity, and quality of negative samples compared to positive samples play crucial roles in learning meaningful embeddings for node classification downstream tasks. Less variation, excessive quantity, and low-quality negative samples cause the model to be overfitted for particular nodes, resulting in less robust models. To solve the overfitting problem in the GCL paradigm, this study proposes a novel Cumulative Sample Selection (CSS) algorithm by comprehensively considering negative samples’ quality, variations, and quantity. Initially, three negative sample pools are constructed: easy, medium, and hard negative samples, which contain 25%, 50%, and 25% of the total available negative samples, respectively. Then, 10% negative samples are selected from each of these three negative sample pools for training the model. After that, a decision agent module evaluates model training results and decides whether to explore more negative samples from three negative sample pools by increasing the ratio or keep exploiting the current sampling ratio. The proposed algorithm is integrated into a proposed graph contrastive learning framework named NegAmplify. NegAmplify is compared with the SOTA methods on nine graph node classification datasets, with seven achieving better node classification accuracy with up to 2.86% improvement.

[AI-36] Behaviour Distillation

链接: https://arxiv.org/abs/2406.15042
作者: Andrei Lupu,Chris Lu,Jarek Liesen,Robert Tjarko Lange,Jakob Foerster
关键词: small number, drop-in replacements, condense large datasets, distillation, datasets
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at ICLR 2024

点击查看摘要

Abstract:Dataset distillation aims to condense large datasets into a small number of synthetic examples that can be used as drop-in replacements when training new models. It has applications to interpretability, neural architecture search, privacy, and continual learning. Despite strong successes in supervised domains, such methods have not yet been extended to reinforcement learning, where the lack of a fixed dataset renders most distillation methods unusable. Filling the gap, we formalize behaviour distillation, a setting that aims to discover and then condense the information required for training an expert policy into a synthetic dataset of state-action pairs, without access to expert data. We then introduce Hallucinating Datasets with Evolution Strategies (HaDES), a method for behaviour distillation that can discover datasets of just four state-action pairs which, under supervised learning, train agents to competitive performance levels in continuous control tasks. We show that these datasets generalize out of distribution to training policies with a wide range of architectures and hyperparameters. We also demonstrate application to a downstream task, namely training multi-task agents in a zero-shot fashion. Beyond behaviour distillation, HaDES provides significant improvements in neuroevolution for RL over previous approaches and achieves SoTA results on one standard supervised dataset distillation task. Finally, we show that visualizing the synthetic datasets can provide human-interpretable task insights.

[AI-37] Online detection and infographic explanation of spam reviews with data drift adaptation

链接: https://arxiv.org/abs/2406.15038
作者: Francisco de Arriba-Pérez,Silvia García-Méndez,Fátima Leal,Benedita Malheiro,J. C. Burguillo
关键词: online platforms due, impact on reputation, platforms due, significant impact, Spam
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Spam reviews are a pervasive problem on online platforms due to its significant impact on reputation. However, research into spam detection in data streams is scarce. Another concern lies in their need for transparency. Consequently, this paper addresses those problems by proposing an online solution for identifying and explaining spam reviews, incorporating data drift adaptation. It integrates (i) incremental profiling, (ii) data drift detection adaptation, and (iii) identification of spam reviews employing Machine Learning. The explainable mechanism displays a visual and textual prediction explanation in a dashboard. The best results obtained reached up to 87 % spam F-measure.

[AI-38] GiusBERTo: A Legal Language Model for Personal Data De-identification in Italian Court of Auditors Decisions

链接: https://arxiv.org/abs/2406.15032
作者: Giulio Salierno,Rosamaria Bertè,Luca Attias,Carla Morrone,Dario Pettazzoni,Daniela Battisti
关键词: Natural Language Processing, Natural Language, Language Processing, Recent advances, pretrained language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 14 pages, 4 figures, 6 Tables

点击查看摘要

Abstract:Recent advances in Natural Language Processing have demonstrated the effectiveness of pretrained language models like BERT for a variety of downstream tasks. We present GiusBERTo, the first BERT-based model specialized for anonymizing personal data in Italian legal documents. GiusBERTo is trained on a large dataset of Court of Auditors decisions to recognize entities to anonymize, including names, dates, locations, while retaining contextual relevance. We evaluate GiusBERTo on a held-out test set and achieve 97% token-level accuracy. GiusBERTo provides the Italian legal community with an accurate and tailored BERT model for de-identification, balancing privacy and data protection.

[AI-39] Evolution of Rewards for Food and Motor Action by Simulating Birth and Death

链接: https://arxiv.org/abs/2406.15016
作者: Yuji Kanagawa,Kenji Doya
关键词: survival and reproduction, fundamental drivers, drivers of animal, animal behaviors, critical for survival
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The reward system is one of the fundamental drivers of animal behaviors and is critical for survival and reproduction. Despite its importance, the problem of how the reward system has evolved is underexplored. In this paper, we try to replicate the evolution of biologically plausible reward functions and investigate how environmental conditions affect evolved rewards’ shape. For this purpose, we developed a population-based decentralized evolutionary simulation framework, where agents maintain their energy level to live longer and produce more children. Each agent inherits its reward function from its parent subject to mutation and learns to get rewards via reinforcement learning throughout its lifetime. Our results show that biologically reasonable positive rewards for food acquisition and negative rewards for motor action can evolve from randomly initialized ones. However, we also find that the rewards for motor action diverge into two modes: largely positive and slightly negative. The emergence of positive motor action rewards is surprising because it can make agents too active and inefficient in foraging. In environments with poor and poisonous foods, the evolution of rewards for less important foods tends to be unstable, while rewards for normal foods are still stable. These results demonstrate the usefulness of our simulation environment and energy-dependent birth and death model for further studies of the origin of reward systems.

[AI-40] GraLMatch: Matching Groups of Entities with Graphs and Language Models

链接: https://arxiv.org/abs/2406.15015
作者: Fernando De Meer Pardo,Claude Lehmann,Dennis Gehrig,Andrea Nagy,Stefano Nicoli,Branka Hadji Misheva,Martin Braschler,Kurt Stockinger
关键词: entity group matching, records, call entity group, multiple data sources, Matching
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 12 pages, 4 figures, accepted as research paper at EDBT 2025

点击查看摘要

Abstract:In this paper, we present an end-to-end multi-source Entity Matching problem, which we call entity group matching, where the goal is to assign to the same group, records originating from multiple data sources but representing the same real-world entity. We focus on the effects of transitively matched records, i.e. the records connected by paths in the graph G = (V,E) whose nodes and edges represent the records and whether they are a match or not. We present a real-world instance of this problem, where the challenge is to match records of companies and financial securities originating from different data providers. We also introduce two new multi-source benchmark datasets that present similar matching challenges as real-world records. A distinctive characteristic of these records is that they are regularly updated following real-world events, but updates are not applied uniformly across data sources. This phenomenon makes the matching of certain groups of records only possible through the use of transitive information. In our experiments, we illustrate how considering transitively matched records is challenging since a limited amount of false positive pairwise match predictions can throw off the group assignment of large quantities of records. Thus, we propose GraLMatch, a method that can partially detect and remove false positive pairwise predictions through graph-based properties. Finally, we showcase how fine-tuning a Transformer-based model (DistilBERT) on a reduced number of labeled samples yields a better final entity group matching than training on more samples and/or incorporating fine-tuning optimizations, illustrating how precision becomes the deciding factor in the entity group matching of large volumes of records. Comments: 12 pages, 4 figures, accepted as research paper at EDBT 2025 Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2406.15015 [cs.DB] (or arXiv:2406.15015v1 [cs.DB] for this version)

[AI-41] Fair Manipulation-Robust and Transparent Sortition

链接: https://arxiv.org/abs/2406.15009
作者: Carmel Baharav,Bailey Flanigan
关键词: Citizens’ Assemblies, processes like Citizens’, political representatives, world to choose, choose participants
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sortition, the random selection of political representatives, is increasingly being used around the world to choose participants of deliberative processes like Citizens’ Assemblies. Motivated by sortition’s practical importance, there has been a recent flurry of research on sortition algorithms, whose task it is to select a panel from among a pool of volunteers. This panel must satisfy quotas enforcing representation of key population subgroups. Past work has contributed an algorithmic approach for fulfilling this task while ensuring that volunteers’ chances of selection are maximally equal, as measured by any convex equality objective. The question, then, is: which equality objective is the right one? Past work has mainly studied the objectives Minimax and Leximin, which respectively minimize the maximum and maximize the minimum chance of selection given to any volunteer. Recent work showed that both of these objectives have key weaknesses: Minimax is highly robust to manipulation but is arbitrarily unfair; oppositely, Leximin is highly fair but arbitrarily manipulable. In light of this gap, we propose a new equality objective, Goldilocks, that aims to achieve these ideals simultaneously by ensuring that no volunteer receives too little or too much chance of selection. We theoretically bound the extent to which Goldilocks achieves these ideals, finding that in an important sense, Goldilocks recovers among the best available solutions in a given instance. We then extend our bounds to the case where the output of Goldilocks is transformed to achieve a third goal, Transparency. Our empirical analysis of Goldilocks in real data is even more promising: we find that this objective achieves nearly instance-optimal minimum and maximum selection probabilities simultaneously in most real instances – an outcome not even guaranteed to be possible for any algorithm. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2406.15009 [cs.AI] (or arXiv:2406.15009v1 [cs.AI] for this version)

[AI-42] RouteFinder: Towards Foundation Models for Vehicle Routing Problems

链接: https://arxiv.org/abs/2406.15007
作者: Federico Berto,Chuanbo Hua,Nayeli Gast Zepeda,André Hottung,Niels Wouda,Leon Lan,Kevin Tierney,Jinkyoo Park
关键词: Vehicle Routing Problems, Vehicle Routing, supply chain management, significant real-world implications, Routing Problems
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vehicle Routing Problems (VRPs) are optimization problems with significant real-world implications in logistics, transportation, and supply chain management. Despite the recent progress made in learning to solve individual VRP variants, there is a lack of a unified approach that can effectively tackle a wide range of tasks, which is crucial for real-world impact. This paper introduces RouteFinder, a framework for developing foundation models for VRPs. Our key idea is that a foundation model for VRPs should be able to model variants by treating each variant as a subset of a larger VRP problem, equipped with different attributes. We introduce a parallelized environment that can handle any combination of attributes at the same time in a batched manner, and an efficient sampling procedure to train on a mix of problems at each optimization step that can greatly improve convergence robustness. We also introduce novel Global Feature Embeddings that project instance-wise attributes efficiently onto the latent space and help the model understand different VRP variants. Finally, we introduce Efficient Adapter Layers, a simple yet effective technique to finetune pre-trained RouteFinder models to solve novel variants with previously unseen attributes outside of the original feature space. We validate our approach through extensive experiments on 24 VRP variants, demonstrating competitive results over recent multi-task learning models. We make our code openly available at this https URL.

[AI-43] Unveiling the Impact of Multi-Modal Interactions on User Engagement: A Comprehensive Evaluation in AI-driven Conversations

链接: https://arxiv.org/abs/2406.15000
作者: Lichao Zhang,Jia Yu,Shuai Zhang,Long Li,Yangyang Zhong,Guanbao Liang,Yuming Yan,Qing Ma,Fangsheng Weng,Fayu Pan,Jing Li,Renjun Xu,Zhenzhong Lan
关键词: Large Language Models, Large Language, Language Models, advanced user-bot interactions, significantly advanced user-bot
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly advanced user-bot interactions, enabling more complex and coherent dialogues. However, the prevalent text-only modality might not fully exploit the potential for effective user engagement. This paper explores the impact of multi-modal interactions, which incorporate images and audio alongside text, on user engagement in chatbot conversations. We conduct a comprehensive analysis using a diverse set of chatbots and real-user interaction data, employing metrics such as retention rate and conversation length to evaluate user engagement. Our findings reveal a significant enhancement in user engagement with multi-modal interactions compared to text-only dialogues. Notably, the incorporation of a third modality significantly amplifies engagement beyond the benefits observed with just two modalities. These results suggest that multi-modal interactions optimize cognitive processing and facilitate richer information comprehension. This study underscores the importance of multi-modality in chatbot design, offering valuable insights for creating more engaging and immersive AI communication experiences and informing the broader AI community about the benefits of multi-modal interactions in enhancing user engagement.

[AI-44] Learning Variable Compliance Control From a Few Demonstrations for Bimanual Robot with Haptic Feedback Teleoperation System

链接: https://arxiv.org/abs/2406.14990
作者: Tatsuya Kamijo,Cristian C. Beltran-Hernandez,Masashi Hamaya
关键词: Automating dexterous, challenge in robotics, significant challenge, rigid robots, compliance control
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automating dexterous, contact-rich manipulation tasks using rigid robots is a significant challenge in robotics. Rigid robots, defined by their actuation through position commands, face issues of excessive contact forces due to their inability to adapt to contact with the environment, potentially causing damage. While compliance control schemes have been introduced to mitigate these issues by controlling forces via external sensors, they are hampered by the need for fine-tuning task-specific controller parameters. Learning from Demonstrations (LfD) offers an intuitive alternative, allowing robots to learn manipulations through observed actions. In this work, we introduce a novel system to enhance the teaching of dexterous, contact-rich manipulations to rigid robots. Our system is twofold: firstly, it incorporates a teleoperation interface utilizing Virtual Reality (VR) controllers, designed to provide an intuitive and cost-effective method for task demonstration with haptic feedback. Secondly, we present Comp-ACT (Compliance Control via Action Chunking with Transformers), a method that leverages the demonstrations to learn variable compliance control from a few demonstrations. Our methods have been validated across various complex contact-rich manipulation tasks using single-arm and bimanual robot setups in simulated and real-world environments, demonstrating the effectiveness of our system in teaching robots dexterous manipulations with enhanced adaptability and safety.

[AI-45] Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers

链接: https://arxiv.org/abs/2406.14986
作者: Manuel Mondal,Ljiljana Dolamic,Gérôme Bovet,Philippe Cudré-Mauroux
关键词: Large Language Models, Multiple Choices Questions, Choices Questions, Language Models, Large Language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Prompting and Multiple Choices Questions (MCQ) have become the preferred approach to assess the capabilities of Large Language Models (LLMs), due to their ease of manipulation and evaluation. Such experimental appraisals have pointed toward the LLMs’ apparent ability to perform causal reasoning or to grasp uncertainty. In this paper, we investigate whether these abilities are measurable outside of tailored prompting and MCQ by reformulating these issues as direct text completion - the foundation of LLMs. To achieve this goal, we define scenarios with multiple possible outcomes and we compare the prediction made by the LLM through prompting (their Stated Answer) to the probability distributions they compute over these outcomes during next token prediction (their Revealed Belief). Our findings suggest that the Revealed Belief of LLMs significantly differs from their Stated Answer and hint at multiple biases and misrepresentations that their beliefs may yield in many scenarios and outcomes. As text completion is at the core of LLMs, these results suggest that common evaluation methods may only provide a partial picture and that more research is needed to assess the extent and nature of their capabilities.

[AI-46] Human-AI collectives produce the most accurate differential diagnoses

链接: https://arxiv.org/abs/2406.14981
作者: N. Zöller,J. Berger,I. Lin,N. Fu,J. Komarneni,G. Barabucci,K. Laskowski,V. Shia,B. Harack,E. A. Chu,V. Trianni,R. H.J.M. Kurvers,S. M. Herzog
关键词: large language models, Artificial intelligence systems, Artificial intelligence, large language, language models
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Artificial intelligence systems, particularly large language models (LLMs), are increasingly being employed in high-stakes decisions that impact both individuals and society at large, often without adequate safeguards to ensure safety, quality, and equity. Yet LLMs hallucinate, lack common sense, and are biased - shortcomings that may reflect LLMs’ inherent limitations and thus may not be remedied by more sophisticated architectures, more data, or more human feedback. Relying solely on LLMs for complex, high-stakes decisions is therefore problematic. Here we present a hybrid collective intelligence system that mitigates these risks by leveraging the complementary strengths of human experience and the vast information processed by LLMs. We apply our method to open-ended medical diagnostics, combining 40,762 differential diagnoses made by physicians with the diagnoses of five state-of-the art LLMs across 2,133 medical cases. We show that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles. This result holds across a range of medical specialties and professional experience, and can be attributed to humans’ and LLMs’ complementary contributions that lead to different kinds of errors. Our approach highlights the potential for collective human and machine intelligence to improve accuracy in complex, open-ended domains like medical diagnostics.

[AI-47] rustworthy Enhanced Multi-view Multi-modal Alzheimers Disease Prediction with Brain-wide Imaging Transcriptomics Data

链接: https://arxiv.org/abs/2406.14977
作者: Shan Cong,Zhoujie Fan,Hongwei Liu,Yinghan Zhang,Xin Wang,Haoran Luo,Xiaohui Yao
关键词: functions and processes, molecular mechanisms, coordinates its functions, brain coordinates, predicting Alzheimer disease
类目: Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Brain transcriptomics provides insights into the molecular mechanisms by which the brain coordinates its functions and processes. However, existing multimodal methods for predicting Alzheimer’s disease (AD) primarily rely on imaging and sometimes genetic data, often neglecting the transcriptomic basis of brain. Furthermore, while striving to integrate complementary information between modalities, most studies overlook the informativeness disparities between modalities. Here, we propose TMM, a trusted multiview multimodal graph attention framework for AD diagnosis, using extensive brain-wide transcriptomics and imaging data. First, we construct view-specific brain regional co-function networks (RRIs) from transcriptomics and multimodal radiomics data to incorporate interaction information from both biomolecular and imaging perspectives. Next, we apply graph attention (GAT) processing to each RRI network to produce graph embeddings and employ cross-modal attention to fuse transcriptomics-derived embedding with each imagingderived embedding. Finally, a novel true-false-harmonized class probability (TFCP) strategy is designed to assess and adaptively adjust the prediction confidence of each modality for AD diagnosis. We evaluate TMM using the AHBA database with brain-wide transcriptomics data and the ADNI database with three imaging modalities (AV45-PET, FDG-PET, and VBM-MRI). The results demonstrate the superiority of our method in identifying AD, EMCI, and LMCI compared to state-of-the-arts. Code and data are available at this https URL.

[AI-48] Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation

链接: https://arxiv.org/abs/2406.14971
作者: Shamane Siriwardhana,Mark McQuade,Thomas Gauthier,Lucas Atkins,Fernando Fernandes Neto,Luke Meyers,Anneketh Vij,Tyler Odenthal,Charles Goddard,Mary MacCarthy,Jacob Solawetz
关键词: conducted extensive experiments, SEC data, exploring its performance, conducted extensive, extensive experiments
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:We conducted extensive experiments on domain adaptation of the Meta-Llama-3-70B-Instruct model on SEC data, exploring its performance on both general and domain-specific benchmarks. Our focus included continual pre-training (CPT) and model merging, aiming to enhance the model’s domain-specific capabilities while mitigating catastrophic forgetting. Through this study, we evaluated the impact of integrating financial regulatory data into a robust language model and examined the effectiveness of our model merging techniques in preserving and improving the model’s instructive abilities. The model is accessible at hugging face: this https URL, arcee-ai/Llama-3-SEC-Base. This is an intermediate checkpoint of our final model, which has seen 20B tokens so far. The full model is still in the process of training. This is a preprint technical report with thorough evaluations to understand the entire process.

[AI-49] Uni-Mol2: Exploring Molecular Pretraining Model at Scale

链接: https://arxiv.org/abs/2406.14969
作者: Xiaohong Ji,Wang Zhen,Zhifeng Gao,Hang Zheng,Linfeng Zhang,Guolin Ke,Weinan E
关键词: natural language processing, made significant advancements, molecular pretraining models, molecular pretraining, significant advancements
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, pretraining models have made significant advancements in the fields of natural language processing (NLP), computer vision (CV), and life sciences. The significant advancements in NLP and CV are predominantly driven by the expansion of model parameters and data size, a phenomenon now recognized as the scaling laws. However, research exploring scaling law in molecular pretraining models remains unexplored. In this work, we present Uni-Mol2 , an innovative molecular pretraining model that leverages a two-track transformer to effectively integrate features at the atomic level, graph level, and geometry structure level. Along with this, we systematically investigate the scaling law within molecular pretraining models, characterizing the power-law correlations between validation loss and model size, dataset size, and computational resources. Consequently, we successfully scale Uni-Mol2 to 1.1 billion parameters through pretraining on 800 million conformations, making it the largest molecular pretraining model to date. Extensive experiments show consistent improvement in the downstream tasks as the model size grows. The Uni-Mol2 with 1.1B parameters also outperforms existing methods, achieving an average 27% improvement on the QM9 and 14% on COMPAS-1D dataset.

[AI-50] Deep Imbalanced Regression to Estimate Vascular Age from PPG Data: a Novel Digital Biomarker for Cardiovascular Health

链接: https://arxiv.org/abs/2406.14953
作者: Guangkun Nie,Qinghao Zhao,Gongzheng Tang,Jun Li,Shenda Hong
关键词: monitoring human hemodynamics, recent studies highlighting, assessing vascular aging, human hemodynamics, Dist Loss
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Photoplethysmography (PPG) is emerging as a crucial tool for monitoring human hemodynamics, with recent studies highlighting its potential in assessing vascular aging through deep learning. However, real-world age distributions are often imbalanced, posing significant challenges for deep learning models. In this paper, we introduce a novel, simple, and effective loss function named the Dist Loss to address deep imbalanced regression tasks. We trained a one-dimensional convolutional neural network (Net1D) incorporating the Dist Loss on the extensive UK Biobank dataset (n=502,389) to estimate vascular age from PPG signals and validate its efficacy in characterizing cardiovascular health. The model’s performance was validated on a 40% held-out test set, achieving state-of-the-art results, especially in regions with small sample sizes. Furthermore, we divided the population into three subgroups based on the difference between predicted vascular age and chronological age: less than -10 years, between -10 and 10 years, and greater than 10 years. We analyzed the relationship between predicted vascular age and several cardiovascular events over a follow-up period of up to 10 years, including death, coronary heart disease, and heart failure. Our results indicate that the predicted vascular age has significant potential to reflect an individual’s cardiovascular health status. Our code will be available at this https URL.

[AI-51] An Idiosyncrasy of Time-discretization in Reinforcement Learning

链接: https://arxiv.org/abs/2406.14951
作者: Kris De Asis,Richard S. Sutton
关键词: discrete time steps, agent interacts, discrete time, time steps, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: RLC 2024

点击查看摘要

Abstract:Many reinforcement learning algorithms are built on an assumption that an agent interacts with an environment over fixed-duration, discrete time steps. However, physical systems are continuous in time, requiring a choice of time-discretization granularity when digitally controlling them. Furthermore, such systems do not wait for decisions to be made before advancing the environment state, necessitating the study of how the choice of discretization may affect a reinforcement learning algorithm. In this work, we consider the relationship between the definitions of the continuous-time and discrete-time returns. Specifically, we acknowledge an idiosyncrasy with naively applying a discrete-time algorithm to a discretized continuous-time environment, and note how a simple modification can better align the return definitions. This observation is of practical consideration when dealing with environments where time-discretization granularity is a choice, or situations where such granularity is inherently stochastic.

[AI-52] CEASEFIRE: An AI-powered system for combatting illicit firearms trafficking

链接: https://arxiv.org/abs/2406.14949
作者: Ioannis Mademlis,Jorgen Cani,Marina Mancuso,Caterina Paternoster,Emmanouil Adamakis,George Margetis,Sylvie Chambon,Alain Crouzil,Loubna Lechelek,Georgia Dede,Spyridon Evangelatos,George Lalas,Franck Mignet,Pantelis Linardatos,Konstantinos Kentrotis,Henryk Gierszal,Piotr Tyczka,Sophia Karagiorgou,George Pantelis,Georgios Stavropoulos,Konstantinos Votis,Georgios Th. Papadopoulos
关键词: led illicit firearms, illicit firearms trafficking, Modern technologies, merge with cybercrime, technologies have led
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Modern technologies have led illicit firearms trafficking to partially merge with cybercrime, while simultaneously permitting its off-line aspects to become more sophisticated. Law enforcement officers face difficult challenges that require hi-tech solutions. This article presents a real-world system, powered by advanced Artificial Intelligence, for facilitating them in their everyday work.

[AI-53] owards Retrieval Augmented Generation over Large Video Libraries

链接: https://arxiv.org/abs/2406.14938
作者: Yannis Tevissen,Khalil Guetari,Frédéric Petitpont
关键词: requires complex manual, Library Question Answering, automated searches, creators need efficient, efficient tools
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted in IEEE HSI 2024

点击查看摘要

Abstract:Video content creators need efficient tools to repurpose content, a task that often requires complex manual or automated searches. Crafting a new video from large video libraries remains a challenge. In this paper we introduce the task of Video Library Question Answering (VLQA) through an interoperable architecture that applies Retrieval Augmented Generation (RAG) to video libraries. We propose a system that uses large language models (LLMs) to generate search queries, retrieving relevant video moments indexed by speech and visual metadata. An answer generation module then integrates user queries with this metadata to produce responses with specific video timestamps. This approach shows promise in multimedia content retrieval, and AI-assisted video content creation.

[AI-54] Autonomous Agents for Collaborative Task under Information Asymmetry

链接: https://arxiv.org/abs/2406.14928
作者: Wei Liu,Chenxi Wang,Yifei Wang,Zihao Xie,Rennai Qiu,Yufan Dang,Zhuoyun Du,Weize Chen,Cheng Yang,Chen Qian
关键词: Large Language Model, Language Model Multi-Agent, Large Language, Language Model, Model Multi-Agent Systems
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
*备注: 16 pages, 8 figures, 5 tables, Work in progress

点击查看摘要

Abstract:Large Language Model Multi-Agent Systems (LLM-MAS) have achieved great progress in solving complex tasks. It performs communication among agents within the system to collaboratively solve tasks, under the premise of shared information. However, when agents’ communication is leveraged to enhance human cooperation, a new challenge arises due to information asymmetry, since each agent can only access the information of its human user. Previous MAS struggle to complete tasks under this condition. To address this, we propose a new MAS paradigm termed iAgents, which denotes Informative Multi-Agent Systems. In iAgents, the human social network is mirrored in the agent network, where agents proactively exchange human information necessary for task resolution, thereby overcoming information asymmetry. iAgents employs a novel agent reasoning mechanism, InfoNav, to navigate agents’ communication towards effective information exchange. Together with InfoNav, iAgents organizes human information in a mixed memory to provide agents with accurate and comprehensive information for exchange. Additionally, we introduce InformativeBench, the first benchmark tailored for evaluating LLM agents’ task-solving ability under information asymmetry. Experimental results show that iAgents can collaborate within a social network of 140 individuals and 588 relationships, autonomously communicate over 30 turns, and retrieve information from nearly 70,000 messages to complete tasks within 3 minutes.

[AI-55] LLM2FEA: Discover Novel Designs with Generative Evolutionary Multitasking

链接: https://arxiv.org/abs/2406.14917
作者: Melvin Wong,Jiao Liu,Thiago Rios,Stefan Menzel,Yew Soon Ong
关键词: generative artificial intelligence, high-quality images, rapid research, research and development, artificial intelligence
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:The rapid research and development of generative artificial intelligence has enabled the generation of high-quality images, text, and 3D models from text prompts. This advancement impels an inquiry into whether these models can be leveraged to create digital artifacts for both creative and engineering applications. Drawing on innovative designs from other domains may be one answer to this question, much like the historical practice of ``bionics", where humans have sought inspiration from nature’s exemplary designs. This raises the intriguing possibility of using generative models to simultaneously tackle design tasks across multiple domains, facilitating cross-domain learning and resulting in a series of innovative design solutions. In this paper, we propose LLM2FEA as the first attempt to discover novel designs in generative models by transferring knowledge across multiple domains. By utilizing a multi-factorial evolutionary algorithm (MFEA) to drive a large language model, LLM2FEA integrates knowledge from various fields to generate prompts that guide the generative model in discovering novel and practical objects. Experimental results in the context of 3D aerodynamic design verify the discovery capabilities of the proposed LLM2FEA. The designs generated by LLM2FEA not only satisfy practicality requirements to a certain degree but also feature novel and aesthetically pleasing shapes, demonstrating the potential applications of LLM2FEA in discovery tasks.

[AI-56] Demonstrating the Efficacy of Kolmogorov-Arnold Networks in Vision Tasks

链接: https://arxiv.org/abs/2406.14916
作者: Minjong Cheon
关键词: Kolmogorov-Arnold Network, deep learning, multilayer projections, realm of deep, vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the realm of deep learning, the Kolmogorov-Arnold Network (KAN) has emerged as a potential alternative to multilayer projections (MLPs). However, its applicability to vision tasks has not been extensively validated. In our study, we demonstrated the effectiveness of KAN for vision tasks through multiple trials on the MNIST, CIFAR10, and CIFAR100 datasets, using a training batch size of 32. Our results showed that while KAN outperformed the original MLP-Mixer on CIFAR10 and CIFAR100, it performed slightly worse than the state-of-the-art ResNet-18. These findings suggest that KAN holds significant promise for vision tasks, and further modifications could enhance its performance in future evaluations.Our contributions are threefold: first, we showcase the efficiency of KAN-based algorithms for visual tasks; second, we provide extensive empirical assessments across various vision benchmarks, comparing KAN’s performance with MLP-Mixer, CNNs, and Vision Transformers (ViT); and third, we pioneer the use of natural KAN layers in visual tasks, addressing a gap in previous research. This paper lays the foundation for future studies on KANs, highlighting their potential as a reliable alternative for image classification tasks.

[AI-57] MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression

链接: https://arxiv.org/abs/2406.14909
作者: Tianyu Fu,Haofeng Huang,Xuefei Ning,Genghan Zhang,Boju Chen,Tianqi Wu,Hongyi Wang,Zixiao Huang,Shiyao Li,Shengen Yan,Guohao Dai,Huazhong Yang,Yu Wang
关键词: Large Language Models, Large Language, demands of Large, Language Models, attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 10 pages

点击查看摘要

Abstract:Sparse attention can effectively mitigate the significant memory and throughput demands of Large Language Models (LLMs) in long contexts. Existing methods typically employ a uniform sparse attention mask, applying the same sparse pattern across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers. MoA constructs and navigates a search space of various attention patterns and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal sparse attention compression plan. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by 3.9\times with the same average attention span, boosting retrieval accuracy by 1.5-7.1\times over the uniform-attention baseline across Vicuna-7B, Vicuna-13B, and Llama3-8B models. Moreover, MoA narrows the capability gaps between sparse and dense models, reducing the maximum relative performance drop from 9%-36% to within 5% across two long-context understanding benchmarks. MoA achieves a 1.2-1.4\times GPU memory reduction and boosts decode throughput by 5.5-6.7 \times for 7B and 13B dense models on a single GPU, with minimal impact on performance.

[AI-58] GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models

链接: https://arxiv.org/abs/2406.14903
作者: Leyan Wang,Yonggang Jin,Tianhao Shen,Tianyu Zheng,Xinrun Du,Chenchen Zhang,Wenhao Huang,Jiaheng Liu,Shi Wang,Ge Zhang,Liuyu Xiang,Zhaofeng He
关键词: large language models, group identities, gain widespread application, language models, continue to develop
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) continue to develop and gain widespread application, the ability of LLMs to exhibit empathy towards diverse group identities and understand their perspectives is increasingly recognized as critical. Most existing benchmarks for empathy evaluation of LLMs focus primarily on universal human emotions, such as sadness and pain, often overlooking the context of individuals’ group identities. To address this gap, we introduce GIEBench, a comprehensive benchmark that includes 11 identity dimensions, covering 97 group identities with a total of 999 single-choice questions related to specific group identities. GIEBench is designed to evaluate the empathy of LLMs when presented with specific group identities such as gender, age, occupation, and race, emphasizing their ability to respond from the standpoint of the identified group. This supports the ongoing development of empathetic LLM applications tailored to users with different identities. Our evaluation of 23 LLMs revealed that while these LLMs understand different identity standpoints, they fail to consistently exhibit equal empathy across these identities without explicit instructions to adopt those perspectives. This highlights the need for improved alignment of LLMs with diverse values to better accommodate the multifaceted nature of human identities. Our datasets are available at this https URL.

[AI-59] alking the Talk Does Not Entail Walking the Walk: On the Limits of Large Language Models in Lexical Entailment Recognition

链接: https://arxiv.org/abs/2406.14894
作者: Candida M. Greco,Lucio La Cava,Andrea Tagarelli
关键词: providing the structure, form the backbone, Large Language Models, lexical entailment, Verbs form
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:Verbs form the backbone of language, providing the structure and meaning to sentences. Yet, their intricate semantic nuances pose a longstanding challenge. Understanding verb relations through the concept of lexical entailment is crucial for comprehending sentence meanings and grasping verb dynamics. This work investigates the capabilities of eight Large Language Models in recognizing lexical entailment relations among verbs through differently devised prompting strategies and zero-/few-shot settings over verb pairs from two lexical databases, namely WordNet and HyperLex. Our findings unveil that the models can tackle the lexical entailment recognition task with moderately good performance, although at varying degree of effectiveness and under different conditions. Also, utilizing few-shot prompting can enhance the models’ performance. However, perfectly solving the task arises as an unmet challenge for all examined LLMs, which raises an emergence for further research developments on this topic.

[AI-60] raining Greedy Policy for Proposal Batch Selection in Expensive Multi-Objective Combinatorial Optimization

链接: https://arxiv.org/abs/2406.14876
作者: Deokjae Lee,Hyun Oh Song,Kyunghyun Cho
关键词: subset selection problem, batch acquisition score, batch acquisition, challenging subset selection, Active learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ICML 2024; Codes at this https URL

点击查看摘要

Abstract:Active learning is increasingly adopted for expensive multi-objective combinatorial optimization problems, but it involves a challenging subset selection problem, optimizing the batch acquisition score that quantifies the goodness of a batch for evaluation. Due to the excessively large search space of the subset selection problem, prior methods optimize the batch acquisition on the latent space, which has discrepancies with the actual space, or optimize individual acquisition scores without considering the dependencies among candidates in a batch instead of directly optimizing the batch acquisition. To manage the vast search space, a simple and effective approach is the greedy method, which decomposes the problem into smaller subproblems, yet it has difficulty in parallelization since each subproblem depends on the outcome from the previous ones. To this end, we introduce a novel greedy-style subset selection algorithm that optimizes batch acquisition directly on the combinatorial space by sequential greedy sampling from the greedy policy, specifically trained to address all greedy subproblems concurrently. Notably, our experiments on the red fluorescent proteins design task show that our proposed method achieves the baseline performance in 1.69x fewer queries, demonstrating its efficiency.

[AI-61] I dont trust you (anymore)! – The effect of students LLM use on Lecturer-Student-Trust in Higher Education

链接: https://arxiv.org/abs/2406.14871
作者: Simon Kloker,Matthew Bazanya,Twaha Kateete
关键词: Large Language Models, Team Trust, encompassing teaching, plays a pivotal, pivotal role
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Working Paper

点击查看摘要

Abstract:Trust plays a pivotal role in Lecturer-Student-Collaboration, encompassing teaching and research aspects. The advent of Large Language Models (LLMs) in platforms like Open AI’s ChatGPT, coupled with their cost-effectiveness and high-quality results, has led to their rapid adoption among university students. However, discerning genuine student input from LLM-generated output poses a challenge for lecturers. This dilemma jeopardizes the trust relationship between lecturers and students, potentially impacting university downstream activities, particularly collaborative research initiatives. Despite attempts to establish guidelines for student LLM use, a clear framework mutually beneficial for lecturers and students in higher education remains elusive. This study addresses the research question: How does the use of LLMs by students impact Informational and Procedural Justice, influencing Team Trust and Expected Team Performance? Methodically, we applied a quantitative construct-based survey, evaluated using techniques of Structural Equation Modelling (PLS- SEM) to examine potential relationships among these constructs. Our findings based on 23 valid respondents from Ndejje University indicate that lecturers are less concerned about the fairness of LLM use per se but are more focused on the transparency of student utilization, which significantly influences Team Trust positively. This research contributes to the global discourse on integrating and regulating LLMs and subsequent models in education. We propose that guidelines should support LLM use while enforcing transparency in Lecturer-Student- Collaboration to foster Team Trust and Performance. The study contributes valuable insights for shaping policies enabling ethical and transparent LLMs usage in education to ensure effectiveness of collaborative learning environments.

[AI-62] DistiLRR: Transferring Code Repair for Low-Resource Programming Languages

链接: https://arxiv.org/abs/2406.14867
作者: Kyle Wong,Alfonso Amayuelas,Liangming Pan,William Yang Wang
关键词: Large language models, shown remarkable performance, Large language, code generation tasks, code repair
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable performance on code generation tasks. A recent application of LLMs for code generation is iterative code repair, where a model fixes an incorrect program by rationalizing about errors and generating a new program. However, code repair is primarily studied on high-resource languages like Python, and the framework’s efficacy is under-explored on low-resource languages. To apply code repair for low-resource languages, we propose Distilling Low-Resource Repairs (DistiLRR), an approach that transfers the reasoning and code generation ability from a teacher model to a student model. Our results show that DistiLRR consistently outperforms baselines on low-resource languages, but has similar performance on high-resource languages. To investigate this behavior, we perform a further analysis and find that the correlation between rationale quality and code correctness is weaker than previously perceived. We hypothesize this weakness is magnified in low-resource settings where base models lack deep knowledge of a programming language, leading to wavering benefits of code repair between high-resource and low-resource languages.

[AI-63] AI-based Anomaly Detection for Clinical-Grade Histopathological Diagnostics

链接: https://arxiv.org/abs/2406.14866
作者: Jonas Dippel,Niklas Prenißl,Julius Hense,Philipp Liznerski,Tobias Winterhoff,Simon Schallenberg,Marius Kloft,Oliver Buchstab,David Horst,Maximilian Alber,Lukas Ruff,Klaus-Robert Müller,Frederick Klauschen
关键词: previous studies, studies have demonstrated, demonstrated the potential, diseases, clinical implementation
类目: Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:While previous studies have demonstrated the potential of AI to diagnose diseases in imaging data, clinical implementation is still lagging behind. This is partly because AI models require training with large numbers of examples only available for common diseases. In clinical reality, however, only few diseases are common, whereas the majority of diseases are less frequent (long-tail distribution). Current AI models overlook or misclassify these diseases. We propose a deep anomaly detection approach that only requires training data from common diseases to detect also all less frequent diseases. We collected two large real-world datasets of gastrointestinal biopsies, which are prototypical of the problem. Herein, the ten most common findings account for approximately 90% of cases, whereas the remaining 10% contained 56 disease entities, including many cancers. 17 million histological images from 5,423 cases were used for training and evaluation. Without any specific training for the diseases, our best-performing model reliably detected a broad spectrum of infrequent (“anomalous”) pathologies with 95.0% (stomach) and 91.0% (colon) AUROC and generalized across scanners and hospitals. By design, the proposed anomaly detection can be expected to detect any pathological alteration in the diagnostic tail of gastrointestinal biopsies, including rare primary or metastatic cancers. This study establishes the first effective clinical application of AI-based anomaly detection in histopathology that can flag anomalous cases, facilitate case prioritization, reduce missed diagnoses and enhance the general safety of AI models, thereby driving AI adoption and automation in routine diagnostics and beyond.

[AI-64] From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking

链接: https://arxiv.org/abs/2406.14859
作者: Siyuan Wang,Zhuohan Long,Zhihao Fan,Zhongyu Wei
关键词: Large Language Models, Language Models, Multimodal Large Language, Large Language, Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid development of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has exposed vulnerabilities to various adversarial attacks. This paper provides a comprehensive overview of jailbreaking research targeting both LLMs and MLLMs, highlighting recent advancements in evaluation benchmarks, attack techniques and defense strategies. Compared to the more advanced state of unimodal jailbreaking, multimodal domain remains underexplored. We summarize the limitations and potential research directions of multimodal jailbreaking, aiming to inspire future research and further enhance the robustness and security of MLLMs.

[AI-65] PEANO-ViT: Power-Efficient Approximations of Non-Linearities in Vision Transformers

链接: https://arxiv.org/abs/2406.14854
作者: Mohammad Erfan Sadeghi,Arash Fayyazi,Seyedarmin Azizi,Massoud Pedram
关键词: Field-Programmable Gate Arrays, Error Linear Unit, Gaussian Error Linear, specially Field-Programmable Gate, Gate Arrays
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:The deployment of Vision Transformers (ViTs) on hardware platforms, specially Field-Programmable Gate Arrays (FPGAs), presents many challenges, which are mainly due to the substantial computational and power requirements of their non-linear functions, notably layer normalization, softmax, and Gaussian Error Linear Unit (GELU). These critical functions pose significant obstacles to efficient hardware implementation due to their complex mathematical operations and the inherent resource count and architectural limitations of FPGAs. PEANO-ViT offers a novel approach to streamlining the implementation of the layer normalization layer by introducing a division-free technique that simultaneously approximates the division and square root function. Additionally, PEANO-ViT provides a multi-scale division strategy to eliminate division operations in the softmax layer, aided by a Pade-based approximation for the exponential function. Finally, PEANO-ViT introduces a piece-wise linear approximation for the GELU function, carefully designed to bypass the computationally intensive operations associated with GELU. In our comprehensive evaluations, PEANO-ViT exhibits minimal accuracy degradation (= 0.5% for DeiT-B) while significantly enhancing power efficiency, achieving improvements of 1.91x, 1.39x, 8.01x for layer normalization, softmax, and GELU, respectively. This improvement is achieved through substantial reductions in DSP, LUT, and register counts for these non-linear operations. Consequently, PEANO-ViT enables efficient deployment of Vision Transformers on resource- and power-constrained FPGA platforms.

[AI-66] Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

链接: https://arxiv.org/abs/2406.14852
作者: Jiayu Wang,Yifei Ming,Zhenmei Shi,Vibhav Vineet,Xin Wang,Neel Joshi
关键词: Large language models, demonstrated remarkable performance, Large language, tasks and domains, demonstrated remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) and vision-language models (VLMs) have demonstrated remarkable performance across a wide range of tasks and domains. Despite this promise, spatial understanding and reasoning – a fundamental component of human cognition – remains under-explored. We develop novel benchmarks that cover diverse aspects of spatial reasoning such as relationship understanding, navigation, and counting. We conduct a comprehensive evaluation of competitive language and vision-language models. Our findings reveal several counter-intuitive insights that have been overlooked in the literature: (1) Spatial reasoning poses significant challenges where competitive models can fall behind random guessing; (2) Despite additional visual input, VLMs often under-perform compared to their LLM counterparts; (3) When both textual and visual information is available, multi-modal language models become less reliant on visual information if sufficient textual clues are provided. Additionally, we demonstrate that leveraging redundancy between vision and text can significantly enhance model performance. We hope our study will inform the development of multimodal models to improve spatial intelligence and further close the gap with human intelligence.

[AI-67] DN-CL: Deep Symbolic Regression against Noise via Contrastive Learning

链接: https://arxiv.org/abs/2406.14844
作者: Jingyi Liu,Yanjie Li,Lina Yu,Min Wu,Weijun Li,Wenqiang Li,Meilan Hao,Yusong Deng,Shu Wei
关键词: factors including physical, numerous factors including, Noise ubiquitously exists, including physical, environmental effects
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Noise ubiquitously exists in signals due to numerous factors including physical, electronic, and environmental effects. Traditional methods of symbolic regression, such as genetic programming or deep learning models, aim to find the most fitting expressions for these signals. However, these methods often overlook the noise present in real-world data, leading to reduced fitting accuracy. To tackle this issue, we propose \textit\textbfDeep Symbolic Regression against \textbfNoise via \textbfContrastive \textbfLearning (DN-CL). DN-CL employs two parameter-sharing encoders to embed data points from various data transformations into feature shields against noise. This model treats noisy data and clean data as different views of the ground-truth mathematical expressions. Distances between these features are minimized, utilizing contrastive learning to distinguish between ‘positive’ noise-corrected pairs and ‘negative’ contrasting pairs. Our experiments indicate that DN-CL demonstrates superior performance in handling both noisy and clean data, presenting a promising method of symbolic regression.

[AI-68] Automated architectural space layout planning using a physics-inspired generative design framework

链接: https://arxiv.org/abs/2406.14840
作者: Zhipeng Li,Sichao Li,Geoff Hinchcliffe,Noam Maitless,Nick Birbilis
关键词: space layout planning, space layout, primary activities, layout planning, schematic design stage
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The determination of space layout is one of the primary activities in the schematic design stage of an architectural project. The initial layout planning defines the shape, dimension, and circulation pattern of internal spaces; which can also affect performance and cost of the construction. When carried out manually, space layout planning can be complicated, repetitive and time consuming. In this work, a generative design framework for the automatic generation of spatial architectural layout has been developed. The proposed approach integrates a novel physics-inspired parametric model for space layout planning and an evolutionary optimisation metaheuristic. Results revealed that such a generative design framework can generate a wide variety of design suggestions at the schematic design stage, applicable to complex design problems.

[AI-69] Latent diffusion models for parameterization and data assimilation of facies-based geomodels

链接: https://arxiv.org/abs/2406.14815
作者: Guido Di Federico,Louis J. Durlofsky
关键词: Geological parameterization entails, latent diffusion model, porosity and permeability, Diffusion models, entails the representation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Geological parameterization entails the representation of a geomodel using a small set of latent variables and a mapping from these variables to grid-block properties such as porosity and permeability. Parameterization is useful for data assimilation (history matching), as it maintains geological realism while reducing the number of variables to be determined. Diffusion models are a new class of generative deep-learning procedures that have been shown to outperform previous methods, such as generative adversarial networks, for image generation tasks. Diffusion models are trained to “denoise”, which enables them to generate new geological realizations from input fields characterized by random noise. Latent diffusion models, which are the specific variant considered in this study, provide dimension reduction through use of a low-dimensional latent variable. The model developed in this work includes a variational autoencoder for dimension reduction and a U-net for the denoising process. Our application involves conditional 2D three-facies (channel-levee-mud) systems. The latent diffusion model is shown to provide realizations that are visually consistent with samples from geomodeling software. Quantitative metrics involving spatial and flow-response statistics are evaluated, and general agreement between the diffusion-generated models and reference realizations is observed. Stability tests are performed to assess the smoothness of the parameterization method. The latent diffusion model is then used for ensemble-based data assimilation. Two synthetic “true” models are considered. Significant uncertainty reduction, posterior P _10 -P _90 forecasts that generally bracket observed data, and consistent posterior geomodels, are achieved in both cases.

[AI-70] Securing the Future: Proactive Threat Hunting for Sustainable IoT Ecosystems

链接: https://arxiv.org/abs/2406.14804
作者: Saeid Ghasemshirazi,Ghazaleh Shirvani
关键词: rapidly evolving landscape, proactive threat hunting, threat hunting, proactive threat, paramount concern
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:In the rapidly evolving landscape of the IoT, the security of connected devices has become a paramount concern. This paper explores the concept of proactive threat hunting as a pivotal strategy for enhancing the security and sustainability of IoT systems. Proactive threat hunting is an alternative to traditional reactive security measures that analyses IoT networks continuously and in advance to find and eliminate threats before they occure. By improving the security posture of IoT devices this approach significantly contributes to extending IoT operational lifespan and reduces environmental impact. By integrating security metrics similar to the Common Vulnerability Scoring System (CVSS) into consumer platforms, this paper argues that proactive threat hunting can elevate user awareness about the security of IoT devices. This has the potential to impact consumer choices and encourage a security-conscious mindset in both the manufacturing and user communities. Through a comprehensive analysis, this study demonstrates how proactive threat hunting can contribute to the development of a more secure, sustainable, and user-aware IoT ecosystem.

[AI-71] Probabilistic Emulation of a Global Climate Model with Spherical DYffusion

链接: https://arxiv.org/abs/2406.14798
作者: Salva Rühling Cachay,Brian Henn,Oliver Watt-Meyer,Christopher S. Bretherton,Rose Yu
关键词: Data-driven deep learning, global weather forecasting, transforming global weather, deep learning models, weather forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Data-driven deep learning models are on the verge of transforming global weather forecasting. It is an open question if this success can extend to climate modeling, where long inference rollouts and data complexity pose significant challenges. Here, we present the first conditional generative model able to produce global climate ensemble simulations that are accurate and physically consistent. Our model runs at 6-hourly time steps and is shown to be stable for 10-year-long simulations. Our approach beats relevant baselines and nearly reaches a gold standard for successful climate model emulation. We discuss the key design choices behind our dynamics-informed diffusion model-based approach which enables this significant step towards efficient, data-driven climate simulations that can help us better understand the Earth and adapt to a changing climate.

[AI-72] Camera-Invariant Meta-Learning Network for Single-Camera-Training Person Re-identification

链接: https://arxiv.org/abs/2406.14797
作者: Jiangbo Pei,Zhuqing Jiang,Aidong Men,Haiying Wang,Haiyong Luo,Shiping Wen
关键词: SCT re-ID, SCT, SCT datasets, aims to train, re-ID
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Single-camera-training person re-identification (SCT re-ID) aims to train a re-ID model using SCT datasets where each person appears in only one camera. The main challenge of SCT re-ID is to learn camera-invariant feature representations without cross-camera same-person (CCSP) data as supervision. Previous methods address it by assuming that the most similar person should be found in another camera. However, this assumption is not guaranteed to be correct. In this paper, we propose a Camera-Invariant Meta-Learning Network (CIMN) for SCT re-ID. CIMN assumes that the camera-invariant feature representations should be robust to camera changes. To this end, we split the training data into meta-train set and meta-test set based on camera IDs and perform a cross-camera simulation via meta-learning strategy, aiming to enforce the representations learned from the meta-train set to be robust to the meta-test set. With the cross-camera simulation, CIMN can learn camera-invariant and identity-discriminative representations even there are no CCSP data. However, this simulation also causes the separation of the meta-train set and the meta-test set, which ignores some beneficial relations between them. Thus, we introduce three losses: meta triplet loss, meta classification loss, and meta camera alignment loss, to leverage the ignored relations. The experiment results demonstrate that our method achieves comparable performance with and without CCSP data, and outperforms the state-of-the-art methods on SCT re-ID benchmarks. In addition, it is also effective in improving the domain generalization ability of the model.

[AI-73] MU-Bench: A Multitask Multimodal Benchmark for Machine Unlearning

链接: https://arxiv.org/abs/2406.14796
作者: Jiali Cheng,Hadi Amiri
关键词: Recent advancements, Machine Unlearning, trained models, sensitive information, introduced solutions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in Machine Unlearning (MU) have introduced solutions to selectively remove certain training samples, such as those with outdated or sensitive information, from trained models. Despite these advancements, evaluation of MU methods have been inconsistent, employing different trained models and architectures, and sample removal strategies, which hampers accurate comparison. In addition, prior MU approaches have mainly focused on singular tasks or modalities, which is not comprehensive. To address these limitations, we develop MU-Bench, the first comprehensive benchmark for MU that (i) unifies the sets of deleted samples and trained models, and (ii) provides broad coverage of tasks and data modalities, including previously unexplored domains such as speech and video classification. Our evaluation show that RandLabel and SalUn are the most effective general MU approaches on MU-Bench, and BadT and SCRUB are capable of achieving random performance on the deletion set. We analyze several under-investigated aspects of unlearning, including scalability, the impacts of parameter-efficient fine-tuning and curriculum learning, and susceptibility to dataset biases. MU-Bench provides an easy-to-use package that includes dataset splits, models, and implementations, together with a leader board to enable unified and scalable MU research.

[AI-74] ACR: A Benchmark for Automatic Cohort Retrieval

链接: https://arxiv.org/abs/2406.14780
作者: Dung Ngoc Thai,Victor Ardulov,Jose Ulises Mena,Simran Tiwari,Gleb Erofeev,Ramy Eskander,Karim Tarabishy,Ravi B Parikh,Wael Salloum
关键词: including clinical trial, clinical trial recruitment, Identifying patient cohorts, numerous healthcare tasks, including clinical
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Identifying patient cohorts is fundamental to numerous healthcare tasks, including clinical trial recruitment and retrospective studies. Current cohort retrieval methods in healthcare organizations rely on automated queries of structured data combined with manual curation, which are time-consuming, labor-intensive, and often yield low-quality results. Recent advancements in large language models (LLMs) and information retrieval (IR) offer promising avenues to revolutionize these systems. Major challenges include managing extensive eligibility criteria and handling the longitudinal nature of unstructured Electronic Medical Records (EMRs) while ensuring that the solution remains cost-effective for real-world application. This paper introduces a new task, Automatic Cohort Retrieval (ACR), and evaluates the performance of LLMs and commercial, domain-specific neuro-symbolic approaches. We provide a benchmark task, a query dataset, an EMR dataset, and an evaluation framework. Our findings underscore the necessity for efficient, high-quality ACR systems capable of longitudinal reasoning across extensive patient databases.

[AI-75] Learning to Select Goals in Automated Planning with Deep-Q Learning

链接: https://arxiv.org/abs/2406.14779
作者: Carlos Núñez-Molina,Juan Fernández-Olivares,Raúl Pérez
关键词: acting architecture endowed, standard Deep Q-Learning, Deep Q-Learning, work we propose, propose a planning
类目: Artificial Intelligence (cs.AI)
*备注: 25 pages, 4 figures

点击查看摘要

Abstract:In this work we propose a planning and acting architecture endowed with a module which learns to select subgoals with Deep Q-Learning. This allows us to decrease the load of a planner when faced with scenarios with real-time restrictions. We have trained this architecture on a video game environment used as a standard test-bed for intelligent systems applications, testing it on different levels of the same game to evaluate its generalization abilities. We have measured the performance of our approach as more training data is made available, as well as compared it with both a state-of-the-art, classical planner and the standard Deep Q-Learning algorithm. The results obtained show our model performs better than the alternative methods considered, when both plan quality (plan length) and time requirements are taken into account. On the one hand, it is more sample-efficient than standard Deep Q-Learning, and it is able to generalize better across levels. On the other hand, it reduces problem-solving time when compared with a state-of-the-art automated planner, at the expense of obtaining plans with only 9% more actions.

[AI-76] How critically can an AI think? A framework for evaluating the quality of thinking of generative artificial intelligence

链接: https://arxiv.org/abs/2406.14769
作者: Luke Zaphir,Jason M. Lodge,Jacinta Lisec,Dom McGrath,Hassan Khosravi
关键词: large language models, assessment design practices, large language, language models, models have created
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative AI such as those with large language models have created opportunities for innovative assessment design practices. Due to recent technological developments, there is a need to know the limits and capabilities of generative AI in terms of simulating cognitive skills. Assessing student critical thinking skills has been a feature of assessment for time immemorial, but the demands of digital assessment create unique challenges for equity, academic integrity and assessment authorship. Educators need a framework for determining their assessments vulnerability to generative AI to inform assessment design practices. This paper presents a framework that explores the capabilities of the LLM ChatGPT4 application, which is the current industry benchmark. This paper presents the Mapping of questions, AI vulnerability testing, Grading, Evaluation (MAGE) framework to methodically critique their assessments within their own disciplinary contexts. This critique will provide specific and targeted indications of their questions vulnerabilities in terms of the critical thinking skills. This can go on to form the basis of assessment design for their tasks.

[AI-77] ChatGPT as Research Scientist: Probing GPTs Capabilities as a Research Librarian Research Ethicist Data Generator and Data Predictor

链接: https://arxiv.org/abs/2406.14765
作者: Steven A. Lehr,Aylin Caliskan,Suneragiri Liyanage,Mahzarin R. Banaji
关键词: Research Ethicist, Research Librarian, research, Study, Data Generator
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Main article is 14 pages, 1 table. Includes SI Appendix: 26 pages, 12 tables, 2 figures. Total: 40 pages, 13 tables, 2 figures. Under revised review at PNAS

点击查看摘要

Abstract:How good a research scientist is ChatGPT? We systematically probed the capabilities of GPT-3.5 and GPT-4 across four central components of the scientific process: as a Research Librarian, Research Ethicist, Data Generator, and Novel Data Predictor, using psychological science as a testing field. In Study 1 (Research Librarian), unlike human researchers, GPT-3.5 and GPT-4 hallucinated, authoritatively generating fictional references 36.0% and 5.4% of the time, respectively, although GPT-4 exhibited an evolving capacity to acknowledge its fictions. In Study 2 (Research Ethicist), GPT-4 (though not GPT-3.5) proved capable of detecting violations like p-hacking in fictional research protocols, correcting 88.6% of blatantly presented issues, and 72.6% of subtly presented issues. In Study 3 (Data Generator), both models consistently replicated patterns of cultural bias previously discovered in large language corpora, indicating that ChatGPT can simulate known results, an antecedent to usefulness for both data generation and skills like hypothesis generation. Contrastingly, in Study 4 (Novel Data Predictor), neither model was successful at predicting new results absent in their training data, and neither appeared to leverage substantially new information when predicting more versus less novel outcomes. Together, these results suggest that GPT is a flawed but rapidly improving librarian, a decent research ethicist already, capable of data generation in simple domains with known characteristics but poor at predicting novel patterns of empirical data to aid future experimentation.

[AI-78] RE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation

链接: https://arxiv.org/abs/2406.14764
作者: William Fleshman,Benjamin Van Durme
关键词: Large language models, Large language, fine-tuned for text-retrieval, text-retrieval have demonstrated, information retrieval
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) fine-tuned for text-retrieval have demonstrated state-of-the-art results across several information retrieval (IR) benchmarks. However, supervised training for improving these models requires numerous labeled examples, which are generally unavailable or expensive to acquire. In this work, we explore the effectiveness of extending reverse engineered adaptation to the context of information retrieval (RE-AdaptIR). We use RE-AdaptIR to improve LLM-based IR models using only unlabeled data. We demonstrate improved performance both in training domains as well as zero-shot in domains where the models have seen no queries. We analyze performance changes in various fine-tuning scenarios and offer findings of immediate use to practitioners.

[AI-79] A Learn-Then-Reason Model Towards Generalization in Knowledge Base Question Answering

链接: https://arxiv.org/abs/2406.14763
作者: Lingxi Zhang,Jing Zhang,Yanling Wang,Cuiping Li,Hong Chen
关键词: Wikidata house millions, Freebase and Wikidata, Large-scale knowledge bases, Base Question Answering, Wikidata house
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large-scale knowledge bases (KBs) like Freebase and Wikidata house millions of structured knowledge. Knowledge Base Question Answering (KBQA) provides a user-friendly way to access these valuable KBs via asking natural language questions. In order to improve the generalization capabilities of KBQA models, extensive research has embraced a retrieve-then-reason framework to retrieve relevant evidence for logical expression generation. These multi-stage efforts prioritize acquiring external sources but overlook the incorporation of new knowledge into their model parameters. In effect, even advanced language models and retrievers have knowledge boundaries, thereby limiting the generalization capabilities of previous KBQA models. Therefore, this paper develops KBLLaMA, which follows a learn-then-reason framework to inject new KB knowledge into a large language model for flexible end-to-end KBQA. At the core of KBLLaMA, we study (1) how to organize new knowledge about KBQA and (2) how to facilitate the learning of the organized knowledge. Extensive experiments on various KBQA generalization tasks showcase the state-of-the-art performance of KBLLaMA. Especially on the general benchmark GrailQA and domain-specific benchmark Bio-chemical, KBLLaMA respectively derives a performance gain of up to 3.8% and 9.8% compared to the baselines.

[AI-80] Diffusion-Based Failure Sampling for Cyber-Physical Systems

链接: https://arxiv.org/abs/2406.14761
作者: Harrison Delecki,Marc R. Schlichting,Mansur Arief,Anthony Corso,Marcell Vazquez-Chanlatte,Mykel J. Kochenderfer
关键词: Validating safety-critical autonomous, safety-critical autonomous systems, Validating safety-critical, Markov chain Monte, chain Monte Carlo
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Under review at RA-L

点击查看摘要

Abstract:Validating safety-critical autonomous systems in high-dimensional domains such as robotics presents a significant challenge. Existing black-box approaches based on Markov chain Monte Carlo may require an enormous number of samples, while methods based on importance sampling often rely on simple parametric families that may struggle to represent the distribution over failures. We propose to sample the distribution over failures using a conditional denoising diffusion model, which has shown success in complex high-dimensional problems such as robotic task planning. We iteratively train a diffusion model to produce state trajectories closer to failure. We demonstrate the effectiveness of our approach on high-dimensional robotic validation tasks, improving sample efficiency and mode coverage compared to existing black-box techniques.

[AI-81] An LLM Feature-based Framework for Dialogue Constructiveness Assessment

链接: https://arxiv.org/abs/2406.14760
作者: Lexin Zhou,Youmna Farag,Andreas Vlachos
关键词: analysing conversational factors, constructiveness assessment focuses, LLM feature-based models, predicting constructive outcomes, LLM feature-based
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Research on dialogue constructiveness assessment focuses on (i) analysing conversational factors that influence individuals to take specific actions, win debates, change their perspectives or broaden their open-mindedness and (ii) predicting constructive outcomes following dialogues for such use cases. These objectives can be achieved by training either interpretable feature-based models (which often involve costly human annotations) or neural models such as pre-trained language models (which have empirically shown higher task accuracy but lack interpretability). We propose a novel LLM feature-based framework that combines the strengths of feature-based and neural approaches while mitigating their downsides, in assessing dialogue constructiveness. The framework first defines a set of dataset-independent and interpretable linguistic features, which can be extracted by both prompting an LLM and simple heuristics. Such features are then used to train LLM feature-based models. We apply this framework to three datasets of dialogue constructiveness and find that our LLM feature-based models significantly outperform standard feature-based models and neural models, and tend to learn more robust prediction rules instead of relying on superficial shortcuts (as seen with neural models). Further, we demonstrate that interpreting these LLM feature-based models can yield valuable insights into what makes a dialogue constructive.

[AI-82] Compliance Cards: Computational Artifacts for Automated AI Regulation Compliance

链接: https://arxiv.org/abs/2406.14758
作者: Bill Marino,Preslav Aleksandrov,Carwyn Rahman,Yulu Pi,Bill Shen,Rui-jie Yew,Nicholas D. Lane
关键词: supply chain grows, incorporate externally-sourced ingredients, artificial intelligence, supply chain, grows more complex
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As the artificial intelligence (AI) supply chain grows more complex, AI systems and models are increasingly likely to incorporate externally-sourced ingredients such as datasets and other models. In such cases, determining whether or not an AI system or model complies with the EU AI Act will require gathering compliance-related metadata about both the AI system or model at-large as well as those externally-supplied ingredients. There must then be an analysis that looks across all of this metadata to render a prediction about the compliance of the overall AI system or model. Up until now, this process has not been automated. Thus, it has not been possible to make real-time compliance determinations in scenarios where doing so would be advantageous, such as the iterative workflows of today’s AI developers, search and acquisition of AI ingredients on communities like Hugging Face, federated and continuous learning, and more. To address this shortcoming, we introduce a highly automated system for AI Act compliance analysis. This system has two key elements. First is an interlocking set of computational artifacts that capture compliance-related metadata about both: (1) the AI system or model at-large; (2) any constituent ingredients such as datasets and models. Second is an automated analysis algorithm that operates across those computational artifacts to render a run-time prediction about whether or not the overall AI system or model complies with the AI Act. Working together, these elements promise to enhance and accelerate AI Act compliance assessments.

[AI-83] A Large Language Model Outperforms Other Computational Approaches to the High-Throughput Phenotyping of Physician Notes

链接: https://arxiv.org/abs/2406.14757
作者: Syed I. Munzir,Daniel B. Hier,Chelsea Oommen,Michael D. Carrithers
关键词: standardized ontology concepts, electronic health records, Large Language Model, Large Language, Natural Language Processing
类目: Artificial Intelligence (cs.AI)
*备注: Submitted to AMIA Annual Symposium 2024, San Francisco CA

点击查看摘要

Abstract:High-throughput phenotyping, the automated mapping of patient signs and symptoms to standardized ontology concepts, is essential to gaining value from electronic health records (EHR) in the support of precision medicine. Despite technological advances, high-throughput phenotyping remains a challenge. This study compares three computational approaches to high-throughput phenotyping: a Large Language Model (LLM) incorporating generative AI, a Natural Language Processing (NLP) approach utilizing deep learning for span categorization, and a hybrid approach combining word vectors with machine learning. The approach that implemented GPT-4 (a Large Language Model) demonstrated superior performance, suggesting that Large Language Models are poised to be the preferred method for high-throughput phenotyping of physician notes.

[AI-84] SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions

链接: https://arxiv.org/abs/2406.14756
作者: Huitong Pan,Qi Zhang,Cornelia Caragea,Eduard Dragut,Longin Jan Latecki
关键词: existing related resources, scientific mention detection, offering a significant, related resources, mention detection
类目: Artificial Intelligence (cs.AI)
*备注: LREC/COLING 2024

点击查看摘要

Abstract:We present SciDMT, an enhanced and expanded corpus for scientific mention detection, offering a significant advancement over existing related resources. SciDMT contains annotated scientific documents for datasets (D), methods (M), and tasks (T). The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes. To the best of our knowledge, SciDMT is the largest corpus for scientific entity mention detection. The corpus’s scale and diversity are instrumental in developing and refining models for tasks such as indexing scientific papers, enhancing information retrieval, and improving the accessibility of scientific knowledge. We demonstrate the corpus’s utility through experiments with advanced deep learning architectures like SciBERT and GPT-3.5. Our findings establish performance baselines and highlight unresolved challenges in scientific mention detection. SciDMT serves as a robust benchmark for the research community, encouraging the development of innovative models to further the field of scientific information extraction.

[AI-85] An Adapter-Based Unified Model for Multiple Spoken Language Processing Tasks

链接: https://arxiv.org/abs/2406.14747
作者: Varsha Suresh,Salah Aït-Mokhtar,Caroline Brun,Ioan Calapodescu
关键词: Self-supervised learning models, Self-supervised learning, revolutionized the field, Spoken Emotion Recognition, Automatic Speech Recognition
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: ICASSP 2024

点击查看摘要

Abstract:Self-supervised learning models have revolutionized the field of speech processing. However, the process of fine-tuning these models on downstream tasks requires substantial computational resources, particularly when dealing with multiple speech-processing tasks. In this paper, we explore the potential of adapter-based fine-tuning in developing a unified model capable of effectively handling multiple spoken language processing tasks. The tasks we investigate are Automatic Speech Recognition, Phoneme Recognition, Intent Classification, Slot Filling, and Spoken Emotion Recognition. We validate our approach through a series of experiments on the SUPERB benchmark, and our results indicate that adapter-based fine-tuning enables a single encoder-decoder model to perform multiple speech processing tasks with an average improvement of 18.4% across the five target tasks while staying efficient in terms of parameter updates.

[AI-86] Relation Extraction with Fine-Tuned Large Language Models in Retrieval Augmented Generation Frameworks

链接: https://arxiv.org/abs/2406.14745
作者: Sefika Efeoglu,Adrian Paschke
关键词: Knowledge Graphs, Information Extraction, converting unstructured data, formats like Knowledge, crucial for converting
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: preprint

点击查看摘要

Abstract:Information Extraction (IE) is crucial for converting unstructured data into structured formats like Knowledge Graphs (KGs). A key task within IE is Relation Extraction (RE), which identifies relationships between entities in text. Various RE methods exist, including supervised, unsupervised, weakly supervised, and rule-based approaches. Recent studies leveraging pre-trained language models (PLMs) have shown significant success in this area. In the current era dominated by Large Language Models (LLMs), fine-tuning these models can overcome limitations associated with zero-shot LLM prompting-based RE methods, especially regarding domain adaptation challenges and identifying implicit relations between entities in sentences. These implicit relations, which cannot be easily extracted from a sentence’s dependency tree, require logical inference for accurate identification. This work explores the performance of fine-tuned LLMs and their integration into the Retrieval Augmented-based (RAG) RE approach to address the challenges of identifying implicit relations at the sentence level, particularly when LLMs act as generators within the RAG framework. Empirical evaluations on the TACRED, TACRED-Revisited (TACREV), Re-TACRED, and SemEVAL datasets show significant performance improvements with fine-tuned LLMs, including Llama2-7B, Mistral-7B, and T5 (Large). Notably, our approach achieves substantial gains on SemEVAL, where implicit relations are common, surpassing previous results on this dataset. Additionally, our method outperforms previous works on TACRED, TACREV, and Re-TACRED, demonstrating exceptional performance across diverse evaluation scenarios.

[AI-87] raining Next Generation AI Users and Developers at NCSA

链接: https://arxiv.org/abs/2406.14744
作者: Daniel S. Katz,Volodymyr Kindratenko,Olena Kindratenko,Priyam Mazumdar
关键词: Supercomputing Applications, National Center, Center for Supercomputing, University of Illinois, training work carried
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This article focuses on training work carried out in artificial intelligence (AI) at the National Center for Supercomputing Applications (NCSA) at the University of Illinois Urbana-Champaign via a research experience for undergraduates (REU) program named FoDOMMaT. It also describes why we are interested in AI, and concludes by discussing what we’ve learned from running this program and its predecessor over six years.

[AI-88] Does GPT Really Get It? A Hierarchical Scale to Quantify Human vs AIs Understanding of Algorithms

链接: https://arxiv.org/abs/2406.14722
作者: Mirabel Reid,Santosh S. Vempala
关键词: Large Language Models, complex cognitive tasks, Language Models, Large Language, natural question
类目: Artificial Intelligence (cs.AI)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:As Large Language Models (LLMs) perform (and sometimes excel at) more and more complex cognitive tasks, a natural question is whether AI really understands. The study of understanding in LLMs is in its infancy, and the community has yet to incorporate well-trodden research in philosophy, psychology, and education. We initiate this, specifically focusing on understanding algorithms, and propose a hierarchy of levels of understanding. We use the hierarchy to design and conduct a study with human subjects (undergraduate and graduate students) as well as large language models (generations of GPT), revealing interesting similarities and differences. We expect that our rigorous criteria will be useful to keep track of AI’s progress in such cognitive domains.

[AI-89] MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate

链接: https://arxiv.org/abs/2406.14711
作者: Alfonso Amayuelas,Xianjun Yang,Antonis Antoniades,Wenyue Hua,Liangming Pan,William Wang
关键词: shown exceptional results, Large Language Models, Large Language, working individually, shown exceptional
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown exceptional results on current benchmarks when working individually. The advancement in their capabilities, along with a reduction in parameter size and inference times, has facilitated the use of these models as agents, enabling interactions among multiple models to execute complex tasks. Such collaborations offer several advantages, including the use of specialized models (e.g. coding), improved confidence through multiple computations, and enhanced divergent thinking, leading to more diverse outputs. Thus, the collaborative use of language models is expected to grow significantly in the coming years. In this work, we evaluate the behavior of a network of models collaborating through debate under the influence of an adversary. We introduce pertinent metrics to assess the adversary’s effectiveness, focusing on system accuracy and model agreement. Our findings highlight the importance of a model’s persuasive ability in influencing others. Additionally, we explore inference-time methods to generate more compelling arguments and evaluate the potential of prompt-based mitigation as a defensive strategy.

[AI-90] Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics

链接: https://arxiv.org/abs/2406.14703
作者: Seungbeen Lee,Seungwon Lim,Seungju Han,Giyeong Oh,Hyungjoo Chae,Jiwan Chung,Minju Kim,Beong-woo Kwak,Yeonsoo Lee,Dongha Lee,Jinyoung Yeo,Youngjae Yu
关键词: Large Language Models, Language Models, Large Language, extended to Large, observable behavior
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Preprint; Under review

点击查看摘要

Abstract:The idea of personality in descriptive psychology, traditionally defined through observable behavior, has now been extended to Large Language Models (LLMs) to better understand their behavior. This raises a question: do LLMs exhibit distinct and consistent personality traits, similar to humans? Existing self-assessment personality tests, while applicable, lack the necessary validity and reliability for precise personality measurements. To address this, we introduce TRAIT, a new tool consisting of 8K multi-choice questions designed to assess the personality of LLMs with validity and reliability. TRAIT is built on the psychometrically validated human questionnaire, Big Five Inventory (BFI) and Short Dark Triad (SD-3), enhanced with the ATOMIC10X knowledge graph for testing personality in a variety of real scenarios. TRAIT overcomes the reliability and validity issues when measuring personality of LLM with self-assessment, showing the highest scores across three metrics: refusal rate, prompt sensitivity, and option order sensitivity. It reveals notable insights into personality of LLM: 1) LLMs exhibit distinct and consistent personality, which is highly influenced by their training data (i.e., data used for alignment tuning), and 2) current prompting techniques have limited effectiveness in eliciting certain traits, such as high psychopathy or low conscientiousness, suggesting the need for further research in this direction.

[AI-91] Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions

链接: https://arxiv.org/abs/2406.14701
作者: Murali Karthick Baskar,Andrew Rosenberg,Bhuvana Ramabhadran,Neeraj Gaur,Zhong Meng
关键词: focus on addressing, addressing the constraints, constraints faced, ASR, LLMs
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. Recent works utilize prefixLM-type models, which directly apply speech as a prefix to LLMs for ASR. We have found that optimizing speech prefixes leads to better ASR performance and propose applying RNNT loss to perform speech prefix-tuning. This is a simple approach and does not increase the model complexity or alter the inference pipeline. We also propose language-based soft prompting to further improve with frozen LLMs. Empirical analysis on realtime testset from 10 Indic languages demonstrate that our proposed speech prefix-tuning yields improvements with both frozen and fine-tuned LLMs. Our recognition results on an average of 10 Indics show that the proposed prefix-tuning with RNNT loss results in a 12% relative improvement in WER over the baseline with a fine-tuned LLM. Our proposed approches with the frozen LLM leads to a 31% relative improvement over basic soft-prompting prefixLM.

[AI-92] Physically Analyzable AI-Based Nonlinear Platoon Dynamics Modeling During Traffic Oscillation: A Koopman Approach

链接: https://arxiv.org/abs/2406.14696
作者: Kexin Tian,Haotian Shi,Yang Zhou,Sixu Li
关键词: concurrently achieving physical, achieving physical analyzability, complexity and nonlinearity, exists a critical, concurrently achieving
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Given the complexity and nonlinearity inherent in traffic dynamics within vehicular platoons, there exists a critical need for a modeling methodology with high accuracy while concurrently achieving physical analyzability. Currently, there are two predominant approaches: the physics model-based approach and the Artificial Intelligence (AI)–based approach. Knowing the facts that the physical-based model usually lacks sufficient modeling accuracy and potential function mismatches and the pure-AI-based method lacks analyzability, this paper innovatively proposes an AI-based Koopman approach to model the unknown nonlinear platoon dynamics harnessing the power of AI and simultaneously maintain physical analyzability, with a particular focus on periods of traffic oscillation. Specifically, this research first employs a deep learning framework to generate the embedding function that lifts the original space into the embedding space. Given the embedding space descriptiveness, the platoon dynamics can be expressed as a linear dynamical system founded by the Koopman theory. Based on that, the routine of linear dynamical system analysis can be conducted on the learned traffic linear dynamics in the embedding space. By that, the physical interpretability and analyzability of model-based methods with the heightened precision inherent in data-driven approaches can be synergized. Comparative experiments have been conducted with existing modeling approaches, which suggests our method’s superiority in accuracy. Additionally, a phase plane analysis is performed, further evidencing our approach’s effectiveness in replicating the complex dynamic patterns. Moreover, the proposed methodology is proven to feature the capability of analyzing the stability, attesting to the physical analyzability.

[AI-93] Depth F_1: Improving Evaluation of Cross-Domain Text Classification by Measuring Semantic Generalizability

链接: https://arxiv.org/abs/2406.14695
作者: Parker Seegmiller,Joseph Gatto,Sarah Masud Preum
关键词: cross-domain text classification, cross-domain text, text classification, source domain, obtain domain-invariant performance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent evaluations of cross-domain text classification models aim to measure the ability of a model to obtain domain-invariant performance in a target domain given labeled samples in a source domain. The primary strategy for this evaluation relies on assumed differences between source domain samples and target domain samples in benchmark datasets. This evaluation strategy fails to account for the similarity between source and target domains, and may mask when models fail to transfer learning to specific target samples which are highly dissimilar from the source domain. We introduce Depth F_1 , a novel cross-domain text classification performance metric. Designed to be complementary to existing classification metrics such as F_1 , Depth F_1 measures how well a model performs on target samples which are dissimilar from the source domain. We motivate this metric using standard cross-domain text classification datasets and benchmark several recent cross-domain text classification models, with the goal of enabling in-depth evaluation of the semantic generalizability of cross-domain text classification models.

[AI-94] his Looks Better than That: Better Interpretable Models with ProtoPNeXt

链接: https://arxiv.org/abs/2406.14675
作者: Frank Willard,Luke Moffett,Emmanuel Mokel,Jon Donnelly,Stark Guo,Julia Yang,Giyoung Kim,Alina Jade Barnett,Cynthia Rudin
关键词: popular interpretable alternative, black-box deep learning, deep learning models, computer vision, popular interpretable
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Prototypical-part models are a popular interpretable alternative to black-box deep learning models for computer vision. However, they are difficult to train, with high sensitivity to hyperparameter tuning, inhibiting their application to new datasets and our understanding of which methods truly improve their performance. To facilitate the careful study of prototypical-part networks (ProtoPNets), we create a new framework for integrating components of prototypical-part models – ProtoPNeXt. Using ProtoPNeXt, we show that applying Bayesian hyperparameter tuning and an angular prototype similarity metric to the original ProtoPNet is sufficient to produce new state-of-the-art accuracy for prototypical-part models on CUB-200 across multiple backbones. We further deploy this framework to jointly optimize for accuracy and prototype interpretability as measured by metrics included in ProtoPNeXt. Using the same resources, this produces models with substantially superior semantics and changes in accuracy between +1.3% and -1.5%. The code and trained models will be made publicly available upon publication.

[AI-95] Exploring Design Choices for Building Language-Specific LLMs

链接: https://arxiv.org/abs/2406.14670
作者: Atula Tejaswi,Nilesh Gupta,Eunsol Choi
关键词: languages remain unsatisfactory, remain unsatisfactory, rapid progress, progress in large, vast majority
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 6 figures, 11 tables

点击查看摘要

Abstract:Despite rapid progress in large language models (LLMs), their performance on a vast majority of languages remain unsatisfactory. In this paper, we study building language-specific LLMs by adapting monolingual and multilingual LLMs. We conduct systematic experiments on how design choices (base model selection, vocabulary extension, and continued fine-tuning) impact the adapted LLM, both in terms of efficiency (how many tokens are needed to encode the same amount of information) and end task performance. We find that (1) the initial performance before the adaptation is not always indicative of the final performance. (2) Efficiency can easily improved with simple vocabulary extension and continued fine-tuning in most LLMs we study, and (3) The optimal adaptation method is highly language-dependent, and the simplest approach works well across various experimental settings. Adapting English-centric models can yield better results than adapting multilingual models despite their worse initial performance on low-resource languages. Together, our work lays foundations on efficiently building language-specific LLMs by adapting existing LLMs.

[AI-96] OpenDebateEvidence: A Massive-Scale Argument Mining and Summarization Dataset

链接: https://arxiv.org/abs/2406.14657
作者: Allen Roush,Yusuf Shabazz,Arvind Balaji,Peter Zhang,Stefano Mezza,Markus Zhang,Sanjay Basu,Sriram Vishwanath,Mehdi Fatemi,Ravid Schwartz-Ziv
关键词: American Competitive Debate, Competitive Debate community, American Competitive, Competitive Debate, Debate community
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for Publication to ARGMIN 2024 at ACL2024

点击查看摘要

Abstract:We introduce OpenDebateEvidence, a comprehensive dataset for argument mining and summarization sourced from the American Competitive Debate community. This dataset includes over 3.5 million documents with rich metadata, making it one of the most extensive collections of debate evidence. OpenDebateEvidence captures the complexity of arguments in high school and college debates, providing valuable resources for training and evaluation. Our extensive experiments demonstrate the efficacy of fine-tuning state-of-the-art large language models for argumentative abstractive summarization across various methods, models, and datasets. By providing this comprehensive resource, we aim to advance computational argumentation and support practical applications for debaters, educators, and researchers. OpenDebateEvidence is publicly available to support further research and innovation in computational argumentation. Access it here: this https URL

[AI-97] HYPERmotion: Learning Hybrid Behavior Planning for Autonomous Loco-manipulation

链接: https://arxiv.org/abs/2406.14655
作者: Jin Wang,Rui Dai,Weijie Wang,Luca Rossini,Francesco Ruscelli,Nikos Tsagarakis
关键词: autonomously perform hybrid, Enabling robots, perform hybrid motions, household chores, material handling
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project page: this https URL

点击查看摘要

Abstract:Enabling robots to autonomously perform hybrid motions in diverse environments can be beneficial for long-horizon tasks such as material handling, household chores, and work assistance. This requires extensive exploitation of intrinsic motion capabilities, extraction of affordances from rich environmental information, and planning of physical interaction behaviors. Despite recent progress has demonstrated impressive humanoid whole-body control abilities, they struggle to achieve versatility and adaptability for new tasks. In this work, we propose HYPERmotion, a framework that learns, selects and plans behaviors based on tasks in different scenarios. We combine reinforcement learning with whole-body optimization to generate motion for 38 actuated joints and create a motion library to store the learned skills. We apply the planning and reasoning features of the large language models (LLMs) to complex loco-manipulation tasks, constructing a hierarchical task graph that comprises a series of primitive behaviors to bridge lower-level execution with higher-level planning. By leveraging the interaction of distilled spatial geometry and 2D observation with a visual language model (VLM) to ground knowledge into a robotic morphology selector to choose appropriate actions in single- or dual-arm, legged or wheeled locomotion. Experiments in simulation and real-world show that learned motions can efficiently adapt to new tasks, demonstrating high autonomy from free-text commands in unstructured scenes. Videos and website: this http URL

[AI-98] Major Entity Identification: A Generalizable Alternative to Coreference Resolution

链接: https://arxiv.org/abs/2406.14654
作者: Kawshik Manikantan(1),Shubham Toshniwal(2),Makarand Tapaswi(1),Vineet Gandhi(1) ((1) CVIT, IIIT Hyderabad, (2) NVIDIA)
关键词: task broad application, Major Entity Identification, coreference resolution, broad application, major bottleneck
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task’s broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative formulation of the CR task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities. Through extensive experiments, we demonstrate that MEI models generalize well across domains on multiple datasets with supervised models and LLM-based few-shot prompting. Additionally, the MEI task fits the classification framework, which enables the use of classification-based metrics that are more robust than the current CR metrics. Finally, MEI is also of practical use as it allows a user to search for all mentions of a particular entity or a group of entities of interest.

[AI-99] LLM Granularity for On-the-Fly Robot Control

链接: https://arxiv.org/abs/2406.14653
作者: Peng Wang,Mattia Robbiani,Zhihao Guo
关键词: attracted significant attention, significant attention due, Assistive robots, Assistive, attracted significant
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Assistive robots have attracted significant attention due to their potential to enhance the quality of life for vulnerable individuals like the elderly. The convergence of computer vision, large language models, and robotics has introduced the visuolinguomotor' mode for assistive robots, where visuals and linguistics are incorporated into assistive robots to enable proactive and interactive assistance. This raises the question: \textitIn circumstances where visuals become unreliable or unavailable, can we rely solely on language to control robots, i.e., the viability of the linguomotor` mode for assistive robots? This work takes the initial steps to answer this question by: 1) evaluating the responses of assistive robots to language prompts of varying granularities; and 2) exploring the necessity and feasibility of controlling the robot on-the-fly. We have designed and conducted experiments on a Sawyer cobot to support our arguments. A Turtlebot robot case is designed to demonstrate the adaptation of the solution to scenarios where assistive robots need to maneuver to assist. Codes will be released on GitHub soon to benefit the community.

[AI-100] Holistic Evaluation for Interleaved Text-and-Image Generation

链接: https://arxiv.org/abs/2406.14643
作者: Minqian Liu,Zhiyang Xu,Zihao Lin,Trevor Ashby,Joy Rimchala,Jiaxin Zhang,Lifu Huang
关键词: intriguing research direction, arbitrary order, required to generate, Interleaved, interleaved generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Work in progress. 13 pages, 5 figure, 6 tables

点击查看摘要

Abstract:Interleaved text-and-image generation has been an intriguing research direction, where the models are required to generate both images and text pieces in an arbitrary order. Despite the emerging advancements in interleaved generation, the progress in its evaluation still significantly lags behind. Existing evaluation benchmarks do not support arbitrarily interleaved images and text for both inputs and outputs, and they only cover a limited number of domains and use cases. Also, current works predominantly use similarity-based metrics which fall short in assessing the quality in open-ended scenarios. To this end, we introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. InterleavedBench features a rich array of tasks to cover diverse real-world use cases. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation. We carefully define five essential evaluation aspects for InterleavedEval, including text quality, perceptual quality, image coherence, text-image coherence, and helpfulness, to ensure a comprehensive and fine-grained assessment. Through extensive experiments and rigorous human evaluation, we show that our benchmark and metric can effectively evaluate the existing models with a strong correlation with human judgments surpassing previous reference-based metrics. We also provide substantial findings and insights to foster future research in interleaved generation and its evaluation.

[AI-101] Harvesting Efficient On-Demand Order Pooling from Skilled Couriers: Enhancing Graph Representation Learning for Refining Real-time Many-to-One Assignments

链接: https://arxiv.org/abs/2406.14635
作者: Yile Liang,Jiuxia Zhao,Donghui Li,Jie Feng,Chen Zhang,Xuetao Ding,Jinghua Hao,Renqing He
关键词: offering delivery fulfillment, on-demand food delivery, recent past, past has witnessed, witnessed a notable
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted in KDD 2024 ADS Track

点击查看摘要

Abstract:The recent past has witnessed a notable surge in on-demand food delivery (OFD) services, offering delivery fulfillment within dozens of minutes after an order is placed. In OFD, pooling multiple orders for simultaneous delivery in real-time order assignment is a pivotal efficiency source, which may in turn extend delivery time. Constructing high-quality order pooling to harmonize platform efficiency with the experiences of consumers and couriers, is crucial to OFD platforms. However, the complexity and real-time nature of order assignment, making extensive calculations impractical, significantly limit the potential for order consolidation. Moreover, offline environment is frequently riddled with unknown factors, posing challenges for the platform’s perceptibility and pooling decisions. Nevertheless, delivery behaviors of skilled couriers (SCs) who know the environment well, can improve system awareness and effectively inform decisions. Hence a SC delivery network (SCDN) is constructed, based on an enhanced attributed heterogeneous network embedding approach tailored for OFD. It aims to extract features from rich temporal and spatial information, and uncover the latent potential for order combinations embedded within SC trajectories. Accordingly, the vast search space of order assignment can be effectively pruned through scalable similarity calculations of low-dimensional vectors, making comprehensive and high-quality pooling outcomes more easily identified in real time. SCDN has now been deployed in Meituan dispatch system. Online tests reveal that with SCDN, the pooling quality and extent have been greatly improved. And our system can boost couriers’efficiency by 45-55% during noon peak hours, while upholding the timely delivery commitment.

[AI-102] Adaptive Manipulation using Behavior Trees

链接: https://arxiv.org/abs/2406.14634
作者: Jacques Cloete,Wolfgang Merkt,Ioannis Havoutis
关键词: tightening or loosening, manipulation, loosening a valve, common motions, twisting motion
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 12 pages, including 7 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:Many manipulation tasks use instances of a set of common motions, such as a twisting motion for tightening or loosening a valve. However, different instances of the same motion often require different environmental parameters (e.g. force/torque level), and thus different manipulation strategies to successfully complete; for example, grasping a valve handle from the side rather than head-on to increase applied torque. Humans can intuitively adapt their manipulation strategy to best suit such problems, but representing and implementing such behaviors for robots remains an open question. We present a behavior tree-based approach for adaptive manipulation, wherein the robot can reactively select from and switch between a discrete set of manipulation strategies during task execution. Furthermore, our approach allows the robot to learn from past attempts to optimize performance, for example learning the optimal strategy for different task instances. Our approach also allows the robot to preempt task failure and either change to a more feasible strategy or safely exit the task before catastrophic failure occurs. We propose a simple behavior tree design for general adaptive robot behavior and apply it in the context of industrial manipulation. The adaptive behavior outperformed all baseline behaviors that only used a single manipulation strategy, markedly reducing the number of attempts and overall time taken to complete the example tasks. Our results demonstrate potential for improved robustness and efficiency in task completion, reducing dependency on human supervision and intervention.

[AI-103] Can LLMs Learn by Teaching? A Preliminary Study

链接: https://arxiv.org/abs/2406.14629
作者: Xuefei Ning,Zifu Wang,Shiyao Li,Zinan Lin,Peiran Yao,Tianyu Fu,Matthew B. Blaschko,Guohao Dai,Huazhong Yang,Yu Wang
关键词: extensively studied methodology, knowledge distillation, extensively studied, studied methodology, Teaching
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Under review

点击查看摘要

Abstract:Teaching to improve student models (e.g., knowledge distillation) is an extensively studied methodology in LLMs. However, for humans, teaching not only improves students but also improves teachers. We ask: Can LLMs also learn by teaching (LbT)? If yes, we can potentially unlock the possibility of continuously advancing the models without solely relying on human-produced data or stronger models. In this paper, we provide a preliminary exploration of this ambitious agenda. We show that LbT ideas can be incorporated into existing LLM training/prompting pipelines and provide noticeable improvements. Specifically, we design three methods, each mimicking one of the three levels of LbT in humans: observing students’ feedback, learning from the feedback, and learning iteratively, with the goals of improving answer accuracy without training and improving models’ inherent capability with fine-tuning. The findings are encouraging. For example, similar to LbT in human, we see that: (1) LbT can induce weak-to-strong generalization: strong models can improve themselves by teaching other weak models; (2) Diversity in students might help: teaching multiple students could be better than teaching one student or the teacher itself. We hope that this early promise can inspire future research on LbT and more broadly adopting the advanced techniques in education to improve LLMs. The code is available at this https URL.

[AI-104] SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

链接: https://arxiv.org/abs/2406.14598
作者: Tinghao Xie,Xiangyu Qi,Yi Zeng,Yangsibo Huang,Udari Madhushani Sehwag,Kaixuan Huang,Luxi He,Boyi Wei,Dacheng Li,Ying Sheng,Ruoxi Jia,Bo Li,Kai Li,Danqi Chen,Peter Henderson,Prateek Mittal
关键词: Evaluating aligned large, Evaluating aligned, reject unsafe user, unsafe user requests, policy-compliant deployments
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Evaluating aligned large language models’ (LLMs) ability to recognize and reject unsafe user requests is crucial for safe, policy-compliant deployments. Existing evaluation efforts, however, face three limitations that we address with SORRY-Bench, our proposed benchmark. First, existing methods often use coarse-grained taxonomies of unsafe topics, and are over-representing some fine-grained topics. For example, among the ten existing datasets that we evaluated, tests for refusals of self-harm instructions are over 3x less represented than tests for fraudulent activities. SORRY-Bench improves on this by using a fine-grained taxonomy of 45 potentially unsafe topics, and 450 class-balanced unsafe instructions, compiled through human-in-the-loop methods. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more – which are only implicitly considered in many evaluations. We supplement SORRY-Bench with 20 diverse linguistic augmentations to systematically examine these effects. Third, existing evaluations rely on large LLMs (e.g., GPT-4) for evaluation, which can be computationally expensive. We investigate design choices for creating a fast, accurate automated safety evaluator. By collecting 7K+ human annotations and conducting a meta-evaluation of diverse LLM-as-a-judge designs, we show that fine-tuned 7B LLMs can achieve accuracy comparable to GPT-4 scale LLMs, with lower computational cost. Putting these together, we evaluate over 40 proprietary and open-source LLMs on SORRY-Bench, analyzing their distinctive refusal behaviors. We hope our effort provides a building block for systematic evaluations of LLMs’ safety refusal capabilities, in a balanced, granular, and efficient manner.

[AI-105] ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights

链接: https://arxiv.org/abs/2406.14596
作者: Gabriel Sarch,Lawrence Jang,Michael J. Tarr,William W. Cohen,Kenneth Marino,Katerina Fragkiadaki
关键词: Large-scale generative language, Large-scale generative, generative language, language and vision-language, decision making
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project website: this http URL

点击查看摘要

Abstract:Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot in-context learning for decision making and instruction following. However, they require high-quality exemplar demonstrations to be included in their context window. In this work, we ask: Can LLMs and VLMs generate their own prompt examples from generic, sub-optimal demonstrations? We propose In-Context Abstraction Learning (ICAL), a method that builds a memory of multimodal experience insights from sub-optimal demonstrations and human feedback. Given a noisy demonstration in a new domain, VLMs abstract the trajectory into a general program by fixing inefficient actions and annotating cognitive abstractions: task relationships, object state changes, temporal subgoals, and task construals. These abstractions are refined and adapted interactively through human feedback while the agent attempts to execute the trajectory in a similar environment. The resulting abstractions, when used as exemplars in the prompt, significantly improve decision-making in retrieval-augmented LLM and VLM agents. Our ICAL agent surpasses the state-of-the-art in dialogue-based instruction following in TEACh, multimodal web agents in VisualWebArena, and action anticipation in Ego4D. In TEACh, we achieve a 12.6% improvement in goal-condition success. In VisualWebArena, our task success rate improves over the SOTA from 14.3% to 22.7%. In Ego4D action forecasting, we improve over few-shot GPT-4V and remain competitive with supervised models. We show finetuning our retrieval-augmented in-context agent yields additional improvements. Our approach significantly reduces reliance on expert-crafted examples and consistently outperforms in-context learning from action plans that lack such insights.

[AI-106] Adversaries Can Misuse Combinations of Safe Models

链接: https://arxiv.org/abs/2406.14595
作者: Erik Jones,Anca Dragan,Jacob Steinhardt
关键词: user manipulation, model, Developers, model enables cyberoffense, adversaries
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Developers try to evaluate whether an AI system can be misused by adversaries before releasing it; for example, they might test whether a model enables cyberoffense, user manipulation, or bioterrorism. In this work, we show that individually testing models for misuse is inadequate; adversaries can misuse combinations of models even when each individual model is safe. The adversary accomplishes this by first decomposing tasks into subtasks, then solving each subtask with the best-suited model. For example, an adversary might solve challenging-but-benign subtasks with an aligned frontier model, and easy-but-malicious subtasks with a weaker misaligned model. We study two decomposition methods: manual decomposition where a human identifies a natural decomposition of a task, and automated decomposition where a weak model generates benign tasks for a frontier model to solve, then uses the solutions in-context to solve the original task. Using these decompositions, we empirically show that adversaries can create vulnerable code, explicit images, python scripts for hacking, and manipulative tweets at much higher rates with combinations of models than either individual model. Our work suggests that even perfectly-aligned frontier systems can enable misuse without ever producing malicious outputs, and that red-teaming efforts should extend beyond single models in isolation.

[AI-107] PreSto: An In-Storage Data Preprocessing System for Training Recommendation Models

链接: https://arxiv.org/abs/2406.14571
作者: Yunjae Lee,Hyeseong Kim,Minsoo Rhu
关键词: Training recommendation systems, faces several challenges, stage to preprocess, seamless manner, raw data
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training recommendation systems (RecSys) faces several challenges as it requires the “data preprocessing” stage to preprocess an ample amount of raw data and feed them to the GPU for training in a seamless manner. To sustain high training throughput, state-of-the-art solutions reserve a large fleet of CPU servers for preprocessing which incurs substantial deployment cost and power consumption. Our characterization reveals that prior CPU-centric preprocessing is bottlenecked on feature generation and feature normalization operations as it fails to reap out the abundant inter-/intra-feature parallelism in RecSys preprocessing. PreSto is a storage-centric preprocessing system leveraging In-Storage Processing (ISP), which offloads the bottlenecked preprocessing operations to our ISP units. We show that PreSto outperforms the baseline CPU-centric system with a 9.6\times speedup in end-to-end preprocessing time, 4.3\times enhancement in cost-efficiency, and 11.3\times improvement in energyefficiency on average for production-scale RecSys preprocessing.

[AI-108] DragPoser: Motion Reconstruction from Variable Sparse Tracking Signals via Latent Space Optimization

链接: https://arxiv.org/abs/2406.14567
作者: Jose Luis Ponton,Eduard Pujol,Andreas Aristidou,Carlos Andujar,Nuria Pelechano
关键词: High-quality motion reconstruction, high-end mocap systems, High-quality motion, user movements, high-end mocap
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:High-quality motion reconstruction that follows the user’s movements can be achieved by high-end mocap systems with many sensors. However, obtaining such animation quality with fewer input devices is gaining popularity as it brings mocap closer to the general public. The main challenges include the loss of end-effector accuracy in learning-based approaches, or the lack of naturalness and smoothness in IK-based solutions. In addition, such systems are often finely tuned to a specific number of trackers and are highly sensitive to missing data e.g., in scenarios where a sensor is occluded or malfunctions. In response to these challenges, we introduce DragPoser, a novel deep-learning-based motion reconstruction system that accurately represents hard and dynamic on-the-fly constraints, attaining real-time high end-effectors position accuracy. This is achieved through a pose optimization process within a structured latent space. Our system requires only one-time training on a large human motion dataset, and then constraints can be dynamically defined as losses, while the pose is iteratively refined by computing the gradients of these losses within the latent space. To further enhance our approach, we incorporate a Temporal Predictor network, which employs a Transformer architecture to directly encode temporality within the latent space. This network ensures the pose optimization is confined to the manifold of valid poses and also leverages past pose data to predict temporally coherent poses. Results demonstrate that DragPoser surpasses both IK-based and the latest data-driven methods in achieving precise end-effector positioning, while it produces natural poses and temporally coherent motion. In addition, our system showcases robustness against on-the-fly constraint modifications, and exhibits exceptional adaptability to various input configurations and changes.

[AI-109] LM-IGTD: a 2D image generator for low-dimensional and mixed-type tabular data to leverage the potential of convolutional neural networks

链接: https://arxiv.org/abs/2406.14566
作者: Vanesa Gómez-Martínez,Francisco J. Lara-Abelenda,Pablo Peiro-Corbacho,David Chushig-Muzo,Conceicao Granja,Cristina Soguero-Ruiz
关键词: Tabular data, transforming tabular data, data, knowledge domains, Tabular
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tabular data have been extensively used in different knowledge domains. Convolutional neural networks (CNNs) have been successfully used in many applications where important information about data is embedded in the order of features (images), outperforming predictive results of traditional models. Recently, several researchers have proposed transforming tabular data into images to leverage the potential of CNNs and obtain high results in predictive tasks such as classification and regression. In this paper, we present a novel and effective approach for transforming tabular data into images, addressing the inherent limitations associated with low-dimensional and mixed-type datasets. Our method, named Low Mixed-Image Generator for Tabular Data (LM-IGTD), integrates a stochastic feature generation process and a modified version of the IGTD. We introduce an automatic and interpretable end-to-end pipeline, enabling the creation of images from tabular data. A mapping between original features and the generated images is established, and post hoc interpretability methods are employed to identify crucial areas of these images, enhancing interpretability for predictive tasks. An extensive evaluation of the tabular-to-image generation approach proposed on 12 low-dimensional and mixed-type datasets, including binary and multi-class classification scenarios. In particular, our method outperformed all traditional ML models trained on tabular data in five out of twelve datasets when using images generated with LM-IGTD and CNN. In the remaining datasets, LM-IGTD images and CNN consistently surpassed three out of four traditional ML models, achieving similar results to the fourth model.

[AI-110] ackling GenAI Copyright Issues: Originality Estimation and Genericization

链接: https://arxiv.org/abs/2406.03341
作者: Hiroaki Chiba-Okabe,Weijie J. Su
关键词: numerous lawsuits filed, significant copyright concerns, sparked significant copyright, leading to numerous, generative model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:The rapid progress of generative AI technology has sparked significant copyright concerns, leading to numerous lawsuits filed against AI developers. While some studies explore methods to mitigate copyright risks by steering the outputs of generative models away from those resembling copyrighted data, little attention has been paid to the question of how much of a resemblance is undesirable; more original or unique data are afforded stronger protection, and the threshold level of resemblance for constituting infringement correspondingly lower. Here, leveraging this principle, we propose a genericization method that modifies the outputs of a generative model to make them more generic and less likely to infringe copyright. To achieve this, we introduce a metric for quantifying the level of originality of data in a manner that is consistent with the legal framework. This metric can be practically estimated by drawing samples from a generative model, which is then used for the genericization process. Experiments demonstrate that our genericization method successfully modifies the output of a text-to-image generative model so that it produces more generic, copyright-compliant images.

[AI-111] Straight-Through meets Sparse Recovery: the Support Exploration Algorithm

链接: https://arxiv.org/abs/2301.13584
作者: Mimoun Mohamed(QARMA, I2M),François Malgouyres(IMT),Valentin Emiya(QARMA),Caroline Chaux(IPAL)
关键词: http URL make, quantized neural networks, optimize quantized neural, sparse support recovery, http URL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:The \it straight-through estimator (STE) is commonly used to optimize quantized neural networks, yet its contexts of effective performance are still unclear despite empirical this http URL make a step forward in this comprehension, we apply STE to a well-understood problem: \it sparse support recovery. We introduce the \it Support Exploration Algorithm (SEA), a novel algorithm promoting sparsity, and we analyze its performance in support recovery (a.k.a. model selection) problems. SEA explores more supports than the state-of-the-art, leading to superior performance in experiments, especially when the columns of A are strongly coherent.The theoretical analysis considers recovery guarantees when the linear measurements matrix A satisfies the \it Restricted Isometry Property (RIP).The sufficient conditions of recovery are comparable but more stringent than those of the state-of-the-art in sparse support recovery. Their significance lies mainly in their applicability to an instance of the STE.

[AI-112] Rapid and Accurate Diagnosis of Acute Aortic Syndrome using Non-contrast CT: A Large-scale Retrospective Multi-center and AI-based Study

链接: https://arxiv.org/abs/2406.15222
作者: Yujian Hu,Yilang Xiang,Yan-Jie Zhou,Yangyan He,Shifeng Yang,Xiaolong Du,Chunlan Den,Youyao Xu,Gaofeng Wang,Zhengyao Ding,Jingyong Huang,Wenjun Zhao,Xuejun Wu,Donglin Li,Qianqian Zhu,Zhenjiang Li,Chenyang Qiu,Ziheng Wu,Yunjun He,Chen Tian,Yihui Qiu,Zuodong Lin,Xiaolong Zhang,Yuan He,Zhenpeng Yuan,Xiaoxiang Zhou,Rong Fan,Ruihan Chen,Wenchao Guo,Jianpeng Zhang,Tony C. W. Mok,Zi Li,Le Lu,Dehai Lang,Xiaoqiang Li,Guofu Wang,Wei Lu,Zhengxing Huang,Minfeng Xu,Hongkun Zhang
关键词: Chest pain symptoms, acute aortic syndrome, acute chest pain, catastrophic cardiovascular emergency, chest pain conditions
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: submit to Nature Medicine

点击查看摘要

Abstract:Chest pain symptoms are highly prevalent in emergency departments (EDs), where acute aortic syndrome (AAS) is a catastrophic cardiovascular emergency with a high fatality rate, especially when timely and accurate treatment is not administered. However, current triage practices in the ED can cause up to approximately half of patients with AAS to have an initially missed diagnosis or be misdiagnosed as having other acute chest pain conditions. Subsequently, these AAS patients will undergo clinically inaccurate or suboptimal differential diagnosis. Fortunately, even under these suboptimal protocols, nearly all these patients underwent non-contrast CT covering the aorta anatomy at the early stage of differential diagnosis. In this study, we developed an artificial intelligence model (DeepAAS) using non-contrast CT, which is highly accurate for identifying AAS and provides interpretable results to assist in clinical decision-making. Performance was assessed in two major phases: a multi-center retrospective study (n = 20,750) and an exploration in real-world emergency scenarios (n = 137,525). In the multi-center cohort, DeepAAS achieved a mean area under the receiver operating characteristic curve of 0.958 (95% CI 0.950-0.967). In the real-world cohort, DeepAAS detected 109 AAS patients with misguided initial suspicion, achieving 92.6% (95% CI 76.2%-97.5%) in mean sensitivity and 99.2% (95% CI 99.1%-99.3%) in mean specificity. Our AI model performed well on non-contrast CT at all applicable early stages of differential diagnosis workflows, effectively reduced the overall missed diagnosis and misdiagnosis rate from 48.8% to 4.8% and shortened the diagnosis time for patients with misguided initial suspicion from an average of 681.8 (74-11,820) mins to 68.5 (23-195) mins. DeepAAS could effectively fill the gap in the current clinical workflow without requiring additional tests.

[AI-113] A Wavelet Guided Attention Module for Skin Cancer Classification with Gradient-based Feature Fusion

链接: https://arxiv.org/abs/2406.15128
作者: Ayush Roy,Sujan Sarkar,Sohom Ghosal,Dmitrii Kaplun,Asya Lyanova,Ram Sarkar
关键词: highly dangerous type, diagnose skin cancer, Skin cancer, physicians diagnose skin, dangerous type
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Skin cancer is a highly dangerous type of cancer that requires an accurate diagnosis from experienced physicians. To help physicians diagnose skin cancer more efficiently, a computer-aided diagnosis (CAD) system can be very helpful. In this paper, we propose a novel model, which uses a novel attention mechanism to pinpoint the differences in features across the spatial dimensions and symmetry of the lesion, thereby focusing on the dissimilarities of various classes based on symmetry, uniformity in texture and color, etc. Additionally, to take into account the variations in the boundaries of the lesions for different classes, we employ a gradient-based fusion of wavelet and soft attention-aided features to extract boundary information of skin lesions. We have tested our model on the multi-class and highly class-imbalanced dataset, called HAM10000, and achieved promising results, with a 91.17% F1-score and 90.75% accuracy. The code is made available at: this https URL.

[AI-114] FA-Net: A Fuzzy Attention-aided Deep Neural Network for Pneumonia Detection in Chest X-Rays

链接: https://arxiv.org/abs/2406.15117
作者: Ayush Roy,Anurag Bhattacharjee,Diego Oliva,Oscar Ramos-Soto,Francisco J. Alvarez-Padilla,Ram Sarkar
关键词: respiratory infection caused, caused by bacteria, Chest X-ray, infection caused, Pneumonia
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pneumonia is a respiratory infection caused by bacteria, fungi, or viruses. It affects many people, particularly those in developing or underdeveloped nations with high pollution levels, unhygienic living conditions, overcrowding, and insufficient medical infrastructure. Pneumonia can cause pleural effusion, where fluids fill the lungs, leading to respiratory difficulty. Early diagnosis is crucial to ensure effective treatment and increase survival rates. Chest X-ray imaging is the most commonly used method for diagnosing pneumonia. However, visual examination of chest X-rays can be difficult and subjective. In this study, we have developed a computer-aided diagnosis system for automatic pneumonia detection using chest X-ray images. We have used DenseNet-121 and ResNet50 as the backbone for the binary class (pneumonia and normal) and multi-class (bacterial pneumonia, viral pneumonia, and normal) classification tasks, respectively. We have also implemented a channel-specific spatial attention mechanism, called Fuzzy Channel Selective Spatial Attention Module (FCSSAM), to highlight the specific spatial regions of relevant channels while removing the irrelevant channels of the extracted features by the backbone. We evaluated the proposed approach on a publicly available chest X-ray dataset, using binary and multi-class classification setups. Our proposed method achieves accuracy rates of 97.15% and 79.79% for the binary and multi-class classification setups, respectively. The results of our proposed method are superior to state-of-the-art (SOTA) methods. The code of the proposed model will be available at: this https URL.

[AI-115] A Dual Attention-aided DenseNet-121 for Classification of Glaucoma from Fundus Images

链接: https://arxiv.org/abs/2406.15113
作者: Soham Chakraborty,Ayush Roy,Payel Pramanik,Daria Valenkova,Ram Sarkar
关键词: Deep learning, computer vision methods, field of ophthalmology, learning and computer, computer vision
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning and computer vision methods are nowadays predominantly used in the field of ophthalmology. In this paper, we present an attention-aided DenseNet-121 for classifying normal and glaucomatous eyes from fundus images. It involves the convolutional block attention module to highlight relevant spatial and channel features extracted by DenseNet-121. The channel recalibration module further enriches the features by utilizing edge information along with the statistical features of the spatial dimension. For the experiments, two standard datasets, namely RIM-ONE and ACRIMA, have been used. Our method has shown superior results than state-of-the-art models. An ablation study has also been conducted to show the effectiveness of each of the components. The code of the proposed work is available at: this https URL.

[AI-116] Introducing the Biomechanics-Function Relationship in Glaucoma: Improved Visual Field Loss Predictions from intraocular pressure-induced Neural Tissue Strains

链接: https://arxiv.org/abs/2406.14988
作者: Thanadet Chuangsuwanich,Monisha E. Nongpiur,Fabian A. Braeu,Tin A. Tun,Alexandre Thiery,Shamira Perera,Ching Lin Ho,Martin Buist,George Barbastathis,Tin Aung,Michaël J.A. Girard
关键词: Objective, neural tissue, neural tissue strains, Setting and Participants, tissue
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
*备注: 19 pages, 2 figures

点击查看摘要

Abstract:Objective. (1) To assess whether neural tissue structure and biomechanics could predict functional loss in glaucoma; (2) To evaluate the importance of biomechanics in making such predictions. Design, Setting and Participants. We recruited 238 glaucoma subjects. For one eye of each subject, we imaged the optic nerve head (ONH) using spectral-domain OCT under the following conditions: (1) primary gaze and (2) primary gaze with acute IOP elevation. Main Outcomes: We utilized automatic segmentation of optic nerve head (ONH) tissues and digital volume correlation (DVC) analysis to compute intraocular pressure (IOP)-induced neural tissue strains. A robust geometric deep learning approach, known as Point-Net, was employed to predict the full Humphrey 24-2 pattern standard deviation (PSD) maps from ONH structural and biomechanical information. For each point in each PSD map, we predicted whether it exhibited no defect or a PSD value of less than 5%. Predictive performance was evaluated using 5-fold cross-validation and the F1-score. We compared the model’s performance with and without the inclusion of IOP-induced strains to assess the impact of biomechanics on prediction accuracy. Results: Integrating biomechanical (IOP-induced neural tissue strains) and structural (tissue morphology and neural tissues thickness) information yielded a significantly better predictive model (F1-score: 0.76±0.02) across validation subjects, as opposed to relying only on structural information, which resulted in a significantly lower F1-score of 0.71±0.02 (p 0.05). Conclusion: Our study has shown that the integration of biomechanical data can significantly improve the accuracy of visual field loss predictions. This highlights the importance of the biomechanics-function relationship in glaucoma, and suggests that biomechanics may serve as a crucial indicator for the development and progression of glaucoma.

[AI-117] Extraction of 3D trajectories of mandibular condyles from 2D real-time MRI

链接: https://arxiv.org/abs/2406.14925
作者: Karyna Isaieva(IADI),Justine Leclère(IADI),Guillaume Paillart(IADI),Guillaume Drouot(CIC-IT),Jacques Felblinger(IADI, CIC-IT),Xavier Dubernard(CHU Reims),Pierre-André Vuissoz(IADI)
关键词: mandibular condyles directly, real-time MRI, underwent real-time MRI, comprehensive examination, kinematic details
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Computing the trajectories of mandibular condyles directly from MRI could provide a comprehensive examination, allowing for the extraction of both anatomical and kinematic details. This study aimed to investigate the feasibility of extracting 3D condylar trajectories from 2D real-time MRI and to assess their precision.Twenty healthy subjects underwent real-time MRI while opening and closing their jaws. One axial and two sagittal slices were segmented using a U-Net-based algorithm. The centers of mass of the resulting masks were projected onto the coordinate system based on anatomical markers and temporally adjusted using a common projection. The quality of the computed trajectories was evaluated using metrics designed to estimate movement reproducibility, head motion, and slice placement symmetry.The segmentation of the axial slices demonstrated good-to-excellent quality; however, the segmentation of the sagittal slices required some fine-tuning. The movement reproducibility was acceptable for most cases; nevertheless, head motion displaced the trajectories by 1 mm on average. The difference in the superior-inferior coordinate of the condyles in the closed jaw position was 1.7 mm on average.Despite limitations in precision, real-time MRI enables the extraction of condylar trajectories with sufficient accuracy for evaluating clinically relevant parameters such as condyle displacement, trajectories aspect, and symmetry.

[AI-118] Self-supervised Brain Lesion Generation for Effective Data Augmentation of Medical Images

链接: https://arxiv.org/abs/2406.14826
作者: Jiayu Huo,Sebastien Ourselin,Rachel Sparks
关键词: planning neurosurgical treatment, Accurate brain lesion, Accurate brain, brain lesion segmentation, neurosurgical treatment
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
*备注: 10 pages, 6 figures, 8 tables

点击查看摘要

Abstract:Accurate brain lesion delineation is important for planning neurosurgical treatment. Automatic brain lesion segmentation methods based on convolutional neural networks have demonstrated remarkable performance. However, neural network performance is constrained by the lack of large-scale well-annotated training datasets. In this manuscript, we propose a comprehensive framework to efficiently generate new, realistic samples for training a brain lesion segmentation model. We first train a lesion generator, based on an adversarial autoencoder, in a self-supervised manner. Next, we utilize a novel image composition algorithm, Soft Poisson Blending, to seamlessly combine synthetic lesions and brain images to obtain training samples. Finally, to effectively train the brain lesion segmentation model with augmented images we introduce a new prototype consistence regularization to align real and synthetic features. Our framework is validated by extensive experiments on two public brain lesion segmentation datasets: ATLAS v2.0 and Shift MS. Our method outperforms existing brain image data augmentation schemes. For instance, our method improves the Dice from 50.36% to 60.23% compared to the U-Net with conventional data augmentation techniques for the ATLAS v2.0 dataset.

[AI-119] An updated overview of radiomics-based artificial intelligence (AI) methods in breast cancer screening and diagnosis

链接: https://arxiv.org/abs/2406.14735
作者: Reza Elahi,Mahdis Nazari
关键词: positive predictive power, modest positive predictive, predictive power, modest positive, positive predictive
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current imaging methods for diagnosing BC are associated with limited sensitivity and specificity and modest positive predictive power. The recent progress in image analysis using artificial intelligence (AI) has created great promise to improve breast cancer (BC) diagnosis and subtype differentiation. In this case, novel quantitative computational methods, such as radiomics, have been developed to improve the sensitivity and specificity of early BC diagnosis and classification. The potential of radiomics in improving the diagnostic efficacy of imaging studies has been shown in several studies. In this review article, we discuss the radiomics workflow and current hand-crafted radiomics methods in the diagnosis and classification of BC based on most recent studies on different imaging modalities, e.g. MRI, mammography, contrast-enhanced spectral mammography (CESM), ultrasound imaging, and digital breast tumosynthesis (DBT). We also discuss current challenges and potential strategies to improve the specificity and sensitivity of radiomics in breast cancer to help achieve a higher level of BC classification and diagnosis in the clinical setting. The growing field of AI incorporation with imaging information has opened a great opportunity to provide a higher level of care for BC patients.

[AI-120] Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generative Models

链接: https://arxiv.org/abs/2406.14712
作者: Sanjay Vishwakarma,Francis Harkins,Siddharth Golecha,Vishal Sharathchandra Bajpe,Nicolas Dupuis,Luca Buratti,David Kremer,Ismael Faro,Ruchir Puri,Juan Cruz-Benito
关键词: Software Development Kits, quantum Software Development, Development Kits, Software Development, Generative Artificial intelligence
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Quantum programs are typically developed using quantum Software Development Kits (SDKs). The rapid advancement of quantum computing necessitates new tools to streamline this development process, and one such tool could be Generative Artificial intelligence (GenAI). In this study, we introduce and use the Qiskit HumanEval dataset, a hand-curated collection of tasks designed to benchmark the ability of Large Language Models (LLMs) to produce quantum code using Qiskit - a quantum SDK. This dataset consists of more than 100 quantum computing tasks, each accompanied by a prompt, a canonical solution, a comprehensive test case, and a difficulty scale to evaluate the correctness of the generated solutions. We systematically assess the performance of a set of LLMs against the Qiskit HumanEval dataset’s tasks and focus on the models ability in producing executable quantum code. Our findings not only demonstrate the feasibility of using LLMs for generating quantum code but also establish a new benchmark for ongoing advancements in the field and encourage further exploration and development of GenAI-driven tools for quantum code generation.

[AI-121] Attention Networks for Personalized Mealtime Insulin Dosing in People with Type 1 Diabetes

链接: https://arxiv.org/abs/2406.14579
作者: Anas El Fathi,Elliott Pryor,Marc D. Breton
关键词: Calculating mealtime insulin, Calculating mealtime, mealtime insulin doses, insulin doses poses, individuals with Type
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注: 6 pages, 4 figures, Biological and Medical Systems - 12th BMS 2024 - IFAC

点击查看摘要

Abstract:Calculating mealtime insulin doses poses a significant challenge for individuals with Type 1 Diabetes (T1D). Doses should perfectly compensate for expected post-meal glucose excursions, requiring a profound understanding of the individual’s insulin sensitivity and the meal macronutrients’. Usually, people rely on intuition and experience to develop this understanding. In this work, we demonstrate how a reinforcement learning agent, employing a self-attention encoder network, can effectively mimic and enhance this intuitive process. Trained on 80 virtual subjects from the FDA-approved UVA/Padova T1D adult cohort and tested on twenty, self-attention demonstrates superior performance compared to other network architectures. Results reveal a significant reduction in glycemic risk, from 16.5 to 9.6 in scenarios using sensor-augmented pump and from 9.1 to 6.7 in scenarios using automated insulin delivery. This new paradigm bypasses conventional therapy parameters, offering the potential to simplify treatment and promising improved quality of life and glycemic outcomes for people with T1D.

[AI-122] Bioptic – A Target-Agnostic Efficacy-Based Small Molecules Search Engine

链接: https://arxiv.org/abs/2406.14572
作者: Vlad Vinogradov,Ivan Izmailov,Simon Steshin,Kong T. Nguyen
关键词: Recent successes, extensive chemical libraries, successes in virtual, virtual screening, extensive chemical
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent successes in virtual screening have been made possible by large models and extensive chemical libraries. However, combining these elements is challenging: the larger the model, the more expensive it is to run, making ultra-large libraries unfeasible. To address this, we developed a target-agnostic, efficacy-based molecule search model, which allows us to find structurally dissimilar molecules with similar biological activities. We used the best practices to design fast retrieval system, based on processor-optimized SIMD instructions, enabling us to screen the ultra-large 40B Enamine REAL library with 100% recall rate. We extensively benchmarked our model and several state-of-the-art models for both speed performance and retrieval quality of novel molecules.

[AI-123] Deep-Learning Approach for Tissue Classification using Acoustic Waves during Ablation with an Er:YAG Laser (Updated)

链接: https://arxiv.org/abs/2406.14570
作者: Carlo Seppi,Philippe C. Cattin
关键词: Today mechanical tools, Today mechanical, Neural Network, Convolutional Neural Network, Fully-connected Neural Network
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Tissues and Organs (q-bio.TO)
*备注: This paper is an updated version of Deep-Learning Approach for Tissue Classification using Acoustic Waves during Ablation with an Er:YAG Laser originally published in DOI: https://doi.org/10.1109/ACCESS.2021.3113055 . This update addresses several issues and incorporates corrections as outlined in DOI: https://doi.org/10.1109/ACCESS.2024.3395071 . We provide here a detailed description of our experiments and the new models we used

点击查看摘要

Abstract:Today’s mechanical tools for bone cutting (osteotomy) cause mechanical trauma that prolongs the healing process. Medical device manufacturers aim to minimize this trauma, with minimally invasive surgery using laser cutting as one innovation. This method ablates tissue using laser light instead of mechanical tools, reducing post-surgery healing time. A reliable feedback system is crucial during laser surgery to prevent damage to surrounding tissues. We propose a tissue classification method analyzing acoustic waves generated during laser ablation, demonstrating its applicability in an ex-vivo experiment. The ablation process with a microsecond pulsed Er:YAG laser produces acoustic waves, acquired with an air-coupled transducer. These waves were used to classify five porcine tissue types: hard bone, soft bone, muscle, fat, and skin. For automated tissue classification, we compared five Neural Network (NN) approaches: a one-dimensional Convolutional Neural Network (CNN) with time-dependent input, a Fully-connected Neural Network (FcNN) with either the frequency spectrum or principal components of the frequency spectrum as input, and a combination of a CNN and an FcNN with time-dependent data and its frequency spectrum as input. Consecutive acoustic waves were used to improve classification accuracy. Grad-Cam identified the activation map of the frequencies, showing low frequencies as the most important for this task. Our results indicated that combining time-dependent data with its frequency spectrum achieved the highest classification accuracy (65.5%-75.5%). We also found that using the frequency spectrum alone was sufficient, with no additional benefit from applying Principal Components Analysis (PCA).

附件下载

点击下载今日全部论文列表