本篇博文主要展示 2024-08-12 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上10:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天10:30左右邮件定时自动发送。

目录

概览 (2024-08-12)

今日共更新314篇论文,其中:

  • 自然语言处理73篇(Computation and Language (cs.CL))
  • 人工智能92篇(Artificial Intelligence (cs.AI))
  • 计算机视觉57篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习90篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Preserving Privacy in Large Language Models : A Survey on Current Threats and Solutions
[NLP-0] 在大型语言模型中保护隐私:对当前威胁和解决方案的调查

链接: https://arxiv.org/abs/2408.05212
作者: Michele Miranda,Elena Sofia Ruzzetti,Andrea Santilli,Fabio Massimo Zanzotto,Sébastien Bratières,Emanuele Rodolà
关键词-EN: Large Language Models, Large Language, Language Models, represent a significant, artificial intelligence
关键词-ZN: 大型语言模型,大型语言,语言模型,代表了一种重要的人工智能
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: GitHub repository: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) represent a significant advancement in artificial intelligence, finding applications across various domains. However, their reliance on massive internet-sourced datasets for training brings notable privacy issues, which are exacerbated in critical domains (e.g., healthcare). Moreover, certain application-specific scenarios may require fine-tuning these models on private data. This survey critically examines the privacy threats associated with LLMs, emphasizing the potential for these models to memorize and inadvertently reveal sensitive information. We explore current threats by reviewing privacy attacks on LLMs and propose comprehensive solutions for integrating privacy mechanisms throughout the entire learning pipeline. These solutions range from anonymizing training datasets to implementing differential privacy during training or inference and machine unlearning after training. Our comprehensive review of existing literature highlights ongoing challenges, available tools, and future directions for preserving privacy in LLMs. This work aims to guide the development of more secure and trustworthy AI systems by providing a thorough understanding of privacy preservation methods and their effectiveness in mitigating risks.
摘要:大型语言模型(LLM)代表了人工智能领域的一项重大进步,可以找到跨各个领域的应用程序。然而,他们对来自互联网的海量数据集的依赖带来了显著的隐私问题,在关键领域(例如医疗保健),这一问题更加严重。此外,某些特定于应用程序的场景可能需要根据私有数据对这些模型进行微调。这项调查严格审查了与LLMS相关的隐私威胁,强调了这些模型可能会记住并无意中泄露敏感信息。我们通过审查对LLM的隐私攻击来探索当前的威胁,并提出全面的解决方案,将隐私机制整合到整个学习管道中。这些解决方案的范围从匿名训练数据集到在训练期间实现差异隐私或在训练后进行推理和机器遗忘。我们对现有文献的全面回顾突出了在LLMS中保护隐私的持续挑战、可用的工具和未来的方向。这项工作旨在通过提供对隐私保护方法及其在降低风险方面的有效性的透彻了解,来指导更安全和值得信赖的人工智能系统的开发。

[NLP-1] VITA: Towards Open-Source Interactive Omni Multimodal LLM
[NLP-1] VITA:迈向开源互动全模式LLM

链接: https://arxiv.org/abs/2408.05211
作者: Chaoyou Fu,Haojia Lin,Zuwei Long,Yunhang Shen,Meng Zhao,Yifan Zhang,Xiong Wang,Di Yin,Long Ma,Xiawu Zheng,Ran He,Rongrong Ji,Yunsheng Wu,Caifeng Shan,Xing Sun
关键词-EN: models rarely excel, Large Language Model, open-source models rarely, underscore their necessity, practical applications
关键词-ZN: 模型很少表现出色,大型语言模型、开源模型很少,强调它们的必要性、实际应用
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research. Project Page: this https URL.
摘要:GPT-40卓越的多模式能力和交互体验强调了它们在实际应用中的必要性,然而开源模型很少在这两个方面都出类拔萃。在本文中,我们介绍了VITA,第一个开源的多通道大语言模型(MLLM),擅长同时处理和分析视频、图像、文本和音频通道,同时具有先进的多通道交互体验。我们从Mixtral 8x7B作为语言基础,扩展了它的中文词汇量,然后进行了双语教学调整。我们通过多通道对齐和教学调整的两阶段多任务学习,进一步赋予语言模型以视觉和听觉能力。VITA展示了强大的多语言、视觉和音频理解基础能力,其在一系列单模式和多模式基准测试中的强劲表现证明了这一点。除了基础能力之外,我们还在增强自然的多通道人机交互体验方面取得了长足的进步。据我们所知,我们是第一个在MLLM中利用非唤醒交互和音频中断的公司。VITA是开源社区探索多模式理解和交互无缝集成的第一步。虽然在VITA上还有很多工作要做,以接近接近源代码的同行,但我们希望它作为先驱的角色可以作为后续研究的基石。项目页面:此HTTPS URL。

[NLP-2] Evaluating the capability of large language models to personalize science texts for diverse middle-school-age learners
[NLP-2] 评估大型语言模型为不同中学年龄学习者个性化科学文本的能力

链接: https://arxiv.org/abs/2408.05204
作者: Michael Vaccaro Jr,Mikayla Friday,Arash Zaghi
关键词-EN: Large language models, including OpenAI GPT-series, Large language, language models, including OpenAI
关键词-ZN: 大型语言模型,包括OpenAI GPT系列、大型语言、语言模型,包括OpenAI
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 20 pages, 3 figures

点击查看摘要

Abstract:Large language models (LLMs), including OpenAI’s GPT-series, have made significant advancements in recent years. Known for their expertise across diverse subject areas and quick adaptability to user-provided prompts, LLMs hold unique potential as Personalized Learning (PL) tools. Despite this potential, their application in K-12 education remains largely unexplored. This paper presents one of the first randomized controlled trials (n = 23) to evaluate the effectiveness of GPT-4 in personalizing educational science texts for middle school students. In this study, GPT-4 was used to profile student learning preferences based on choices made during a training session. For the experimental group, GPT-4 was used to rewrite science texts to align with the student’s predicted profile while, for students in the control group, texts were rewritten to contradict their learning preferences. The results of a Mann-Whitney U test showed that students significantly preferred (at the .10 level) the rewritten texts when they were aligned with their profile (p = .059). These findings suggest that GPT-4 can effectively interpret and tailor educational content to diverse learner preferences, marking a significant advancement in PL technology. The limitations of this study and ethical considerations for using artificial intelligence in education are also discussed.
摘要:大型语言模型(LLM),包括OpenAI的GPT系列,近年来取得了重大进展。LLMS以其在不同学科领域的专业知识和对用户提供的提示的快速适应性而闻名,作为个性化学习(PL)工具具有独特的潜力。尽管有这种潜力,但它们在K-12教育中的应用在很大程度上仍未得到探索。本文介绍了第一批随机对照试验之一(n=23),以评估GPT-4在中学教育科学文本个性化方面的有效性。在这项研究中,GPT-4被用来根据培训期间所做的选择来描述学生的学习偏好。对于实验组,学生使用GPT-4重写科学课文,以符合学生的预测轮廓;而对于控制组的学生,课文被重写,以与他们的学习偏好相矛盾。Mann-Whitney U测试的结果显示,当重写的文本与他们的个人资料对齐时,学生们显著倾向于(在.10水平上)重写的文本(p=.059)。这些发现表明,GPT-4可以有效地解释和定制教育内容,以满足不同的学习者偏好,标志着PL技术的重大进步。本文还讨论了这项研究的局限性以及在教育中使用人工智能的伦理考虑。

[NLP-3] aSL: Task Skill Localization and Consolidation for Language Model Continual Learning ACL2024
[NLP-3] aSL:语言模型持续学习的任务技能本地化和整合

链接: https://arxiv.org/abs/2408.05200
作者: Yujie Feng,Xu Chu,Yongxin Xu,Zexin Lu,Bo Liu,Philip S. Yu,Xiao-Ming Wu
关键词-EN: recently garnered significant, garnered significant interest, significant interest due, dynamic real-world environments, adapt large language
关键词-ZN: 最近获得了重大,获得了重大兴趣,产生了重大兴趣,动态的现实世界环境,适应大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Extension of ACL 2024 paper titled: Continual Dialog State Tracking via Task Skill Localization and Consolidation

点击查看摘要

Abstract:Language model continual learning (CL) has recently garnered significant interest due to its potential to adapt large language models (LLMs) to dynamic real-world environments without re-training. A key challenge in this field is catastrophic forgetting, where models lose previously acquired knowledge when learning new tasks. Existing methods commonly employ multiple parameter-efficient fine-tuning (PEFT) blocks to acquire task-specific knowledge for each task, but these approaches lack efficiency and overlook the potential for knowledge transfer through task interaction. In this paper, we present a novel CL framework for language models called Task Skill Localization and Consolidation (TaSL), which enhances knowledge transfer without relying on memory replay. TaSL first divides the model into `skill units’ based on parameter dependencies, enabling more granular control. It then employs a novel group-wise skill localization technique to identify the importance distribution of skill units for a new task. By comparing this importance distribution with those from previous tasks, we implement a fine-grained skill consolidation strategy that retains task-specific knowledge, thereby preventing forgetting, and updates task-shared knowledge, which facilitates bi-directional knowledge transfer. As a result, TaSL achieves a superior balance between retaining previous knowledge and excelling in new tasks. TaSL also shows strong generalizability, suitable for general models and customizable for PEFT methods like LoRA. Additionally, it demonstrates notable extensibility, allowing integration with memory replay to further enhance performance. Extensive experiments on two CL benchmarks, with varying model sizes (from 220M to 7B), demonstrate the effectiveness of TaSL and its variants across different settings.
摘要:语言模型连续学习(CL)因其无需重新训练就能使大型语言模型(LLM)适应动态真实环境的潜力而引起了人们的极大兴趣。这一领域的一个关键挑战是灾难性的遗忘,即模型在学习新任务时丢失了以前获得的知识。现有的方法通常使用多个参数高效微调(PEFT)块来获取每个任务的特定任务知识,但这些方法缺乏效率,并且忽略了通过任务交互进行知识转移的潜力。在本文中,我们提出了一种新的语言模型CL框架,称为任务技能本地化和巩固(TASL),它在不依赖记忆重放的情况下增强了知识转移。TASL首先根据参数依赖关系将模型划分为“技能单元”,从而实现更细粒度的控制。然后,它采用一种新的分组技能定位技术来识别新任务的技能单元的重要性分布。通过将这种重要性分布与以前任务的重要性分布进行比较,我们实现了一种细粒度的技能巩固策略,该策略保留了特定于任务的知识,从而防止了遗忘,并更新了任务共享的知识,从而促进了双向知识转移。因此,TASL在保留以前的知识和在新任务中出类拔萃之间实现了更好的平衡。TASL还具有很强的泛化能力,适用于一般模型,可定制用于LORA等PEFT方法。此外,它还表现出显著的可扩展性,允许与内存重放集成以进一步增强性能。在两个CL基准上的广泛实验,具有不同的模型大小(从220M到7B),展示了TASL及其变体在不同设置下的有效性。

[NLP-4] Separating Style from Substance: Enhancing Cross-Genre Authorship Attribution through Data Selection and Presentation
[NLP-4] 风格与实质分离:通过数据选择和呈现增强跨流派作者归因

链接: https://arxiv.org/abs/2408.05192
作者: Steven Fincke,Elizabeth Boschee
关键词-EN: documents are written, task of deciding, author is challenging, documents, machines and humans
关键词-ZN: 文件是书面的,决定的任务,作者具有挑战性,文件、机器和人类
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The task of deciding whether two documents are written by the same author is challenging for both machines and humans. This task is even more challenging when the two documents are written about different topics (e.g. baseball vs. politics) or in different genres (e.g. a blog post vs. an academic article). For machines, the problem is complicated by the relative lack of real-world training examples that cross the topic boundary and the vanishing scarcity of cross-genre data. We propose targeted methods for training data selection and a novel learning curriculum that are designed to discourage a model’s reliance on topic information for authorship attribution and correspondingly force it to incorporate information more robustly indicative of style no matter the topic. These refinements yield a 62.7% relative improvement in average cross-genre authorship attribution, as well as 16.6% in the per-genre condition.
摘要:决定两个文档是否由同一作者撰写的任务对于机器和人类来说都是具有挑战性的。当这两个文档是关于不同主题(例如棒球与政治)或不同类型(例如博客文章与学术文章)撰写时,这项任务就更具挑战性。对于机器来说,由于相对缺乏跨越主题边界的现实世界训练示例以及跨流派数据的稀缺性消失,问题变得复杂。我们提出了有针对性的训练数据选择方法和新颖的学习课程,旨在阻止模型依赖主题信息来获得作者归属,并相应地迫使其纳入更强有力地指示风格的信息,无论主题如何。这些改进使跨流派作者归因的平均相对提高了62.7%,每流派条件的平均相对提高了16.6%。

[NLP-5] Deep-change at AXOLOTL-24: Orchestrating WSD and WSI Models for Semantic Change Modeling
[NLP-5] AX OLOTL-24的深度变革:阐述WSD和WSI模型以实现语义变革建模

链接: https://arxiv.org/abs/2408.05184
作者: Denis Kokosinskii,Mikhail Kuklin,Nikolay Arefyev
关键词-EN: Semantic Change Modeling, Change Modeling, Semantic Change, paper describes, describes our solution
关键词-ZN: 语义变化建模,变化建模,语义变化,论文描述,描述我们的解决方案
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper describes our solution of the first subtask from the AXOLOTL-24 shared task on Semantic Change Modeling. The goal of this subtask is to distribute a given set of usages of a polysemous word from a newer time period between senses of this word from an older time period and clusters representing gained senses of this word. We propose and experiment with three new methods solving this task. Our methods achieve SOTA results according to both official metrics of the first substask. Additionally, we develop a model that can tell if a given word usage is not described by any of the provided sense definitions. This model serves as a component in one of our methods, but can potentially be useful on its own.
摘要:本文描述了关于语义变化建模的AXOLOTL-24共享任务中的第一个子任务的解决方案。这个子任务的目标是将来自新时期的一个歧义词的一组给定用法分布在来自旧时期的该词的意义和代表该词获得的意义的集群之间。我们提出并实验了三种解决这项任务的新方法。我们的方法根据第一个子任务的两个官方指标来实现SOTA结果。此外,我们开发了一个模型,该模型可以判断给定的单词用法是否没有被提供的任何意义定义描述。该模型作为我们的一种方法中的一个组件,但它本身可能很有用。

[NLP-6] Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
[NLP-6] Gemma Scope:在Gemma 2上同时开放随处可见的稀疏自动编码器

链接: https://arxiv.org/abs/2408.05147
作者: Tom Lieberum,Senthooran Rajamanoharan,Arthur Conmy,Lewis Smith,Nicolas Sonnerat,Vikrant Varma,János Kramár,Anca Dragan,Rohin Shah,Neel Nanda
关键词-EN: seemingly interpretable features, neural network latent, network latent representations, Sparse autoencoders, sparse decomposition
关键词-ZN: 看似可解释的特征、神经网络潜在、网络潜在表示、稀疏自动编码器、稀疏分解
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 main text pages, and 14 pages of acknowledgements, references and appendices

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network’s latent representations into seemingly interpretable features. Despite recent excitement about their potential, research applications outside of industry are limited by the high cost of training a comprehensive suite of SAEs. In this work, we introduce Gemma Scope, an open suite of JumpReLU SAEs trained on all layers and sub-layers of Gemma 2 2B and 9B and select layers of Gemma 2 27B base models. We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison. We evaluate the quality of each SAE on standard metrics and release these results. We hope that by releasing these SAE weights, we can help make more ambitious safety and interpretability research easier for the community. Weights and a tutorial can be found at this https URL and an interactive demo can be found at this https URL
摘要:稀疏自动编码器(SAEs)是一种无监督方法,用于学习神经网络的潜在表示的稀疏分解为看似可解释的特征。尽管最近人们对它们的潜力感到兴奋,但行业以外的研究应用因培训一套全面的严重不良事件而受到限制。在这项工作中,我们介绍了Gemma Scope,这是一个在Gemma 2 2B和9 B的所有层和子层以及Gemma 2 27 B基本模型的选定层上训练的ExpressReLU SAEs开放套件。我们主要在Gemma 2预训练模型上训练严重不良事件,但额外发布在经过训练的Gemma 2 9 B上训练的严重不良事件以进行比较。我们根据标准指标评估每个严重不良事件的质量并发布这些结果。我们希望通过发布这些严重不良事件权重,我们可以帮助社区更容易地进行更雄心勃勃的安全性和可解释性研究。重量和教程可以在此https URL中找到,交互式演示可以在此https URL中找到

[NLP-7] A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning KDD
[NLP-7] 一种全面增强复杂推理的混合RAG系统

链接: https://arxiv.org/abs/2408.05141
作者: Ye Yuan,Chengwu Liu,Jingyang Yuan,Gongbo Sun,Siqi Li,Ming Zhang
关键词-EN: external knowledge bases, framework enabling large, enabling large language, integrating external knowledge, Retrieval-augmented generation
关键词-ZN: 外部知识库、框架支持大、支持大语言、集成外部知识、检索增强生成
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Technical report for 3rd prize in Task 1 of Meta CRAG KDD Cup 2024

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a framework enabling large language models (LLMs) to enhance their accuracy and reduce hallucinations by integrating external knowledge bases. In this paper, we introduce a hybrid RAG system enhanced through a comprehensive suite of optimizations that significantly improve retrieval quality, augment reasoning capabilities, and refine numerical computation ability. We refined the text chunks and tables in web pages, added attribute predictors to reduce hallucinations, conducted LLM Knowledge Extractor and Knowledge Graph Extractor, and finally built a reasoning strategy with all the references. We evaluated our system on the CRAG dataset through the Meta CRAG KDD Cup 2024 Competition. Both the local and online evaluations demonstrate that our system significantly enhances complex reasoning capabilities. In local evaluations, we have significantly improved accuracy and reduced error rates compared to the baseline model, achieving a notable increase in scores. In the meanwhile, we have attained outstanding results in online assessments, demonstrating the performance and generalization capabilities of the proposed system. The source code for our system is released in \urlthis https URL.
摘要:检索增强生成(RAG)是一个框架,它使大型语言模型(LLM)能够通过整合外部知识库来提高其准确性并减少幻觉。在本文中,我们介绍了一个混合RAG系统,通过一套全面的优化,显著提高了检索质量,增强了推理能力,并改进了数值计算能力。我们对网页中的文本块和表格进行了精化,增加了属性预测器以减少幻觉,并进行了LLM知识抽取器和知识图抽取器,最后利用所有的参考建立了推理策略。我们通过Meta Crag KDD杯2024年大赛在CRAG数据集上对我们的系统进行了评估。本地和在线评估都表明,我们的系统显著增强了复杂推理能力。在当地的评估中,与基准模型相比,我们显著提高了准确率,降低了错误率,实现了分数的显著提高。同时,我们在在线评估中取得了突出的成绩,表明了拟议系统的性能和推广能力。我们系统的源代码在此HTTPS URL中发布。

[NLP-8] How Well Do LLMs Identify Cultural Unity in Diversity?
[NLP-8] 法学硕士如何识别多样性中的文化统一性?

链接: https://arxiv.org/abs/2408.05102
作者: Jialin Li,Junli Wang,Junjie Hu,Ming Jiang
关键词-EN: large language models, awareness of large, large language, models’ sensitivity, United States plays
关键词-ZN: 大型语言模型,对大型语言的意识,模型的敏感性,美国游戏
类目: Computation and Language (cs.CL)
备注: COLM 2024

点击查看摘要

Abstract:Much work on the cultural awareness of large language models (LLMs) focuses on the models’ sensitivity to geo-cultural diversity. However, in addition to cross-cultural differences, there also exists common ground across cultures. For instance, a bridal veil in the United States plays a similar cultural-relevant role as a honggaitou in China. In this study, we introduce a benchmark dataset CUNIT for evaluating decoder-only LLMs in understanding the cultural unity of concepts. Specifically, CUNIT consists of 1,425 evaluation examples building upon 285 traditional cultural-specific concepts across 10 countries. Based on a systematic manual annotation of cultural-relevant features per concept, we calculate the cultural association between any pair of cross-cultural concepts. Built upon this dataset, we design a contrastive matching task to evaluate the LLMs’ capability to identify highly associated cross-cultural concept pairs. We evaluate 3 strong LLMs, using 3 popular prompting strategies, under the settings of either giving all extracted concept features or no features at all on CUNIT Interestingly, we find that cultural associations across countries regarding clothing concepts largely differ from food. Our analysis shows that LLMs are still limited to capturing cross-cultural associations between concepts compared to humans. Moreover, geo-cultural proximity shows a weak influence on model performance in capturing cross-cultural associations.
摘要:关于大型语言模型文化意识的研究主要集中在模型对地理文化多样性的敏感性上。然而,除了跨文化差异外,跨文化之间也存在共同点。例如,美国的婚纱扮演着与文化相关的角色,就像《中国》中的红盖头。在这项研究中,我们引入了一个基准数据集CUNIT来评估仅限译码的LLMS,以理解概念的文化统一性。具体地说,国家信息技术中心以10个国家的285个传统文化概念为基础,包括1,425个评价范例。基于对每个概念的文化相关特征的系统的人工注释,我们计算了任意一对跨文化概念之间的文化联想。在此数据集的基础上,我们设计了一个对比匹配任务来评估LLMS识别高度关联的跨文化概念对的能力。我们使用3种流行的提示策略,在CUNIT上给出所有提取的概念特征或根本不给出特征的设置下,评估了3个强大的LLM。有趣的是,我们发现各国关于服装概念的文化联想与食物有很大不同。我们的分析表明,与人类相比,LLM仍然局限于捕捉概念之间的跨文化关联。此外,在捕捉跨文化联想方面,地理文化接近对模特表现的影响较小。

[NLP-9] MooER: LLM-based Speech Recognition and Translation Models from Moore Threads
[NLP-9] MooER:来自Moore Threads的基于LLM的语音识别和翻译模型

链接: https://arxiv.org/abs/2408.05101
作者: Junhao Xu,Zhenlin Liang,Yi Liu,Yichao Hu,Jian Li,Yajun Zheng,Meng Cai,Hua Wang
关键词-EN: Moore Threads, LLM-based large-scale automatic, automatic speech recognition, automatic speech translation, large-scale automatic speech
关键词-ZN: Moore Threads,基于LLM的大规模自动、自动语音识别、自动语音翻译、大规模自动语音
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we present MooER, a LLM-based large-scale automatic speech recognition (ASR) / automatic speech translation (AST) model of Moore Threads. A 5000h pseudo labeled dataset containing open source and self collected speech data is used for training. We achieve performance comparable to other open source models trained with up to hundreds of thousands of hours of labeled speech data. Meanwhile, experiments conducted on Covost2 Zh2en testset suggest that our model outperforms other open source Speech LLMs. A BLEU score of 25.2 can be obtained. The main contributions of this paper are summarized as follows. First, this paper presents a training strategy for encoders and LLMs on speech related tasks (including ASR and AST) using a small size of pseudo labeled data without any extra manual annotation and selection. Second, we release our ASR and AST models and plan to open-source our training code and strategy in the near future. Moreover, a model trained on 8wh scale training data is planned to be released later on.
摘要:本文提出了一个基于LLM的大规模自动语音识别(ASR)/自动语音翻译(AST)模型MooER。使用一个5000h的伪标签数据集进行训练,该数据集包含开源和自收集的语音数据。我们获得了与其他开源模型相当的性能,这些模型使用高达数十万小时的标签语音数据进行训练。同时,在Covost2zh2en测试集上进行的实验表明,该模型的性能优于其他开源语音LLMS。BLEU的得分可以达到25.2分。本文的主要贡献概括如下。首先,本文提出了一种针对编码者和LLMS的语音相关任务(包括ASR和AST)的训练策略,使用少量的伪标记数据,而不需要任何额外的人工标注和选择。其次,我们发布了我们的ASR和AST模型,并计划在不久的将来开放我们的培训代码和战略。此外,还计划在晚些时候发布一个基于8wh规模训练数据的模型。

[NLP-10] Unlocking Decoding-time Controllability: Gradient-Free Multi-Objective Alignment with Contrastive Prompts
[NLP-10] 解锁解码时间可控制性:具有对比预测的无干扰多目标对齐

链接: https://arxiv.org/abs/2408.05094
作者: Tingchen Fu,Yupeng Hou,Julian McAuley,Rui Yan
关键词-EN: large language models, multi-objective alignment aims, harmlessness and honesty, alignment objectives, Multi-objective Contrastive Alignemnt
关键词-ZN: 大型语言模型、多目标对齐目标、无害性和诚实性、对齐目标、多目标对比对齐
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The task of multi-objective alignment aims at balancing and controlling the different alignment objectives (e.g., helpfulness, harmlessness and honesty) of large language models to meet the personalized requirements of different users. However, previous methods tend to train multiple models to deal with various user preferences, with the number of trained models growing linearly with the number of alignment objectives and the number of different preferences. Meanwhile, existing methods are generally poor in extensibility and require significant re-training for each new alignment objective considered. Considering the limitation of previous approaches, we propose MCA (Multi-objective Contrastive Alignemnt), which constructs an expert prompt and an adversarial prompt for each objective to contrast at the decoding time and balances the objectives through combining the contrast. Our approach is verified to be superior to previous methods in obtaining a well-distributed Pareto front among different alignment objectives.
摘要:多目标对齐任务旨在平衡和控制大型语言模型的不同对齐目标(如有用、无害和诚实),以满足不同用户的个性化需求。然而,以前的方法倾向于训练多个模型来处理不同的用户偏好,训练的模型的数量随着对齐目标的数量和不同偏好的数量线性增长。同时,现有的方法通常在可扩展性方面较差,并且需要对所考虑的每个新的对准目标进行大量的重新训练。考虑到已有方法的局限性,我们提出了多目标对比对齐算法MCA,它在解码时为每个目标构造一个专家提示和一个对抗性提示进行对比,并通过组合对比来平衡目标。经验证,该方法在获得不同排列目标间分布均匀的帕累托前沿方面优于以往的方法。

[NLP-11] Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models AAAI25
[NLP-11] 幻觉中的顺序很重要:以顺序推理作为基准和大型语言模型的自适应预算

链接: https://arxiv.org/abs/2408.05093
作者: Zikai Xie
关键词-EN: Large language models, Large language, generated significant attention, finding applications, industrial domains
关键词-ZN: 大型语言模型,大型语言,引起了极大的关注,寻找应用程序,工业领域
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, submitted to AAAI25

点击查看摘要

Abstract:Large language models (LLMs) have generated significant attention since their inception, finding applications across various academic and industrial domains. However, these models often suffer from the “hallucination problem”, where outputs, though grammatically and logically coherent, lack factual accuracy or are entirely fabricated. A particularly troubling issue discovered and widely discussed recently is the numerical comparison error where multiple LLMs incorrectly infer that “9.11 9.9”. We discovered that the order in which LLMs generate answers and reasoning impacts their consistency. Specifically, results vary significantly when an LLM generates an answer first and then provides the reasoning versus generating the reasoning process first and then the conclusion. Inspired by this, we propose a new benchmark method for assessing LLM consistency: comparing responses generated through these two different approaches. This benchmark effectively identifies instances where LLMs fabricate answers and subsequently generate justifications. Furthermore, we introduce a novel and straightforward prompt strategy designed to mitigate this issue. Experimental results demonstrate that this strategy improves performance across various LLMs compared to direct questioning. This work not only sheds light on a critical flaw in LLMs but also offers a practical solution to enhance their reliability.
摘要:大型语言模型自问世以来就引起了人们的极大关注,在各个学术和工业领域得到了广泛的应用。然而,这些模型往往存在“幻觉问题”,即输出虽然在语法上和逻辑上是连贯的,但缺乏事实的准确性或完全是捏造的。最近发现并广泛讨论的一个特别令人不安的问题是数值比较误差,其中多个LLM错误地推断出“9.11 9.9”。我们发现,LLM生成答案和推理的顺序影响了它们的一致性。具体地说,当LLM首先生成答案然后提供推理时,结果与先生成推理过程然后得出结论时的结果有很大差异。受此启发,我们提出了一种新的评估LLM一致性的基准方法:比较通过这两种不同方法生成的响应。这一基准有效地识别了LLM捏造答案并随后生成理由的情况。此外,我们引入了一种新颖而直接的即时策略,旨在缓解这一问题。实验结果表明,与直接提问相比,该策略在不同的LLMS上都有较高的性能。这项工作不仅揭示了LLMS中的一个关键缺陷,而且为提高其可靠性提供了一种实用的解决方案。

[NLP-12] Generating novel experimental hypotheses from language models: A case study on cross-dative generalization
[NLP-12] 从语言模型生成新颖的实验假设:交叉配格概括的案例研究

链接: https://arxiv.org/abs/2408.05086
作者: Kanishka Misra,Najoung Kim
关键词-EN: Neural network language, complex linguistic knowledge, successfully capture complex, capture complex linguistic, network language models
关键词-ZN: 神经网络语言,复杂的语言知识,成功捕获复杂的,捕获复杂的语言,网络语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural network language models (LMs) have been shown to successfully capture complex linguistic knowledge. However, their utility for understanding language acquisition is still debated. We contribute to this debate by presenting a case study where we use LMs as simulated learners to derive novel experimental hypotheses to be tested with humans. We apply this paradigm to study cross-dative generalization (CDG): productive generalization of novel verbs across dative constructions (she pilked me the ball/she pilked the ball to me) – acquisition of which is known to involve a large space of contextual features – using LMs trained on child-directed speech. We specifically ask: “what properties of the training exposure facilitate a novel verb’s generalization to the (unmodeled) alternate construction?” To answer this, we systematically vary the exposure context in which a novel dative verb occurs in terms of the properties of the theme and recipient, and then analyze the LMs’ usage of the novel verb in the unmodeled dative construction. We find LMs to replicate known patterns of children’s CDG, as a precondition to exploring novel hypotheses. Subsequent simulations reveal a nuanced role of the features of the novel verbs’ exposure context on the LMs’ CDG. We find CDG to be facilitated when the first postverbal argument of the exposure context is pronominal, definite, short, and conforms to the prototypical animacy expectations of the exposure dative. These patterns are characteristic of harmonic alignment in datives, where the argument with features ranking higher on the discourse prominence scale tends to precede the other. This gives rise to a novel hypothesis that CDG is facilitated insofar as the features of the exposure context – in particular, its first postverbal argument – are harmonically aligned. We conclude by proposing future experiments that can test this hypothesis in children.
摘要:神经网络语言模型(LMS)已被证明能够成功地获取复杂的语言知识。然而,它们在理解语言习得方面的作用仍然存在争议。我们通过提供一个案例研究来促进这场辩论,在这个案例研究中,我们使用LMS作为模拟学习者来推导新的实验假设,并用人类进行测试。我们应用这一范式来研究与格泛化(CDG):对与格结构中新动词的有效泛化(她把球扔给我/她把球扔给我)–众所周知,这涉及到大量语境特征的习得–使用针对儿童指导的言语训练的LMS。我们特别提出这样一个问题:“训练的哪些特性有助于将一个新的动词概括为(未建模的)交替结构?”为了回答这个问题,我们根据与格动词的主位和接受者的性质,系统地改变了与格动词出现的暴露语境,并分析了与格动词在非模式化与格结构中的用法。我们发现LMS复制了儿童CDG的已知模式,作为探索新假设的前提。随后的模拟揭示了新动词暴露语境的特征在LMS的CDG中的微妙作用。我们发现,当暴露语境的第一个言后论元是代词的、确定的、简短的,并且符合与格暴露的典型的有生期望时,CDG就被促进了。这些句型在与格结构中呈现出和声对齐的特点,其中具有较高特征的论元倾向于先于另一个论元。这提出了一种新的假设,即CDG是在暴露语境的特征–特别是其第一个言后论点–和谐一致的情况下被促进的。最后,我们提出了未来可以在儿童身上检验这一假说的实验。

[NLP-13] RT-Surv: Improving Mortality Prediction After Radiotherapy with Large Language Model Structuring of Large-Scale Unstructured Electronic Health Records
[NLP-13] RT-Surv:通过大规模非结构化电子健康记录的大语言模型结构化改进放疗后死亡率预测

链接: https://arxiv.org/abs/2408.05074
作者: Sangjoon Park,Chan Woo Wee,Seo Hee Choi,Kyung Hwan Kim,Jee Suk Chang,Hong In Yoon,Ik Jae Lee,Yong Bae Kim,Jaeho Cho,Ki Chang Keum,Chang Geol Lee,Hwa Kyung Byun,Woong Sub Koom
关键词-EN: Accurate patient selection, prevent ineffective treatments, Accurate patient, unstructured EHR data, unstructured EHR
关键词-ZN: 准确的患者选择,防止无效治疗,准确的患者,非结构化EHR数据,非结构化EHR
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 pages, 2 tables, 4 figures

点击查看摘要

Abstract:Accurate patient selection is critical in radiotherapy (RT) to prevent ineffective treatments. Traditional survival prediction models, relying on structured data, often lack precision. This study explores the potential of large language models (LLMs) to structure unstructured electronic health record (EHR) data, thereby improving survival prediction accuracy through comprehensive clinical information integration. Data from 34,276 patients treated with RT at Yonsei Cancer Center between 2013 and 2023 were analyzed, encompassing both structured and unstructured data. An open-source LLM was used to structure the unstructured EHR data via single-shot learning, with its performance compared against a domain-specific medical LLM and a smaller variant. Survival prediction models were developed using statistical, machine learning, and deep learning approaches, incorporating both structured and LLM-structured data. Clinical experts evaluated the accuracy of the LLM-structured data. The open-source LLM achieved 87.5% accuracy in structuring unstructured EHR data without additional training, significantly outperforming the domain-specific medical LLM, which reached only 35.8% accuracy. Larger LLMs were more effective, particularly in extracting clinically relevant features like general condition and disease extent, which closely correlated with patient survival. Incorporating LLM-structured clinical features into survival prediction models significantly improved accuracy, with the C-index of deep learning models increasing from 0.737 to 0.820. These models also became more interpretable by emphasizing clinically significant factors. This study shows that general-domain LLMs, even without specific medical training, can effectively structure large-scale unstructured EHR data, substantially enhancing the accuracy and interpretability of clinical predictive models.
摘要:在放射治疗(RT)中,准确的患者选择是防止无效治疗的关键。传统的生存预测模型依赖于结构化数据,往往缺乏精确度。这项研究探索了大型语言模型(LLM)在结构化非结构化电子健康记录(EHR)数据方面的潜力,从而通过全面的临床信息集成提高了生存预测的准确性。分析了2013至2023年间在延世癌症中心接受RT治疗的34,276名患者的数据,包括结构化和非结构化数据。使用开源的LLM通过单次学习来结构化非结构化的EHR数据,并与特定领域的医学LLM和更小的变体进行了性能比较。生存预测模型是使用统计、机器学习和深度学习方法开发的,结合了结构化和LLM结构化数据。临床专家评估了LLM结构数据的准确性。开源的LLM在不需要额外训练的情况下,在组织非结构化EHR数据方面达到了87.5%的准确率,显著超过了特定领域的医学LLM,后者的准确率仅为35.8%。较大的LLM更有效,特别是在提取与患者生存密切相关的临床相关特征方面,如一般情况和疾病程度。将LLM结构的临床特征纳入生存预测模型显著提高了准确性,深度学习模型的C指数从0.737增加到0.820。通过强调临床上有意义的因素,这些模型也变得更容易解释。这项研究表明,一般领域的LLMS,即使没有特定的医学培训,也可以有效地构建大规模非结构化EHR数据,大大提高临床预测模型的准确性和可解释性。

[NLP-14] Examining the Behavior of LLM Architectures Within the Framework of Standardized National Exams in Brazil AAAI
[NLP-14] 在巴西标准化国家考试框架内审查LLM架构的行为

链接: https://arxiv.org/abs/2408.05035
作者: Marcelo Sartori Locatelli,Matheus Prado Miranda,Igor Joaquim da Silva Costa,Matheus Torres Prates,Victor Thomé,Mateus Zaparoli Monteiro,Tomas Lacerda,Adriana Pagano,Eduardo Rios Neto,Wagner Meira Jr.,Virgilio Almeida
关键词-EN: Ensino Médio, Exame Nacional, Nacional do Ensino, universities in Brazil, Brazilian Portuguese tests
关键词-ZN: Ensino Médio、Exame Nacional、Nacional do Ensino、巴西大学、巴西葡萄牙语测试
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted at the Seventh AAAI/ACM Conference on AI, Ethics and Society (AIES 2024). 14 pages, 4 figures

点击查看摘要

Abstract:The Exame Nacional do Ensino Médio (ENEM) is a pivotal test for Brazilian students, required for admission to a significant number of universities in Brazil. The test consists of four objective high-school level tests on Math, Humanities, Natural Sciences and Languages, and one writing essay. Students’ answers to the test and to the accompanying socioeconomic status questionnaire are made public every year (albeit anonymized) due to transparency policies from the Brazilian Government. In the context of large language models (LLMs), these data lend themselves nicely to comparing different groups of humans with AI, as we can have access to human and machine answer distributions. We leverage these characteristics of the ENEM dataset and compare GPT-3.5 and 4, and MariTalk, a model trained using Portuguese data, to humans, aiming to ascertain how their answers relate to real societal groups and what that may reveal about the model biases. We divide the human groups by using socioeconomic status (SES), and compare their answer distribution with LLMs for each question and for the essay. We find no significant biases when comparing LLM performance to humans on the multiple-choice Brazilian Portuguese tests, as the distance between model and human answers is mostly determined by the human accuracy. A similar conclusion is found by looking at the generated text as, when analyzing the essays, we observe that human and LLM essays differ in a few key factors, one being the choice of words where model essays were easily separable from human ones. The texts also differ syntactically, with LLM generated essays exhibiting, on average, smaller sentences and less thought units, among other differences. These results suggest that, for Brazilian Portuguese in the ENEM context, LLM outputs represent no group of humans, being significantly different from the answers from Brazilian students across all tests.
摘要:国家英语考试是巴西学生的一项关键考试,是被巴西许多大学录取所必需的。这次考试包括四个客观的高中水平测试,包括数学、人文、自然科学和语言,以及一篇写作文章。由于巴西政府的透明政策,学生对考试和随附的社会经济状况问卷的回答每年都会公布(尽管是匿名的)。在大型语言模型(LLM)的背景下,这些数据很好地将不同的人类群体与人工智能进行比较,因为我们可以访问人和机器的答案分布。我们利用ENEM数据集的这些特征,将GPT-3.5和4以及使用葡萄牙语数据训练的MariTalk模型与人类进行比较,旨在确定他们的答案如何与真实的社会群体相关,以及这可能揭示出模型偏差的哪些方面。我们根据社会经济地位(SES)对人群进行划分,并将他们的答案分布与每个问题和论文的LLMS进行比较。在巴西葡萄牙语多项选择题测试中,我们没有发现明显的偏差,因为模型和人类答案之间的距离在很大程度上取决于人类的准确性。通过观察生成的文本,我们发现了类似的结论,因为在分析文章时,我们观察到人类和LLM的文章在几个关键因素上是不同的,其中一个是在范文很容易与人类区分的情况下的词语选择。这些文本在句法上也不同,除了其他不同之处外,LLM生成的文章平均表现出较小的句子和较少的思维单元。这些结果表明,对于ENEM背景下的巴西葡萄牙语,LLM的输出不代表人类群体,在所有测试中与巴西学生的答案显著不同。

[NLP-15] MIDI-to-Tab: Guitar Tablature Inference via Masked Language Modeling
[NLP-15] MIDI-to-Tab:通过掩蔽语言建模的吉他谱推理

链接: https://arxiv.org/abs/2408.05024
作者: Drew Edwards,Xavier Riley,Pedro Sarmento,Simon Dixon
关键词-EN: traditional music notation, indicating precisely, enrich the structure, structure of traditional, notation by assigning
关键词-ZN: 传统乐记法,精确地指示,通过分配丰富了传统乐记法的结构,结构
类目: ound (cs.SD); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Reviewed pre-print accepted for publication at ISMIR 2024

点击查看摘要

Abstract:Guitar tablatures enrich the structure of traditional music notation by assigning each note to a string and fret of a guitar in a particular tuning, indicating precisely where to play the note on the instrument. The problem of generating tablature from a symbolic music representation involves inferring this string and fret assignment per note across an entire composition or performance. On the guitar, multiple string-fret assignments are possible for most pitches, which leads to a large combinatorial space that prevents exhaustive search approaches. Most modern methods use constraint-based dynamic programming to minimize some cost function (e.g.\ hand position movement). In this work, we introduce a novel deep learning solution to symbolic guitar tablature estimation. We train an encoder-decoder Transformer model in a masked language modeling paradigm to assign notes to strings. The model is first pre-trained on DadaGP, a dataset of over 25K tablatures, and then fine-tuned on a curated set of professionally transcribed guitar performances. Given the subjective nature of assessing tablature quality, we conduct a user study amongst guitarists, wherein we ask participants to rate the playability of multiple versions of tablature for the same four-bar excerpt. The results indicate our system significantly outperforms competing algorithms.
摘要:吉他表丰富了传统乐谱的结构,在特定的调音中,将每个音符分配给吉他的弦和弦,准确地指示在乐器上的哪里演奏音符。从象征性音乐表示中生成表格的问题涉及在整个作曲或表演中推断每个音符的弦乐和音符分配。在吉他上,大多数音高都有可能分配多个弦乐调子,这导致了一个很大的组合空间,从而阻止了穷尽的搜索方法。大多数现代方法使用基于约束的动态规划来最小化某些成本函数(例如,手位置移动)。在这项工作中,我们提出了一种新的深度学习方法来估计符号吉他的音阶。我们在掩蔽语言建模范例中训练编码器-解码器转换器模型,以便为字符串分配注释。该模型首先在DadaGP上进行预训练,DadaGP是一个超过25K张桌子的数据集,然后根据一组经过精心策划的专业转录的吉他表演进行微调。考虑到评估曲谱质量的主观性,我们在吉他手中进行了一项用户研究,在该研究中,我们要求参与者对同一四小节节选的多个版本的曲谱的可玩性进行评级。实验结果表明,该算法的性能明显优于其他同类算法。

[NLP-16] Investigating a Benchmark for Training-set free Evaluation of Linguistic Capabilities in Machine Reading Comprehension
[NLP-16] 研究机器阅读理解语言能力培训自由评估的基准

链接: https://arxiv.org/abs/2408.05023
作者: Viktor Schlegel,Goran Nenadic,Riza Batista-Navarro
关键词-EN: Performance of NLP, collecting a large-scale, crowd-sourcing to train, train a data-driven, held-out portion
关键词-ZN: NLP的性能,收集大规模、众包来训练,训练数据驱动的、持有的部分
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Performance of NLP systems is typically evaluated by collecting a large-scale dataset by means of crowd-sourcing to train a data-driven model and evaluate it on a held-out portion of the data. This approach has been shown to suffer from spurious correlations and the lack of challenging examples that represent the diversity of natural language. Instead, we examine a framework for evaluating optimised models in training-set free setting on synthetically generated challenge sets. We find that despite the simplicity of the generation method, the data can compete with crowd-sourced datasets with regard to naturalness and lexical diversity for the purpose of evaluating the linguistic capabilities of MRC models. We conduct further experiments and show that state-of-the-art language model-based MRC systems can learn to succeed on the challenge set correctly, although, without capturing the general notion of the evaluated phenomenon.
摘要:NLP系统的性能通常是通过众包收集大规模数据集来评估的,以训练数据驱动模型并在数据的保留部分上对其进行评估。事实证明,这种方法存在虚假相关性,并且缺乏代表自然语言多样性的具有挑战性的例子。相反,我们研究了一个框架,用于在合成生成的挑战集的无训练集环境中评估优化模型。我们发现,尽管生成方法很简单,但为了评估MRC模型的语言能力,数据可以在自然性和词汇多样性方面与众包数据集竞争。我们进行了进一步的实验,并表明最新的基于语言模型的MRC系统可以在挑战集中正确地学习成功,尽管没有捕捉到所评估现象的一般概念。

[NLP-17] ProFuser: Progressive Fusion of Large Language Models
[NLP-17] ProFuser:大型语言模型的渐进融合

链接: https://arxiv.org/abs/2408.04998
作者: Tianyuan Shi,Fanqi Wan,Canbin Huang,Xiaojun Quan,Chenliang Li,Ming Yan,Ji Zhang
关键词-EN: properly select advantageous, large language models, select advantageous model, offers a pathway, fusing the capacities
关键词-ZN: 正确选择有利的、大型的语言模型,选择有利的模型,提供途径,融合能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While fusing the capacities and advantages of various large language models (LLMs) offers a pathway to construct more powerful and versatile models, a fundamental challenge is to properly select advantageous model during the training. Existing fusion methods primarily focus on the training mode that uses cross entropy on ground truth in a teacher-forcing setup to measure a model’s advantage, which may provide limited insight towards model advantage. In this paper, we introduce a novel approach that enhances the fusion process by incorporating both the training and inference modes. Our method evaluates model advantage not only through cross entropy during training but also by considering inference outputs, providing a more comprehensive assessment. To combine the two modes effectively, we introduce ProFuser to progressively transition from inference mode to training mode. To validate ProFuser’s effectiveness, we fused three models, including vicuna-7b-v1.5, Llama-2-7b-chat, and mpt-7b-8k-chat, and demonstrated the improved performance in knowledge, reasoning, and safety compared to baseline methods.
摘要:虽然融合各种大语言模型的能力和优势为构建更强大、更通用的模型提供了一条途径,但一个根本的挑战是在训练过程中正确选择有利的模型。现有的融合方法主要集中于在教师强迫的设置中使用地面真实的交叉熵来衡量模型的优势的训练模式,这可能会限制对模型优势的洞察。在本文中,我们介绍了一种新的方法,通过结合训练和推理模式来增强融合过程。我们的方法不仅通过训练过程中的交叉熵来评估模型的优势,还通过考虑推理输出来评估模型的优势,提供了更全面的评估。为了将这两种模式有效地结合起来,我们引入了ProFuser来逐步从推理模式过渡到训练模式。为了验证ProFuser的有效性,我们融合了三种模型,包括羊膜-7b-v1.5、羊-2-7b-Chat和mpt-7b-8k-Chat,并证明了与基线方法相比,在知识、推理和安全性方面的改进。

[NLP-18] Get Confused Cautiously: Textual Sequence Memorization Erasure with Selective Entropy Maximization
[NLP-18] 谨慎地混淆:使用选择性熵最大化的文本序列重新化擦除

链接: https://arxiv.org/abs/2408.04983
作者: Zhaohan Zhang,Ziquan Liu,Ioannis Patras
关键词-EN: Textual Sequence Memorization, raising broad concerns, Large Language Models, training set verbatim, Large Language
关键词-ZN: 文本序列重新同步,引起广泛关注,大型语言模型,逐字训练集,大型语言
类目: Computation and Language (cs.CL)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have been found to memorize and recite some of the textual sequences from their training set verbatim, raising broad concerns about privacy and copyright issues when using LLMs. This Textual Sequence Memorization (TSM) phenomenon leads to a high demand to regulate LLM output to prevent it from generating certain memorized text to meet user requirements. However, our empirical study reveals that existing methods for TSM erasure fail to forget massive memorized samples without substantially jeopardizing the model utility. To achieve a better trade-off between the effectiveness of TSM erasure and model utility in LLMs, our paper proposes a new framework based on Entropy Maximization with Selective Optimization (EMSO), where the updated weights are chosen with a novel contrastive gradient metric without any participation of additional model or data. Our analysis shows that training with the entropy maximization loss has a more stable optimization process and better keeps model utility than existing methods. The contrastive gradient metric localizes the most influential weight for TSM erasure by taking both the gradient magnitude and direction into consideration. Extensive experiments across three model scales demonstrate that our method excels in handling large-scale forgetting requests while preserving model ability in language generation and reasoning.
摘要:大型语言模型(LLMS)被发现能够从其训练集中逐字记忆和背诵一些文本序列,这引起了人们对使用LLM时隐私和版权问题的广泛关注。这种文本序列记忆(TSM)现象导致了对LLM输出的高要求,以防止它产生满足用户要求的特定记忆文本。然而,我们的实证研究表明,现有的TSM擦除方法无法忘记大量记忆的样本,而不会实质上危及模型的实用性。为了在TSM删除的有效性和模型效用之间取得更好的折衷,本文提出了一种新的基于选择优化(EMSO)的熵最大化框架,该框架使用一种新的对比梯度度量来选择更新的权重,而不需要任何额外的模型或数据参与。我们的分析表明,与现有方法相比,具有最大熵损失的训练方法具有更稳定的优化过程,并且更好地保持了模型的实用性。对比梯度度量通过同时考虑梯度大小和方向来定位对TSM擦除最有影响的权重。在三个模型尺度上的大量实验表明,我们的方法在处理大规模遗忘请求的同时,保持了模型在语言生成和推理方面的能力。

[NLP-19] textitreCSE: Portable Reshaping Features for Sentence Embedding in Self-supervised Contrastive Learning
[NLP-19] textitreCSE:自监督对比学习中句子嵌入的便携式重塑功能

链接: https://arxiv.org/abs/2408.04975
作者: Fufangchen Zhao,Gao Jian,Danfeng Yan
关键词-EN: current advanced models, supervised contrastive learning, representation framework based, contrastive learning sentence, sentence representation framework
关键词-ZN: 当前先进模型、监督对比学习、基于表示框架、对比学习句子、句子表示框架
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We propose \textitreCSE, a self supervised contrastive learning sentence representation framework based on feature reshaping. This framework is different from the current advanced models that use discrete data augmentation methods, but instead reshapes the input features of the original sentence, aggregates the global information of each token in the sentence, and alleviates the common problems of representation polarity and GPU memory consumption linear increase in current advanced models. In addition, our \textitreCSE has achieved competitive performance in semantic similarity tasks. And the experiment proves that our proposed feature reshaping method has strong universality, which can be transplanted to other self supervised contrastive learning frameworks and enhance their representation ability, even achieving state-of-the-art performance. Our code is available at this https URL.
摘要:我们提出了\textitreCSE,一个基于特征重塑的自监督对比学习句子表示框架。该框架与当前使用离散数据增强方法的高级模型不同,而是重塑了原始句子的输入特征,聚合了句子中每个标记的全局信息,并缓解了当前高级模型中常见的表示极性和图形处理器内存消耗线性增加的问题。此外,我们的\textitreCSE在语义相似性任务中取得了有竞争力的性能。实验证明,我们提出的特征重塑方法具有很强的通用性,可以移植到其他自监督对比学习框架中,增强其表示能力,甚至达到最先进的性能。我们的代码可在此https URL上找到。

[NLP-20] Generalisation First Memorisation Second? Memorisation Localisation for Natural Language Classification Tasks ACL
[NLP-20] 通用化第一小型化第二?自然语言分类任务的小型化本地化

链接: https://arxiv.org/abs/2408.04965
作者: Verna Dankers,Ivan Titov
关键词-EN: atypical input-output combinations, real-world data, neural models pick, part of learning, learning from real-world
关键词-ZN: 非典型输入输出组合、现实世界数据、神经模型选择、学习的一部分、从现实世界中学习
类目: Computation and Language (cs.CL)
备注: Published in ACL Findings 2024; 19 pages total (9 in the main paper, 4 pages with limitations, acknowledgments and references, 6 pages with appendices)

点击查看摘要

Abstract:Memorisation is a natural part of learning from real-world data: neural models pick up on atypical input-output combinations and store those training examples in their parameter space. That this happens is well-known, but how and where are questions that remain largely unanswered. Given a multi-layered neural model, where does memorisation occur in the millions of parameters? Related work reports conflicting findings: a dominant hypothesis based on image classification is that lower layers learn generalisable features and that deeper layers specialise and memorise. Work from NLP suggests this does not apply to language models, but has been mainly focused on memorisation of facts. We expand the scope of the localisation question to 12 natural language classification tasks and apply 4 memorisation localisation techniques. Our results indicate that memorisation is a gradual process rather than a localised one, establish that memorisation is task-dependent, and give nuance to the generalisation first, memorisation second hypothesis.
摘要:记忆是从真实世界数据中学习的自然部分:神经模型拾取非典型的输入-输出组合,并将这些训练样本存储在它们的参数空间中。这种情况的发生是众所周知的,但如何以及在哪里发生,在很大程度上仍是悬而未决的问题。给出一个多层神经模型,在数百万个参数中,记忆发生在哪里?相关研究报告了相互矛盾的发现:基于图像分类的一个占主导地位的假设是,较低的层学习可概括的特征,而较深的层专门和记忆。NLP的研究表明,这并不适用于语言模型,而是主要专注于记忆事实。我们将本地化问题的范围扩展到12个自然语言分类任务,并应用了4种记忆本地化技术。我们的结果表明,记忆是一个渐进的过程,而不是局部化的过程,确立了记忆是任务依赖的,并给出了概括第一,记忆第二的细微差别。

[NLP-21] HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction
[NLP-21] HybridRAG:集成知识图和载体检索增强生成以实现高效的信息提取

链接: https://arxiv.org/abs/2408.04948
作者: Bhaskarjit Sarmah,Benika Hall,Rohan Rao,Sunil Patel,Stefano Pasquali,Dhagash Mehta
关键词-EN: present substantial challenges, large language models, unstructured text data, text data arising, Retrieval Augmented Generation
关键词-ZN: 存在重大挑战、大型语言模型、非结构化文本数据、文本数据出现、检索增强生成
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Applications (stat.AP); Machine Learning (stat.ML)
备注: 9 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Extraction and interpretation of intricate information from unstructured text data arising in financial applications, such as earnings call transcripts, present substantial challenges to large language models (LLMs) even using the current best practices to use Retrieval Augmented Generation (RAG) (referred to as VectorRAG techniques which utilize vector databases for information retrieval) due to challenges such as domain specific terminology and complex formats of the documents. We introduce a novel approach based on a combination, called HybridRAG, of the Knowledge Graphs (KGs) based RAG techniques (called GraphRAG) and VectorRAG techniques to enhance question-answer (QA) systems for information extraction from financial documents that is shown to be capable of generating accurate and contextually relevant answers. Using experiments on a set of financial earning call transcripts documents which come in the form of QA format, and hence provide a natural set of pairs of ground-truth QAs, we show that HybridRAG which retrieves context from both vector database and KG outperforms both traditional VectorRAG and GraphRAG individually when evaluated at both the retrieval and generation stages in terms of retrieval accuracy and answer generation. The proposed technique has applications beyond the financial domain
摘要:由于特定领域的术语和复杂的文档格式等挑战,从金融应用中出现的非结构化文本数据(如收益电话记录)中提取和解释复杂信息对大型语言模型(LLMS)提出了巨大的挑战,即使使用当前的最佳实践来使用检索增强生成(RAG)(称为利用矢量数据库进行信息检索的VectorRAG技术)。提出了一种基于知识图(KG)的RAG技术(称为GraphRAG)和VectorRAG技术相结合的新方法,称为HybriRAG,用于增强问答系统,以从金融文档中提取信息,该系统被证明能够生成准确和上下文相关的答案。通过对一组以QA格式出现的财务收入通话记录文档的实验,我们证明了从向量数据库和KG检索上下文的HybriRAG在检索准确率和答案生成方面都分别优于传统的VectorRAG和GraphRAG。拟议的技术在金融领域之外也有应用

[NLP-22] Quantitative Information Extraction from Humanitarian Documents
[NLP-22] 从人道主义文件中提取定量信息

链接: https://arxiv.org/abs/2408.04941
作者: Daniele Liberatore,Kyriaki Kalimeri,Derya Sever,Yelena Mejova
关键词-EN: mass of reports, Natural Language Processing, humanitarian domain, Humanitarian action, custom Natural Language
关键词-ZN: 大量报告、自然语言处理、人道主义领域、人道主义行动、自定义自然语言
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Humanitarian action is accompanied by a mass of reports, summaries, news, and other documents. To guide its activities, important information must be quickly extracted from such free-text resources. Quantities, such as the number of people affected, amount of aid distributed, or the extent of infrastructure damage, are central to emergency response and anticipatory action. In this work, we contribute an annotated dataset for the humanitarian domain for the extraction of such quantitative information, along side its important context, including units it refers to, any modifiers, and the relevant event. Further, we develop a custom Natural Language Processing pipeline to extract the quantities alongside their units, and evaluate it in comparison to baseline and recent literature. The proposed model achieves a consistent improvement in the performance, especially in the documents pertaining to the Dominican Republic and select African countries. We make the dataset and code available to the research community to continue the improvement of NLP tools for the humanitarian domain.
摘要:人道主义行动伴随着大量的报道、摘要、新闻和其他文件。为了指导其活动,必须迅速从这些自由文本资源中提取重要信息。数量,如受影响的人数、分发的援助金额或基础设施受损程度,是应急反应和预期行动的核心。在这项工作中,我们为人道主义领域提供了一个带注释的数据集,用于提取这种定量信息及其重要背景,包括它所指的单位、任何修饰语和相关事件。此外,我们开发了一个定制的自然语言处理管道来提取单位旁边的数量,并与基线和最近的文献进行比较。拟议的模式实现了业绩的持续改进,特别是在有关多米尼加共和国和选定非洲国家的文件中。我们向研究界提供数据集和代码,以继续改进人道主义领域的自然资源规划工具。

[NLP-23] Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy and Novel Ensemble Method
[NLP-23] 调查图像字幕评价的格局:全面的分类学和新颖的整合方法

链接: https://arxiv.org/abs/2408.04909
作者: Uri Berger,Gabriel Stanovsky,Omri Abend,Lea Frermann
关键词-EN: image captioning models, image captioning, image captioning metrics, complex task, task of evaluating
关键词-ZN: 图像字幕模型,图像字幕,图像字幕指标,复杂任务,评估任务
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The task of image captioning has recently been gaining popularity, and with it the complex task of evaluating the quality of image captioning models. In this work, we present the first survey and taxonomy of over 70 different image captioning metrics and their usage in hundreds of papers. We find that despite the diversity of proposed metrics, the vast majority of studies rely on only five popular metrics, which we show to be weakly correlated with human judgements. Instead, we propose EnsembEval – an ensemble of evaluation methods achieving the highest reported correlation with human judgements across 5 image captioning datasets, showing there is a lot of room for improvement by leveraging a diverse set of metrics.
摘要:图像字幕的任务最近越来越受欢迎,随之而来的是评估图像字幕模型质量的复杂任务。在这项工作中,我们首次对70多种不同的图像字幕指标及其在数百篇论文中的使用进行了调查和分类。我们发现,尽管提出的指标多种多样,但绝大多数研究仅依赖于五种流行指标,我们表明这些指标与人类判断的相关性较弱。相反,我们提出了EnsemmbEval–一系列评估方法,在5个图像字幕数据集中实现了与人类判断的最高相关性,表明通过利用一组不同的指标,还有很大的改进空间。

[NLP-24] owards a Generative Approach for Emotion Detection and Reasoning
[NLP-24] 拥有一种生成性的情感检测和推理方法

链接: https://arxiv.org/abs/2408.04906
作者: Ankita Bhaumik,Tomek Strzalkowski
关键词-EN: Large language models, demonstrated impressive performance, Large language, prompting techniques, emotional reasoning
关键词-ZN: 大型语言模型,表现出令人印象深刻的性能,大型语言,提示技术,情感推理
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive performance in mathematical and commonsense reasoning tasks using chain-of-thought (CoT) prompting techniques. But can they perform emotional reasoning by concatenating `Let’s think step-by-step’ to the input prompt? In this paper we investigate this question along with introducing a novel approach to zero-shot emotion detection and emotional reasoning using LLMs. Existing state of the art zero-shot approaches rely on textual entailment models to choose the most appropriate emotion label for an input text. We argue that this strongly restricts the model to a fixed set of labels which may not be suitable or sufficient for many applications where emotion analysis is required. Instead, we propose framing the problem of emotion analysis as a generative question-answering (QA) task. Our approach uses a two step methodology of generating relevant context or background knowledge to answer the emotion detection question step-by-step. Our paper is the first work on using a generative approach to jointly address the tasks of emotion detection and emotional reasoning for texts. We evaluate our approach on two popular emotion detection datasets and also release the fine-grained emotion labels and explanations for further training and fine-tuning of emotional reasoning systems.
摘要:大型语言模型(LLM)在数学和常识推理任务中使用思维链(CoT)提示技术表现出了令人印象深刻的性能。但是,他们能通过在输入提示中串联“让我们一步一步地思考”来进行情感推理吗?在本文中,我们对这一问题进行了研究,并介绍了一种新的利用LLMS进行零镜头情感检测和情感推理的方法。现有的零镜头方法依赖于文本蕴涵模型来为输入文本选择最合适的情感标签。我们认为,这强烈地将模型限制在一组固定的标签上,这可能不适合或不足以用于许多需要情感分析的应用。相反,我们建议将情绪分析问题框定为生成性问答(QA)任务。我们的方法使用两步法来生成相关的上下文或背景知识来逐步回答情感检测问题。我们的论文是第一次使用生成方法来联合处理文本的情感检测和情感推理任务。我们在两个流行的情感检测数据集上对我们的方法进行了评估,并发布了细粒度的情感标签和解释,为进一步训练和微调情感推理系统提供了依据。

[NLP-25] GlitchProber: Advancing Effective Detection and Mitigation of Glitch Tokens in Large Language Models
[NLP-25] GlitchProber:推进大型语言模型中毛刺令牌的有效检测和缓解

链接: https://arxiv.org/abs/2408.04905
作者: Zhibo Zhang,Wuxia Bai,Yuxi Li,Mark Huasong Meng,Kailong Wang,Ling Shi,Li Li,Jun Wang,Haoyu Wang
关键词-EN: achieved unprecedented success, Large language models, natural language processing, glitch tokens, Large language
关键词-ZN: 取得了前所未有的成功,大型语言模型,自然语言处理,故障令牌,大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved unprecedented success in the field of natural language processing. However, the black-box nature of their internal mechanisms has brought many concerns about their trustworthiness and interpretability. Recent research has discovered a class of abnormal tokens in the model’s vocabulary space and named them “glitch tokens”. Those tokens, once included in the input, may induce the model to produce incorrect, irrelevant, or even harmful results, drastically undermining the reliability and practicality of LLMs. In this work, we aim to enhance the understanding of glitch tokens and propose techniques for their detection and mitigation. We first reveal the characteristic features induced by glitch tokens on LLMs, which are evidenced by significant deviations in the distributions of attention patterns and dynamic information from intermediate model layers. Based on the insights, we develop GlitchProber, a tool for efficient glitch token detection and mitigation. GlitchProber utilizes small-scale sampling, principal component analysis for accelerated feature extraction, and a simple classifier for efficient vocabulary screening. Taking one step further, GlitchProber rectifies abnormal model intermediate layer values to mitigate the destructive effects of glitch tokens. Evaluated on five mainstream open-source LLMs, GlitchProber demonstrates higher efficiency, precision, and recall compared to existing approaches, with an average F1 score of 0.86 and an average repair rate of 50.06%. GlitchProber unveils a novel path to address the challenges posed by glitch tokens and inspires future research toward more robust and interpretable LLMs. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.04905 [cs.CL] (or arXiv:2408.04905v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.04905 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
摘要:大语言模型在自然语言处理领域取得了前所未有的成功。然而,它们内部机制的黑箱性质引发了人们对它们的可信性和可解释性的许多担忧。最近的研究在模型的词汇空间中发现了一类不正常的标记,并将它们命名为“毛刺标记”。这些令牌一旦包含在输入中,可能会导致模型产生不正确的、不相关的甚至有害的结果,从而极大地破坏LLMS的可靠性和实用性。在这项工作中,我们的目标是加强对毛刺令牌的理解,并提出检测和缓解毛刺令牌的技术。我们首先揭示了LLM上由毛刺标记引起的特征,这些特征表现在中间模型层的注意模式和动态信息的分布上的显著偏差。基于这些见解,我们开发了GlitchProber,一个高效的毛刺令牌检测和缓解工具。GlitchProber使用小规模采样、主成分分析加速特征提取,并使用简单的分类器进行有效的词汇筛选。更进一步,GlitchProber纠正了异常的模型中间层值,以减轻毛刺令牌的破坏性影响。GlitchProber在五个主流开源LLMS上的测试结果表明,与现有方法相比,GlitchProber具有更高的效率、精确度和召回率,平均F1得分为0.86,平均修复率为50.06%。GlitchProber推出了一条新的途径来解决毛刺令牌带来的挑战,并启发了未来对更健壮和可解释的LLM的研究。主题:计算与语言(cs.CL);人工智能(cs.AI)引用为:arxiv:2408.04905cs.CLhttps://doi.org/10.48550/arXiv.2408.04905 Focus通过DataCite了解更多arxiv发布的DOI(待注册)

[NLP-26] Communicate to Play: Pragmatic Reasoning for Efficient Cross-Cultural Communication in Codenames
[NLP-26] 通过沟通来玩:有效跨文化交流的务实推理代号

链接: https://arxiv.org/abs/2408.04900
作者: Isadora White,Sashrika Pandey,Michelle Pan
关键词-EN: Rational Speech Acts, Cultural differences, method Rational Speech, differences in common, common ground
关键词-ZN: 理性言语行为、文化差异、方法理性言语、共同点的差异
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cultural differences in common ground may result in pragmatic failure and misunderstandings during communication. We develop our method Rational Speech Acts for Cross-Cultural Communication (RSA+C3) to resolve cross-cultural differences in common ground. To measure the success of our method, we study RSA+C3 in the collaborative referential game of Codenames Duet and show that our method successfully improves collaboration between simulated players of different cultures. Our contributions are threefold: (1) creating Codenames players using contrastive learning of an embedding space and LLM prompting that are aligned with human patterns of play, (2) studying culturally induced differences in common ground reflected in our trained models, and (3) demonstrating that our method RSA+C3 can ease cross-cultural communication in gameplay by inferring sociocultural context from interaction. Our code is publicly available at this http URL.
摘要:文化上的共同点差异可能会导致交际中的务实失误和误解。我们开发了跨文化沟通理性言语行为(RSA+C3)方法,以解决跨文化差异的共同点。为了衡量我们方法的成功,我们在Codenames Duet的协作参考游戏中研究了RSA+C3,并表明我们的方法成功地改善了不同文化的模拟玩家之间的协作。我们的贡献有三重:(1)使用嵌入空间的对比学习和与人类游戏模式一致的LLM提示来创建代号玩家,(2)研究我们训练的模型中反映的共同点的文化引发的差异,以及(3)证明我们的方法RSA+C3可以通过从互动中推断社会文化背景来简化游戏玩法中的跨文化交流。我们的代码可在此http URL中公开获取。

[NLP-27] Unsupervised Episode Detection for Large-Scale News Events
[NLP-27] 大规模新闻事件的无监督剧集检测

链接: https://arxiv.org/abs/2408.04873
作者: Priyanka Kargupta,Yunyi Zhang,Yizhu Jiao,Siru Ouyang,Jiawei Han
关键词-EN: evolving large-scale key, large-scale key events, structures are inherently, inherently interpretable, interpretable and adaptable
关键词-ZN: 不断发展的大规模关键、大规模关键事件、结构本质上是可解释的、可解释的和可适应的
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Episodic structures are inherently interpretable and adaptable to evolving large-scale key events. However, state-of-the-art automatic event detection methods overlook event episodes and, therefore, struggle with these crucial characteristics. This paper introduces a novel task, episode detection, aimed at identifying episodes from a news corpus containing key event articles. An episode describes a cohesive cluster of core entities (e.g., “protesters”, “police”) performing actions at a specific time and location. Furthermore, an episode is a significant part of a larger group of episodes under a particular key event. Automatically detecting episodes is challenging because, unlike key events and atomic actions, we cannot rely on explicit mentions of times and locations to distinguish between episodes or use semantic similarity to merge inconsistent episode co-references. To address these challenges, we introduce EpiMine, an unsupervised episode detection framework that (1) automatically identifies the most salient, key-event-relevant terms and segments, (2) determines candidate episodes in an article based on natural episodic partitions estimated through shifts in discriminative term combinations, and (3) refines and forms final episode clusters using large language model-based reasoning on the candidate episodes. We construct three diverse, real-world event datasets annotated at the episode level. EpiMine outperforms all baselines on these datasets by an average 59.2% increase across all metrics.
摘要:情节结构对不断演变的大规模关键事件具有内在的解释力和适应性。然而,最先进的自动事件检测方法忽略了事件情节,因此难以处理这些关键特征。本文介绍了一种新的任务,情节检测,目的是从包含关键事件文章的新闻语料库中识别情节。一段插曲描述了在特定时间和地点采取行动的核心实体(如“抗议者”、“警察”)的凝聚力。此外,剧集是特定关键事件下更大的剧集组中的重要部分。自动检测情节是具有挑战性的,因为与关键事件和原子操作不同,我们不能依赖明确提到的时间和位置来区分情节,也不能使用语义相似性来合并不一致的情节共指。为了应对这些挑战,我们引入了Epimine,一个无监督的情节检测框架,它(1)自动识别最显著的、与关键事件相关的术语和片段,(2)基于通过区分术语组合的变化估计的自然情节划分来确定文章中的候选情节,以及(3)使用基于大型语言模型的推理对候选情节进行精化和形成最终情节簇。我们构建了三个不同的、在情节级别标注的真实世界事件数据集。在所有指标中,EpiMy的表现比这些数据集的所有基线平均高出59.2%。

[NLP-28] SCOI: Syntax-augmented Coverage-based In-context Example Selection for Machine Translation
[NLP-28] COI:用于机器翻译的语法增强的基于覆盖范围的内上下文示例选择

链接: https://arxiv.org/abs/2408.04872
作者: Chenming Tang,Zhixiang Wang,Yunfang Wu
关键词-EN: large language models, improvement highly depends, greatly improves, language models, down-stream tasks
关键词-ZN: 大型语言模型,改进高度依赖,大大改进,语言模型,下游任务
类目: Computation and Language (cs.CL)
备注: 16 pages, 2 figures, 14 tables

点击查看摘要

Abstract:In-context learning (ICL) greatly improves the performance of large language models (LLMs) on various down-stream tasks, where the improvement highly depends on the quality of demonstrations. In this work, we introduce syntactic knowledge to select better in-context examples for machine translation (MT). We propose a new strategy, namely Syntax-augmented COverage-based In-context example selection (SCOI), leveraging the deep syntactic structure beyond conventional word matching. Specifically, we measure the set-level syntactic coverage by computing the coverage of polynomial terms with the help of a simplified tree-to-polynomial algorithm, and lexical coverage using word overlap. Furthermore, we devise an alternate selection approach to combine both coverage measures, taking advantage of syntactic and lexical information. We conduct experiments with two multi-lingual LLMs on six translation directions. Empirical results show that our proposed SCOI obtains the highest average COMET score among all learning-free methods, indicating that combining syntactic and lexical coverage successfully helps to select better in-context examples for MT.
摘要:情境学习(ICL)极大地提高了大型语言模型(LLM)在各种下游任务中的表现,而这种提高在很大程度上依赖于演示的质量。在这项工作中,我们引入句法知识来选择更好的上下文中的例子来进行机器翻译。我们提出了一种新的策略,即基于句法扩展覆盖的上下文中实例选择(SCOI),该策略利用了传统单词匹配之外的深层句法结构。具体地说,我们通过简化的树到多项式算法计算多项式项的覆盖率来衡量集合级别的句法覆盖率,并使用词重叠来计算词汇覆盖率。此外,我们设计了一种替代选择方法来结合这两种覆盖措施,利用句法和词汇信息。我们在六个翻译方向上用两个多语种的LLM进行了实验。实验结果表明,在所有无学习方法中,我们提出的SCOI获得了最高的平均COMET分数,这表明成功地将句法和词汇覆盖相结合有助于为机器翻译选择更好的上下文中的例子。

[NLP-29] MSG-Chart: Multimodal Scene Graph for ChartQA CIKM
[NLP-29] MSG-Chart:ChartQA的多模式场景图

链接: https://arxiv.org/abs/2408.04852
作者: Yue Dai,Soyeon Caren Han,Wei Liu
关键词-EN: Automatic Chart Question, Chart Question Answering, Question Answering, Automatic Chart, Chart Question
关键词-ZN: 自动图表问题,图表问题解答,问题解答,自动图表,图表问题
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accpeted by CIKM Short 2024

点击查看摘要

Abstract:Automatic Chart Question Answering (ChartQA) is challenging due to the complex distribution of chart elements with patterns of the underlying data not explicitly displayed in charts. To address this challenge, we design a joint multimodal scene graph for charts to explicitly represent the relationships between chart elements and their patterns. Our proposed multimodal scene graph includes a visual graph and a textual graph to jointly capture the structural and semantical knowledge from the chart. This graph module can be easily integrated with different vision transformers as inductive bias. Our experiments demonstrate that incorporating the proposed graph module enhances the understanding of charts’ elements’ structure and semantics, thereby improving performance on publicly available benchmarks, ChartQA and OpenCQA.
摘要:自动图表问题解答(ChartQA)具有挑战性,因为图表元素分布复杂,基础数据的模式没有明确显示在图表中。为了应对这一挑战,我们为图表设计了一个联合多模式场景图,以显式地表示图表元素及其模式之间的关系。我们提出的多模式场景图包括视觉图和文本图,以联合捕获图表中的结构和语义知识。该图形模块可以轻松地与不同的视觉转换器集成作为感性偏置。我们的实验表明,结合所提出的图形模块可以增强对图表元素结构和语义的理解,从而提高公开基准测试、ChartQA和OpenCQA的性能。

[NLP-30] Ensemble BERT: A student social network text sentiment classification model based on ensemble learning and BERT architecture
[NLP-30] Ensemble BERT:基于集成学习和BERT架构的学生社交网络文本情感分类模型

链接: https://arxiv.org/abs/2408.04849
作者: Kai Jiang,Honghao Yang,Yuexian Wang,Qianru Chen,Yiming Luo
关键词-EN: mental health assessment, middle school students, middle school, field of education, ensemble learning network
关键词-ZN: 心理健康评估,中学生,中学,教育领域,整体学习网络
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The mental health assessment of middle school students has always been one of the focuses in the field of education. This paper introduces a new ensemble learning network based on BERT, employing the concept of enhancing model performance by integrating multiple classifiers. We trained a range of BERT-based learners, which combined using the majority voting method. We collect social network text data of middle school students through China’s Weibo and apply the method to the task of classifying emotional tendencies in middle school students’ social network texts. Experimental results suggest that the ensemble learning network has a better performance than the base model and the performance of the ensemble learning model, consisting of three single-layer BERT models, is barely the same as a three-layer BERT model but requires 11.58% more training time. Therefore, in terms of balancing prediction effect and efficiency, the deeper BERT network should be preferred for training. However, for interpretability, network ensembles can provide acceptable solutions.
摘要:中学生心理健康测评一直是教育界关注的焦点之一。本文介绍了一种新的基于ERT的集成学习网络,该网络采用集成多个分类器来提高模型性能的概念。我们训练了一系列基于BERT的学习者,这些学习者结合使用了多数投票方法。我们通过中国的微博收集了中学生社交网络文本数据,并将该方法应用于中学生社交网络文本中的情感倾向分类任务。实验结果表明,集成学习网络具有比基本模型更好的性能,而由三个单层BERT模型组成的集成学习模型的性能与三层BERT模型几乎相同,但需要增加11.58%的训练时间。因此,从平衡预测效果和效率的角度来看,应首选较深的BERT网络进行训练。然而,对于可解释性,网络合奏可以提供可接受的解决方案。

[NLP-31] mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
[NLP-31] mPLUG-Owl 3:在多模式大型语言模型中实现长图像序列理解

链接: https://arxiv.org/abs/2408.04840
作者: Jiabo Ye,Haiyang Xu,Haowei Liu,Anwen Hu,Ming Yan,Qi Qian,Ji Zhang,Fei Huang,Jingren Zhou
关键词-EN: demonstrated remarkable capabilities, Multi-modal Large Language, Large Language Models, Large Language, Multi-modal Large
关键词-ZN: 表现出非凡的能力,多模式大型语言,大型语言模型,大型语言,多模式大型
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.
摘要:多通道大语言模型在执行各种单图像任务的指令方面表现出了卓越的能力。尽管取得了这些进展,但在对长图像序列进行建模方面仍然存在重大挑战。在这项工作中,我们介绍了通用的多模式大型语言模型mPLUG-Owl3,它增强了在包含检索到的图文知识、交错的图文和长视频的场景中理解长图像序列的能力。具体地说,我们提出了新的超注意块来有效地将视觉和语言整合到公共语言制导的语义空间中,从而促进了对扩展的多图像场景的处理。大量的实验结果表明,在单图像、多图像和视频基准上,mPLUG-Owl3在尺寸相似的模型中达到了最先进的性能。此外,我们还提出了一种具有挑战性的长视觉序列评估方法,称为分心物抵抗,用于评估模型在干扰中保持焦点的能力。最后,在所提出的体系结构下,mPLUG-Owl3在超长视觉序列输入上表现出了出色的性能。我们希望mPLUG-Owl3能够为开发更高效、更强大的多通道大型语言模型做出贡献。

[NLP-32] FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers
[NLP-32] 融合语言模型:零镜头适配器发现,用于跨令牌器的快速优化

链接: https://arxiv.org/abs/2408.04816
作者: Joshua Nathaniel Williams,J. Zico Kolter
关键词-EN: making knowledge transfer, discovery tasks difficult, prompt discovery tasks, model embedding space, embedding space
关键词-ZN: 使知识转移、发现任务变得困难、提示发现任务、模型嵌入空间、嵌入空间
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published as a Conference Paper at COLM 2024; 10 Pages; this https URL

点击查看摘要

Abstract:The widespread use of large language models has resulted in a multitude of tokenizers and embedding spaces, making knowledge transfer in prompt discovery tasks difficult. In this work, we propose FUSE (Flexible Unification of Semantic Embeddings), an inexpensive approach to approximating an adapter layer that maps from one model’s textual embedding space to another, even across different tokenizers. We introduce a third-order tensor-based representation of a model’s embedding space that aligns semantic embeddings that have been split apart by different tokenizers, and use this representation to derive an approximation of the gradient of one model’s outputs with respect to another model’s embedding space. We show the efficacy of our approach via multi-objective optimization over vision-language and causal language models for image captioning and sentiment-based image captioning.
摘要:大型语言模型的广泛使用导致了大量的标记器和嵌入空间,使得即时发现任务中的知识转移变得困难。在这项工作中,我们提出了FUSE(语义嵌入的灵活统一),这是一种廉价的逼近适配器层的方法,可以从一个模型的文本嵌入空间映射到另一个模型的文本嵌入空间,甚至跨不同的标记器。我们引入了模型嵌入空间的三阶基于张量的表示,该表示将被不同符号化器分裂的语义嵌入对齐,并使用这种表示来推导出一个模型的输出相对于另一个模型的嵌入空间的梯度的近似值。我们通过对图像字幕和基于情感的图像字幕的视觉语言和因果语言模型的多目标优化来展示我们方法的有效性。

[NLP-33] Hybrid Student-Teacher Large Language Model Refinement for Cancer Toxicity Symptom Extraction
[NLP-33] 用于癌症毒性症状提取的混合学生-教师大型语言模型细化

链接: https://arxiv.org/abs/2408.04775
作者: Reza Khanmohammadi,Ahmed I. Ghanem,Kyle Verdecchia,Ryan Hall,Mohamed Elshaikh,Benjamin Movsas,Hassan Bagher-Ebadian,Bing Luo,Indrin J. Chetty,Tuka Alhanai,Kundan Thind,Mohammad M. Ghassemi
关键词-EN: Large Language Models, Large Language, offer significant potential, computational limitations, Language Models
关键词-ZN: 大型语言模型,大型语言,提供巨大的潜力,计算限制,语言模型
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer significant potential for clinical symptom extraction, but their deployment in healthcare settings is constrained by privacy concerns, computational limitations, and operational costs. This study investigates the optimization of compact LLMs for cancer toxicity symptom extraction using a novel iterative refinement approach. We employ a student-teacher architecture, utilizing Zephyr-7b-beta and Phi3-mini-128 as student models and GPT-4o as the teacher, to dynamically select between prompt refinement, Retrieval-Augmented Generation (RAG), and fine-tuning strategies. Our experiments on 294 clinical notes covering 12 post-radiotherapy toxicity symptoms demonstrate the effectiveness of this approach. The RAG method proved most efficient, improving average accuracy scores from 0.32 to 0.73 for Zephyr-7b-beta and from 0.40 to 0.87 for Phi3-mini-128 during refinement. In the test set, both models showed an approximate 0.20 increase in accuracy across symptoms. Notably, this improvement was achieved at a cost 45 times lower than GPT-4o for Zephyr and 79 times lower for Phi-3. These results highlight the potential of iterative refinement techniques in enhancing the capabilities of compact LLMs for clinical applications, offering a balance between performance, cost-effectiveness, and privacy preservation in healthcare settings.
摘要:大型语言模型为临床症状提取提供了巨大的潜力,但它们在医疗保健环境中的部署受到隐私问题、计算限制和操作成本的限制。本研究使用一种新的迭代求精方法,研究了用于癌症毒性症状提取的紧凑型LLMS的优化。我们采用学生-教师结构,使用ZePhyr-7b-beta和Phi3-mini-128作为学生模型,GPT-40作为教师,在快速精化、检索-增强生成(RAG)和微调策略之间进行动态选择。我们对294个临床笔记的实验证明了这种方法的有效性,这些笔记涵盖了12个放射治疗后的毒性症状。RAG方法被证明是最有效的,在精化过程中,Zephar-7b-beta的平均精度分数从0.32提高到0.73,Phi3-mini-128的平均精度分数从0.40提高到0.87。在测试集中,两个模型都显示出症状准确率大约提高了0.20。值得注意的是,实现这一改进的成本比GPT-40低45倍,比Phi-3低79倍。这些结果突出了迭代改进技术在增强紧凑型LLM临床应用能力方面的潜力,在医疗保健环境中提供了性能、成本效益和隐私保护之间的平衡。

[NLP-34] Understanding the Performance and Estimating the Cost of LLM Fine-Tuning
[NLP-34] 了解LLM微调的性能并估计成本

链接: https://arxiv.org/abs/2408.04693
作者: Yuchen Xia,Jiho Kim,Yuhan Chen,Haojie Ye,Souvik Kundu,Cong(Callie)Hao,Nishil Talati
关键词-EN: training Large Language, Large Language Models, Large Language, limited compute resources, training Large
关键词-ZN: 培训大型语言、大型语言模型、大型语言、有限的计算资源、培训大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, conference

点击查看摘要

Abstract:Due to the cost-prohibitive nature of training Large Language Models (LLMs), fine-tuning has emerged as an attractive alternative for specializing LLMs for specific tasks using limited compute resources in a cost-effective manner. In this paper, we characterize sparse Mixture of Experts (MoE) based LLM fine-tuning to understand their accuracy and runtime performance on a single GPU. Our evaluation provides unique insights into the training efficacy of sparse and dense versions of MoE models, as well as their runtime characteristics, including maximum batch size, execution time breakdown, end-to-end throughput, GPU hardware utilization, and load distribution. Our study identifies the optimization of the MoE layer as crucial for further improving the performance of LLM fine-tuning. Using our profiling results, we also develop and validate an analytical model to estimate the cost of LLM fine-tuning on the cloud. This model, based on parameters of the model and GPU architecture, estimates LLM throughput and the cost of training, aiding practitioners in industry and academia to budget the cost of fine-tuning a specific model.
摘要:由于训练大型语言模型的成本过高,微调已经成为一种有吸引力的替代方案,可以在使用有限的计算资源的情况下以经济高效的方式将大型语言模型专门用于特定任务。在本文中,我们描述了基于稀疏混合专家(MOE)的LLM微调,以了解它们在单个GPU上的精确度和运行性能。我们的评估为稀疏和密集版本的MOE模型的训练效率以及它们的运行时特征提供了独特的见解,包括最大批处理大小、执行时间细分、端到端吞吐量、GPU硬件利用率和负载分布。我们的研究表明,MOE层的优化是进一步提高LLM微调性能的关键。使用我们的分析结果,我们还开发并验证了一个分析模型,以估计在云上进行LLM微调的成本。该模型基于模型和GPU架构的参数,估计LLM吞吐量和培训成本,帮助工业界和学术界的实践者预算微调特定模型的成本。

[NLP-35] Improving Relational Database Interactions with Large Language Models : Column Descriptions and Their Impact on Text-to-SQL Performance
[NLP-35] 改进关系数据库与大型语言模型的交互:列描述及其对文本到SQL性能的影响

链接: https://arxiv.org/abs/2408.04691
作者: Niklas Wretblad,Oskar Holmström,Erik Larsson,Axel Wiksäter,Oscar Söderlund,Hjalmar Öhman,Ture Pontén,Martin Forsberg,Martin Sörme,Fredrik Heintz
关键词-EN: Relational databases, table contents, descriptors of table, Relational, column descriptions
关键词-ZN: 关系数据库、表内容、表描述符、关系、列描述
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Relational databases often suffer from uninformative descriptors of table contents, such as ambiguous columns and hard-to-interpret values, impacting both human users and Text-to-SQL models. This paper explores the use of large language models (LLMs) to generate informative column descriptions as a semantic layer for relational databases. Using the BIRD-Bench development set, we created \textscColSQL, a dataset with gold-standard column descriptions generated and refined by LLMs and human annotators. We evaluated several instruction-tuned models, finding that GPT-4o and Command R+ excelled in generating high-quality descriptions. Additionally, we applied an LLM-as-a-judge to evaluate model performance. Although this method does not align well with human evaluations, we included it to explore its potential and to identify areas for improvement. More work is needed to improve the reliability of automatic evaluations for this task. We also find that detailed column descriptions significantly improve Text-to-SQL execution accuracy, especially when columns are uninformative. This study establishes LLMs as effective tools for generating detailed metadata, enhancing the usability of relational databases.
摘要:关系数据库经常存在表内容的非信息性描述符,例如列不明确和难以解释的值,这对人类用户和文本到SQL模型都有影响。本文探讨了如何使用大型语言模型(LLM)来生成信息性列描述,作为关系数据库的语义层。使用BIRD-BENCH开发集,我们创建了\extscColSQL,这是一个数据集,其中包含由LLM和人工注释员生成和改进的黄金标准列描述。我们评估了几个指令调优模型,发现GPT-40和Command R+在生成高质量描述方面表现出色。此外,我们应用LLM作为判断来评估模型的性能。虽然这种方法不能很好地与人类评估保持一致,但我们将其纳入其中是为了探索其潜力并确定需要改进的领域。需要做更多的工作来提高这项任务的自动评价的可靠性。我们还发现,详细的列描述显著提高了文本到SQL执行的准确性,特别是在列信息不多的情况下。这项研究建立了LLMS作为生成详细元数据的有效工具,增强了关系数据库的可用性。

[NLP-36] Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles
[NLP-36] 从第一原则出发对大型语言模型的多轮上下文越狱攻击

链接: https://arxiv.org/abs/2408.04686
作者: Xiongtao Sun,Deyue Zhang,Dongdong Yang,Quanchen Zou,Hui Li
关键词-EN: Large language models, Large language, language models, numerous applications, text generation
关键词-ZN: 大型语言模型、大型语言、语言模型、众多应用程序、文本生成
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have significantly enhanced the performance of numerous applications, from intelligent conversations to text generation. However, their inherent security vulnerabilities have become an increasingly significant challenge, especially with respect to jailbreak attacks. Attackers can circumvent the security mechanisms of these LLMs, breaching security constraints and causing harmful outputs. Focusing on multi-turn semantic jailbreak attacks, we observe that existing methods lack specific considerations for the role of multiturn dialogues in attack strategies, leading to semantic deviations during continuous interactions. Therefore, in this paper, we establish a theoretical foundation for multi-turn attacks by considering their support in jailbreak attacks, and based on this, propose a context-based contextual fusion black-box jailbreak attack method, named Context Fusion Attack (CFA). This method approach involves filtering and extracting key terms from the target, constructing contextual scenarios around these terms, dynamically integrating the target into the scenarios, replacing malicious key terms within the target, and thereby concealing the direct malicious intent. Through comparisons on various mainstream LLMs and red team datasets, we have demonstrated CFA’s superior success rate, divergence, and harmfulness compared to other multi-turn attack strategies, particularly showcasing significant advantages on Llama3 and GPT-4.
摘要:大型语言模型显著提高了从智能对话到文本生成的众多应用程序的性能。然而,它们固有的安全漏洞已成为一个日益重大的挑战,特别是在越狱攻击方面。攻击者可以绕过这些LLM的安全机制,违反安全限制并造成有害的输出。针对多轮语义越狱攻击,我们观察到现有的方法缺乏对多轮对话在攻击策略中的作用的具体考虑,导致在持续交互过程中出现语义偏差。因此,本文通过考虑多轮攻击对越狱攻击的支持,为多轮攻击奠定了理论基础,并在此基础上提出了一种基于上下文融合的黑盒越狱攻击方法,称为上下文融合攻击(CFA)。该方法包括从目标过滤和提取关键字,围绕这些关键字构建上下文场景,动态地将目标集成到场景中,替换目标内的恶意关键字,从而隐藏直接恶意意图。通过在各种主流LLM和RED团队数据集上的比较,我们已经证明了CFA相对于其他多回合攻击策略具有更高的成功率、发散性和危害性,特别是在Llama3和GPT-4上表现出显著的优势。

[NLP-37] oolSandbox: A Stateful Conversational Interactive Evaluation Benchmark for LLM Tool Use Capabilities
[NLP-37] oolSandbox:LLM工具使用能力的状态对话交互评估基准

链接: https://arxiv.org/abs/2408.04682
作者: Jiarui Lu,Thomas Holleis,Yizhe Zhang,Bernhard Aumayer,Feng Nan,Felix Bai,Shuang Ma,Shen Ma,Mengyu Li,Guoli Yin,Zirui Wang,Ruoming Pang
关键词-EN: Recent large language, solving real-world challenges, growing research interest, Recent large, assisted LLMs solving
关键词-ZN: 最近的大型语言,解决现实世界的挑战,不断增长的研究兴趣,最近的大型辅助LLM解决
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at this https URL
摘要:近年来大型语言模型的发展引起了人们对工具辅助的大型语言模型解决现实世界挑战的日益浓厚的研究兴趣,这就要求对工具使用能力进行综合评估。虽然以前的工作集中在基于单一用户提示的无状态Web服务(RESTful API)评估,或者非策略对话轨迹的评估,但ToolSandbox包括有状态的工具执行,工具之间的隐式状态依赖,支持策略对话评估的内置用户模拟器,以及针对任意轨迹上的中间和最终里程碑的动态评估策略。我们发现,开源模型和专有模型在性能上存在显著差距,而在工具沙箱中定义的复杂任务,如状态依赖、规范化和信息不足,甚至对最有能力的Sota LLM也构成了挑战,为工具使用LLM功能提供了全新的见解。在此HTTPS URL发布了ToolSandbox评估框架

[NLP-38] Conversational AI Powered by Large Language Models Amplifies False Memories in Witness Interviews
[NLP-38] 由大型语言模型支持的对话人工智能放大了证人采访中的虚假记忆

链接: https://arxiv.org/abs/2408.04681
作者: Samantha Chan,Pat Pataranutaporn,Aditya Suri,Wazeer Zulfikar,Pattie Maes,Elizabeth F. Loftus
关键词-EN: false memories, recollections of events, actual occurrences, study examines, examines the impact
关键词-ZN: 错误记忆、事件回忆、实际发生的事情、研究检查、检查影响
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This study examines the impact of AI on human false memories – recollections of events that did not occur or deviate from actual occurrences. It explores false memory induction through suggestive questioning in Human-AI interactions, simulating crime witness interviews. Four conditions were tested: control, survey-based, pre-scripted chatbot, and generative chatbot using a large language model (LLM). Participants (N=200) watched a crime video, then interacted with their assigned AI interviewer or survey, answering questions including five misleading ones. False memories were assessed immediately and after one week. Results show the generative chatbot condition significantly increased false memory formation, inducing over 3 times more immediate false memories than the control and 1.7 times more than the survey method. 36.4% of users’ responses to the generative chatbot were misled through the interaction. After one week, the number of false memories induced by generative chatbots remained constant. However, confidence in these false memories remained higher than the control after one week. Moderating factors were explored: users who were less familiar with chatbots but more familiar with AI technology, and more interested in crime investigations, were more susceptible to false memories. These findings highlight the potential risks of using advanced AI in sensitive contexts, like police interviews, emphasizing the need for ethical considerations.
摘要:这项研究考察了人工智能对人类错误记忆的影响–对没有发生或偏离实际发生的事件的回忆。它探索了在人类-人工智能交互中通过提示性提问来诱导虚假记忆,模拟犯罪证人采访。测试了四种情况:对照、基于调查的、预脚本化的聊天机器人和使用大型语言模型(LLM)的生成式聊天机器人。参与者(N=200)观看犯罪视频,然后与指定的人工智能面试者或调查人员互动,回答包括五个误导性问题在内的问题。分别在即刻和一周后评估错误记忆。结果表明,产生式聊天机器人显著增加了错误记忆的形成,诱发的即时错误记忆是对照组的3倍以上,是调查方法的1.7倍。36.4%的用户对生成式聊天机器人的反应通过互动被误导。一周后,由生成性聊天机器人诱导的错误记忆数量保持不变。然而,一周后,对这些错误记忆的信心仍然高于对照组。研究探讨了缓和因素:对聊天机器人不太熟悉但更熟悉人工智能技术、对犯罪调查更感兴趣的用户更容易出现错误记忆。这些发现突显了在敏感环境中使用先进人工智能的潜在风险,比如警察讯问,强调了伦理考虑的必要性。

[NLP-39] Dynamic Fog Computing for Enhanced LLM Execution in Medical Applications
[NLP-39] 用于医疗应用中增强LLM执行的动态雾计算

链接: https://arxiv.org/abs/2408.04680
作者: Philipp Zagar,Vishnu Ravi,Lauren Aalami,Stephan Krusche,Oliver Aalami,Paul Schmiedmayer
关键词-EN: large language models, data-driven care delivery, comprehend vast quantities, enhance data-driven care, language models
关键词-ZN: 大型语言模型、数据驱动的护理交付、理解大量内容、增强数据驱动的护理、语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The ability of large language models (LLMs) to transform, interpret, and comprehend vast quantities of heterogeneous data presents a significant opportunity to enhance data-driven care delivery. However, the sensitive nature of protected health information (PHI) raises valid concerns about data privacy and trust in remote LLM platforms. In addition, the cost associated with cloud-based artificial intelligence (AI) services continues to impede widespread adoption. To address these challenges, we propose a shift in the LLM execution environment from opaque, centralized cloud providers to a decentralized and dynamic fog computing architecture. By executing open-weight LLMs in more trusted environments, such as the user’s edge device or a fog layer within a local network, we aim to mitigate the privacy, trust, and financial challenges associated with cloud-based LLMs. We further present SpeziLLM, an open-source framework designed to facilitate rapid and seamless leveraging of different LLM execution layers and lowering barriers to LLM integration in digital health applications. We demonstrate SpeziLLM’s broad applicability across six digital health applications, showcasing its versatility in various healthcare settings.
摘要:大型语言模型(LLM)转换、解释和理解大量异类数据的能力为增强数据驱动的医疗服务提供了一个重要的机会。然而,受保护的健康信息(PHI)的敏感性质引发了对远程LLM平台中数据隐私和信任的合理担忧。此外,与基于云的人工智能(AI)服务相关的成本继续阻碍着广泛采用。为了应对这些挑战,我们建议将LLM执行环境从不透明的集中式云提供商转变为分散的动态雾计算体系结构。通过在更可信的环境(例如用户的边缘设备或本地网络中的雾层)中执行开放权重LLM,我们的目标是减轻与基于云的LLM相关的隐私、信任和财务挑战。我们进一步介绍了SpeziLLM,这是一个开源框架,旨在促进不同LLM执行层的快速和无缝利用,并降低数字医疗应用中LLM集成的障碍。我们展示了SpeziLLM在六个数字医疗应用中的广泛适用性,展示了其在各种医疗保健设置中的多功能性。

[NLP-40] owards Linguistic Neural Representation Learning and Sentence Retrieval from Electroencephalogram Recordings
[NLP-40] owards语言神经表示学习和从脑电波记录中检索句子

链接: https://arxiv.org/abs/2408.04679
作者: Jinzhao Zhou,Yiqun Duan,Ziyi Zhao,Yu-Cheng Chang,Yu-Kai Wang,Thomas Do,Chin-Teng Lin
关键词-EN: vast applicational potential, non-invasive brain signals, EEG, gained increasing research, increasing research attention
关键词-ZN: 巨大的应用潜力、无创脑信号、脑电,得到越来越多的研究,越来越多的研究关注
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Decoding linguistic information from non-invasive brain signals using EEG has gained increasing research attention due to its vast applicational potential. Recently, a number of works have adopted a generative-based framework to decode electroencephalogram (EEG) signals into sentences by utilizing the power generative capacity of pretrained large language models (LLMs). However, this approach has several drawbacks that hinder the further development of linguistic applications for brain-computer interfaces (BCIs). Specifically, the ability of the EEG encoder to learn semantic information from EEG data remains questionable, and the LLM decoder’s tendency to generate sentences based on its training memory can be hard to avoid. These issues necessitate a novel approach for converting EEG signals into sentences. In this paper, we propose a novel two-step pipeline that addresses these limitations and enhances the validity of linguistic EEG decoding research. We first confirm that word-level semantic information can be learned from EEG data recorded during natural reading by training a Conformer encoder via a masked contrastive objective for word-level classification. To achieve sentence decoding results, we employ a training-free retrieval method to retrieve sentences based on the predictions from the EEG encoder. Extensive experiments and ablation studies were conducted in this paper for a comprehensive evaluation of the proposed approach. Visualization of the top prediction candidates reveals that our model effectively groups EEG segments into semantic categories with similar meanings, thereby validating its ability to learn patterns from unspoken EEG recordings. Despite the exploratory nature of this work, these results suggest that our method holds promise for providing more reliable solutions for converting EEG signals into text.
摘要:利用脑电信号对非侵入性脑信号中的语言信息进行解码,因其巨大的应用潜力而受到越来越多的研究关注。最近,一些工作已经采用了基于生成的框架,利用预先训练的大语言模型(LLMS)的强大生成能力将脑电(EEG)信号解码成句子。然而,这种方法存在一些缺陷,阻碍了脑机接口(BCI)语言应用的进一步发展。具体地说,EEG编码器从EEG数据中学习语义信息的能力仍然值得怀疑,LLM解码器根据其训练记忆生成句子的倾向可能很难避免。这些问题需要一种新的方法将脑电信号转换成句子。在本文中,我们提出了一种新的两步流水线,解决了这些局限性,并提高了语言脑电解码研究的有效性。我们首先通过对形变编码器进行词级分类的掩蔽对比目标的训练,证实了可以从自然阅读过程中记录的脑电数据中学习到词级语义信息。为了获得句子的解码结果,我们使用了一种无需训练的检索方法来根据EEG编码器的预测来检索句子。为了对所提出的方法进行全面的评价,本文进行了大量的实验和烧蚀研究。对排名靠前的预测候选者的可视化显示,我们的模型有效地将EEG片段分组到具有相似含义的语义类别,从而验证了它从未说出的EEG记录中学习模式的能力。尽管这项工作是探索性的,但这些结果表明,我们的方法有望为将EEG信号转换为文本提供更可靠的解决方案。

[NLP-41] CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding
[NLP-41] CREST:有效地压缩数据存储以实现基于检索的推测解码

链接: https://arxiv.org/abs/2408.04678
作者: Sophia Ho,Jinsol Park,Patrick Wang
关键词-EN: Compact Retrieval-Based Speculative, Retrieval-Based Speculative Decoding, Compact Retrieval-Based, Speculative Decoding, speculative decoding based
关键词-ZN: 紧凑的基于检索的推测、基于检索的推测解码、紧凑的基于检索的、推测解码、基于推测解码
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:We present CREST (Compact Retrieval-Based Speculative Decoding), a redesign of REST that allows it to be effectively “compacted”. REST is a drafting technique for speculative decoding based on retrieving exact n-gram matches of the most recent n tokens generated by the target LLM from a datastore. The key idea of CREST is to only store a subset of the smallest and most common n-grams in the datastore with the hope of achieving comparable performance with less storage space. We found that storing a subset of n-grams both reduces storage space and improves performance. CREST matches REST’s accepted token length with 10.6-13.5x less storage space and achieves a 16.5-17.1% higher acceptance length than REST using the same storage space on the HumanEval and MT Bench benchmarks.
摘要:我们提出CREST(基于紧凑检索的推测解码),这是对REST的重新设计,使其能够有效地“压缩”。REST是一种用于推测性解码的起草技术,基于从收件箱检索目标LLM生成的最近n个令牌的精确n元匹配。CREST的关键思想是仅在收件箱中存储最小、最常见的n元语法的一个子集,希望以更少的存储空间实现相当的性能。我们发现存储n-gram的子集既可以减少存储空间又可以提高性能。CREST以减少10.6- 13.5倍的存储空间与REST的可接受令牌长度相匹配,并且比HumanEval和MT Bench基准测试中使用相同存储空间的REST的可接受长度高出16.5-17.1%。

[NLP-42] ACL Ready: RAG Based Assistant for the ACL Checklist
[NLP-42] ACL Ready:基于RAG的ACL检查表助理

链接: https://arxiv.org/abs/2408.04675
作者: Michael Galarnyk,Rutwik Routu,Kosha Bheda,Priyanshu Mehta,Agam Shah,Sudheer Chava
关键词-EN: ARR Responsible NLP, Responsible NLP Research, NLP Research checklist, ARR Responsible, Responsible NLP
关键词-ZN: ARR负责任NLP、负责任NLP研究、NLP研究清单、ARR负责任、负责任NLP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The ARR Responsible NLP Research checklist website states that the “checklist is designed to encourage best practices for responsible research, addressing issues of research ethics, societal impact and reproducibility.” Answering the questions is an opportunity for authors to reflect on their work and make sure any shared scientific assets follow best practices. Ideally, considering the checklist before submission can favorably impact the writing of a research paper. However, the checklist is often filled out at the last moment. In this work, we introduce ACLReady, a retrieval-augmented language model application that can be used to empower authors to reflect on their work and assist authors with the ACL checklist. To test the effectiveness of the system, we conducted a qualitative study with 13 users which shows that 92% of users found the application useful and easy to use as well as 77% of the users found that the application provided the information they expected. Our code is publicly available under the CC BY-NC 4.0 license on GitHub.
摘要:ARR负责任的NLP研究核对表网站指出,该核对表旨在鼓励负责任的研究的最佳实践,解决研究伦理、社会影响和重现性等问题。回答这些问题是作者反思他们的工作并确保任何共享的科学资产遵循最佳实践的机会。理想情况下,在提交之前考虑清单可以更好地影响研究论文的撰写。然而,核对表通常是在最后一刻填写的。在这项工作中,我们介绍了ACLReady,这是一个检索增强的语言模型应用程序,可以用来支持作者反思他们的工作,并帮助作者完成ACL检查表。为了测试系统的有效性,我们对13名用户进行了定性研究,结果显示,92%的用户认为该应用程序有用且易于使用,77%的用户认为该应用程序提供了他们期望的信息。我们的代码在GitHub上的CC BY-NC 4.0许可证下公开提供。

[NLP-43] AutoFAIR : Automatic Data FAIRification via Machine Reading
[NLP-43] AutoFAIR:通过机器读取自动数据FAIR

链接: https://arxiv.org/abs/2408.04673
作者: Tingyan Ma,Wei Liu,Bin Lu,Xiaoying Gan,Yunqiang Zhu,Luoyi Fu,Chenghu Zhou
关键词-EN: fuels data-driven research, data fuels data-driven, data-driven research, facilitating progress, diverse domains
关键词-ZN: 推动数据驱动的研究,数据推动数据驱动的、数据驱动的研究,促进进步,多元化领域
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The explosive growth of data fuels data-driven research, facilitating progress across diverse domains. The FAIR principles emerge as a guiding standard, aiming to enhance the findability, accessibility, interoperability, and reusability of data. However, current efforts primarily focus on manual data FAIRification, which can only handle targeted data and lack efficiency. To address this issue, we propose AutoFAIR, an architecture designed to enhance data FAIRness automately. Firstly, We align each data and metadata operation with specific FAIR indicators to guide machine-executable actions. Then, We utilize Web Reader to automatically extract metadata based on language models, even in the absence of structured data webpage schemas. Subsequently, FAIR Alignment is employed to make metadata comply with FAIR principles by ontology guidance and semantic matching. Finally, by applying AutoFAIR to various data, especially in the field of mountain hazards, we observe significant improvements in findability, accessibility, interoperability, and reusability of data. The FAIRness scores before and after applying AutoFAIR indicate enhanced data value.
摘要:数据的爆炸性增长推动了数据驱动的研究,促进了不同领域的进步。公平原则作为指导标准出现,旨在提高数据的可查找性、可访问性、互操作性和可重用性。然而,目前的工作主要集中在人工数据FAIR,只能处理目标数据,缺乏效率。为了解决这个问题,我们提出了AutoFair,一种旨在自动增强数据公平性的体系结构。首先,我们将每个数据和元数据操作与具体的公平指标相结合,以指导机器可执行的行动。然后,我们利用Web Reader自动提取基于语言模型的元数据,即使在没有结构化数据网页模式的情况下也是如此。随后,通过本体指导和语义匹配,采用公平对齐的方法使元数据符合公平原则。最后,通过将AutoFair应用于各种数据,特别是在山地灾害领域,我们观察到数据的可查找性、可访问性、互操作性和可重用性方面有了显著的改善。应用AutoFair之前和之后的公平性得分表明增强了数据价值。

[NLP-44] Prompt and Prejudice ECCV
[NLP-44] 提示和先决条件

链接: https://arxiv.org/abs/2408.04671
作者: Lorenzo Berlincioni,Luca Cultrera,Federico Becattini,Marco Bertini,Alberto Del Bimbo
关键词-EN: Large Language Models, Vision Language Models, Large Language, Vision Language, Language Models
关键词-ZN: 大型语言模型,视觉语言模型,大型语言,视觉语言,语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted at ECCV workshop FAILED

点击查看摘要

Abstract:This paper investigates the impact of using first names in Large Language Models (LLMs) and Vision Language Models (VLMs), particularly when prompted with ethical decision-making tasks. We propose an approach that appends first names to ethically annotated text scenarios to reveal demographic biases in model outputs. Our study involves a curated list of more than 300 names representing diverse genders and ethnic backgrounds, tested across thousands of moral scenarios. Following the auditing methodologies from social sciences we propose a detailed analysis involving popular LLMs/VLMs to contribute to the field of responsible AI by emphasizing the importance of recognizing and mitigating biases in these systems. Furthermore, we introduce a novel benchmark, the Pratical Scenarios Benchmark (PSB), designed to assess the presence of biases involving gender or demographic prejudices in everyday decision-making scenarios as well as practical scenarios where an LLM might be used to make sensible decisions (e.g., granting mortgages or insurances). This benchmark allows for a comprehensive comparison of model behaviors across different demographic categories, highlighting the risks and biases that may arise in practical applications of LLMs and VLMs.
摘要:本文研究了在大型语言模型和视觉语言模型中使用名字的影响,特别是当提示执行伦理决策任务时。我们提出了一种方法,将名字附加到伦理注释的文本场景中,以揭示模型输出中的人口统计偏差。我们的研究包括300多个代表不同性别和种族背景的名字的精心策划的名单,测试了数千种道德情景。遵循社会科学的审计方法,我们提出了一项涉及流行的LLM/VLM的详细分析,以通过强调识别和减少这些系统中的偏差的重要性来为负责任的人工智能领域做出贡献。此外,我们引入了一个新的基准,实际情景基准(PSB),旨在评估日常决策情景以及实际情景中涉及性别或人口偏见的偏见的存在,在这些情景中,LLM可能被用于做出明智的决策(例如,发放抵押贷款或保险)。这一基准允许对不同人口统计类别的模型行为进行全面比较,突出了LLM和VLM在实际应用中可能出现的风险和偏差。

[NLP-45] Forecasting Live Chat Intent from Browsing History CIKM2024
[NLP-45] 根据浏览历史记录预测实时聊天意图

链接: https://arxiv.org/abs/2408.04668
作者: Se-eun Yoon,Ahmad Bin Rabiah,Zaid Alibadi,Surya Kallumadi,Julian McAuley
关键词-EN: online live chat, live chat agents, Customers reach, requesting a return, browsing history
关键词-ZN: 在线实时聊天、实时聊天代理、客户联系、请求退货、浏览历史记录
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: CIKM 2024

点击查看摘要

Abstract:Customers reach out to online live chat agents with various intents, such as asking about product details or requesting a return. In this paper, we propose the problem of predicting user intent from browsing history and address it through a two-stage approach. The first stage classifies a user’s browsing history into high-level intent categories. Here, we represent each browsing history as a text sequence of page attributes and use the ground-truth class labels to fine-tune pretrained Transformers. The second stage provides a large language model (LLM) with the browsing history and predicted intent class to generate fine-grained intents. For automatic evaluation, we use a separate LLM to judge the similarity between generated and ground-truth intents, which closely aligns with human judgments. Our two-stage approach yields significant performance gains compared to generating intents without the classification stage.
摘要:客户以各种意图联系在线实时聊天代理,例如询问产品详细信息或要求退货。在本文中,我们提出了根据浏览历史预测用户意图的问题,并通过两阶段方法解决这个问题。第一阶段将用户的浏览历史记录分类为高级意图类别。在这里,我们将每个浏览历史记录表示为页面属性的文本序列,并使用地面真相类标签来微调预训练的变形金刚。第二阶段提供具有浏览历史和预测意图类的大型语言模型(LLM),以生成细粒度意图。对于自动评估,我们使用单独的LLM来判断生成的意图和地面真相意图之间的相似性,这与人类的判断密切一致。与没有分类阶段的生成意图相比,我们的两阶段方法可以产生显着的性能提升。

[NLP-46] LLM Stability: A detailed analysis with some surprises
[NLP-46] LLM稳定性:带一些惊喜的详细分析

链接: https://arxiv.org/abs/2408.04667
作者: Berk Atil,Alexa Chittams,Liseng Fu,Ferhan Ture,Lixinyu Xu,Breck Baldwin
关键词-EN: Toggle, magical LLMs involves, Code, Papers, Code Toggle Papers
关键词-ZN: 切换,神奇的LLM涉及,代码,论文,代码切换论文
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:A concerning property of our nearly magical LLMs involves the variation of results given the exact same input and deterministic hyper-parameters. While AI has always had a certain level of noisiness from inputs outside of training data, we have generally had deterministic results for any particular input; that is no longer true. While most LLM practitioners are “in the know”, we are unaware of any work that attempts to quantify current LLM stability. We suspect no one has taken the trouble because it is just too boring a paper to execute and write. But we have done it and there are some surprises. What kinds of surprises? The evaluated LLMs are rarely deterministic at the raw output level; they are much more deterministic at the parsed output/answer level but still rarely 100% stable across 5 re-runs with same data input. LLM accuracy variation is not normally distributed. Stability varies based on task. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2408.04667 [cs.CL] (or arXiv:2408.04667v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.04667 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Breck Baldwin [view email] [v1] Tue, 6 Aug 2024 16:43:35 UTC (297 KB) Full-text links: Access Paper: View a PDF of the paper titled LLM Stability: A detailed analysis with some surprises, by Berk Atil and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2024-08 Change to browse by: cs cs.AI cs.LG cs.SE References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
摘要:我们的几乎神奇的LLMS的一个令人关注的性质涉及到在完全相同的输入和确定性超参数的情况下结果的变化。虽然人工智能总是有一定程度的噪音来自训练数据以外的输入,但我们通常对任何特定的输入都有确定性的结果;这不再是真的。虽然大多数LLM从业者都是“知情的”,但我们不知道有任何试图量化当前LLM稳定性的工作。我们怀疑,没有人不辞辛苦,因为这篇论文太枯燥了,执行和写作都太乏味了。但我们已经做到了,而且也有一些惊喜。什么样的惊喜?评估的LLM在原始输出级别很少是确定性的;它们在解析输出/答案级别的确定性要高得多,但在使用相同数据输入的5次重新运行中仍然很少是100%稳定的。LLM精度变化不是正态分布的。稳定性因任务而异。学科:计算与语言(cs.CL);人工智能(cs.AI);机器学习(cs.LG);软件工程(cs.SE)引用为:arxiv:2408.04667cs.CLhttps://doi.org/10.48550/arXiv.2408.04667 Focus要通过DataCite了解更多arxiv发布的DOI,请访问:Breck Baldwin[查看电子邮件][v1]Tue,6 Aug 2024 16:43:35 UTC(297 KB)全文链接:Access Paper:查看由Berk Atil和其他5位作者撰写的题为LLM稳定性:具有一些惊喜的详细分析的PDF论文查看PDFHTML(实验)Tex源其他格式查看许可证当前浏览上下文:cs.CL prev|Next new|Recent|2024-08浏览依据:CS cs.AI cs.LG cs.SE参考文献NASA ADSGoogle学者语义学者a导出BibTeX引文加载…正在加载BibTeX格式的引文…数据提供方:Bookmark Checked=“Checked”>书目工具书目和引用工具书目资源管理器切换书目资源管理器(什么是资源管理器?)Litmap切换Litmap(什么是Litmap?)切换SCITE智能引用(什么是智能引用?)与本文相关的代码、数据、媒体代码、数据和媒体链接到论文代码切换Catalyst X代码查找器(什么是CatalyzeX?)切换DagsHub(什么是DagsHub?)GotitPub切换Gotit.pub(什么是GotitPub?)代码链接切换带有代码的论文(什么是带有代码的论文?)科学预测切换科学预测(什么是科学预测?)演示复制切换复制(什么是复制?)空间切换拥抱脸部空间(什么是空间?)空格切换TXYZ.AI(什么是TXYZ.AI?)相关论文推荐器和搜索工具链接影响花影响花(什么是影响花?)连接的文件切换连接的文件(什么是连接的文件?)核心推荐切换核心推荐(什么是核心?)作者地点机构主题arXivLabs:与社区合作者的实验项目arXivLabs是一个框架,允许合作者直接在我们的网站上开发和共享新的arxiv功能。与arXivLabs合作的个人和组织都接受并接受了我们的价值观,即开放、社区、卓越和用户数据隐私。Arxiv致力于这些价值观,只与坚持这些价值观的合作伙伴合作。你有一个为arxiv社区增加价值的项目的想法吗?了解有关arXivLabs的更多信息。本文的哪些作者是代言人?|禁用MathJax(什么是MathJax?)Mathjax切换();关于帮助联系arxiv单击此处联系arxiv联系订阅arxiv邮件单击此处订阅版权隐私策略Web辅助功能arxiv操作状态通过电子邮件或SLACK获取状态通知

[NLP-47] LLMs are Not Just Next Token Predictors
[NLP-47] LLM不仅仅是下一个代币预测者

链接: https://arxiv.org/abs/2408.04666
作者: Stephen M. Downes,Patrick Forber,Alex Grzankowski
关键词-EN: stochastic gradient descent, token prediction objective, statistical models, models of language, language learning
关键词-ZN: 随机梯度下降、标记预测目标、统计模型、语言模型、语言学习
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs are statistical models of language learning through stochastic gradient descent with a next token prediction objective. Prompting a popular view among AI modelers: LLMs are just next token predictors. While LLMs are engineered using next token prediction, and trained based on their success at this task, our view is that a reduction to just next token predictor sells LLMs short. Moreover, there are important explanations of LLM behavior and capabilities that are lost when we engage in this kind of reduction. In order to draw this out, we will make an analogy with a once prominent research program in biology explaining evolution and development from the gene’s eye view.
摘要:LLM是通过随机梯度下降进行语言学习的统计模型,具有下一个令牌预测目标。这体现了人工智能建模者中的一个流行观点:LLM只是下一个代币预测器。虽然LLM是使用下一个代币预测来设计的,并根据其在此任务中的成功进行训练,但我们的观点是,仅减少到下一个代币预测器会导致LLM不足。此外,当我们进行这种减少时,LLM行为和能力会消失,还有一些重要的解释。为了阐明这一点,我们将与生物学中曾经著名的研究项目进行类比,该项目从基因的角度解释进化和发育。

[NLP-48] LLM-based MOFs Synthesis Condition Extraction using Few-Shot Demonstrations
[NLP-48] 使用Few-Shot演示的基于LLM的MOF合成条件提取

链接: https://arxiv.org/abs/2408.04665
作者: Lei Shi,Zhimeng Liu,Yi Yang,Weize Wu,Yuyang Zhang,Hongbo Zhang,Jing Lin,Siyu Wu,Zihan Chen,Ruiming Li,Nan Wang,Zipeng Liu,Huobin Tan,Hongyi Gao,Yue Zhang,Ge Wang
关键词-EN: Metal-Organic Frameworks, desirable functionality, challenging but crucial, logical design, synthesis conditions
关键词-ZN: 金属有机框架、理想的功能性、具有挑战性但至关重要、逻辑设计、合成条件
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The extraction of Metal-Organic Frameworks (MOFs) synthesis conditions from literature text has been challenging but crucial for the logical design of new MOFs with desirable functionality. The recent advent of large language models (LLMs) provides disruptively new solution to this long-standing problem and latest researches have reported over 90% F1 in extracting correct conditions from MOFs literature. We argue in this paper that most existing synthesis extraction practices with LLMs stay with the primitive zero-shot learning, which could lead to downgraded extraction and application performance due to the lack of specialized knowledge. This work pioneers and optimizes the few-shot in-context learning paradigm for LLM extraction of material synthesis conditions. First, we propose a human-AI joint data curation process to secure high-quality ground-truth demonstrations for few-shot learning. Second, we apply a BM25 algorithm based on the retrieval-augmented generation (RAG) technique to adaptively select few-shot demonstrations for each MOF’s extraction. Over a dataset randomly sampled from 84,898 well-defined MOFs, the proposed few-shot method achieves much higher average F1 performance (0.93 vs. 0.81, +14.8%) than the native zero-shot LLM using the same GPT-4 model, under fully automatic evaluation that are more objective than the previous human evaluation. The proposed method is further validated through real-world material experiments: compared with the baseline zero-shot LLM, the proposed few-shot approach increases the MOFs structural inference performance (R^2) by 29.4% in average.
摘要:从文献文本中提取金属有机骨架(MOF)的合成条件一直是一个具有挑战性的问题,但对于具有理想功能的新MOF的逻辑设计来说是至关重要的。最近出现的大型语言模型为这个长期存在的问题提供了新的解决方案,最新研究表明,超过90%的F1能够从MOF文献中提取正确的条件。本文认为,现有的LLMS综合抽取实践大多停留在原始的零镜头学习阶段,由于缺乏专业知识,这可能会导致抽取和应用性能的下降。这项工作开创并优化了材料合成条件的LLM提取的少数情况下的上下文学习范式。首先,我们提出了一个人类-人工智能联合数据管理过程,以确保为少数人的学习提供高质量的地面事实演示。其次,我们应用基于检索-增强生成(RAG)技术的BM25算法自适应地为每个MOF的抽取选择少镜头的样例。在从84,898个定义明确的MOF中随机抽样的数据集上,在比之前的人工评估更客观的全自动评估下,所提出的少激发方法获得了比使用相同GPT-4模型的本地零激发LLM高得多的平均F1性能(0.93vs.0.81,+14.8%)。通过真实世界的实验进一步验证了该方法的有效性:与基线零镜头LLM方法相比,该方法的结构推理性能(R^2)平均提高了29.4%。

[NLP-49] Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)
[NLP-49] 通过对比解码(LCD)缓解大型视觉语言模型(LVLM)中的幻觉

链接: https://arxiv.org/abs/2408.04664
作者: Avshalom Manevich,Reut Tsarfaty
关键词-EN: Large Vision-Language Models, Large Language Models, Large Vision-Language, Large Language, expanding AI capabilities
关键词-ZN: 大型视觉语言模型、大型语言模型、大型视觉语言、大型语言、扩展人工智能能力
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) are an extension of Large Language Models (LLMs) that facilitate processing both image and text inputs, expanding AI capabilities. However, LVLMs struggle with object hallucinations due to their reliance on text cues and learned object co-occurrence biases. While most research quantifies these hallucinations, mitigation strategies are still lacking. Our study introduces a Language Contrastive Decoding (LCD) algorithm that adjusts LVLM outputs based on LLM distribution confidence levels, effectively reducing object hallucinations. We demonstrate the advantages of LCD in leading LVLMs, showing up to %4 improvement in POPE F1 scores and up to %36 reduction in CHAIR scores on the COCO validation set, while also improving captioning quality scores. Our method effectively improves LVLMs without needing complex post-processing or retraining, and is easily applicable to different models. Our findings highlight the potential of further exploration of LVLM-specific decoding algorithms.
摘要:大视觉语言模型(LLM)是大语言模型(LLM)的扩展,可以方便地处理图像和文本输入,扩展AI能力。然而,由于依赖于文本提示和习得的对象共现偏差,LVLM与对象幻觉作斗争。虽然大多数研究将这些幻觉量化,但缓解策略仍然缺乏。我们的研究引入了一种语言对比解码(LCD)算法,该算法根据LLM分布的置信度来调整LVLM输出,有效地减少了物体的幻觉。我们展示了LCD在领先的LVLM中的优势,在Pope F1分数上显示出高达4%的改进,在Coco验证集上显示出高达36%的椅子分数降低,同时还提高了字幕质量分数。该方法不需要复杂的后处理或再训练就能有效地改进LVLMS,并且很容易适用于不同的模型。我们的发现突出了进一步探索特定于LVLM的解码算法的潜力。

[NLP-50] Dopamin: Transformer-based Comment Classifiers through Domain Post-Training and Multi-level Layer Aggregation
[NLP-50] 多巴胺:通过领域后训练和多层聚合的基于转换器的评论分类器

链接: https://arxiv.org/abs/2408.04663
作者: Nam Le Hai,Nghi D. Q. Bui
关键词-EN: provide important information, comments provide important, provide important, important information, information for understanding
关键词-ZN: 提供重要信息,评论提供重要的,提供重要的,重要的信息,用于理解的信息
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at The 3rd Intl. Workshop on NL-based Software Engineering, 2024

点击查看摘要

Abstract:Code comments provide important information for understanding the source code. They can help developers understand the overall purpose of a function or class, as well as identify bugs and technical debt. However, an overabundance of comments is meaningless and counterproductive. As a result, it is critical to automatically filter out these comments for specific purposes. In this paper, we present Dopamin, a Transformer-based tool for dealing with this issue. Our model excels not only in presenting knowledge sharing of common categories across multiple languages, but also in achieving robust performance in comment classification by improving comment representation. As a result, it outperforms the STACC baseline by 3% on the NLBSE’24 Tool Competition dataset in terms of average F1-score, while maintaining a comparable inference time for practical use. The source code is publicity available at this https URL.
摘要:代码注释为理解源代码提供了重要信息。它们可以帮助开发人员了解函数或类的总体目的,以及识别错误和技术债务。然而,过多的评论毫无意义,而且适得其反。因此,出于特定目的自动过滤掉这些评论至关重要。在本文中,我们介绍了Dopamin,这是一种基于Transformer的工具,用于处理此问题。我们的模型不仅擅长呈现跨多种语言的常见类别的知识共享,而且还擅长通过改进评论表示来实现评论分类的稳健性能。因此,在NLBSE ’ 24 Tools Competition数据集中,它的F1平均得分比STACC基线高出3%,同时在实际使用中保持了相当的推理时间。源代码已在httpsURL上公开。

[NLP-51] Citekit: A Modular Toolkit for Large Language Model Citation Generation
[NLP-51] Citekit:大型语言模型引文生成的模块化工具包

链接: https://arxiv.org/abs/2408.04662
作者: Jiajun Shen,Tong Zhou,Suifeng Zhao,Yubo Chen,Kang Liu
关键词-EN: Enabling Large Language, Large Language Models, Enabling Large, Language Models, Large Language
关键词-ZN: 启用大型语言、大型语言模型、启用大型语言模型、大型语言
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 13 figures

点击查看摘要

Abstract:Enabling Large Language Models (LLMs) to generate citations in Question-Answering (QA) tasks is an emerging paradigm aimed at enhancing the verifiability of their responses when LLMs are utilizing external references to generate an answer. However, there is currently no unified framework to standardize and fairly compare different citation generation methods, leading to difficulties in reproducing different methods and a comprehensive assessment. To cope with the problems above, we introduce \name, an open-source and modular toolkit designed to facilitate the implementation and evaluation of existing citation generation methods, while also fostering the development of new approaches to improve citation quality in LLM outputs. This tool is highly extensible, allowing users to utilize 4 main modules and 14 components to construct a pipeline, evaluating an existing method or innovative designs. Our experiments with two state-of-the-art LLMs and 11 citation generation baselines demonstrate varying strengths of different modules in answer accuracy and citation quality improvement, as well as the challenge of enhancing granularity. Based on our analysis of the effectiveness of components, we propose a new method, self-RAG \snippet, obtaining a balanced answer accuracy and citation quality. Citekit is released at this https URL.
摘要:使大语言模型能够在问答任务中生成引文是一种新兴的范式,旨在提高大语言模型在利用外部参照生成答案时对其回答的可验证性。然而,目前还没有统一的框架来标准化和公平地比较不同的引文生成方法,这导致了复制不同的方法和综合评估的困难。为了解决上述问题,我们引入了\NAME,这是一个开源的模块化工具包,旨在促进现有引文生成方法的实施和评估,同时也促进开发新的方法,以提高LLM输出的引文质量。该工具具有高度的可扩展性,允许用户使用4个主要模块和14个组件来构建管道,评估现有方法或创新设计。我们使用两个最先进的LLM和11个引文生成基线进行的实验表明,不同模块在答案准确性和引文质量改进方面具有不同的优势,以及提高粒度的挑战。基于对构件有效性的分析,我们提出了一种新的方法–自碎片断,从而获得了答案准确率和引文质量的平衡。Citekit在此HTTPS URL上发布。

[NLP-52] MaterioMiner – An ontology-based text mining dataset for extraction of process-structure-property entities
[NLP-52] MaterioMiner --一个基于实体的文本挖掘数据集,用于提取流程结构属性实体

链接: https://arxiv.org/abs/2408.04661
作者: Ali Riza Durmaz,Akhil Thomas,Lokesh Mishra,Rachana Niranjan Murthy,Thomas Straub
关键词-EN: learn sound statistical, sound statistical representations, models learn sound, learn sound, sound statistical
关键词-ZN: 学习声音统计,声音统计表示,模型学习声音,学习声音,声音统计
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci)
备注:

点击查看摘要

Abstract:While large language models learn sound statistical representations of the language and information therein, ontologies are symbolic knowledge representations that can complement the former ideally. Research at this critical intersection relies on datasets that intertwine ontologies and text corpora to enable training and comprehensive benchmarking of neurosymbolic models. We present the MaterioMiner dataset and the linked materials mechanics ontology where ontological concepts from the mechanics of materials domain are associated with textual entities within the literature corpus. Another distinctive feature of the dataset is its eminently fine-granular annotation. Specifically, 179 distinct classes are manually annotated by three raters within four publications, amounting to a total of 2191 entities that were annotated and curated. Conceptual work is presented for the symbolic representation of causal composition-process-microstructure-property relationships. We explore the annotation consistency between the three raters and perform fine-tuning of pre-trained models to showcase the feasibility of named-entity recognition model training. Reusing the dataset can foster training and benchmarking of materials language models, automated ontology construction, and knowledge graph generation from textual data.
摘要:虽然大型语言模型学习语言和其中信息的合理统计表示,但本体是符号知识表示,理想情况下可以补充前者。在这个关键的交叉点上的研究依赖于交织在一起的本体和文本语料库的数据集,以实现神经符号模型的训练和全面的基准测试。我们提出了MaterioMiner数据集和链接的材料力学本体,其中来自材料力学领域的本体概念与文献语料库中的文本实体相关联。该数据集的另一个显著特点是其非常细粒度的注释。具体地说,179个不同的类别由四个出版物中的三个评分员手动注释,总共有2191个实体被注释和整理。提出了因果组成-过程-显微结构-性能关系的符号表示的概念性工作。我们探讨了三种评分器之间的标注一致性,并对预先训练的模型进行了微调,以展示命名实体识别模型训练的可行性。重用数据集可以促进材料语言模型的训练和基准测试、自动本体构建以及从文本数据生成知识图。

[NLP-53] XMainframe: A Large Language Model for Mainframe Modernization
[NLP-53] XMainframe:大型机现代化的大型语言模型

链接: https://arxiv.org/abs/2408.04660
作者: Anh T. V. Dau,Hieu Trung Dao,Anh Tuan Nguyen,Hieu Trung Tran,Phong X. Nguyen,Nghi D. Q. Bui
关键词-EN: support critical sectors, continue to support, finance and government, support critical, critical sectors
关键词-ZN: 支持关键部门,继续支持、财政和政府,支持关键、关键部门
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mainframe operating systems, despite their inception in the 1940s, continue to support critical sectors like finance and government. However, these systems are often viewed as outdated, requiring extensive maintenance and modernization. Addressing this challenge necessitates innovative tools that can understand and interact with legacy codebases. To this end, we introduce XMainframe, a state-of-the-art large language model (LLM) specifically designed with knowledge of mainframe legacy systems and COBOL codebases. Our solution involves the creation of an extensive data collection pipeline to produce high-quality training datasets, enhancing XMainframe’s performance in this specialized domain. Additionally, we present MainframeBench, a comprehensive benchmark for assessing mainframe knowledge, including multiple-choice questions, question answering, and COBOL code summarization. Our empirical evaluations demonstrate that XMainframe consistently outperforms existing state-of-the-art LLMs across these tasks. Specifically, XMainframe achieves 30% higher accuracy than DeepSeek-Coder on multiple-choice questions, doubles the BLEU score of Mixtral-Instruct 8x7B on question answering, and scores six times higher than GPT-3.5 on COBOL summarization. Our work highlights the potential of XMainframe to drive significant advancements in managing and modernizing legacy systems, thereby enhancing productivity and saving time for software developers.
摘要:大型机操作系统虽然诞生于20世纪40年代,但仍在为金融和政府等关键部门提供支持。然而,这些系统通常被认为是过时的,需要广泛的维护和现代化。应对这一挑战需要能够理解遗留代码库并与之交互的创新工具。为此,我们介绍了XMainFrame,这是一种先进的大型语言模型(LLM),它专门利用大型机遗留系统和COBOL代码库的知识进行设计。我们的解决方案包括创建一个广泛的数据收集管道,以产生高质量的训练数据集,提高XMainFrame在这一专门领域的表现。此外,我们还介绍了MainFrameBch,这是一个用于评估大型机知识的综合基准,包括多项选择题、问题回答和COBOL代码摘要。我们的经验评估表明,XMainFrame在这些任务中的表现一直优于现有的最先进的LLM。具体地说,XMainFrame在多项选择题上的准确率比DeepSeek-Coder高30%,在问题回答上是Mixtral-Instruct 8x7B的BLEU分数的两倍,在COBOL摘要上的分数是GPT-3.5的6倍。我们的工作突出了XMainFrame在管理和现代化遗留系统方面推动重大进步的潜力,从而提高了生产力并为软件开发人员节省了时间。

[NLP-54] Winning Amazon KDD Cup24
[NLP-54] 赢得亚马逊KDD Cup 24

链接: https://arxiv.org/abs/2408.04658
作者: Chris Deotte,Ivan Sorokin,Ahmet Erdem,Benedikt Schifferer,Gilberto Titericz Jr,Simon Jegou
关键词-EN: Multi Task Online, Online Shopping Challenge, Task Online Shopping, Online Shopping, Multi Task
关键词-ZN: 在线多任务,在线购物挑战,在线购物任务,在线购物,在线购物,多任务
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper describes the winning solution of all 5 tasks for the Amazon KDD Cup 2024 Multi Task Online Shopping Challenge for LLMs. The challenge was to build a useful assistant, answering questions in the domain of online shopping. The competition contained 57 diverse tasks, covering 5 different task types (e.g. multiple choice) and across 4 different tracks (e.g. multi-lingual). Our solution is a single model per track. We fine-tune Qwen2-72B-Instruct on our own training dataset. As the competition released only 96 example questions, we developed our own training dataset by processing multiple public datasets or using Large Language Models for data augmentation and synthetic data generation. We apply wise-ft to account for distribution shifts and ensemble multiple LoRA adapters in one model. We employed Logits Processors to constrain the model output on relevant tokens for the tasks. AWQ 4-bit Quantization and vLLM are used during inference to predict the test dataset in the time constraints of 20 to 140 minutes depending on the track. Our solution achieved the first place in each individual track and is the first place overall of Amazons KDD Cup 2024.
摘要:本文描述了亚马逊KDD杯2024多任务在线购物挑战赛的全部5项任务的获胜解决方案。挑战是建立一个有用的助手,回答在线购物领域的问题。比赛包括57项不同的任务,涵盖5种不同的任务类型(例如多项选择)和4种不同的轨道(例如多种语言)。我们的解决方案是每条赛道只有一个型号。我们微调Qwen2-72B-指导我们自己的训练数据集。由于比赛只发布了96个样例问题,我们通过处理多个公共数据集或使用大型语言模型进行数据扩充和合成数据生成,开发了自己的训练数据集。我们应用WISE-FT来考虑分布变化,并将多个LORA适配器整合到一个模型中。我们使用Logits处理器来约束任务的相关令牌上的模型输出。在推理过程中使用AWQ 4比特量化和vLLM来预测测试数据集,根据轨道的不同,在20到140分钟的时间约束内预测测试数据集。我们的解决方案在每个单项赛道上都获得了第一名,也是亚马逊KDD杯2024总决赛的第一名。

[NLP-55] owards Semantic Markup of Mathematical Documents via User Interaction
[NLP-55] 通过用户交互实现数学文档的语义标记

链接: https://arxiv.org/abs/2408.04656
作者: Luka Vrečar,Joe Wells,Fairouz Kamareddine
关键词-EN: Mathematical documents written, semantic markup, Mathematical documents, semantic, documents written
关键词-ZN: 编写的数学文档,语义标记,数学文档,语义,编写的文档
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Submitted to the CICM 2024 conference, due to be published in Volume 14960 of Springer’s Lecture Notes in Computer Science

点击查看摘要

Abstract:Mathematical documents written in LaTeX often contain ambiguities. We can resolve some of them via semantic markup using, e.g., sTeX, which also has other potential benefits, such as interoperability with computer algebra systems, proof systems, and increased accessibility. However, semantic markup is more involved than “regular” typesetting and presents a challenge for authors of mathematical documents. We aim to smooth out the transition from plain LaTeX to semantic markup by developing semi-automatic tools for authors. In this paper we present an approach to semantic markup of formulas by (semi-)automatically generating grammars from existing sTeX macro definitions and parsing mathematical formulas with them. We also present a GUI-based tool for the disambiguation of parse results and showcase its functionality and potential using a grammar for parsing untyped \lambda -terms.
摘要:用LaTeX编写的数学文档通常包含歧义。我们可以通过语义标记来解决其中一些问题,例如,sTeX还具有其他潜在优势,例如与计算机代数系统、证明系统的互操作性以及增强的可访问性。然而,语义标记比“常规”排字更复杂,并给数学文档的作者带来了挑战。我们的目标是通过为作者开发半自动工具来平稳地从普通LaTeX到语义标记的过渡。在本文中,我们提出了一种公式的语义标记方法,通过从现有的sTeX宏定义(半)自动生成语法并用它们解析数学公式。我们还提供了一个基于图形界面的工具,用于解析结果的歧义消除,并展示了其使用语法来解析未类型化的\ambda-terms的功能和潜力。

[NLP-56] Strong and weak alignment of large language models with human values
[NLP-56] 大型语言模型与人类价值观的强和弱一致

链接: https://arxiv.org/abs/2408.04655
作者: Mehdi Khamassi,Marceau Nahon,Raja Chatila
关键词-EN: Minimizing negative impacts, Artificial Intelligent, Minimizing negative, impacts of Artificial, negative impacts
关键词-ZN: 最大限度地减少负面影响,人工智能,最大限度地减少负面影响,人工的影响,负面影响
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Minimizing negative impacts of Artificial Intelligent (AI) systems on human societies without human supervision requires them to be able to align with human values. However, most current work only addresses this issue from a technical point of view, e.g., improving current methods relying on reinforcement learning from human feedback, neglecting what it means and is required for alignment to occur. Here, we propose to distinguish strong and weak value alignment. Strong alignment requires cognitive abilities (either human-like or different from humans) such as understanding and reasoning about agents’ intentions and their ability to causally produce desired effects. We argue that this is required for AI systems like large language models (LLMs) to be able to recognize situations presenting a risk that human values may be flouted. To illustrate this distinction, we present a series of prompts showing ChatGPT’s, Gemini’s and Copilot’s failures to recognize some of these situations. We moreover analyze word embeddings to show that the nearest neighbors of some human values in LLMs differ from humans’ semantic representations. We then propose a new thought experiment that we call “the Chinese room with a word transition dictionary”, in extension of John Searle’s famous proposal. We finally mention current promising research directions towards a weak alignment, which could produce statistically satisfying answers in a number of common situations, however so far without ensuring any truth value.
摘要:在没有人类监督的情况下,将人工智能(AI)系统对人类社会的负面影响降至最低,要求它们能够与人类价值观保持一致。然而,目前的大多数工作只从技术的角度来解决这个问题,例如,改进现有的方法依赖于来自人类反馈的强化学习,而忽略了它的含义和对齐发生的要求。在这里,我们建议区分强弱价值对齐。强匹配需要认知能力(无论是类似人类的还是不同于人类的),例如对代理人的意图及其因果产生预期结果的能力的理解和推理。我们认为,这对于像大型语言模型(LLMS)这样的人工智能系统来说是必要的,以便能够识别存在人类价值可能被藐视的风险的情况。为了说明这一区别,我们提供了一系列提示,显示ChatGPT、双子座和Copilot未能识别其中一些情况。此外,我们还分析了词的嵌入,以表明LLMS中某些人类价值的最近邻与人类的语义表示不同。然后,我们提出了一个新的思维实验,我们称之为“带单词转换词典的汉语房间”,这是John Searle著名提议的扩展。最后,我们提到了目前有希望的弱对齐研究方向,它可以在许多常见情况下产生统计上令人满意的答案,但到目前为止还不能确保任何真值。

[NLP-57] Batching BPE Tokenization Merges
[NLP-57] 批量BPE代币化合并

链接: https://arxiv.org/abs/2408.04653
作者: Alexander P. Morgan
关键词-EN: Byte Pair Encoding, Pair Encoding algorithm, Byte Pair, Pair Encoding, Encoding algorithm
关键词-ZN: 字节对编码,对编码算法,字节对,对编码,编码算法
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, 1 code block

点击查看摘要

Abstract:The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer’s vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training make it feasible to train a high quality tokenizer on a basic laptop. This paper presents BatchBPE, an open-source pure Python implementation of these concepts, with the goal of making experimenting with new tokenization strategies more accessible especially in compute- and memory-constrained contexts. BatchBPE’s usefulness and malleability are demonstrated through the training of several token vocabularies to explore the batch merging process and experiment with preprocessing a stop word list and ignoring the least common text chunks in a dataset. Resultant encoded lengths of texts are used as a basic evaluation metric.
摘要:在构建标记器的词汇表时,字节对编码算法可以安全地批量处理,以一次合并数百对标记。这项技术与减少词汇训练中使用的文本的内存占用相结合,使得在基本笔记本电脑上训练高质量的标记器成为可能。本文介绍了BatchBPE,这是这些概念的开源纯Python实现,目标是使新的标记化策略的实验更容易实现,尤其是在计算和内存受限的环境中。BatchBPE的有用性和可塑性通过训练几个代币词汇表来探索批量合并过程并实验预处理停止词列表并忽略数据集中最不常见的文本块来证明。文本的结果编码长度用作基本评估指标。

[NLP-58] Leveraging Large Language Models with Chain-of-Thought and Prompt Engineering for Traffic Crash Severity Analysis and Inference
[NLP-58] 利用大型语言模型与思想链和提示工程进行交通事故严重性分析和推理

链接: https://arxiv.org/abs/2408.04652
作者: Hao Zhen,Yucheng Shi,Yongcan Huang,Jidong J. Yang,Ninghao Liu
关键词-EN: Large Language Models, Large Language, crash severity inference, Harnessing the power, power of Large
关键词-ZN: 大型语言模型、大型语言、崩溃严重程度推断、利用力量、大型的力量
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 12 figures, 3 tables

点击查看摘要

Abstract:Harnessing the power of Large Language Models (LLMs), this study explores the use of three state-of-the-art LLMs, specifically GPT-3.5-turbo, LLaMA3-8B, and LLaMA3-70B, for crash severity inference, framing it as a classification task. We generate textual narratives from original traffic crash tabular data using a pre-built template infused with domain knowledge. Additionally, we incorporated Chain-of-Thought (CoT) reasoning to guide the LLMs in analyzing the crash causes and then inferring the severity. This study also examine the impact of prompt engineering specifically designed for crash severity inference. The LLMs were tasked with crash severity inference to: (1) evaluate the models’ capabilities in crash severity analysis, (2) assess the effectiveness of CoT and domain-informed prompt engineering, and (3) examine the reasoning abilities with the CoT framework. Our results showed that LLaMA3-70B consistently outperformed the other models, particularly in zero-shot settings. The CoT and Prompt Engineering techniques significantly enhanced performance, improving logical reasoning and addressing alignment issues. Notably, the CoT offers valuable insights into LLMs’ reasoning processes, unleashing their capacity to consider diverse factors such as environmental conditions, driver behavior, and vehicle characteristics in severity analysis and inference.
摘要:利用大型语言模型的能力,本研究探索了三种最先进的大型语言模型的使用,特别是GPT-3.5-Turbo、LLaMA3-8B和LLaMA3-70B,用于故障严重性推断,将其作为一项分类任务。我们使用注入了领域知识的预先构建的模板,从原始的交通事故表格数据生成文本叙述。此外,我们结合了思想链(COT)推理来指导LLMS分析崩溃原因,然后推断严重程度。这项研究还考察了专门为碰撞严重性推断而设计的快速工程的影响。LLM的任务是进行碰撞严重性推断:(1)评估模型在碰撞严重性分析中的能力;(2)评估COT和域信息即时工程的有效性;(3)使用COT框架检验推理能力。我们的结果表明,LLaMA3-70B的性能一直优于其他型号,特别是在零镜头设置下。COT和Prompt Engineering技术显著提高了性能,改进了逻辑推理并解决了对齐问题。值得注意的是,COT为LLMS的推理过程提供了有价值的见解,释放了它们在严重性分析和推理中考虑各种因素(如环境条件、驾驶员行为和车辆特征)的能力。

[NLP-59] Knowledge AI: Fine-tuning NLP Models for Facilitating Scientific Knowledge Extraction and Understanding
[NLP-59] 知识人工智能:微调NLP模型以促进科学知识提取和理解

链接: https://arxiv.org/abs/2408.04651
作者: Balaji Muralidharan,Hayden Beadles,Reza Marzban,Kalyan Sashank Mupparaju
关键词-EN: Large Language Models, deep learning framework, efficacy of Large, Large Language, Natural Language Processing
关键词-ZN: 大型语言模型、深度学习框架、大型、大型语言、自然语言处理的功效
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages

点击查看摘要

Abstract:This project investigates the efficacy of Large Language Models (LLMs) in understanding and extracting scientific knowledge across specific domains and to create a deep learning framework: Knowledge AI. As a part of this framework, we employ pre-trained models and fine-tune them on datasets in the scientific domain. The models are adapted for four key Natural Language Processing (NLP) tasks: summarization, text generation, question answering, and named entity recognition. Our results indicate that domain-specific fine-tuning significantly enhances model performance in each of these tasks, thereby improving their applicability for scientific contexts. This adaptation enables non-experts to efficiently query and extract information within targeted scientific fields, demonstrating the potential of fine-tuned LLMs as a tool for knowledge discovery in the sciences.
摘要:该项目研究大型语言模型(LLM)在理解和提取特定领域科学知识方面的功效,并创建深度学习框架:知识人工智能。作为该框架的一部分,我们使用预先训练的模型并在科学领域的数据集上对它们进行微调。这些模型适用于四项关键的自然语言处理(NLP)任务:摘要、文本生成、问题回答和命名实体识别。我们的结果表明,特定领域的微调显着增强了每个任务中的模型性能,从而提高了其对科学背景的适用性。这种调整使非专家能够有效地查询和提取目标科学领域内的信息,展示了微调的LLM作为科学知识发现工具的潜力。

[NLP-60] Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools
[NLP-60] 建立对心理健康聊天机器人的信任:安全收件箱和基于LLM的评估工具

链接: https://arxiv.org/abs/2408.04650
作者: Jung In Park,Mahyar Abbasian,Iman Azimi,Dawn Bounds,Angela Jun,Jaesu Han,Robert McCarron,Jessica Borelli,Jia Li,Mona Mahmoudi,Carmen Wiedenhoeft,Amir Rahmani
关键词-EN: increasingly popular due, mental health chatbots, mental health, human-like interactions, aims to develop
关键词-ZN: 由于心理健康聊天机器人、心理健康、类人互动,越来越受欢迎,旨在开发
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Objective: This study aims to develop and validate an evaluation framework to ensure the safety and reliability of mental health chatbots, which are increasingly popular due to their accessibility, human-like interactions, and context-aware support. Materials and Methods: We created an evaluation framework with 100 benchmark questions and ideal responses, and five guideline questions for chatbot responses. This framework, validated by mental health experts, was tested on a GPT-3.5-turbo-based chatbot. Automated evaluation methods explored included large language model (LLM)-based scoring, an agentic approach using real-time data, and embedding models to compare chatbot responses against ground truth standards. Results: The results highlight the importance of guidelines and ground truth for improving LLM evaluation accuracy. The agentic method, dynamically accessing reliable information, demonstrated the best alignment with human assessments. Adherence to a standardized, expert-validated framework significantly enhanced chatbot response safety and reliability. Discussion: Our findings emphasize the need for comprehensive, expert-tailored safety evaluation metrics for mental health chatbots. While LLMs have significant potential, careful implementation is necessary to mitigate risks. The superior performance of the agentic approach underscores the importance of real-time data access in enhancing chatbot reliability. Conclusion: The study validated an evaluation framework for mental health chatbots, proving its effectiveness in improving safety and reliability. Future work should extend evaluations to accuracy, bias, empathy, and privacy to ensure holistic assessment and responsible integration into healthcare. Standardized evaluations will build trust among users and professionals, facilitating broader adoption and improved mental health support through technology.
摘要:目的:本研究旨在开发和验证一个评估框架,以确保心理健康聊天机器人的安全性和可靠性,这些聊天机器人因其可访问性、类似人类的交互和上下文感知支持而越来越受欢迎。材料和方法:我们建立了一个评估框架,包括100个基准问题和理想回答,以及5个针对聊天机器人响应的指导性问题。该框架得到了心理健康专家的验证,并在基于GPT-3.5涡轮增压的聊天机器人上进行了测试。探索的自动化评估方法包括基于大型语言模型(LLM)的评分,一种使用实时数据的代理方法,以及嵌入模型以比较聊天机器人的反应和地面真相标准。结果:研究结果强调了指导方针和基本事实对于提高LLM评估准确性的重要性。代理方法,动态获取可靠的信息,展示了与人类评估的最佳一致性。遵守标准化的、经专家验证的框架显著提高了聊天机器人响应的安全性和可靠性。讨论:我们的发现强调了为精神健康聊天机器人提供全面的、专家定制的安全评估指标的必要性。尽管LLM具有巨大的潜力,但为了降低风险,有必要谨慎实施。代理方法的卓越性能突显了实时数据访问在提高聊天机器人可靠性方面的重要性。结论:本研究验证了一个针对心理健康聊天机器人的评估框架,证明了其在提高安全性和可靠性方面的有效性。未来的工作应该将评估扩展到准确性、偏见、同理心和隐私,以确保全面评估和负责任地整合到医疗保健中。标准化评估将在用户和专业人员之间建立信任,促进更广泛的采用,并通过技术改善精神健康支持。

[NLP-61] Chain of Stance: Stance Detection with Large Language Models
[NLP-61] 姿势链:使用大型语言模型的姿势检测

链接: https://arxiv.org/abs/2408.04649
作者: Junxia Ma,Changjiang Wang,Hanwen Xing,Dongming Zhao,Yazhou Zhang
关键词-EN: natural language processing, Stance detection, active task, task in natural, aims to identify
关键词-ZN: 自然语言处理、姿态检测、主动任务、自然任务、旨在识别
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Stance detection is an active task in natural language processing (NLP) that aims to identify the author’s stance towards a particular target within a text. Given the remarkable language understanding capabilities and encyclopedic prior knowledge of large language models (LLMs), how to explore the potential of LLMs in stance detection has received significant attention. Unlike existing LLM-based approaches that focus solely on fine-tuning with large-scale datasets, we propose a new prompting method, called \textitChain of Stance (CoS). In particular, it positions LLMs as expert stance detectors by decomposing the stance detection process into a series of intermediate, stance-related assertions that culminate in the final judgment. This approach leads to significant improvements in classification performance. We conducted extensive experiments using four SOTA LLMs on the SemEval 2016 dataset, covering the zero-shot and few-shot learning setups. The results indicate that the proposed method achieves state-of-the-art results with an F1 score of 79.84 in the few-shot setting.
摘要:姿态检测是自然语言处理中的一项主动任务,旨在识别作者对文本中特定目标的姿态。鉴于大型语言模型(LLMS)显著的语言理解能力和百科全书式的先验知识,如何挖掘LLMS在姿势检测中的潜力受到了极大的关注。与现有的基于LLM的方法只关注大规模数据集的微调不同,我们提出了一种新的提示方法,称为姿态链(CoS)。特别是,它通过将姿势检测过程分解为一系列与姿势相关的中间断言,最终得出最终判断,将LLM定位为专家姿势检测器。这种方法显著提高了分类性能。我们使用四个Sota LLM在SemEval 2016数据集上进行了广泛的实验,涵盖了零镜头和少镜头学习设置。结果表明,在少镜头条件下,该方法的F1得分达到了79.84,达到了最好的效果。

[NLP-62] PLUGH: A Benchmark for Spatial Understanding and Reasoning in Large Language Models ACL2024
[NLP-62] PLUGH:大型语言模型中空间理解和推理的基准

链接: https://arxiv.org/abs/2408.04648
作者: Alexey Tikhonov
关键词-EN: Large Language Models, input texts extracted, Large Language, present PLUGH, Language Models
关键词-ZN: 大型语言模型,提取的输入文本,大型语言,当前PLUGH,语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Wordplay Workshop @ ACL 2024

点击查看摘要

Abstract:We present PLUGH (this https URL), a modern benchmark that currently consists of 5 tasks, each with 125 input texts extracted from 48 different games and representing 61 different (non-isomorphic) spatial graphs to assess the abilities of Large Language Models (LLMs) for spatial understanding and reasoning. Our evaluation of API-based and open-sourced LLMs shows that while some commercial LLMs exhibit strong reasoning abilities, open-sourced competitors can demonstrate almost the same level of quality; however, all models still have significant room for improvement. We identify typical reasons for LLM failures and discuss possible ways to deal with them. Datasets and evaluation code are released (this https URL).
摘要:我们介绍了PLUGH(这个https URL),这是一个现代基准,目前由5个任务组成,每个任务包含从48个不同游戏中提取的125个输入文本,并代表61个不同的(非同质)空间图,以评估大型语言模型(LLM)的空间理解和推理能力。我们对基于API和开源的LLM的评估表明,虽然一些商业LLM表现出强大的推理能力,但开源竞争对手几乎可以表现出相同水平的质量;然而,所有模型仍有很大的改进空间。我们确定了LLM失败的典型原因,并讨论了可能的处理方法。发布数据集和评估代码(此https URL)。

[NLP-63] Distinguishing Chatbot from Human
[NLP-63] 聊天机器人与人类的区别

链接: https://arxiv.org/abs/2408.04647
作者: Gauri Anil Godghase,Rishit Agrawal,Tanush Obili,Mark Stamp
关键词-EN: generative Artificial Intelligence, Generative Pre-trained Transformer, Large Language Models, Artificial Intelligence, Pre-trained Transformer
关键词-ZN: 生成式人工智能、生成式预训练Transformer、大型语言模型、人工智能、预训练Transformer
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:There have been many recent advances in the fields of generative Artificial Intelligence (AI) and Large Language Models (LLM), with the Generative Pre-trained Transformer (GPT) model being a leading “chatbot.” LLM-based chatbots have become so powerful that it may seem difficult to differentiate between human-written and machine-generated text. To analyze this problem, we have developed a new dataset consisting of more than 750,000 human-written paragraphs, with a corresponding chatbot-generated paragraph for each. Based on this dataset, we apply Machine Learning (ML) techniques to determine the origin of text (human or chatbot). Specifically, we consider two methodologies for tackling this issue: feature analysis and embeddings. Our feature analysis approach involves extracting a collection of features from the text for classification. We also explore the use of contextual embeddings and transformer-based architectures to train classification models. Our proposed solutions offer high classification accuracy and serve as useful tools for textual analysis, resulting in a better understanding of chatbot-generated text in this era of advanced AI technology.
摘要:产生式人工智能(AI)和大语言模型(LLM)领域最近取得了许多进展,其中产生式预训练变压器(GPT)模型是一种领先的聊天机器人。基于LLM的聊天机器人已经变得如此强大,以至于似乎很难区分人类编写的文本和机器生成的文本。为了分析这个问题,我们开发了一个新的数据集,由超过75万个人写的段落组成,每个段落都有一个对应的由聊天机器人生成的段落。基于这个数据集,我们应用机器学习(ML)技术来确定文本的来源(人类或聊天机器人)。具体地说,我们考虑了两种方法来解决这个问题:特征分析和嵌入。我们的特征分析方法包括从文本中提取一组特征用于分类。我们还探索了使用上下文嵌入和基于转换器的体系结构来训练分类模型。我们提出的解决方案提供了高分类精度,并作为文本分析的有用工具,从而在这个先进人工智能技术的时代更好地理解了聊天机器人生成的文本。

[NLP-64] Efficacy of Large Language Models in Systematic Reviews
[NLP-64] 大型语言模型在系统评估中的功效

链接: https://arxiv.org/abs/2408.04646
作者: Aaditya Shah,Shridhar Mehendale,Siddha Kanthi
关键词-EN: Large Language Models, Large Language, relationship between Environmental, effectiveness of Large, interpreting existing literature
关键词-ZN: 大型语言模型、大型语言、环境之间的关系、大型的有效性、解释现有文献
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study investigates the effectiveness of Large Language Models (LLMs) in interpreting existing literature through a systematic review of the relationship between Environmental, Social, and Governance (ESG) factors and financial performance. The primary objective is to assess how LLMs can replicate a systematic review on a corpus of ESG-focused papers. We compiled and hand-coded a database of 88 relevant papers published from March 2020 to May 2024. Additionally, we used a set of 238 papers from a previous systematic review of ESG literature from January 2015 to February 2020. We evaluated two current state-of-the-art LLMs, Meta AI’s Llama 3 8B and OpenAI’s GPT-4o, on the accuracy of their interpretations relative to human-made classifications on both sets of papers. We then compared these results to a “Custom GPT” and a fine-tuned GPT-4o Mini model using the corpus of 238 papers as training data. The fine-tuned GPT-4o Mini model outperformed the base LLMs by 28.3% on average in overall accuracy on prompt 1. At the same time, the “Custom GPT” showed a 3.0% and 15.7% improvement on average in overall accuracy on prompts 2 and 3, respectively. Our findings reveal promising results for investors and agencies to leverage LLMs to summarize complex evidence related to ESG investing, thereby enabling quicker decision-making and a more efficient market.
摘要:本研究通过系统回顾环境、社会和治理(ESG)因素与财务绩效之间的关系,考察了大型语言模型(LLM)在解释现有文献方面的有效性。主要目标是评估LLMS如何在以ESG为重点的论文语料库上复制系统审查。我们编制并手工编码了一个数据库,收录了2020年3月至2024年5月发表的88篇相关论文。此外,我们使用了一组238篇论文,这些论文来自于2015年1月至2020年2月对ESG文献进行的系统综述。我们评估了两个目前最先进的LLM,Meta AI的Llama 3 8B和OpenAI的GPT-40,根据它们在两组论文上相对于人工分类的解释的准确性。然后,我们使用238篇论文的语料库作为训练数据,将这些结果与“定制GPT”和微调的GPT-40迷你模型进行了比较。在提示1上,微调的GPT-40迷你模型的总体准确率平均比基本LLMS高28.3%。同时,在提示2和3上,“定制GPT”的总体准确率分别平均提高3.0%和15.7%。我们的研究结果显示,投资者和机构利用LLMS总结与ESG投资相关的复杂证据,从而能够更快地做出决策和更有效的市场,结果令人振奋。

[NLP-65] Evaluating the Impact of Advanced LLM Techniques on AI-Lecture Tutors for a Robotics Course ECAI-2024
[NLP-65] 评估高级LLM技术对机器人学课程AI讲座导师的影响

链接: https://arxiv.org/abs/2408.04645
作者: Sebastian Kahl,Felix Löffler,Martin Maciol,Fabian Ridder,Marius Schmitz,Jennifer Spanagel,Jens Wienkamp,Christopher Burgahn,Malte Schilling
关键词-EN: Artificial Intelligence-based tutor, Large Language Models, Large Language, Artificial Intelligence-based, Intelligence-based tutor
关键词-ZN: 基于人工智能的导师,大型语言模型,大型语言,基于人工智能的导师
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Robotics (cs.RO)
备注: The article is an extended version of a paper presented at the International Workshop on AI in Education and Educational Research (AIEER) at ECAI-2024 (27th European Conference on Artificial Intelligence)

点击查看摘要

Abstract:This study evaluates the performance of Large Language Models (LLMs) as an Artificial Intelligence-based tutor for a university course. In particular, different advanced techniques are utilized, such as prompt engineering, Retrieval-Augmented-Generation (RAG), and fine-tuning. We assessed the different models and applied techniques using common similarity metrics like BLEU-4, ROUGE, and BERTScore, complemented by a small human evaluation of helpfulness and trustworthiness. Our findings indicate that RAG combined with prompt engineering significantly enhances model responses and produces better factual answers. In the context of education, RAG appears as an ideal technique as it is based on enriching the input of the model with additional information and material which usually is already present for a university course. Fine-tuning, on the other hand, can produce quite small, still strong expert models, but poses the danger of overfitting. Our study further asks how we measure performance of LLMs and how well current measurements represent correctness or relevance? We find high correlation on similarity metrics and a bias of most of these metrics towards shorter responses. Overall, our research points to both the potential and challenges of integrating LLMs in educational settings, suggesting a need for balanced training approaches and advanced evaluation frameworks.
摘要:本研究评估了大型语言模型(LLMS)作为基于人工智能的大学课程导师的表现。特别是,使用了不同的先进技术,如即时工程、检索-增强-生成(RAG)和微调。我们使用BLEU-4、Rouge和BERTScore等常见的相似性度量来评估不同的模型和应用的技术,并辅之以对帮助和可信度的小型人类评估。我们的发现表明,RAG与快速工程的结合显著地增强了模型的响应,并产生了更好的事实答案。在教育方面,RAG似乎是一种理想的技术,因为它的基础是用通常在大学课程中已经存在的额外信息和材料丰富模型的投入。另一方面,微调可以产生相当小的、仍然强大的专家模型,但存在过度适应的危险。我们的研究进一步询问我们如何衡量LLMS的性能,以及当前的衡量标准在多大程度上代表了正确性或相关性?我们发现相似性度量具有很高的相关性,并且这些度量中的大多数偏向于较短的响应。总体而言,我们的研究指出了将低成本管理纳入教育环境的潜力和挑战,表明需要平衡的培训方法和先进的评估框架。

[NLP-66] Risks Causes and Mitigations of Widespread Deployments of Large Language Models (LLMs): A Survey
[NLP-66] 大型语言模型(LLM)广泛部署的风险、原因和缓解措施:一项调查

链接: https://arxiv.org/abs/2408.04643
作者: Md Nazmus Sakib,Md Athikul Islam,Royal Pathak,Md Mashrur Arifin
关键词-EN: Natural Language Processing, transformed Natural Language, Large Language Models, significantly transformed Natural, Language Processing
关键词-ZN: 自然语言处理、改造的自然语言、大型语言模型、显着改造的自然、语言处理
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to 2nd International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings-2024), September 07-08, 2024, Michigan, USA

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs), such as ChatGPT and LLaMA, have significantly transformed Natural Language Processing (NLP) with their outstanding abilities in text generation, summarization, and classification. Nevertheless, their widespread adoption introduces numerous challenges, including issues related to academic integrity, copyright, environmental impacts, and ethical considerations such as data bias, fairness, and privacy. The rapid evolution of LLMs also raises concerns regarding the reliability and generalizability of their evaluations. This paper offers a comprehensive survey of the literature on these subjects, systematically gathered and synthesized from Google Scholar. Our study provides an in-depth analysis of the risks associated with specific LLMs, identifying sub-risks, their causes, and potential solutions. Furthermore, we explore the broader challenges related to LLMs, detailing their causes and proposing mitigation strategies. Through this literature analysis, our survey aims to deepen the understanding of the implications and complexities surrounding these powerful models.
摘要:ChatGPT和Llama等大型语言模型的最新进展,以其在文本生成、摘要和分类方面的卓越能力,极大地改变了自然语言处理(NLP)。然而,它们的广泛采用带来了许多挑战,包括与学术诚信、版权、环境影响和伦理考虑相关的问题,如数据偏见、公平性和隐私。土地管理办法的迅速演变也引起了人们对其评价的可靠性和普遍性的关注。本文系统地收集和综合了谷歌学者的相关文献,对这些主题的文献进行了全面的综述。我们的研究深入分析了与特定低成本管理相关的风险,确定了子风险、其原因和潜在的解决方案。此外,我们还探讨了与低土地利用类型管理相关的更广泛的挑战,详细说明了其原因并提出了缓解策略。通过这一文献分析,我们的调查旨在加深对围绕这些强大模型的含义和复杂性的理解。

[NLP-67] GPT-3 Powered Information Extraction for Building Robust Knowledge Bases
[NLP-67] GPT-3支持信息提取,用于构建稳健的知识库

链接: https://arxiv.org/abs/2408.04641
作者: Ritabrata Roy Choudhury,Soumik Dey
关键词-EN: language model, information extraction, knowledge base development, information, suggested
关键词-ZN: 语言模型、信息提取、知识库开发、信息、建议
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work uses the state-of-the-art language model GPT-3 to offer a novel method of information extraction for knowledge base development. The suggested method attempts to solve the difficulties associated with obtaining relevant entities and relationships from unstructured text in order to extract structured information. We conduct experiments on a huge corpus of text from diverse fields to assess the performance of our suggested technique. The evaluation measures, which are frequently employed in information extraction tasks, include precision, recall, and F1-score. The findings demonstrate that GPT-3 can be used to efficiently and accurately extract pertinent and correct information from text, hence increasing the precision and productivity of knowledge base creation. We also assess how well our suggested approach performs in comparison to the most advanced information extraction techniques already in use. The findings show that by utilizing only a small number of instances in in-context learning, our suggested strategy yields competitive outcomes with notable savings in terms of data annotation and engineering expense. Additionally, we use our proposed method to retrieve Biomedical information, demonstrating its practicality in a real-world setting. All things considered, our suggested method offers a viable way to overcome the difficulties involved in obtaining structured data from unstructured text in order to create knowledge bases. It can greatly increase the precision and effectiveness of information extraction, which is necessary for many applications including chatbots, recommendation engines, and question-answering systems.
摘要:本文利用最新的语言模型GPT-3,为知识库的开发提供了一种新的信息抽取方法。所建议的方法试图解决与从非结构化文本中获取相关实体和关系以便提取结构化信息相关的困难。我们在来自不同领域的大量文本语料库上进行了实验,以评估我们所建议的技术的性能。在信息提取任务中经常使用的评估指标包括准确率、召回率和F1分数。研究结果表明,GPT-3能够有效、准确地从文本中提取相关和正确的信息,从而提高知识库创建的精确度和生产率。我们还评估了与已经使用的最先进的信息提取技术相比,我们建议的方法执行得如何。研究结果表明,通过在情境学习中仅使用少量实例,我们建议的策略产生了具有竞争力的结果,并在数据标注和工程费用方面节省了显著的成本。此外,我们使用我们提出的方法来检索生物医学信息,在现实世界中展示了它的实用性。考虑到所有因素,我们建议的方法提供了一种可行的方法,以克服从非结构化文本中获取结构化数据以创建知识库所涉及的困难。它可以极大地提高信息抽取的精度和有效性,这是聊天机器人、推荐引擎和问答系统等许多应用所必需的。

[NLP-68] LLMs for Enhanced Agricultural Meteorological Recommendations
[NLP-68] 增强农业气象建议的LLM

链接: https://arxiv.org/abs/2408.04640
作者: Ji-jun Park,Soo-joon Choi
关键词-EN: enhancing crop productivity, actionable insights based, Agricultural meteorological recommendations, soil conditions, weather forecasts
关键词-ZN: 提高作物生产力、基于可操作见解、农业气象建议、土壤条件、天气预报
类目: Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:Agricultural meteorological recommendations are crucial for enhancing crop productivity and sustainability by providing farmers with actionable insights based on weather forecasts, soil conditions, and crop-specific data. This paper presents a novel approach that leverages large language models (LLMs) and prompt engineering to improve the accuracy and relevance of these recommendations. We designed a multi-round prompt framework to iteratively refine recommendations using updated data and feedback, implemented on ChatGPT, Claude2, and GPT-4. Our method was evaluated against baseline models and a Chain-of-Thought (CoT) approach using manually collected datasets. The results demonstrate significant improvements in accuracy and contextual relevance, with our approach achieving up to 90% accuracy and high GPT-4 scores. Additional validation through real-world pilot studies further confirmed the practical benefits of our method, highlighting its potential to transform agricultural practices and decision-making.
摘要:农业气象建议为农民提供了基于天气预报、土壤条件和作物特定数据的可操作的见解,对于提高作物产量和可持续性至关重要。本文提出了一种新的方法,该方法利用大型语言模型(LLM)和快速工程来提高这些建议的准确性和相关性。我们设计了一个多轮提示框架,使用更新的数据和反馈反复改进推荐,并在ChatGPT、Claude2和GPT-4上实现。我们的方法根据基线模型和使用手动收集的数据集的思想链(COT)方法进行了评估。结果表明,我们的方法在准确率和上下文相关性方面都有了显著的提高,准确率达到了90%,GPT-4得分也很高。通过真实世界试点研究的额外验证进一步证实了我们的方法的实际好处,突出了其改变农业实践和决策的潜力。

[NLP-69] Abstractive summarization from Audio Transcription
[NLP-69] 音频转录的抽象摘要

链接: https://arxiv.org/abs/2408.04639
作者: Ilia Derkach
关键词-EN: gaining popularity, ranging from text, answers to queries, text translation, translation to generating
关键词-ZN: 越来越受欢迎,从文本、答案到查询、文本翻译、翻译到生成
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 36 pages, Master’s thesis, 14 figures

点击查看摘要

Abstract:Currently, large language models are gaining popularity, their achievements are used in many areas, ranging from text translation to generating answers to queries. However, the main problem with these new machine learning algorithms is that training such models requires large computing resources that only large IT companies have. To avoid this problem, a number of methods (LoRA, quantization) have been proposed so that existing models can be effectively fine-tuned for specific tasks. In this paper, we propose an E2E (end to end) audio summarization model using these techniques. In addition, this paper examines the effectiveness of these approaches to the problem under consideration and draws conclusions about the applicability of these methods.
摘要:目前,大型语言模型越来越受欢迎,它们的成果被应用于许多领域,从文本翻译到生成查询答案。然而,这些新机器学习算法的主要问题是训练此类模型需要只有大型IT公司才拥有的大量计算资源。为了避免这个问题,人们提出了多种方法(LoRA、量化),以便可以针对特定任务有效地微调现有模型。本文中,我们使用这些技术提出了一种E2E(端到端)音频摘要模型。此外,本文还考察了这些方法对所考虑问题的有效性,并得出了有关这些方法适用性的结论。

[NLP-70] Affective Computing in the Era of Large Language Models : A Survey from the NLP Perspective
[NLP-70] 大型语言模型时代的情感计算:NLP角度的调查

链接: https://arxiv.org/abs/2408.04638
作者: Yiqun Zhang,Xiaocui Yang,Xingle Xu,Zeran Gao,Yijie Huang,Shiyi Mu,Shi Feng,Daling Wang,Yifei Zhang,Kaisong Song,Ge Yu
关键词-EN: http URL create, integrating computer science, cognitive science knowledge, Affective Computing, Affective Computing tasks
关键词-ZN: http URL创建、集成计算机科学、认知科学知识、情感计算、情感计算任务
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Affective Computing (AC), integrating computer science, psychology, and cognitive science knowledge, aims to enable machines to recognize, interpret, and simulate human this http URL create more value, AC can be applied to diverse scenarios, including social media, finance, healthcare, education, etc. Affective Computing (AC) includes two mainstream tasks, i.e., Affective Understanding (AU) and Affective Generation (AG). Fine-tuning Pre-trained Language Models (PLMs) for AU tasks has succeeded considerably. However, these models lack generalization ability, requiring specialized models for specific tasks. Additionally, traditional PLMs face challenges in AG, particularly in generating diverse and emotionally rich responses. The emergence of Large Language Models (LLMs), such as the ChatGPT series and LLaMA models, brings new opportunities and challenges, catalyzing a paradigm shift in AC. LLMs possess capabilities of in-context learning, common sense reasoning, and advanced sequence generation, which present unprecedented opportunities for AU. To provide a comprehensive overview of AC in the LLMs era from an NLP perspective, we summarize the development of LLMs research in this field, aiming to offer new insights. Specifically, we first summarize the traditional tasks related to AC and introduce the preliminary study based on LLMs. Subsequently, we outline the relevant techniques of popular LLMs to improve AC tasks, including Instruction Tuning and Prompt Engineering. For Instruction Tuning, we discuss full parameter fine-tuning and parameter-efficient methods such as LoRA, P-Tuning, and Prompt Tuning. In Prompt Engineering, we examine Zero-shot, Few-shot, Chain of Thought (CoT), and Agent-based methods for AU and AG. To clearly understand the performance of LLMs on different Affective Computing tasks, we further summarize the existing benchmarks and evaluation methods.
摘要:情感计算(AC)融合了计算机科学、心理学和认知科学的知识,旨在使机器能够识别、解释和模拟人类。这个http URL创造了更多的价值,AC可以应用于不同的场景,包括社交媒体、金融、医疗、教育等。情感计算(AC)包括两个主流任务,即情感理解(AU)和情感生成(AG)。为非盟任务微调预先训练的语言模型取得了相当大的成功。然而,这些模型缺乏泛化能力,需要针对特定任务建立专门的模型。此外,传统的PLM在AG中面临挑战,特别是在产生多样化和情感丰富的反应方面。大型语言模型(LLM)的出现,如ChatGPT系列和大羊驼模型,带来了新的机遇和挑战,催化了AC的范式转变。LLMS具有情境学习、常识推理和高级序列生成的能力,这为AU提供了前所未有的机遇。为了从自然语言处理的角度对低层管理系统时代的交流进行全面的概述,我们总结了该领域的研究进展,旨在提供新的见解。具体来说,我们首先总结了与AC相关的传统任务,并介绍了基于LLMS的初步研究。随后,我们概述了流行的LLMS改进AC任务的相关技术,包括指令调优和即时工程。对于指令调优,我们讨论了全参数微调和参数高效方法,如LORA、P-Tuning和Prompt Tuning。在“即时工程”一书中,我们研究了用于AU和AG的零射、少射、思想链(COT)和基于代理的方法。为了更清楚地了解LLMS在不同情感计算任务上的表现,我们进一步总结了现有的基准测试和评估方法。

[NLP-71] APE: Active Learning-based Tooling for Finding Informative Few-shot Examples for LLM-based Entity Matching
[NLP-71] BEP:基于主动学习的工具,用于为基于LLM的实体匹配寻找信息性少镜头示例

链接: https://arxiv.org/abs/2408.04637
作者: Kun Qian,Yisi Sang,Farima Fatahi Bayat,Anton Belyi,Xianqi Chu,Yash Govind,Samira Khorshidi,Rahul Khot,Katherine Luna,Azadeh Nikfarjam,Xiaoguang Qi,Fei Wu,Xianhan Zhang,Yunyao Li
关键词-EN: large language models, effectively directing large, directing large language, formulate suitable instructions, requiring extensive manual
关键词-ZN: 大型语言模型,有效地指导大型语言,制定合适的指令,需要大量的手册
类目: Computation and Language (cs.CL)
备注: 3 pages, Proceedings of the Fifth Workshop on Data Science with Human-in-the-Loop (DaSH 2024)

点击查看摘要

Abstract:Prompt engineering is an iterative procedure often requiring extensive manual effort to formulate suitable instructions for effectively directing large language models (LLMs) in specific tasks. Incorporating few-shot examples is a vital and effective approach to providing LLMs with precise instructions, leading to improved LLM performance. Nonetheless, identifying the most informative demonstrations for LLMs is labor-intensive, frequently entailing sifting through an extensive search space. In this demonstration, we showcase a human-in-the-loop tool called APE (Active Prompt Engineering) designed for refining prompts through active learning. Drawing inspiration from active learning, APE iteratively selects the most ambiguous examples for human feedback, which will be transformed into few-shot examples within the prompt. The demo recording can be found with the submission or be viewed at this https URL.
摘要:提示工程是一个迭代过程,通常需要大量的手动工作来制定合适的指令,以有效地指导特定任务中的大型语言模型(LLM)。简化少量示例是为LLM提供精确指令、提高LLM性能的重要而有效的方法。尽管如此,为LLM识别信息最丰富的演示是一项劳动密集型工作,通常需要筛选广泛的搜索空间。在此演示中,我们展示了一种名为APT(主动提示工程)的人在循环工具,旨在通过主动学习来完善提示。从主动学习中汲取灵感,APM迭代地选择最模糊的示例进行人类反馈,这些示例将在提示内转化为少数镜头示例。演示录音可以随提交内容一起找到,也可以在此https URL中查看。

[NLP-72] Survey: Transformer-based Models in Data Modality Conversion
[NLP-72] 调查:数据模式转换中基于转换器的模型

链接: https://arxiv.org/abs/2408.04723
作者: Elyas Rashno,Amir Eskandari,Aman Anand,Farhana Zulkernine
关键词-EN: natural language processing, including natural language, artificial intelligence domains, made significant strides, language processing
关键词-ZN: 自然语言处理,包括自然语言、人工智能领域,取得了重大进展,语言处理
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Signal Processing (eess.SP)
备注: Submitted to ACM Computing Surveys (CSUR)

点击查看摘要

Abstract:Transformers have made significant strides across various artificial intelligence domains, including natural language processing, computer vision, and audio processing. This success has naturally garnered considerable interest from both academic and industry researchers. Consequently, numerous Transformer variants (often referred to as X-formers) have been developed for these fields. However, a thorough and systematic review of these modality-specific conversions remains lacking. Modality Conversion involves the transformation of data from one form of representation to another, mimicking the way humans integrate and interpret sensory information. This paper provides a comprehensive review of transformer-based models applied to the primary modalities of text, vision, and speech, discussing their architectures, conversion methodologies, and applications. By synthesizing the literature on modality conversion, this survey aims to underline the versatility and scalability of transformers in advancing AI-driven content generation and understanding.
摘要:变形金刚在各种人工智能领域取得了长足的进步,包括自然语言处理、计算机视觉和音频处理。这一成功自然引起了学术和行业研究人员的极大兴趣。因此,为这些领域开发了许多变形金刚变种(通常被称为X变形金刚)。然而,仍然缺乏对这些特定模式转换的彻底和系统的审查。通道转换涉及将数据从一种表示形式转换为另一种表示形式,模仿人类整合和解释感觉信息的方式。本文对应用于文本、视觉和语音的主要模式的基于转换器的模型进行了全面的回顾,讨论了它们的体系结构、转换方法和应用。通过综合有关情态转换的文献,本调查旨在强调转换器在促进人工智能驱动的内容生成和理解方面的多功能性和可伸缩性。

人工智能

[AI-0] Preserving Privacy in Large Language Models : A Survey on Current Threats and Solutions

链接: https://arxiv.org/abs/2408.05212
作者: Michele Miranda,Elena Sofia Ruzzetti,Andrea Santilli,Fabio Massimo Zanzotto,Sébastien Bratières,Emanuele Rodolà
关键词-EN: Large Language Models, Large Language, Language Models, represent a significant, artificial intelligence
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: GitHub repository: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) represent a significant advancement in artificial intelligence, finding applications across various domains. However, their reliance on massive internet-sourced datasets for training brings notable privacy issues, which are exacerbated in critical domains (e.g., healthcare). Moreover, certain application-specific scenarios may require fine-tuning these models on private data. This survey critically examines the privacy threats associated with LLMs, emphasizing the potential for these models to memorize and inadvertently reveal sensitive information. We explore current threats by reviewing privacy attacks on LLMs and propose comprehensive solutions for integrating privacy mechanisms throughout the entire learning pipeline. These solutions range from anonymizing training datasets to implementing differential privacy during training or inference and machine unlearning after training. Our comprehensive review of existing literature highlights ongoing challenges, available tools, and future directions for preserving privacy in LLMs. This work aims to guide the development of more secure and trustworthy AI systems by providing a thorough understanding of privacy preservation methods and their effectiveness in mitigating risks.

[AI-1] VITA: Towards Open-Source Interactive Omni Multimodal LLM

链接: https://arxiv.org/abs/2408.05211
作者: Chaoyou Fu,Haojia Lin,Zuwei Long,Yunhang Shen,Meng Zhao,Yifan Zhang,Xiong Wang,Di Yin,Long Ma,Xiawu Zheng,Ran He,Rongrong Ji,Yunsheng Wu,Caifeng Shan,Xing Sun
关键词-EN: models rarely excel, Large Language Model, open-source models rarely, underscore their necessity, practical applications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Project Page: this https URL

点击查看摘要

Abstract:The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research. Project Page: this https URL.

[AI-2] aSL: Task Skill Localization and Consolidation for Language Model Continual Learning ACL2024

链接: https://arxiv.org/abs/2408.05200
作者: Yujie Feng,Xu Chu,Yongxin Xu,Zexin Lu,Bo Liu,Philip S. Yu,Xiao-Ming Wu
关键词-EN: recently garnered significant, garnered significant interest, significant interest due, dynamic real-world environments, adapt large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Extension of ACL 2024 paper titled: Continual Dialog State Tracking via Task Skill Localization and Consolidation

点击查看摘要

Abstract:Language model continual learning (CL) has recently garnered significant interest due to its potential to adapt large language models (LLMs) to dynamic real-world environments without re-training. A key challenge in this field is catastrophic forgetting, where models lose previously acquired knowledge when learning new tasks. Existing methods commonly employ multiple parameter-efficient fine-tuning (PEFT) blocks to acquire task-specific knowledge for each task, but these approaches lack efficiency and overlook the potential for knowledge transfer through task interaction. In this paper, we present a novel CL framework for language models called Task Skill Localization and Consolidation (TaSL), which enhances knowledge transfer without relying on memory replay. TaSL first divides the model into `skill units’ based on parameter dependencies, enabling more granular control. It then employs a novel group-wise skill localization technique to identify the importance distribution of skill units for a new task. By comparing this importance distribution with those from previous tasks, we implement a fine-grained skill consolidation strategy that retains task-specific knowledge, thereby preventing forgetting, and updates task-shared knowledge, which facilitates bi-directional knowledge transfer. As a result, TaSL achieves a superior balance between retaining previous knowledge and excelling in new tasks. TaSL also shows strong generalizability, suitable for general models and customizable for PEFT methods like LoRA. Additionally, it demonstrates notable extensibility, allowing integration with memory replay to further enhance performance. Extensive experiments on two CL benchmarks, with varying model sizes (from 220M to 7B), demonstrate the effectiveness of TaSL and its variants across different settings.

[AI-3] HistoKernel: Whole Slide Image Level Maximum Mean Discrepancy Kernels for Pan-Cancer Predictive Modelling

链接: https://arxiv.org/abs/2408.05195
作者: Piotr Keller,Muhammad Dawood,Brinder Singh Chohan,Fayyaz ul Amir Afsar Minhas
关键词-EN: Slide Images, multi-gigapixel Whole Slide, computational pathology, scores for crucial, Machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 28 pages, 5 figures, 1 Table. Preprint for article in review at Nature Machine Intelligence

点击查看摘要

Abstract:Machine learning in computational pathology (CPath) often aggregates patch-level predictions from multi-gigapixel Whole Slide Images (WSIs) to generate WSI-level prediction scores for crucial tasks such as survival prediction and drug effect prediction. However, current methods do not explicitly characterize distributional differences between patch sets within WSIs. We introduce HistoKernel, a novel Maximum Mean Discrepancy (MMD) kernel that measures distributional similarity between WSIs for enhanced prediction performance on downstream prediction tasks. Our comprehensive analysis demonstrates HistoKernel’s effectiveness across various machine learning tasks, including retrieval (n = 9,362), drug sensitivity regression (n = 551), point mutation classification (n = 3,419), and survival analysis (n = 2,291), outperforming existing deep learning methods. Additionally, HistoKernel seamlessly integrates multi-modal data and offers a novel perturbation-based method for patch-level explainability. This work pioneers the use of kernel-based methods for WSI-level predictive modeling, opening new avenues for research. Code is available at this https URL. Comments: 28 pages, 5 figures, 1 Table. Preprint for article in review at Nature Machine Intelligence Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.05195 [cs.LG] (or arXiv:2408.05195v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.05195 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-4] Meta-Learning Guided Label Noise Distillation for Robust Signal Modulation Classification

链接: https://arxiv.org/abs/2408.05151
作者: Xiaoyang Hao,Zhixi Feng,Tongqing Peng,Shuyuan Yang
关键词-EN: Automatic modulation classification, physical layer threats, Automatic modulation, modulation classification, internet of things
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Automatic modulation classification (AMC) is an effective way to deal with physical layer threats of the internet of things (IoT). However, there is often label mislabeling in practice, which significantly impacts the performance and robustness of deep neural networks (DNNs). In this paper, we propose a meta-learning guided label noise distillation method for robust AMC. Specifically, a teacher-student heterogeneous network (TSHN) framework is proposed to distill and reuse label noise. Based on the idea that labels are representations, the teacher network with trusted meta-learning divides and conquers untrusted label samples and then guides the student network to learn better by reassessing and correcting labels. Furthermore, we propose a multi-view signal (MVS) method to further improve the performance of hard-to-classify categories with few-shot trusted label samples. Extensive experimental results show that our methods can significantly improve the performance and robustness of signal AMC in various and complex label noise scenarios, which is crucial for securing IoT applications.

[AI-5] AttackER: Towards Enhancing Cyber-Attack Attribution with a Named Entity Recognition Dataset

链接: https://arxiv.org/abs/2408.05149
作者: Pritam Deka,Sampath Rajapaksha,Ruby Rani,Amirah Almutairi,Erisa Karafili
关键词-EN: place attacker-oriented countermeasures, Natural Language Processing, legal actions, experts to put, put in place
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Submitted to WISE 2024

点击查看摘要

Abstract:Cyber-attack attribution is an important process that allows experts to put in place attacker-oriented countermeasures and legal actions. The analysts mainly perform attribution manually, given the complex nature of this task. AI and, more specifically, Natural Language Processing (NLP) techniques can be leveraged to support cybersecurity analysts during the attribution process. However powerful these techniques are, they need to deal with the lack of datasets in the attack attribution domain. In this work, we will fill this gap and will provide, to the best of our knowledge, the first dataset on cyber-attack attribution. We designed our dataset with the primary goal of extracting attack attribution information from cybersecurity texts, utilizing named entity recognition (NER) methodologies from the field of NLP. Unlike other cybersecurity NER datasets, ours offers a rich set of annotations with contextual details, including some that span phrases and sentences. We conducted extensive experiments and applied NLP techniques to demonstrate the dataset’s effectiveness for attack attribution. These experiments highlight the potential of Large Language Models (LLMs) capabilities to improve the NER tasks in cybersecurity datasets for cyber-attack attribution.

[AI-6] Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

链接: https://arxiv.org/abs/2408.05147
作者: Tom Lieberum,Senthooran Rajamanoharan,Arthur Conmy,Lewis Smith,Nicolas Sonnerat,Vikrant Varma,János Kramár,Anca Dragan,Rohin Shah,Neel Nanda
关键词-EN: seemingly interpretable features, neural network latent, network latent representations, Sparse autoencoders, sparse decomposition
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 12 main text pages, and 14 pages of acknowledgements, references and appendices

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network’s latent representations into seemingly interpretable features. Despite recent excitement about their potential, research applications outside of industry are limited by the high cost of training a comprehensive suite of SAEs. In this work, we introduce Gemma Scope, an open suite of JumpReLU SAEs trained on all layers and sub-layers of Gemma 2 2B and 9B and select layers of Gemma 2 27B base models. We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison. We evaluate the quality of each SAE on standard metrics and release these results. We hope that by releasing these SAE weights, we can help make more ambitious safety and interpretability research easier for the community. Weights and a tutorial can be found at this https URL and an interactive demo can be found at this https URL

[AI-7] Cautious Calibration in Binary Classification ECAI2024

链接: https://arxiv.org/abs/2408.05120
作者: Mari-Liis Allikivi,Joonas Järve,Meelis Kull
关键词-EN: machine learning systems, learning systems integrated, crucial for enhancing, enhancing the trustworthiness, trustworthiness of machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to ECAI 2024

点击查看摘要

Abstract:Being cautious is crucial for enhancing the trustworthiness of machine learning systems integrated into decision-making pipelines. Although calibrated probabilities help in optimal decision-making, perfect calibration remains unattainable, leading to estimates that fluctuate between under- and overconfidence. This becomes a critical issue in high-risk scenarios, where even occasional overestimation can lead to extreme expected costs. In these scenarios, it is important for each predicted probability to lean towards underconfidence, rather than just achieving an average balance. In this study, we introduce the novel concept of cautious calibration in binary classification. This approach aims to produce probability estimates that are intentionally underconfident for each predicted probability. We highlight the importance of this approach in a high-risk scenario and propose a theoretically grounded method for learning cautious calibration maps. Through experiments, we explore and compare our method to various approaches, including methods originally not devised for cautious calibration but applicable in this context. We show that our approach is the most consistent in providing cautious estimates. Our work establishes a strong baseline for further developments in this novel framework.

[AI-8] Semantic Successive Refinement: A Generative AI-aided Semantic Communication Framework

链接: https://arxiv.org/abs/2408.05112
作者: Kexin Zhang,Lixin Li,Wensheng Lin,Yuna Yan,Rui Li,Wenchi Cheng,Zhu Han
关键词-EN: emerging technology aiming, Shannon limit, surpass the Shannon, Semantic Communication, emerging technology
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Semantic Communication (SC) is an emerging technology aiming to surpass the Shannon limit. Traditional SC strategies often minimize signal distortion between the original and reconstructed data, neglecting perceptual quality, especially in low Signal-to-Noise Ratio (SNR) environments. To address this issue, we introduce a novel Generative AI Semantic Communication (GSC) system for single-user scenarios. This system leverages deep generative models to establish a new paradigm in SC. Specifically, At the transmitter end, it employs a joint source-channel coding mechanism based on the Swin Transformer for efficient semantic feature extraction and compression. At the receiver end, an advanced Diffusion Model (DM) reconstructs high-quality images from degraded signals, enhancing perceptual details. Additionally, we present a Multi-User Generative Semantic Communication (MU-GSC) system utilizing an asynchronous processing model. This model effectively manages multiple user requests and optimally utilizes system resources for parallel processing. Simulation results on public datasets demonstrate that our generative AI semantic communication systems achieve superior transmission efficiency and enhanced communication content quality across various channel conditions. Compared to CNN-based DeepJSCC, our methods improve the Peak Signal-to-Noise Ratio (PSNR) by 17.75% in Additive White Gaussian Noise (AWGN) channels and by 20.86% in Rayleigh channels.

[AI-9] Application of Unsupervised Artificial Neural Network (ANN) Self_Organizing Map (SOM) in Identifying Main Car Sales Factors

链接: https://arxiv.org/abs/2408.05110
作者: Mazyar Taghavi
关键词-EN: consumer tastes, attract customers, Factors, Iranian customer buying, important factors
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Factors which attract customers and persuade them to buy new car are various regarding different consumer tastes. There are some methods to extract pattern form mass data. In this case we firstly asked passenger car marketing experts to rank more important factors which affect customer decision making behavior using fuzzy Delphi technique, then we provided a sample set from questionnaires and tried to apply a useful artificial neural network method called self_organizing map SOM to find out which factors have more effect on Iranian customer’s buying decision making. Fuzzy tools were applied to adjust the study to be more real. MATLAB software was used for developing and training network. Results report four factors are more important rather than the others. Results are rather different from marketing expert rankings. Such results would help manufacturers to focus on more important factors and increase company sales level.

[AI-10] MooER: LLM-based Speech Recognition and Translation Models from Moore Threads

链接: https://arxiv.org/abs/2408.05101
作者: Junhao Xu,Zhenlin Liang,Yi Liu,Yichao Hu,Jian Li,Yajun Zheng,Meng Cai,Hua Wang
关键词-EN: Moore Threads, LLM-based large-scale automatic, automatic speech recognition, automatic speech translation, large-scale automatic speech
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we present MooER, a LLM-based large-scale automatic speech recognition (ASR) / automatic speech translation (AST) model of Moore Threads. A 5000h pseudo labeled dataset containing open source and self collected speech data is used for training. We achieve performance comparable to other open source models trained with up to hundreds of thousands of hours of labeled speech data. Meanwhile, experiments conducted on Covost2 Zh2en testset suggest that our model outperforms other open source Speech LLMs. A BLEU score of 25.2 can be obtained. The main contributions of this paper are summarized as follows. First, this paper presents a training strategy for encoders and LLMs on speech related tasks (including ASR and AST) using a small size of pseudo labeled data without any extra manual annotation and selection. Second, we release our ASR and AST models and plan to open-source our training code and strategy in the near future. Moreover, a model trained on 8wh scale training data is planned to be released later on.

[AI-11] AI-driven Java Performance Testing: Balancing Result Quality with Testing Time

链接: https://arxiv.org/abs/2408.05100
作者: Luca Traini,Federico Di Menna,Vittorio Cortellessa
关键词-EN: uncovering efficiency issues, Performance testing aims, warm-up phase, aims at uncovering, uncovering efficiency
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
*备注: Accepted for publication in The 39th IEEE/ACM International Conference on Automated Software Engineering (ASE '24)

点击查看摘要

Abstract:Performance testing aims at uncovering efficiency issues of software systems. In order to be both effective and practical, the design of a performance test must achieve a reasonable trade-off between result quality and testing time. This becomes particularly challenging in Java context, where the software undergoes a warm-up phase of execution, due to just-in-time compilation. During this phase, performance measurements are subject to severe fluctuations, which may adversely affect quality of performance test results. However, these approaches often provide suboptimal estimates of the warm-up phase, resulting in either insufficient or excessive warm-up iterations, which may degrade result quality or increase testing time. There is still a lack of consensus on how to properly address this problem. Here, we propose and study an AI-based framework to dynamically halt warm-up iterations at runtime. Specifically, our framework leverages recent advances in AI for Time Series Classification (TSC) to predict the end of the warm-up phase during test execution. We conduct experiments by training three different TSC models on half a million of measurement segments obtained from JMH microbenchmark executions. We find that our framework significantly improves the accuracy of the warm-up estimates provided by state-of-practice and state-of-the-art methods. This higher estimation accuracy results in a net improvement in either result quality or testing time for up to +35.3% of the microbenchmarks. Our study highlights that integrating AI to dynamically estimate the end of the warm-up phase can enhance the cost-effectiveness of Java performance testing.

[AI-12] Overcoming the Limitations of Layer Synchronization in Spiking Neural Networks

链接: https://arxiv.org/abs/2408.05098
作者: Roel Koopman,Amirreza Yousefzadeh,Mahyar Shahsavari,Guangzhi Tang,Manolis Sifalakis
关键词-EN: aggregate incoming currents, layer aggregate incoming, Spiking Neural Networks, machine learning applications, learning applications relies
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Currently, neural-network processing in machine learning applications relies on layer synchronization, whereby neurons in a layer aggregate incoming currents from all neurons in the preceding layer, before evaluating their activation function. This is practiced even in artificial Spiking Neural Networks (SNNs), which are touted as consistent with neurobiology, in spite of processing in the brain being, in fact asynchronous. A truly asynchronous system however would allow all neurons to evaluate concurrently their threshold and emit spikes upon receiving any presynaptic current. Omitting layer synchronization is potentially beneficial, for latency and energy efficiency, but asynchronous execution of models previously trained with layer synchronization may entail a mismatch in network dynamics and performance. We present a study that documents and quantifies this problem in three datasets on our simulation environment that implements network asynchrony, and we show that models trained with layer synchronization either perform sub-optimally in absence of the synchronization, or they will fail to benefit from any energy and latency reduction, when such a mechanism is in place. We then “make ends meet” and address the problem with unlayered backprop, a novel backpropagation-based training method, for learning models suitable for asynchronous processing. We train with it models that use different neuron execution scheduling strategies, and we show that although their neurons are more reactive, these models consistently exhibit lower overall spike density (up to 50%), reach a correct decision faster (up to 2x) without integrating all spikes, and achieve superior accuracy (up to 10% higher). Our findings suggest that asynchronous event-based (neuromorphic) AI computing is indeed more efficient, but we need to seriously rethink how we train our SNN models, to benefit from it.

[AI-13] Hyperbolic Learning with Multimodal Large Language Models ECCV2024

链接: https://arxiv.org/abs/2408.05097
作者: Paolo Mandica,Luca Franco,Konstantinos Kallidromitis,Suzanne Petryk,Fabio Galasso
关键词-EN: including image segmentation, deep-learning tasks, including image, active learning, demonstrated their effectiveness
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ECCV 2024 - Beyond Euclidean Workshop

点击查看摘要

Abstract:Hyperbolic embeddings have demonstrated their effectiveness in capturing measures of uncertainty and hierarchical relationships across various deep-learning tasks, including image segmentation and active learning. However, their application in modern vision-language models (VLMs) has been limited. A notable exception is MERU, which leverages the hierarchical properties of hyperbolic space in the CLIP ViT-large model, consisting of hundreds of millions parameters. In our work, we address the challenges of scaling multi-modal hyperbolic models by orders of magnitude in terms of parameters (billions) and training complexity using the BLIP-2 architecture. Although hyperbolic embeddings offer potential insights into uncertainty not present in Euclidean embeddings, our analysis reveals that scaling these models is particularly difficult. We propose a novel training strategy for a hyperbolic version of BLIP-2, which allows to achieve comparable performance to its Euclidean counterpart, while maintaining stability throughout the training process and showing a meaningful indication of uncertainty with each embedding.

[AI-14] Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models AAAI25

链接: https://arxiv.org/abs/2408.05093
作者: Zikai Xie
关键词-EN: Large language models, Large language, generated significant attention, finding applications, industrial domains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 7 pages, submitted to AAAI25

点击查看摘要

Abstract:Large language models (LLMs) have generated significant attention since their inception, finding applications across various academic and industrial domains. However, these models often suffer from the “hallucination problem”, where outputs, though grammatically and logically coherent, lack factual accuracy or are entirely fabricated. A particularly troubling issue discovered and widely discussed recently is the numerical comparison error where multiple LLMs incorrectly infer that “9.11 9.9”. We discovered that the order in which LLMs generate answers and reasoning impacts their consistency. Specifically, results vary significantly when an LLM generates an answer first and then provides the reasoning versus generating the reasoning process first and then the conclusion. Inspired by this, we propose a new benchmark method for assessing LLM consistency: comparing responses generated through these two different approaches. This benchmark effectively identifies instances where LLMs fabricate answers and subsequently generate justifications. Furthermore, we introduce a novel and straightforward prompt strategy designed to mitigate this issue. Experimental results demonstrate that this strategy improves performance across various LLMs compared to direct questioning. This work not only sheds light on a critical flaw in LLMs but also offers a practical solution to enhance their reliability.

[AI-15] Generating novel experimental hypotheses from language models: A case study on cross-dative generalization

链接: https://arxiv.org/abs/2408.05086
作者: Kanishka Misra,Najoung Kim
关键词-EN: Neural network language, complex linguistic knowledge, successfully capture complex, capture complex linguistic, network language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neural network language models (LMs) have been shown to successfully capture complex linguistic knowledge. However, their utility for understanding language acquisition is still debated. We contribute to this debate by presenting a case study where we use LMs as simulated learners to derive novel experimental hypotheses to be tested with humans. We apply this paradigm to study cross-dative generalization (CDG): productive generalization of novel verbs across dative constructions (she pilked me the ball/she pilked the ball to me) – acquisition of which is known to involve a large space of contextual features – using LMs trained on child-directed speech. We specifically ask: “what properties of the training exposure facilitate a novel verb’s generalization to the (unmodeled) alternate construction?” To answer this, we systematically vary the exposure context in which a novel dative verb occurs in terms of the properties of the theme and recipient, and then analyze the LMs’ usage of the novel verb in the unmodeled dative construction. We find LMs to replicate known patterns of children’s CDG, as a precondition to exploring novel hypotheses. Subsequent simulations reveal a nuanced role of the features of the novel verbs’ exposure context on the LMs’ CDG. We find CDG to be facilitated when the first postverbal argument of the exposure context is pronominal, definite, short, and conforms to the prototypical animacy expectations of the exposure dative. These patterns are characteristic of harmonic alignment in datives, where the argument with features ranking higher on the discourse prominence scale tends to precede the other. This gives rise to a novel hypothesis that CDG is facilitated insofar as the features of the exposure context – in particular, its first postverbal argument – are harmonically aligned. We conclude by proposing future experiments that can test this hypothesis in children.

[AI-16] Generalizing Few Data to Unseen Domains Flexibly Based on Label Smoothing Integrated with Distributionally Robust Optimization

链接: https://arxiv.org/abs/2408.05082
作者: Yangdi Wang,Zhi-Hai Zhang,Su Xiu Xu,Wenming Guo
关键词-EN: deep neural networks, applying deep neural, Overfitting commonly occurs, existing data, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Overfitting commonly occurs when applying deep neural networks (DNNs) on small-scale datasets, where DNNs do not generalize well from existing data to unseen data. The main reason resulting in overfitting is that small-scale datasets cannot reflect the situations of the real world. Label smoothing (LS) is an effective regularization method to prevent overfitting, avoiding it by mixing one-hot labels with uniform label vectors. However, LS only focuses on labels while ignoring the distribution of existing data. In this paper, we introduce the distributionally robust optimization (DRO) to LS, achieving shift the existing data distribution flexibly to unseen domains when training DNNs. Specifically, we prove that the regularization of LS can be extended to a regularization term for the DNNs parameters when integrating DRO. The regularization term can be utilized to shift existing data to unseen domains and generate new data. Furthermore, we propose an approximate gradient-iteration label smoothing algorithm (GI-LS) to achieve the findings and train DNNs. We prove that the shift for the existing data does not influence the convergence of GI-LS. Since GI-LS incorporates a series of hyperparameters, we further consider using Bayesian optimization (BO) to find the relatively optimal combinations of these hyperparameters. Taking small-scale anomaly classification tasks as a case, we evaluate GI-LS, and the results clearly demonstrate its superior performance.

[AI-17] RT-Surv: Improving Mortality Prediction After Radiotherapy with Large Language Model Structuring of Large-Scale Unstructured Electronic Health Records

链接: https://arxiv.org/abs/2408.05074
作者: Sangjoon Park,Chan Woo Wee,Seo Hee Choi,Kyung Hwan Kim,Jee Suk Chang,Hong In Yoon,Ik Jae Lee,Yong Bae Kim,Jaeho Cho,Ki Chang Keum,Chang Geol Lee,Hwa Kyung Byun,Woong Sub Koom
关键词-EN: Accurate patient selection, prevent ineffective treatments, Accurate patient, unstructured EHR data, unstructured EHR
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 23 pages, 2 tables, 4 figures

点击查看摘要

Abstract:Accurate patient selection is critical in radiotherapy (RT) to prevent ineffective treatments. Traditional survival prediction models, relying on structured data, often lack precision. This study explores the potential of large language models (LLMs) to structure unstructured electronic health record (EHR) data, thereby improving survival prediction accuracy through comprehensive clinical information integration. Data from 34,276 patients treated with RT at Yonsei Cancer Center between 2013 and 2023 were analyzed, encompassing both structured and unstructured data. An open-source LLM was used to structure the unstructured EHR data via single-shot learning, with its performance compared against a domain-specific medical LLM and a smaller variant. Survival prediction models were developed using statistical, machine learning, and deep learning approaches, incorporating both structured and LLM-structured data. Clinical experts evaluated the accuracy of the LLM-structured data. The open-source LLM achieved 87.5% accuracy in structuring unstructured EHR data without additional training, significantly outperforming the domain-specific medical LLM, which reached only 35.8% accuracy. Larger LLMs were more effective, particularly in extracting clinically relevant features like general condition and disease extent, which closely correlated with patient survival. Incorporating LLM-structured clinical features into survival prediction models significantly improved accuracy, with the C-index of deep learning models increasing from 0.737 to 0.820. These models also became more interpretable by emphasizing clinically significant factors. This study shows that general-domain LLMs, even without specific medical training, can effectively structure large-scale unstructured EHR data, substantially enhancing the accuracy and interpretability of clinical predictive models.

[AI-18] A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares

链接: https://arxiv.org/abs/2408.05061
作者: Stav Cohen,Ron Bitton,Ben Nassi
关键词-EN: jailbroken GenAI model, GenAI model behavior, GenAI model, paper we argue, substantial harm
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Website, see this https URL

点击查看摘要

Abstract:In this paper we argue that a jailbroken GenAI model can cause substantial harm to GenAI-powered applications and facilitate PromptWare, a new type of attack that flips the GenAI model’s behavior from serving an application to attacking it. PromptWare exploits user inputs to jailbreak a GenAI model to force/perform malicious activity within the context of a GenAI-powered application. First, we introduce a naive implementation of PromptWare that behaves as malware that targets Plan Execute architectures (a.k.a., ReAct, function calling). We show that attackers could force a desired execution flow by creating a user input that produces desired outputs given that the logic of the GenAI-powered application is known to attackers. We demonstrate the application of a DoS attack that triggers the execution of a GenAI-powered assistant to enter an infinite loop that wastes money and computational resources on redundant API calls to a GenAI engine, preventing the application from providing service to a user. Next, we introduce a more sophisticated implementation of PromptWare that we name Advanced PromptWare Threat (APwT) that targets GenAI-powered applications whose logic is unknown to attackers. We show that attackers could create user input that exploits the GenAI engine’s advanced AI capabilities to launch a kill chain in inference time consisting of six steps intended to escalate privileges, analyze the application’s context, identify valuable assets, reason possible malicious activities, decide on one of them, and execute it. We demonstrate the application of APwT against a GenAI-powered e-commerce chatbot and show that it can trigger the modification of SQL tables, potentially leading to unauthorized discounts on the items sold to the user.

[AI-19] GLEAMS: Bridging the Gap Between Local and Global Explanations

链接: https://arxiv.org/abs/2408.05060
作者: Giorgio Visani,Vincenzo Stanzione,Damien Garreau
关键词-EN: machine learning algorithms, algorithms is crucial, emerged recently, explainability of machine, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The explainability of machine learning algorithms is crucial, and numerous methods have emerged recently. Local, post-hoc methods assign an attribution score to each feature, indicating its importance for the prediction. However, these methods require recalculating explanations for each example. On the other side, while there exist global approaches they often produce explanations that are either overly simplistic and unreliable or excessively complex. To bridge this gap, we propose GLEAMS, a novel method that partitions the input space and learns an interpretable model within each sub-region, thereby providing both faithful local and global surrogates. We demonstrate GLEAMS’ effectiveness on both synthetic and real-world data, highlighting its desirable properties and human-understandable insights.

[AI-20] SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation

链接: https://arxiv.org/abs/2408.05057
作者: Da Mu,Zhicheng Zhang,Haobo Yue,Zehao Wang,Jin Tang,Jianqin Yin
关键词-EN: Sound Event Localization, demonstrated impressive capabilities, Sound Event Detection, Event Localization, Transformer-based models
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In the Sound Event Localization and Detection (SELD) task, Transformer-based models have demonstrated impressive capabilities. However, the quadratic complexity of the Transformer’s self-attention mechanism results in computational inefficiencies. In this paper, we propose a network architecture for SELD called SELD-Mamba, which utilizes Mamba, a selective state-space model. We adopt the Event-Independent Network V2 (EINV2) as the foundational framework and replace its Conformer blocks with bidirectional Mamba blocks to capture a broader range of contextual information while maintaining computational efficiency. Additionally, we implement a two-stage training method, with the first stage focusing on Sound Event Detection (SED) and Direction of Arrival (DoA) estimation losses, and the second stage reintroducing the Source Distance Estimation (SDE) loss. Our experimental results on the 2024 DCASE Challenge Task3 dataset demonstrate the effectiveness of the selective state-space model in SELD and highlight the benefits of the two-stage training approach in enhancing SELD performance.

[AI-21] A GNN Model with Adaptive Weights for Session-Based Recommendation Systems

链接: https://arxiv.org/abs/2408.05051
作者: Begüm Özbay,Dr.Resul Tugay,Prof. Dr. Şule Gündüz Öğüdücü
关键词-EN: users’ interests based, recommendation systems aim, Session-based recommendation systems, model users’ interests, session-based recommendation model
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 7 pages, 7 tables, 2 figures, and 3 equations

点击查看摘要

Abstract:Session-based recommendation systems aim to model users’ interests based on their sequential interactions to predict the next item in an ongoing session. In this work, we present a novel approach that can be used in session-based recommendations (SBRs). Our goal is to enhance the prediction accuracy of an existing session-based recommendation model, the SR-GNN model, by introducing an adaptive weighting mechanism applied to the graph neural network (GNN) vectors. This mechanism is designed to incorporate various types of side information obtained through different methods during the study. Items are assigned varying degrees of importance within each session as a result of the weighting mechanism. We hypothesize that this adaptive weighting strategy will contribute to more accurate predictions and thus improve the overall performance of SBRs in different scenarios. The adaptive weighting strategy can be utilized to address the cold start problem in SBRs by dynamically adjusting the importance of items in each session, thus providing better recommendations in cold start situations, such as for new users or newly added items. Our experimental evaluations on the Dressipi dataset demonstrate the effectiveness of the proposed approach compared to traditional models in enhancing the user experience and highlighting its potential to optimize the recommendation results in real-world applications.

[AI-22] Rag and Roll: An End-to-End Evaluation of Indirect Prompt Manipulations in LLM-based Application Frameworks

链接: https://arxiv.org/abs/2408.05025
作者: Gianluca De Stefano,Giancarlo Pellegrino,Lea Schönherr
关键词-EN: Retrieval Augmented Generation, Augmented Generation, Retrieval Augmented, distribution knowledge, RAG
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) is a technique commonly used to equip models with out of distribution knowledge. This process involves collecting, indexing, retrieving, and providing information to an LLM for generating responses. Despite its growing popularity due to its flexibility and low cost, the security implications of RAG have not been extensively studied. The data for such systems are often collected from public sources, providing an attacker a gateway for indirect prompt injections to manipulate the responses of the model. In this paper, we investigate the security of RAG systems against end-to-end indirect prompt manipulations. First, we review existing RAG framework pipelines deriving a prototypical architecture and identifying potentially critical configuration parameters. We then examine prior works searching for techniques that attackers can use to perform indirect prompt manipulations. Finally, implemented Rag n Roll, a framework to determine the effectiveness of attacks against end-to-end RAG applications. Our results show that existing attacks are mostly optimized to boost the ranking of malicious documents during the retrieval phase. However, a higher rank does not immediately translate into a reliable attack. Most attacks, against various configurations, settle around a 40% success rate, which could rise to 60% when considering ambiguous answers as successful attacks (those that include the expected benign one as well). Additionally, when using unoptimized documents, attackers deploying two of them (or more) for a target query can achieve similar results as those using optimized ones. Finally, exploration of the configuration space of a RAG showed limited impact in thwarting the attacks, where the most successful combination severely undermines functionality.

[AI-23] Enhancing the Code Debugging Ability of LLMs via Communicative Agent Based Data Refinement

链接: https://arxiv.org/abs/2408.05006
作者: Weiqing Yang,Hanbin Wang,Zhenghao Liu,Xinze Li,Yukun Yan,Shuo Wang,Yu Gu,Minghe Yu,Zhiyuan Liu,Ge Yu
关键词-EN: Large Language Models, Large Language, remain largely unexplored, code debugging ability, code debugging
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Debugging is a vital aspect of software development, yet the debugging capabilities of Large Language Models (LLMs) remain largely unexplored. This paper first introduces DEBUGEVAL, a comprehensive benchmark designed to evaluate the debugging capabilities of LLMs. DEBUGEVAL collects data from existing high-quality datasets and designs four different tasks to evaluate the debugging effectiveness, including BUG Localization, BUG Identification, Code Review, and Code Repair. Additionally, to enhance the code debugging ability of LLMs, this paper proposes a CoMmunicative Agent BaSed DaTa REfinement FRamework (MASTER), which generates the refined code debugging data for supervised finetuning. Specifically, MASTER employs the Code Quizzer to generate refined data according to the defined tasks of DEBUGEVAL. Then the Code Learner acts as a critic and reserves the generated problems that it can not solve. Finally, the Code Teacher provides a detailed Chain-of-Thought based solution to deal with the generated problem. We collect the synthesized data and finetune the Code Learner to enhance the debugging ability and conduct the NeuDebugger model. Our experiments evaluate various LLMs and NeuDebugger in the zero-shot setting on DEBUGEVAL. Experimental results demonstrate that these 7B-scale LLMs have weaker debugging capabilities, even these code-oriented LLMs. On the contrary, these larger models (over 70B) show convincing debugging ability. Our further analyses illustrate that MASTER is an effective method to enhance the code debugging ability by synthesizing data for Supervised Fine-Tuning (SFT) LLMs.

[AI-24] ProFuser: Progressive Fusion of Large Language Models

链接: https://arxiv.org/abs/2408.04998
作者: Tianyuan Shi,Fanqi Wan,Canbin Huang,Xiaojun Quan,Chenliang Li,Ming Yan,Ji Zhang
关键词-EN: properly select advantageous, large language models, select advantageous model, offers a pathway, fusing the capacities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While fusing the capacities and advantages of various large language models (LLMs) offers a pathway to construct more powerful and versatile models, a fundamental challenge is to properly select advantageous model during the training. Existing fusion methods primarily focus on the training mode that uses cross entropy on ground truth in a teacher-forcing setup to measure a model’s advantage, which may provide limited insight towards model advantage. In this paper, we introduce a novel approach that enhances the fusion process by incorporating both the training and inference modes. Our method evaluates model advantage not only through cross entropy during training but also by considering inference outputs, providing a more comprehensive assessment. To combine the two modes effectively, we introduce ProFuser to progressively transition from inference mode to training mode. To validate ProFuser’s effectiveness, we fused three models, including vicuna-7b-v1.5, Llama-2-7b-chat, and mpt-7b-8k-chat, and demonstrated the improved performance in knowledge, reasoning, and safety compared to baseline methods.

[AI-25] On the use of neurosymbolic AI for defending against cyber attacks

链接: https://arxiv.org/abs/2408.04996
作者: Gudmund Grov,Jonas Halvorsen,Magnus Wiik Eckhoff,Bjørn Jervell Hansen,Martin Eian,Vasileios Mavroeidis
关键词-EN: cyber attacks, generally accepted, ability to detect, detect and respond, attacks
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: Accepted to 18th International Conference on Neural-Symbolic Learning and Reasoning

点击查看摘要

Abstract:It is generally accepted that all cyber attacks cannot be prevented, creating a need for the ability to detect and respond to cyber attacks. Both connectionist and symbolic AI are currently being used to support such detection and response. In this paper, we make the case for combining them using neurosymbolic AI. We identify a set of challenges when using AI today and propose a set of neurosymbolic use cases we believe are both interesting research directions for the neurosymbolic AI community and can have an impact on the cyber security field. We demonstrate feasibility through two proof-of-concept experiments.

[AI-26] LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description

链接: https://arxiv.org/abs/2408.04957
作者: Yizhang Jin,Jian Li,Jiangning Zhang,Jianlong Hu,Zhenye Gan,Xin Tan,Yong Liu,Yabiao Wang,Chengjie Wang,Lizhuang Ma
关键词-EN: Visual Spatial Description, Visual Spatial, visual spatial relationship, Traditional visual spatial, Spatial Description
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Visual Spatial Description (VSD) aims to generate texts that describe the spatial relationships between objects within images. Traditional visual spatial relationship classification (VSRC) methods typically output the spatial relationship between two objects in an image, often neglecting world knowledge and lacking general language capabilities. In this paper, we propose a Large Language-and-Vision Assistant for Visual Spatial Description, named LLaVA-VSD, which is designed for the classification, description, and open-ended description of visual spatial relationships. Specifically, the model first constructs a VSD instruction-following dataset using given figure-caption pairs for the three tasks. It then employs LoRA to fine-tune a Large Language and Vision Assistant for VSD, which has 13 billion parameters and supports high-resolution images. Finally, a large language model (Qwen-2) is used to refine the generated sentences, enhancing their diversity and accuracy. LLaVA-VSD demonstrates excellent multimodal conversational capabilities and can follow open-ended instructions to assist with inquiries about object relationships in images.

[AI-27] UAV-Enhanced Combination to Application: Comprehensive Analysis and Benchmarking of a Human Detection Dataset for Disaster Scenarios ICPR2024

链接: https://arxiv.org/abs/2408.04922
作者: Ragib Amin Nihal,Benjamin Yen,Katsutoshi Itoyama,Kazuhiro Nakadai
关键词-EN: http URL address, Unmanned aerial vehicles, Combination to Application, training machine learning, machine learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This Paper is accepted for 27th International Conference on Pattern Recognition (ICPR 2024)

点击查看摘要

Abstract:Unmanned aerial vehicles (UAVs) have revolutionized search and rescue (SAR) operations, but the lack of specialized human detection datasets for training machine learning models poses a significant this http URL address this gap, this paper introduces the Combination to Application (C2A) dataset, synthesized by overlaying human poses onto UAV-captured disaster scenes. Through extensive experimentation with state-of-the-art detection models, we demonstrate that models fine-tuned on the C2A dataset exhibit substantial performance improvements compared to those pre-trained on generic aerial datasets. Furthermore, we highlight the importance of combining the C2A dataset with general human datasets to achieve optimal performance and generalization across various scenarios. This points out the crucial need for a tailored dataset to enhance the effectiveness of SAR operations. Our contributions also include developing dataset creation pipeline and integrating diverse human poses and disaster scenes information to assess the severity of disaster scenarios. Our findings advocate for future developments, to ensure that SAR operations benefit from the most realistic and effective AI-assisted interventions possible.

[AI-28] Avoid Wasted Annotation Costs in Open-set Active Learning with Pre-trained Vision-Language Model

链接: https://arxiv.org/abs/2408.04917
作者: Jaehyuk Heo,Pilsung Kang
关键词-EN: Active learning, selectively collecting highly, minimizing annotation costs, aims to enhance, selectively collecting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Active learning (AL) aims to enhance model performance by selectively collecting highly informative data, thereby minimizing annotation costs. However, in practical scenarios, unlabeled data may contain out-of-distribution (OOD) samples, leading to wasted annotation costs if data is incorrectly selected. Recent research has explored methods to apply AL to open-set data, but these methods often require or incur unavoidable cost losses to minimize them. To address these challenges, we propose a novel selection strategy, CLIPN for AL (CLIPNAL), which minimizes cost losses without requiring OOD samples. CLIPNAL sequentially evaluates the purity and informativeness of data. First, it utilizes a pre-trained vision-language model to detect and exclude OOD data by leveraging linguistic and visual information of in-distribution (ID) data without additional training. Second, it selects highly informative data from the remaining ID data, and then the selected samples are annotated by human experts. Experimental results on datasets with various open-set conditions demonstrate that CLIPNAL achieves the lowest cost loss and highest performance across all scenarios. Code is available at this https URL.

[AI-29] Knowledge Base Embeddings: Semantics and Theoretical Properties KR2024

链接: https://arxiv.org/abs/2408.04913
作者: Camille Bourgaux,Ricardo Guimarães,Raoul Koudijs,Victor Lacerda,Ana Ozaki
关键词-EN: vector spaces, relevant conceptual knowledge, recently evolved, map facts, constrain the models
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: This is an extended version of a paper appearing at the 21st International Conference on Principles of Knowledge Representation and Reasoning (KR 2024). 17 pages

点击查看摘要

Abstract:Research on knowledge graph embeddings has recently evolved into knowledge base embeddings, where the goal is not only to map facts into vector spaces but also constrain the models so that they take into account the relevant conceptual knowledge available. This paper examines recent methods that have been proposed to embed knowledge bases in description logic into vector spaces through the lens of their geometric-based semantics. We identify several relevant theoretical properties, which we draw from the literature and sometimes generalize or unify. We then investigate how concrete embedding methods fit in this theoretical framework.

[AI-30] Unleashing Artificial Cognition: Integrating Multiple AI Systems

链接: https://arxiv.org/abs/2408.04910
作者: Muntasir Adnan,Buddhi Gamage,Zhiwei Xu,Damith Herath,Carlos Noschang Kuhn
关键词-EN: query analysis techniques, artificial intelligence, present an innovative, innovative fusion, query analysis
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this study, we present an innovative fusion of language models and query analysis techniques to unlock cognition in artificial intelligence. Our system seamlessly integrates a Chess engine with a language model, enabling it to predict moves and provide strategic explanations. Leveraging a vector database through retrievable answer generation, our OpenSI AI system elucidates its decision-making process, bridging the gap between raw computation and human-like understanding. Our choice of Chess as the demonstration environment underscores the versatility of our approach. Beyond Chess, our system holds promise for diverse applications, from medical diagnostics to financial forecasting.

[AI-31] owards a Generative Approach for Emotion Detection and Reasoning

链接: https://arxiv.org/abs/2408.04906
作者: Ankita Bhaumik,Tomek Strzalkowski
关键词-EN: Large language models, demonstrated impressive performance, Large language, prompting techniques, emotional reasoning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive performance in mathematical and commonsense reasoning tasks using chain-of-thought (CoT) prompting techniques. But can they perform emotional reasoning by concatenating `Let’s think step-by-step’ to the input prompt? In this paper we investigate this question along with introducing a novel approach to zero-shot emotion detection and emotional reasoning using LLMs. Existing state of the art zero-shot approaches rely on textual entailment models to choose the most appropriate emotion label for an input text. We argue that this strongly restricts the model to a fixed set of labels which may not be suitable or sufficient for many applications where emotion analysis is required. Instead, we propose framing the problem of emotion analysis as a generative question-answering (QA) task. Our approach uses a two step methodology of generating relevant context or background knowledge to answer the emotion detection question step-by-step. Our paper is the first work on using a generative approach to jointly address the tasks of emotion detection and emotional reasoning for texts. We evaluate our approach on two popular emotion detection datasets and also release the fine-grained emotion labels and explanations for further training and fine-tuning of emotional reasoning systems.

[AI-32] GlitchProber: Advancing Effective Detection and Mitigation of Glitch Tokens in Large Language Models

链接: https://arxiv.org/abs/2408.04905
作者: Zhibo Zhang,Wuxia Bai,Yuxi Li,Mark Huasong Meng,Kailong Wang,Ling Shi,Li Li,Jun Wang,Haoyu Wang
关键词-EN: achieved unprecedented success, Large language models, natural language processing, glitch tokens, Large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved unprecedented success in the field of natural language processing. However, the black-box nature of their internal mechanisms has brought many concerns about their trustworthiness and interpretability. Recent research has discovered a class of abnormal tokens in the model’s vocabulary space and named them “glitch tokens”. Those tokens, once included in the input, may induce the model to produce incorrect, irrelevant, or even harmful results, drastically undermining the reliability and practicality of LLMs. In this work, we aim to enhance the understanding of glitch tokens and propose techniques for their detection and mitigation. We first reveal the characteristic features induced by glitch tokens on LLMs, which are evidenced by significant deviations in the distributions of attention patterns and dynamic information from intermediate model layers. Based on the insights, we develop GlitchProber, a tool for efficient glitch token detection and mitigation. GlitchProber utilizes small-scale sampling, principal component analysis for accelerated feature extraction, and a simple classifier for efficient vocabulary screening. Taking one step further, GlitchProber rectifies abnormal model intermediate layer values to mitigate the destructive effects of glitch tokens. Evaluated on five mainstream open-source LLMs, GlitchProber demonstrates higher efficiency, precision, and recall compared to existing approaches, with an average F1 score of 0.86 and an average repair rate of 50.06%. GlitchProber unveils a novel path to address the challenges posed by glitch tokens and inspires future research toward more robust and interpretable LLMs. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.04905 [cs.CL] (or arXiv:2408.04905v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.04905 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-33] Axiomatic Characterisations of Sample-based Explainers

链接: https://arxiv.org/abs/2408.04903
作者: Leila Amgouda,Martin C. Cooper,Salim Debbaoui
关键词-EN: Explaining decisions, computationally challenging, decisions of black-box, black-box classifiers, important and computationally
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explaining decisions of black-box classifiers is both important and computationally challenging. In this paper, we scrutinize explainers that generate feature-based explanations from samples or datasets. We start by presenting a set of desirable properties that explainers would ideally satisfy, delve into their relationships, and highlight incompatibilities of some of them. We identify the entire family of explainers that satisfy two key properties which are compatible with all the others. Its instances provide sufficient reasons, called weak abductive explanations.We then unravel its various subfamilies that satisfy subsets of compatible properties. Indeed, we fully characterize all the explainers that satisfy any subset of compatible properties. In particular, we introduce the first (broad family of) explainers that guarantee the existence of explanations and their global consistency.We discuss some of its instances including the irrefutable explainer and the surrogate explainer whose explanations can be found in polynomial time.

[AI-34] Better Not to Propagate: Understanding Edge Uncertainty and Over-smoothing in Signed Graph Neural Networks

链接: https://arxiv.org/abs/2408.04895
作者: Yoonhyuk Choi,Jiho Choi,Taewook Ko,Chong-Kwon Kim
关键词-EN: Traditional Graph Neural, Graph Neural Networks, real-world heterophily scenarios, Neural Networks, Traditional Graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional Graph Neural Networks (GNNs) rely on network homophily, which can lead to performance degradation due to over-smoothing in many real-world heterophily scenarios. Recent studies analyze the smoothing effect (separability) after message-passing (MP), depending on the expectation of node features. Regarding separability gain, they provided theoretical backgrounds on over-smoothing caused by various propagation schemes, including positive, signed, and blocked MPs. More recently, by extending these theorems, some works have suggested improvements in signed propagation under multiple classes. However, prior works assume that the error ratio of all propagation schemes is fixed, failing to investigate this phenomenon correctly. To solve this problem, we propose a novel method for estimating homophily and edge error ratio, integrated with dynamic selection between blocked and signed propagation during training. Our theoretical analysis, supported by extensive experiments, demonstrates that blocking MP can be more effective than signed propagation under high edge error ratios, improving the performance in both homophilic and heterophilic graphs.

[AI-35] ConfusedPilot: Compromising Enterprise Information Integrity and Confidentiality with Copilot for Microsoft 365

链接: https://arxiv.org/abs/2408.04870
作者: Ayush RoyChowdhury,Mulong Luo,Prateek Sahu,Sarbartha Banerjee,Mohit Tiwari
关键词-EN: large language model, Retrieval augmented generation, augmented generation, language model, retrieves useful information
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrieval augmented generation (RAG) is a process where a large language model (LLM) retrieves useful information from a database and then generates the responses. It is becoming popular in enterprise settings for daily business operations. For example, Copilot for Microsoft 365 has accumulated millions of businesses. However, the security implications of adopting such RAG-based systems are unclear. In this paper, we introduce ConfusedPilot, a class of security vulnerabilities of RAG systems that confuse Copilot and cause integrity and confidentiality violations in its responses. First, we investigate a vulnerability that embeds malicious text in the modified prompt in RAG, corrupting the responses generated by the LLM. Second, we demonstrate a vulnerability that leaks secret data, which leverages the caching mechanism during retrieval. Third, we investigate how both vulnerabilities can be exploited to propagate misinformation within the enterprise and ultimately impact its operations, such as sales and manufacturing. We also discuss the root cause of these attacks by investigating the architecture of a RAG-based system. This study highlights the security vulnerabilities in today’s RAG-based systems and proposes design guidelines to secure future RAG-based systems. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.04870 [cs.CR] (or arXiv:2408.04870v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2408.04870 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-36] Ensemble BERT: A student social network text sentiment classification model based on ensemble learning and BERT architecture

链接: https://arxiv.org/abs/2408.04849
作者: Kai Jiang,Honghao Yang,Yuexian Wang,Qianru Chen,Yiming Luo
关键词-EN: mental health assessment, middle school students, middle school, field of education, ensemble learning network
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The mental health assessment of middle school students has always been one of the focuses in the field of education. This paper introduces a new ensemble learning network based on BERT, employing the concept of enhancing model performance by integrating multiple classifiers. We trained a range of BERT-based learners, which combined using the majority voting method. We collect social network text data of middle school students through China’s Weibo and apply the method to the task of classifying emotional tendencies in middle school students’ social network texts. Experimental results suggest that the ensemble learning network has a better performance than the base model and the performance of the ensemble learning model, consisting of three single-layer BERT models, is barely the same as a three-layer BERT model but requires 11.58% more training time. Therefore, in terms of balancing prediction effect and efficiency, the deeper BERT network should be preferred for training. However, for interpretability, network ensembles can provide acceptable solutions.

[AI-37] UGrid: An Efficient-And-Rigorous Neural Multigrid Solver for Linear PDEs

链接: https://arxiv.org/abs/2408.04846
作者: Xi Han,Fei Hou,Hong Qin
关键词-EN: Partial Differential Equations, Differential Equations, Partial Differential, science and engineering, fundamental significance
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注: Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024

点击查看摘要

Abstract:Numerical solvers of Partial Differential Equations (PDEs) are of fundamental significance to science and engineering. To date, the historical reliance on legacy techniques has circumscribed possible integration of big data knowledge and exhibits sub-optimal efficiency for certain PDE formulations, while data-driven neural methods typically lack mathematical guarantee of convergence and correctness. This paper articulates a mathematically rigorous neural solver for linear PDEs. The proposed UGrid solver, built upon the principled integration of U-Net and MultiGrid, manifests a mathematically rigorous proof of both convergence and correctness, and showcases high numerical accuracy, as well as strong generalization power to various input geometry/values and multiple PDE formulations. In addition, we devise a new residual loss metric, which enables unsupervised training and affords more stability and a larger solution space over the legacy losses.

[AI-38] Counterfactual Explanations with Probabilistic Guarantees on their Robustness to Model Change

链接: https://arxiv.org/abs/2408.04842
作者: Ignacy Stępka,Mateusz Lango,Jerzy Stefanowski
关键词-EN: achieve desired outputs, machine learning models, guide users, desired outputs, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Counterfactual explanations (CFEs) guide users on how to adjust inputs to machine learning models to achieve desired outputs. While existing research primarily addresses static scenarios, real-world applications often involve data or model changes, potentially invalidating previously generated CFEs and rendering user-induced input changes ineffective. Current methods addressing this issue often support only specific models or change types, require extensive hyperparameter tuning, or fail to provide probabilistic guarantees on CFE robustness to model changes. This paper proposes a novel approach for generating CFEs that provides probabilistic guarantees for any model and change type, while offering interpretable and easy-to-select hyperparameters. We establish a theoretical framework for probabilistically defining robustness to model change and demonstrate how our BetaRCE method directly stems from it. BetaRCE is a post-hoc method applied alongside a chosen base CFE generation method to enhance the quality of the explanation beyond robustness. It facilitates a transition from the base explanation to a more robust one with user-adjusted probability bounds. Through experimental comparisons with baselines, we show that BetaRCE yields robust, most plausible, and closest to baseline counterfactual explanations.

[AI-39] Kolmogorov-Arnold Network for Online Reinforcement Learning

链接: https://arxiv.org/abs/2408.04841
作者: Victor Augusto Kich,Jair Augusto Bottega,Raul Steinmetz,Ricardo Bedin Grando,Ayano Yorozu,Akihisa Ohya
关键词-EN: reduced memory usage, providing universal function, Proximal Policy Optimization, Kolmogorov-Arnold Networks, Multi-Layer Perceptrons
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Paper accepted at 24th International Conference on Control, Automation and Systems

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) have shown potential as an alternative to Multi-Layer Perceptrons (MLPs) in neural networks, providing universal function approximation with fewer parameters and reduced memory usage. In this paper, we explore the use of KANs as function approximators within the Proximal Policy Optimization (PPO) algorithm. We evaluate this approach by comparing its performance to the original MLP-based PPO using the DeepMind Control Proprio Robotics benchmark. Our results indicate that the KAN-based reinforcement learning algorithm can achieve comparable performance to its MLP-based counterpart, often with fewer parameters. These findings suggest that KANs may offer a more efficient option for reinforcement learning models.

[AI-40] mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

链接: https://arxiv.org/abs/2408.04840
作者: Jiabo Ye,Haiyang Xu,Haowei Liu,Anwen Hu,Ming Yan,Qi Qian,Ji Zhang,Fei Huang,Jingren Zhou
关键词-EN: demonstrated remarkable capabilities, Multi-modal Large Language, Large Language Models, Large Language, Multi-modal Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

[AI-41] Self-augmented Gaussian Splatting with Structure-aware Masks for Sparse-view 3D Reconstruction

链接: https://arxiv.org/abs/2408.04831
作者: Lingbei Meng,Bi’an Du,Wei Hu
关键词-EN: build complete three-dimensional, complete three-dimensional models, computer vision, aiming to build, viewing perspectives
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sparse-view 3D reconstruction stands as a formidable challenge in computer vision, aiming to build complete three-dimensional models from a limited array of viewing perspectives. This task confronts several difficulties: 1) the limited number of input images that lack consistent information; 2) dependence on the quality of input images; and 3) the substantial size of model parameters. To address these challenges, we propose a self-augmented coarse-to-fine Gaussian splatting paradigm, enhanced with a structure-aware mask, for sparse-view 3D reconstruction. In particular, our method initially employs a coarse Gaussian model to obtain a basic 3D representation from sparse-view inputs. Subsequently, we develop a fine Gaussian network to enhance consistent and detailed representation of the output with both 3D geometry augmentation and perceptual view augmentation. During training, we design a structure-aware masking strategy to further improve the model’s robustness against sparse inputs and noise.Experimental results on the MipNeRF360 and OmniObject3D datasets demonstrate that the proposed method achieves state-of-the-art performances for sparse input views in both perceptual quality and efficiency.

[AI-42] Performance Prediction of Hub-Based Swarms

链接: https://arxiv.org/abs/2408.04822
作者: Puneet Jain,Chaitanya Dwivedi,Vigynesh Bhatt,Nick Smith,Michael A Goodrich
关键词-EN: common nest site, nest site called, hub-based colony consists, consists of multiple, share a common
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A hub-based colony consists of multiple agents who share a common nest site called the hub. Agents perform tasks away from the hub like foraging for food or gathering information about future nest sites. Modeling hub-based colonies is challenging because the size of the collective state space grows rapidly as the number of agents grows. This paper presents a graph-based representation of the colony that can be combined with graph-based encoders to create low-dimensional representations of collective state that can scale to many agents for a best-of-N colony problem. We demonstrate how the information in the low-dimensional embedding can be used with two experiments. First, we show how the information in the tensor can be used to cluster collective states by the probability of choosing the best site for a very small problem. Second, we show how structured collective trajectories emerge when a graph encoder is used to learn the low-dimensional embedding, and these trajectories have information that can be used to predict swarm performance.

[AI-43] Natural Language Outlines for Code: Literate Programming in the LLM Era

链接: https://arxiv.org/abs/2408.04820
作者: Kensen Shi,Deniz Altınbüken,Saswat Anand,Mihai Christodorescu,Katja Grünwedel,Alexa Koenings,Sai Naidu,Anurag Pathak,Marc Rasi,Fredde Ribeiro,Brandon Ruffin,Siddhant Sanyam,Maxim Tabachnyk,Sara Toth,Roy Tu,Tobias Welp,Pengcheng Yin,Manzil Zaheer,Satish Chandra,Charles Sutton
关键词-EN: software development process, natural language outlines, development process, natural language, modality and interaction
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose using natural language outlines as a novel modality and interaction surface for providing AI assistance to developers throughout the software development process. An NL outline for a code function comprises multiple statements written in concise prose, which partition the code and summarize its main ideas in the style of literate programming. Crucially, we find that modern LLMs can generate accurate and high-quality NL outlines in practice. Moreover, NL outlines enable a bidirectional sync between code and NL, allowing changes in one to be automatically reflected in the other. We discuss many use cases for NL outlines: they can accelerate understanding and navigation of code and diffs, simplify code maintenance, augment code search, steer code generation, and more. We then propose and compare multiple LLM prompting techniques for generating outlines and ask professional developers to judge outline quality. Finally, we present two case studies applying NL outlines toward code review and the difficult task of malware detection.

[AI-44] Performance Metric for Multiple Anomaly Score Distributions with Discrete Severity Levels

链接: https://arxiv.org/abs/2408.04817
作者: Wonjun Yi,Yong-Hwa Park,Wonho Jung
关键词-EN: severity level differences, classifying severity levels, severity level, automated maintenance, rise of smart
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: accepted as a work-in-progress paper at the 2024 Annual Conference of the IEEE Industrial Electronics Society (IECON)

点击查看摘要

Abstract:The rise of smart factories has heightened the demand for automated maintenance, and normal-data-based anomaly detection has proved particularly effective in environments where anomaly data are scarce. This method, which does not require anomaly data during training, has prompted researchers to focus not only on detecting anomalies but also on classifying severity levels by using anomaly scores. However, the existing performance metrics, such as the area under the receiver operating characteristic curve (AUROC), do not effectively reflect the performance of models in classifying severity levels based on anomaly scores. To address this limitation, we propose the weighted sum of the area under the receiver operating characteristic curve (WS-AUROC), which combines AUROC with a penalty for severity level differences. We conducted various experiments using different penalty assignment methods: uniform penalty regardless of severity level differences, penalty based on severity level index differences, and penalty based on actual physical quantities that cause anomalies. The latter method was the most sensitive. Additionally, we propose an anomaly detector that achieves clear separation of distributions and outperforms the ablation models on the WS-AUROC and AUROC metrics.

[AI-45] A Collaborative PIM Computing Optimization Framework for Multi-Tenant DNN

链接: https://arxiv.org/abs/2408.04812
作者: Bojing Li,Duo Zhong,Xiang Chen,Chenchen Liu
关键词-EN: Modern Artificial Intelligence, Modern Artificial, Artificial Intelligence, deep neural networks, increasingly utilizing multi-tenant
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Modern Artificial Intelligence (AI) applications are increasingly utilizing multi-tenant deep neural networks (DNNs), which lead to a significant rise in computing complexity and the need for computing parallelism. ReRAM-based processing-in-memory (PIM) computing, with its high density and low power consumption characteristics, holds promising potential for supporting the deployment of multi-tenant DNNs. However, direct deployment of complex multi-tenant DNNs on exsiting ReRAM-based PIM designs poses challenges. Resource contention among different tenants can result in sever under-utilization of on-chip computing resources. Moreover, area-intensive operators and computation-intensive operators require excessively large on-chip areas and long processing times, leading to high overall latency during parallel computing. To address these challenges, we propose a novel ReRAM-based in-memory computing framework that enables efficient deployment of multi-tenant DNNs on ReRAM-based PIM designs. Our approach tackles the resource contention problems by iteratively partitioning the PIM hardware at tenant level. In addition, we construct a fine-grained reconstructed processing pipeline at the operator level to handle area-intensive operators. Compared to the direct deployments on traditional ReRAM-based PIM designs, our proposed PIM computing framework achieves significant improvements in speed (ranges from 1.75x to 60.43x) and energy(up to 1.89x).

[AI-46] h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment

链接: https://arxiv.org/abs/2408.04811
作者: Moussa Koulako Bala Doumbouya,Ananjan Nandi,Gabriel Poesia,Davide Ghilardi,Anna Goldie,Federico Bianchi,Dan Jurafsky,Christopher D. Manning
关键词-EN: critical concern due, Large Language Models, resist generating harmful, Large Language, remains a critical
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The safety of Large Language Models (LLMs) remains a critical concern due to a lack of adequate benchmarks for systematically evaluating their ability to resist generating harmful content. Previous efforts towards automated red teaming involve static or templated sets of illicit requests and adversarial prompts which have limited utility given jailbreak attacks’ evolving and composable nature. We propose a novel dynamic benchmark of composable jailbreak attacks to move beyond static datasets and taxonomies of attacks and harms. Our approach consists of three components collectively called h4rm3l: (1) a domain-specific language that formally expresses jailbreak attacks as compositions of parameterized prompt transformation primitives, (2) bandit-based few-shot program synthesis algorithms that generate novel attacks optimized to penetrate the safety filters of a target black box LLM, and (3) open-source automated red-teaming software employing the previous two components. We use h4rm3l to generate a dataset of 2656 successful novel jailbreak attacks targeting 6 state-of-the-art (SOTA) open-source and proprietary LLMs. Several of our synthesized attacks are more effective than previously reported ones, with Attack Success Rates exceeding 90% on SOTA closed language models such as claude-3-haiku and GPT4-o. By generating datasets of jailbreak attacks in a unified formal representation, h4rm3l enables reproducible benchmarking and automated red-teaming, contributes to understanding LLM safety limitations, and supports the development of robust defenses in an increasingly LLM-integrated world. Warning: This paper and related research artifacts contain offensive and potentially disturbing prompts and model-generated content. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) MSC classes: 68 ACMclasses: I.2; I.2.0; I.2.1; I.2.5; I.2.7; K.6.5; K.4.2 Cite as: arXiv:2408.04811 [cs.CR] (or arXiv:2408.04811v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2408.04811 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-47] UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

链接: https://arxiv.org/abs/2408.04810
作者: Haider Al-Tahan,Quentin Garrido,Randall Balestriero,Diane Bouchacourt,Caner Hazirbas,Mark Ibrahim
关键词-EN: Significant research efforts, Significant research, research efforts, improve vision-language model, Significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Significant research efforts have been made to scale and improve vision-language model (VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers are tasked with the heavy burden of implementing each protocol, bearing a non-trivial computational cost, and making sense of how all these benchmarks translate into meaningful axes of progress. To facilitate a systematic evaluation of VLM progress, we introduce UniBench: a unified implementation of 50+ VLM benchmarks spanning a comprehensive range of carefully categorized capabilities from object recognition to spatial awareness, counting, and much more. We showcase the utility of UniBench for measuring progress by evaluating nearly 60 publicly available vision-language models, trained on scales of up to 12.8B samples. We find that while scaling training data or model size can boost many vision-language model capabilities, scaling offers little benefit for reasoning or relations. Surprisingly, we also discover today’s best VLMs struggle on simple digit recognition and counting tasks, e.g. MNIST, which much simpler networks can solve. Where scale falls short, we find that more precise interventions, such as data quality or tailored-learning objectives offer more promise. For practitioners, we also offer guidance on selecting a suitable VLM for a given application. Finally, we release an easy-to-run UniBench code-base with the full set of 50+ benchmarks and comparisons across 59 models as well as a distilled, representative set of benchmarks that runs in 5 minutes on a single GPU.

[AI-48] On the Geometry of Deep Learning

链接: https://arxiv.org/abs/2408.04809
作者: Randall Balestriero,Ahmed Imtiaz Humayun,Richard Baraniuk
关键词-EN: continuous piecewise linear, piecewise linear functions, continuous piecewise, multiple dimensions, network affine spline
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we overview one promising avenue of progress at the mathematical foundation of deep learning: the connection between deep networks and function approximation by affine splines (continuous piecewise linear functions in multiple dimensions). In particular, we will overview work over the past decade on understanding certain geometrical properties of a deep network’s affine spline mapping, in particular how it tessellates its input space. As we will see, the affine spline connection and geometrical viewpoint provide a powerful portal through which to view, analyze, and improve the inner workings of a deep network.

[AI-49] AI and Machine Learning Driven Indoor Localization and Navigation with Mobile Embedded Systems

链接: https://arxiv.org/abs/2408.04797
作者: Sudeep Pasricha
关键词-EN: indoor navigation solutions, autonomous vehicles, Indoor navigation, mobile embedded systems, foundational technology
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Indoor navigation is a foundational technology to assist the tracking and localization of humans, autonomous vehicles, drones, and robots in indoor spaces. Due to the lack of penetration of GPS signals in buildings, subterranean locales, and dense urban environments, indoor navigation solutions typically make use of ubiquitous wireless signals (e.g., WiFi) and sensors in mobile embedded systems to perform tracking and localization. This article provides an overview of the many challenges facing state-of-the-art indoor navigation solutions, and then describes how AI algorithms deployed on mobile embedded systems can overcome these challenges.

[AI-50] AI Consciousness and Public Perceptions: Four Futures

链接: https://arxiv.org/abs/2408.04771
作者: Ines Fernandez,Nicoleta Kyosovska,Jay Luong,Gabriel Mukobi
关键词-EN: AIs’ moral status, advanced AI systems, typically focuses, focuses on misuse, accidents and loss
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The discourse on risks from advanced AI systems (“AIs”) typically focuses on misuse, accidents and loss of control, but the question of AIs’ moral status could have negative impacts which are of comparable significance and could be realised within similar timeframes. Our paper evaluates these impacts by investigating (1) the factual question of whether future advanced AI systems will be conscious, together with (2) the epistemic question of whether future human society will broadly believe advanced AI systems to be conscious. Assuming binary responses to (1) and (2) gives rise to four possibilities: in the true positive scenario, society predominantly correctly believes that AIs are conscious; in the false positive scenario, that belief is incorrect; in the true negative scenario, society correctly believes that AIs are not conscious; and lastly, in the false negative scenario, society incorrectly believes that AIs are not conscious. The paper offers vivid vignettes of the different futures to ground the two-dimensional framework. Critically, we identify four major risks: AI suffering, human disempowerment, geopolitical instability, and human depravity. We evaluate each risk across the different scenarios and provide an overall qualitative risk assessment for each scenario. Our analysis suggests that the worst possibility is the wrong belief that AI is non-conscious, followed by the wrong belief that AI is conscious. The paper concludes with the main recommendations to avoid research aimed at intentionally creating conscious AI and instead focus efforts on reducing our current uncertainties on both the factual and epistemic questions on AI consciousness.

[AI-51] Data-Driven Pixel Control: Challenges and Prospects

链接: https://arxiv.org/abs/2408.04767
作者: Saurabh Farkya,Zachary Alan Daniels,Aswin Raghavan,Gooitzen van der Wal,Michael Isnardi,Michael Piacentino,David Zhang
关键词-EN: Recent advancements, Recent, high resolution, pixel level, high
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Accepted to the Conference on Dynamic Data-Driven Applications Systems (DDDAS2024)

点击查看摘要

Abstract:Recent advancements in sensors have led to high resolution and high data throughput at the pixel level. Simultaneously, the adoption of increasingly large (deep) neural networks (NNs) has lead to significant progress in computer vision. Currently, visual intelligence comes at increasingly high computational complexity, energy, and latency. We study a data-driven system that combines dynamic sensing at the pixel level with computer vision analytics at the video level and propose a feedback control loop to minimize data movement between the sensor front-end and computational back-end without compromising detection and tracking precision. Our contributions are threefold: (1) We introduce anticipatory attention and show that it leads to high precision prediction with sparse activation of pixels; (2) Leveraging the feedback control, we show that the dimensionality of learned feature vectors can be significantly reduced with increased sparsity; and (3) We emulate analog design choices (such as varying RGB or Bayer pixel format and analog noise) and study their impact on the key metrics of the data-driven system. Comparative analysis with traditional pixel and deep learning models shows significant performance enhancements. Our system achieves a 10X reduction in bandwidth and a 15-30X improvement in Energy-Delay Product (EDP) when activating only 30% of pixels, with a minor reduction in object detection and tracking precision. Based on analog emulation, our system can achieve a throughput of 205 megapixels/sec (MP/s) with a power consumption of only 110 mW per MP, i.e., a theoretical improvement of ~30X in EDP.

[AI-52] Embodied Uncertainty-Aware Object Segmentation IROS2024

链接: https://arxiv.org/abs/2408.04760
作者: Xiaolin Fang,Leslie Pack Kaelbling,Tomás Lozano-Pérez
关键词-EN: introduce uncertainty-aware object, uncertainty-aware object instance, embodied interactive segmentation, introduce uncertainty-aware, usefulness for embodied
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: IROS 2024

点击查看摘要

Abstract:We introduce uncertainty-aware object instance segmentation (UncOS) and demonstrate its usefulness for embodied interactive segmentation. To deal with uncertainty in robot perception, we propose a method for generating a hypothesis distribution of object segmentation. We obtain a set of region-factored segmentation hypotheses together with confidence estimates by making multiple queries of large pre-trained models. This process can produce segmentation results that achieve state-of-the-art performance on unseen object segmentation problems. The output can also serve as input to a belief-driven process for selecting robot actions to perturb the scene to reduce ambiguity. We demonstrate the effectiveness of this method in real-robot experiments. Website: this https URL

[AI-53] More Questions than Answers? Lessons from Integrating Explainable AI into a Cyber-AI Tool

链接: https://arxiv.org/abs/2408.04746
作者: Ashley Suh,Harry Li,Caitlin Kenney,Kenneth Alperin,Steven R. Gomez
关键词-EN: implement Explainable, share observations, observations and challenges, ongoing effort, effort to implement
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: ACM CHI 2024 Workshop on Human-Centered Explainable AI (HCXAI)

点击查看摘要

Abstract:We share observations and challenges from an ongoing effort to implement Explainable AI (XAI) in a domain-specific workflow for cybersecurity analysts. Specifically, we briefly describe a preliminary case study on the use of XAI for source code classification, where accurate assessment and timeliness are paramount. We find that the outputs of state-of-the-art saliency explanation techniques (e.g., SHAP or LIME) are lost in translation when interpreted by people with little AI expertise, despite these techniques being marketed for non-technical users. Moreover, we find that popular XAI techniques offer fewer insights for real-time human-AI workflows when they are post hoc and too localized in their explanations. Instead, we observe that cyber analysts need higher-level, easy-to-digest explanations that can offer as little disruption as possible to their workflows. We outline unaddressed gaps in practical and effective XAI, then touch on how emerging technologies like Large Language Models (LLMs) could mitigate these existing obstacles.

[AI-54] AI for operational methane emitter monitoring from space

链接: https://arxiv.org/abs/2408.04745
作者: Anna Vaughan,Gonzalo Mateo-Garcia,Itziar Irakulis-Loitxate,Marc Watine,Pablo Fernandez-Poblaciones,Richard E. Turner,James Requeima,Javier Gorroño,Cynthia Randles,Manfredi Caltagirone,Claudio Cifarelli
关键词-EN: buy humanity time, Mitigating methane emissions, stop global warming, Mitigating methane, time to decarbonise
类目: Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Mitigating methane emissions is the fastest way to stop global warming in the short-term and buy humanity time to decarbonise. Despite the demonstrated ability of remote sensing instruments to detect methane plumes, no system has been available to routinely monitor and act on these events. We present MARS-S2L, an automated AI-driven methane emitter monitoring system for Sentinel-2 and Landsat satellite imagery deployed operationally at the United Nations Environment Programme’s International Methane Emissions Observatory. We compile a global dataset of thousands of super-emission events for training and evaluation, demonstrating that MARS-S2L can skillfully monitor emissions in a diverse range of regions globally, providing a 216% improvement in mean average precision over a current state-of-the-art detection method. Running this system operationally for six months has yielded 457 near-real-time detections in 22 different countries of which 62 have already been used to provide formal notifications to governments and stakeholders.

[AI-55] DyGMamba: Efficiently Modeling Long-Term Temporal Dependency on Continuous-Time Dynamic Graphs with State Space Models

链接: https://arxiv.org/abs/2408.04713
作者: Zifeng Ding,Yifeng Li,Yuan He,Antonio Norelli,Jingcheng Wu,Volker Tresp,Yunpu Ma,Michael Bronstein
关键词-EN: nuanced temporal details, grasp nuanced temporal, Encoding longer histories, CTDG representation learning, grasp nuanced
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint. Work on progress

点击查看摘要

Abstract:Learning useful representations for continuous-time dynamic graphs (CTDGs) is challenging, due to the concurrent need to span long node interaction histories and grasp nuanced temporal details. In particular, two problems emerge: (1) Encoding longer histories requires more computational resources, making it crucial for CTDG models to maintain low computational complexity to ensure efficiency; (2) Meanwhile, more powerful models are needed to identify and select the most critical temporal information within the extended context provided by longer histories. To address these problems, we propose a CTDG representation learning model named DyGMamba, originating from the popular Mamba state space model (SSM). DyGMamba first leverages a node-level SSM to encode the sequence of historical node interactions. Another time-level SSM is then employed to exploit the temporal patterns hidden in the historical graph, where its output is used to dynamically select the critical information from the interaction history. We validate DyGMamba experimentally on the dynamic link prediction task. The results show that our model achieves state-of-the-art in most cases. DyGMamba also maintains high efficiency in terms of computational resources, making it possible to capture long temporal dependencies with a limited computation budget.

[AI-56] MulliVC: Multi-lingual Voice Conversion With Cycle Consistency

链接: https://arxiv.org/abs/2408.04708
作者: Jiawei Huang,Chen Zhang,Yi Ren,Ziyue Jiang,Zhenhui Ye,Jinglin Liu,Jinzheng He,Xiang Yin,Zhou Zhao
关键词-EN: Voice conversion aims, Voice conversion, multi-lingual voice conversion, voice conversion system, source speaker voice
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Voice conversion aims to modify the source speaker’s voice to resemble the target speaker while preserving the original speech content. Despite notable advancements in voice conversion these days, multi-lingual voice conversion (including both monolingual and cross-lingual scenarios) has yet to be extensively studied. It faces two main challenges: 1) the considerable variability in prosody and articulation habits across languages; and 2) the rarity of paired multi-lingual datasets from the same speaker. In this paper, we propose MulliVC, a novel voice conversion system that only converts timbre and keeps original content and source language prosody without multi-lingual paired data. Specifically, each training step of MulliVC contains three substeps: In step one the model is trained with monolingual speech data; then, steps two and three take inspiration from back translation, construct a cyclical process to disentangle the timbre and other information (content, prosody, and other language-related information) in the absence of multi-lingual data from the same speaker. Both objective and subjective results indicate that MulliVC significantly surpasses other methods in both monolingual and cross-lingual contexts, demonstrating the system’s efficacy and the viability of the three-step approach with cycle consistency. Audio samples can be found on our demo page (this http URL).

[AI-57] Understanding the Performance and Estimating the Cost of LLM Fine-Tuning

链接: https://arxiv.org/abs/2408.04693
作者: Yuchen Xia,Jiho Kim,Yuhan Chen,Haojie Ye,Souvik Kundu,Cong(Callie)Hao,Nishil Talati
关键词-EN: training Large Language, Large Language Models, Large Language, limited compute resources, training Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, conference

点击查看摘要

Abstract:Due to the cost-prohibitive nature of training Large Language Models (LLMs), fine-tuning has emerged as an attractive alternative for specializing LLMs for specific tasks using limited compute resources in a cost-effective manner. In this paper, we characterize sparse Mixture of Experts (MoE) based LLM fine-tuning to understand their accuracy and runtime performance on a single GPU. Our evaluation provides unique insights into the training efficacy of sparse and dense versions of MoE models, as well as their runtime characteristics, including maximum batch size, execution time breakdown, end-to-end throughput, GPU hardware utilization, and load distribution. Our study identifies the optimization of the MoE layer as crucial for further improving the performance of LLM fine-tuning. Using our profiling results, we also develop and validate an analytical model to estimate the cost of LLM fine-tuning on the cloud. This model, based on parameters of the model and GPU architecture, estimates LLM throughput and the cost of training, aiding practitioners in industry and academia to budget the cost of fine-tuning a specific model.

[AI-58] Exploring Scalability in Large-Scale Time Series in DeepVATS framework

链接: https://arxiv.org/abs/2408.04692
作者: Inmaculada Santamaria-Valenzuela,Victor Rodriguez-Fernandez,David Camacho
关键词-EN: Deep Learning, Visual analytics, Deep Learning module, Visual Analytics module, Deep Learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Admitted pending publication in Lecture Notes in Network and Systems (LNNS) series (Springer). Code available at this https URL

点击查看摘要

Abstract:Visual analytics is essential for studying large time series due to its ability to reveal trends, anomalies, and insights. DeepVATS is a tool that merges Deep Learning (Deep) with Visual Analytics (VA) for the analysis of large time series data (TS). It has three interconnected modules. The Deep Learning module, developed in R, manages the load of datasets and Deep Learning models from and to the Storage module. This module also supports models training and the acquisition of the embeddings from the latent space of the trained model. The Storage module operates using the Weights and Biases system. Subsequently, these embeddings can be analyzed in the Visual Analytics module. This module, based on an R Shiny application, allows the adjustment of the parameters related to the projection and clustering of the embeddings space. Once these parameters are set, interactive plots representing both the embeddings, and the time series are shown. This paper introduces the tool and examines its scalability through log analytics. The execution time evolution is examined while the length of the time series is varied. This is achieved by resampling a large data series into smaller subsets and logging the main execution and rendering times for later analysis of scalability.

[AI-59] Improving Relational Database Interactions with Large Language Models : Column Descriptions and Their Impact on Text-to-SQL Performance

链接: https://arxiv.org/abs/2408.04691
作者: Niklas Wretblad,Oskar Holmström,Erik Larsson,Axel Wiksäter,Oscar Söderlund,Hjalmar Öhman,Ture Pontén,Martin Forsberg,Martin Sörme,Fredrik Heintz
关键词-EN: Relational databases, table contents, descriptors of table, Relational, column descriptions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Relational databases often suffer from uninformative descriptors of table contents, such as ambiguous columns and hard-to-interpret values, impacting both human users and Text-to-SQL models. This paper explores the use of large language models (LLMs) to generate informative column descriptions as a semantic layer for relational databases. Using the BIRD-Bench development set, we created \textscColSQL, a dataset with gold-standard column descriptions generated and refined by LLMs and human annotators. We evaluated several instruction-tuned models, finding that GPT-4o and Command R+ excelled in generating high-quality descriptions. Additionally, we applied an LLM-as-a-judge to evaluate model performance. Although this method does not align well with human evaluations, we included it to explore its potential and to identify areas for improvement. More work is needed to improve the reliability of automatic evaluations for this task. We also find that detailed column descriptions significantly improve Text-to-SQL execution accuracy, especially when columns are uninformative. This study establishes LLMs as effective tools for generating detailed metadata, enhancing the usability of relational databases.

[AI-60] Design of a Quality Management System based on the EU Artificial Intelligence Act

链接: https://arxiv.org/abs/2408.04689
作者: Henryk Mustroph,Stefanie Rinderle-Ma
关键词-EN: Artificial Intelligence Act, European Union mandates, Artificial Intelligence, Intelligence Act, European Union
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The Artificial Intelligence Act of the European Union mandates that providers and deployers of high-risk AI systems establish a quality management system (QMS). Among other criteria, a QMS shall help to i) identify, analyze, evaluate, and mitigate risks, ii) ensure evidence of compliance with training, validation, and testing data, and iii) verify and document the AI system design and quality. Current research mainly addresses conceptual considerations and framework designs for AI risk assessment and auditing processes. However, it often overlooks practical tools that actively involve and support humans in checking and documenting high-risk or general-purpose AI systems. This paper addresses this gap by proposing requirements derived from legal regulations and a generic design and architecture of a QMS for AI systems verification and documentation. A first version of a prototype QMS is implemented, integrating LLMs as examples of AI systems and focusing on an integrated risk management sub-service. The prototype is evaluated on i) a user story-based qualitative requirements assessment using potential stakeholder scenarios and ii) a technical assessment of the required GPU storage and performance.

[AI-61] Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles

链接: https://arxiv.org/abs/2408.04686
作者: Xiongtao Sun,Deyue Zhang,Dongdong Yang,Quanchen Zou,Hui Li
关键词-EN: Large language models, Large language, language models, numerous applications, text generation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have significantly enhanced the performance of numerous applications, from intelligent conversations to text generation. However, their inherent security vulnerabilities have become an increasingly significant challenge, especially with respect to jailbreak attacks. Attackers can circumvent the security mechanisms of these LLMs, breaching security constraints and causing harmful outputs. Focusing on multi-turn semantic jailbreak attacks, we observe that existing methods lack specific considerations for the role of multiturn dialogues in attack strategies, leading to semantic deviations during continuous interactions. Therefore, in this paper, we establish a theoretical foundation for multi-turn attacks by considering their support in jailbreak attacks, and based on this, propose a context-based contextual fusion black-box jailbreak attack method, named Context Fusion Attack (CFA). This method approach involves filtering and extracting key terms from the target, constructing contextual scenarios around these terms, dynamically integrating the target into the scenarios, replacing malicious key terms within the target, and thereby concealing the direct malicious intent. Through comparisons on various mainstream LLMs and red team datasets, we have demonstrated CFA’s superior success rate, divergence, and harmfulness compared to other multi-turn attack strategies, particularly showcasing significant advantages on Llama3 and GPT-4.

[AI-62] Eliminating Backdoors in Neural Code Models via Trigger Inversion

链接: https://arxiv.org/abs/2408.04683
作者: Weisong Sun,Yuchen Chen,Chunrong Fang,Yebo Feng,Yuan Xiao,An Guo,Quanjun Zhang,Yang Liu,Baowen Xu,Zhenyu Chen
关键词-EN: Neural code models, Neural code, trigger, trigger inversion, Neural
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: Under review

点击查看摘要

Abstract:Neural code models (NCMs) have been widely used for addressing various code understanding tasks, such as defect detection and clone detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like autonomous driving systems, it could lead to life safety. However, there is an urgent need for effective defenses against backdoor attacks targeting NCMs. To address this issue, in this paper, we innovatively propose a backdoor defense technique based on trigger inversion, called EliBadCode. EliBadCode first filters the model vocabulary for trigger tokens to reduce the search space for trigger inversion, thereby enhancing the efficiency of the trigger inversion. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of adversarial perturbations for subsequent trigger inversion, thereby producing effective inverted triggers efficiently. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of EliBadCode in eliminating backdoor attacks against multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model. Comments: Under review Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) MSC classes: 68-04 ACMclasses: D.2.3; I.2.2; I.2.7 Cite as: arXiv:2408.04683 [cs.CR] (or arXiv:2408.04683v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2408.04683 Focus to learn more arXiv-issued DOI via DataCite

[AI-63] oolSandbox: A Stateful Conversational Interactive Evaluation Benchmark for LLM Tool Use Capabilities

链接: https://arxiv.org/abs/2408.04682
作者: Jiarui Lu,Thomas Holleis,Yizhe Zhang,Bernhard Aumayer,Feng Nan,Felix Bai,Shuang Ma,Shen Ma,Mengyu Li,Guoli Yin,Zirui Wang,Ruoming Pang
关键词-EN: Recent large language, solving real-world challenges, growing research interest, Recent large, assisted LLMs solving
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at this https URL

[AI-64] Conversational AI Powered by Large Language Models Amplifies False Memories in Witness Interviews

链接: https://arxiv.org/abs/2408.04681
作者: Samantha Chan,Pat Pataranutaporn,Aditya Suri,Wazeer Zulfikar,Pattie Maes,Elizabeth F. Loftus
关键词-EN: false memories, recollections of events, actual occurrences, study examines, examines the impact
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:This study examines the impact of AI on human false memories – recollections of events that did not occur or deviate from actual occurrences. It explores false memory induction through suggestive questioning in Human-AI interactions, simulating crime witness interviews. Four conditions were tested: control, survey-based, pre-scripted chatbot, and generative chatbot using a large language model (LLM). Participants (N=200) watched a crime video, then interacted with their assigned AI interviewer or survey, answering questions including five misleading ones. False memories were assessed immediately and after one week. Results show the generative chatbot condition significantly increased false memory formation, inducing over 3 times more immediate false memories than the control and 1.7 times more than the survey method. 36.4% of users’ responses to the generative chatbot were misled through the interaction. After one week, the number of false memories induced by generative chatbots remained constant. However, confidence in these false memories remained higher than the control after one week. Moderating factors were explored: users who were less familiar with chatbots but more familiar with AI technology, and more interested in crime investigations, were more susceptible to false memories. These findings highlight the potential risks of using advanced AI in sensitive contexts, like police interviews, emphasizing the need for ethical considerations.

[AI-65] Dynamic Fog Computing for Enhanced LLM Execution in Medical Applications

链接: https://arxiv.org/abs/2408.04680
作者: Philipp Zagar,Vishnu Ravi,Lauren Aalami,Stephan Krusche,Oliver Aalami,Paul Schmiedmayer
关键词-EN: large language models, data-driven care delivery, comprehend vast quantities, enhance data-driven care, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:The ability of large language models (LLMs) to transform, interpret, and comprehend vast quantities of heterogeneous data presents a significant opportunity to enhance data-driven care delivery. However, the sensitive nature of protected health information (PHI) raises valid concerns about data privacy and trust in remote LLM platforms. In addition, the cost associated with cloud-based artificial intelligence (AI) services continues to impede widespread adoption. To address these challenges, we propose a shift in the LLM execution environment from opaque, centralized cloud providers to a decentralized and dynamic fog computing architecture. By executing open-weight LLMs in more trusted environments, such as the user’s edge device or a fog layer within a local network, we aim to mitigate the privacy, trust, and financial challenges associated with cloud-based LLMs. We further present SpeziLLM, an open-source framework designed to facilitate rapid and seamless leveraging of different LLM execution layers and lowering barriers to LLM integration in digital health applications. We demonstrate SpeziLLM’s broad applicability across six digital health applications, showcasing its versatility in various healthcare settings.

[AI-66] owards Linguistic Neural Representation Learning and Sentence Retrieval from Electroencephalogram Recordings

链接: https://arxiv.org/abs/2408.04679
作者: Jinzhao Zhou,Yiqun Duan,Ziyi Zhao,Yu-Cheng Chang,Yu-Kai Wang,Thomas Do,Chin-Teng Lin
关键词-EN: vast applicational potential, non-invasive brain signals, EEG, gained increasing research, increasing research attention
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decoding linguistic information from non-invasive brain signals using EEG has gained increasing research attention due to its vast applicational potential. Recently, a number of works have adopted a generative-based framework to decode electroencephalogram (EEG) signals into sentences by utilizing the power generative capacity of pretrained large language models (LLMs). However, this approach has several drawbacks that hinder the further development of linguistic applications for brain-computer interfaces (BCIs). Specifically, the ability of the EEG encoder to learn semantic information from EEG data remains questionable, and the LLM decoder’s tendency to generate sentences based on its training memory can be hard to avoid. These issues necessitate a novel approach for converting EEG signals into sentences. In this paper, we propose a novel two-step pipeline that addresses these limitations and enhances the validity of linguistic EEG decoding research. We first confirm that word-level semantic information can be learned from EEG data recorded during natural reading by training a Conformer encoder via a masked contrastive objective for word-level classification. To achieve sentence decoding results, we employ a training-free retrieval method to retrieve sentences based on the predictions from the EEG encoder. Extensive experiments and ablation studies were conducted in this paper for a comprehensive evaluation of the proposed approach. Visualization of the top prediction candidates reveals that our model effectively groups EEG segments into semantic categories with similar meanings, thereby validating its ability to learn patterns from unspoken EEG recordings. Despite the exploratory nature of this work, these results suggest that our method holds promise for providing more reliable solutions for converting EEG signals into text.

[AI-67] CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding

链接: https://arxiv.org/abs/2408.04678
作者: Sophia Ho,Jinsol Park,Patrick Wang
关键词-EN: Compact Retrieval-Based Speculative, Retrieval-Based Speculative Decoding, Compact Retrieval-Based, Speculative Decoding, speculative decoding based
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:We present CREST (Compact Retrieval-Based Speculative Decoding), a redesign of REST that allows it to be effectively “compacted”. REST is a drafting technique for speculative decoding based on retrieving exact n-gram matches of the most recent n tokens generated by the target LLM from a datastore. The key idea of CREST is to only store a subset of the smallest and most common n-grams in the datastore with the hope of achieving comparable performance with less storage space. We found that storing a subset of n-grams both reduces storage space and improves performance. CREST matches REST’s accepted token length with 10.6-13.5x less storage space and achieves a 16.5-17.1% higher acceptance length than REST using the same storage space on the HumanEval and MT Bench benchmarks.

[AI-68] ACL Ready: RAG Based Assistant for the ACL Checklist

链接: https://arxiv.org/abs/2408.04675
作者: Michael Galarnyk,Rutwik Routu,Kosha Bheda,Priyanshu Mehta,Agam Shah,Sudheer Chava
关键词-EN: ARR Responsible NLP, Responsible NLP Research, NLP Research checklist, ARR Responsible, Responsible NLP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The ARR Responsible NLP Research checklist website states that the “checklist is designed to encourage best practices for responsible research, addressing issues of research ethics, societal impact and reproducibility.” Answering the questions is an opportunity for authors to reflect on their work and make sure any shared scientific assets follow best practices. Ideally, considering the checklist before submission can favorably impact the writing of a research paper. However, the checklist is often filled out at the last moment. In this work, we introduce ACLReady, a retrieval-augmented language model application that can be used to empower authors to reflect on their work and assist authors with the ACL checklist. To test the effectiveness of the system, we conducted a qualitative study with 13 users which shows that 92% of users found the application useful and easy to use as well as 77% of the users found that the application provided the information they expected. Our code is publicly available under the CC BY-NC 4.0 license on GitHub.

[AI-69] AutoFAIR : Automatic Data FAIRification via Machine Reading

链接: https://arxiv.org/abs/2408.04673
作者: Tingyan Ma,Wei Liu,Bin Lu,Xiaoying Gan,Yunqiang Zhu,Luoyi Fu,Chenghu Zhou
关键词-EN: fuels data-driven research, data fuels data-driven, data-driven research, facilitating progress, diverse domains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The explosive growth of data fuels data-driven research, facilitating progress across diverse domains. The FAIR principles emerge as a guiding standard, aiming to enhance the findability, accessibility, interoperability, and reusability of data. However, current efforts primarily focus on manual data FAIRification, which can only handle targeted data and lack efficiency. To address this issue, we propose AutoFAIR, an architecture designed to enhance data FAIRness automately. Firstly, We align each data and metadata operation with specific FAIR indicators to guide machine-executable actions. Then, We utilize Web Reader to automatically extract metadata based on language models, even in the absence of structured data webpage schemas. Subsequently, FAIR Alignment is employed to make metadata comply with FAIR principles by ontology guidance and semantic matching. Finally, by applying AutoFAIR to various data, especially in the field of mountain hazards, we observe significant improvements in findability, accessibility, interoperability, and reusability of data. The FAIRness scores before and after applying AutoFAIR indicate enhanced data value.

[AI-70] Prompt and Prejudice ECCV

链接: https://arxiv.org/abs/2408.04671
作者: Lorenzo Berlincioni,Luca Cultrera,Federico Becattini,Marco Bertini,Alberto Del Bimbo
关键词-EN: Large Language Models, Vision Language Models, Large Language, Vision Language, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Accepted at ECCV workshop FAILED

点击查看摘要

Abstract:This paper investigates the impact of using first names in Large Language Models (LLMs) and Vision Language Models (VLMs), particularly when prompted with ethical decision-making tasks. We propose an approach that appends first names to ethically annotated text scenarios to reveal demographic biases in model outputs. Our study involves a curated list of more than 300 names representing diverse genders and ethnic backgrounds, tested across thousands of moral scenarios. Following the auditing methodologies from social sciences we propose a detailed analysis involving popular LLMs/VLMs to contribute to the field of responsible AI by emphasizing the importance of recognizing and mitigating biases in these systems. Furthermore, we introduce a novel benchmark, the Pratical Scenarios Benchmark (PSB), designed to assess the presence of biases involving gender or demographic prejudices in everyday decision-making scenarios as well as practical scenarios where an LLM might be used to make sensible decisions (e.g., granting mortgages or insurances). This benchmark allows for a comprehensive comparison of model behaviors across different demographic categories, highlighting the risks and biases that may arise in practical applications of LLMs and VLMs.

[AI-71] Forecasting Live Chat Intent from Browsing History CIKM2024

链接: https://arxiv.org/abs/2408.04668
作者: Se-eun Yoon,Ahmad Bin Rabiah,Zaid Alibadi,Surya Kallumadi,Julian McAuley
关键词-EN: online live chat, live chat agents, Customers reach, requesting a return, browsing history
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: CIKM 2024

点击查看摘要

Abstract:Customers reach out to online live chat agents with various intents, such as asking about product details or requesting a return. In this paper, we propose the problem of predicting user intent from browsing history and address it through a two-stage approach. The first stage classifies a user’s browsing history into high-level intent categories. Here, we represent each browsing history as a text sequence of page attributes and use the ground-truth class labels to fine-tune pretrained Transformers. The second stage provides a large language model (LLM) with the browsing history and predicted intent class to generate fine-grained intents. For automatic evaluation, we use a separate LLM to judge the similarity between generated and ground-truth intents, which closely aligns with human judgments. Our two-stage approach yields significant performance gains compared to generating intents without the classification stage.

[AI-72] LLM Stability: A detailed analysis with some surprises

链接: https://arxiv.org/abs/2408.04667
作者: Berk Atil,Alexa Chittams,Liseng Fu,Ferhan Ture,Lixinyu Xu,Breck Baldwin
关键词-EN: Toggle, magical LLMs involves, Code, Papers, Code Toggle Papers
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:A concerning property of our nearly magical LLMs involves the variation of results given the exact same input and deterministic hyper-parameters. While AI has always had a certain level of noisiness from inputs outside of training data, we have generally had deterministic results for any particular input; that is no longer true. While most LLM practitioners are “in the know”, we are unaware of any work that attempts to quantify current LLM stability. We suspect no one has taken the trouble because it is just too boring a paper to execute and write. But we have done it and there are some surprises. What kinds of surprises? The evaluated LLMs are rarely deterministic at the raw output level; they are much more deterministic at the parsed output/answer level but still rarely 100% stable across 5 re-runs with same data input. LLM accuracy variation is not normally distributed. Stability varies based on task. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2408.04667 [cs.CL] (or arXiv:2408.04667v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.04667 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Breck Baldwin [view email] [v1] Tue, 6 Aug 2024 16:43:35 UTC (297 KB) Full-text links: Access Paper: View a PDF of the paper titled LLM Stability: A detailed analysis with some surprises, by Berk Atil and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2024-08 Change to browse by: cs cs.AI cs.LG cs.SE References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-73] LLMs are Not Just Next Token Predictors

链接: https://arxiv.org/abs/2408.04666
作者: Stephen M. Downes,Patrick Forber,Alex Grzankowski
关键词-EN: stochastic gradient descent, token prediction objective, statistical models, models of language, language learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:LLMs are statistical models of language learning through stochastic gradient descent with a next token prediction objective. Prompting a popular view among AI modelers: LLMs are just next token predictors. While LLMs are engineered using next token prediction, and trained based on their success at this task, our view is that a reduction to just next token predictor sells LLMs short. Moreover, there are important explanations of LLM behavior and capabilities that are lost when we engage in this kind of reduction. In order to draw this out, we will make an analogy with a once prominent research program in biology explaining evolution and development from the gene’s eye view.

[AI-74] LLM-based MOFs Synthesis Condition Extraction using Few-Shot Demonstrations

链接: https://arxiv.org/abs/2408.04665
作者: Lei Shi,Zhimeng Liu,Yi Yang,Weize Wu,Yuyang Zhang,Hongbo Zhang,Jing Lin,Siyu Wu,Zihan Chen,Ruiming Li,Nan Wang,Zipeng Liu,Huobin Tan,Hongyi Gao,Yue Zhang,Ge Wang
关键词-EN: Metal-Organic Frameworks, desirable functionality, challenging but crucial, logical design, synthesis conditions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The extraction of Metal-Organic Frameworks (MOFs) synthesis conditions from literature text has been challenging but crucial for the logical design of new MOFs with desirable functionality. The recent advent of large language models (LLMs) provides disruptively new solution to this long-standing problem and latest researches have reported over 90% F1 in extracting correct conditions from MOFs literature. We argue in this paper that most existing synthesis extraction practices with LLMs stay with the primitive zero-shot learning, which could lead to downgraded extraction and application performance due to the lack of specialized knowledge. This work pioneers and optimizes the few-shot in-context learning paradigm for LLM extraction of material synthesis conditions. First, we propose a human-AI joint data curation process to secure high-quality ground-truth demonstrations for few-shot learning. Second, we apply a BM25 algorithm based on the retrieval-augmented generation (RAG) technique to adaptively select few-shot demonstrations for each MOF’s extraction. Over a dataset randomly sampled from 84,898 well-defined MOFs, the proposed few-shot method achieves much higher average F1 performance (0.93 vs. 0.81, +14.8%) than the native zero-shot LLM using the same GPT-4 model, under fully automatic evaluation that are more objective than the previous human evaluation. The proposed method is further validated through real-world material experiments: compared with the baseline zero-shot LLM, the proposed few-shot approach increases the MOFs structural inference performance (R^2) by 29.4% in average.

[AI-75] Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)

链接: https://arxiv.org/abs/2408.04664
作者: Avshalom Manevich,Reut Tsarfaty
关键词-EN: Large Vision-Language Models, Large Language Models, Large Vision-Language, Large Language, expanding AI capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) are an extension of Large Language Models (LLMs) that facilitate processing both image and text inputs, expanding AI capabilities. However, LVLMs struggle with object hallucinations due to their reliance on text cues and learned object co-occurrence biases. While most research quantifies these hallucinations, mitigation strategies are still lacking. Our study introduces a Language Contrastive Decoding (LCD) algorithm that adjusts LVLM outputs based on LLM distribution confidence levels, effectively reducing object hallucinations. We demonstrate the advantages of LCD in leading LVLMs, showing up to %4 improvement in POPE F1 scores and up to %36 reduction in CHAIR scores on the COCO validation set, while also improving captioning quality scores. Our method effectively improves LVLMs without needing complex post-processing or retraining, and is easily applicable to different models. Our findings highlight the potential of further exploration of LVLM-specific decoding algorithms.

[AI-76] Dopamin: Transformer-based Comment Classifiers through Domain Post-Training and Multi-level Layer Aggregation

链接: https://arxiv.org/abs/2408.04663
作者: Nam Le Hai,Nghi D. Q. Bui
关键词-EN: provide important information, comments provide important, provide important, important information, information for understanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted at The 3rd Intl. Workshop on NL-based Software Engineering, 2024

点击查看摘要

Abstract:Code comments provide important information for understanding the source code. They can help developers understand the overall purpose of a function or class, as well as identify bugs and technical debt. However, an overabundance of comments is meaningless and counterproductive. As a result, it is critical to automatically filter out these comments for specific purposes. In this paper, we present Dopamin, a Transformer-based tool for dealing with this issue. Our model excels not only in presenting knowledge sharing of common categories across multiple languages, but also in achieving robust performance in comment classification by improving comment representation. As a result, it outperforms the STACC baseline by 3% on the NLBSE’24 Tool Competition dataset in terms of average F1-score, while maintaining a comparable inference time for practical use. The source code is publicity available at this https URL.

[AI-77] Citekit: A Modular Toolkit for Large Language Model Citation Generation

链接: https://arxiv.org/abs/2408.04662
作者: Jiajun Shen,Tong Zhou,Suifeng Zhao,Yubo Chen,Kang Liu
关键词-EN: Enabling Large Language, Large Language Models, Enabling Large, Language Models, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 7 pages, 13 figures

点击查看摘要

Abstract:Enabling Large Language Models (LLMs) to generate citations in Question-Answering (QA) tasks is an emerging paradigm aimed at enhancing the verifiability of their responses when LLMs are utilizing external references to generate an answer. However, there is currently no unified framework to standardize and fairly compare different citation generation methods, leading to difficulties in reproducing different methods and a comprehensive assessment. To cope with the problems above, we introduce \name, an open-source and modular toolkit designed to facilitate the implementation and evaluation of existing citation generation methods, while also fostering the development of new approaches to improve citation quality in LLM outputs. This tool is highly extensible, allowing users to utilize 4 main modules and 14 components to construct a pipeline, evaluating an existing method or innovative designs. Our experiments with two state-of-the-art LLMs and 11 citation generation baselines demonstrate varying strengths of different modules in answer accuracy and citation quality improvement, as well as the challenge of enhancing granularity. Based on our analysis of the effectiveness of components, we propose a new method, self-RAG \snippet, obtaining a balanced answer accuracy and citation quality. Citekit is released at this https URL.

[AI-78] XMainframe: A Large Language Model for Mainframe Modernization

链接: https://arxiv.org/abs/2408.04660
作者: Anh T. V. Dau,Hieu Trung Dao,Anh Tuan Nguyen,Hieu Trung Tran,Phong X. Nguyen,Nghi D. Q. Bui
关键词-EN: support critical sectors, continue to support, finance and government, support critical, critical sectors
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Mainframe operating systems, despite their inception in the 1940s, continue to support critical sectors like finance and government. However, these systems are often viewed as outdated, requiring extensive maintenance and modernization. Addressing this challenge necessitates innovative tools that can understand and interact with legacy codebases. To this end, we introduce XMainframe, a state-of-the-art large language model (LLM) specifically designed with knowledge of mainframe legacy systems and COBOL codebases. Our solution involves the creation of an extensive data collection pipeline to produce high-quality training datasets, enhancing XMainframe’s performance in this specialized domain. Additionally, we present MainframeBench, a comprehensive benchmark for assessing mainframe knowledge, including multiple-choice questions, question answering, and COBOL code summarization. Our empirical evaluations demonstrate that XMainframe consistently outperforms existing state-of-the-art LLMs across these tasks. Specifically, XMainframe achieves 30% higher accuracy than DeepSeek-Coder on multiple-choice questions, doubles the BLEU score of Mixtral-Instruct 8x7B on question answering, and scores six times higher than GPT-3.5 on COBOL summarization. Our work highlights the potential of XMainframe to drive significant advancements in managing and modernizing legacy systems, thereby enhancing productivity and saving time for software developers.

[AI-79] Winning Amazon KDD Cup24

链接: https://arxiv.org/abs/2408.04658
作者: Chris Deotte,Ivan Sorokin,Ahmet Erdem,Benedikt Schifferer,Gilberto Titericz Jr,Simon Jegou
关键词-EN: Multi Task Online, Online Shopping Challenge, Task Online Shopping, Online Shopping, Multi Task
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper describes the winning solution of all 5 tasks for the Amazon KDD Cup 2024 Multi Task Online Shopping Challenge for LLMs. The challenge was to build a useful assistant, answering questions in the domain of online shopping. The competition contained 57 diverse tasks, covering 5 different task types (e.g. multiple choice) and across 4 different tracks (e.g. multi-lingual). Our solution is a single model per track. We fine-tune Qwen2-72B-Instruct on our own training dataset. As the competition released only 96 example questions, we developed our own training dataset by processing multiple public datasets or using Large Language Models for data augmentation and synthetic data generation. We apply wise-ft to account for distribution shifts and ensemble multiple LoRA adapters in one model. We employed Logits Processors to constrain the model output on relevant tokens for the tasks. AWQ 4-bit Quantization and vLLM are used during inference to predict the test dataset in the time constraints of 20 to 140 minutes depending on the track. Our solution achieved the first place in each individual track and is the first place overall of Amazons KDD Cup 2024.

[AI-80] Strong and weak alignment of large language models with human values

链接: https://arxiv.org/abs/2408.04655
作者: Mehdi Khamassi,Marceau Nahon,Raja Chatila
关键词-EN: Minimizing negative impacts, Artificial Intelligent, Minimizing negative, impacts of Artificial, negative impacts
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Minimizing negative impacts of Artificial Intelligent (AI) systems on human societies without human supervision requires them to be able to align with human values. However, most current work only addresses this issue from a technical point of view, e.g., improving current methods relying on reinforcement learning from human feedback, neglecting what it means and is required for alignment to occur. Here, we propose to distinguish strong and weak value alignment. Strong alignment requires cognitive abilities (either human-like or different from humans) such as understanding and reasoning about agents’ intentions and their ability to causally produce desired effects. We argue that this is required for AI systems like large language models (LLMs) to be able to recognize situations presenting a risk that human values may be flouted. To illustrate this distinction, we present a series of prompts showing ChatGPT’s, Gemini’s and Copilot’s failures to recognize some of these situations. We moreover analyze word embeddings to show that the nearest neighbors of some human values in LLMs differ from humans’ semantic representations. We then propose a new thought experiment that we call “the Chinese room with a word transition dictionary”, in extension of John Searle’s famous proposal. We finally mention current promising research directions towards a weak alignment, which could produce statistically satisfying answers in a number of common situations, however so far without ensuring any truth value.

[AI-81] Batching BPE Tokenization Merges

链接: https://arxiv.org/abs/2408.04653
作者: Alexander P. Morgan
关键词-EN: Byte Pair Encoding, Pair Encoding algorithm, Byte Pair, Pair Encoding, Encoding algorithm
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 8 pages, 5 figures, 1 code block

点击查看摘要

Abstract:The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer’s vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training make it feasible to train a high quality tokenizer on a basic laptop. This paper presents BatchBPE, an open-source pure Python implementation of these concepts, with the goal of making experimenting with new tokenization strategies more accessible especially in compute- and memory-constrained contexts. BatchBPE’s usefulness and malleability are demonstrated through the training of several token vocabularies to explore the batch merging process and experiment with preprocessing a stop word list and ignoring the least common text chunks in a dataset. Resultant encoded lengths of texts are used as a basic evaluation metric.

[AI-82] Leveraging Large Language Models with Chain-of-Thought and Prompt Engineering for Traffic Crash Severity Analysis and Inference

链接: https://arxiv.org/abs/2408.04652
作者: Hao Zhen,Yucheng Shi,Yongcan Huang,Jidong J. Yang,Ninghao Liu
关键词-EN: Large Language Models, Large Language, crash severity inference, Harnessing the power, power of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 12 figures, 3 tables

点击查看摘要

Abstract:Harnessing the power of Large Language Models (LLMs), this study explores the use of three state-of-the-art LLMs, specifically GPT-3.5-turbo, LLaMA3-8B, and LLaMA3-70B, for crash severity inference, framing it as a classification task. We generate textual narratives from original traffic crash tabular data using a pre-built template infused with domain knowledge. Additionally, we incorporated Chain-of-Thought (CoT) reasoning to guide the LLMs in analyzing the crash causes and then inferring the severity. This study also examine the impact of prompt engineering specifically designed for crash severity inference. The LLMs were tasked with crash severity inference to: (1) evaluate the models’ capabilities in crash severity analysis, (2) assess the effectiveness of CoT and domain-informed prompt engineering, and (3) examine the reasoning abilities with the CoT framework. Our results showed that LLaMA3-70B consistently outperformed the other models, particularly in zero-shot settings. The CoT and Prompt Engineering techniques significantly enhanced performance, improving logical reasoning and addressing alignment issues. Notably, the CoT offers valuable insights into LLMs’ reasoning processes, unleashing their capacity to consider diverse factors such as environmental conditions, driver behavior, and vehicle characteristics in severity analysis and inference.

[AI-83] Knowledge AI: Fine-tuning NLP Models for Facilitating Scientific Knowledge Extraction and Understanding

链接: https://arxiv.org/abs/2408.04651
作者: Balaji Muralidharan,Hayden Beadles,Reza Marzban,Kalyan Sashank Mupparaju
关键词-EN: Large Language Models, deep learning framework, efficacy of Large, Large Language, Natural Language Processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 11 pages

点击查看摘要

Abstract:This project investigates the efficacy of Large Language Models (LLMs) in understanding and extracting scientific knowledge across specific domains and to create a deep learning framework: Knowledge AI. As a part of this framework, we employ pre-trained models and fine-tune them on datasets in the scientific domain. The models are adapted for four key Natural Language Processing (NLP) tasks: summarization, text generation, question answering, and named entity recognition. Our results indicate that domain-specific fine-tuning significantly enhances model performance in each of these tasks, thereby improving their applicability for scientific contexts. This adaptation enables non-experts to efficiently query and extract information within targeted scientific fields, demonstrating the potential of fine-tuned LLMs as a tool for knowledge discovery in the sciences.

[AI-84] Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools

链接: https://arxiv.org/abs/2408.04650
作者: Jung In Park,Mahyar Abbasian,Iman Azimi,Dawn Bounds,Angela Jun,Jaesu Han,Robert McCarron,Jessica Borelli,Jia Li,Mona Mahmoudi,Carmen Wiedenhoeft,Amir Rahmani
关键词-EN: increasingly popular due, mental health chatbots, mental health, human-like interactions, aims to develop
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objective: This study aims to develop and validate an evaluation framework to ensure the safety and reliability of mental health chatbots, which are increasingly popular due to their accessibility, human-like interactions, and context-aware support. Materials and Methods: We created an evaluation framework with 100 benchmark questions and ideal responses, and five guideline questions for chatbot responses. This framework, validated by mental health experts, was tested on a GPT-3.5-turbo-based chatbot. Automated evaluation methods explored included large language model (LLM)-based scoring, an agentic approach using real-time data, and embedding models to compare chatbot responses against ground truth standards. Results: The results highlight the importance of guidelines and ground truth for improving LLM evaluation accuracy. The agentic method, dynamically accessing reliable information, demonstrated the best alignment with human assessments. Adherence to a standardized, expert-validated framework significantly enhanced chatbot response safety and reliability. Discussion: Our findings emphasize the need for comprehensive, expert-tailored safety evaluation metrics for mental health chatbots. While LLMs have significant potential, careful implementation is necessary to mitigate risks. The superior performance of the agentic approach underscores the importance of real-time data access in enhancing chatbot reliability. Conclusion: The study validated an evaluation framework for mental health chatbots, proving its effectiveness in improving safety and reliability. Future work should extend evaluations to accuracy, bias, empathy, and privacy to ensure holistic assessment and responsible integration into healthcare. Standardized evaluations will build trust among users and professionals, facilitating broader adoption and improved mental health support through technology.

[AI-85] Chain of Stance: Stance Detection with Large Language Models

链接: https://arxiv.org/abs/2408.04649
作者: Junxia Ma,Changjiang Wang,Hanwen Xing,Dongming Zhao,Yazhou Zhang
关键词-EN: natural language processing, Stance detection, active task, task in natural, aims to identify
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Stance detection is an active task in natural language processing (NLP) that aims to identify the author’s stance towards a particular target within a text. Given the remarkable language understanding capabilities and encyclopedic prior knowledge of large language models (LLMs), how to explore the potential of LLMs in stance detection has received significant attention. Unlike existing LLM-based approaches that focus solely on fine-tuning with large-scale datasets, we propose a new prompting method, called \textitChain of Stance (CoS). In particular, it positions LLMs as expert stance detectors by decomposing the stance detection process into a series of intermediate, stance-related assertions that culminate in the final judgment. This approach leads to significant improvements in classification performance. We conducted extensive experiments using four SOTA LLMs on the SemEval 2016 dataset, covering the zero-shot and few-shot learning setups. The results indicate that the proposed method achieves state-of-the-art results with an F1 score of 79.84 in the few-shot setting.

[AI-86] PLUGH: A Benchmark for Spatial Understanding and Reasoning in Large Language Models ACL2024

链接: https://arxiv.org/abs/2408.04648
作者: Alexey Tikhonov
关键词-EN: Large Language Models, input texts extracted, Large Language, present PLUGH, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Wordplay Workshop @ ACL 2024

点击查看摘要

Abstract:We present PLUGH (this https URL), a modern benchmark that currently consists of 5 tasks, each with 125 input texts extracted from 48 different games and representing 61 different (non-isomorphic) spatial graphs to assess the abilities of Large Language Models (LLMs) for spatial understanding and reasoning. Our evaluation of API-based and open-sourced LLMs shows that while some commercial LLMs exhibit strong reasoning abilities, open-sourced competitors can demonstrate almost the same level of quality; however, all models still have significant room for improvement. We identify typical reasons for LLM failures and discuss possible ways to deal with them. Datasets and evaluation code are released (this https URL).

[AI-87] Evaluating the Impact of Advanced LLM Techniques on AI-Lecture Tutors for a Robotics Course ECAI-2024

链接: https://arxiv.org/abs/2408.04645
作者: Sebastian Kahl,Felix Löffler,Martin Maciol,Fabian Ridder,Marius Schmitz,Jennifer Spanagel,Jens Wienkamp,Christopher Burgahn,Malte Schilling
关键词-EN: Artificial Intelligence-based tutor, Large Language Models, Large Language, Artificial Intelligence-based, Intelligence-based tutor
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Robotics (cs.RO)
*备注: The article is an extended version of a paper presented at the International Workshop on AI in Education and Educational Research (AIEER) at ECAI-2024 (27th European Conference on Artificial Intelligence)

点击查看摘要

Abstract:This study evaluates the performance of Large Language Models (LLMs) as an Artificial Intelligence-based tutor for a university course. In particular, different advanced techniques are utilized, such as prompt engineering, Retrieval-Augmented-Generation (RAG), and fine-tuning. We assessed the different models and applied techniques using common similarity metrics like BLEU-4, ROUGE, and BERTScore, complemented by a small human evaluation of helpfulness and trustworthiness. Our findings indicate that RAG combined with prompt engineering significantly enhances model responses and produces better factual answers. In the context of education, RAG appears as an ideal technique as it is based on enriching the input of the model with additional information and material which usually is already present for a university course. Fine-tuning, on the other hand, can produce quite small, still strong expert models, but poses the danger of overfitting. Our study further asks how we measure performance of LLMs and how well current measurements represent correctness or relevance? We find high correlation on similarity metrics and a bias of most of these metrics towards shorter responses. Overall, our research points to both the potential and challenges of integrating LLMs in educational settings, suggesting a need for balanced training approaches and advanced evaluation frameworks.

[AI-88] GPT-3 Powered Information Extraction for Building Robust Knowledge Bases

链接: https://arxiv.org/abs/2408.04641
作者: Ritabrata Roy Choudhury,Soumik Dey
关键词-EN: language model, information extraction, knowledge base development, information, suggested
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work uses the state-of-the-art language model GPT-3 to offer a novel method of information extraction for knowledge base development. The suggested method attempts to solve the difficulties associated with obtaining relevant entities and relationships from unstructured text in order to extract structured information. We conduct experiments on a huge corpus of text from diverse fields to assess the performance of our suggested technique. The evaluation measures, which are frequently employed in information extraction tasks, include precision, recall, and F1-score. The findings demonstrate that GPT-3 can be used to efficiently and accurately extract pertinent and correct information from text, hence increasing the precision and productivity of knowledge base creation. We also assess how well our suggested approach performs in comparison to the most advanced information extraction techniques already in use. The findings show that by utilizing only a small number of instances in in-context learning, our suggested strategy yields competitive outcomes with notable savings in terms of data annotation and engineering expense. Additionally, we use our proposed method to retrieve Biomedical information, demonstrating its practicality in a real-world setting. All things considered, our suggested method offers a viable way to overcome the difficulties involved in obtaining structured data from unstructured text in order to create knowledge bases. It can greatly increase the precision and effectiveness of information extraction, which is necessary for many applications including chatbots, recommendation engines, and question-answering systems.

[AI-89] Beyond the Eye: A Relational Model for Early Dementia Detection Using Retinal OCTA Images

链接: https://arxiv.org/abs/2408.05117
作者: Shouyue Liu,Jinkui Hao,Yonghuai Liu,Huazhu Fu,Xinyu Guo,Shuting Zhang,Yitian Zhao
关键词-EN: mild cognitive impairment, enable timely intervention, Alzheimer disease, cognitive impairment, mild cognitive
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Early detection of dementia, such as Alzheimer’s disease (AD) or mild cognitive impairment (MCI), is essential to enable timely intervention and potential treatment. Accurate detection of AD/MCI is challenging due to the high complexity, cost, and often invasive nature of current diagnostic techniques, which limit their suitability for large-scale population screening. Given the shared embryological origins and physiological characteristics of the retina and brain, retinal imaging is emerging as a potentially rapid and cost-effective alternative for the identification of individuals with or at high risk of AD. In this paper, we present a novel PolarNet+ that uses retinal optical coherence tomography angiography (OCTA) to discriminate early-onset AD (EOAD) and MCI subjects from controls. Our method first maps OCTA images from Cartesian coordinates to polar coordinates, allowing approximate sub-region calculation to implement the clinician-friendly early treatment of diabetic retinopathy study (ETDRS) grid analysis. We then introduce a multi-view module to serialize and analyze the images along three dimensions for comprehensive, clinically useful information extraction. Finally, we abstract the sequence embedding into a graph, transforming the detection task into a general graph classification problem. A regional relationship module is applied after the multi-view module to excavate the relationship between the sub-regions. Such regional relationship analyses validate known eye-brain links and reveal new discriminative patterns.

[AI-90] CROCODILE: Causality aids RObustness via COntrastive DIsentangled LEarning MICCAI2024

链接: https://arxiv.org/abs/2408.04949
作者: Gianluca Carloni,Sotirios A Tsaftaris,Sara Colantonio
关键词-EN: classifiers perform poorly, image classifiers perform, perform poorly, poorly when applied, classifiers perform
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: MICCAI 2024 UNSURE Workshop, Accepted for presentation, Submitted Manuscript Version, 10 pages

点击查看摘要

Abstract:Due to domain shift, deep learning image classifiers perform poorly when applied to a domain different from the training one. For instance, a classifier trained on chest X-ray (CXR) images from one hospital may not generalize to images from another hospital due to variations in scanner settings or patient characteristics. In this paper, we introduce our CROCODILE framework, showing how tools from causality can foster a model’s robustness to domain shift via feature disentanglement, contrastive learning losses, and the injection of prior knowledge. This way, the model relies less on spurious correlations, learns the mechanism bringing from images to prediction better, and outperforms baselines on out-of-distribution (OOD) data. We apply our method to multi-label lung disease classification from CXRs, utilizing over 750000 images from four datasets. Our bias-mitigation method improves domain generalization and fairness, broadening the applicability and reliability of deep learning models for a safer medical image analysis. Find our code at: this https URL.

[AI-91] Survey: Transformer-based Models in Data Modality Conversion

链接: https://arxiv.org/abs/2408.04723
作者: Elyas Rashno,Amir Eskandari,Aman Anand,Farhana Zulkernine
关键词-EN: natural language processing, including natural language, artificial intelligence domains, made significant strides, language processing
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Signal Processing (eess.SP)
*备注: Submitted to ACM Computing Surveys (CSUR)

点击查看摘要

Abstract:Transformers have made significant strides across various artificial intelligence domains, including natural language processing, computer vision, and audio processing. This success has naturally garnered considerable interest from both academic and industry researchers. Consequently, numerous Transformer variants (often referred to as X-formers) have been developed for these fields. However, a thorough and systematic review of these modality-specific conversions remains lacking. Modality Conversion involves the transformation of data from one form of representation to another, mimicking the way humans integrate and interpret sensory information. This paper provides a comprehensive review of transformer-based models applied to the primary modalities of text, vision, and speech, discussing their architectures, conversion methodologies, and applications. By synthesizing the literature on modality conversion, this survey aims to underline the versatility and scalability of transformers in advancing AI-driven content generation and understanding.

计算机视觉

[CV-0] VITA: Towards Open-Source Interactive Omni Multimodal LLM

链接: https://arxiv.org/abs/2408.05211
作者: Chaoyou Fu,Haojia Lin,Zuwei Long,Yunhang Shen,Meng Zhao,Yifan Zhang,Xiong Wang,Di Yin,Long Ma,Xiawu Zheng,Ran He,Rongrong Ji,Yunsheng Wu,Caifeng Shan,Xing Sun
关键词-EN: models rarely excel, Large Language Model, open-source models rarely, underscore their necessity, practical applications
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Project Page: this https URL

点击查看摘要

Abstract:The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research. Project Page: this https URL.

[CV-1] Multi-Garment Customized Model Generation

链接: https://arxiv.org/abs/2408.05206
作者: Yichen Liu,Penghui Du,Yi Liu Quanwei Zhang
关键词-EN: Latent Diffusion Models, Latent Diffusion, Customized Model Generation, introduces Multi-Garment Customized, based on Latent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces Multi-Garment Customized Model Generation, a unified framework based on Latent Diffusion Models (LDMs) aimed at addressing the unexplored task of synthesizing images with free combinations of multiple pieces of clothing. The method focuses on generating customized models wearing various targeted outfits according to different text prompts. The primary challenge lies in maintaining the natural appearance of the dressed model while preserving the complex textures of each piece of clothing, ensuring that the information from different garments does not interfere with each other. To tackle these challenges, we first developed a garment encoder, which is a trainable UNet copy with shared weights, capable of extracting detailed features of garments in parallel. Secondly, our framework supports the conditional generation of multiple garments through decoupled multi-garment feature fusion, allowing multiple clothing features to be injected into the backbone network, significantly alleviating conflicts between garment information. Additionally, the proposed garment encoder is a plug-and-play module that can be combined with other extension modules such as IP-Adapter and ControlNet, enhancing the diversity and controllability of the generated models. Extensive experiments demonstrate the superiority of our approach over existing alternatives, opening up new avenues for the task of generating images with multiple-piece clothing combinations

[CV-2] Kalman-Inspired Feature Propagation for Video Face Super-Resolution ECCV2024

链接: https://arxiv.org/abs/2408.05205
作者: Ruicheng Feng,Chongyi Li,Chen Change Loy
关键词-EN: face image super-resolution, remains relatively under-explored, face super-resolution remains, image super-resolution, promising progress
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024. Project page: this https URL

点击查看摘要

Abstract:Despite the promising progress of face image super-resolution, video face super-resolution remains relatively under-explored. Existing approaches either adapt general video super-resolution networks to face datasets or apply established face image super-resolution models independently on individual video frames. These paradigms encounter challenges either in reconstructing facial details or maintaining temporal consistency. To address these issues, we introduce a novel framework called Kalman-inspired Feature Propagation (KEEP), designed to maintain a stable face prior over time. The Kalman filtering principles offer our method a recurrent ability to use the information from previously restored frames to guide and regulate the restoration process of the current frame. Extensive experiments demonstrate the effectiveness of our method in capturing facial details consistently across video frames. Code and video demo are available at this https URL.

[CV-3] Cross-Domain Learning for Video Anomaly Detection with Limited Supervision

链接: https://arxiv.org/abs/2408.05191
作者: Yashika Jain,Ali Dabouei,Min Xu
关键词-EN: Video Anomaly Detection, Anomaly Detection, Video Anomaly, surveillance videos, automates the identification
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Video Anomaly Detection (VAD) automates the identification of unusual events, such as security threats in surveillance videos. In real-world applications, VAD models must effectively operate in cross-domain settings, identifying rare anomalies and scenarios not well-represented in the training data. However, existing cross-domain VAD methods focus on unsupervised learning, resulting in performance that falls short of real-world expectations. Since acquiring weak supervision, i.e., video-level labels, for the source domain is cost-effective, we conjecture that combining it with external unlabeled data has notable potential to enhance cross-domain performance. To this end, we introduce a novel weakly-supervised framework for Cross-Domain Learning (CDL) in VAD that incorporates external data during training by estimating its prediction bias and adaptively minimizing that using the predicted uncertainty. We demonstrate the effectiveness of the proposed CDL framework through comprehensive experiments conducted in various configurations on two large-scale VAD datasets: UCF-Crime and XD-Violence. Our method significantly surpasses the state-of-the-art works in cross-domain evaluations, achieving an average absolute improvement of 19.6% on UCF-Crime and 12.87% on XD-Violence.

[CV-4] Weak-Annotation of HAR Datasets using Vision Foundation Models ISWC’24

链接: https://arxiv.org/abs/2408.05169
作者: Marius Bock,Kristof Van Laerhoven,Michael Moeller
关键词-EN: Human Activity Recognition, Activity Recognition, time-consuming task requiring, dedicate substantial time, task requiring researchers
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages, 3 figures, accepted at ISWC’24: International Symposium on Wearable Computers, Oct, 2024

点击查看摘要

Abstract:As wearable-based data annotation remains, to date, a tedious, time-consuming task requiring researchers to dedicate substantial time, benchmark datasets within the field of Human Activity Recognition in lack richness and size compared to datasets available within related fields. Recently, vision foundation models such as CLIP have gained significant attention, helping the vision community advance in finding robust, generalizable feature representations. With the majority of researchers within the wearable community relying on vision modalities to overcome the limited expressiveness of wearable data and accurately label their to-be-released benchmark datasets offline, we propose a novel, clustering-based annotation pipeline to significantly reduce the amount of data that needs to be annotated by a human annotator. We show that using our approach, the annotation of centroid clips suffices to achieve average labelling accuracies close to 90% across three publicly available HAR benchmark datasets. Using the weakly annotated datasets, we further demonstrate that we can match the accuracy scores of fully-supervised deep learning classifiers across all three benchmark datasets. Code as well as supplementary figures and results are publicly downloadable via this http URL.

[CV-5] EasyInv: Toward Fast and Better DDIM Inversion

链接: https://arxiv.org/abs/2408.05159
作者: Ziyue Zhang,Mingbao Lin,Shuicheng Yan,Rongrong Ji
关键词-EN: DDIM Inversion approach, paper introduces EasyInv, DDIM Inversion, iterative optimization methods, initial latent state
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 8 pages

点击查看摘要

Abstract:This paper introduces EasyInv, an easy yet novel approach that significantly advances the field of DDIM Inversion by addressing the inherent inefficiencies and performance limitations of traditional iterative optimization methods. At the core of our EasyInv is a refined strategy for approximating inversion noise, which is pivotal for enhancing the accuracy and reliability of the inversion process. By prioritizing the initial latent state, which encapsulates rich information about the original images, EasyInv steers clear of the iterative refinement of noise items. Instead, we introduce a methodical aggregation of the latent state from the preceding time step with the current state, effectively increasing the influence of the initial latent state and mitigating the impact of noise. We illustrate that EasyInv is capable of delivering results that are either on par with or exceed those of the conventional DDIM Inversion approach, especially under conditions where the model’s precision is limited or computational resources are scarce. Concurrently, our EasyInv offers an approximate threefold enhancement regarding inference efficiency over off-the-shelf iterative optimization techniques.

[CV-6] Modeling Electromagnetic Signal Injection Attacks on Camera-based Smart Systems: Applications and Mitigation

链接: https://arxiv.org/abs/2408.05124
作者: Youqian Zhang,Michael Cheung,Chunxi Yang,Xinwei Zhai,Zitong Shen,Xinyu Ji,Eugene Y. Fu,Sze-Yiu Chau,Xiapu Luo
关键词-EN: allowing artificial intelligence, security-critical systems depend, make important decisions, Numerous safety, perceive their surroundings
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注: 13 pages, 10 figures, 4 tables

点击查看摘要

Abstract:Numerous safety- or security-critical systems depend on cameras to perceive their surroundings, further allowing artificial intelligence (AI) to analyze the captured images to make important decisions. However, a concerning attack vector has emerged, namely, electromagnetic waves, which pose a threat to the integrity of these systems. Such attacks enable attackers to manipulate the images remotely, leading to incorrect AI decisions, e.g., autonomous vehicles missing detecting obstacles ahead resulting in collisions. The lack of understanding regarding how different systems react to such attacks poses a significant security risk. Furthermore, no effective solutions have been demonstrated to mitigate this threat. To address these gaps, we modeled the attacks and developed a simulation method for generating adversarial images. Through rigorous analysis, we confirmed that the effects of the simulated adversarial images are indistinguishable from those from real attacks. This method enables researchers and engineers to rapidly assess the susceptibility of various AI vision applications to these attacks, without the need for constructing complicated attack devices. In our experiments, most of the models demonstrated vulnerabilities to these attacks, emphasizing the need to enhance their robustness. Fortunately, our modeling and simulation method serves as a stepping stone toward developing more resilient models. We present a pilot study on adversarial training to improve their robustness against attacks, and our results demonstrate a significant improvement by recovering up to 91% performance, offering a promising direction for mitigating this threat. Comments: 13 pages, 10 figures, 4 tables Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.05124 [cs.CR] (or arXiv:2408.05124v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2408.05124 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-7] PriPHiT: Privacy-Preserving Hierarchical Training of Deep Neural Networks

链接: https://arxiv.org/abs/2408.05092
作者: Yamin Sepehri,Pedram Pad,Pascal Frossard,L. Andrea Dunbar
关键词-EN: neural networks requires, networks requires substantial, requires substantial resources, deep neural networks, training phase
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 16 pages, 16 figures, 6 tables

点击查看摘要

Abstract:The training phase of deep neural networks requires substantial resources and as such is often performed on cloud servers. However, this raises privacy concerns when the training dataset contains sensitive content, e.g., face images. In this work, we propose a method to perform the training phase of a deep learning model on both an edge device and a cloud server that prevents sensitive content being transmitted to the cloud while retaining the desired information. The proposed privacy-preserving method uses adversarial early exits to suppress the sensitive content at the edge and transmits the task-relevant information to the cloud. This approach incorporates noise addition during the training phase to provide a differential privacy guarantee. We extensively test our method on different facial datasets with diverse face attributes using various deep learning architectures, showcasing its outstanding performance. We also demonstrate the effectiveness of privacy preservation through successful defenses against different white-box and deep reconstruction attacks.

[CV-8] Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation

链接: https://arxiv.org/abs/2408.05090
作者: Huilin Tian,Jingke Meng,Wei-Shi Zheng,Yuan-Ming Li,Junkai Yan,Yunong Zhang
关键词-EN: outdoor VLN, agent spatial position, outdoor VLN tasks, Vision and Language, understand instructions
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
*备注: arXiv admin note: text overlap with arXiv:2203.13838 by other authors

点击查看摘要

Abstract:Vision and Language Navigation (VLN) is a challenging task that requires agents to understand instructions and navigate to the destination in a visual environment.One of the key challenges in outdoor VLN is keeping track of which part of the instruction was completed. To alleviate this problem, previous works mainly focus on grounding the natural language to the visual input, but neglecting the crucial role of the agent’s spatial position information in the grounding process. In this work, we first explore the substantial effect of spatial position locating on the grounding of outdoor VLN, drawing inspiration from human navigation. In real-world navigation scenarios, before planning a path to the destination, humans typically need to figure out their current location. This observation underscores the pivotal role of spatial localization in the navigation process. In this work, we introduce a novel framework, Locating be for Planning (Loc4Plan), designed to incorporate spatial perception for action planning in outdoor VLN tasks. The main idea behind Loc4Plan is to perform the spatial localization before planning a decision action based on corresponding guidance, which comprises a block-aware spatial locating (BAL) module and a spatial-aware action planning (SAP) module. Specifically, to help the agent perceive its spatial location in the environment, we propose to learn a position predictor that measures how far the agent is from the next intersection for reflecting its position, which is achieved by the BAL module. After the locating process, we propose the SAP module to incorporate spatial information to ground the corresponding guidance and enhance the precision of action planning. Extensive experiments on the Touchdown and map2seq datasets show that the proposed Loc4Plan outperforms the SOTA methods.

[CV-9] UNIC: Universal Classification Models via Multi-teacher Distillation ECCV2024

链接: https://arxiv.org/abs/2408.05088
作者: Mert Bulent Sariyildiz,Philippe Weinzaepfel,Thomas Lucas,Diane Larlus,Yannis Kalantidis
关键词-EN: offer strong results, Pretrained models, complementary pretrained models, commodity and offer, broad range
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: To be presented at ECCV 2024

点击查看摘要

Abstract:Pretrained models have become a commodity and offer strong results on a broad range of tasks. In this work, we focus on classification and seek to learn a unique encoder able to take from several complementary pretrained models. We aim at even stronger generalization across a variety of classification tasks. We propose to learn such an encoder via multi-teacher distillation. We first thoroughly analyse standard distillation when driven by multiple strong teachers with complementary strengths. Guided by this analysis, we gradually propose improvements to the basic distillation setup. Among those, we enrich the architecture of the encoder with a ladder of expendable projectors, which increases the impact of intermediate features during distillation, and we introduce teacher dropping, a regularization mechanism that better balances the teachers’ influence. Our final distillation strategy leads to student models of the same capacity as any of the teachers, while retaining or improving upon the performance of the best teacher for each task. Project page and code: this https URL Comments: To be presented at ECCV 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.05088 [cs.CV] (or arXiv:2408.05088v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.05088 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-10] PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control ECCV2024

链接: https://arxiv.org/abs/2408.05083
作者: Rishubh Parihar,Sachidanand VS,Sabariswaran Mani,Tejan Karmali,R. Venkatesh Babu
关键词-EN: fine-grained attribute editing, attribute editing, Recently, attribute, fine-grained attribute
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024, Project page: this https URL

点击查看摘要

Abstract:Recently, we have seen a surge of personalization methods for text-to-image (T2I) diffusion models to learn a concept using a few images. Existing approaches, when used for face personalization, suffer to achieve convincing inversion with identity preservation and rely on semantic text-based editing of the generated face. However, a more fine-grained control is desired for facial attribute editing, which is challenging to achieve solely with text prompts. In contrast, StyleGAN models learn a rich face prior and enable smooth control towards fine-grained attribute editing by latent manipulation. This work uses the disentangled \mathcalW+ space of StyleGANs to condition the T2I model. This approach allows us to precisely manipulate facial attributes, such as smoothly introducing a smile, while preserving the existing coarse text-based control inherent in T2I models. To enable conditioning of the T2I model on the \mathcalW+ space, we train a latent mapper to translate latent codes from \mathcalW+ to the token embedding space of the T2I model. The proposed approach excels in the precise inversion of face images with attribute preservation and facilitates continuous control for fine-grained attribute editing. Furthermore, our approach can be readily extended to generate compositions involving multiple individuals. We perform extensive experiments to validate our method for face personalization and fine-grained attribute editing.

[CV-11] DeepInteraction: Multi-Modality Interaction for Autonomous Driving NEURIPS2022

链接: https://arxiv.org/abs/2408.05075
作者: Zeyu Yang,Nan Song,Wei Li,Xiatian Zhu,Li Zhang,Philip H.S. Torr
关键词-EN: reliable scene understanding, Existing top-performance autonomous, systems typically rely, Existing top-performance, driving systems typically
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Journal extension of NeurIPS 2022. arXiv admin note: text overlap with arXiv:2208.11112

点击查看摘要

Abstract:Existing top-performance autonomous driving systems typically rely on the multi-modal fusion strategy for reliable scene understanding. This design is however fundamentally restricted due to overlooking the modality-specific strengths and finally hampering the model performance. To address this limitation, in this work, we introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout, enabling their unique characteristics to be exploited during the whole perception pipeline. To demonstrate the effectiveness of the proposed strategy, we design DeepInteraction++, a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder. Specifically, the encoder is implemented as a dual-stream Transformer with specialized attention operation for information exchange and integration between separate modality-specific representations. Our multi-modal representational learning incorporates both object-centric, precise sampling-based feature alignment and global dense information spreading, essential for the more challenging planning task. The decoder is designed to iteratively refine the predictions by alternately aggregating information from separate representations in a unified modality-agnostic manner, realizing multi-modal predictive interaction. Extensive experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks. Our code is available at this https URL.

[CV-12] Benchmarking Conventional and Learned Video Codecs with a Low-Delay Configuration

链接: https://arxiv.org/abs/2408.05042
作者: Siyue Teng(1),Yuxuan Jiang(1),Ge Gao(1),Fan Zhang(1),Thomas Davis(2),Zoe Liu(2),David Bull(1) ((1) University of Bristol, (2) Visionular Inc.)
关键词-EN: Random Access mode, Random Access, learning-based video codecs, JVET ECM, significant coding performance
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Recent advances in video compression have seen significant coding performance improvements with the development of new standards and learning-based video codecs. However, most of these works focus on application scenarios that allow a certain amount of system delay (e.g., Random Access mode in MPEG codecs), which is not always acceptable for live delivery. This paper conducts a comparative study of state-of-the-art conventional and learned video coding methods based on a low delay configuration. Specifically, this study includes two MPEG standard codecs (H.266/VVC VTM and JVET ECM), two AOM codecs (AV1 libaom and AVM), and two recent neural video coding models (DCVC-DC and DCVC-FM). To allow a fair and meaningful comparison, the evaluation was performed on test sequences defined in the AOM and MPEG common test conditions in the YCbCr 4:2:0 color space. The evaluation results show that the JVET ECM codecs offer the best overall coding performance among all codecs tested, with a 16.1% (based on PSNR) average BD-rate saving over AOM AVM, and 11.0% over DCVC-FM. We also observed inconsistent performance with the learned video codecs, DCVC-DC and DCVC-FM, for test content with large background motions.

[CV-13] Livestock Fish Larvae Counting using DETR and YOLO based Deep Networks

链接: https://arxiv.org/abs/2408.05032
作者: Daniel Ortega de Carvalho,Luiz Felipe Teodoro Monteiro,Fernanda Marques Bazilio,Gabriel Toshio Hirokawa Higa,Hemerson Pistori
关键词-EN: Counting fish larvae, fish larvae counting, fish larvae, Counting fish, time consuming
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Counting fish larvae is an important, yet demanding and time consuming, task in aquaculture. In order to address this problem, in this work, we evaluate four neural network architectures, including convolutional neural networks and transformers, in different sizes, in the task of fish larvae counting. For the evaluation, we present a new annotated image dataset with less data collection requirements than preceding works, with images of spotted sorubim and dourado larvae. By using image tiling techniques, we achieve a MAPE of 4.46% ( \pm 4.70 ) with an extra large real time detection transformer, and 4.71% ( \pm 4.98 ) with a medium-sized YOLOv8.

[CV-14] Collaborative Static-Dynamic Teaching: A Semi-Supervised Framework for Stripe-Like Space Target Detection

链接: https://arxiv.org/abs/2408.05029
作者: Zijian Zhu,Ali Zia,Xuesong Li,Bingbing Dan,Yuebo Ma,Hongfeng Long,Kaili Lu,Enhai Liu,Rujin Zhao
关键词-EN: space situational awareness, space target detection, Stripe-like space target, space targets scenarios, situational awareness
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Stripe-like space target detection (SSTD) is crucial for space situational awareness. Traditional unsupervised methods often fail in low signal-to-noise ratio and variable stripe-like space targets scenarios, leading to weak generalization. Although fully supervised learning methods improve model generalization, they require extensive pixel-level labels for training. In the SSTD task, manually creating these labels is often inaccurate and labor-intensive. Semi-supervised learning (SSL) methods reduce the need for these labels and enhance model generalizability, but their performance is limited by pseudo-label quality. To address this, we introduce an innovative Collaborative Static-Dynamic Teacher (CSDT) SSL framework, which includes static and dynamic teacher models as well as a student model. This framework employs a customized adaptive pseudo-labeling (APL) strategy, transitioning from initial static teaching to adaptive collaborative teaching, guiding the student model’s training. The exponential moving average (EMA) mechanism further enhances this process by feeding new stripe-like knowledge back to the dynamic teacher model through the student model, creating a positive feedback loop that continuously enhances the quality of pseudo-labels. Moreover, we present MSSA-Net, a novel SSTD network featuring a multi-scale dual-path convolution (MDPC) block and a feature map weighted attention (FMWA) block, designed to extract diverse stripe-like features within the CSDT SSL training framework. Extensive experiments verify the state-of-the-art performance of our framework on the AstroStripeSet and various ground-based and space-based real-world datasets.

[CV-15] RadarPillars: Efficient Object Detection from 4D Radar Point Clouds ITSC

链接: https://arxiv.org/abs/2408.05020
作者: Alexander Musiat,Laurenz Reichardt,Michael Schulze,Oliver Wasenmüller
关键词-EN: azimuth and Doppler, Automotive radar systems, Automotive radar, Doppler velocity, evolved to provide
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper has been accepted at IEEE Intelligent Transportation Systems Conference (ITSC), 2024

点击查看摘要

Abstract:Automotive radar systems have evolved to provide not only range, azimuth and Doppler velocity, but also elevation data. This additional dimension allows for the representation of 4D radar as a 3D point cloud. As a result, existing deep learning methods for 3D object detection, which were initially developed for LiDAR data, are often applied to these radar point clouds. However, this neglects the special characteristics of 4D radar data, such as the extreme sparsity and the optimal utilization of velocity information. To address these gaps in the state-of-the-art, we present RadarPillars, a pillar-based object detection network. By decomposing radial velocity data, introducing PillarAttention for efficient feature extraction, and studying layer scaling to accommodate radar sparsity, RadarPillars significantly outperform state-of-the-art detection results on the View-of-Delft dataset. Importantly, this comes at a significantly reduced parameter count, surpassing existing methods in terms of efficiency and enabling real-time performance on edge devices. Comments: This paper has been accepted at IEEE Intelligent Transportation Systems Conference (ITSC), 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2408.05020 [cs.CV] (or arXiv:2408.05020v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2408.05020 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-16] Instruction Tuning-free Visual Token Complement for Multimodal LLMs ECCV2024

链接: https://arxiv.org/abs/2408.05019
作者: Dongsheng Wang,Jiequan Cui,Miaoge Li,Wang Lin,Bo Chen,Hanwang Zhang
关键词-EN: large language models, multimodal LLMs, language models, large language, open community
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV2024 (20pages)

点击查看摘要

Abstract:As the open community of large language models (LLMs) matures, multimodal LLMs (MLLMs) have promised an elegant bridge between vision and language. However, current research is inherently constrained by challenges such as the need for high-quality instruction pairs and the loss of visual information in image-to-text training objectives. To this end, we propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features and thus improve response accuracy. Specifically, our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens to enrich the original visual input. Moreover, an iterative strategy is further designed to extract more visual information by iteratively using the visual selector without any additional training. Notably, the training pipeline requires no additional image-text pairs, resulting in a desired instruction tuning-free property. Both qualitative and quantitative experiments demonstrate the superiority and efficiency of our VTC.

[CV-17] DreamCouple: Exploring High Quality Text-to-3D Generation Via Rectified Flow

链接: https://arxiv.org/abs/2408.05008
作者: Hangyu Li,Xiangxiang Chu,Dingyuan Shi
关键词-EN: Score Distillation Sampling, Score Distillation, achieved significant success, Distillation Sampling, exploits pretrained
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Tech Report

点击查看摘要

Abstract:The Score Distillation Sampling (SDS), which exploits pretrained text-to-image model diffusion models as priors to 3D model training, has achieved significant success. Currently, the flow-based diffusion model has become a new trend for generations. Yet, adapting SDS to flow-based diffusion models in 3D generation remains unexplored. Our work is aimed to bridge this gap. In this paper, we adapt SDS to rectified flow and re-examine the over-smoothing issue under this novel framework. The issue can be explained that the model learns an average of multiple ODE trajectories. Then we propose DreamCouple, which instead of randomly sampling noise, uses a rectified flow model to find the coupled noise. Its Unique Couple Matching (UCM) loss guides the model to learn different trajectories and thus solves the over-smoothing issue. We apply our method to both NeRF and 3D Gaussian splatting and achieve state-of-the-art performances. We also identify some other interesting open questions such as initialization issues for NeRF and faster training convergence. Our code will be released soon.

[CV-18] XNN: Paradigm Shift in Mitigating Identity Leakage within Cloud-Enabled Deep Learning

链接: https://arxiv.org/abs/2408.04974
作者: Kaixin Liu,Huixin Xiong,Bingyu Duan,Zexuan Cheng,Xinyu Zhou,Wanqian Zhang,Xiangyu Zhang
关键词-EN: cloud-based deep learning, external computational resources, computational resources coexists, acute privacy concerns, deep learning
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In the domain of cloud-based deep learning, the imperative for external computational resources coexists with acute privacy concerns, particularly identity leakage. To address this challenge, we introduce XNN and XNN-d, pioneering methodologies that infuse neural network features with randomized perturbations, striking a harmonious balance between utility and privacy. XNN, designed for the training phase, ingeniously blends random permutation with matrix multiplication techniques to obfuscate feature maps, effectively shielding private data from potential breaches without compromising training integrity. Concurrently, XNN-d, devised for the inference phase, employs adversarial training to integrate generative adversarial noise. This technique effectively counters black-box access attacks aimed at identity extraction, while a distilled face recognition network adeptly processes the perturbed features, ensuring accurate identification. Our evaluation demonstrates XNN’s effectiveness, significantly outperforming existing methods in reducing identity leakage while maintaining a high model accuracy.

[CV-19] DAFT-GAN: Dual Affine Transformation Generative Adversarial Network for Text-Guided Image Inpainting ACM-MM’2024

链接: https://arxiv.org/abs/2408.04962
作者: Jihoon Lee,Yunhong Min,Hwidong Kim,Sangtae Ahn
关键词-EN: recent years, significant focus, focus on research, research related, text-guided image inpainting
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ACM MM’2024. 9 pages, 3 tables, 9 figures

点击查看摘要

Abstract:In recent years, there has been a significant focus on research related to text-guided image inpainting. However, the task remains challenging due to several constraints, such as ensuring alignment between the image and the text, and maintaining consistency in distribution between corrupted and uncorrupted regions. In this paper, thus, we propose a dual affine transformation generative adversarial network (DAFT-GAN) to maintain the semantic consistency for text-guided inpainting. DAFT-GAN integrates two affine transformation networks to combine text and image features gradually for each decoding block. Moreover, we minimize information leakage of uncorrupted features for fine-grained image generation by encoding corrupted and uncorrupted regions of the masked image separately. Our proposed model outperforms the existing GAN-based models in both qualitative and quantitative assessments with three benchmark datasets (MS-COCO, CUB, and Oxford) for text-guided image inpainting.

[CV-20] In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation ECCV2024

链接: https://arxiv.org/abs/2408.04961
作者: Dahyun Kang,Minsu Cho
关键词-EN: open-vocabulary semantic segmentation, two-stage approach, approach of unsupervised, open-vocabulary semantic, present lazy visual
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding, for open-vocabulary semantic segmentation. Plenty of the previous art casts this task as pixel-to-text classification without object-level comprehension, leveraging the image-to-text classification capability of pretrained vision-and-language models. We argue that visual objects are distinguishable without the prior text information as segmentation is essentially a vision task. Lazy visual grounding first discovers object masks covering an image with iterative Normalized cuts and then later assigns text on the discovered objects in a late interaction manner. Our model requires no additional training yet shows great performance on five public datasets: Pascal VOC, Pascal Context, COCO-object, COCO-stuff, and ADE 20K. Especially, the visually appealing segmentation results demonstrate the model capability to localize objects precisely. Paper homepage: this https URL

[CV-21] Surgical-VQLA: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery

链接: https://arxiv.org/abs/2408.04958
作者: Long Bai,Guankun Wang,Mobarakol Islam,Lalithkumar Seenivasan,An Wang,Hongliang Ren
关键词-EN: Medical visual question, Medical visual, visual question answering, bridges the gap, enabling doctors
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注: Accepted by Information Fusion. Code and data availability: this https URL

点击查看摘要

Abstract:Medical visual question answering (VQA) bridges the gap between visual information and clinical decision-making, enabling doctors to extract understanding from clinical images and videos. In particular, surgical VQA can enhance the interpretation of surgical data, aiding in accurate diagnoses, effective education, and clinical interventions. However, the inability of VQA models to visually indicate the regions of interest corresponding to the given questions results in incomplete comprehension of the surgical scene. To tackle this, we propose the surgical visual question localized-answering (VQLA) for precise and context-aware responses to specific queries regarding surgical images. Furthermore, to address the strong demand for safety in surgical scenarios and potential corruptions in image acquisition and transmission, we propose a novel approach called Calibrated Co-Attention Gated Vision-Language (C ^2 G-ViL) embedding to integrate and align multimodal information effectively. Additionally, we leverage the adversarial sample-based contrastive learning strategy to boost our performance and robustness. We also extend our EndoVis-18-VQLA and EndoVis-17-VQLA datasets to broaden the scope and application of our data. Extensive experiments on the aforementioned datasets demonstrate the remarkable performance and robustness of our solution. Our solution can effectively combat real-world image corruption. Thus, our proposed approach can serve as an effective tool for assisting surgical education, patient care, and enhancing surgical outcomes.

[CV-22] LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description

链接: https://arxiv.org/abs/2408.04957
作者: Yizhang Jin,Jian Li,Jiangning Zhang,Jianlong Hu,Zhenye Gan,Xin Tan,Yong Liu,Yabiao Wang,Chengjie Wang,Lizhuang Ma
关键词-EN: Visual Spatial Description, Visual Spatial, visual spatial relationship, Traditional visual spatial, Spatial Description
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Visual Spatial Description (VSD) aims to generate texts that describe the spatial relationships between objects within images. Traditional visual spatial relationship classification (VSRC) methods typically output the spatial relationship between two objects in an image, often neglecting world knowledge and lacking general language capabilities. In this paper, we propose a Large Language-and-Vision Assistant for Visual Spatial Description, named LLaVA-VSD, which is designed for the classification, description, and open-ended description of visual spatial relationships. Specifically, the model first constructs a VSD instruction-following dataset using given figure-caption pairs for the three tasks. It then employs LoRA to fine-tune a Large Language and Vision Assistant for VSD, which has 13 billion parameters and supports high-resolution images. Finally, a large language model (Qwen-2) is used to refine the generated sentences, enhancing their diversity and accuracy. LLaVA-VSD demonstrates excellent multimodal conversational capabilities and can follow open-ended instructions to assist with inquiries about object relationships in images.

[CV-23] Model Debiasing by Learnable Data Augmentation

链接: https://arxiv.org/abs/2408.04955
作者: Pietro Morerio,Ruggero Ragonesi,Vittorio Murino
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, actual task labels, efficiently fitting training
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep Neural Networks are well known for efficiently fitting training data, yet experiencing poor generalization capabilities whenever some kind of bias dominates over the actual task labels, resulting in models learning “shortcuts”. In essence, such models are often prone to learn spurious correlations between data and labels. In this work, we tackle the problem of learning from biased data in the very realistic unsupervised scenario, i.e., when the bias is unknown. This is a much harder task as compared to the supervised case, where auxiliary, bias-related annotations, can be exploited in the learning process. This paper proposes a novel 2-stage learning pipeline featuring a data augmentation strategy able to regularize the training. First, biased/unbiased samples are identified by training over-biased models. Second, such subdivision (typically noisy) is exploited within a data augmentation framework, properly combining the original samples while learning mixing parameters, which has a regularization effect. Experiments on synthetic and realistic biased datasets show state-of-the-art classification accuracy, outperforming competing methods, ultimately proving robust performance on both biased and unbiased examples. Notably, being our training method totally agnostic to the level of bias, it also positively affects performance for any, even apparently unbiased, dataset, thus improving the model generalization regardless of the level of bias (or its absence) in the data.

[CV-24] Capsule Vision 2024 Challenge: Multi-Class Abnormality Classification for Video Capsule Endoscopy

链接: https://arxiv.org/abs/2408.04940
作者: Palak Handa,Amirreza Mahbod,Florian Schwarzhans,Ramona Woitek,Nidhi Goel,Deepti Chhabra,Shreshtha Jha,Manas Dhir,Deepak Gunjan,Jagadeesh Kakarla,Balasubramanian Raman
关键词-EN: Video Capsule Endoscopy, Multi-Class Abnormality Classification, Capsule Endoscopy, Video Capsule, Vision Image Processing
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages

点击查看摘要

Abstract:We present the Capsule Vision 2024 Challenge: Multi-Class Abnormality Classification for Video Capsule Endoscopy. It is being virtually organized by the Research Center for Medical Image Analysis and Artificial Intelligence (MIAAI), Department of Medicine, Danube Private University, Krems, Austria and Medical Imaging and Signal Analysis Hub (MISAHUB) in collaboration with the 9th International Conference on Computer Vision Image Processing (CVIP 2024) being organized by the Indian Institute of Information Technology, Design and Manufacturing (IIITDM) Kancheepuram, Chennai, India. This document describes the overview of the challenge, its registration and rules, submission format, and the description of the utilized datasets.

[CV-25] UAV-Enhanced Combination to Application: Comprehensive Analysis and Benchmarking of a Human Detection Dataset for Disaster Scenarios ICPR2024

链接: https://arxiv.org/abs/2408.04922
作者: Ragib Amin Nihal,Benjamin Yen,Katsutoshi Itoyama,Kazuhiro Nakadai
关键词-EN: http URL address, Unmanned aerial vehicles, Combination to Application, training machine learning, machine learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This Paper is accepted for 27th International Conference on Pattern Recognition (ICPR 2024)

点击查看摘要

Abstract:Unmanned aerial vehicles (UAVs) have revolutionized search and rescue (SAR) operations, but the lack of specialized human detection datasets for training machine learning models poses a significant this http URL address this gap, this paper introduces the Combination to Application (C2A) dataset, synthesized by overlaying human poses onto UAV-captured disaster scenes. Through extensive experimentation with state-of-the-art detection models, we demonstrate that models fine-tuned on the C2A dataset exhibit substantial performance improvements compared to those pre-trained on generic aerial datasets. Furthermore, we highlight the importance of combining the C2A dataset with general human datasets to achieve optimal performance and generalization across various scenarios. This points out the crucial need for a tailored dataset to enhance the effectiveness of SAR operations. Our contributions also include developing dataset creation pipeline and integrating diverse human poses and disaster scenes information to assess the severity of disaster scenarios. Our findings advocate for future developments, to ensure that SAR operations benefit from the most realistic and effective AI-assisted interventions possible.

[CV-26] Avoid Wasted Annotation Costs in Open-set Active Learning with Pre-trained Vision-Language Model

链接: https://arxiv.org/abs/2408.04917
作者: Jaehyuk Heo,Pilsung Kang
关键词-EN: Active learning, selectively collecting highly, minimizing annotation costs, aims to enhance, selectively collecting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Active learning (AL) aims to enhance model performance by selectively collecting highly informative data, thereby minimizing annotation costs. However, in practical scenarios, unlabeled data may contain out-of-distribution (OOD) samples, leading to wasted annotation costs if data is incorrectly selected. Recent research has explored methods to apply AL to open-set data, but these methods often require or incur unavoidable cost losses to minimize them. To address these challenges, we propose a novel selection strategy, CLIPN for AL (CLIPNAL), which minimizes cost losses without requiring OOD samples. CLIPNAL sequentially evaluates the purity and informativeness of data. First, it utilizes a pre-trained vision-language model to detect and exclude OOD data by leveraging linguistic and visual information of in-distribution (ID) data without additional training. Second, it selects highly informative data from the remaining ID data, and then the selected samples are annotated by human experts. Experimental results on datasets with various open-set conditions demonstrate that CLIPNAL achieves the lowest cost loss and highest performance across all scenarios. Code is available at this https URL.

[CV-27] GuidedNet: Semi-Supervised Multi-Organ Segmentation via Labeled Data Guide Unlabeled Data ACM-MM2024

链接: https://arxiv.org/abs/2408.04914
作者: Haochen Zhao,Hui Meng,Deqian Yang,Xiaozheng Xie,Xiaoze Wu,Qingfeng Li,Jianwei Niu
关键词-EN: unlabeled data, Semi-supervised multi-organ medical, labeled data, multi-organ medical image, medical image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ACM MM2024, 10 pages, 5 figures

点击查看摘要

Abstract:Semi-supervised multi-organ medical image segmentation aids physicians in improving disease diagnosis and treatment planning and reduces the time and effort required for organ annotation.Existing state-of-the-art methods train the labeled data with ground truths and train the unlabeled data with pseudo-labels. However, the two training flows are separate, which does not reflect the interrelationship between labeled and unlabeled this http URL address this issue, we propose a semi-supervised multi-organ segmentation method called GuidedNet, which leverages the knowledge from labeled data to guide the training of unlabeled data. The primary goals of this study are to improve the quality of pseudo-labels for unlabeled data and to enhance the network’s learning capability for both small and complex organs.A key concept is that voxel features from labeled and unlabeled data that are close to each other in the feature space are more likely to belong to the same class.On this basis, a 3D Consistent Gaussian Mixture Model (3D-CGMM) is designed to leverage the feature distributions from labeled data to rectify the generated pseudo-labels.Furthermore, we introduce a Knowledge Transfer Cross Pseudo Supervision (KT-CPS) strategy, which leverages the prior knowledge obtained from the labeled data to guide the training of the unlabeled data, thereby improving the segmentation accuracy for both small and complex organs. Extensive experiments on two public datasets, FLARE22 and AMOS, demonstrated that GuidedNet is capable of achieving state-of-the-art performance.

[CV-28] Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy and Novel Ensemble Method

链接: https://arxiv.org/abs/2408.04909
作者: Uri Berger,Gabriel Stanovsky,Omri Abend,Lea Frermann
关键词-EN: image captioning models, image captioning, image captioning metrics, complex task, task of evaluating
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The task of image captioning has recently been gaining popularity, and with it the complex task of evaluating the quality of image captioning models. In this work, we present the first survey and taxonomy of over 70 different image captioning metrics and their usage in hundreds of papers. We find that despite the diversity of proposed metrics, the vast majority of studies rely on only five popular metrics, which we show to be weakly correlated with human judgements. Instead, we propose EnsembEval – an ensemble of evaluation methods achieving the highest reported correlation with human judgements across 5 image captioning datasets, showing there is a lot of room for improvement by leveraging a diverse set of metrics.

[CV-29] Clustering-friendly Representation Learning for Enhancing Salient Features PAKDD2024

链接: https://arxiv.org/abs/2408.04891
作者: Toshiyuki Oshima,Kentaro Takagi,Kouta Nakata
关键词-EN: challenging unlabeled datasets, Recently, successfully applied, applied to challenging, challenging unlabeled
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages, 6 figures, 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2024)

点击查看摘要

Abstract:Recently, representation learning with contrastive learning algorithms has been successfully applied to challenging unlabeled datasets. However, these methods are unable to distinguish important features from unimportant ones under simply unsupervised settings, and definitions of importance vary according to the type of downstream task or analysis goal, such as the identification of objects or backgrounds. In this paper, we focus on unsupervised image clustering as the downstream task and propose a representation learning method that enhances features critical to the clustering task. We extend a clustering-friendly contrastive learning method and incorporate a contrastive analysis approach, which utilizes a reference dataset to separate important features from unimportant ones, into the design of loss functions. Conducting an experimental evaluation of image clustering for three datasets with characteristic backgrounds, we show that for all datasets, our method achieves higher clustering scores compared with conventional contrastive analysis and deep clustering methods.

[CV-30] ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation ECCV2024

链接: https://arxiv.org/abs/2408.04883
作者: Mengcheng Lan,Chaofeng Chen,Yiping Ke,Xinjiang Wang,Litong Feng,Wayne Zhang
关键词-EN: effectively integrate visual, Contrastive Language-Image Pre-training, Vision Foundation Models, integrate visual representations, open-vocabulary semantic labels
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024. Code available at this https URL

点击查看摘要

Abstract:Open-vocabulary semantic segmentation requires models to effectively integrate visual representations with open-vocabulary semantic labels. While Contrastive Language-Image Pre-training (CLIP) models shine in recognizing visual concepts from text, they often struggle with segment coherence due to their limited localization ability. In contrast, Vision Foundation Models (VFMs) excel at acquiring spatially consistent local visual representations, yet they fall short in semantic understanding. This paper introduces ProxyCLIP, an innovative framework designed to harmonize the strengths of both CLIP and VFMs, facilitating enhanced open-vocabulary semantic segmentation. ProxyCLIP leverages the spatial feature correspondence from VFMs as a form of proxy attention to augment CLIP, thereby inheriting the VFMs’ robust local consistency and maintaining CLIP’s exceptional zero-shot transfer capacity. We propose an adaptive normalization and masking strategy to get the proxy attention from VFMs, allowing for adaptation across different VFMs. Remarkably, as a training-free approach, ProxyCLIP significantly improves the average mean Intersection over Union (mIoU) across eight benchmarks from 40.3 to 44.4, showcasing its exceptional efficacy in bridging the gap between spatial precision and semantic richness for the open-vocabulary segmentation task.

[CV-31] On the Element-Wise Representation and Reasoning in Zero-Shot Image Recognition: A Systematic Survey

链接: https://arxiv.org/abs/2408.04879
作者: Jingcai Guo,Zhijie Rao,Zhi Chen,Song Guo,Jingren Zhou,Dacheng Tao
关键词-EN: Zero-shot image recognition, Zero-shot image, learning generalized knowledge, unseen domains, aims at empowering
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 24 pages, 7 figures

点击查看摘要

Abstract:Zero-shot image recognition (ZSIR) aims at empowering models to recognize and reason in unseen domains via learning generalized knowledge from limited data in the seen domain. The gist for ZSIR is to execute element-wise representation and reasoning from the input visual space to the target semantic space, which is a bottom-up modeling paradigm inspired by the process by which humans observe the world, i.e., capturing new concepts by learning and combining the basic components or shared characteristics. In recent years, element-wise learning techniques have seen significant progress in ZSIR as well as widespread application. However, to the best of our knowledge, there remains a lack of a systematic overview of this topic. To enrich the literature and provide a sound basis for its future development, this paper presents a broad review of recent advances in element-wise ZSIR. Concretely, we first attempt to integrate the three basic ZSIR tasks of object recognition, compositional recognition, and foundation model-based open-world recognition into a unified element-wise perspective and provide a detailed taxonomy and analysis of the main research approaches. Then, we collect and summarize some key information and benchmarks, such as detailed technical implementations and common datasets. Finally, we sketch out the wide range of its related applications, discuss vital challenges, and suggest potential future directions.

[CV-32] ChatGPT Meets Iris Biometrics

链接: https://arxiv.org/abs/2408.04868
作者: Parisa Farmanifard,Arun Ross
关键词-EN: multimodal Large Language, Large Language Model, Large Language, multimodal Large, Language Model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Published at IJCB 2024

点击查看摘要

Abstract:This study utilizes the advanced capabilities of the GPT-4 multimodal Large Language Model (LLM) to explore its potential in iris recognition - a field less common and more specialized than face recognition. By focusing on this niche yet crucial area, we investigate how well AI tools like ChatGPT can understand and analyze iris images. Through a series of meticulously designed experiments employing a zero-shot learning approach, the capabilities of ChatGPT-4 was assessed across various challenging conditions including diverse datasets, presentation attacks, occlusions such as glasses, and other real-world variations. The findings convey ChatGPT-4’s remarkable adaptability and precision, revealing its proficiency in identifying distinctive iris features, while also detecting subtle effects like makeup on iris recognition. A comparative analysis with Gemini Advanced - Google’s AI model - highlighted ChatGPT-4’s better performance and user experience in complex iris analysis tasks. This research not only validates the use of LLMs for specialized biometric applications but also emphasizes the importance of nuanced query framing and interaction design in extracting significant insights from biometric data. Our findings suggest a promising path for future research and the development of more adaptable, efficient, robust and interactive biometric security solutions.

[CV-33] MSG-Chart: Multimodal Scene Graph for ChartQA CIKM

链接: https://arxiv.org/abs/2408.04852
作者: Yue Dai,Soyeon Caren Han,Wei Liu
关键词-EN: Automatic Chart Question, Chart Question Answering, Question Answering, Automatic Chart, Chart Question
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accpeted by CIKM Short 2024

点击查看摘要

Abstract:Automatic Chart Question Answering (ChartQA) is challenging due to the complex distribution of chart elements with patterns of the underlying data not explicitly displayed in charts. To address this challenge, we design a joint multimodal scene graph for charts to explicitly represent the relationships between chart elements and their patterns. Our proposed multimodal scene graph includes a visual graph and a textual graph to jointly capture the structural and semantical knowledge from the chart. This graph module can be easily integrated with different vision transformers as inductive bias. Our experiments demonstrate that incorporating the proposed graph module enhances the understanding of charts’ elements’ structure and semantics, thereby improving performance on publicly available benchmarks, ChartQA and OpenCQA.

[CV-34] mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

链接: https://arxiv.org/abs/2408.04840
作者: Jiabo Ye,Haiyang Xu,Haowei Liu,Anwen Hu,Ming Yan,Qi Qian,Ji Zhang,Fei Huang,Jingren Zhou
关键词-EN: demonstrated remarkable capabilities, Multi-modal Large Language, Large Language Models, Large Language, Multi-modal Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

[CV-35] Adversarially Robust Industrial Anomaly Detection Through Diffusion Model

链接: https://arxiv.org/abs/2408.04839
作者: Yuanpu Cao,Lu Lin,Jinghui Chen
关键词-EN: achieved remarkably high, remarkably high accuracy, Deep learning-based industrial, anomaly detection, anomaly
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning-based industrial anomaly detection models have achieved remarkably high accuracy on commonly used benchmark datasets. However, the robustness of those models may not be satisfactory due to the existence of adversarial examples, which pose significant threats to the practical deployment of deep anomaly detectors. Recently, it has been shown that diffusion models can be used to purify the adversarial noises and thus build a robust classifier against adversarial attacks. Unfortunately, we found that naively applying this strategy in anomaly detection (i.e., placing a purifier before an anomaly detector) will suffer from a high anomaly miss rate since the purifying process can easily remove both the anomaly signal and the adversarial perturbations, causing the later anomaly detector failed to detect anomalies. To tackle this issue, we explore the possibility of performing anomaly detection and adversarial purification simultaneously. We propose a simple yet effective adversarially robust anomaly detection method, \textitAdvRAD, that allows the diffusion model to act both as an anomaly detector and adversarial purifier. We also extend our proposed method for certified robustness to l_2 norm bounded perturbations. Through extensive experiments, we show that our proposed method exhibits outstanding (certified) adversarial robustness while also maintaining equally strong anomaly detection performance on par with the state-of-the-art methods on industrial anomaly detection benchmark datasets.

[CV-36] Self-augmented Gaussian Splatting with Structure-aware Masks for Sparse-view 3D Reconstruction

链接: https://arxiv.org/abs/2408.04831
作者: Lingbei Meng,Bi’an Du,Wei Hu
关键词-EN: build complete three-dimensional, complete three-dimensional models, computer vision, aiming to build, viewing perspectives
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Sparse-view 3D reconstruction stands as a formidable challenge in computer vision, aiming to build complete three-dimensional models from a limited array of viewing perspectives. This task confronts several difficulties: 1) the limited number of input images that lack consistent information; 2) dependence on the quality of input images; and 3) the substantial size of model parameters. To address these challenges, we propose a self-augmented coarse-to-fine Gaussian splatting paradigm, enhanced with a structure-aware mask, for sparse-view 3D reconstruction. In particular, our method initially employs a coarse Gaussian model to obtain a basic 3D representation from sparse-view inputs. Subsequently, we develop a fine Gaussian network to enhance consistent and detailed representation of the output with both 3D geometry augmentation and perceptual view augmentation. During training, we design a structure-aware masking strategy to further improve the model’s robustness against sparse inputs and noise.Experimental results on the MipNeRF360 and OmniObject3D datasets demonstrate that the proposed method achieves state-of-the-art performances for sparse input views in both perceptual quality and efficiency.

[CV-37] One Shot is Enough for Sequential Infrared Small Target Segmentation

链接: https://arxiv.org/abs/2408.04823
作者: Bingbing Dan,Meihui Li,Tao Tang,Jing Zhang
关键词-EN: Infrared small target, sequential infrared small, rich contextual information, exhibit strong similarities, small target segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Infrared small target sequences exhibit strong similarities between frames and contain rich contextual information, which motivates us to achieve sequential infrared small target segmentation with minimal data. Inspired by the success of large segmentation models led by Segment Anything Model (SAM) across various downstream tasks, we propose a one-shot and training-free method that perfectly adapts SAM’s zero-shot generalization capabilities to sequential infrared small target segmentation. Given one annotated frame as a reference, our method can accurately segment small targets in other frames of the sequence. Specifically, we first obtain a confidence map through local feature matching between reference image and test image. Then, the highest point in the confidence map is as a prompt, and we design the Point Prompt-Centric Focusing (PPCF) module to address the over-segmentation of small targets with blurry boundaries. Subsequently, to prevent miss and false detections, we introduce the Triple-Level Ensemble (TLE) module that ensembles the masks obtained at different levels from the first two steps to produce the final mask. Experiments demonstrate that our method requires only one shot to achieve comparable performance to state-of-the-art methods based on traditional many-shot supervision and even superior performance in a few-shot setting. Moreover, ablation studies confirm the robustness of our approach to variations in one-shot samples, changes in scenes, and the presence of multiple targets.

[CV-38] Rethinking Multiple Instance Learning: Developing an Instance-Level Classifier via Weakly-Supervised Self-Training

链接: https://arxiv.org/abs/2408.04813
作者: Yingfan Ma,Xiaoyuan Luo,Mingzhi Yuan,Xinrong Chen,Manning Wang
关键词-EN: Multiple instance learning, ignore important information, important information contained, Multiple instance, hard positive instances
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Multiple instance learning (MIL) problem is currently solved from either bag-classification or instance-classification perspective, both of which ignore important information contained in some instances and result in limited performance. For example, existing methods often face difficulty in learning hard positive instances. In this paper, we formulate MIL as a semi-supervised instance classification problem, so that all the labeled and unlabeled instances can be fully utilized to train a better classifier. The difficulty in this formulation is that all the labeled instances are negative in MIL, and traditional self-training techniques used in semi-supervised learning tend to degenerate in generating pseudo labels for the unlabeled instances in this scenario. To resolve this problem, we propose a weakly-supervised self-training method, in which we utilize the positive bag labels to construct a global constraint and a local constraint on the pseudo labels to prevent them from degenerating and force the classifier to learn hard positive instances. It is worth noting that easy positive instances are instances are far from the decision boundary in the classification process, while hard positive instances are those close to the decision boundary. Through iterative optimization, the pseudo labels can gradually approach the true labels. Extensive experiments on two MNIST synthetic datasets, five traditional MIL benchmark datasets and two histopathology whole slide image datasets show that our method achieved new SOTA performance on all of them. The code will be publicly available.

[CV-39] UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

链接: https://arxiv.org/abs/2408.04810
作者: Haider Al-Tahan,Quentin Garrido,Randall Balestriero,Diane Bouchacourt,Caner Hazirbas,Mark Ibrahim
关键词-EN: Significant research efforts, Significant research, research efforts, improve vision-language model, Significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Significant research efforts have been made to scale and improve vision-language model (VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers are tasked with the heavy burden of implementing each protocol, bearing a non-trivial computational cost, and making sense of how all these benchmarks translate into meaningful axes of progress. To facilitate a systematic evaluation of VLM progress, we introduce UniBench: a unified implementation of 50+ VLM benchmarks spanning a comprehensive range of carefully categorized capabilities from object recognition to spatial awareness, counting, and much more. We showcase the utility of UniBench for measuring progress by evaluating nearly 60 publicly available vision-language models, trained on scales of up to 12.8B samples. We find that while scaling training data or model size can boost many vision-language model capabilities, scaling offers little benefit for reasoning or relations. Surprisingly, we also discover today’s best VLMs struggle on simple digit recognition and counting tasks, e.g. MNIST, which much simpler networks can solve. Where scale falls short, we find that more precise interventions, such as data quality or tailored-learning objectives offer more promise. For practitioners, we also offer guidance on selecting a suitable VLM for a given application. Finally, we release an easy-to-run UniBench code-base with the full set of 50+ benchmarks and comparisons across 59 models as well as a distilled, representative set of benchmarks that runs in 5 minutes on a single GPU.

[CV-40] On the Geometry of Deep Learning

链接: https://arxiv.org/abs/2408.04809
作者: Randall Balestriero,Ahmed Imtiaz Humayun,Richard Baraniuk
关键词-EN: continuous piecewise linear, piecewise linear functions, continuous piecewise, multiple dimensions, network affine spline
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we overview one promising avenue of progress at the mathematical foundation of deep learning: the connection between deep networks and function approximation by affine splines (continuous piecewise linear functions in multiple dimensions). In particular, we will overview work over the past decade on understanding certain geometrical properties of a deep network’s affine spline mapping, in particular how it tessellates its input space. As we will see, the affine spline connection and geometrical viewpoint provide a powerful portal through which to view, analyze, and improve the inner workings of a deep network.

[CV-41] Hyper-YOLO: When Visual Object Detection Meets Hypergraph Computation

链接: https://arxiv.org/abs/2408.04804
作者: Yifan Feng,Jiangang Huang,Shaoyi Du,Shihui Ying,Jun-Hai Yong,Yipeng Li,Guiguang Ding,Rongrong Ji,Yue Gao
关键词-EN: object detection method, integrates hypergraph computations, complex high-order correlations, Hypergraph Computation Empowered, Computation Empowered Semantic
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We introduce Hyper-YOLO, a new object detection method that integrates hypergraph computations to capture the complex high-order correlations among visual features. Traditional YOLO models, while powerful, have limitations in their neck designs that restrict the integration of cross-level features and the exploitation of high-order feature interrelationships. To address these challenges, we propose the Hypergraph Computation Empowered Semantic Collecting and Scattering (HGC-SCS) framework, which transposes visual feature maps into a semantic space and constructs a hypergraph for high-order message propagation. This enables the model to acquire both semantic and structural information, advancing beyond conventional feature-focused learning. Hyper-YOLO incorporates the proposed Mixed Aggregation Network (MANet) in its backbone for enhanced feature extraction and introduces the Hypergraph-Based Cross-Level and Cross-Position Representation Network (HyperC2Net) in its neck. HyperC2Net operates across five scales and breaks free from traditional grid structures, allowing for sophisticated high-order interactions across levels and positions. This synergy of components positions Hyper-YOLO as a state-of-the-art architecture in various scale models, as evidenced by its superior performance on the COCO dataset. Specifically, Hyper-YOLO-N significantly outperforms the advanced YOLOv8-N and YOLOv9-T with 12% \textAP^val and 9% \textAP^val improvements. The source codes are at ttps://github.com/iMoonLab/Hyper-YOLO.

[CV-42] FewShotNeRF: Meta-Learning-based Novel View Synthesis for Rapid Scene-Specific Adaptation

链接: https://arxiv.org/abs/2408.04803
作者: Piraveen Sivakumar,Paul Janson,Jathushan Rajasegaran,Thanuja Ambegoda
关键词-EN: limited multi-view images, Neural Radiance Field, address the challenge, limited multi-view, multi-view images
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we address the challenge of generating novel views of real-world objects with limited multi-view images through our proposed approach, FewShotNeRF. Our method utilizes meta-learning to acquire optimal initialization, facilitating rapid adaptation of a Neural Radiance Field (NeRF) to specific scenes. The focus of our meta-learning process is on capturing shared geometry and textures within a category, embedded in the weight initialization. This approach expedites the learning process of NeRFs and leverages recent advancements in positional encodings to reduce the time required for fitting a NeRF to a scene, thereby accelerating the inner loop optimization of meta-learning. Notably, our method enables meta-learning on a large number of 3D scenes to establish a robust 3D prior for various categories. Through extensive evaluations on the Common Objects in 3D open source dataset, we empirically demonstrate the efficacy and potential of meta-learning in generating high-quality novel views of objects.

[CV-43] SOD-YOLOv8 – Enhancing YOLOv8 for Small Object Detection in Traffic Scenes

链接: https://arxiv.org/abs/2408.04786
作者: Boshra Khalili,Andrew W.Smyth
关键词-EN: Small Object Detection, Object detection, small objects, detecting small objects, emergency response
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 15 pages, 14 figures

点击查看摘要

Abstract:Object detection as part of computer vision can be crucial for traffic management, emergency response, autonomous vehicles, and smart cities. Despite significant advances in object detection, detecting small objects in images captured by distant cameras remains challenging due to their size, distance from the camera, varied shapes, and cluttered backgrounds. To address these challenges, we propose Small Object Detection YOLOv8 (SOD-YOLOv8), a novel model specifically designed for scenarios involving numerous small objects. Inspired by Efficient Generalized Feature Pyramid Networks (GFPN), we enhance multi-path fusion within YOLOv8 to integrate features across different levels, preserving details from shallower layers and improving small object detection accuracy. Also, A fourth detection layer is added to leverage high-resolution spatial information effectively. The Efficient Multi-Scale Attention Module (EMA) in the C2f-EMA module enhances feature extraction by redistributing weights and prioritizing relevant features. We introduce Powerful-IoU (PIoU) as a replacement for CIoU, focusing on moderate-quality anchor boxes and adding a penalty based on differences between predicted and ground truth bounding box corners. This approach simplifies calculations, speeds up convergence, and enhances detection accuracy. SOD-YOLOv8 significantly improves small object detection, surpassing widely used models in various metrics, without substantially increasing computational cost or latency compared to YOLOv8s. Specifically, it increases recall from 40.1% to 43.9%, precision from 51.2% to 53.9%, \textmAP_0.5 from 40.6% to 45.1%, and \textmAP_0.5:0.95 from 24% to 26.6%. In dynamic real-world traffic scenes, SOD-YOLOv8 demonstrated notable improvements in diverse conditions, proving its reliability and effectiveness in detecting small objects even in challenging environments.

[CV-44] BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion

链接: https://arxiv.org/abs/2408.04785
作者: James Baker
关键词-EN: personalizing diffusion models, Textual Inversion remains, Textual Inversion, subjects and styles, diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Textual Inversion remains a popular method for personalizing diffusion models, in order to teach models new subjects and styles. We note that textual inversion has been underexplored using alternatives to the UNet, and experiment with textual inversion with a vision transformer. We also seek to optimize textual inversion using a strategy that does not require explicit use of the UNet and its idiosyncratic layers, so we add bonus tokens and enforce orthogonality. We find the use of the bonus token improves adherence to the source images and the use of the vision transformer improves adherence to the prompt. Code is available at this https URL.

[CV-45] Data-Driven Pixel Control: Challenges and Prospects

链接: https://arxiv.org/abs/2408.04767
作者: Saurabh Farkya,Zachary Alan Daniels,Aswin Raghavan,Gooitzen van der Wal,Michael Isnardi,Michael Piacentino,David Zhang
关键词-EN: Recent advancements, Recent, high resolution, pixel level, high
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Accepted to the Conference on Dynamic Data-Driven Applications Systems (DDDAS2024)

点击查看摘要

Abstract:Recent advancements in sensors have led to high resolution and high data throughput at the pixel level. Simultaneously, the adoption of increasingly large (deep) neural networks (NNs) has lead to significant progress in computer vision. Currently, visual intelligence comes at increasingly high computational complexity, energy, and latency. We study a data-driven system that combines dynamic sensing at the pixel level with computer vision analytics at the video level and propose a feedback control loop to minimize data movement between the sensor front-end and computational back-end without compromising detection and tracking precision. Our contributions are threefold: (1) We introduce anticipatory attention and show that it leads to high precision prediction with sparse activation of pixels; (2) Leveraging the feedback control, we show that the dimensionality of learned feature vectors can be significantly reduced with increased sparsity; and (3) We emulate analog design choices (such as varying RGB or Bayer pixel format and analog noise) and study their impact on the key metrics of the data-driven system. Comparative analysis with traditional pixel and deep learning models shows significant performance enhancements. Our system achieves a 10X reduction in bandwidth and a 15-30X improvement in Energy-Delay Product (EDP) when activating only 30% of pixels, with a minor reduction in object detection and tracking precision. Based on analog emulation, our system can achieve a throughput of 205 megapixels/sec (MP/s) with a power consumption of only 110 mW per MP, i.e., a theoretical improvement of ~30X in EDP.

[CV-46] Novel adaptation of video segmentation to 3D MRI: efficient zero-shot knee segmentation with SAM2

链接: https://arxiv.org/abs/2408.04762
作者: Andrew Seohwan Yu,Mohsen Hariri,Xuecen Zhang,Mingrui Yang,Vipin Chaudhary,Xiaojuan Li
关键词-EN: algorithm performance degrades, performance degrades due, Intelligent medical image, Intelligent medical, domain transfer
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Intelligent medical image segmentation methods are rapidly evolving and being increasingly applied, yet they face the challenge of domain transfer, where algorithm performance degrades due to different data distributions between source and target domains. To address this, we introduce a method for zero-shot, single-prompt segmentation of 3D knee MRI by adapting Segment Anything Model 2 (SAM2), a general-purpose segmentation model designed to accept prompts and retain memory across frames of a video. By treating slices from 3D medical volumes as individual video frames, we leverage SAM2’s advanced capabilities to generate motion- and spatially-aware predictions. We demonstrate that SAM2 can efficiently perform segmentation tasks in a zero-shot manner with no additional training or fine-tuning, accurately delineating structures in knee MRI scans using only a single prompt. Our experiments on the Osteoarthritis Initiative Zuse Institute Berlin (OAI-ZIB) dataset reveal that SAM2 achieves high accuracy on 3D knee bone segmentation, with a testing Dice similarity coefficient of 0.9643 on tibia. We also present results generated using different SAM2 model sizes, different prompt schemes, as well as comparative results from the SAM1 model deployed on the same dataset. This breakthrough has the potential to revolutionize medical image analysis by providing a scalable, cost-effective solution for automated segmentation, paving the way for broader clinical applications and streamlined workflows.

[CV-47] Embodied Uncertainty-Aware Object Segmentation IROS2024

链接: https://arxiv.org/abs/2408.04760
作者: Xiaolin Fang,Leslie Pack Kaelbling,Tomás Lozano-Pérez
关键词-EN: introduce uncertainty-aware object, uncertainty-aware object instance, embodied interactive segmentation, introduce uncertainty-aware, usefulness for embodied
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: IROS 2024

点击查看摘要

Abstract:We introduce uncertainty-aware object instance segmentation (UncOS) and demonstrate its usefulness for embodied interactive segmentation. To deal with uncertainty in robot perception, we propose a method for generating a hypothesis distribution of object segmentation. We obtain a set of region-factored segmentation hypotheses together with confidence estimates by making multiple queries of large pre-trained models. This process can produce segmentation results that achieve state-of-the-art performance on unseen object segmentation problems. The output can also serve as input to a belief-driven process for selecting robot actions to perturb the scene to reduce ambiguity. We demonstrate the effectiveness of this method in real-robot experiments. Website: this https URL

[CV-48] Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)

链接: https://arxiv.org/abs/2408.04664
作者: Avshalom Manevich,Reut Tsarfaty
关键词-EN: Large Vision-Language Models, Large Language Models, Large Vision-Language, Large Language, expanding AI capabilities
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) are an extension of Large Language Models (LLMs) that facilitate processing both image and text inputs, expanding AI capabilities. However, LVLMs struggle with object hallucinations due to their reliance on text cues and learned object co-occurrence biases. While most research quantifies these hallucinations, mitigation strategies are still lacking. Our study introduces a Language Contrastive Decoding (LCD) algorithm that adjusts LVLM outputs based on LLM distribution confidence levels, effectively reducing object hallucinations. We demonstrate the advantages of LCD in leading LVLMs, showing up to %4 improvement in POPE F1 scores and up to %36 reduction in CHAIR scores on the COCO validation set, while also improving captioning quality scores. Our method effectively improves LVLMs without needing complex post-processing or retraining, and is easily applicable to different models. Our findings highlight the potential of further exploration of LVLM-specific decoding algorithms.

[CV-49] Beyond the Eye: A Relational Model for Early Dementia Detection Using Retinal OCTA Images

链接: https://arxiv.org/abs/2408.05117
作者: Shouyue Liu,Jinkui Hao,Yonghuai Liu,Huazhu Fu,Xinyu Guo,Shuting Zhang,Yitian Zhao
关键词-EN: mild cognitive impairment, enable timely intervention, Alzheimer disease, cognitive impairment, mild cognitive
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Early detection of dementia, such as Alzheimer’s disease (AD) or mild cognitive impairment (MCI), is essential to enable timely intervention and potential treatment. Accurate detection of AD/MCI is challenging due to the high complexity, cost, and often invasive nature of current diagnostic techniques, which limit their suitability for large-scale population screening. Given the shared embryological origins and physiological characteristics of the retina and brain, retinal imaging is emerging as a potentially rapid and cost-effective alternative for the identification of individuals with or at high risk of AD. In this paper, we present a novel PolarNet+ that uses retinal optical coherence tomography angiography (OCTA) to discriminate early-onset AD (EOAD) and MCI subjects from controls. Our method first maps OCTA images from Cartesian coordinates to polar coordinates, allowing approximate sub-region calculation to implement the clinician-friendly early treatment of diabetic retinopathy study (ETDRS) grid analysis. We then introduce a multi-view module to serialize and analyze the images along three dimensions for comprehensive, clinically useful information extraction. Finally, we abstract the sequence embedding into a graph, transforming the detection task into a general graph classification problem. A regional relationship module is applied after the multi-view module to excavate the relationship between the sub-regions. Such regional relationship analyses validate known eye-brain links and reveal new discriminative patterns.

[CV-50] Multi-dimensional Parameter Space Exploration for Streamline-specific Tractography MICCAI2024

链接: https://arxiv.org/abs/2408.05056
作者: Ruben Vink,Anna Vilanova,Maxime Chamberland
关键词-EN: dataset or bundle, parameter space, unspoken challenges, multi-dimensional parameter space, SSP
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at MICCAI 2024 International Workshop on Computational Diffusion MRI

点击查看摘要

Abstract:One of the unspoken challenges of tractography is choosing the right parameters for a given dataset or bundle. In order to tackle this challenge, we explore the multi-dimensional parameter space of tractography using streamline-specific parameters (SSP). We 1) validate a state-of-the-art probabilistic tracking method using per-streamline parameters on synthetic data, and 2) show how we can gain insights into the parameter space by focusing on streamline acceptance using real-world data. We demonstrate the potential added value of SSP to the current state of tractography by showing how SSP can be used to reveal patterns in the parameter space.

[CV-51] Integrating Edge Information into Ground Truth for the Segmentation of the Optic Disc and Cup from Fundus Images

链接: https://arxiv.org/abs/2408.05052
作者: Yoga Sri Varshan V,Hitesh Gupta Kattamuri,Subin Sahayam,Umarani Jayaraman
关键词-EN: Optic disc, Optic, optic disc-cup ground, ground truth, myocardial infarction
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optic disc and cup segmentation helps in the diagnosis of glaucoma, myocardial infarction, and diabetic retinopathy. Most deep learning methods developed to perform segmentation tasks are built on top of a U-Net-based model architecture. Nevertheless, U-Net and its variants have a tendency to over-segment/ under-segment the required regions of interest. Since the most important outcome is the value of cup-to-disc ratio and not the segmented regions themselves, we are more concerned about the boundaries rather than the regions under the boundaries. This makes learning edges important as compared to learning the regions. In the proposed work, the authors aim to extract both edges of the optic disc and cup from the ground truth using a Laplacian filter. Next, edges are reconstructed to obtain an edge ground truth in addition to the optic disc-cup ground truth. Utilizing both ground truths, the authors study several U-Net and its variant architectures with and without optic disc and cup edges as target, along with the optic disc-cup ground truth for segmentation. The authors have used the REFUGE benchmark dataset and the Drishti-GS dataset to perform the study, and the results are tabulated for the dice and the Hausdorff distance metrics. In the case of the REFUGE dataset, the optic disc mean dice score has improved from 0.7425 to 0.8859 while the mean Hausdorff distance has reduced from 6.5810 to 3.0540 for the baseline U-Net model. Similarly, the optic cup mean dice score has improved from 0.6970 to 0.8639 while the mean Hausdorff distance has reduced from 5.2340 to 2.6323 for the same model. Similar improvement has been observed for the Drishti-GS dataset as well. Compared to the baseline U-Net and its variants (i.e) the Attention U-Net and the U-Net++, the models that learn integrated edges along with the optic disc and cup regions performed well in both validation and testing datasets.

[CV-52] CROCODILE: Causality aids RObustness via COntrastive DIsentangled LEarning MICCAI2024

链接: https://arxiv.org/abs/2408.04949
作者: Gianluca Carloni,Sotirios A Tsaftaris,Sara Colantonio
关键词-EN: classifiers perform poorly, image classifiers perform, perform poorly, poorly when applied, classifiers perform
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: MICCAI 2024 UNSURE Workshop, Accepted for presentation, Submitted Manuscript Version, 10 pages

点击查看摘要

Abstract:Due to domain shift, deep learning image classifiers perform poorly when applied to a domain different from the training one. For instance, a classifier trained on chest X-ray (CXR) images from one hospital may not generalize to images from another hospital due to variations in scanner settings or patient characteristics. In this paper, we introduce our CROCODILE framework, showing how tools from causality can foster a model’s robustness to domain shift via feature disentanglement, contrastive learning losses, and the injection of prior knowledge. This way, the model relies less on spurious correlations, learns the mechanism bringing from images to prediction better, and outperforms baselines on out-of-distribution (OOD) data. We apply our method to multi-label lung disease classification from CXRs, utilizing over 750000 images from four datasets. Our bias-mitigation method improves domain generalization and fairness, broadening the applicability and reliability of deep learning models for a safer medical image analysis. Find our code at: this https URL.

[CV-53] Geo-UNet: A Geometrically Constrained Neural Framework for Clinical-Grade Lumen Segmentation in Intravascular Ultrasound MICCAI2024

链接: https://arxiv.org/abs/2408.04826
作者: Yiming Chen,Niharika S. D’Souza,Akshith Mandepally,Patrick Henninger,Satyananda Kashyap,Neerav Karani,Neel Dey,Marcos Zachary,Raed Rizq,Paul Chouinard,Polina Golland,Tanveer F. Syeda-Mahmood
关键词-EN: Precisely estimating lumen, deep vein thrombosis, sizing interventional stents, treat deep vein, estimating lumen boundaries
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted into the 15th workshop on Machine Learning in Medical Imaging at MICCAI 2024. ( indicates equal contribution)

点击查看摘要

Abstract:Precisely estimating lumen boundaries in intravascular ultrasound (IVUS) is needed for sizing interventional stents to treat deep vein thrombosis (DVT). Unfortunately, current segmentation networks like the UNet lack the precision needed for clinical adoption in IVUS workflows. This arises due to the difficulty of automatically learning accurate lumen contour from limited training data while accounting for the radial geometry of IVUS imaging. We propose the Geo-UNet framework to address these issues via a design informed by the geometry of the lumen contour segmentation task. We first convert the input data and segmentation targets from Cartesian to polar coordinates. Starting from a convUNet feature extractor, we propose a two-task setup, one for conventional pixel-wise labeling and the other for single boundary lumen-contour localization. We directly combine the two predictions by passing the predicted lumen contour through a new activation (named CDFeLU) to filter out spurious pixel-wise predictions. Our unified loss function carefully balances area-based, distance-based, and contour-based penalties to provide near clinical-grade generalization in unseen patient data. We also introduce a lightweight, inference-time technique to enhance segmentation smoothness. The efficacy of our framework on a venous IVUS dataset is shown against state-of-the-art models.

[CV-54] Improved Robustness for Deep Learning-based Segmentation of Multi-Center Myocardial Perfusion MRI Datasets Using Data Adaptive Uncertainty-guided Space-time Analysis

链接: https://arxiv.org/abs/2408.04805
作者: Dilek M. Yalcinkaya,Khalid Youssef,Bobak Heydari,Janet Wei,Noel Bairey Merz,Robert Judd,Rohan Dharmakumar,Orlando P. Simonetti,Jonathan W. Weinsaft,Subha V. Raman,Behzad Sharif
关键词-EN: MRI datasets enables, DAUGS analysis approach, proposed DAUGS analysis, perfusion MRI datasets, datasets
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注: Accepted for publication in JCMR, 2024

点击查看摘要

Abstract:Background. Fully automatic analysis of myocardial perfusion MRI datasets enables rapid and objective reporting of stress/rest studies in patients with suspected ischemic heart disease. Developing deep learning techniques that can analyze multi-center datasets despite limited training data and variations in software and hardware is an ongoing challenge. Methods. Datasets from 3 medical centers acquired at 3T (n = 150 subjects) were included: an internal dataset (inD; n = 95) and two external datasets (exDs; n = 55) used for evaluating the robustness of the trained deep neural network (DNN) models against differences in pulse sequence (exD-1) and scanner vendor (exD-2). A subset of inD (n = 85) was used for training/validation of a pool of DNNs for segmentation, all using the same spatiotemporal U-Net architecture and hyperparameters but with different parameter initializations. We employed a space-time sliding-patch analysis approach that automatically yields a pixel-wise “uncertainty map” as a byproduct of the segmentation process. In our approach, a given test case is segmented by all members of the DNN pool and the resulting uncertainty maps are leveraged to automatically select the “best” one among the pool of solutions. Results. The proposed DAUGS analysis approach performed similarly to the established approach on the internal dataset (p = n.s.) whereas it significantly outperformed on the external datasets (p 0.005 for exD-1 and exD-2). Moreover, the number of image series with “failed” segmentation was significantly lower for the proposed vs. the established approach (4.3% vs. 17.1%, p 0.0005). Conclusions. The proposed DAUGS analysis approach has the potential to improve the robustness of deep learning methods for segmentation of multi-center stress perfusion datasets with variations in the choice of pulse sequence, site location or scanner vendor. Comments: Accepted for publication in JCMR, 2024 Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph) Cite as: arXiv:2408.04805 [eess.IV] (or arXiv:2408.04805v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2408.04805 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Dilek Yalcinkaya [view email] [v1] Fri, 9 Aug 2024 01:21:41 UTC (2,463 KB)

[CV-55] Deep Learning-based Unsupervised Domain Adaptation via a Unified Model for Prostate Lesion Detection Using Multisite Bi-parametric MRI Datasets

链接: https://arxiv.org/abs/2408.04777
作者: Hao Li,Han Liu,Heinrich von Busch,Robert Grimm,Henkjan Huisman,Angela Tong,David Winkel,Tobias Penzkofer,Ivan Shabunin,Moon Hyung Choi,Qingsong Yang,Dieter Szolar,Steven Shea,Fergus Coakley,Mukesh Harisinghani,Ipek Oguz,Dorin Comaniciu,Ali Kamen,Bin Lou
关键词-EN: supervised learning models, Prostate Imaging Reporting, offers a promising, promising and reliable, reliable strategy
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accept at Radiology: Artificial Intelligence. Journal reference and external DOI will be added once published

点击查看摘要

Abstract:Our hypothesis is that UDA using diffusion-weighted images, generated with a unified model, offers a promising and reliable strategy for enhancing the performance of supervised learning models in multi-site prostate lesion detection, especially when various b-values are present. This retrospective study included data from 5,150 patients (14,191 samples) collected across nine different imaging centers. A novel UDA method using a unified generative model was developed for multi-site PCa detection. This method translates diffusion-weighted imaging (DWI) acquisitions, including apparent diffusion coefficient (ADC) and individual DW images acquired using various b-values, to align with the style of images acquired using b-values recommended by Prostate Imaging Reporting and Data System (PI-RADS) guidelines. The generated ADC and DW images replace the original images for PCa detection. An independent set of 1,692 test cases (2,393 samples) was used for evaluation. The area under the receiver operating characteristic curve (AUC) was used as the primary metric, and statistical analysis was performed via bootstrapping. For all test cases, the AUC values for baseline SL and UDA methods were 0.73 and 0.79 (p.001), respectively, for PI-RADS=3, and 0.77 and 0.80 (p.001) for PI-RADS=4 PCa lesions. In the 361 test cases under the most unfavorable image acquisition setting, the AUC values for baseline SL and UDA were 0.49 and 0.76 (p.001) for PI-RADS=3, and 0.50 and 0.77 (p.001) for PI-RADS=4 PCa lesions. The results indicate the proposed UDA with generated images improved the performance of SL methods in multi-site PCa lesion detection across datasets with various b values, especially for images acquired with significant deviations from the PI-RADS recommended DWI protocol (e.g. with an extremely high b-value).

[CV-56] Segmentation of Mental Foramen in Orthopantomographs: A Deep Learning Approach

链接: https://arxiv.org/abs/2408.04763
作者: Haider Raza,Mohsin Ali,Vishal Krishna Singh,Agustin Wahjuningrum,Rachel Sarig,Akhilanand Chaurasia
关键词-EN: impacted tooth removal, Mental Foramen, Precise identification, cyst surgeries, tooth removal
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Precise identification and detection of the Mental Foramen are crucial in dentistry, impacting procedures such as impacted tooth removal, cyst surgeries, and implants. Accurately identifying this anatomical feature facilitates post-surgery issues and improves patient outcomes. Moreover, this study aims to accelerate dental procedures, elevating patient care and healthcare efficiency in dentistry. This research used Deep Learning methods to accurately detect and segment the Mental Foramen from panoramic radiograph images. Two mask types, circular and square, were used during model training. Multiple segmentation models were employed to identify and segment the Mental Foramen, and their effectiveness was evaluated using diverse metrics. An in-house dataset comprising 1000 panoramic radiographs was created for this study. Our experiments demonstrated that the Classical UNet model performed exceptionally well on the test data, achieving a Dice Coefficient of 0.79 and an Intersection over Union (IoU) of 0.67. Moreover, ResUNet++ and UNet Attention models showed competitive performance, with Dice scores of 0.675 and 0.676, and IoU values of 0.683 and 0.671, respectively. We also investigated transfer learning models with varied backbone architectures, finding LinkNet to produce the best outcomes. In conclusion, our research highlights the efficacy of the classical Unet model in accurately identifying and outlining the Mental Foramen in panoramic radiographs. While vital, this task is comparatively simpler than segmenting complex medical datasets such as brain tumours or skin cancer, given their diverse sizes and shapes. This research also holds value in optimizing dental practice, benefiting practitioners and patients.

机器学习

[LG-0] Preserving Privacy in Large Language Models : A Survey on Current Threats and Solutions

链接: https://arxiv.org/abs/2408.05212
作者: Michele Miranda,Elena Sofia Ruzzetti,Andrea Santilli,Fabio Massimo Zanzotto,Sébastien Bratières,Emanuele Rodolà
关键词-EN: Large Language Models, Large Language, Language Models, represent a significant, artificial intelligence
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: GitHub repository: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) represent a significant advancement in artificial intelligence, finding applications across various domains. However, their reliance on massive internet-sourced datasets for training brings notable privacy issues, which are exacerbated in critical domains (e.g., healthcare). Moreover, certain application-specific scenarios may require fine-tuning these models on private data. This survey critically examines the privacy threats associated with LLMs, emphasizing the potential for these models to memorize and inadvertently reveal sensitive information. We explore current threats by reviewing privacy attacks on LLMs and propose comprehensive solutions for integrating privacy mechanisms throughout the entire learning pipeline. These solutions range from anonymizing training datasets to implementing differential privacy during training or inference and machine unlearning after training. Our comprehensive review of existing literature highlights ongoing challenges, available tools, and future directions for preserving privacy in LLMs. This work aims to guide the development of more secure and trustworthy AI systems by providing a thorough understanding of privacy preservation methods and their effectiveness in mitigating risks.

[LG-1] Cell Morphology-Guided Small Molecule Generation with GFlowNets

链接: https://arxiv.org/abs/2408.05196
作者: Stephen Zhewen Lu,Ziqing Lu,Ehsan Hajiramezanali,Tommaso Biancalani,Yoshua Bengio,Gabriele Scalia,Michał Koziarski
关键词-EN: including high-content imaging, High-content phenotypic screening, including high-content, high-content imaging, gained popularity
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:High-content phenotypic screening, including high-content imaging (HCI), has gained popularity in the last few years for its ability to characterize novel therapeutics without prior knowledge of the protein target. When combined with deep learning techniques to predict and represent molecular-phenotype interactions, these advancements hold the potential to significantly accelerate and enhance drug discovery applications. This work focuses on the novel task of HCI-guided molecular design. Generative models for molecule design could be guided by HCI data, for example with a supervised model that links molecules to phenotypes of interest as a reward function. However, limited labeled data, combined with the high-dimensional readouts, can make training these methods challenging and impractical. We consider an alternative approach in which we leverage an unsupervised multimodal joint embedding to define a latent similarity as a reward for GFlowNets. The proposed model learns to generate new molecules that could produce phenotypic effects similar to those of the given image target, without relying on pre-annotated phenotypic labels. We demonstrate that the proposed method generates molecules with high morphological and structural similarity to the target, increasing the likelihood of similar biological activity, as confirmed by an independent oracle model.

[LG-2] HistoKernel: Whole Slide Image Level Maximum Mean Discrepancy Kernels for Pan-Cancer Predictive Modelling

链接: https://arxiv.org/abs/2408.05195
作者: Piotr Keller,Muhammad Dawood,Brinder Singh Chohan,Fayyaz ul Amir Afsar Minhas
关键词-EN: Slide Images, multi-gigapixel Whole Slide, computational pathology, scores for crucial, Machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 28 pages, 5 figures, 1 Table. Preprint for article in review at Nature Machine Intelligence

点击查看摘要

Abstract:Machine learning in computational pathology (CPath) often aggregates patch-level predictions from multi-gigapixel Whole Slide Images (WSIs) to generate WSI-level prediction scores for crucial tasks such as survival prediction and drug effect prediction. However, current methods do not explicitly characterize distributional differences between patch sets within WSIs. We introduce HistoKernel, a novel Maximum Mean Discrepancy (MMD) kernel that measures distributional similarity between WSIs for enhanced prediction performance on downstream prediction tasks. Our comprehensive analysis demonstrates HistoKernel’s effectiveness across various machine learning tasks, including retrieval (n = 9,362), drug sensitivity regression (n = 551), point mutation classification (n = 3,419), and survival analysis (n = 2,291), outperforming existing deep learning methods. Additionally, HistoKernel seamlessly integrates multi-modal data and offers a novel perturbation-based method for patch-level explainability. This work pioneers the use of kernel-based methods for WSI-level predictive modeling, opening new avenues for research. Code is available at this https URL. Comments: 28 pages, 5 figures, 1 Table. Preprint for article in review at Nature Machine Intelligence Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2408.05195 [cs.LG] (or arXiv:2408.05195v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.05195 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] ECG-FM: An Open Electrocardiogram Foundation Model

链接: https://arxiv.org/abs/2408.05178
作者: Kaden McKeen,Laura Oliva,Sameer Masood,Augustin Toma,Barry Rubin,Bo Wang
关键词-EN: ubiquitous diagnostic test, diagnostic test, ubiquitous diagnostic, ECG analysis, ECG
类目: Machine Learning (cs.LG)
*备注: 22 pages, 7 figures, 10 tables

点击查看摘要

Abstract:The electrocardiogram (ECG) is a ubiquitous diagnostic test. Conventional task-specific ECG analysis models require large numbers of expensive ECG annotations or associated labels to train. Transfer learning techniques have been shown to improve generalization and reduce reliance on labeled data. We present ECG-FM, an open foundation model for ECG analysis, and conduct a comprehensive study performed on a dataset of 1.66 million ECGs sourced from both publicly available and private institutional sources. ECG-FM adopts a transformer-based architecture and is pretrained on 2.5 million samples using ECG-specific augmentations and contrastive learning, as well as a continuous signal masking objective. Our transparent evaluation includes a diverse range of downstream tasks, where we predict ECG interpretation labels, reduced left ventricular ejection fraction, and abnormal cardiac troponin. Affirming ECG-FM’s effectiveness as a foundation model, we demonstrate how its command of contextual information results in strong performance, rich pretrained embeddings, and reliable interpretability. Due to a lack of open-weight practices, we highlight how ECG analysis is lagging behind other medical machine learning subfields in terms of foundation model adoption. Our code is available at this https URL.

[LG-4] Beyond Closure Models: Learning Chaotic-Systems via Physics-Informed Neural Operators

链接: https://arxiv.org/abs/2408.05177
作者: Chuwei Wang,Julius Berner,Zongyi Li,Di Zhou,Jiayun Wang,Jane Bae,Anima Anandkumar
关键词-EN: Accurately predicting, chaotic systems, closure model, closure, behavior of chaotic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately predicting the long-term behavior of chaotic systems is crucial for various applications such as climate modeling. However, achieving such predictions typically requires iterative computations over a dense spatiotemporal grid to account for the unstable nature of chaotic systems, which is expensive and impractical in many real-world situations. An alternative approach to such a full-resolved simulation is using a coarse grid and then correcting its errors through a \textitclosure model, which approximates the overall information from fine scales not captured in the coarse-grid simulation. Recently, ML approaches have been used for closure modeling, but they typically require a large number of training samples from expensive fully-resolved simulations (FRS). In this work, we prove an even more fundamental limitation, i.e., the standard approach to learning closure models suffers from a large approximation error for generic problems, no matter how large the model is, and it stems from the non-uniqueness of the mapping. We propose an alternative end-to-end learning approach using a physics-informed neural operator (PINO) that overcomes this limitation by not using a closure model or a coarse-grid solver. We first train the PINO model on data from a coarse-grid solver and then fine-tune it with (a small amount of) FRS and physics-based losses on a fine grid. The discretization-free nature of neural operators means that they do not suffer from the restriction of a coarse grid that closure models face, and they can provably approximate the long-term statistics of chaotic systems. In our experiments, our PINO model achieves a 120x speedup compared to FRS with a relative error \sim 5% . In contrast, the closure model coupled with a coarse-grid solver is 58 x slower than PINO while having a much higher error \sim205% when the closure model is trained on the same FRS dataset.

[LG-5] Federated Hypergraph Learning with Hyperedge Completion

链接: https://arxiv.org/abs/2408.05160
作者: Linfeng Luo,Fengxiao Tang,Xiyu Liu,Zhiqi Guo,Zihao Qiu,Ming Zhao
关键词-EN: neural networks enhance, networks enhance conventional, capturing high-order relationships, Hypergraph neural networks, neural networks
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hypergraph neural networks enhance conventional graph neural networks by capturing high-order relationships among nodes, which proves vital in data-rich environments where interactions are not merely pairwise. As data complexity and interconnectivity grow, it is common for graph-structured data to be split and stored in a distributed manner, underscoring the necessity of federated learning on subgraphs. In this work, we propose FedHGN, a novel algorithm for federated hypergraph learning. Our algorithm utilizes subgraphs of a hypergraph stored on distributed devices to train local HGNN models in a federated manner:by collaboratively developing an effective global HGNN model through sharing model parameters while preserving client privacy. Additionally, considering that hyperedges may span multiple clients, a pre-training step is employed before the training process in which cross-client hyperedge feature gathering is performed at the central server. In this way, the missing cross-client information can be supplemented from the central server during the node feature aggregation phase. Experimental results on seven real-world datasets confirm the effectiveness of our approach and demonstrate its performance advantages over traditional federated graph learning methods.

[LG-6] Meta-Learning Guided Label Noise Distillation for Robust Signal Modulation Classification

链接: https://arxiv.org/abs/2408.05151
作者: Xiaoyang Hao,Zhixi Feng,Tongqing Peng,Shuyuan Yang
关键词-EN: Automatic modulation classification, physical layer threats, Automatic modulation, modulation classification, internet of things
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:Automatic modulation classification (AMC) is an effective way to deal with physical layer threats of the internet of things (IoT). However, there is often label mislabeling in practice, which significantly impacts the performance and robustness of deep neural networks (DNNs). In this paper, we propose a meta-learning guided label noise distillation method for robust AMC. Specifically, a teacher-student heterogeneous network (TSHN) framework is proposed to distill and reuse label noise. Based on the idea that labels are representations, the teacher network with trusted meta-learning divides and conquers untrusted label samples and then guides the student network to learn better by reassessing and correcting labels. Furthermore, we propose a multi-view signal (MVS) method to further improve the performance of hard-to-classify categories with few-shot trusted label samples. Extensive experimental results show that our methods can significantly improve the performance and robustness of signal AMC in various and complex label noise scenarios, which is crucial for securing IoT applications.

[LG-7] Impacts of floating-point non-associativity on reproducibility for HPC and deep learning applications

链接: https://arxiv.org/abs/2408.05148
作者: Sanjif Shanmugavelu,Mathieu Taillefumier,Christopher Culver,Oscar Hernandez,Mark Coletti,Ada Sedova
关键词-EN: parallel programs caused, significantly affect reproducibility, floating-point non-associativity, iterative algorithms, due to accumulating
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Run-by-run variability in parallel programs caused by floating-point non-associativity (FPNA) has been known to significantly affect reproducibility in iterative algorithms, due to accumulating errors. Non-reproducibility negatively affects efficiency and effectiveness of correctness testing for stochastic programs. Recently, the sensitivity of deep learning (DL) training and inference pipelines to FPNA have been found to be extreme, and can prevent certification for commercial applications, accurate assessment of robustness and sensitivity, and bug detection. New approaches in scientific computing applications have coupled DL models with high-performance computing (HPC) simulations, leading to an aggravation of debugging and testing challenges. Here we perform an investigation of the statistical properties of FPNA within modern parallel programming models, analyze performance and productivity impacts of replacing atomic operations with deterministic alternatives on GPUs, and examine the recently-added deterministic options within the PyTorch framework within the context of GPU deployment, uncovering and quantifying the impacts of input parameters triggering run-by-run variability and reporting on the reliability and completeness of the documentation. Finally, we evaluate the strategy of exploiting automatic determinism provided by deterministic hardware, using the Groq LPU ^TM accelerator for inference portions of the DL pipeline. We demonstrate the benefits that this strategy can provide within reproducibility and correctness efforts.

[LG-8] Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

链接: https://arxiv.org/abs/2408.05147
作者: Tom Lieberum,Senthooran Rajamanoharan,Arthur Conmy,Lewis Smith,Nicolas Sonnerat,Vikrant Varma,János Kramár,Anca Dragan,Rohin Shah,Neel Nanda
关键词-EN: seemingly interpretable features, neural network latent, network latent representations, Sparse autoencoders, sparse decomposition
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: 12 main text pages, and 14 pages of acknowledgements, references and appendices

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network’s latent representations into seemingly interpretable features. Despite recent excitement about their potential, research applications outside of industry are limited by the high cost of training a comprehensive suite of SAEs. In this work, we introduce Gemma Scope, an open suite of JumpReLU SAEs trained on all layers and sub-layers of Gemma 2 2B and 9B and select layers of Gemma 2 27B base models. We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison. We evaluate the quality of each SAE on standard metrics and release these results. We hope that by releasing these SAE weights, we can help make more ambitious safety and interpretability research easier for the community. Weights and a tutorial can be found at this https URL and an interactive demo can be found at this https URL

[LG-9] Performative Prediction on Games and Mechanism Design ICML2024

链接: https://arxiv.org/abs/2408.05146
作者: António Góis,Mehrnaz Mofakhami,Fernando P. Santos,Simon Lacoste-Julien,Gauthier Gidel
关键词-EN: aim to predict, influence the reality, Abstract, predict, performativity
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
*备注: Accepted to ICML 2024 Workshop on Agentic Markets, Vienna, Austria

点击查看摘要

Abstract:Predictions often influence the reality which they aim to predict, an effect known as performativity. Existing work focuses on accuracy maximization under this effect, but model deployment may have important unintended impacts, especially in multiagent scenarios. In this work, we investigate performative prediction in a concrete game-theoretic setting where social welfare is an alternative objective to accuracy maximization. We explore a collective risk dilemma scenario where maximising accuracy can negatively impact social welfare, when predicting collective behaviours. By assuming knowledge of a Bayesian agent behavior model, we then show how to achieve better trade-offs and use them for mechanism design.

[LG-10] Cycle-Configuration: A Novel Graph-theoretic Descriptor Set for Molecular Inference

链接: https://arxiv.org/abs/2408.05136
作者: Bowen Song,Jianshen Zhu,Naveed Ahmed Azam,Kazuya Haraguchi,Liang Zhao,Tatsuya Akutsu
关键词-EN: integer linear programming, mixed integer linear, inference framework based, molecular inference framework, named cycle-configuration
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel family of descriptors of chemical graphs, named cycle-configuration (CC), that can be used in the standard “two-layered (2L) model” of mol-infer, a molecular inference framework based on mixed integer linear programming (MILP) and machine learning (ML). Proposed descriptors capture the notion of ortho/meta/para patterns that appear in aromatic rings, which has been impossible in the framework so far. Computational experiments show that, when the new descriptors are supplied, we can construct prediction functions of similar or better performance for all of the 27 tested chemical properties. We also provide an MILP formulation that asks for a chemical graph with desired properties under the 2L model with CC descriptors (2L+CC model). We show that a chemical graph with up to 50 non-hydrogen vertices can be inferred in a practical time.

[LG-11] Range Membership Inference Attacks

链接: https://arxiv.org/abs/2408.05131
作者: Jiashu Tao,Reza Shokri
关键词-EN: leak private information, membership inference attacks, measure this risk, major limitation, private information
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models can leak private information about their training data, but the standard methods to measure this risk, based on membership inference attacks (MIAs), have a major limitation. They only check if a given data point \textitexactly matches a training point, neglecting the potential of similar or partially overlapping data revealing the same private information. To address this issue, we introduce the class of range membership inference attacks (RaMIAs), testing if the model was trained on any data in a specified range (defined based on the semantics of privacy). We formulate the RaMIAs game and design a principled statistical test for its complex hypotheses. We show that RaMIAs can capture privacy loss more accurately and comprehensively than MIAs on various types of data, such as tabular, image, and language. RaMIA paves the way for a more comprehensive and meaningful privacy auditing of machine learning algorithms.

[LG-12] Cautious Calibration in Binary Classification ECAI2024

链接: https://arxiv.org/abs/2408.05120
作者: Mari-Liis Allikivi,Joonas Järve,Meelis Kull
关键词-EN: machine learning systems, learning systems integrated, crucial for enhancing, enhancing the trustworthiness, trustworthiness of machine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to ECAI 2024

点击查看摘要

Abstract:Being cautious is crucial for enhancing the trustworthiness of machine learning systems integrated into decision-making pipelines. Although calibrated probabilities help in optimal decision-making, perfect calibration remains unattainable, leading to estimates that fluctuate between under- and overconfidence. This becomes a critical issue in high-risk scenarios, where even occasional overestimation can lead to extreme expected costs. In these scenarios, it is important for each predicted probability to lean towards underconfidence, rather than just achieving an average balance. In this study, we introduce the novel concept of cautious calibration in binary classification. This approach aims to produce probability estimates that are intentionally underconfident for each predicted probability. We highlight the importance of this approach in a high-risk scenario and propose a theoretically grounded method for learning cautious calibration maps. Through experiments, we explore and compare our method to various approaches, including methods originally not devised for cautious calibration but applicable in this context. We show that our approach is the most consistent in providing cautious estimates. Our work establishes a strong baseline for further developments in this novel framework.

[LG-13] Semantic Successive Refinement: A Generative AI-aided Semantic Communication Framework

链接: https://arxiv.org/abs/2408.05112
作者: Kexin Zhang,Lixin Li,Wensheng Lin,Yuna Yan,Rui Li,Wenchi Cheng,Zhu Han
关键词-EN: emerging technology aiming, Shannon limit, surpass the Shannon, Semantic Communication, emerging technology
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Semantic Communication (SC) is an emerging technology aiming to surpass the Shannon limit. Traditional SC strategies often minimize signal distortion between the original and reconstructed data, neglecting perceptual quality, especially in low Signal-to-Noise Ratio (SNR) environments. To address this issue, we introduce a novel Generative AI Semantic Communication (GSC) system for single-user scenarios. This system leverages deep generative models to establish a new paradigm in SC. Specifically, At the transmitter end, it employs a joint source-channel coding mechanism based on the Swin Transformer for efficient semantic feature extraction and compression. At the receiver end, an advanced Diffusion Model (DM) reconstructs high-quality images from degraded signals, enhancing perceptual details. Additionally, we present a Multi-User Generative Semantic Communication (MU-GSC) system utilizing an asynchronous processing model. This model effectively manages multiple user requests and optimally utilizes system resources for parallel processing. Simulation results on public datasets demonstrate that our generative AI semantic communication systems achieve superior transmission efficiency and enhanced communication content quality across various channel conditions. Compared to CNN-based DeepJSCC, our methods improve the Peak Signal-to-Noise Ratio (PSNR) by 17.75% in Additive White Gaussian Noise (AWGN) channels and by 20.86% in Rayleigh channels.

[LG-14] Application of Unsupervised Artificial Neural Network (ANN) Self_Organizing Map (SOM) in Identifying Main Car Sales Factors

链接: https://arxiv.org/abs/2408.05110
作者: Mazyar Taghavi
关键词-EN: consumer tastes, attract customers, Factors, Iranian customer buying, important factors
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Factors which attract customers and persuade them to buy new car are various regarding different consumer tastes. There are some methods to extract pattern form mass data. In this case we firstly asked passenger car marketing experts to rank more important factors which affect customer decision making behavior using fuzzy Delphi technique, then we provided a sample set from questionnaires and tried to apply a useful artificial neural network method called self_organizing map SOM to find out which factors have more effect on Iranian customer’s buying decision making. Fuzzy tools were applied to adjust the study to be more real. MATLAB software was used for developing and training network. Results report four factors are more important rather than the others. Results are rather different from marketing expert rankings. Such results would help manufacturers to focus on more important factors and increase company sales level.

[LG-15] AI-driven Java Performance Testing: Balancing Result Quality with Testing Time

链接: https://arxiv.org/abs/2408.05100
作者: Luca Traini,Federico Di Menna,Vittorio Cortellessa
关键词-EN: uncovering efficiency issues, Performance testing aims, warm-up phase, aims at uncovering, uncovering efficiency
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
*备注: Accepted for publication in The 39th IEEE/ACM International Conference on Automated Software Engineering (ASE '24)

点击查看摘要

Abstract:Performance testing aims at uncovering efficiency issues of software systems. In order to be both effective and practical, the design of a performance test must achieve a reasonable trade-off between result quality and testing time. This becomes particularly challenging in Java context, where the software undergoes a warm-up phase of execution, due to just-in-time compilation. During this phase, performance measurements are subject to severe fluctuations, which may adversely affect quality of performance test results. However, these approaches often provide suboptimal estimates of the warm-up phase, resulting in either insufficient or excessive warm-up iterations, which may degrade result quality or increase testing time. There is still a lack of consensus on how to properly address this problem. Here, we propose and study an AI-based framework to dynamically halt warm-up iterations at runtime. Specifically, our framework leverages recent advances in AI for Time Series Classification (TSC) to predict the end of the warm-up phase during test execution. We conduct experiments by training three different TSC models on half a million of measurement segments obtained from JMH microbenchmark executions. We find that our framework significantly improves the accuracy of the warm-up estimates provided by state-of-practice and state-of-the-art methods. This higher estimation accuracy results in a net improvement in either result quality or testing time for up to +35.3% of the microbenchmarks. Our study highlights that integrating AI to dynamically estimate the end of the warm-up phase can enhance the cost-effectiveness of Java performance testing.

[LG-16] Hyperbolic Learning with Multimodal Large Language Models ECCV2024

链接: https://arxiv.org/abs/2408.05097
作者: Paolo Mandica,Luca Franco,Konstantinos Kallidromitis,Suzanne Petryk,Fabio Galasso
关键词-EN: including image segmentation, deep-learning tasks, including image, active learning, demonstrated their effectiveness
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ECCV 2024 - Beyond Euclidean Workshop

点击查看摘要

Abstract:Hyperbolic embeddings have demonstrated their effectiveness in capturing measures of uncertainty and hierarchical relationships across various deep-learning tasks, including image segmentation and active learning. However, their application in modern vision-language models (VLMs) has been limited. A notable exception is MERU, which leverages the hierarchical properties of hyperbolic space in the CLIP ViT-large model, consisting of hundreds of millions parameters. In our work, we address the challenges of scaling multi-modal hyperbolic models by orders of magnitude in terms of parameters (billions) and training complexity using the BLIP-2 architecture. Although hyperbolic embeddings offer potential insights into uncertainty not present in Euclidean embeddings, our analysis reveals that scaling these models is particularly difficult. We propose a novel training strategy for a hyperbolic version of BLIP-2, which allows to achieve comparable performance to its Euclidean counterpart, while maintaining stability throughout the training process and showing a meaningful indication of uncertainty with each embedding.

[LG-17] PriPHiT: Privacy-Preserving Hierarchical Training of Deep Neural Networks

链接: https://arxiv.org/abs/2408.05092
作者: Yamin Sepehri,Pedram Pad,Pascal Frossard,L. Andrea Dunbar
关键词-EN: neural networks requires, networks requires substantial, requires substantial resources, deep neural networks, training phase
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 16 pages, 16 figures, 6 tables

点击查看摘要

Abstract:The training phase of deep neural networks requires substantial resources and as such is often performed on cloud servers. However, this raises privacy concerns when the training dataset contains sensitive content, e.g., face images. In this work, we propose a method to perform the training phase of a deep learning model on both an edge device and a cloud server that prevents sensitive content being transmitted to the cloud while retaining the desired information. The proposed privacy-preserving method uses adversarial early exits to suppress the sensitive content at the edge and transmits the task-relevant information to the cloud. This approach incorporates noise addition during the training phase to provide a differential privacy guarantee. We extensively test our method on different facial datasets with diverse face attributes using various deep learning architectures, showcasing its outstanding performance. We also demonstrate the effectiveness of privacy preservation through successful defenses against different white-box and deep reconstruction attacks.

[LG-18] Bootstrap Latents of Nodes and Neighbors for Graph Self-Supervised Learning KDD2024 ECML

链接: https://arxiv.org/abs/2408.05087
作者: Yunhui Liu,Huaisong Zhang,Tieke He,Tao Zheng,Jianhua Zhao
关键词-EN: Contrastive learning, significant paradigm, positive pairs, positive, negative samples
类目: Machine Learning (cs.LG)
*备注: Accepted by ECML PKDD 2024

点击查看摘要

Abstract:Contrastive learning is a significant paradigm in graph self-supervised learning. However, it requires negative samples to prevent model collapse and learn discriminative representations. These negative samples inevitably lead to heavy computation, memory overhead and class collision, compromising the representation learning. Recent studies present that methods obviating negative samples can attain competitive performance and scalability enhancements, exemplified by bootstrapped graph latents (BGRL). However, BGRL neglects the inherent graph homophily, which provides valuable insights into underlying positive pairs. Our motivation arises from the observation that subtly introducing a few ground-truth positive pairs significantly improves BGRL. Although we can’t obtain ground-truth positive pairs without labels under the self-supervised setting, edges in the graph can reflect noisy positive pairs, i.e., neighboring nodes often share the same label. Therefore, we propose to expand the positive pair set with node-neighbor pairs. Subsequently, we introduce a cross-attention module to predict the supportiveness score of a neighbor with respect to the anchor node. This score quantifies the positive support from each neighboring node, and is encoded into the training objective. Consequently, our method mitigates class collision from negative and noisy positive samples, concurrently enhancing intra-class compactness. Extensive experiments are conducted on five benchmark datasets and three downstream task node classification, node clustering, and node similarity search. The results demonstrate that our method generates node representations with enhanced intra-class compactness and achieves state-of-the-art performance.

[LG-19] Generalizing Few Data to Unseen Domains Flexibly Based on Label Smoothing Integrated with Distributionally Robust Optimization

链接: https://arxiv.org/abs/2408.05082
作者: Yangdi Wang,Zhi-Hai Zhang,Su Xiu Xu,Wenming Guo
关键词-EN: deep neural networks, applying deep neural, Overfitting commonly occurs, existing data, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Overfitting commonly occurs when applying deep neural networks (DNNs) on small-scale datasets, where DNNs do not generalize well from existing data to unseen data. The main reason resulting in overfitting is that small-scale datasets cannot reflect the situations of the real world. Label smoothing (LS) is an effective regularization method to prevent overfitting, avoiding it by mixing one-hot labels with uniform label vectors. However, LS only focuses on labels while ignoring the distribution of existing data. In this paper, we introduce the distributionally robust optimization (DRO) to LS, achieving shift the existing data distribution flexibly to unseen domains when training DNNs. Specifically, we prove that the regularization of LS can be extended to a regularization term for the DNNs parameters when integrating DRO. The regularization term can be utilized to shift existing data to unseen domains and generate new data. Furthermore, we propose an approximate gradient-iteration label smoothing algorithm (GI-LS) to achieve the findings and train DNNs. We prove that the shift for the existing data does not influence the convergence of GI-LS. Since GI-LS incorporates a series of hyperparameters, we further consider using Bayesian optimization (BO) to find the relatively optimal combinations of these hyperparameters. Taking small-scale anomaly classification tasks as a case, we evaluate GI-LS, and the results clearly demonstrate its superior performance.

[LG-20] Masked adversarial neural network for cell type deconvolution in spatial transcriptomics

链接: https://arxiv.org/abs/2408.05065
作者: Lin Huang,Xiaofei Liu,Shunfang Wang,Wenwen Min
关键词-EN: identifying disease targets, Accurately determining cell, disease targets, Accurately determining, composition in disease-relevant
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately determining cell type composition in disease-relevant tissues is crucial for identifying disease targets. Most existing spatial transcriptomics (ST) technologies cannot achieve single-cell resolution, making it challenging to accurately determine cell types. To address this issue, various deconvolution methods have been developed. Most of these methods use single-cell RNA sequencing (scRNA-seq) data from the same tissue as a reference to infer cell types in ST data spots. However, they often overlook the differences between scRNA-seq and ST data. To overcome this limitation, we propose a Masked Adversarial Neural Network (MACD). MACD employs adversarial learning to align real ST data with simulated ST data generated from scRNA-seq data. By mapping them into a unified latent space, it can minimize the differences between the two types of data. Additionally, MACD uses masking techniques to effectively learn the features of real ST data and mitigate noise. We evaluated MACD on 32 simulated datasets and 2 real datasets, demonstrating its accuracy in performing cell type deconvolution. All code and public datasets used in this paper are available at this https URL and this https URL.

[LG-21] GLEAMS: Bridging the Gap Between Local and Global Explanations

链接: https://arxiv.org/abs/2408.05060
作者: Giorgio Visani,Vincenzo Stanzione,Damien Garreau
关键词-EN: machine learning algorithms, algorithms is crucial, emerged recently, explainability of machine, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The explainability of machine learning algorithms is crucial, and numerous methods have emerged recently. Local, post-hoc methods assign an attribution score to each feature, indicating its importance for the prediction. However, these methods require recalculating explanations for each example. On the other side, while there exist global approaches they often produce explanations that are either overly simplistic and unreliable or excessively complex. To bridge this gap, we propose GLEAMS, a novel method that partitions the input space and learns an interpretable model within each sub-region, thereby providing both faithful local and global surrogates. We demonstrate GLEAMS’ effectiveness on both synthetic and real-world data, highlighting its desirable properties and human-understandable insights.

[LG-22] Graph Neural Networks as Ordering Heuristics for Parallel Graph Coloring

链接: https://arxiv.org/abs/2408.05054
作者: Kenneth Langedal,Fredrik Manne
关键词-EN: adjacent vertices share, distinct colors, adjacent vertices, vertices share, pair of adjacent
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The graph coloring problem asks for an assignment of the minimum number of distinct colors to vertices in an undirected graph with the constraint that no pair of adjacent vertices share the same color. The problem is a thoroughly studied NP-hard combinatorial problem with several real-world applications. As such, a number of greedy heuristics have been suggested that strike a good balance between coloring quality, execution time, and also parallel scalability. In this work, we introduce a graph neural network (GNN) based ordering heuristic and demonstrate that it outperforms existing greedy ordering heuristics both on quality and performance. Previous results have demonstrated that GNNs can produce high-quality colorings but at the expense of excessive running time. The current paper is the first that brings the execution time down to compete with existing greedy heuristics. Our GNN model is trained using both supervised and unsupervised techniques. The experimental results show that a 2-layer GNN model can achieve execution times between the largest degree first (LF) and smallest degree last (SL) ordering heuristics while outperforming both on coloring quality. Increasing the number of layers improves the coloring quality further, and it is only at four layers that SL becomes faster than the GNN. Finally, our GNN-based coloring heuristic achieves superior scaling in the parallel setting compared to both SL and LF.

[LG-23] BoFire: Bayesian Optimization Framework Intended for Real Experiments

链接: https://arxiv.org/abs/2408.05040
作者: Johannes P. Dürholt,Thomas S. Asche,Johanna Kleinekorte,Gabriel Mancino-Ball,Benjamin Schiller,Simon Sung,Julian Keupp,Aaron Osburg,Toby Boyne,Ruth Misener,Rosona Eldred,Wagner Steuer Costa,Chrysoula Kappatou,Robert M. Lee,Dominik Linzner,David Walz,Niklas Wulkow,Behrang Shafei
关键词-EN: combines Bayesian Optimization, Bayesian Optimization, open-source Python package, Python package BoFire, BoFire combines Bayesian
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 6 pages, 1 figure, 1 listing

点击查看摘要

Abstract:Our open-source Python package BoFire combines Bayesian Optimization (BO) with other design of experiments (DoE) strategies focusing on developing and optimizing new chemistry. Previous BO implementations, for example as they exist in the literature or software, require substantial adaptation for effective real-world deployment in chemical industry. BoFire provides a rich feature-set with extensive configurability and realizes our vision of fast-tracking research contributions into industrial use via maintainable open-source software. Owing to quality-of-life features like JSON-serializability of problem formulations, BoFire enables seamless integration of BO into RESTful APIs, a common architecture component for both self-driving laboratories and human-in-the-loop setups. This paper discusses the differences between BoFire and other BO implementations and outlines ways that BO research needs to be adapted for real-world use in a chemistry setting.

[LG-24] A conformalized learning of a prediction set with applications to medical imaging classification

链接: https://arxiv.org/abs/2408.05037
作者: Roy Hirsch,Jacob Goldberger
关键词-EN: high predictive accuracy, achieve high predictive, predictive accuracy, unresolved challenge, achieve high
类目: Machine Learning (cs.LG)
*备注: 21st IEEE International Symposium on Biomedical Imaging (ISBI) camera-ready

点击查看摘要

Abstract:Medical imaging classifiers can achieve high predictive accuracy, but quantifying their uncertainty remains an unresolved challenge, which prevents their deployment in medical clinics. We present an algorithm that can modify any classifier to produce a prediction set containing the true label with a user-specified probability, such as 90%. We train a network to predict an instance-based version of the Conformal Prediction threshold. The threshold is then conformalized to ensure the required coverage. We applied the proposed algorithm to several standard medical imaging classification datasets. The experimental results demonstrate that our method outperforms current approaches in terms of smaller average size of the prediction set while maintaining the desired coverage.

[LG-25] Retrieval-augmented code completion for local projects using large language models

链接: https://arxiv.org/abs/2408.05026
作者: Marko Hostnik,Marko Robnik-Šikonja
关键词-EN: large language models, software developers, large language, increasingly widespread, widespread among software
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 28 pages, 14 figures

点击查看摘要

Abstract:The use of large language models (LLMs) is becoming increasingly widespread among software developers. However, privacy and computational requirements are problematic with commercial solutions and the use of LLMs. In this work, we focus on using LLMs with around 160 million parameters that are suitable for local execution and augmentation with retrieval from local projects. We train two models based on the transformer architecture, the generative model GPT-2 and the retrieval-adapted RETRO model, on open-source Python files, and empirically evaluate and compare them, confirming the benefits of vector embedding based retrieval. Further, we improve our models’ performance with In-context retrieval-augmented generation, which retrieves code snippets based on the Jaccard similarity of tokens. We evaluate In-context retrieval-augmented generation on larger models and conclude that, despite its simplicity, the approach is more suitable than using the RETRO architecture. We highlight the key role of proper tokenization in achieving the full potential of LLMs in code completion.

[LG-26] owards aerodynamic surrogate modeling based on beta-variational autoencoders

链接: https://arxiv.org/abs/2408.04969
作者: Víctor Francés-Belda,Alberto Solera-Rico,Javier Nieto-Centenero,Esther Andrés,Carlos Sanmiguel Vila,Rodrigo Castellanos
关键词-EN: costly high-fidelity CFD, high-fidelity CFD data, combining dimensionality reduction, high-fidelity CFD, beta
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 18 pages, 12 figures

点击查看摘要

Abstract:Surrogate models combining dimensionality reduction and regression techniques are essential to reduce the need for costly high-fidelity CFD data. New approaches using \beta -Variational Autoencoder ( \beta -VAE) architectures have shown promise in obtaining high-quality low-dimensional representations of high-dimensional flow data while enabling physical interpretation of their latent spaces. We propose a surrogate model based on latent space regression to predict pressure distributions on a transonic wing given the flight conditions: Mach number and angle of attack. The \beta -VAE model, enhanced with Principal Component Analysis (PCA), maps high-dimensional data to a low-dimensional latent space, showing a direct correlation with flight conditions. Regularization through \beta requires careful tuning to improve the overall performance, while PCA pre-processing aids in constructing an effective latent space, improving autoencoder training and performance. Gaussian Process Regression is used to predict latent space variables from flight conditions, showing robust behavior independent of \beta , and the decoder reconstructs the high-dimensional pressure field data. This pipeline provides insight into unexplored flight conditions. Additionally, a fine-tuning process of the decoder further refines the model, reducing dependency on \beta and enhancing accuracy. The structured latent space, robust regression performance, and significant improvements from fine-tuning collectively create a highly accurate and efficient surrogate model. Our methodology demonstrates the effectiveness of \beta -VAEs for aerodynamic surrogate modeling, offering a rapid, cost-effective, and reliable alternative for aerodynamic data prediction.

[LG-27] LiD-FL: Towards List-Decodable Federated Learning

链接: https://arxiv.org/abs/2408.04963
作者: Hong Liu,Liren Shan,Han Bao,Ronghui You,Yuhao Yi,Jiancheng Lv
关键词-EN: Federated learning, unverified participants, Byzantine federated learning, list-decodable federated learning, Federated
类目: Machine Learning (cs.LG)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:Federated learning is often used in environments with many unverified participants. Therefore, federated learning under adversarial attacks receives significant attention. This paper proposes an algorithmic framework for list-decodable federated learning, where a central server maintains a list of models, with at least one guaranteed to perform well. The framework has no strict restriction on the fraction of honest workers, extending the applicability of Byzantine federated learning to the scenario with more than half adversaries. Under proper assumptions on the loss function, we prove a convergence theorem for our method. Experimental results, including image classification tasks with both convex and non-convex losses, demonstrate that the proposed algorithm can withstand the malicious majority under various attacks.

[LG-28] Model Debiasing by Learnable Data Augmentation

链接: https://arxiv.org/abs/2408.04955
作者: Pietro Morerio,Ruggero Ragonesi,Vittorio Murino
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, actual task labels, efficiently fitting training
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep Neural Networks are well known for efficiently fitting training data, yet experiencing poor generalization capabilities whenever some kind of bias dominates over the actual task labels, resulting in models learning “shortcuts”. In essence, such models are often prone to learn spurious correlations between data and labels. In this work, we tackle the problem of learning from biased data in the very realistic unsupervised scenario, i.e., when the bias is unknown. This is a much harder task as compared to the supervised case, where auxiliary, bias-related annotations, can be exploited in the learning process. This paper proposes a novel 2-stage learning pipeline featuring a data augmentation strategy able to regularize the training. First, biased/unbiased samples are identified by training over-biased models. Second, such subdivision (typically noisy) is exploited within a data augmentation framework, properly combining the original samples while learning mixing parameters, which has a regularization effect. Experiments on synthetic and realistic biased datasets show state-of-the-art classification accuracy, outperforming competing methods, ultimately proving robust performance on both biased and unbiased examples. Notably, being our training method totally agnostic to the level of bias, it also positively affects performance for any, even apparently unbiased, dataset, thus improving the model generalization regardless of the level of bias (or its absence) in the data.

[LG-29] HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction

链接: https://arxiv.org/abs/2408.04948
作者: Bhaskarjit Sarmah,Benika Hall,Rohan Rao,Sunil Patel,Stefano Pasquali,Dhagash Mehta
关键词-EN: present substantial challenges, large language models, unstructured text data, text data arising, Retrieval Augmented Generation
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 9 pages, 2 figures, 5 tables

点击查看摘要

Abstract:Extraction and interpretation of intricate information from unstructured text data arising in financial applications, such as earnings call transcripts, present substantial challenges to large language models (LLMs) even using the current best practices to use Retrieval Augmented Generation (RAG) (referred to as VectorRAG techniques which utilize vector databases for information retrieval) due to challenges such as domain specific terminology and complex formats of the documents. We introduce a novel approach based on a combination, called HybridRAG, of the Knowledge Graphs (KGs) based RAG techniques (called GraphRAG) and VectorRAG techniques to enhance question-answer (QA) systems for information extraction from financial documents that is shown to be capable of generating accurate and contextually relevant answers. Using experiments on a set of financial earning call transcripts documents which come in the form of QA format, and hence provide a natural set of pairs of ground-truth QAs, we show that HybridRAG which retrieves context from both vector database and KG outperforms both traditional VectorRAG and GraphRAG individually when evaluated at both the retrieval and generation stages in terms of retrieval accuracy and answer generation. The proposed technique has applications beyond the financial domain

[LG-30] Privacy-Preserved Taxi Demand Prediction System Utilizing Distributed Data

链接: https://arxiv.org/abs/2408.04931
作者: Ren Ozeki,Haruki Yonekura,Hamada Rizk,Hirozumi Yamaguchi
关键词-EN: enhancing urban transportation, Accurate taxi-demand prediction, Accurate taxi-demand, urban transportation services, optimizing taxi operations
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Accurate taxi-demand prediction is essential for optimizing taxi operations and enhancing urban transportation services. However, using customers’ data in these systems raises significant privacy and security concerns. Traditional federated learning addresses some privacy issues by enabling model training without direct data exchange but often struggles with accuracy due to varying data distributions across different regions or service providers. In this paper, we propose CC-Net: a novel approach using collaborative learning enhanced with contrastive learning for taxi-demand prediction. Our method ensures high performance by enabling multiple parties to collaboratively train a demand-prediction model through hierarchical federated learning. In this approach, similar parties are clustered together, and federated learning is applied within each cluster. The similarity is defined without data exchange, ensuring privacy and security. We evaluated our approach using real-world data from five taxi service providers in Japan over fourteen months. The results demonstrate that CC-Net maintains the privacy of customers’ data while improving prediction accuracy by at least 2.2% compared to existing techniques.

[LG-31] UAV-Enhanced Combination to Application: Comprehensive Analysis and Benchmarking of a Human Detection Dataset for Disaster Scenarios ICPR2024

链接: https://arxiv.org/abs/2408.04922
作者: Ragib Amin Nihal,Benjamin Yen,Katsutoshi Itoyama,Kazuhiro Nakadai
关键词-EN: http URL address, Unmanned aerial vehicles, Combination to Application, training machine learning, machine learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: This Paper is accepted for 27th International Conference on Pattern Recognition (ICPR 2024)

点击查看摘要

Abstract:Unmanned aerial vehicles (UAVs) have revolutionized search and rescue (SAR) operations, but the lack of specialized human detection datasets for training machine learning models poses a significant this http URL address this gap, this paper introduces the Combination to Application (C2A) dataset, synthesized by overlaying human poses onto UAV-captured disaster scenes. Through extensive experimentation with state-of-the-art detection models, we demonstrate that models fine-tuned on the C2A dataset exhibit substantial performance improvements compared to those pre-trained on generic aerial datasets. Furthermore, we highlight the importance of combining the C2A dataset with general human datasets to achieve optimal performance and generalization across various scenarios. This points out the crucial need for a tailored dataset to enhance the effectiveness of SAR operations. Our contributions also include developing dataset creation pipeline and integrating diverse human poses and disaster scenes information to assess the severity of disaster scenarios. Our findings advocate for future developments, to ensure that SAR operations benefit from the most realistic and effective AI-assisted interventions possible.

[LG-32] PTrajM: Efficient and Semantic-rich Trajectory Learning with Pretrained Trajectory-Mamba

链接: https://arxiv.org/abs/2408.04916
作者: Yan Lin,Yichen Liu,Zeyu Zhou,Haomin Wen,Erwen Zheng,Shengnan Guo,Youfang Lin,Huaiyu Wan
关键词-EN: trajectories provide crucial, provide crucial movement, crucial movement information, Vehicle trajectories provide, movement behavior
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vehicle trajectories provide crucial movement information for various real-world applications. To better utilize vehicle trajectories, it is essential to develop a trajectory learning approach that can effectively and efficiently extract rich semantic information, including movement behavior and travel purposes, to support accurate downstream applications. However, creating such an approach presents two significant challenges. First, movement behavior are inherently spatio-temporally continuous, making them difficult to extract efficiently from irregular and discrete trajectory points. Second, travel purposes are related to the functionalities of areas and road segments traversed by vehicles. These functionalities are not available from the raw spatio-temporal trajectory features and are hard to extract directly from complex textual features associated with these areas and road segments. To address these challenges, we propose PTrajM, a novel method capable of efficient and semantic-rich vehicle trajectory learning. To support efficient modeling of movement behavior, we introduce Trajectory-Mamba as the learnable model of PTrajM, which effectively extracts continuous movement behavior while being more computationally efficient than existing structures. To facilitate efficient extraction of travel purposes, we propose a travel purpose-aware pre-training procedure, which enables PTrajM to discern the travel purposes of trajectories without additional computational resources during its embedding process. Extensive experiments on two real-world datasets and comparisons with several state-of-the-art trajectory learning methods demonstrate the effectiveness of PTrajM. Code is available at https://anonymous.4open.science/r/PTrajM-C973. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2408.04916 [cs.LG] (or arXiv:2408.04916v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2408.04916 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-33] AcousAF: Acoustic Sensing-Based Atrial Fibrillation Detection System for Mobile Phones

链接: https://arxiv.org/abs/2408.04912
作者: Xuanyu Liu,Haoxian Liu,Jiao Li,Zongqi Yang,Yi Huang,Jin Zhang
关键词-EN: irregular electrical impulses, electrical impulses originating, Atrial fibrillation, characterized by irregular, irregular electrical
类目: ound (cs.SD); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted for publication in Companion of the 2024 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp Companion '24)

点击查看摘要

Abstract:Atrial fibrillation (AF) is characterized by irregular electrical impulses originating in the atria, which can lead to severe complications and even death. Due to the intermittent nature of the AF, early and timely monitoring of AF is critical for patients to prevent further exacerbation of the condition. Although ambulatory ECG Holter monitors provide accurate monitoring, the high cost of these devices hinders their wider adoption. Current mobile-based AF detection systems offer a portable solution. However, these systems have various applicability issues, such as being easily affected by environmental factors and requiring significant user effort. To overcome the above limitations, we present AcousAF, a novel AF detection system based on acoustic sensors of smartphones. Particularly, we explore the potential of pulse wave acquisition from the wrist using smartphone speakers and microphones. In addition, we propose a well-designed framework comprised of pulse wave probing, pulse wave extraction, and AF detection to ensure accurate and reliable AF detection. We collect data from 20 participants utilizing our custom data collection application on the smartphone. Extensive experimental results demonstrate the high performance of our system, with 92.8% accuracy, 86.9% precision, 87.4% recall, and 87.1% F1 Score.

[LG-34] A Geometric Nash Approach in Tuning the Learning Rate in Q-Learning Algorithm

链接: https://arxiv.org/abs/2408.04911
作者: Kwadwo Osei Bonsu
关键词-EN: paper proposes, proposes a geometric, geometric approach, alpha, Nash Equilibrium provide
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper proposes a geometric approach for estimating the \alpha value in Q learning. We establish a systematic framework that optimizes the \alpha parameter, thereby enhancing learning efficiency and stability. Our results show that there is a relationship between the learning rate and the angle between a vector T (total time steps in each episode of learning) and R (the reward vector for each episode). The concept of angular bisector between vectors T and R and Nash Equilibrium provide insight into estimating \alpha such that the algorithm minimizes losses arising from exploration-exploitation trade-off.

[LG-35] Axiomatic Characterisations of Sample-based Explainers

链接: https://arxiv.org/abs/2408.04903
作者: Leila Amgouda,Martin C. Cooper,Salim Debbaoui
关键词-EN: Explaining decisions, computationally challenging, decisions of black-box, black-box classifiers, important and computationally
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explaining decisions of black-box classifiers is both important and computationally challenging. In this paper, we scrutinize explainers that generate feature-based explanations from samples or datasets. We start by presenting a set of desirable properties that explainers would ideally satisfy, delve into their relationships, and highlight incompatibilities of some of them. We identify the entire family of explainers that satisfy two key properties which are compatible with all the others. Its instances provide sufficient reasons, called weak abductive explanations.We then unravel its various subfamilies that satisfy subsets of compatible properties. Indeed, we fully characterize all the explainers that satisfy any subset of compatible properties. In particular, we introduce the first (broad family of) explainers that guarantee the existence of explanations and their global consistency.We discuss some of its instances including the irrefutable explainer and the surrogate explainer whose explanations can be found in polynomial time.

[LG-36] Better Not to Propagate: Understanding Edge Uncertainty and Over-smoothing in Signed Graph Neural Networks

链接: https://arxiv.org/abs/2408.04895
作者: Yoonhyuk Choi,Jiho Choi,Taewook Ko,Chong-Kwon Kim
关键词-EN: Traditional Graph Neural, Graph Neural Networks, real-world heterophily scenarios, Neural Networks, Traditional Graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traditional Graph Neural Networks (GNNs) rely on network homophily, which can lead to performance degradation due to over-smoothing in many real-world heterophily scenarios. Recent studies analyze the smoothing effect (separability) after message-passing (MP), depending on the expectation of node features. Regarding separability gain, they provided theoretical backgrounds on over-smoothing caused by various propagation schemes, including positive, signed, and blocked MPs. More recently, by extending these theorems, some works have suggested improvements in signed propagation under multiple classes. However, prior works assume that the error ratio of all propagation schemes is fixed, failing to investigate this phenomenon correctly. To solve this problem, we propose a novel method for estimating homophily and edge error ratio, integrated with dynamic selection between blocked and signed propagation during training. Our theoretical analysis, supported by extensive experiments, demonstrates that blocking MP can be more effective than signed propagation under high edge error ratios, improving the performance in both homophilic and heterophilic graphs.

[LG-37] Clustering-friendly Representation Learning for Enhancing Salient Features PAKDD2024

链接: https://arxiv.org/abs/2408.04891
作者: Toshiyuki Oshima,Kentaro Takagi,Kouta Nakata
关键词-EN: challenging unlabeled datasets, Recently, successfully applied, applied to challenging, challenging unlabeled
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 12 pages, 6 figures, 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2024)

点击查看摘要

Abstract:Recently, representation learning with contrastive learning algorithms has been successfully applied to challenging unlabeled datasets. However, these methods are unable to distinguish important features from unimportant ones under simply unsupervised settings, and definitions of importance vary according to the type of downstream task or analysis goal, such as the identification of objects or backgrounds. In this paper, we focus on unsupervised image clustering as the downstream task and propose a representation learning method that enhances features critical to the clustering task. We extend a clustering-friendly contrastive learning method and incorporate a contrastive analysis approach, which utilizes a reference dataset to separate important features from unimportant ones, into the design of loss functions. Conducting an experimental evaluation of image clustering for three datasets with characteristic backgrounds, we show that for all datasets, our method achieves higher clustering scores compared with conventional contrastive analysis and deep clustering methods.

[LG-38] UCB Exploration for Fixed-Budget Bayesian Best Arm Identification

链接: https://arxiv.org/abs/2408.04869
作者: Rong J.B. Zhu,Yanqi Qiu
关键词-EN: study best-arm identification, best-arm identification, Bayesian BAI problem, BAI, study best-arm
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study best-arm identification (BAI) in the fixed-budget setting. Adaptive allocations based on upper confidence bounds (UCBs), such as UCBE, are known to work well in BAI. However, it is well-known that its optimal regret is theoretically dependent on instances, which we show to be an artifact in many fixed-budget BAI problems. In this paper we propose an UCB exploration algorithm that is both theoretically and empirically efficient for the fixed budget BAI problem under a Bayesian setting. The key idea is to learn prior information, which can enhance the performance of UCB-based BAI algorithm as it has done in the cumulative regret minimization problem. We establish bounds on the failure probability and the simple regret for the Bayesian BAI problem, providing upper bounds of order \tildeO(\sqrtK/n) , up to logarithmic factors, where n represents the budget and K denotes the number of arms. Furthermore, we demonstrate through empirical results that our approach consistently outperforms state-of-the-art baselines.

[LG-39] An Evaluation of Standard Statistical Models and LLMs on Time Series Forecasting

链接: https://arxiv.org/abs/2408.04867
作者: Rui Cao,Qiao Wang
关键词-EN: Large Language Models, Large Language, Language Models, predicting time series, time series
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This research examines the use of Large Language Models (LLMs) in predicting time series, with a specific focus on the LLMTIME model. Despite the established effectiveness of LLMs in tasks such as text generation, language translation, and sentiment analysis, this study highlights the key challenges that large language models encounter in the context of time series prediction. We assess the performance of LLMTIME across multiple datasets and introduce classical almost periodic functions as time series to gauge its effectiveness. The empirical results indicate that while large language models can perform well in zero-shot forecasting for certain datasets, their predictive accuracy diminishes notably when confronted with diverse time series data and traditional signals. The primary finding of this study is that the predictive capacity of LLMTIME, similar to other LLMs, significantly deteriorates when dealing with time series data that contain both periodic and trend components, as well as when the signal comprises complex frequency components.

[LG-40] High dimensional Bayesian Optimization via Condensing-Expansion Projection

链接: https://arxiv.org/abs/2408.04860
作者: Jiaming Lu,Rong J.B. Zhu
关键词-EN: Bayesian optimization, Projection Bayesian optimization, embedding Bayesian optimization, expensive and infeasible, Bayesian optimization algorithm
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In high-dimensional settings, Bayesian optimization (BO) can be expensive and infeasible. The random embedding Bayesian optimization algorithm is commonly used to address high-dimensional BO challenges. However, this method relies on the effective subspace assumption on the optimization problem’s objective function, which limits its applicability. In this paper, we introduce Condensing-Expansion Projection Bayesian optimization (CEPBO), a novel random projection-based approach for high-dimensional BO that does not reply on the effective subspace assumption. The approach is both simple to implement and highly practical. We present two algorithms based on different random projection matrices: the Gaussian projection matrix and the hashing projection matrix. Experimental results demonstrate that both algorithms outperform existing random embedding-based algorithms in most cases, achieving superior performance on high-dimensional BO problems. The code is available in \urlhttps://anonymous.4open.science/r/CEPBO-14429.

[LG-41] Your Classifier Can Be Secretly a Likelihood-Based OOD Detector

链接: https://arxiv.org/abs/2408.04851
作者: Jirayu Burapacheep,Yixuan Li
关键词-EN: classification models deployed, OOD detection, ability to detect, open environment, OOD
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The ability to detect out-of-distribution (OOD) inputs is critical to guarantee the reliability of classification models deployed in an open environment. A fundamental challenge in OOD detection is that a discriminative classifier is typically trained to estimate the posterior probability p(y|z) for class y given an input z, but lacks the explicit likelihood estimation of p(z) ideally needed for OOD detection. While numerous OOD scoring functions have been proposed for classification models, these estimate scores are often heuristic-driven and cannot be rigorously interpreted as likelihood. To bridge the gap, we propose Intrinsic Likelihood (INK), which offers rigorous likelihood interpretation to modern discriminative-based classifiers. Specifically, our proposed INK score operates on the constrained latent embeddings of a discriminative classifier, which are modeled as a mixture of hyperspherical embeddings with constant norm. We draw a novel connection between the hyperspherical distribution and the intrinsic likelihood, which can be effectively optimized in modern neural networks. Extensive experiments on the OpenOOD benchmark empirically demonstrate that INK establishes a new state-of-the-art in a variety of OOD detection setups, including both far-OOD and near-OOD. Code is available at this https URL.

[LG-42] UGrid: An Efficient-And-Rigorous Neural Multigrid Solver for Linear PDEs

链接: https://arxiv.org/abs/2408.04846
作者: Xi Han,Fei Hou,Hong Qin
关键词-EN: Partial Differential Equations, Differential Equations, Partial Differential, science and engineering, fundamental significance
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Software (cs.MS)
*备注: Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024

点击查看摘要

Abstract:Numerical solvers of Partial Differential Equations (PDEs) are of fundamental significance to science and engineering. To date, the historical reliance on legacy techniques has circumscribed possible integration of big data knowledge and exhibits sub-optimal efficiency for certain PDE formulations, while data-driven neural methods typically lack mathematical guarantee of convergence and correctness. This paper articulates a mathematically rigorous neural solver for linear PDEs. The proposed UGrid solver, built upon the principled integration of U-Net and MultiGrid, manifests a mathematically rigorous proof of both convergence and correctness, and showcases high numerical accuracy, as well as strong generalization power to various input geometry/values and multiple PDE formulations. In addition, we devise a new residual loss metric, which enables unsupervised training and affords more stability and a larger solution space over the legacy losses.

[LG-43] MDS-GNN: A Mutual Dual-Stream Graph Neural Network on Graphs with Incomplete Features and Structure

链接: https://arxiv.org/abs/2408.04845
作者: Peng Yuan,Peng Tang
关键词-EN: Graph Neural Networks, graph-structured data, emerged as powerful, powerful tools, tools for analyzing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as powerful tools for analyzing and learning representations from graph-structured data. A crucial prerequisite for the outstanding performance of GNNs is the availability of complete graph information, i.e., node features and graph structure, which is frequently unmet in real-world scenarios since graphs are often incomplete due to various uncontrollable factors. Existing approaches only focus on dealing with either incomplete features or incomplete structure, which leads to performance loss inevitably. To address this issue, this study proposes a mutual dual-stream graph neural network (MDS-GNN), which implements a mutual benefit learning between features and structure. Its main ideas are as follows: a) reconstructing the missing node features based on the initial incomplete graph structure; b) generating an augmented global graph based on the reconstructed node features, and propagating the incomplete node features on this global graph; and c) utilizing contrastive learning to make the dual-stream process mutually benefit from each other. Extensive experiments on six real-world datasets demonstrate the effectiveness of our proposed MDS-GNN on incomplete graphs.

[LG-44] Counterfactual Explanations with Probabilistic Guarantees on their Robustness to Model Change

链接: https://arxiv.org/abs/2408.04842
作者: Ignacy Stępka,Mateusz Lango,Jerzy Stefanowski
关键词-EN: achieve desired outputs, machine learning models, guide users, desired outputs, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Counterfactual explanations (CFEs) guide users on how to adjust inputs to machine learning models to achieve desired outputs. While existing research primarily addresses static scenarios, real-world applications often involve data or model changes, potentially invalidating previously generated CFEs and rendering user-induced input changes ineffective. Current methods addressing this issue often support only specific models or change types, require extensive hyperparameter tuning, or fail to provide probabilistic guarantees on CFE robustness to model changes. This paper proposes a novel approach for generating CFEs that provides probabilistic guarantees for any model and change type, while offering interpretable and easy-to-select hyperparameters. We establish a theoretical framework for probabilistically defining robustness to model change and demonstrate how our BetaRCE method directly stems from it. BetaRCE is a post-hoc method applied alongside a chosen base CFE generation method to enhance the quality of the explanation beyond robustness. It facilitates a transition from the base explanation to a more robust one with user-adjusted probability bounds. Through experimental comparisons with baselines, we show that BetaRCE yields robust, most plausible, and closest to baseline counterfactual explanations.

[LG-45] Kolmogorov-Arnold Network for Online Reinforcement Learning

链接: https://arxiv.org/abs/2408.04841
作者: Victor Augusto Kich,Jair Augusto Bottega,Raul Steinmetz,Ricardo Bedin Grando,Ayano Yorozu,Akihisa Ohya
关键词-EN: reduced memory usage, providing universal function, Proximal Policy Optimization, Kolmogorov-Arnold Networks, Multi-Layer Perceptrons
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Paper accepted at 24th International Conference on Control, Automation and Systems

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) have shown potential as an alternative to Multi-Layer Perceptrons (MLPs) in neural networks, providing universal function approximation with fewer parameters and reduced memory usage. In this paper, we explore the use of KANs as function approximators within the Proximal Policy Optimization (PPO) algorithm. We evaluate this approach by comparing its performance to the original MLP-based PPO using the DeepMind Control Proprio Robotics benchmark. Our results indicate that the KAN-based reinforcement learning algorithm can achieve comparable performance to its MLP-based counterpart, often with fewer parameters. These findings suggest that KANs may offer a more efficient option for reinforcement learning models.

[LG-46] mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

链接: https://arxiv.org/abs/2408.04840
作者: Jiabo Ye,Haiyang Xu,Haowei Liu,Anwen Hu,Ming Yan,Qi Qian,Ji Zhang,Fei Huang,Jingren Zhou
关键词-EN: demonstrated remarkable capabilities, Multi-modal Large Language, Large Language Models, Large Language, Multi-modal Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.

[LG-47] Adversarially Robust Industrial Anomaly Detection Through Diffusion Model

链接: https://arxiv.org/abs/2408.04839
作者: Yuanpu Cao,Lu Lin,Jinghui Chen
关键词-EN: achieved remarkably high, remarkably high accuracy, Deep learning-based industrial, anomaly detection, anomaly
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Deep learning-based industrial anomaly detection models have achieved remarkably high accuracy on commonly used benchmark datasets. However, the robustness of those models may not be satisfactory due to the existence of adversarial examples, which pose significant threats to the practical deployment of deep anomaly detectors. Recently, it has been shown that diffusion models can be used to purify the adversarial noises and thus build a robust classifier against adversarial attacks. Unfortunately, we found that naively applying this strategy in anomaly detection (i.e., placing a purifier before an anomaly detector) will suffer from a high anomaly miss rate since the purifying process can easily remove both the anomaly signal and the adversarial perturbations, causing the later anomaly detector failed to detect anomalies. To tackle this issue, we explore the possibility of performing anomaly detection and adversarial purification simultaneously. We propose a simple yet effective adversarially robust anomaly detection method, \textitAdvRAD, that allows the diffusion model to act both as an anomaly detector and adversarial purifier. We also extend our proposed method for certified robustness to l_2 norm bounded perturbations. Through extensive experiments, we show that our proposed method exhibits outstanding (certified) adversarial robustness while also maintaining equally strong anomaly detection performance on par with the state-of-the-art methods on industrial anomaly detection benchmark datasets.

[LG-48] Dual-Channel Latent Factor Analysis Enhanced Graph Contrastive Learning for Recommendation

链接: https://arxiv.org/abs/2408.04838
作者: Junfeng Long,Hao Wu
关键词-EN: Graph Neural Networks, Neural Networks, powerful learning methods, handling complicated user-item, recommender systems owing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are powerful learning methods for recommender systems owing to their robustness in handling complicated user-item interactions. Recently, the integration of contrastive learning with GNNs has demonstrated remarkable performance in recommender systems to handle the issue of highly sparse user-item interaction data. Yet, some available graph contrastive learning (GCL) techniques employ stochastic augmentation, i.e., nodes or edges are randomly perturbed on the user-item bipartite graph to construct contrastive views. Such a stochastic augmentation strategy not only brings noise perturbation but also cannot utilize global collaborative signals effectively. To address it, this study proposes a latent factor analysis (LFA) enhanced GCL approach, named LFA-GCL. Our model exclusively incorporates LFA to implement the unconstrained structural refinement, thereby obtaining an augmented global collaborative graph accurately without introducing noise signals. Experiments on four public datasets show that the proposed LFA-GCL outperforms the state-of-the-art models.

[LG-49] Performance Prediction of Hub-Based Swarms

链接: https://arxiv.org/abs/2408.04822
作者: Puneet Jain,Chaitanya Dwivedi,Vigynesh Bhatt,Nick Smith,Michael A Goodrich
关键词-EN: common nest site, nest site called, hub-based colony consists, consists of multiple, share a common
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A hub-based colony consists of multiple agents who share a common nest site called the hub. Agents perform tasks away from the hub like foraging for food or gathering information about future nest sites. Modeling hub-based colonies is challenging because the size of the collective state space grows rapidly as the number of agents grows. This paper presents a graph-based representation of the colony that can be combined with graph-based encoders to create low-dimensional representations of collective state that can scale to many agents for a best-of-N colony problem. We demonstrate how the information in the low-dimensional embedding can be used with two experiments. First, we show how the information in the tensor can be used to cluster collective states by the probability of choosing the best site for a very small problem. Second, we show how structured collective trajectories emerge when a graph encoder is used to learn the low-dimensional embedding, and these trajectories have information that can be used to predict swarm performance.

[LG-50] Natural Language Outlines for Code: Literate Programming in the LLM Era

链接: https://arxiv.org/abs/2408.04820
作者: Kensen Shi,Deniz Altınbüken,Saswat Anand,Mihai Christodorescu,Katja Grünwedel,Alexa Koenings,Sai Naidu,Anurag Pathak,Marc Rasi,Fredde Ribeiro,Brandon Ruffin,Siddhant Sanyam,Maxim Tabachnyk,Sara Toth,Roy Tu,Tobias Welp,Pengcheng Yin,Manzil Zaheer,Satish Chandra,Charles Sutton
关键词-EN: software development process, natural language outlines, development process, natural language, modality and interaction
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose using natural language outlines as a novel modality and interaction surface for providing AI assistance to developers throughout the software development process. An NL outline for a code function comprises multiple statements written in concise prose, which partition the code and summarize its main ideas in the style of literate programming. Crucially, we find that modern LLMs can generate accurate and high-quality NL outlines in practice. Moreover, NL outlines enable a bidirectional sync between code and NL, allowing changes in one to be automatically reflected in the other. We discuss many use cases for NL outlines: they can accelerate understanding and navigation of code and diffs, simplify code maintenance, augment code search, steer code generation, and more. We then propose and compare multiple LLM prompting techniques for generating outlines and ask professional developers to judge outline quality. Finally, we present two case studies applying NL outlines toward code review and the difficult task of malware detection.

[LG-51] Interventional Causal Structure Discovery over Graphical Models with Convergence and Optimality Guarantees

链接: https://arxiv.org/abs/2408.04819
作者: Qiu Chengbo,Yang Kai
关键词-EN: including healthcare, artificial intelligence, fundamental problem, problem with applications, causal structure
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Learning causal structure from sampled data is a fundamental problem with applications in various fields, including healthcare, machine learning and artificial intelligence. Traditional methods predominantly rely on observational data, but there exist limits regarding the identifiability of causal structures with only observational data. Interventional data, on the other hand, helps establish a cause-and-effect relationship by breaking the influence of confounding variables. It remains to date under-explored to develop a mathematical framework that seamlessly integrates both observational and interventional data in causal structure learning. Furthermore, existing studies often focus on centralized approaches, necessitating the transfer of entire datasets to a single server, which lead to considerable communication overhead and heightened risks to privacy. To tackle these challenges, we develop a bilevel polynomial optimization (Bloom) framework. Bloom not only provides a powerful mathematical modeling framework, underpinned by theoretical support, for causal structure discovery from both interventional and observational data, but also aspires to an efficient causal discovery algorithm with convergence and optimality guarantees. We further extend Bloom to a distributed setting to reduce the communication overhead and mitigate data privacy risks. It is seen through experiments on both synthetic and real-world datasets that Bloom markedly surpasses other leading learning algorithms.

[LG-52] Performance Metric for Multiple Anomaly Score Distributions with Discrete Severity Levels

链接: https://arxiv.org/abs/2408.04817
作者: Wonjun Yi,Yong-Hwa Park,Wonho Jung
关键词-EN: severity level differences, classifying severity levels, severity level, automated maintenance, rise of smart
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: accepted as a work-in-progress paper at the 2024 Annual Conference of the IEEE Industrial Electronics Society (IECON)

点击查看摘要

Abstract:The rise of smart factories has heightened the demand for automated maintenance, and normal-data-based anomaly detection has proved particularly effective in environments where anomaly data are scarce. This method, which does not require anomaly data during training, has prompted researchers to focus not only on detecting anomalies but also on classifying severity levels by using anomaly scores. However, the existing performance metrics, such as the area under the receiver operating characteristic curve (AUROC), do not effectively reflect the performance of models in classifying severity levels based on anomaly scores. To address this limitation, we propose the weighted sum of the area under the receiver operating characteristic curve (WS-AUROC), which combines AUROC with a penalty for severity level differences. We conducted various experiments using different penalty assignment methods: uniform penalty regardless of severity level differences, penalty based on severity level index differences, and penalty based on actual physical quantities that cause anomalies. The latter method was the most sensitive. Additionally, we propose an anomaly detector that achieves clear separation of distributions and outperforms the ablation models on the WS-AUROC and AUROC metrics.

[LG-53] FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers

链接: https://arxiv.org/abs/2408.04816
作者: Joshua Nathaniel Williams,J. Zico Kolter
关键词-EN: making knowledge transfer, discovery tasks difficult, prompt discovery tasks, model embedding space, embedding space
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Published as a Conference Paper at COLM 2024; 10 Pages; this https URL

点击查看摘要

Abstract:The widespread use of large language models has resulted in a multitude of tokenizers and embedding spaces, making knowledge transfer in prompt discovery tasks difficult. In this work, we propose FUSE (Flexible Unification of Semantic Embeddings), an inexpensive approach to approximating an adapter layer that maps from one model’s textual embedding space to another, even across different tokenizers. We introduce a third-order tensor-based representation of a model’s embedding space that aligns semantic embeddings that have been split apart by different tokenizers, and use this representation to derive an approximation of the gradient of one model’s outputs with respect to another model’s embedding space. We show the efficacy of our approach via multi-objective optimization over vision-language and causal language models for image captioning and sentiment-based image captioning.

[LG-54] owards improving Alzheimers intervention: a machine learning approach for biomarker detection through combining MEG and MRI pipelines

链接: https://arxiv.org/abs/2408.04815
作者: Alwani Liyana Ahmad,Jose Sanchez-Bornot,Roberto C. Sotero,Damien Coyle,Zamzuri Idris,Ibrahima Faye
关键词-EN: Alzheimer Disease, studying brain function, invasive neuroimaging techniques, MEG, spatial resolution
类目: Machine Learning (cs.LG)
*备注: 28 pages, 9 figures, 3 tables, 19 supplimetary material

点击查看摘要

Abstract:MEG are non invasive neuroimaging techniques with excellent temporal and spatial resolution, crucial for studying brain function in dementia and Alzheimer Disease. They identify changes in brain activity at various Alzheimer stages, including preclinical and prodromal phases. MEG may detect pathological changes before clinical symptoms, offering potential biomarkers for intervention. This study evaluates classification techniques using MEG features to distinguish between healthy controls and mild cognitive impairment participants from the BioFIND study. We compare MEG based biomarkers with MRI based anatomical features, both independently and combined. We used 3 Tesla MRI and MEG data from 324 BioFIND participants;158 MCI and 166 HC. Analyses were performed using MATLAB with SPM12 and OSL toolboxes. Machine learning analyses, including 100 Monte Carlo replications of 10 fold cross validation, were conducted on sensor and source spaces. Combining MRI with MEG features achieved the best performance; 0.76 accuracy and AUC of 0.82 for GLMNET using LCMV source based MEG. MEG only analyses using LCMV and eLORETA also performed well, suggesting that combining uncorrected MEG with z-score-corrected MRI features is optimal.

[LG-55] UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

链接: https://arxiv.org/abs/2408.04810
作者: Haider Al-Tahan,Quentin Garrido,Randall Balestriero,Diane Bouchacourt,Caner Hazirbas,Mark Ibrahim
关键词-EN: Significant research efforts, Significant research, research efforts, improve vision-language model, Significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Significant research efforts have been made to scale and improve vision-language model (VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers are tasked with the heavy burden of implementing each protocol, bearing a non-trivial computational cost, and making sense of how all these benchmarks translate into meaningful axes of progress. To facilitate a systematic evaluation of VLM progress, we introduce UniBench: a unified implementation of 50+ VLM benchmarks spanning a comprehensive range of carefully categorized capabilities from object recognition to spatial awareness, counting, and much more. We showcase the utility of UniBench for measuring progress by evaluating nearly 60 publicly available vision-language models, trained on scales of up to 12.8B samples. We find that while scaling training data or model size can boost many vision-language model capabilities, scaling offers little benefit for reasoning or relations. Surprisingly, we also discover today’s best VLMs struggle on simple digit recognition and counting tasks, e.g. MNIST, which much simpler networks can solve. Where scale falls short, we find that more precise interventions, such as data quality or tailored-learning objectives offer more promise. For practitioners, we also offer guidance on selecting a suitable VLM for a given application. Finally, we release an easy-to-run UniBench code-base with the full set of 50+ benchmarks and comparisons across 59 models as well as a distilled, representative set of benchmarks that runs in 5 minutes on a single GPU.

[LG-56] On the Geometry of Deep Learning

链接: https://arxiv.org/abs/2408.04809
作者: Randall Balestriero,Ahmed Imtiaz Humayun,Richard Baraniuk
关键词-EN: continuous piecewise linear, piecewise linear functions, continuous piecewise, multiple dimensions, network affine spline
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In this paper, we overview one promising avenue of progress at the mathematical foundation of deep learning: the connection between deep networks and function approximation by affine splines (continuous piecewise linear functions in multiple dimensions). In particular, we will overview work over the past decade on understanding certain geometrical properties of a deep network’s affine spline mapping, in particular how it tessellates its input space. As we will see, the affine spline connection and geometrical viewpoint provide a powerful portal through which to view, analyze, and improve the inner workings of a deep network.

[LG-57] Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor SOSP’24

链接: https://arxiv.org/abs/2408.04808
作者: Yiqi Liu,Yuqi Xue,Yu Cheng,Lingxiao Ma,Ziming Miao,Jilong Xue,Jian Huang
关键词-EN: incorporate numerous parallelized, low-latency interconnect links, numerous parallelized cores, scale deep learning, chips incorporate numerous
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: This paper is accepted at The 30th ACM Symposium on Operating Systems Principles (SOSP’24)

点击查看摘要

Abstract:As AI chips incorporate numerous parallelized cores to scale deep learning (DL) computing, inter-core communication is enabled recently by employing high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore IPU). It allows each core to directly access the fast scratchpad memory in other cores, which enables new parallel computing paradigms. However, without proper support for the scalable inter-core connections in current DL compilers, it is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips. To formulate the computation and communication patterns of tensor operators in this new architecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps a DNN model to execution plans with a generalized compute-shift pattern, by partitioning DNN computation into sub-operators and mapping them to cores, so that the cores can exchange data following predictable patterns. T10 makes globally optimized trade-offs between on-chip memory consumption and inter-core communication overhead, selects the best execution plan from a vast optimization space, and alleviates unnecessary inter-core communications. Our evaluation with a real inter-core connected AI chip, the Graphcore IPU, shows up to 3.3 \times performance improvement, and scalability support for larger models, compared to state-of-the-art DL compilers and vendor libraries. Comments: This paper is accepted at The 30th ACM Symposium on Operating Systems Principles (SOSP’24) Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2408.04808 [cs.DC] (or arXiv:2408.04808v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2408.04808 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-58] AI and Machine Learning Driven Indoor Localization and Navigation with Mobile Embedded Systems

链接: https://arxiv.org/abs/2408.04797
作者: Sudeep Pasricha
关键词-EN: indoor navigation solutions, autonomous vehicles, Indoor navigation, mobile embedded systems, foundational technology
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Indoor navigation is a foundational technology to assist the tracking and localization of humans, autonomous vehicles, drones, and robots in indoor spaces. Due to the lack of penetration of GPS signals in buildings, subterranean locales, and dense urban environments, indoor navigation solutions typically make use of ubiquitous wireless signals (e.g., WiFi) and sensors in mobile embedded systems to perform tracking and localization. This article provides an overview of the many challenges facing state-of-the-art indoor navigation solutions, and then describes how AI algorithms deployed on mobile embedded systems can overcome these challenges.

[LG-59] Confident magnitude-based neural network pruning

链接: https://arxiv.org/abs/2408.04759
作者: Joaquin Alvarez
关键词-EN: deep learning models, Pruning neural networks, neural networks, deep neural network, increase the efficiency
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pruning neural networks has proven to be a successful approach to increase the efficiency and reduce the memory storage of deep learning models without compromising performance. Previous literature has shown that it is possible to achieve a sizable reduction in the number of parameters of a deep neural network without deteriorating its predictive capacity in one-shot pruning regimes. Our work builds beyond this background in order to provide rigorous uncertainty quantification for pruning neural networks reliably, which has not been addressed to a great extent in previous literature focusing on pruning methods in computer vision settings. We leverage recent techniques on distribution-free uncertainty quantification to provide finite-sample statistical guarantees to compress deep neural networks, while maintaining high performance. Moreover, this work presents experiments in computer vision tasks to illustrate how uncertainty-aware pruning is a useful approach to deploy sparse neural networks safely.

[LG-60] Quantifying the Corpus Bias Problem in Automatic Music Transcription Systems

链接: https://arxiv.org/abs/2408.04737
作者: Lukáš Samuel Marták,Patricia Hu,Gerhard Widmer
关键词-EN: Automatic Music Transcription, Automatic Music, Music Transcription, task of recognizing, audio recordings
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 2 pages, 1 figure, presented in the 1st International Workshop on Sound Signal Processing Applications (IWSSPA) 2024

点击查看摘要

Abstract:Automatic Music Transcription (AMT) is the task of recognizing notes in audio recordings of music. The State-of-the-Art (SotA) benchmarks have been dominated by deep learning systems. Due to the scarcity of high quality data, they are usually trained and evaluated exclusively or predominantly on classical piano music. Unfortunately, that hinders our ability to understand how they generalize to other music. Previous works have revealed several aspects of memorization and overfitting in these systems. We identify two primary sources of distribution shift: the music, and the sound. Complementing recent results on the sound axis (i.e. acoustics, timbre), we investigate the musical one (i.e. note combinations, dynamics, genre). We evaluate the performance of several SotA AMT systems on two new experimental test sets which we carefully construct to emulate different levels of musical distribution shift. Our results reveal a stark performance gap, shedding further light on the Corpus Bias problem, and the extent to which it continues to trouble these systems.

[LG-61] Zero-Shot Uncertainty Quantification using Diffusion Probabilistic Models

链接: https://arxiv.org/abs/2408.04718
作者: Dule Shu,Amir Barati Farimani
关键词-EN: problems commonly encountered, diffusion probabilistic models, regression problems commonly, diffusion models, diffusion
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The success of diffusion probabilistic models in generative tasks, such as text-to-image generation, has motivated the exploration of their application to regression problems commonly encountered in scientific computing and various other domains. In this context, the use of diffusion regression models for ensemble prediction is becoming a practice with increasing popularity. Under such background, we conducted a study to quantitatively evaluate the effectiveness of ensemble methods on solving different regression problems using diffusion models. We consider the ensemble prediction of a diffusion model as a means for zero-shot uncertainty quantification, since the diffusion models in our study are not trained with a loss function containing any uncertainty estimation. Through extensive experiments on 1D and 2D data, we demonstrate that ensemble methods consistently improve model prediction accuracy across various regression tasks. Notably, we observed a larger accuracy gain in auto-regressive prediction compared with point-wise prediction, and that enhancements take place in both the mean-square error and the physics-informed loss. Additionally, we reveal a statistical correlation between ensemble prediction error and ensemble variance, offering insights into balancing computational complexity with prediction accuracy and monitoring prediction confidence in practical applications where the ground truth is unknown. Our study provides a comprehensive view of the utility of diffusion ensembles, serving as a useful reference for practitioners employing diffusion models in regression problem-solving.

[LG-62] DyGMamba: Efficiently Modeling Long-Term Temporal Dependency on Continuous-Time Dynamic Graphs with State Space Models

链接: https://arxiv.org/abs/2408.04713
作者: Zifeng Ding,Yifeng Li,Yuan He,Antonio Norelli,Jingcheng Wu,Volker Tresp,Yunpu Ma,Michael Bronstein
关键词-EN: nuanced temporal details, grasp nuanced temporal, Encoding longer histories, CTDG representation learning, grasp nuanced
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint. Work on progress

点击查看摘要

Abstract:Learning useful representations for continuous-time dynamic graphs (CTDGs) is challenging, due to the concurrent need to span long node interaction histories and grasp nuanced temporal details. In particular, two problems emerge: (1) Encoding longer histories requires more computational resources, making it crucial for CTDG models to maintain low computational complexity to ensure efficiency; (2) Meanwhile, more powerful models are needed to identify and select the most critical temporal information within the extended context provided by longer histories. To address these problems, we propose a CTDG representation learning model named DyGMamba, originating from the popular Mamba state space model (SSM). DyGMamba first leverages a node-level SSM to encode the sequence of historical node interactions. Another time-level SSM is then employed to exploit the temporal patterns hidden in the historical graph, where its output is used to dynamically select the critical information from the interaction history. We validate DyGMamba experimentally on the dynamic link prediction task. The results show that our model achieves state-of-the-art in most cases. DyGMamba also maintains high efficiency in terms of computational resources, making it possible to capture long temporal dependencies with a limited computation budget.

[LG-63] Overlay-based Decentralized Federated Learning in Bandwidth-limited Networks

链接: https://arxiv.org/abs/2408.04705
作者: Yudi Huang,Tingyang Sun,Ting He
关键词-EN: emerging machine learning, machine learning paradigm, decentralized federated learning, artificial intelligence, centralized coordination
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The emerging machine learning paradigm of decentralized federated learning (DFL) has the promise of greatly boosting the deployment of artificial intelligence (AI) by directly learning across distributed agents without centralized coordination. Despite significant efforts on improving the communication efficiency of DFL, most existing solutions were based on the simplistic assumption that neighboring agents are physically adjacent in the underlying communication network, which fails to correctly capture the communication cost when learning over a general bandwidth-limited network, as encountered in many edge networks. In this work, we address this gap by leveraging recent advances in network tomography to jointly design the communication demands and the communication schedule for overlay-based DFL in bandwidth-limited networks without requiring explicit cooperation from the underlying network. By carefully analyzing the structure of our problem, we decompose it into a series of optimization problems that can each be solved efficiently, to collectively minimize the total training time. Extensive data-driven simulations show that our solution can significantly accelerate DFL in comparison with state-of-the-art designs.

[LG-64] Understanding the Performance and Estimating the Cost of LLM Fine-Tuning

链接: https://arxiv.org/abs/2408.04693
作者: Yuchen Xia,Jiho Kim,Yuhan Chen,Haojie Ye,Souvik Kundu,Cong(Callie)Hao,Nishil Talati
关键词-EN: training Large Language, Large Language Models, Large Language, limited compute resources, training Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 10 pages, conference

点击查看摘要

Abstract:Due to the cost-prohibitive nature of training Large Language Models (LLMs), fine-tuning has emerged as an attractive alternative for specializing LLMs for specific tasks using limited compute resources in a cost-effective manner. In this paper, we characterize sparse Mixture of Experts (MoE) based LLM fine-tuning to understand their accuracy and runtime performance on a single GPU. Our evaluation provides unique insights into the training efficacy of sparse and dense versions of MoE models, as well as their runtime characteristics, including maximum batch size, execution time breakdown, end-to-end throughput, GPU hardware utilization, and load distribution. Our study identifies the optimization of the MoE layer as crucial for further improving the performance of LLM fine-tuning. Using our profiling results, we also develop and validate an analytical model to estimate the cost of LLM fine-tuning on the cloud. This model, based on parameters of the model and GPU architecture, estimates LLM throughput and the cost of training, aiding practitioners in industry and academia to budget the cost of fine-tuning a specific model.

[LG-65] Exploring Scalability in Large-Scale Time Series in DeepVATS framework

链接: https://arxiv.org/abs/2408.04692
作者: Inmaculada Santamaria-Valenzuela,Victor Rodriguez-Fernandez,David Camacho
关键词-EN: Deep Learning, Visual analytics, Deep Learning module, Visual Analytics module, Deep Learning models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Admitted pending publication in Lecture Notes in Network and Systems (LNNS) series (Springer). Code available at this https URL

点击查看摘要

Abstract:Visual analytics is essential for studying large time series due to its ability to reveal trends, anomalies, and insights. DeepVATS is a tool that merges Deep Learning (Deep) with Visual Analytics (VA) for the analysis of large time series data (TS). It has three interconnected modules. The Deep Learning module, developed in R, manages the load of datasets and Deep Learning models from and to the Storage module. This module also supports models training and the acquisition of the embeddings from the latent space of the trained model. The Storage module operates using the Weights and Biases system. Subsequently, these embeddings can be analyzed in the Visual Analytics module. This module, based on an R Shiny application, allows the adjustment of the parameters related to the projection and clustering of the embeddings space. Once these parameters are set, interactive plots representing both the embeddings, and the time series are shown. This paper introduces the tool and examines its scalability through log analytics. The execution time evolution is examined while the length of the time series is varied. This is achieved by resampling a large data series into smaller subsets and logging the main execution and rendering times for later analysis of scalability.

[LG-66] oolSandbox: A Stateful Conversational Interactive Evaluation Benchmark for LLM Tool Use Capabilities

链接: https://arxiv.org/abs/2408.04682
作者: Jiarui Lu,Thomas Holleis,Yizhe Zhang,Bernhard Aumayer,Feng Nan,Felix Bai,Shuang Ma,Shen Ma,Mengyu Li,Guoli Yin,Zirui Wang,Ruoming Pang
关键词-EN: Recent large language, solving real-world challenges, growing research interest, Recent large, assisted LLMs solving
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at this https URL

[LG-67] owards Linguistic Neural Representation Learning and Sentence Retrieval from Electroencephalogram Recordings

链接: https://arxiv.org/abs/2408.04679
作者: Jinzhao Zhou,Yiqun Duan,Ziyi Zhao,Yu-Cheng Chang,Yu-Kai Wang,Thomas Do,Chin-Teng Lin
关键词-EN: vast applicational potential, non-invasive brain signals, EEG, gained increasing research, increasing research attention
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decoding linguistic information from non-invasive brain signals using EEG has gained increasing research attention due to its vast applicational potential. Recently, a number of works have adopted a generative-based framework to decode electroencephalogram (EEG) signals into sentences by utilizing the power generative capacity of pretrained large language models (LLMs). However, this approach has several drawbacks that hinder the further development of linguistic applications for brain-computer interfaces (BCIs). Specifically, the ability of the EEG encoder to learn semantic information from EEG data remains questionable, and the LLM decoder’s tendency to generate sentences based on its training memory can be hard to avoid. These issues necessitate a novel approach for converting EEG signals into sentences. In this paper, we propose a novel two-step pipeline that addresses these limitations and enhances the validity of linguistic EEG decoding research. We first confirm that word-level semantic information can be learned from EEG data recorded during natural reading by training a Conformer encoder via a masked contrastive objective for word-level classification. To achieve sentence decoding results, we employ a training-free retrieval method to retrieve sentences based on the predictions from the EEG encoder. Extensive experiments and ablation studies were conducted in this paper for a comprehensive evaluation of the proposed approach. Visualization of the top prediction candidates reveals that our model effectively groups EEG segments into semantic categories with similar meanings, thereby validating its ability to learn patterns from unspoken EEG recordings. Despite the exploratory nature of this work, these results suggest that our method holds promise for providing more reliable solutions for converting EEG signals into text.

[LG-68] AutoFAIR : Automatic Data FAIRification via Machine Reading

链接: https://arxiv.org/abs/2408.04673
作者: Tingyan Ma,Wei Liu,Bin Lu,Xiaoying Gan,Yunqiang Zhu,Luoyi Fu,Chenghu Zhou
关键词-EN: fuels data-driven research, data fuels data-driven, data-driven research, facilitating progress, diverse domains
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The explosive growth of data fuels data-driven research, facilitating progress across diverse domains. The FAIR principles emerge as a guiding standard, aiming to enhance the findability, accessibility, interoperability, and reusability of data. However, current efforts primarily focus on manual data FAIRification, which can only handle targeted data and lack efficiency. To address this issue, we propose AutoFAIR, an architecture designed to enhance data FAIRness automately. Firstly, We align each data and metadata operation with specific FAIR indicators to guide machine-executable actions. Then, We utilize Web Reader to automatically extract metadata based on language models, even in the absence of structured data webpage schemas. Subsequently, FAIR Alignment is employed to make metadata comply with FAIR principles by ontology guidance and semantic matching. Finally, by applying AutoFAIR to various data, especially in the field of mountain hazards, we observe significant improvements in findability, accessibility, interoperability, and reusability of data. The FAIRness scores before and after applying AutoFAIR indicate enhanced data value.

[LG-69] LLM Stability: A detailed analysis with some surprises

链接: https://arxiv.org/abs/2408.04667
作者: Berk Atil,Alexa Chittams,Liseng Fu,Ferhan Ture,Lixinyu Xu,Breck Baldwin
关键词-EN: Toggle, magical LLMs involves, Code, Papers, Code Toggle Papers
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:A concerning property of our nearly magical LLMs involves the variation of results given the exact same input and deterministic hyper-parameters. While AI has always had a certain level of noisiness from inputs outside of training data, we have generally had deterministic results for any particular input; that is no longer true. While most LLM practitioners are “in the know”, we are unaware of any work that attempts to quantify current LLM stability. We suspect no one has taken the trouble because it is just too boring a paper to execute and write. But we have done it and there are some surprises. What kinds of surprises? The evaluated LLMs are rarely deterministic at the raw output level; they are much more deterministic at the parsed output/answer level but still rarely 100% stable across 5 re-runs with same data input. LLM accuracy variation is not normally distributed. Stability varies based on task. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2408.04667 [cs.CL] (or arXiv:2408.04667v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2408.04667 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Breck Baldwin [view email] [v1] Tue, 6 Aug 2024 16:43:35 UTC (297 KB) Full-text links: Access Paper: View a PDF of the paper titled LLM Stability: A detailed analysis with some surprises, by Berk Atil and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2024-08 Change to browse by: cs cs.AI cs.LG cs.SE References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-70] Winning Amazon KDD Cup24

链接: https://arxiv.org/abs/2408.04658
作者: Chris Deotte,Ivan Sorokin,Ahmet Erdem,Benedikt Schifferer,Gilberto Titericz Jr,Simon Jegou
关键词-EN: Multi Task Online, Online Shopping Challenge, Task Online Shopping, Online Shopping, Multi Task
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper describes the winning solution of all 5 tasks for the Amazon KDD Cup 2024 Multi Task Online Shopping Challenge for LLMs. The challenge was to build a useful assistant, answering questions in the domain of online shopping. The competition contained 57 diverse tasks, covering 5 different task types (e.g. multiple choice) and across 4 different tracks (e.g. multi-lingual). Our solution is a single model per track. We fine-tune Qwen2-72B-Instruct on our own training dataset. As the competition released only 96 example questions, we developed our own training dataset by processing multiple public datasets or using Large Language Models for data augmentation and synthetic data generation. We apply wise-ft to account for distribution shifts and ensemble multiple LoRA adapters in one model. We employed Logits Processors to constrain the model output on relevant tokens for the tasks. AWQ 4-bit Quantization and vLLM are used during inference to predict the test dataset in the time constraints of 20 to 140 minutes depending on the track. Our solution achieved the first place in each individual track and is the first place overall of Amazons KDD Cup 2024.

[LG-71] Leveraging Large Language Models with Chain-of-Thought and Prompt Engineering for Traffic Crash Severity Analysis and Inference

链接: https://arxiv.org/abs/2408.04652
作者: Hao Zhen,Yucheng Shi,Yongcan Huang,Jidong J. Yang,Ninghao Liu
关键词-EN: Large Language Models, Large Language, crash severity inference, Harnessing the power, power of Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 20 pages, 12 figures, 3 tables

点击查看摘要

Abstract:Harnessing the power of Large Language Models (LLMs), this study explores the use of three state-of-the-art LLMs, specifically GPT-3.5-turbo, LLaMA3-8B, and LLaMA3-70B, for crash severity inference, framing it as a classification task. We generate textual narratives from original traffic crash tabular data using a pre-built template infused with domain knowledge. Additionally, we incorporated Chain-of-Thought (CoT) reasoning to guide the LLMs in analyzing the crash causes and then inferring the severity. This study also examine the impact of prompt engineering specifically designed for crash severity inference. The LLMs were tasked with crash severity inference to: (1) evaluate the models’ capabilities in crash severity analysis, (2) assess the effectiveness of CoT and domain-informed prompt engineering, and (3) examine the reasoning abilities with the CoT framework. Our results showed that LLaMA3-70B consistently outperformed the other models, particularly in zero-shot settings. The CoT and Prompt Engineering techniques significantly enhanced performance, improving logical reasoning and addressing alignment issues. Notably, the CoT offers valuable insights into LLMs’ reasoning processes, unleashing their capacity to consider diverse factors such as environmental conditions, driver behavior, and vehicle characteristics in severity analysis and inference.

[LG-72] Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools

链接: https://arxiv.org/abs/2408.04650
作者: Jung In Park,Mahyar Abbasian,Iman Azimi,Dawn Bounds,Angela Jun,Jaesu Han,Robert McCarron,Jessica Borelli,Jia Li,Mona Mahmoudi,Carmen Wiedenhoeft,Amir Rahmani
关键词-EN: increasingly popular due, mental health chatbots, mental health, human-like interactions, aims to develop
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objective: This study aims to develop and validate an evaluation framework to ensure the safety and reliability of mental health chatbots, which are increasingly popular due to their accessibility, human-like interactions, and context-aware support. Materials and Methods: We created an evaluation framework with 100 benchmark questions and ideal responses, and five guideline questions for chatbot responses. This framework, validated by mental health experts, was tested on a GPT-3.5-turbo-based chatbot. Automated evaluation methods explored included large language model (LLM)-based scoring, an agentic approach using real-time data, and embedding models to compare chatbot responses against ground truth standards. Results: The results highlight the importance of guidelines and ground truth for improving LLM evaluation accuracy. The agentic method, dynamically accessing reliable information, demonstrated the best alignment with human assessments. Adherence to a standardized, expert-validated framework significantly enhanced chatbot response safety and reliability. Discussion: Our findings emphasize the need for comprehensive, expert-tailored safety evaluation metrics for mental health chatbots. While LLMs have significant potential, careful implementation is necessary to mitigate risks. The superior performance of the agentic approach underscores the importance of real-time data access in enhancing chatbot reliability. Conclusion: The study validated an evaluation framework for mental health chatbots, proving its effectiveness in improving safety and reliability. Future work should extend evaluations to accuracy, bias, empathy, and privacy to ensure holistic assessment and responsible integration into healthcare. Standardized evaluations will build trust among users and professionals, facilitating broader adoption and improved mental health support through technology.

[LG-73] Distinguishing Chatbot from Human

链接: https://arxiv.org/abs/2408.04647
作者: Gauri Anil Godghase,Rishit Agrawal,Tanush Obili,Mark Stamp
关键词-EN: generative Artificial Intelligence, Generative Pre-trained Transformer, Large Language Models, Artificial Intelligence, Pre-trained Transformer
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There have been many recent advances in the fields of generative Artificial Intelligence (AI) and Large Language Models (LLM), with the Generative Pre-trained Transformer (GPT) model being a leading “chatbot.” LLM-based chatbots have become so powerful that it may seem difficult to differentiate between human-written and machine-generated text. To analyze this problem, we have developed a new dataset consisting of more than 750,000 human-written paragraphs, with a corresponding chatbot-generated paragraph for each. Based on this dataset, we apply Machine Learning (ML) techniques to determine the origin of text (human or chatbot). Specifically, we consider two methodologies for tackling this issue: feature analysis and embeddings. Our feature analysis approach involves extracting a collection of features from the text for classification. We also explore the use of contextual embeddings and transformer-based architectures to train classification models. Our proposed solutions offer high classification accuracy and serve as useful tools for textual analysis, resulting in a better understanding of chatbot-generated text in this era of advanced AI technology.

[LG-74] Efficacy of Large Language Models in Systematic Reviews

链接: https://arxiv.org/abs/2408.04646
作者: Aaditya Shah,Shridhar Mehendale,Siddha Kanthi
关键词-EN: Large Language Models, Large Language, relationship between Environmental, effectiveness of Large, interpreting existing literature
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study investigates the effectiveness of Large Language Models (LLMs) in interpreting existing literature through a systematic review of the relationship between Environmental, Social, and Governance (ESG) factors and financial performance. The primary objective is to assess how LLMs can replicate a systematic review on a corpus of ESG-focused papers. We compiled and hand-coded a database of 88 relevant papers published from March 2020 to May 2024. Additionally, we used a set of 238 papers from a previous systematic review of ESG literature from January 2015 to February 2020. We evaluated two current state-of-the-art LLMs, Meta AI’s Llama 3 8B and OpenAI’s GPT-4o, on the accuracy of their interpretations relative to human-made classifications on both sets of papers. We then compared these results to a “Custom GPT” and a fine-tuned GPT-4o Mini model using the corpus of 238 papers as training data. The fine-tuned GPT-4o Mini model outperformed the base LLMs by 28.3% on average in overall accuracy on prompt 1. At the same time, the “Custom GPT” showed a 3.0% and 15.7% improvement on average in overall accuracy on prompts 2 and 3, respectively. Our findings reveal promising results for investors and agencies to leverage LLMs to summarize complex evidence related to ESG investing, thereby enabling quicker decision-making and a more efficient market.

[LG-75] Risks Causes and Mitigations of Widespread Deployments of Large Language Models (LLMs): A Survey

链接: https://arxiv.org/abs/2408.04643
作者: Md Nazmus Sakib,Md Athikul Islam,Royal Pathak,Md Mashrur Arifin
关键词-EN: Natural Language Processing, transformed Natural Language, Large Language Models, significantly transformed Natural, Language Processing
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted to 2nd International Conference on Artificial Intelligence, Blockchain, and Internet of Things (AIBThings-2024), September 07-08, 2024, Michigan, USA

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs), such as ChatGPT and LLaMA, have significantly transformed Natural Language Processing (NLP) with their outstanding abilities in text generation, summarization, and classification. Nevertheless, their widespread adoption introduces numerous challenges, including issues related to academic integrity, copyright, environmental impacts, and ethical considerations such as data bias, fairness, and privacy. The rapid evolution of LLMs also raises concerns regarding the reliability and generalizability of their evaluations. This paper offers a comprehensive survey of the literature on these subjects, systematically gathered and synthesized from Google Scholar. Our study provides an in-depth analysis of the risks associated with specific LLMs, identifying sub-risks, their causes, and potential solutions. Furthermore, we explore the broader challenges related to LLMs, detailing their causes and proposing mitigation strategies. Through this literature analysis, our survey aims to deepen the understanding of the implications and complexities surrounding these powerful models.

[LG-76] GPT-3 Powered Information Extraction for Building Robust Knowledge Bases

链接: https://arxiv.org/abs/2408.04641
作者: Ritabrata Roy Choudhury,Soumik Dey
关键词-EN: language model, information extraction, knowledge base development, information, suggested
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work uses the state-of-the-art language model GPT-3 to offer a novel method of information extraction for knowledge base development. The suggested method attempts to solve the difficulties associated with obtaining relevant entities and relationships from unstructured text in order to extract structured information. We conduct experiments on a huge corpus of text from diverse fields to assess the performance of our suggested technique. The evaluation measures, which are frequently employed in information extraction tasks, include precision, recall, and F1-score. The findings demonstrate that GPT-3 can be used to efficiently and accurately extract pertinent and correct information from text, hence increasing the precision and productivity of knowledge base creation. We also assess how well our suggested approach performs in comparison to the most advanced information extraction techniques already in use. The findings show that by utilizing only a small number of instances in in-context learning, our suggested strategy yields competitive outcomes with notable savings in terms of data annotation and engineering expense. Additionally, we use our proposed method to retrieve Biomedical information, demonstrating its practicality in a real-world setting. All things considered, our suggested method offers a viable way to overcome the difficulties involved in obtaining structured data from unstructured text in order to create knowledge bases. It can greatly increase the precision and effectiveness of information extraction, which is necessary for many applications including chatbots, recommendation engines, and question-answering systems.

[LG-77] Abstractive summarization from Audio Transcription

链接: https://arxiv.org/abs/2408.04639
作者: Ilia Derkach
关键词-EN: gaining popularity, ranging from text, answers to queries, text translation, translation to generating
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 36 pages, Master’s thesis, 14 figures

点击查看摘要

Abstract:Currently, large language models are gaining popularity, their achievements are used in many areas, ranging from text translation to generating answers to queries. However, the main problem with these new machine learning algorithms is that training such models requires large computing resources that only large IT companies have. To avoid this problem, a number of methods (LoRA, quantization) have been proposed so that existing models can be effectively fine-tuned for specific tasks. In this paper, we propose an E2E (end to end) audio summarization model using these techniques. In addition, this paper examines the effectiveness of these approaches to the problem under consideration and draws conclusions about the applicability of these methods.

[LG-78] Decoding Quantum LDPC Codes Using Graph Neural Networks

链接: https://arxiv.org/abs/2408.05170
作者: Vukan Ninkovic,Ognjen Kundacina,Dejan Vukobratovic,Christian Häger,Alexandre Graell i Amat
关键词-EN: Graph Neural Networks, Quantum Low-Density Parity-Check, Neural Networks, Quantum Low-Density, Graph Neural
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Accepted for GLOBECOM 2024

点击查看摘要

Abstract:In this paper, we propose a novel decoding method for Quantum Low-Density Parity-Check (QLDPC) codes based on Graph Neural Networks (GNNs). Similar to the Belief Propagation (BP)-based QLDPC decoders, the proposed GNN-based QLDPC decoder exploits the sparse graph structure of QLDPC codes and can be implemented as a message-passing decoding algorithm. We compare the proposed GNN-based decoding algorithm against selected classes of both conventional and neural-enhanced QLDPC decoding algorithms across several QLDPC code designs. The simulation results demonstrate excellent performance of GNN-based decoders along with their low complexity compared to competing methods.

[LG-79] Concept learning of parameterized quantum models from limited measurements

链接: https://arxiv.org/abs/2408.05116
作者: Beng Yee Gan,Po-Wei Huang,Elies Gil-Fuster,Patrick Rebentrost
关键词-EN: learning quantum states, quantum states, natural variant, learning, quantum
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16 + 8 pages, 4 figures

点击查看摘要

Abstract:Classical learning of the expectation values of observables for quantum states is a natural variant of learning quantum states or channels. While learning-theoretic frameworks establish the sample complexity and the number of measurement shots per sample required for learning such statistical quantities, the interplay between these two variables has not been adequately quantified before. In this work, we take the probabilistic nature of quantum measurements into account in classical modelling and discuss these quantities under a single unified learning framework. We provide provable guarantees for learning parameterized quantum models that also quantify the asymmetrical effects and interplay of the two variables on the performance of learning algorithms. These results show that while increasing the sample size enhances the learning performance of classical machines, even with single-shot estimates, the improvements from increasing measurements become asymptotically trivial beyond a constant factor. We further apply our framework and theoretical guarantees to study the impact of measurement noise on the classical surrogation of parameterized quantum circuit models. Our work provides new tools to analyse the operational influence of finite measurement noise in the classical learning of quantum systems.

[LG-80] Variational Bayesian Phylogenetic Inference with Semi-implicit Branch Length Distributions

链接: https://arxiv.org/abs/2408.05058
作者: Tianyu Xie,Frederick A. Matsen IV,Marc A. Suchard,Cheng Zhang
关键词-EN: evolutionary history relating, Bayesian phylogenetic inference, modern Bayesian phylogenetic, Reconstructing the evolutionary, Bayesian phylogenetic
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 26 pages, 7 figures

点击查看摘要

Abstract:Reconstructing the evolutionary history relating a collection of molecular sequences is the main subject of modern Bayesian phylogenetic inference. However, the commonly used Markov chain Monte Carlo methods can be inefficient due to the complicated space of phylogenetic trees, especially when the number of sequences is large. An alternative approach is variational Bayesian phylogenetic inference (VBPI) which transforms the inference problem into an optimization problem. While effective, the default diagonal lognormal approximation for the branch lengths of the tree used in VBPI is often insufficient to capture the complexity of the exact posterior. In this work, we propose a more flexible family of branch length variational posteriors based on semi-implicit hierarchical distributions using graph neural networks. We show that this semi-implicit construction emits straightforward permutation equivariant distributions, and therefore can handle the non-Euclidean branch length space across different tree topologies with ease. To deal with the intractable marginal probability of semi-implicit variational distributions, we develop several alternative lower bounds for stochastic optimization. We demonstrate the effectiveness of our proposed method over baseline methods on benchmark data examples, in terms of both marginal likelihood estimation and branch length posterior approximation.

[LG-81] Integrating Edge Information into Ground Truth for the Segmentation of the Optic Disc and Cup from Fundus Images

链接: https://arxiv.org/abs/2408.05052
作者: Yoga Sri Varshan V,Hitesh Gupta Kattamuri,Subin Sahayam,Umarani Jayaraman
关键词-EN: Optic disc, Optic, optic disc-cup ground, ground truth, myocardial infarction
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optic disc and cup segmentation helps in the diagnosis of glaucoma, myocardial infarction, and diabetic retinopathy. Most deep learning methods developed to perform segmentation tasks are built on top of a U-Net-based model architecture. Nevertheless, U-Net and its variants have a tendency to over-segment/ under-segment the required regions of interest. Since the most important outcome is the value of cup-to-disc ratio and not the segmented regions themselves, we are more concerned about the boundaries rather than the regions under the boundaries. This makes learning edges important as compared to learning the regions. In the proposed work, the authors aim to extract both edges of the optic disc and cup from the ground truth using a Laplacian filter. Next, edges are reconstructed to obtain an edge ground truth in addition to the optic disc-cup ground truth. Utilizing both ground truths, the authors study several U-Net and its variant architectures with and without optic disc and cup edges as target, along with the optic disc-cup ground truth for segmentation. The authors have used the REFUGE benchmark dataset and the Drishti-GS dataset to perform the study, and the results are tabulated for the dice and the Hausdorff distance metrics. In the case of the REFUGE dataset, the optic disc mean dice score has improved from 0.7425 to 0.8859 while the mean Hausdorff distance has reduced from 6.5810 to 3.0540 for the baseline U-Net model. Similarly, the optic cup mean dice score has improved from 0.6970 to 0.8639 while the mean Hausdorff distance has reduced from 5.2340 to 2.6323 for the same model. Similar improvement has been observed for the Drishti-GS dataset as well. Compared to the baseline U-Net and its variants (i.e) the Attention U-Net and the U-Net++, the models that learn integrated edges along with the optic disc and cup regions performed well in both validation and testing datasets.

[LG-82] CROCODILE: Causality aids RObustness via COntrastive DIsentangled LEarning MICCAI2024

链接: https://arxiv.org/abs/2408.04949
作者: Gianluca Carloni,Sotirios A Tsaftaris,Sara Colantonio
关键词-EN: classifiers perform poorly, image classifiers perform, perform poorly, poorly when applied, classifiers perform
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: MICCAI 2024 UNSURE Workshop, Accepted for presentation, Submitted Manuscript Version, 10 pages

点击查看摘要

Abstract:Due to domain shift, deep learning image classifiers perform poorly when applied to a domain different from the training one. For instance, a classifier trained on chest X-ray (CXR) images from one hospital may not generalize to images from another hospital due to variations in scanner settings or patient characteristics. In this paper, we introduce our CROCODILE framework, showing how tools from causality can foster a model’s robustness to domain shift via feature disentanglement, contrastive learning losses, and the injection of prior knowledge. This way, the model relies less on spurious correlations, learns the mechanism bringing from images to prediction better, and outperforms baselines on out-of-distribution (OOD) data. We apply our method to multi-label lung disease classification from CXRs, utilizing over 750000 images from four datasets. Our bias-mitigation method improves domain generalization and fairness, broadening the applicability and reliability of deep learning models for a safer medical image analysis. Find our code at: this https URL.

[LG-83] Variance-based sensitivity analysis in the presence of correlated input variables

链接: https://arxiv.org/abs/2408.04933
作者: Thomas Most
关键词-EN: based sensitivity indices, variance based sensitivity, classical Sobol, sensitivity indices, paper we propose
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: presented at 5th International Conference on Reliable Engineering Computing (REC), Brno, Czech Republic, 13-15 June, 2012

点击查看摘要

Abstract:In this paper we propose an extension of the classical Sobol’ estimator for the estimation of variance based sensitivity indices. The approach assumes a linear correlation model between the input variables which is used to decompose the contribution of an input variable into a correlated and an uncorrelated part. This method provides sampling matrices following the original joint probability distribution which are used directly to compute the model output without any assumptions or approximations of the model response function.

[LG-84] Causal Discovery of Linear Non-Gaussian Causal Models with Unobserved Confounding

链接: https://arxiv.org/abs/2408.04907
作者: Daniela Schkoda,Elina Robeva,Mathias Drton
关键词-EN: linear non-Gaussian structural, non-Gaussian structural equation, structural equation models, involve latent confounding, linear non-Gaussian
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We consider linear non-Gaussian structural equation models that involve latent confounding. In this setting, the causal structure is identifiable, but, in general, it is not possible to identify the specific causal effects. Instead, a finite number of different causal effects result in the same observational distribution. Most existing algorithms for identifying these causal effects use overcomplete independent component analysis (ICA), which often suffers from convergence to local optima. Furthermore, the number of latent variables must be known a priori. To address these issues, we propose an algorithm that operates recursively rather than using overcomplete ICA. The algorithm first infers a source, estimates the effect of the source and its latent parents on their descendants, and then eliminates their influence from the data. For both source identification and effect size estimation, we use rank conditions on matrices formed from higher-order cumulants. We prove asymptotic correctness under the mild assumption that locally, the number of latent variables never exceeds the number of observed variables. Simulation studies demonstrate that our method achieves comparable performance to overcomplete ICA even though it does not know the number of latents in advance.

[LG-85] A Pipeline for Data-Driven Learning of Topological Features with Applications to Protein Stability Prediction

链接: https://arxiv.org/abs/2408.04847
作者: Amish Mishra,Francis Motta
关键词-EN: learn interpretable topological, topological features, synthetic mini proteins, parsimonious models trained, SME features
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 13 figures, 23 pages (without appendix and references)

点击查看摘要

Abstract:In this paper, we propose a data-driven method to learn interpretable topological features of biomolecular data and demonstrate the efficacy of parsimonious models trained on topological features in predicting the stability of synthetic mini proteins. We compare models that leverage automatically-learned structural features against models trained on a large set of biophysical features determined by subject-matter experts (SME). Our models, based only on topological features of the protein structures, achieved 92%-99% of the performance of SME-based models in terms of the average precision score. By interrogating model performance and feature importance metrics, we extract numerous insights that uncover high correlations between topological features and SME features. We further showcase how combining topological features and SME features can lead to improved model performance over either feature set used in isolation, suggesting that, in some settings, topological features may provide new discriminating information not captured in existing SME features that are useful for protein stability prediction.

[LG-86] Improved Robustness for Deep Learning-based Segmentation of Multi-Center Myocardial Perfusion MRI Datasets Using Data Adaptive Uncertainty-guided Space-time Analysis

链接: https://arxiv.org/abs/2408.04805
作者: Dilek M. Yalcinkaya,Khalid Youssef,Bobak Heydari,Janet Wei,Noel Bairey Merz,Robert Judd,Rohan Dharmakumar,Orlando P. Simonetti,Jonathan W. Weinsaft,Subha V. Raman,Behzad Sharif
关键词-EN: MRI datasets enables, DAUGS analysis approach, proposed DAUGS analysis, perfusion MRI datasets, datasets
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注: Accepted for publication in JCMR, 2024

点击查看摘要

Abstract:Background. Fully automatic analysis of myocardial perfusion MRI datasets enables rapid and objective reporting of stress/rest studies in patients with suspected ischemic heart disease. Developing deep learning techniques that can analyze multi-center datasets despite limited training data and variations in software and hardware is an ongoing challenge. Methods. Datasets from 3 medical centers acquired at 3T (n = 150 subjects) were included: an internal dataset (inD; n = 95) and two external datasets (exDs; n = 55) used for evaluating the robustness of the trained deep neural network (DNN) models against differences in pulse sequence (exD-1) and scanner vendor (exD-2). A subset of inD (n = 85) was used for training/validation of a pool of DNNs for segmentation, all using the same spatiotemporal U-Net architecture and hyperparameters but with different parameter initializations. We employed a space-time sliding-patch analysis approach that automatically yields a pixel-wise “uncertainty map” as a byproduct of the segmentation process. In our approach, a given test case is segmented by all members of the DNN pool and the resulting uncertainty maps are leveraged to automatically select the “best” one among the pool of solutions. Results. The proposed DAUGS analysis approach performed similarly to the established approach on the internal dataset (p = n.s.) whereas it significantly outperformed on the external datasets (p 0.005 for exD-1 and exD-2). Moreover, the number of image series with “failed” segmentation was significantly lower for the proposed vs. the established approach (4.3% vs. 17.1%, p 0.0005). Conclusions. The proposed DAUGS analysis approach has the potential to improve the robustness of deep learning methods for segmentation of multi-center stress perfusion datasets with variations in the choice of pulse sequence, site location or scanner vendor. Comments: Accepted for publication in JCMR, 2024 Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph) Cite as: arXiv:2408.04805 [eess.IV] (or arXiv:2408.04805v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2408.04805 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Dilek Yalcinkaya [view email] [v1] Fri, 9 Aug 2024 01:21:41 UTC (2,463 KB)

[LG-87] A Density Ratio Super Learner

链接: https://arxiv.org/abs/2408.04796
作者: Wencheng Wu,David Benkeser
关键词-EN: including causal inference, statistics fields, density probability functions, great interest, causal inference
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures, 2 tables

点击查看摘要

Abstract:The estimation of the ratio of two density probability functions is of great interest in many statistics fields, including causal inference. In this study, we develop an ensemble estimator of density ratios with a novel loss function based on super learning. We show that this novel loss function is qualified for building super learners. Two simulations corresponding to mediation analysis and longitudinal modified treatment policy in causal inference, where density ratios are nuisance parameters, are conducted to show our density ratio super learner’s performance empirically.

[LG-88] Segmentation of Mental Foramen in Orthopantomographs: A Deep Learning Approach

链接: https://arxiv.org/abs/2408.04763
作者: Haider Raza,Mohsin Ali,Vishal Krishna Singh,Agustin Wahjuningrum,Rachel Sarig,Akhilanand Chaurasia
关键词-EN: impacted tooth removal, Mental Foramen, Precise identification, cyst surgeries, tooth removal
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 9 pages

点击查看摘要

Abstract:Precise identification and detection of the Mental Foramen are crucial in dentistry, impacting procedures such as impacted tooth removal, cyst surgeries, and implants. Accurately identifying this anatomical feature facilitates post-surgery issues and improves patient outcomes. Moreover, this study aims to accelerate dental procedures, elevating patient care and healthcare efficiency in dentistry. This research used Deep Learning methods to accurately detect and segment the Mental Foramen from panoramic radiograph images. Two mask types, circular and square, were used during model training. Multiple segmentation models were employed to identify and segment the Mental Foramen, and their effectiveness was evaluated using diverse metrics. An in-house dataset comprising 1000 panoramic radiographs was created for this study. Our experiments demonstrated that the Classical UNet model performed exceptionally well on the test data, achieving a Dice Coefficient of 0.79 and an Intersection over Union (IoU) of 0.67. Moreover, ResUNet++ and UNet Attention models showed competitive performance, with Dice scores of 0.675 and 0.676, and IoU values of 0.683 and 0.671, respectively. We also investigated transfer learning models with varied backbone architectures, finding LinkNet to produce the best outcomes. In conclusion, our research highlights the efficacy of the classical Unet model in accurately identifying and outlining the Mental Foramen in panoramic radiographs. While vital, this task is comparatively simpler than segmenting complex medical datasets such as brain tumours or skin cancer, given their diverse sizes and shapes. This research also holds value in optimizing dental practice, benefiting practitioners and patients.

[LG-89] Learning the Simplicity of Scattering Amplitudes

链接: https://arxiv.org/abs/2408.04720
作者: Clifford Cheung,Aurélien Dersy,Matthew D. Schwartz
关键词-EN: theoretical high-energy physics, complex expressions lies, scientific progress, high-energy physics, reorganization of complex
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); High Energy Physics - Phenomenology (hep-ph)
*备注: 25+15 pages, 9+6 figures

点击查看摘要

Abstract:The simplification and reorganization of complex expressions lies at the core of scientific progress, particularly in theoretical high-energy physics. This work explores the application of machine learning to a particular facet of this challenge: the task of simplifying scattering amplitudes expressed in terms of spinor-helicity variables. We demonstrate that an encoder-decoder transformer architecture achieves impressive simplification capabilities for expressions composed of handfuls of terms. Lengthier expressions are implemented in an additional embedding network, trained using contrastive learning, which isolates subexpressions that are more likely to simplify. The resulting framework is capable of reducing expressions with hundreds of terms - a regular occurrence in quantum field theory calculations - to vastly simpler equivalent expressions. Starting from lengthy input expressions, our networks can generate the Parke-Taylor formula for five-point gluon scattering, as well as new compact expressions for five-point amplitudes involving scalars and gravitons. An interactive demonstration can be found at https://spinorhelicity.streamlit.app .

信息检索

[IR-0] A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning KDD

链接: https://arxiv.org/abs/2408.05141
作者: Ye Yuan,Chengwu Liu,Jingyang Yuan,Gongbo Sun,Siqi Li,Ming Zhang
关键词-EN: external knowledge bases, framework enabling large, enabling large language, integrating external knowledge, Retrieval-augmented generation
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Technical report for 3rd prize in Task 1 of Meta CRAG KDD Cup 2024

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a framework enabling large language models (LLMs) to enhance their accuracy and reduce hallucinations by integrating external knowledge bases. In this paper, we introduce a hybrid RAG system enhanced through a comprehensive suite of optimizations that significantly improve retrieval quality, augment reasoning capabilities, and refine numerical computation ability. We refined the text chunks and tables in web pages, added attribute predictors to reduce hallucinations, conducted LLM Knowledge Extractor and Knowledge Graph Extractor, and finally built a reasoning strategy with all the references. We evaluated our system on the CRAG dataset through the Meta CRAG KDD Cup 2024 Competition. Both the local and online evaluations demonstrate that our system significantly enhances complex reasoning capabilities. In local evaluations, we have significantly improved accuracy and reduced error rates compared to the baseline model, achieving a notable increase in scores. In the meanwhile, we have attained outstanding results in online assessments, demonstrating the performance and generalization capabilities of the proposed system. The source code for our system is released in \urlthis https URL.

[IR-1] A GNN Model with Adaptive Weights for Session-Based Recommendation Systems

链接: https://arxiv.org/abs/2408.05051
作者: Begüm Özbay,Dr.Resul Tugay,Prof. Dr. Şule Gündüz Öğüdücü
关键词-EN: users’ interests based, recommendation systems aim, Session-based recommendation systems, model users’ interests, session-based recommendation model
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 7 pages, 7 tables, 2 figures, and 3 equations

点击查看摘要

Abstract:Session-based recommendation systems aim to model users’ interests based on their sequential interactions to predict the next item in an ongoing session. In this work, we present a novel approach that can be used in session-based recommendations (SBRs). Our goal is to enhance the prediction accuracy of an existing session-based recommendation model, the SR-GNN model, by introducing an adaptive weighting mechanism applied to the graph neural network (GNN) vectors. This mechanism is designed to incorporate various types of side information obtained through different methods during the study. Items are assigned varying degrees of importance within each session as a result of the weighting mechanism. We hypothesize that this adaptive weighting strategy will contribute to more accurate predictions and thus improve the overall performance of SBRs in different scenarios. The adaptive weighting strategy can be utilized to address the cold start problem in SBRs by dynamically adjusting the importance of items in each session, thus providing better recommendations in cold start situations, such as for new users or newly added items. Our experimental evaluations on the Dressipi dataset demonstrate the effectiveness of the proposed approach compared to traditional models in enhancing the user experience and highlighting its potential to optimize the recommendation results in real-world applications.

[IR-2] MIDI-to-Tab: Guitar Tablature Inference via Masked Language Modeling

链接: https://arxiv.org/abs/2408.05024
作者: Drew Edwards,Xavier Riley,Pedro Sarmento,Simon Dixon
关键词-EN: traditional music notation, indicating precisely, enrich the structure, structure of traditional, notation by assigning
类目: ound (cs.SD); Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Reviewed pre-print accepted for publication at ISMIR 2024

点击查看摘要

Abstract:Guitar tablatures enrich the structure of traditional music notation by assigning each note to a string and fret of a guitar in a particular tuning, indicating precisely where to play the note on the instrument. The problem of generating tablature from a symbolic music representation involves inferring this string and fret assignment per note across an entire composition or performance. On the guitar, multiple string-fret assignments are possible for most pitches, which leads to a large combinatorial space that prevents exhaustive search approaches. Most modern methods use constraint-based dynamic programming to minimize some cost function (e.g.\ hand position movement). In this work, we introduce a novel deep learning solution to symbolic guitar tablature estimation. We train an encoder-decoder Transformer model in a masked language modeling paradigm to assign notes to strings. The model is first pre-trained on DadaGP, a dataset of over 25K tablatures, and then fine-tuned on a curated set of professionally transcribed guitar performances. Given the subjective nature of assessing tablature quality, we conduct a user study amongst guitarists, wherein we ask participants to rate the playability of multiple versions of tablature for the same four-bar excerpt. The results indicate our system significantly outperforms competing algorithms.

[IR-3] Early Exit Strategies for Approximate k-NN Search in Dense Retrieval CIKM2024

链接: https://arxiv.org/abs/2408.04981
作者: Francesco Busolin,Claudio Lucchese,Franco Maria Nardini,Salvatore Orlando,Raffaele Perego,Salvatore Trani
关键词-EN: Learned dense representations, Learned dense, performing approximate, approximate k nearest-neighbors, popular family
类目: Information Retrieval (cs.IR)
*备注: 6 pages, published at CIKM 2024

点击查看摘要

Abstract:Learned dense representations are a popular family of techniques for encoding queries and documents using high-dimensional embeddings, which enable retrieval by performing approximate k nearest-neighbors search (A-kNN). A popular technique for making A-kNN search efficient is based on a two-level index, where the embeddings of documents are clustered offline and, at query processing, a fixed number N of clusters closest to the query is visited exhaustively to compute the result set. In this paper, we build upon state-of-the-art for early exit A-kNN and propose an unsupervised method based on the notion of patience, which can reach competitive effectiveness with large efficiency gains. Moreover, we discuss a cascade approach where we first identify queries that find their nearest neighbor within the closest t N clusters, and then we decide how many more to visit based on our patience approach or other state-of-the-art strategies. Reproducible experiments employing state-of-the-art dense retrieval models and publicly available resources show that our techniques improve the A-kNN efficiency with up to 5x speedups while achieving negligible effectiveness losses. All the code used is available at this https URL

[IR-4] Relevance Filtering for Embedding-based Retrieval CIKM2024

链接: https://arxiv.org/abs/2408.04887
作者: Nicholas Rossi,Juexin Lin,Feng Liu,Zhen Yang,Tony Lee,Alessandro Magnani,Ciya Liao
关键词-EN: Approximate Nearest Neighbor, Approximate Nearest, Nearest Neighbor, enables efficient retrieval, search enables efficient
类目: Information Retrieval (cs.IR)
*备注: 8 pages, 3 figures, CIKM 2024

点击查看摘要

Abstract:In embedding-based retrieval, Approximate Nearest Neighbor (ANN) search enables efficient retrieval of similar items from large-scale datasets. While maximizing recall of relevant items is usually the goal of retrieval systems, a low precision may lead to a poor search experience. Unlike lexical retrieval, which inherently limits the size of the retrieved set through keyword matching, dense retrieval via ANN search has no natural cutoff. Moreover, the cosine similarity scores of embedding vectors are often optimized via contrastive or ranking losses, which make them difficult to interpret. Consequently, relying on top-K or cosine-similarity cutoff is often insufficient to filter out irrelevant results effectively. This issue is prominent in product search, where the number of relevant products is often small. This paper introduces a novel relevance filtering component (called “Cosine Adapter”) for embedding-based retrieval to address this challenge. Our approach maps raw cosine similarity scores to interpretable scores using a query-dependent mapping function. We then apply a global threshold on the mapped scores to filter out irrelevant results. We are able to significantly increase the precision of the retrieved set, at the expense of a small loss of recall. The effectiveness of our approach is demonstrated through experiments on both public MS MARCO dataset and internal Walmart product search data. Furthermore, online A/B testing on the Walmart site validates the practical value of our approach in real-world e-commerce settings.

[IR-5] Enhancing Relevance of Embedding-based Retrieval at Walmart CIKM2024

链接: https://arxiv.org/abs/2408.04884
作者: Juexin Lin,Sachin Yadav,Feng Liu,Nicholas Rossi,Praveen Reddy Suram,Satya Chembolu,Prijith Chandran,Hrushikesh Mohapatra,Tony Lee,Alessandro Magnani,Ciya Liao
关键词-EN: Embedding-based neural retrieval, Embedding-based neural, customer search queries, effective search retrieval, search retrieval method
类目: Information Retrieval (cs.IR)
*备注: 8 pages, 3 figures, CIKM 2024

点击查看摘要

Abstract:Embedding-based neural retrieval (EBR) is an effective search retrieval method in product search for tackling the vocabulary gap between customer search queries and products. The initial launch of our EBR system at Walmart yielded significant gains in relevance and add-to-cart rates [1]. However, despite EBR generally retrieving more relevant products for reranking, we have observed numerous instances of relevance degradation. Enhancing retrieval performance is crucial, as it directly influences product reranking and affects the customer shopping experience. Factors contributing to these degradations include false positives/negatives in the training data and the inability to handle query misspellings. To address these issues, we present several approaches to further strengthen the capabilities of our EBR model in terms of retrieval relevance. We introduce a Relevance Reward Model (RRM) based on human relevance feedback. We utilize RRM to remove noise from the training data and distill it into our EBR model through a multi-objective loss. In addition, we present the techniques to increase the performance of our EBR model, such as typo-aware training, and semi-positive generation. The effectiveness of our EBR is demonstrated through offline relevance evaluation, online AB tests, and successful deployments to live production. [1] Alessandro Magnani, Feng Liu, Suthee Chaidaroon, Sachin Yadav, Praveen Reddy Suram, Ajit Puthenputhussery, Sijie Chen, Min Xie, Anirudh Kashi, Tony Lee, et al. 2022. Semantic retrieval at walmart. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3495-3503. Comments: 8 pages, 3 figures, CIKM 2024 Subjects: Information Retrieval (cs.IR) ACMclasses: H.3.3 Cite as: arXiv:2408.04884 [cs.IR] (or arXiv:2408.04884v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2408.04884 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3627673.3680047 Focus to learn more DOI(s) linking to related resources

[IR-6] Hybrid Student-Teacher Large Language Model Refinement for Cancer Toxicity Symptom Extraction

链接: https://arxiv.org/abs/2408.04775
作者: Reza Khanmohammadi,Ahmed I. Ghanem,Kyle Verdecchia,Ryan Hall,Mohamed Elshaikh,Benjamin Movsas,Hassan Bagher-Ebadian,Bing Luo,Indrin J. Chetty,Tuka Alhanai,Kundan Thind,Mohammad M. Ghassemi
关键词-EN: Large Language Models, Large Language, offer significant potential, computational limitations, Language Models
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer significant potential for clinical symptom extraction, but their deployment in healthcare settings is constrained by privacy concerns, computational limitations, and operational costs. This study investigates the optimization of compact LLMs for cancer toxicity symptom extraction using a novel iterative refinement approach. We employ a student-teacher architecture, utilizing Zephyr-7b-beta and Phi3-mini-128 as student models and GPT-4o as the teacher, to dynamically select between prompt refinement, Retrieval-Augmented Generation (RAG), and fine-tuning strategies. Our experiments on 294 clinical notes covering 12 post-radiotherapy toxicity symptoms demonstrate the effectiveness of this approach. The RAG method proved most efficient, improving average accuracy scores from 0.32 to 0.73 for Zephyr-7b-beta and from 0.40 to 0.87 for Phi3-mini-128 during refinement. In the test set, both models showed an approximate 0.20 increase in accuracy across symptoms. Notably, this improvement was achieved at a cost 45 times lower than GPT-4o for Zephyr and 79 times lower for Phi-3. These results highlight the potential of iterative refinement techniques in enhancing the capabilities of compact LLMs for clinical applications, offering a balance between performance, cost-effectiveness, and privacy preservation in healthcare settings.

[IR-7] 3DLNews: A Three-decade Dataset of US Local News Articles CIKM2024

链接: https://arxiv.org/abs/2408.04716
作者: Gangani Ariyarathne,Alexander C. Nwala
关键词-EN: United States spanning, United States, spanning the period, States spanning, United
类目: Information Retrieval (cs.IR)
*备注: This is a technical report for a resource paper accepted at CIKM 2024

点击查看摘要

Abstract:We present 3DLNews, a novel dataset with local news articles from the United States spanning the period from 1996 to 2024. It contains almost 1 million URLs (with HTML text) from over 14,000 local newspapers, TV, and radio stations across all 50 states, and provides a broad snapshot of the US local news landscape. The dataset was collected by scraping Google and Twitter search results. We employed a multi-step filtering process to remove non-news article links and enriched the dataset with metadata such as the names and geo-coordinates of the source news media organizations, article publication dates, etc. Furthermore, we demonstrated the utility of 3DLNews by outlining four applications.

[IR-8] ACL Ready: RAG Based Assistant for the ACL Checklist

链接: https://arxiv.org/abs/2408.04675
作者: Michael Galarnyk,Rutwik Routu,Kosha Bheda,Priyanshu Mehta,Agam Shah,Sudheer Chava
关键词-EN: ARR Responsible NLP, Responsible NLP Research, NLP Research checklist, ARR Responsible, Responsible NLP
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The ARR Responsible NLP Research checklist website states that the “checklist is designed to encourage best practices for responsible research, addressing issues of research ethics, societal impact and reproducibility.” Answering the questions is an opportunity for authors to reflect on their work and make sure any shared scientific assets follow best practices. Ideally, considering the checklist before submission can favorably impact the writing of a research paper. However, the checklist is often filled out at the last moment. In this work, we introduce ACLReady, a retrieval-augmented language model application that can be used to empower authors to reflect on their work and assist authors with the ACL checklist. To test the effectiveness of the system, we conducted a qualitative study with 13 users which shows that 92% of users found the application useful and easy to use as well as 77% of the users found that the application provided the information they expected. Our code is publicly available under the CC BY-NC 4.0 license on GitHub.

[IR-9] Forecasting Live Chat Intent from Browsing History CIKM2024

链接: https://arxiv.org/abs/2408.04668
作者: Se-eun Yoon,Ahmad Bin Rabiah,Zaid Alibadi,Surya Kallumadi,Julian McAuley
关键词-EN: online live chat, live chat agents, Customers reach, requesting a return, browsing history
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: CIKM 2024

点击查看摘要

Abstract:Customers reach out to online live chat agents with various intents, such as asking about product details or requesting a return. In this paper, we propose the problem of predicting user intent from browsing history and address it through a two-stage approach. The first stage classifies a user’s browsing history into high-level intent categories. Here, we represent each browsing history as a text sequence of page attributes and use the ground-truth class labels to fine-tune pretrained Transformers. The second stage provides a large language model (LLM) with the browsing history and predicted intent class to generate fine-grained intents. For automatic evaluation, we use a separate LLM to judge the similarity between generated and ground-truth intents, which closely aligns with human judgments. Our two-stage approach yields significant performance gains compared to generating intents without the classification stage.

[IR-10] owards Semantic Markup of Mathematical Documents via User Interaction

链接: https://arxiv.org/abs/2408.04656
作者: Luka Vrečar,Joe Wells,Fairouz Kamareddine
关键词-EN: Mathematical documents written, semantic markup, Mathematical documents, semantic, documents written
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
*备注: Submitted to the CICM 2024 conference, due to be published in Volume 14960 of Springer’s Lecture Notes in Computer Science

点击查看摘要

Abstract:Mathematical documents written in LaTeX often contain ambiguities. We can resolve some of them via semantic markup using, e.g., sTeX, which also has other potential benefits, such as interoperability with computer algebra systems, proof systems, and increased accessibility. However, semantic markup is more involved than “regular” typesetting and presents a challenge for authors of mathematical documents. We aim to smooth out the transition from plain LaTeX to semantic markup by developing semi-automatic tools for authors. In this paper we present an approach to semantic markup of formulas by (semi-)automatically generating grammars from existing sTeX macro definitions and parsing mathematical formulas with them. We also present a GUI-based tool for the disambiguation of parse results and showcase its functionality and potential using a grammar for parsing untyped \lambda -terms.

[IR-11] PLUGH: A Benchmark for Spatial Understanding and Reasoning in Large Language Models ACL2024

链接: https://arxiv.org/abs/2408.04648
作者: Alexey Tikhonov
关键词-EN: Large Language Models, input texts extracted, Large Language, present PLUGH, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: Wordplay Workshop @ ACL 2024

点击查看摘要

Abstract:We present PLUGH (this https URL), a modern benchmark that currently consists of 5 tasks, each with 125 input texts extracted from 48 different games and representing 61 different (non-isomorphic) spatial graphs to assess the abilities of Large Language Models (LLMs) for spatial understanding and reasoning. Our evaluation of API-based and open-sourced LLMs shows that while some commercial LLMs exhibit strong reasoning abilities, open-sourced competitors can demonstrate almost the same level of quality; however, all models still have significant room for improvement. We identify typical reasons for LLM failures and discuss possible ways to deal with them. Datasets and evaluation code are released (this https URL).

[IR-12] Abstractive summarization from Audio Transcription

链接: https://arxiv.org/abs/2408.04639
作者: Ilia Derkach
关键词-EN: gaining popularity, ranging from text, answers to queries, text translation, translation to generating
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 36 pages, Master’s thesis, 14 figures

点击查看摘要

Abstract:Currently, large language models are gaining popularity, their achievements are used in many areas, ranging from text translation to generating answers to queries. However, the main problem with these new machine learning algorithms is that training such models requires large computing resources that only large IT companies have. To avoid this problem, a number of methods (LoRA, quantization) have been proposed so that existing models can be effectively fine-tuned for specific tasks. In this paper, we propose an E2E (end to end) audio summarization model using these techniques. In addition, this paper examines the effectiveness of these approaches to the problem under consideration and draws conclusions about the applicability of these methods.

附件下载

点击下载今日全部论文列表