本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,每天早上11:30点定时自动更新,主要按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从arxiv网站获取,每天早上11:30左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱,同样每天11:30左右邮件定时自动发送。

目录

概览 (2024-07-08)

今日共更新453篇论文,其中:

  • 自然语言处理87篇(Computation and Language (cs.CL))
  • 计算机视觉115篇(Computer Vision and Pattern Recognition (cs.CV))
  • 人工智能129篇(Artificial Intelligence (cs.AI))
  • 机器学习127篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages
[NLP-0] 天文馆:将文本翻译为结构化规划语言的严格基准

链接: https://arxiv.org/abs/2407.03321
作者: Max Zuo,Francisco Piedrahita Velez,Xiaochen Li,Michael L. Littman,Stephen H. Bach
关键词: PDDL code, PDDL, generated PDDL code, language, natural language descriptions
中文关键词: PDDL代码、PDDL、生成的PDDL代码、语言、自然语言描述
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Many recent works have explored using language models for planning problems. One line of research focuses on translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). While this approach is promising, accurately measuring the quality of generated PDDL code continues to pose significant challenges. First, generated PDDL code is typically evaluated using planning validators that check whether the problem can be solved with a planner. This method is insufficient because a language model might generate valid PDDL code that does not align with the natural language description of the task. Second, existing evaluation sets often have natural language descriptions of the planning task that closely resemble the ground truth PDDL, reducing the challenge of the task. To bridge this gap, we introduce \benchmarkName, a benchmark designed to evaluate language models’ ability to generate PDDL code from natural language descriptions of planning tasks. We begin by creating a PDDL equivalence algorithm that rigorously evaluates the correctness of PDDL code generated by language models by flexibly comparing it against a ground truth PDDL. Then, we present a dataset of 132,037 text-to-PDDL pairs across 13 different tasks, with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task’s complexity. For example, 87.6% of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, 82.2% are valid, solve-able problems, but only 35.1% are semantically correct, highlighting the need for a more rigorous benchmark for this problem.
摘要:最近的许多著作都探索了使用语言模型来解决规划问题。一种研究集中于将规划任务的自然语言描述翻译成结构化规划语言,例如规划领域定义语言(PDDL)。虽然这种方法很有希望,但准确地测量生成的PDDL代码的质量仍然是一个巨大的挑战。首先,生成的PDDL代码通常使用规划验证器进行评估,该验证器检查问题是否可以通过规划器解决。此方法是不够的,因为语言模型可能会生成与任务的自然语言描述不一致的有效PDDL代码。其次,现有的评估集通常具有与基本事实PDDL非常相似的对计划任务的自然语言描述,从而减少了任务的挑战。为了弥补这一差距,我们引入了\Benchmark MarkName,这是一个旨在评估语言模型从计划任务的自然语言描述生成PDDL代码的能力的基准。我们首先创建PDDL等价算法,该算法通过灵活地将语言模型生成的PDDL代码与基本事实PDDL进行比较来严格评估PDDL代码的正确性。然后,我们提供了一个包含132,037个文本到PDDL对的数据集,涉及13个不同的任务,难度各不相同。最后,我们评估了几个API访问和开放权重语言模型,它们揭示了这项任务的复杂性。例如,GPT-4o生成的PDDL问题描述中有87.6%在语法上是可解析的,82.2%是有效的、可解决的问题,但只有35.1%在语义上是正确的,这突显了对这个问题需要更严格的基准。

[NLP-1] InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
[NLP-1] InternLM-XComposer-2.5:支持长上下文输入和输出的多功能大视觉语言模型

链接: https://arxiv.org/abs/2407.03320
作者: Pan Zhang,Xiaoyi Dong,Yuhang Zang,Yuhang Cao,Rui Qian,Lin Chen,Qipeng Guo,Haodong Duan,Bin Wang,Linke Ouyang,Songyang Zhang,Wenwei Zhang,Yining Li,Yang Gao,Peng Sun,Xinyue Zhang,Wei Li,Jingwen Li,Wenhai Wang,Hang Yan,Conghui He,Xingcheng Zhang,Kai Chen,Jifeng Dai,Yu Qiao,Dahua Lin,Jiaqi Wang
关键词: versatile large-vision language, supports long-contextual input, large-vision language model, versatile large-vision, large-vision language
中文关键词: 通用大视野语言,支持长上下文输入,大视野语言模型,通用大视野,大视野语言
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Technical Report. this https URL

点击查看摘要

Abstract:We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at this https URL.
摘要:我们提出了InternLM-XComposer-2.5(IXC-2.5),这是一个支持长上下文输入和输出的通用大视觉语言模型。IXC-2.5在各种文本图像理解和合成应用中表现出色,仅需7B LLM后端即可实现GPT-4V级别的功能。训练了24K交错的图文上下文,它可以通过绳索外推无缝扩展到96K长的上下文。这种长上下文功能使IXC-2.5能够在需要大量输入和输出上下文的任务中表现出色。与之前的2.0版本相比,InternLM-XComposer-2.5在视觉语言理解方面有三个主要升级:(1)超高分辨率理解,(2)细粒度视频理解,(3)多轮多图像对话。除了理解之外,IXC-2.5还扩展到使用额外的Lora参数进行文本-图像合成的两个引人注目的应用程序:(1)制作网页和(2)撰写高质量的文本-图像文章。IXC-2.5已经在28个基准上进行了评估,在16个基准上超过了现有的开源最先进的模型。它还在16项关键任务上超过或接近GPT-4V和Gemini Pro。InternLM-XComposer-2.5在此HTTPS URL上公开提供。

[NLP-2] BACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate Hallucinations
[NLP-2] BACON:使用概念袋图表增强您的VLM以缓解幻觉

链接: https://arxiv.org/abs/2407.03314
作者: Zhantao Yang,Ruili Feng,Keyu Yan,Huangji Wang,Zhicai Wang,Shangwen Zhu,Han Zhang,Jie Xiao,Pingyu Wu,Kai Zhu,Jixuan Chen,Chen-Wei Xie,Chaojie Mao,Yue Yang,Hongyang Zhang,Yu Liu,Fan Cheng
关键词: Vision Language Models, Vision Language, limited linguistic abilities, Language Models, visual question answering
中文关键词: 视觉语言模型,视觉语言,语言能力有限,语言模型,视觉问答
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:This paper presents Bag-of-Concept Graph (BACON) to gift models with limited linguistic abilities to taste the privilege of Vision Language Models (VLMs) and boost downstream tasks such as detection, visual question answering (VQA), and image generation. Since the visual scenes in physical worlds are structured with complex relations between objects, BACON breaks down annotations into basic minimum elements and presents them in a graph structure. Element-wise style enables easy understanding, and structural composition liberates difficult locating. Careful prompt design births the BACON captions with the help of public-available VLMs and segmentation methods. In this way, we gather a dataset with 100K annotated images, which endow VLMs with remarkable capabilities, such as accurately generating BACON, transforming prompts into BACON format, envisioning scenarios in the style of BACONr, and dynamically modifying elements within BACON through interactive dialogue and more. Wide representative experiments, including detection, VQA, and image generation tasks, tell BACON as a lifeline to achieve previous out-of-reach tasks or excel in their current cutting-edge solutions.
摘要:本文将BAG-of-Concept Graph(BACON)用于语言能力有限的礼物模型,以体验视觉语言模型(VLMS)的特权,并促进后续任务,如检测、视觉问答和图像生成。由于现实世界中的视觉场景是由对象之间的复杂关系构成的,因此培根将注释分解为基本的最小元素,并以图形结构呈现它们。元素风格使人容易理解,结构组成解放了难以定位的空间。精心的即时设计在公共可用的VLM和分割方法的帮助下产生了培根字幕。通过这种方式,我们收集了一个包含100K注释图像的数据集,这赋予了VLM非凡的功能,例如准确地生成BACON、将提示转换为BACON格式、以BACONR风格设想场景,以及通过交互对话动态修改BACON中的元素等。广泛的代表性实验,包括检测、VQA和图像生成任务,告诉培根作为生命线来实现以前无法实现的任务,或者在当前的尖端解决方案中脱颖而出。

[NLP-3] A Review of the Applications of Deep Learning-Based Emergent Communication
[NLP-3] 基于深度学习的紧急通信应用回顾

链接: https://arxiv.org/abs/2407.03302
作者: Brendon Boldt,David Mortensen
关键词: deep multi-agent reinforcement, reinforcement learning environments, multi-agent reinforcement learning, human language-like communication, language-like communication systems
中文关键词: 深度多智能体强化、强化学习环境、多智能体强化学习、类人类语言通信、类语言通信系统
类目: Computation and Language (cs.CL)
备注: 49 pages, 15 figures

点击查看摘要

Abstract:Emergent communication, or emergent language, is the field of research which studies how human language-like communication systems emerge de novo in deep multi-agent reinforcement learning environments. The possibilities of replicating the emergence of a complex behavior like language have strong intuitive appeal, yet it is necessary to complement this with clear notions of how such research can be applicable to other fields of science, technology, and engineering. This paper comprehensively reviews the applications of emergent communication research across machine learning, natural language processing, linguistics, and cognitive science. Each application is illustrated with a description of its scope, an explication of emergent communication’s unique role in addressing it, a summary of the extant literature working towards the application, and brief recommendations for near-term research directions.
摘要:紧急通信或紧急语言是研究类人类语言的通信系统如何在深度多智能体强化学习环境中重新出现的研究领域。复制语言等复杂行为出现的可能性具有强大的直觉吸引力,但有必要用此类研究如何适用于其他科学、技术和工程领域的明确概念来补充这一点。本文全面回顾了新兴传播研究在机器学习、自然语言处理、语言学和认知科学领域的应用。每个应用程序都对其范围进行了说明,解释了紧急沟通在解决这个问题方面的独特作用,现有文献的总结,以及对近期研究方向的简要建议。

[NLP-4] LLM Internal States Reveal Hallucination Risk Faced With a Query
[NLP-4] LLM内部状态揭示面临质疑的幻觉风险

链接: https://arxiv.org/abs/2407.03282
作者: Ziwei Ji,Delong Chen,Etsuko Ishii,Samuel Cahyawijaya,Yejin Bang,Bryan Wilie,Pascale Fung
关键词: Large Language Models, problem of Large, Language Models, Large Language, Natural Language Generation
中文关键词: 大型语言模型、大型问题、语言模型、大型语言、自然语言生成
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The hallucination problem of Large Language Models (LLMs) significantly limits their reliability and trustworthiness. Humans have a self-awareness process that allows us to recognize what we don’t know when faced with queries. Inspired by this, our paper investigates whether LLMs can estimate their own hallucination risk before response generation. We analyze the internal mechanisms of LLMs broadly both in terms of training data sources and across 15 diverse Natural Language Generation (NLG) tasks, spanning over 700 datasets. Our empirical analysis reveals two key insights: (1) LLM internal states indicate whether they have seen the query in training data or not; and (2) LLM internal states show they are likely to hallucinate or not regarding the query. Our study explores particular neurons, activation layers, and tokens that play a crucial role in the LLM perception of uncertainty and hallucination risk. By a probing estimator, we leverage LLM self-assessment, achieving an average hallucination estimation accuracy of 84.32% at run time.
摘要:大型语言模型的幻觉问题严重限制了它们的可靠性和可信性。人类有一个自我意识的过程,当我们面对问题时,它允许我们识别我们不知道的东西。受此启发,我们的论文考察了低收入者是否能够在反应产生之前估计自己的幻觉风险。我们从训练数据源和跨越700多个数据集的15个不同的自然语言生成(NLG)任务方面广泛分析了LLMS的内部机制。我们的实证分析揭示了两个关键的见解:(1)LLM内部状态表明他们是否在训练数据中看到了查询;(2)LLM内部状态表明他们可能对查询产生幻觉。我们的研究探索了特定的神经元、激活层和标记,它们在LLM对不确定性和幻觉风险的感知中起着至关重要的作用。通过探测估计器,我们利用LLM自我评估,在运行时实现了84.32%的平均幻觉估计精度。

[NLP-5] Evaluating Automatic Metrics with Incremental Machine Translation Systems
[NLP-5] 使用增量式机器翻译系统评估自动收件箱

链接: https://arxiv.org/abs/2407.03277
作者: Guojun Wu,Shay B. Cohen,Rico Sennrich
关键词: comprising commercial machine, gathered weekly, dataset comprising commercial, commercial machine translations, translation directions
中文关键词: 包括商业机器,每周收集,包括商业、商业机器翻译、翻译方向的数据集
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions. Since human A/B testing is commonly used, we assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations. Our study confirms several previous findings in MT metrics research and demonstrates the dataset’s value as a testbed for metric evaluation. We release our code at this https URL
摘要:我们引入了一个由商业机器翻译组成的数据集,在六年内每周收集12个翻译方向。由于人工A/B测试很常见,我们假设商业系统会随着时间的推移而改进,这使我们能够根据他们对最近翻译的偏好来评估机器翻译(MT)指标。我们的研究证实了MT指标研究之前的几项发现,并证明了该数据集作为指标评估测试平台的价值。我们在此https URL上发布我们的代码

[NLP-6] How Similar Are Elected Politicians and Their Constituents? Quantitative Evidence From Online Social Networks
[NLP-6] 民选政治家及其选民有多相似?来自在线社交网络的定量证据

链接: https://arxiv.org/abs/2407.03255
作者: Waleed Iqbal,Gareth Tyson,Ignacio Castro
关键词: elected political representatives, elected politicians, similar, elected politicians tend, USA Representatives
中文关键词: 民选政治代表,民选政治家,类似的,民选政治家倾向,美国代表
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How similar are politicians to those who vote for them? This is a critical question at the heart of democratic representation and particularly relevant at times when political dissatisfaction and populism are on the rise. To answer this question we compare the online discourse of elected politicians and their constituents. We collect a two and a half years (September 2020 - February 2023) constituency-level dataset for USA and UK that includes: (i) the Twitter timelines (5.6 Million tweets) of elected political representatives (595 UK Members of Parliament and 433 USA Representatives), (ii) the Nextdoor posts (21.8 Million posts) of the constituency (98.4% USA and 91.5% UK constituencies). We find that elected politicians tend to be equally similar to their constituents in terms of content and style regardless of whether a constituency elects a right or left-wing politician. The size of the electoral victory and the level of income of a constituency shows a nuanced picture. The narrower the electoral victory, the more similar the style and the more dissimilar the content is. The lower the income of a constituency, the more similar the content is. In terms of style, poorer constituencies tend to have a more similar sentiment and more dissimilar psychological text traits (i.e. measured with LIWC categories).
摘要:政客和那些投票给他们的人有多相似?这是民主代表制核心的一个关键问题,在政治不满和民粹主义抬头的时候尤其重要。为了回答这个问题,我们比较了当选政客及其选民的在线话语。我们收集了美国和英国两年半(2020年9月至2023年2月)选区级别的数据集,其中包括:(I)当选政治代表(595名英国国会议员和433名美国众议员)的推特时间表(560万条推文),(Ii)选区的邻门帖子(2180万条帖子)(98.4%的美国选区和91.5%的英国选区)。我们发现,无论一个选区是选举右翼还是左翼政客,民选政客在内容和风格上都倾向于与选民一样相似。选举胜利的大小和选区的收入水平显示了一幅微妙的图景。选举胜利越小,风格就越相似,内容就越不相似。一个选区的收入越低,内容就越相似。在风格方面,较贫穷的选民往往有更相似的情感和更不同的心理文本特征(即用LIWC类别衡量)。

[NLP-7] STF: Sentence Transformer Fine-Tuning For Topic Categorization With Limited Data
[NLP-7] STF:句子Transformer微调,用于有限数据的主题分类

链接: https://arxiv.org/abs/2407.03253
作者: Kheir Eddine Daouadi,Yaakoub Boualleg,Oussama Guehairia
关键词: considerable research attention, attracts considerable research, tweets attracts considerable, Sentence Transformers, attracts considerable
中文关键词: 相当多的研究关注,吸引了相当多的研究,推文吸引了相当多的,句子变形金刚,吸引了相当多的
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Nowadays, topic classification from tweets attracts considerable research attention. Different classification systems have been suggested thanks to these research efforts. Nevertheless, they face major challenges owing to low performance metrics due to the limited amount of labeled data. We propose Sentence Transformers Fine-tuning (STF), a topic detection system that leverages pretrained Sentence Transformers models and fine-tuning to classify topics from tweets accurately. Moreover, extensive parameter sensitivity analyses were conducted to finetune STF parameters for our topic classification task to achieve the best performance results. Experiments on two benchmark datasets demonstrated that (1) the proposed STF can be effectively used for classifying tweet topics and outperforms the latest state-of-the-art approaches, and (2) the proposed STF does not require a huge amount of labeled tweets to achieve good accuracy, which is a limitation of many state-of-the-art approaches. Our main contribution is the achievement of promising results in tweet topic classification by applying pretrained sentence transformers language models.
摘要:当前,推文主题分类研究受到了广泛的关注。由于这些研究努力,已经提出了不同的分类系统。然而,由于标签数据的数量有限,性能指标较低,它们面临着重大挑战。我们提出了句子转换器精调(STF),这是一个话题检测系统,利用预先训练的句子转换器模型和微调来准确地从推文中分类主题。此外,还进行了广泛的参数敏感度分析,以微调主题分类任务的STF参数,以获得最佳的性能结果。在两个基准数据集上的实验表明:(1)提出的STF能够有效地用于推文主题的分类,并优于最新的最新方法;(2)提出的STF不需要大量的标签推文来达到良好的准确率,这是许多最先进方法的局限性。我们的主要贡献是通过应用预先训练的句子转换器语言模型在推文主题分类方面取得了良好的结果。

[NLP-8] CATT: Character-based Arabic Tashkeel Transformer
[NLP-8] CATT:基于直升机的阿拉伯语Tashkeel Transformer

链接: https://arxiv.org/abs/2407.03236
作者: Faris Alasmary,Orjuwan Zaafarani,Ahmad Ghannam
关键词: Arabic Text Diacritization, Arabic Text, Arabic text processing, Text Diacritization, improving Arabic text
中文关键词: 阿拉伯文本变音化,阿拉伯文本,阿拉伯文本处理,文本变音化,改进阿拉伯文本
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tashkeel, or Arabic Text Diacritization (ATD), greatly enhances the comprehension of Arabic text by removing ambiguity and minimizing the risk of misinterpretations caused by its absence. It plays a crucial role in improving Arabic text processing, particularly in applications such as text-to-speech and machine translation. This paper introduces a new approach to training ATD models. First, we finetuned two transformers, encoder-only and encoder-decoder, that were initialized from a pretrained character-based BERT. Then, we applied the Noisy-Student approach to boost the performance of the best model. We evaluated our models alongside 11 commercial and open-source models using two manually labeled benchmark datasets: WikiNews and our CATT dataset. Our findings show that our top model surpasses all evaluated models by relative Diacritic Error Rates (DERs) of 30.83% and 35.21% on WikiNews and CATT, respectively, achieving state-of-the-art in ATD. In addition, we show that our model outperforms GPT-4-turbo on CATT dataset by a relative DER of 9.36%. We open-source our CATT models and benchmark dataset for the research community\footnotethis https URL.
摘要:Tashkeel,或阿拉伯文本分音化(ATD),通过消除歧义并最大限度地减少因缺乏歧义而导致的误解风险,大大增强了对阿拉伯文本的理解。它在改进阿拉伯语文本处理方面发挥着至关重要的作用,特别是在文本到语音和机器翻译等应用中。本文介绍了一种训练ATD模型的新方法。首先,我们优化了两个转换器,仅编码器和编解码器,这两个转换器是从预先训练的基于字符的BERT初始化的。然后,我们应用噪声-学生方法来提高最优模型的性能。我们使用两个手动标记的基准数据集:Wikinews和我们的CATT数据集,与11个商业和开源模型一起评估了我们的模型。我们的结果表明,在维基新闻和CATT上,我们的顶级模型比所有评估模型的相对发音错误率(DER)分别高出30.83%和35.21%,达到了ATD的最高水平。此外,我们还表明,在CATT数据集上,我们的模型比GPT-4-TURBO的性能高出9.36%。我们为研究社区开放了我们的CATT模型和基准数据集\Footnote This HTTPS URL。

[NLP-9] Self-Evaluation as a Defense Against Adversarial Attacks on LLMs
[NLP-9] 自我评估作为对LLM的对抗攻击的防御

链接: https://arxiv.org/abs/2407.03234
作者: Hannah Brown,Leon Lin,Kenji Kawaguchi,Michael Shieh
关键词: human-facing settings, deployed in sensitive, LLMs are deployed, answer unsafe, answer unsafe prompts
中文关键词: 面向人类的设置,部署在敏感环境中,部署了LLM,回答不安全的问题,回答不安全的提示
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as “Tell me how to build a bomb.” We find that, despite these safeguards, it is possible to break model defenses simply by appending a space to the end of a model’s input. In a study of eight open-source models, we demonstrate that this acts as a strong enough attack to cause the majority of models to generate harmful outputs with very high success rates. We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models to generate lists when prompted, overriding training signals to refuse to answer unsafe requests. Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods. Code and data will be made available at this https URL.
摘要:当LLM部署在敏感的、面向人的环境中时,至关重要的是它们不会输出不安全、有偏见或违反隐私的输出。出于这个原因,模特们既接受了培训,又被指示拒绝回答不安全的提示,比如“告诉我如何制造炸弹。”我们发现,尽管有这些保障措施,但只需在模型输入的末尾添加一个空格,就可以打破模型的防御。在对八个开源模型的研究中,我们证明了这是一种足够强大的攻击,足以导致大多数模型产生非常高的成功率的有害输出。我们研究了这种行为的原因,发现在标记化的训练数据中出现单个空格的上下文鼓励模型在得到提示时生成列表,从而覆盖拒绝回答不安全请求的训练信号。我们的发现强调了当前模型比对的脆弱状态,并促进了开发更稳健的比对方法的重要性。代码和数据将在此HTTPS URL上提供。

[NLP-10] Single Character Perturbations Break LLM Alignment
[NLP-10] 单一字符扰动打破了LLM一致性

链接: https://arxiv.org/abs/2407.03232
作者: Leon Lin,Hannah Brown,Kenji Kawaguchi,Michael Shieh
关键词: human-facing settings, deployed in sensitive, LLMs are deployed, answer unsafe, answer unsafe prompts
中文关键词: 面向人类的设置,部署在敏感环境中,部署了LLM,回答不安全的问题,回答不安全的提示
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as “Tell me how to build a bomb.” We find that, despite these safeguards, it is possible to break model defenses simply by appending a space to the end of a model’s input. In a study of eight open-source models, we demonstrate that this acts as a strong enough attack to cause the majority of models to generate harmful outputs with very high success rates. We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models to generate lists when prompted, overriding training signals to refuse to answer unsafe requests. Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods. Code and data will be available at this https URL.
摘要:当LLM部署在敏感的、面向人的环境中时,至关重要的是它们不会输出不安全、有偏见或违反隐私的输出。出于这个原因,模特们既接受了培训,又被指示拒绝回答不安全的提示,比如“告诉我如何制造炸弹。”我们发现,尽管有这些保障措施,但只需在模型输入的末尾添加一个空格,就可以打破模型的防御。在对八个开源模型的研究中,我们证明了这是一种足够强大的攻击,足以导致大多数模型产生非常高的成功率的有害输出。我们研究了这种行为的原因,发现在标记化的训练数据中出现单个空格的上下文鼓励模型在得到提示时生成列表,从而覆盖拒绝回答不安全请求的训练信号。我们的发现强调了当前模型比对的脆弱状态,并促进了开发更稳健的比对方法的重要性。代码和数据将在此HTTPS URL上提供。

[NLP-11] Improving Retrieval-augmented Text-to-SQL with AST-based Ranking and Schema Pruning
[NLP-11] 使用基于AST的排名和模式修剪改进检索增强文本到SQL

链接: https://arxiv.org/abs/2407.03227
作者: Zhili Shen,Pavlos Vougiouklis,Chenxin Diao,Kaustubh Vyas,Yuanyi Ji,Jeff Z. Pan
关键词: Large Language Models, perspective of Large, Large Language, abstract syntax trees, Large
中文关键词: 大型语言模型,大型观点,大型语言,抽象语法树,大型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:We focus on Text-to-SQL semantic parsing from the perspective of Large Language Models. Motivated by challenges related to the size of commercial database schemata and the deployability of business intelligence solutions, we propose an approach that dynamically retrieves input database information and uses abstract syntax trees to select few-shot examples for in-context learning. Furthermore, we investigate the extent to which an in-parallel semantic parser can be leveraged for generating \textitapproximated versions of the expected SQL queries, to support our retrieval. We take this approach to the extreme–we adapt a model consisting of less than 500 M parameters, to act as an extremely efficient approximator, enhancing it with the ability to process schemata in a parallelised manner. We apply our approach to monolingual and cross-lingual benchmarks for semantic parsing, showing improvements over state-of-the-art baselines. Comprehensive experiments highlight the contribution of modules involved in this retrieval-augmented generation setting, revealing interesting directions for future work. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB) Cite as: arXiv:2407.03227 [cs.CL] (or arXiv:2407.03227v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2407.03227 Focus to learn more arXiv-issued DOI via DataCite
摘要:我们从大型语言模型的角度来研究Text-to-SQL语义分析。考虑到商业数据库模式的规模和商业智能解决方案的可部署性带来的挑战,我们提出了一种动态检索输入数据库信息并使用抽象语法树选择少量实例进行情景学习的方法。此外,我们还研究了可以在多大程度上利用并行语义解析器来生成预期的SQL查询的近似版本,以支持我们的检索。我们将这种方法发挥到了极致–我们采用了一个由少于500M参数组成的模型,作为一个非常有效的近似器,通过以并行方式处理模式的能力来增强它。我们将我们的方法应用于语义分析的单语言和跨语言基准测试,显示出与最先进的基线相比的改进。综合实验突出了在这种提取增强的生成环境中涉及的模块的贡献,揭示了未来工作的有趣方向。主题:计算与语言(cs.CL);人工智能(cs.AI);数据库(cs.DB)引用为:arxiv:2407.03227cs.CLhttps://doi.org/10.48550/arXiv.2407.03227 Focus通过DataCite了解更多arxiv发布的文档

[NLP-12] How Does Quantization Affect Multilingual LLMs?
[NLP-12] 量化如何影响多语言LLM?

链接: https://arxiv.org/abs/2407.03211
作者: Kelly Marchisio,Saurabh Dash,Hongyu Chen,Dennis Aumiller,Ahmet Üstün,Sara Hooker,Sebastian Ruder
关键词: improve inference speed, techniques are widely, improve inference, inference speed, speed and deployment
中文关键词: 提高推理速度,技术广泛,提高推理、推理速度、速度和部署
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Quantization techniques are widely used to improve inference speed and deployment of large language models. While a wide body of work examines the impact of quantized LLMs on English tasks, none have examined the effect of quantization across languages. We conduct a thorough analysis of quantized multilingual LLMs, focusing on their performance across languages and at varying scales. We use automatic benchmarks, LLM-as-a-Judge methods, and human evaluation, finding that (1) harmful effects of quantization are apparent in human evaluation, and automatic metrics severely underestimate the detriment: a 1.7% average drop in Japanese across automatic tasks corresponds to a 16.0% drop reported by human evaluators on realistic prompts; (2) languages are disparately affected by quantization, with non-Latin script languages impacted worst; and (3) challenging tasks such as mathematical reasoning degrade fastest. As the ability to serve low-compute models is critical for wide global adoption of NLP technologies, our results urge consideration of multilingual performance as a key evaluation criterion for efficient models.
摘要:量化技术被广泛应用于提高推理速度和部署大型语言模型。虽然大量的工作研究了量化的LLM对英语任务的影响,但还没有人研究量化在不同语言之间的影响。我们对量化的多语言最小二乘模型进行了深入的分析,重点分析了它们在不同语言和不同尺度上的性能。我们使用自动基准、LLM-as-a-Court方法和人工评估,发现(1)量化的有害影响在人类评估中是显而易见的,而自动度量严重低估了危害:在自动任务中,日语平均下降1.7%,而人工评估者在现实提示中报告的下降了16.0%;(2)语言受量化的影响不同,非拉丁文字语言受到的影响最大;以及(3)数学推理等具有挑战性的任务退化最快。由于为低计算模型提供服务的能力对于全球广泛采用NLP技术至关重要,我们的结果敦促考虑将多语言性能作为有效模型的关键评估标准。

[NLP-13] CiteAssist: A System for Automated Preprint Citation and BibTeX Generation
[NLP-13] CiteAID:自动预印本引用和BibTeX生成系统

链接: https://arxiv.org/abs/2407.03192
作者: Lars Benedikt Kaesberg,Terry Ruas,Jan Philip Wahle,Bela Gipp
关键词: streamlining the process, automate the generation, process of bibliographic, bibliographic annotation, BibTeX entries
中文关键词: 简化流程,自动化书目、书目注释、BibTeX条目的生成和流程
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注: Published at SDProc @ ACL 2024

点击查看摘要

Abstract:We present CiteAssist, a system to automate the generation of BibTeX entries for preprints, streamlining the process of bibliographic annotation. Our system extracts metadata, such as author names, titles, publication dates, and keywords, to create standardized annotations within the document. CiteAssist automatically attaches the BibTeX citation to the end of a PDF and links it on the first page of the document so other researchers gain immediate access to the correct citation of the article. This method promotes platform flexibility by ensuring that annotations remain accessible regardless of the repository used to publish or access the preprint. The annotations remain available even if the preprint is viewed externally to CiteAssist. Additionally, the system adds relevant related papers based on extracted keywords to the preprint, providing researchers with additional publications besides those in related work for further reading. Researchers can enhance their preprints organization and reference management workflows through a free and publicly available web interface.
摘要:我们介绍了CiteAsset,这是一个自动生成BibTeX条目以用于预印本的系统,简化了书目注释的过程。我们的系统提取元数据,如作者姓名、标题、出版日期和关键字,以在文档中创建标准化批注。CiteAsset自动将BibTeX引文附加到PDF的末尾,并将其链接到文档的第一页,这样其他研究人员就可以立即访问文章的正确引文。这种方法通过确保注释保持可访问,而不考虑用于发布或访问预印本的存储库,从而提高了平台的灵活性。即使在CiteAsset外部查看预印本,注释仍可用。此外,该系统还根据提取的关键词将相关论文添加到预印本中,为研究人员提供除相关工作出版物外的其他出版物供进一步阅读。研究人员可以通过免费和公开的网络界面增强其预印本组织和参考资料管理工作流程。

[NLP-14] Cutting through the noise to motivate people: A comprehensive analysis of COVID-19 social media posts de/motivating vaccination
[NLP-14] 消除噪音以激励人们:对COVID-19社交媒体上鼓励/激励疫苗接种的帖子的全面分析

链接: https://arxiv.org/abs/2407.03190
作者: Ashiqur Rahman,Ehsan Mohammadi,Hamed Alhoori
关键词: healthcare information system, pandemic exposed significant, exposed significant weaknesses, pandemic exposed, information system
中文关键词: 医疗保健信息系统,大流行暴露重大,暴露重大弱点,大流行暴露,信息系统
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: 51 pages, 13 figures, 12 tables. Accepted at Natural Language Processing Journal

点击查看摘要

Abstract:The COVID-19 pandemic exposed significant weaknesses in the healthcare information system. The overwhelming volume of misinformation on social media and other socioeconomic factors created extraordinary challenges to motivate people to take proper precautions and get vaccinated. In this context, our work explored a novel direction by analyzing an extensive dataset collected over two years, identifying the topics de/motivating the public about COVID-19 vaccination. We analyzed these topics based on time, geographic location, and political orientation. We noticed that while the motivating topics remain the same over time and geographic location, the demotivating topics rapidly. We also identified that intrinsic motivation, rather than external mandate, is more advantageous to inspire the public. This study addresses scientific communication and public motivation in social media. It can help public health officials, policymakers, and social media platforms develop more effective messaging strategies to cut through the noise of misinformation and educate the public about scientific findings.
摘要:新冠肺炎疫情暴露了医疗信息系统的重大弱点。社交媒体上铺天盖地的错误信息和其他社会经济因素给激励人们采取适当预防措施和接种疫苗带来了非同寻常的挑战。在这种背景下,我们的工作探索了一个新的方向,通过分析两年来收集的大量数据集,确定了降低/激励公众对新冠肺炎疫苗接种的主题。我们根据时间、地理位置和政治取向分析了这些话题。我们注意到,虽然激励性话题随着时间和地理位置的变化保持不变,但激励性话题很快就会消失。我们还发现,内部动机比外部授权更有利于激励公众。这项研究探讨了社交媒体中的科学传播和公众动机。它可以帮助公共卫生官员、政策制定者和社交媒体平台开发更有效的消息策略,以消除错误信息的噪音,并教育公众有关科学发现的知识。

[NLP-15] Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models
[NLP-15] 用分歧思想链进行微调通过语言模型的自我纠正来促进推理

链接: https://arxiv.org/abs/2407.03181
作者: Haritz Puerto,Tilek Chubakov,Xiaodan Zhu,Harish Tayyar Madabushi,Iryna Gurevych
关键词: Large Language Model, Large Language, intermediary reasoning steps, generate intermediary reasoning, Requiring a Large
中文关键词: 大型语言模型,大型语言,中间推理步骤,生成中间推理,需要大型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Requiring a Large Language Model to generate intermediary reasoning steps has been shown to be an effective way of boosting performance. In fact, it has been found that instruction tuning on these intermediary reasoning steps improves model performance. In this work, we present a novel method of further improving performance by requiring models to compare multiple reasoning chains before generating a solution in a single inference step. We call this method Divergent CoT (DCoT). We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, LLMs. Through a rigorous set of experiments spanning a wide range of tasks that require various reasoning types, we show that fine-tuning on DCoT consistently improves performance over the CoT baseline across model families and scales (1.3B to 70B). Through a combination of empirical and manual evaluation, we additionally show that these performance gains stem from models generating multiple divergent reasoning chains in a single inference step, indicative of the enabling of self-correction in language models. Our code and data are publicly available at this https URL.
摘要:需要大型语言模型来生成中间推理步骤已被证明是提高性能的有效方法。事实上,人们已经发现,在这些中间推理步骤上进行指令调整可以提高模型的性能。在这项工作中,我们提出了一种新的方法,通过要求模型在单个推理步骤中生成解之前比较多个推理链来进一步提高性能。我们称这种方法为分歧式COT(DCoT)。我们发现,DCoT数据集上的指令调优提高了更小的、因此更容易访问的LLM的性能。通过跨越需要各种推理类型的广泛任务的一组严格的实验,我们表明在DCoT上的微调在模型系列和比例(1.3B到70B)中持续改善COT基线的性能。通过经验和人工评估的结合,我们进一步表明,这些性能收益来自于在单个推理步骤中生成多个发散推理链的模型,这表明语言模型能够自我纠正。我们的代码和数据在此HTTPS URL上公开可用。

[NLP-16] Investigating Decoder-only Large Language Models for Speech-to-text Translation
[NLP-16] 研究语音到文本翻译的纯解码器大型语言模型

链接: https://arxiv.org/abs/2407.03169
作者: Chao-Wei Huang,Hui Lu,Hongyu Gong,Hirofumi Inaguma,Ilia Kulikov,Ruslan Mavlyutov,Sravya Popuri
关键词: exceptional reasoning capabilities, Large language models, Large language, enhancing speech-related tasks, reasoning capabilities
中文关键词: 卓越的推理能力,大型语言模型,大型语言,增强语音相关任务,推理能力
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2024

点击查看摘要

Abstract:Large language models (LLMs), known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains, present a promising avenue for enhancing speech-related tasks. In this paper, we focus on integrating decoder-only LLMs to the task of speech-to-text translation (S2TT). We propose a decoder-only architecture that enables the LLM to directly consume the encoded speech representation and generate the text translation. Additionally, we investigate the effects of different parameter-efficient fine-tuning techniques and task formulation. Our model achieves state-of-the-art performance on CoVoST 2 and FLEURS among models trained without proprietary data. We also conduct analyses to validate the design choices of our proposed model and bring insights to the integration of LLMs to S2TT.
摘要:大型语言模型(LLM)以其出色的推理能力、概括性和跨不同领域的流畅性而闻名,为增强语音相关任务提供了一条有希望的途径。在本文中,我们重点关注将仅解码器的LLM集成到语音到文本翻译(S2 TT)任务中。我们提出了一种仅解码器的架构,使LLM能够直接消费编码语音表示并生成文本翻译。此外,我们还研究了不同参数高效微调技术和任务公式的影响。在没有专有数据训练的模型中,我们的模型在CoVoST 2和FLEURS上实现了最先进的性能。我们还进行分析来验证我们提出的模型的设计选择,并为LLM到S2 TT的集成提供见解。

[NLP-17] SOS! Soft Prompt Attack Against Open-Source Large Language Models
[NLP-17] 求救!针对开源大型语言模型的软提示攻击

链接: https://arxiv.org/abs/2407.03160
作者: Ziqing Yang,Michael Backes,Yang Zhang,Ahmed Salem
关键词: Open-source large language, large language models, public and industry, Open-source large, large language
中文关键词: 开源大型语言、大型语言模型、公共和行业、开源大型语言
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Open-source large language models (LLMs) have become increasingly popular among both the general public and industry, as they can be customized, fine-tuned, and freely used. However, some open-source LLMs require approval before usage, which has led to third parties publishing their own easily accessible versions. Similarly, third parties have been publishing fine-tuned or quantized variants of these LLMs. These versions are particularly appealing to users because of their ease of access and reduced computational resource demands. This trend has increased the risk of training time attacks, compromising the integrity and security of LLMs. In this work, we present a new training time attack, SOS, which is designed to be low in computational demand and does not require clean data or modification of the model weights, thereby maintaining the model’s utility intact. The attack addresses security issues in various scenarios, including the backdoor attack, jailbreak attack, and prompt stealing attack. Our experimental findings demonstrate that the proposed attack is effective across all evaluated targets. Furthermore, we present the other side of our SOS technique, namely the copyright token – a novel technique that enables users to mark their copyrighted content and prevent models from using it.
摘要:开源的大型语言模型(LLM)因其可定制、可调和可自由使用而在普通公众和行业中越来越受欢迎。然而,一些开源的LLM在使用之前需要获得批准,这导致第三方发布了他们自己的易于访问的版本。同样,第三方一直在发布这些LLM的微调或量化变体。这些版本对用户特别有吸引力,因为它们易于访问并减少了计算资源需求。这一趋势增加了训练时间攻击的风险,损害了LLMS的完整性和安全性。在这项工作中,我们提出了一种新的训练时间攻击,SOS,它被设计为计算量低,不需要干净的数据或修改模型权重,从而保持了模型的实用性。该攻击针对各种场景的安全问题,包括后门攻击、越狱攻击、提示窃取攻击。我们的实验结果表明,所提出的攻击对所有被评估目标都是有效的。此外,我们还介绍了SOS技术的另一面,即版权令牌–这是一种使用户能够标记其受版权保护的内容并防止模型使用它的新技术。

[NLP-18] Let the Code LLM Edit Itself When You Edit the Code
[NLP-18] 编辑代码时让代码LLM自行编辑

链接: https://arxiv.org/abs/2407.03157
作者: Zhenyu He,Jun Zhang,Shengjie Luo,Jingjing Xu,Zhi Zhang,Di He
关键词: developer edits existing, edits existing code, large language model, investigate a typical, typical scenario
中文关键词: 开发人员编辑现有的、编辑现有的代码、大型语言模型、调查典型的、典型的场景
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Preprint. Work in Progress

点击查看摘要

Abstract:In this work, we investigate a typical scenario in code generation where a developer edits existing code in real time and requests a code assistant, e.g., a large language model, to re-predict the next token or next line on the fly. Naively, the LLM needs to re-encode the entire KV cache to provide an accurate prediction. However, this process is computationally expensive, especially when the sequence length is long. Simply encoding the edited subsequence and integrating it to the original KV cache meets the temporal confusion problem, leading to significantly worse performance. We address this efficiency and accuracy trade-off by introducing \underline\textbfPositional \textbfIntegrity \textbfEncoding (PIE). Building upon the rotary positional encoding, PIE first removes the rotary matrices in the Key cache that introduce temporal confusion and then reapplies the correct rotary matrices. This process ensures that positional relationships between tokens are correct and requires only a single round of matrix multiplication. We validate the effectiveness of PIE through extensive experiments on the RepoBench-C-8k dataset, utilizing DeepSeek-Coder models with 1.3B, 6.7B, and 33B parameters. Our evaluation includes three real-world coding tasks: code insertion, code deletion, and multi-place code editing. Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach across all model sizes and tasks while well approximating the model performance.
摘要:在这项工作中,我们研究了代码生成中的一个典型场景,其中开发人员实时编辑现有代码,并请求代码助手(例如大型语言模型)动态重新预测下一个令牌或下一行。天真的是,LLM需要对整个KV缓存进行重新编码,以提供准确的预测。然而,这个过程在计算上是昂贵的,特别是当序列长度很长时。简单地对编辑后的子序列进行编码并将其集成到原始KV缓存中会遇到时间混淆问题,从而导致性能显著下降。我们通过引入\下划线\文本bf位置\文本bf完整性\文本bf编码(PIE)来解决效率和精度之间的权衡。在旋转位置编码的基础上,PIE首先删除密钥缓存中引入时间混淆的旋转矩阵,然后重新应用正确的旋转矩阵。该过程确保了令牌之间的位置关系是正确的,并且只需要单轮矩阵乘法。利用DeepSeek-Coder模型和1.3B、6.7B和33B参数,在RepoBch-C-8k数据集上进行了大量的实验,验证了PIE算法的有效性。我们的评估包括三个真实世界的编码任务:代码插入、代码删除和多位代码编辑。结果表明,与标准的完全重新计算方法相比,PIE在所有模型大小和任务的情况下减少了85%以上的计算开销,同时很好地接近了模型性能。

[NLP-19] Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data
[NLP-19] 通过并行数据的连续预训练提高大型语言模型的翻译准确性

链接: https://arxiv.org/abs/2407.03145
作者: Minato Kondo,Takehito Utsuro,Masaaki Nagata
关键词: pre-trained large language, high-quality parallel data, parallel data, pre-trained large, continually pre-trained
中文关键词: 预训练的大型语言、高质量并行数据、并行数据、预训练的大型、持续预训练的
类目: Computation and Language (cs.CL)
备注: IWSLT2024, 18 pages

点击查看摘要

Abstract:In this paper, we propose a two-phase training approach where pre-trained large language models are continually pre-trained on parallel data and then supervised fine-tuned with a small amount of high-quality parallel data. To investigate the effectiveness of our proposed approach, we conducted continual pre-training with a 3.8B-parameter model and parallel data across eight different formats. We evaluate these methods on thirteen test sets for Japanese-to-English and English-to-Japanese translation. The results demonstrate that when utilizing parallel data in continual pre-training, it is essential to alternate between source and target sentences. Additionally, we demonstrated that the translation accuracy improves only for translation directions where the order of source and target sentences aligns between continual pre-training data and inference. In addition, we demonstrate that the LLM-based translation model is more robust in translating spoken language and achieves higher accuracy with less training data compared to supervised encoder-decoder models. We also show that the highest accuracy is achieved when the data for continual pre-training consists of interleaved source and target sentences and when tags are added to the source sentences.
摘要:在本文中,我们提出了一种两阶段训练方法,即预先训练好的大型语言模型在并行数据上持续预训练,然后用少量高质量的并行数据进行监督微调。为了考察我们提出的方法的有效性,我们用一个3.8B参数模型和八种不同格式的并行数据进行了连续的预训练。我们在13个日语-英语和英语-日语翻译测试集上对这些方法进行了评估。结果表明,在连续预训练中利用平行数据时,源句和目标句之间的交替是必要的。此外,我们还证明了只有在源句和目标句的顺序在连续的预训练数据和推理之间保持一致的情况下,翻译精度才会提高。此外,与有监督的编解码器模型相比,基于LLM的翻译模型在翻译口语时具有更强的健壮性,并且以更少的训练数据获得了更高的准确率。我们还表明,当连续预训练的数据由交错的源句和目标句组成时,以及在源句中添加标签时,准确率最高。

[NLP-20] Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech
[NLP-20] 言语中的发音运动和音素对齐的与说话者和文本无关的估计

链接: https://arxiv.org/abs/2407.03132
作者: Tobias Weise,Philipp Klumpp,Kubilay Can Demir,Paula Andrea Pérez-Toro,Maria Schuster,Elmar Noeth,Bjoern Heismann,Andreas Maier,Seung Hee Yang
关键词: previously treated separately, motion estimation, previously treated, treated separately, speech inversion
中文关键词: 先前单独处理、运动估计、先前处理、单独处理、语音倒置
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: to be published in Interspeech 2024 proceedings

点击查看摘要

Abstract:This paper introduces a novel combination of two tasks, previously treated separately: acoustic-to-articulatory speech inversion (AAI) and phoneme-to-articulatory (PTA) motion estimation. We refer to this joint task as acoustic phoneme-to-articulatory speech inversion (APTAI) and explore two different approaches, both working speaker- and text-independently during inference. We use a multi-task learning setup, with the end-to-end goal of taking raw speech as input and estimating the corresponding articulatory movements, phoneme sequence, and phoneme alignment. While both proposed approaches share these same requirements, they differ in their way of achieving phoneme-related predictions: one is based on frame classification, the other on a two-staged training procedure and forced alignment. We reach competitive performance of 0.73 mean correlation for the AAI task and achieve up to approximately 87% frame overlap compared to a state-of-the-art text-dependent phoneme force aligner.
摘要:本文提出了一种新的组合方法:声学-发音语音反转(AAI)和音素-发音(PTA)运动估计。我们将这一联合任务称为声学音素到发音语音反转(APTAI),并探索了两种不同的方法,在推理过程中独立于说话人和文本工作。我们使用多任务学习设置,端到端的目标是将原始语音作为输入,并估计相应的发音动作、音素序列和音素排列。虽然这两种方法都有相同的要求,但它们实现音素相关预测的方式不同:一种是基于帧分类,另一种是基于两个阶段的训练过程和强制对齐。对于AAI任务,我们达到了0.73的平均相关性的竞争性能,并且与最先进的依赖于文本的音素力对齐器相比,实现了高达87%的帧重叠。

[NLP-21] Social Bias Evaluation for Large Language Models Requires Prompt Variations
[NLP-21] 大型语言模型的社会偏见评估需要迅速变化

链接: https://arxiv.org/abs/2407.03129
作者: Rem Hida,Masahiro Kaneko,Naoaki Okazaki
关键词: Warning, Large Language Models, LLMs, social bias, social
中文关键词: 警告、大型语言模型、法学硕士、社会偏见、社会
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Warning: This paper contains examples of stereotypes and biases. Large Language Models (LLMs) exhibit considerable social biases, and various studies have tried to evaluate and mitigate these biases accurately. Previous studies use downstream tasks as prompts to examine the degree of social biases for evaluation and mitigation. While LLMs’ output highly depends on prompts, previous studies evaluating and mitigating bias have often relied on a limited variety of prompts. In this paper, we investigate the sensitivity of LLMs when changing prompt variations (task instruction and prompt, few-shot examples, debias-prompt) by analyzing task performance and social bias of LLMs. Our experimental results reveal that LLMs are highly sensitive to prompts to the extent that the ranking of LLMs fluctuates when comparing models for task performance and social bias. Additionally, we show that LLMs have tradeoffs between performance and social bias caused by the prompts. Less bias from prompt setting may result in reduced performance. Moreover, the ambiguity of instances is one of the reasons for this sensitivity to prompts in advanced LLMs, leading to various outputs. We recommend using diverse prompts, as in this study, to compare the effects of prompts on social bias in LLMs.
摘要:警告:本文包含刻板印象和偏见的例子。大型语言模型(LLM)表现出相当大的社会偏见,各种研究试图准确地评估和缓解这些偏见。以前的研究使用下游任务作为提示,检查社会偏见的程度,以进行评估和缓解。虽然LLMS的输出高度依赖于提示,但之前评估和减轻偏见的研究往往依赖于有限种类的提示。本文通过对LLMS的任务绩效和社会偏向的分析,考察了LLMS在改变提示变量(任务指示和提示、小镜头例子、去偏向-提示)时的敏感性。我们的实验结果表明,LLM对提示高度敏感,当比较任务绩效和社会偏见的模型时,LLM的排名会发生波动。此外,我们还发现,LLM在表现和提示引起的社会偏见之间存在权衡。对提示设置的偏差较小可能会导致性能降低。此外,实例的模糊性是高级LLMS中对提示如此敏感的原因之一,从而导致各种输出。我们建议使用不同的提示,如本研究,来比较提示对LLMS中的社会偏见的影响。

[NLP-22] KeyVideoLLM: Towards Large-scale Video Keyframe Selection
[NLP-22] KeyVideoLLM:迈向大规模视频关键帧选择

链接: https://arxiv.org/abs/2407.03104
作者: Hao Liang,Jiapeng Li,Tianyi Bai,Chong Chen,Conghui He,Bin Cui,Wentao Zhang
关键词: Large Language Models, Video Large Language, increasingly important, understanding large-scale video, rise of web
中文关键词: 大型语言模型,视频大型语言,越来越重要,理解大规模视频,网络的崛起
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recently, with the rise of web videos, managing and understanding large-scale video datasets has become increasingly important. Video Large Language Models (VideoLLMs) have emerged in recent years due to their strong video understanding capabilities. However, training and inference processes for VideoLLMs demand vast amounts of data, presenting significant challenges to data management, particularly regarding efficiency, robustness, and effectiveness. In this work, we present KeyVideoLLM, a text-video frame similarity-based keyframe selection method designed to manage VideoLLM data efficiently, robustly, and effectively. Specifically, KeyVideoLLM achieves a remarkable data compression rate of up to 60.9 times, substantially lowering disk space requirements, which proves its high efficiency. Additionally, it maintains a 100% selection success rate across all video formats and scales, enhances processing speed by up to 200 times compared to existing keyframe selection methods, and does not require hyperparameter tuning. Beyond its outstanding efficiency and robustness, KeyVideoLLM further improves model performance in video question-answering tasks during both training and inference stages. Notably, it consistently achieved the state-of-the-art (SoTA) experimental results on diverse datasets.
摘要:近年来,随着网络视频的兴起,管理和理解大规模视频数据集变得越来越重要。视频大语言模型(Video Large Language Model,简称视频大语言模型)由于具有较强的视频理解能力,近年来应运而生。然而,视频LLMS的训练和推理过程需要大量数据,这给数据管理带来了巨大的挑战,特别是在效率、稳健性和有效性方面。在这项工作中,我们提出了一种基于文本-视频帧相似度的关键帧选择方法KeyVideoLLM,旨在高效、健壮、有效地管理视频LLM数据。具体来说,KeyVideoLLM实现了高达60.9倍的出色数据压缩比,大幅降低了对磁盘空间的需求,证明了其高效率。此外,它在所有视频格式和规模上保持100%的选择成功率,与现有的关键帧选择方法相比,处理速度提高高达200倍,并且不需要超参数调整。除了突出的效率和健壮性外,KeyVideoLLM还在训练和推理阶段进一步提高了模型在视频问答任务中的性能。值得注意的是,它在不同的数据集上始终获得了最先进的(SOTA)实验结果。

[NLP-23] Cactus: Towards Psychological Counseling Conversations using Cognitive Behavioral Theory
[NLP-23] 仙人掌:使用认知行为理论进行心理咨询对话

链接: https://arxiv.org/abs/2407.03103
作者: Suyeon Lee,Sunghwan Kim,Minju Kim,Dongjin Kang,Dongil Yang,Harim Kim,Minseok Kang,Dayi Jung,Min Hee Kim,Seungbeen Lee,Kyoung-Mee Chung,Youngjae Yu,Dongha Lee,Jinyoung Yeo
关键词: individuals express concerns, mental health, significantly increased, individuals express, express concerns
中文关键词: 个人表达担忧,心理健康,显着增加,个人表达,表达担忧
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Recently, the demand for psychological counseling has significantly increased as more individuals express concerns about their mental health. This surge has accelerated efforts to improve the accessibility of counseling by using large language models (LLMs) as counselors. To ensure client privacy, training open-source LLMs faces a key challenge: the absence of realistic counseling datasets. To address this, we introduce Cactus, a multi-turn dialogue dataset that emulates real-life interactions using the goal-oriented and structured approach of Cognitive Behavioral Therapy (CBT). We create a diverse and realistic dataset by designing clients with varied, specific personas, and having counselors systematically apply CBT techniques in their interactions. To assess the quality of our data, we benchmark against established psychological criteria used to evaluate real counseling sessions, ensuring alignment with expert evaluations. Experimental results demonstrate that Camel, a model trained with Cactus, outperforms other models in counseling skills, highlighting its effectiveness and potential as a counseling agent. We make our data, model, and code publicly available.
摘要:近年来,随着越来越多的人对自己的心理健康表示担忧,对心理咨询的需求显著增加。这股热潮加快了通过使用大型语言模型(LLM)作为咨询师来提高咨询可及性的努力。为了确保客户隐私,培训开源LLM面临一个关键挑战:缺乏现实的咨询数据集。为了解决这个问题,我们引入了仙人掌,这是一个多轮对话数据集,使用认知行为疗法(CBT)的目标导向和结构化方法模拟现实生活中的互动。我们通过设计具有不同、特定角色的客户,并让咨询师在他们的互动中系统地应用CBT技术,创建了一个多样化和现实的数据集。为了评估我们的数据质量,我们以用于评估真实咨询课程的既定心理学标准为基准,确保与专家评估保持一致。实验结果表明,与仙人掌一起训练的模型骆驼在咨询技能上优于其他模型,突显了其作为咨询代理的有效性和潜力。我们公开我们的数据、模型和代码。

[NLP-24] A Case Study on Context-Aware Neural Machine Translation with Multi-Task Learning
[NLP-24] 基于多任务学习的上下文感知神经机器翻译案例研究

链接: https://arxiv.org/abs/2407.03076
作者: Ramakrishna Appicharla,Baban Gain,Santanu Pal,Asif Ekbal,Pushpak Bhattacharyya
关键词: neural machine translation, document-level neural machine, machine translation, neural machine, approaches are common
中文关键词: 神经机器翻译,文档级神经机器,机器翻译,神经机器,方法很常见
类目: Computation and Language (cs.CL)
备注: Accepted to EAMT 2024 (poster)

点击查看摘要

Abstract:In document-level neural machine translation (DocNMT), multi-encoder approaches are common in encoding context and source sentences. Recent studies \citeli-etal-2020-multi-encoder have shown that the context encoder generates noise and makes the model robust to the choice of context. This paper further investigates this observation by explicitly modelling context encoding through multi-task learning (MTL) to make the model sensitive to the choice of context. We conduct experiments on cascade MTL architecture, which consists of one encoder and two decoders. Generation of the source from the context is considered an auxiliary task, and generation of the target from the source is the main task. We experimented with German–English language pairs on News, TED, and Europarl corpora. Evaluation results show that the proposed MTL approach performs better than concatenation-based and multi-encoder DocNMT models in low-resource settings and is sensitive to the choice of context. However, we observe that the MTL models are failing to generate the source from the context. These observations align with the previous studies, and this might suggest that the available document-level parallel corpora are not context-aware, and a robust sentence-level model can outperform the context-aware models.
摘要:在文档级神经机器翻译(DocNMT)中,上下文和源句的编码通常采用多个编码者的方法。最近的研究表明,上下文编码器会产生噪声,使模型对上下文的选择具有健壮性。本文通过多任务学习(MTL)对上下文编码进行显式建模,使模型对上下文的选择敏感,进一步研究了这一现象。我们在级联MTL结构上进行了实验,该结构由一个编码器和两个解码器组成。从上下文生成源被认为是辅助任务,而从源生成目标是主要任务。我们在新闻、TED和Europarl语料库上进行了德语-英语语言对的实验。评估结果表明,该方法在低资源环境下的性能优于基于级联和多编码器的DocNMT模型,并且对上下文的选择较为敏感。然而,我们观察到MTL模型无法从上下文生成源。这些观察结果与之前的研究一致,这可能表明现有的文档级平行语料库不是上下文感知的,稳健的句子级模型可以优于上下文感知模型。

[NLP-25] ALTER: Augmentation for Large-Table-Based Reasoning
[NLP-25] ALTER:基于大表的推理的增强

链接: https://arxiv.org/abs/2407.03061
作者: Han Zhang,Yuheng Ma,Hanfang Yang
关键词: large language models, extensive research, research has explored, struggle with scalability, scalability when applied
中文关键词: 大型语言模型、广泛的研究、探索的研究、与可扩展性作斗争、应用时的可扩展性
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While extensive research has explored the use of large language models (LLMs) for table-based reasoning, most approaches struggle with scalability when applied to large tables. To maintain the superior comprehension abilities of LLMs in these scenarios, we introduce ALTER(Augmentation for Large-Table-Based Reasoning)-a framework designed to harness the latent augmentation potential in both free-form natural language (NL) questions, via the query augmentor, and semi-structured tabular data, through the table augmentor. By utilizing only a small subset of relevant data from the table and supplementing it with pre-augmented schema, semantic, and literal information, ALTER achieves outstanding performance on table-based reasoning benchmarks. We also provide a detailed analysis of large-table scenarios, comparing different methods and various partitioning principles. In these scenarios, our method outperforms all other approaches and exhibits robustness and efficiency against perturbations.
摘要:虽然广泛的研究已经探索了使用大型语言模型(LLM)进行基于表的推理,但大多数方法在应用于大型表时都难以实现可伸缩性。为了在这些场景中保持LLMS的卓越理解能力,我们引入了ALTER(基于大表推理的增强)-一个框架,旨在通过查询增强器利用自由形式自然语言(NL)问题的潜在增强潜力,通过表增强器利用半结构化表格数据的潜在增强潜力。通过仅利用表中相关数据的一小部分,并使用预先扩充的模式、语义和文字信息对其进行补充,ALTER在基于表的推理基准测试中实现了出色的性能。我们还详细分析了大表场景,比较了不同的方法和不同的分区原则。在这些场景中,我们的方法比所有其他方法都要好,并且表现出对扰动的稳健性和效率。

[NLP-26] Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment
[NLP-26] 通过直接偏好对齐提高量化大型语言模型的对话能力

链接: https://arxiv.org/abs/2407.03051
作者: Janghwan Lee,Seongmin Park,Sukjin Hong,Minsoo Kim,Du-Seong Chang,Jungwook Choi
关键词: closely mirroring human, generate pertinent sentences, grasp contextual nuances, large language models, human feedback
中文关键词: 密切反映人类、生成相关句子、掌握上下文细微差别、大型语言模型、人类反馈
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has facilitated their transformation into conversational chatbots that can grasp contextual nuances and generate pertinent sentences, closely mirroring human values through advanced techniques such as instruction tuning and reinforcement learning from human feedback (RLHF). However, the computational efficiency required for LLMs, achieved through techniques like post-training quantization (PTQ), presents challenges such as token-flipping that can impair chatbot performance. In response, we propose a novel preference alignment approach, quantization-aware direct preference optimization (QDPO), that aligns quantized LLMs with their full-precision counterparts, improving conversational abilities. Evaluated on two instruction-tuned LLMs in various languages, QDPO demonstrated superior performance in improving conversational abilities compared to established PTQ and knowledge-distillation fine-tuning techniques, marking a significant step forward in the development of efficient and effective conversational LLMs.
摘要:大语言模型(LLMS)的迅速发展促进了它们向对话型聊天机器人的转变,这些聊天机器人可以捕捉语境的细微差别并生成中肯的句子,通过教学调整和从人类反馈中强化学习(RLHF)等先进技术紧密地反映了人类的价值观。然而,通过训练后量化(PTQ)等技术实现的LLMS所需的计算效率带来了诸如令牌翻转等挑战,这可能会影响聊天机器人的性能。作为回应,我们提出了一种新的偏好对齐方法–量化感知的直接偏好优化(QDPO),它将量化的LLM与其全精度的偏好对齐,从而提高会话能力。QDPO在两个不同语言的指令调整的LLM上进行了评估,与现有的PTQ和知识精炼微调技术相比,QDPO在提高会话能力方面表现出了更好的表现,标志着在开发高效和有效的会话LLM方面迈出了重要的一步。

[NLP-27] JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets
[NLP-27] 越狱猎人:越狱的视觉分析方法从大规模人类LLM对话数据集中进行发现

链接: https://arxiv.org/abs/2407.03045
作者: Zhihua Jin,Shiyi Liu,Haotian Li,Xun Zhao,Huamin Qu
关键词: Large Language Models, Large Language, Language Models, gained significant attention, Jailbreak prompts
中文关键词: 越狱提示,大型语言模型,大型语言,语言模型,受到了极大关注
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 9 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have gained significant attention but also raised concerns due to the risk of misuse. Jailbreak prompts, a popular type of adversarial attack towards LLMs, have appeared and constantly evolved to breach the safety protocols of LLMs. To address this issue, LLMs are regularly updated with safety patches based on reported jailbreak prompts. However, malicious users often keep their successful jailbreak prompts private to exploit LLMs. To uncover these private jailbreak prompts, extensive analysis of large-scale conversational datasets is necessary to identify prompts that still manage to bypass the system’s defenses. This task is highly challenging due to the immense volume of conversation data, diverse characteristics of jailbreak prompts, and their presence in complex multi-turn conversations. To tackle these challenges, we introduce JailbreakHunter, a visual analytics approach for identifying jailbreak prompts in large-scale human-LLM conversational datasets. We have designed a workflow with three analysis levels: group-level, conversation-level, and turn-level. Group-level analysis enables users to grasp the distribution of conversations and identify suspicious conversations using multiple criteria, such as similarity with reported jailbreak prompts in previous research and attack success rates. Conversation-level analysis facilitates the understanding of the progress of conversations and helps discover jailbreak prompts within their conversation contexts. Turn-level analysis allows users to explore the semantic similarity and token overlap between a singleturn prompt and the reported jailbreak prompts, aiding in the identification of new jailbreak strategies. The effectiveness and usability of the system were verified through multiple case studies and expert interviews.
摘要:大型语言模型得到了广泛的关注,但由于存在误用的风险,也引起了人们的关注。越狱提示是一种流行的针对LLMS的对抗性攻击类型,已经出现并不断演变为违反LLMS的安全协议。为了解决这个问题,LLM会根据报告的越狱提示定期更新安全补丁。然而,恶意用户通常会将他们成功的越狱提示保密,以利用LLMS。为了发现这些私密的越狱提示,有必要对大规模对话数据集进行广泛分析,以确定仍然设法绕过系统防御的提示。由于对话数据量巨大,越狱提示的特点多种多样,而且它们存在于复杂的多轮对话中,这项任务具有极大的挑战性。为了应对这些挑战,我们引入了JailBreakHunter,这是一种视觉分析方法,用于在大规模的人-LLM对话数据集中识别越狱提示。我们设计了一个具有三个分析级别的工作流:小组级别、会话级别和话轮级别。组级分析使用户能够掌握对话的分布,并使用多种标准识别可疑对话,例如与之前研究中报告的越狱提示相似,以及攻击成功率。会话级别的分析有助于了解会话的进度,并帮助发现会话上下文中的越狱提示。话轮水平分析允许用户探索单一URN提示和报告的越狱提示之间的语义相似性和标记重叠,有助于识别新的越狱策略。通过多个案例研究和专家访谈,验证了该系统的有效性和可用性。

[NLP-28] Raw Text is All you Need: Knowledge-intensive Multi-turn Instruction Tuning for Large Language Model
[NLP-28] 原始文本即可满足您的需求:大型语言模型的知识密集型多回合指令调优

链接: https://arxiv.org/abs/2407.03040
作者: Xia Hou,Qifeng Li,Jian Yang,Tongliang Li,Linzheng Chai,Xianjie Wu,Hangyuan Ji,Zhoujun Li,Jixuan Nie,Jingbo Dun,Wenfeng Song
关键词: effective technique aligns, Instruction tuning, large language models, human preference, effective technique
中文关键词: 有效的技术对齐、指令调优、大型语言模型、人类偏好、有效的技术
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Instruction tuning as an effective technique aligns the outputs of large language models (LLMs) with human preference. But how to generate the seasonal multi-turn dialogues from raw documents for instruction tuning still requires further exploration. In this paper, we present a novel framework named R2S that leverages the CoD-Chain of Dialogue logic to guide large language models (LLMs) in generating knowledge-intensive multi-turn dialogues for instruction tuning. By integrating raw documents from both open-source datasets and domain-specific web-crawled documents into a benchmark K-BENCH, we cover diverse areas such as Wikipedia (English), Science (Chinese), and Artifacts (Chinese). Our approach first decides the logic flow of the current dialogue and then prompts LLMs to produce key phrases for sourcing relevant response content. This methodology enables the creation of the G I NSTRUCT instruction dataset, retaining raw document knowledge within dialoguestyle interactions. Utilizing this dataset, we fine-tune GLLM, a model designed to transform raw documents into structured multi-turn dialogues, thereby injecting comprehensive domain knowledge into the SFT model for enhanced instruction tuning. This work signifies a stride towards refining the adaptability and effectiveness of LLMs in processing and generating more accurate, contextually nuanced responses across various fields.
摘要:指令调优作为一种有效的技术,使大型语言模型的输出与人的偏好保持一致。但如何从原始文档中生成季节性的多语轮对话,以进行教学调整,还需要进一步的探索。在本文中,我们提出了一种新的框架R2S,它利用对话逻辑的编码链来指导大型语言模型(LLM)生成知识密集型的多话轮对话,用于教学优化。通过将来自开源数据集的原始文档和特定于领域的Web爬行文档集成到基准K-BENCH中,我们涵盖了不同的领域,如维基百科(英文)、科学(中文)和构件(中文)。我们的方法首先决定当前对话的逻辑流程,然后提示LLMS产生关键短语来获取相关响应内容。这种方法能够创建G I NSTRUCT指令数据集,在对话样式交互中保留原始文档知识。利用这个数据集,我们对GLLM模型进行了微调,该模型旨在将原始文档转换为结构化的多轮对话,从而将全面的领域知识注入SFT模型中,以增强教学优化。这项工作标志着在处理和生成不同领域的更准确、背景细微差别的反应方面,向完善LLMS的适应性和有效性迈出了一大步。

[NLP-29] On the Client Preference of LLM Fine-tuning in Federated Learning
[NLP-29] 联邦学习中LLM微调的客户偏好

链接: https://arxiv.org/abs/2407.03038
作者: Feijie Wu,Xiaoze Liu,Haoyu Wang,Xingchen Wang,Jing Gao
关键词: large language model, pretrained large language, Reinforcement learning, preference datasets, fine-tunes a pretrained
中文关键词: 大型语言模型、预训练大型语言、强化学习、偏好数据集、微调预训练
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:Reinforcement learning with human feedback (RLHF) fine-tunes a pretrained large language model (LLM) using preference datasets, enabling the LLM to generate outputs that align with human preferences. Given the sensitive nature of these preference datasets held by various clients, there is a need to implement RLHF within a federated learning (FL) framework, where clients are reluctant to share their data due to privacy concerns. To address this, we introduce a feasible framework in which clients collaboratively train a binary selector with their preference datasets using our proposed FedBis. With a well-trained selector, we can further enhance the LLM that generates human-preferred completions. Meanwhile, we propose a novel algorithm, FedBiscuit, that trains multiple selectors by organizing clients into balanced and disjoint clusters based on their preferences. Compared to the FedBis, FedBiscuit demonstrates superior performance in simulating human preferences for pairwise completions. Our extensive experiments on federated human preference datasets – marking the first benchmark to address heterogeneous data partitioning among clients – demonstrate that FedBiscuit outperforms FedBis and even surpasses traditional centralized training.
摘要:带人类反馈的强化学习(RLHF)使用偏好数据集对预先训练的大型语言模型(LLM)进行微调,使LLM能够生成与人类偏好一致的输出。考虑到这些由不同客户持有的偏好数据集的敏感性质,需要在联合学习(FL)框架内实施RLHF,其中客户出于隐私考虑不愿共享他们的数据。为了解决这个问题,我们引入了一个可行的框架,在这个框架中,客户使用我们提出的FedBis利用他们的偏好数据集协作训练二进制选择器。有了训练有素的选择器,我们可以进一步增强生成人类偏好的补全的LLM。同时,我们提出了一种新的算法FedBiscuit,该算法通过根据客户的偏好将他们组织成平衡的和不相交的簇来训练多个选择者。与FedBis相比,FedBiscuit在模拟人类对成对完成的偏好方面表现出了优越的性能。我们在联合人类偏好数据集上的广泛实验–标志着解决客户之间异类数据分区的第一个基准–证明了FedBiscuit的性能优于FedBis,甚至超过了传统的集中式训练。

[NLP-30] Strategies for Arabic Readability Modeling
[NLP-30] 阿拉伯语可读性建模策略

链接: https://arxiv.org/abs/2407.03032
作者: Juan Piñeros Liberato,Bashar Alhafni,Muhamed Al Khalil,Nizar Habash
关键词: building NLP applications, Automatic readability assessment, Arabic readability assessment, building NLP, NLP applications
中文关键词: 构建NLP应用程序、自动可读性评估、阿拉伯语可读性评估、构建NLP、NLP应用程序
类目: Computation and Language (cs.CL)
备注: Accepted to ArabicNLP 2024, ACL

点击查看摘要

Abstract:Automatic readability assessment is relevant to building NLP applications for education, content analysis, and accessibility. However, Arabic readability assessment is a challenging task due to Arabic’s morphological richness and limited readability resources. In this paper, we present a set of experimental results on Arabic readability assessment using a diverse range of approaches, from rule-based methods to Arabic pretrained language models. We report our results on a newly created corpus at different textual granularity levels (words and sentence fragments). Our results show that combining different techniques yields the best results, achieving an overall macro F1 score of 86.7 at the word level and 87.9 at the fragment level on a blind test set. We make our code, data, and pretrained models publicly available.
摘要:自动可读性评估与构建教育、内容分析和无障碍性的NLP应用程序相关。然而,由于阿拉伯语的形态丰富和可读性资源有限,阿拉伯语的可读性评估是一项具有挑战性的任务。在本文中,我们使用各种方法(从基于规则的方法到阿拉伯语预训练语言模型),提供了一组关于阿拉伯语可读性评估的实验结果。我们在不同文本粒度级别(单词和句子片段)的新创建的文集上报告结果。我们的结果表明,结合不同的技术可以产生最好的结果,在盲测试集上,宏F1总体得分在单词级别为86.7,片段级别为87.9。我们公开我们的代码、数据和预训练模型。

[NLP-31] Exploiting Dialect Identification in Automatic Dialectal Text Normalization
[NLP-31] 利用方言识别技术实现方言文本自动规范化

链接: https://arxiv.org/abs/2407.03020
作者: Bashar Alhafni,Sarah Al-Towaity,Ziyad Fawzy,Fatema Nassar,Fadhl Eryani,Houda Bouamor,Nizar Habash
关键词: native Arabic speakers, Dialectal Arabic, primary spoken language, Arabic, daily communication
中文关键词: 母语是阿拉伯语,阿拉伯语方言,主要口语,阿拉伯语,日常沟通
类目: Computation and Language (cs.CL)
备注: Accepted to ArabicNLP 2024, ACL

点击查看摘要

Abstract:Dialectal Arabic is the primary spoken language used by native Arabic speakers in daily communication. The rise of social media platforms has notably expanded its use as a written language. However, Arabic dialects do not have standard orthographies. This, combined with the inherent noise in user-generated content on social media, presents a major challenge to NLP applications dealing with Dialectal Arabic. In this paper, we explore and report on the task of CODAfication, which aims to normalize Dialectal Arabic into the Conventional Orthography for Dialectal Arabic (CODA). We work with a unique parallel corpus of multiple Arabic dialects focusing on five major city dialects. We benchmark newly developed pretrained sequence-to-sequence models on the task of CODAfication. We further show that using dialect identification information improves the performance across all dialects. We make our code, data, and pretrained models publicly available.
摘要:阿拉伯语方言是以阿拉伯语为母语的人在日常交流中使用的主要口语。社交媒体平台的兴起显着扩大了其作为书面语言的用途。然而,阿拉伯方言没有标准的拼写法。这一点,再加上社交媒体上用户生成的内容中的固有噪音,给处理阿拉伯语方言的NLP应用程序带来了重大挑战。在本文中,我们探讨并报告了CODAtification的任务,该任务旨在将阿拉伯语方言规范化为阿拉伯语方言传统正法(CODA)。我们与多种阿拉伯语方言的独特平行数据库合作,重点关注五个主要城市方言。我们对新开发的预训练序列到序列模型进行了基准测试,以实现CODAfification任务。我们进一步表明,使用方言识别信息可以提高所有方言的性能。我们公开我们的代码、数据和预训练模型。

[NLP-32] What Affects the Stability of Tool Learning? An Empirical Study on the Robustness of Tool Learning Frameworks
[NLP-32] 什么影响工具学习的稳定性?工具学习框架稳健性的实证研究

链接: https://arxiv.org/abs/2407.03007
作者: Chengrui Huang,Zhengliang Shi,Yuntao Wen,Xiuying Chen,Peng Han,Shen Gao,Shuo Shang
关键词: large language models, Tool learning methods, Tool learning, methods have enhanced, enhanced the ability
中文关键词: 大语言模型,工具学习方法,工具学习,方法有所增强,能力增强
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 9 figures

点击查看摘要

Abstract:Tool learning methods have enhanced the ability of large language models (LLMs) to interact with real-world applications. Many existing works fine-tune LLMs or design prompts to enable LLMs to select appropriate tools and correctly invoke them to meet user requirements. However, it is observed in previous works that the performance of tool learning varies from tasks, datasets, training settings, and algorithms. Without understanding the impact of these factors, it can lead to inconsistent results, inefficient model deployment, and suboptimal tool utilization, ultimately hindering the practical integration and scalability of LLMs in real-world scenarios. Therefore, in this paper, we explore the impact of both internal and external factors on the performance of tool learning frameworks. Through extensive experiments on two benchmark datasets, we find several insightful conclusions for future work, including the observation that LLMs can benefit significantly from increased trial and exploration. We believe our empirical study provides a new perspective for future tool learning research.
摘要:工具学习方法增强了大型语言模型与现实世界应用程序交互的能力。许多现有的工作微调LLM或设计提示,以使LLM能够选择适当的工具并正确地调用它们,以满足用户需求。然而,在以前的工作中观察到,工具学习的性能因任务、数据集、训练设置和算法而异。如果不了解这些因素的影响,可能会导致结果不一致、模型部署效率低下、工具利用率不佳,最终阻碍LLMS在现实场景中的实际集成和可扩展性。因此,本文探讨了内部因素和外部因素对工具学习框架绩效的影响。通过在两个基准数据集上的广泛实验,我们为未来的工作找到了几个有洞察力的结论,包括观察到,低成本管理可以从更多的试验和探索中显著受益。我们相信我们的实证研究为未来的工具学习研究提供了一个新的视角。

[NLP-33] Human-like Linguistic Biases in Neural Speech Models: Phonetic Categorization and Phonotactic Constraints in Wav2Vec2.0
[NLP-33] 神经语音模型中的类人语言偏见:Wav2Vec2.0中的语音分类和音素约束

链接: https://arxiv.org/abs/2407.03005
作者: Marianne de Heer Kloots,Willem Zuidema
关键词: deep neural speech, models, deep neural, Abstract, speech
中文关键词: 深度神经语音,模型,深度神经,抽象,语音
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2024. For code and materials, see this https URL

点击查看摘要

Abstract:What do deep neural speech models know about phonology? Existing work has examined the encoding of individual linguistic units such as phonemes in these models. Here we investigate interactions between units. Inspired by classic experiments on human speech perception, we study how Wav2Vec2 resolves phonotactic constraints. We synthesize sounds on an acoustic continuum between /l/ and /r/ and embed them in controlled contexts where only /l/, only /r/, or neither occur in English. Like humans, Wav2Vec2 models show a bias towards the phonotactically admissable category in processing such ambiguous sounds. Using simple measures to analyze model internals on the level of individual stimuli, we find that this bias emerges in early layers of the model’s Transformer module. This effect is amplified by ASR finetuning but also present in fully self-supervised models. Our approach demonstrates how controlled stimulus designs can help localize specific linguistic knowledge in neural speech models.
摘要:深层神经语音模型对音系学了解多少?现有的工作已经研究了这些模型中个别语言单位的编码,如音素。在这里,我们研究单位之间的相互作用。受人类语音感知经典实验的启发,我们研究了Wav2Vec2是如何解决语音定向限制的。我们在/L/和/r/之间的声学连续体上合成声音,并将它们嵌入到只有/L/、只有/r/或两者都不在英语中出现的受控语境中。像人类一样,Wav2Vec2模型在处理这种模棱两可的声音时,显示出偏向于音素可接受的类别。使用简单的方法在单个刺激的水平上分析模型的内部,我们发现这种偏差出现在模型的变压器模块的早期层。这一效应被ASR微调放大,但也存在于完全自我监督的模型中。我们的方法展示了受控刺激设计如何帮助在神经语音模型中本地化特定的语言知识。

[NLP-34] SemioLLM: Assessing Large Language Models for Semiological Analysis in Epilepsy Research
[NLP-34] SemioLLM:评估癫痫研究中用于符号学分析的大型语言模型

链接: https://arxiv.org/abs/2407.03004
作者: Meghal Dani,Muthu Jeyanthi Prakash,Zeynep Akata,Stefanie Liebe
关键词: Large Language Models, Large Language, shown promising results, medical question-answering datasets, encode general medical
中文关键词: 大型语言模型,大型语言,显示出有希望的结果,医学问答数据集,编码一般医学
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models have shown promising results in their ability to encode general medical knowledge in standard medical question-answering datasets. However, their potential application in clinical practice requires evaluation in domain-specific tasks, where benchmarks are largely missing. In this study semioLLM, we test the ability of state-of-the-art LLMs (GPT-3.5, GPT-4, Mixtral 8x7B, and Qwen-72chat) to leverage their internal knowledge and reasoning for epilepsy diagnosis. Specifically, we obtain likelihood estimates linking unstructured text descriptions of seizures to seizure-generating brain regions, using an annotated clinical database containing 1269 entries. We evaluate the LLM’s performance, confidence, reasoning, and citation abilities in comparison to clinical evaluation. Models achieve above-chance classification performance with prompt engineering significantly improving their outcome, with some models achieving close-to-clinical performance and reasoning. However, our analyses also reveal significant pitfalls with several models being overly confident while showing poor performance, as well as exhibiting citation errors and hallucinations. In summary, our work provides the first extensive benchmark comparing current SOTA LLMs in the medical domain of epilepsy and highlights their ability to leverage unstructured texts from patients’ medical history to aid diagnostic processes in health care.
摘要:大型语言模型在标准医学问答数据集中编码普通医学知识的能力方面显示出了良好的效果。然而,它们在临床实践中的潜在应用需要在特定领域的任务中进行评估,其中基准在很大程度上是缺乏的。在这项研究中,我们测试了最先进的LLM(GPT-3.5、GPT-4、Mixtral 8x7B和Qwen-72chat)利用其内部知识和推理进行癫痫诊断的能力。具体地说,我们使用一个包含1269个条目的带注释的临床数据库,获得了将癫痫发作的非结构化文本描述与癫痫发作产生的大脑区域联系起来的可能性估计。我们评估了LLM的表现、信心、推理和引用能力,并与临床评估进行了比较。模型实现了机会以上的分类性能,迅速的工程显著改善了它们的结果,一些模型实现了接近临床的性能和推理。然而,我们的分析也揭示了重大陷阱,几个模型过于自信,表现出糟糕的性能,以及表现出引用错误和幻觉。总而言之,我们的工作提供了第一个广泛的基准,比较了当前癫痫医学领域的SOTA LLM,并突出了它们利用患者病史的非结构化文本来帮助医疗保健中的诊断过程的能力。

[NLP-35] VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values
[NLP-35] VIVA:具有人类价值观的基于愿景的决策基准

链接: https://arxiv.org/abs/2407.03000
作者: Zhe Hu,Yixiao Ren,Jing Li,Yu Yin
关键词: VIsion-grounded decision-making driven, paper introduces VIVA, paper introduces, benchmark for VIsion-grounded, VIsion-grounded decision-making
中文关键词: 基于视觉的决策驱动,论文介绍VIVA,论文介绍,基于视觉的基准,基于视觉的决策
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VAlues. While most large vision-language models (VLMs) focus on physical-level skills, our work is the first to examine their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation. VIVA contains 1,062 images depicting diverse real-world situations and the manually annotated decisions grounded in them. Given an image there, the model should select the most appropriate action to address the situation and provide the relevant human values and reason underlying the decision. Extensive experiments based on VIVA show the limitation of VLMs in using human values to make multimodal decisions. Further analyses indicate the potential benefits of exploiting action consequences and predicted human values.
摘要:本文介绍了VIVA,这是由人类吸血鬼驱动的基于视觉的决策的基准。虽然大多数大型视觉语言模型(VLM)关注物理层面的技能,但我们的工作是第一个检查它们在利用人类价值观在视觉描述的情况下做出决策的多模式能力的。VIVA包含1,062张图像,描绘了不同的现实世界情况以及基于这些情况的手动注释决策。给定那里的图像,模型应该选择最适当的行动来解决这种情况,并提供相关的人类价值观和决策背后的理由。基于VIVA的大量实验表明了VLM在使用人类价值观做出多模式决策方面的局限性。进一步的分析表明,利用行动后果和预测人类价值的潜在好处。

[NLP-36] Are Large Language Models Consistent over Value-laden Questions?
[NLP-36] 大型语言模型与充满价值的问题是否一致?

链接: https://arxiv.org/abs/2407.02996
作者: Jared Moore,Tanvi Deshpande,Diyi Yang
关键词: bias their survey, Large language models, models, Large language, topics
中文关键词: 偏见他们的调查,大型语言模型,模型,大型语言,主题
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:Large language models (LLMs) appear to bias their survey answers toward certain values. Nonetheless, some argue that LLMs are too inconsistent to simulate particular values. Are they? To answer, we first define value consistency as the similarity of answers across (1) paraphrases of one question, (2) related questions under one topic, (3) multiple-choice and open-ended use-cases of one question, and (4) multilingual translations of a question to English, Chinese, German, and Japanese. We apply these measures to a few large ( =34b ), open LLMs including llama-3, as well as gpt-4o, using eight thousand questions spanning more than 300 topics. Unlike prior work, we find that models are relatively consistent across paraphrases, use-cases, translations, and within a topic. Still, some inconsistencies remain. Models are more consistent on uncontroversial topics (e.g., in the U.S., “Thanksgiving”) than on controversial ones (“euthanasia”). Base models are both more consistent compared to fine-tuned models and are uniform in their consistency across topics, while fine-tuned models are more inconsistent about some topics (“euthanasia”) than others (“women’s rights”) like our human subjects (n=165).
摘要:大型语言模型(LLM)的调查结果似乎偏向于某些价值观。尽管如此,一些人认为,LLM太不一致,无法模拟特定的值。是吗?为了回答问题,我们首先将值一致性定义为(1)一个问题的释义,(2)一个主题下的相关问题,(3)一个问题的多项选择和开放式用例,以及(4)问题到英语、汉语、德语和日语的多语言翻译的答案的相似性。我们将这些措施应用于几个大型(=34b)、开放的LLM,包括Llama-3以及GPT-40,使用了跨越300多个主题的8000个问题。与以前的工作不同,我们发现模型在释义、用例、翻译和主题内相对一致。尽管如此,仍有一些不一致之处。在没有争议的话题(例如,在美国,“感恩节”)上,模特比在有争议的话题(“安乐死”)上更一致。与微调模型相比,基本模型更一致,而且在各个主题上的一致性也是一致的,而微调模型在某些主题(“安乐死”)上比在其他主题(“妇女权利”)上更不一致,比如我们的人类受试者(n=165)。

[NLP-37] LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models
[NLP-37] LoRA-Guard:大型语言模型内容审核的参数高效保障性调整

链接: https://arxiv.org/abs/2407.02987
作者: Hayder Elesedy,Pedro M. Esperança,Silviu Vlad Oprea,Mete Ozay
关键词: alternative to safety, safety alignment, large language models, Abstract, content moderation
中文关键词: 安全替代方案、安全一致、大型语言模型、抽象、内容审核
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs). Existing model-based guardrails have not been designed for resource-constrained computational portable devices, such as mobile phones, more and more of which are running LLM-based applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models. LoRA-Guard extracts language features from the LLMs and adapts them for the content moderation task using low-rank adapters, while a dual-path design prevents any performance degradation on the generative task. We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.
摘要:护栏已成为大型语言模型(LLM)内容审核的安全对齐的替代方案。现有的基于模型的护栏并未针对手机等资源受限的计算便携式设备设计,其中越来越多的设备在本地运行基于LLM的应用程序。我们引入LoRA-Guard,这是一种参数高效的护栏适应方法,依赖于LLM和护栏模型之间的知识共享。LoRA-Guard从LLM中提取语言功能,并使用低级别适配器将其调整为内容审核任务,而双路径设计可防止生成任务的任何性能下降。我们表明,LoRA-Guard的性能优于现有方法,参数负载降低了100- 1000倍,同时保持准确性,实现设备上内容审核。

[NLP-38] Mast Kalandar at SemEval-2024 Task 8: On the Trail of Textual Origins: RoBERTa-BiLSTM Approach to Detect AI-Generated Text
[NLP-38] Mast Kalandar在SemEval-2024任务8:文本起源的踪迹:检测人工智能生成文本的RoBERTa-BiLSTM方法

链接: https://arxiv.org/abs/2407.02978
作者: Jainit Sushil Bafna,Hardik Mittal,Suyash Sethia,Manish Shrivastava,Radhika Mamidi
关键词: Large Language Models, Large Language, diverse user queries, showcased impressive abilities, generating fluent responses
中文关键词: 大型语言模型、大型语言、多样化的用户查询,展示了令人印象深刻的能力,生成流畅的响应
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: SemEval-2024

点击查看摘要

Abstract:Large Language Models (LLMs) have showcased impressive abilities in generating fluent responses to diverse user queries. However, concerns regarding the potential misuse of such texts in journalism, educational, and academic contexts have surfaced. SemEval 2024 introduces the task of Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection, aiming to develop automated systems for identifying machine-generated text and detecting potential misuse. In this paper, we i) propose a RoBERTa-BiLSTM based classifier designed to classify text into two categories: AI-generated or human ii) conduct a comparative study of our model with baseline approaches to evaluate its effectiveness. This paper contributes to the advancement of automatic text detection systems in addressing the challenges posed by machine-generated text misuse. Our architecture ranked 46th on the official leaderboard with an accuracy of 80.83 among 125.
摘要:大型语言模型(LLM)在对不同的用户查询生成流畅响应方面展示了令人印象深刻的能力。然而,人们对此类文本在新闻、教育和学术领域可能被滥用的担忧已经浮出水面。SemEval 2024引入了多生成器、多域和多语言黑匣子机器生成文本检测的任务,旨在开发用于识别机器生成文本和检测潜在滥用的自动化系统。在本文中,我们i)提出了一个基于RoBERTa-BiLSTM的分类器,旨在将文本分为两类:人工智能生成的文本或人类文本ii)使用基线方法对我们的模型进行比较研究以评估其有效性。本文有助于自动文本检测系统的进步,以应对机器生成文本滥用带来的挑战。我们的架构在官方排行榜上排名第46位,准确率为80.83(125个)。

[NLP-39] Large Language Models as Evaluators for Scientific Synthesis
[NLP-39] 大型语言模型作为科学综合的评估者

链接: https://arxiv.org/abs/2407.02977
作者: Julia Evans,Jennifer D’Souza,Sören Auer
关键词: Large Language Models, Large Language, Language Models, Mistral model ability, open-source Mistral model
中文关键词: 大型语言模型,大型语言,语言模型,Mistral模型能力,开源Mistral模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 4 pages, forthcoming as part of the KONVENS 2024 proceedings this https URL

点击查看摘要

Abstract:Our study explores how well the state-of-the-art Large Language Models (LLMs), like GPT-4 and Mistral, can assess the quality of scientific summaries or, more fittingly, scientific syntheses, comparing their evaluations to those of human annotators. We used a dataset of 100 research questions and their syntheses made by GPT-4 from abstracts of five related papers, checked against human quality ratings. The study evaluates both the closed-source GPT-4 and the open-source Mistral model’s ability to rate these summaries and provide reasons for their judgments. Preliminary results show that LLMs can offer logical explanations that somewhat match the quality ratings, yet a deeper statistical analysis shows a weak correlation between LLM and human ratings, suggesting the potential and current limitations of LLMs in scientific synthesis evaluation.
摘要:我们的研究探讨了GPT-4和Mistral等最先进的大型语言模型(LLM)如何评估科学摘要或更合适的科学合成的质量,并将其评估与人类注释者的评估进行比较。我们使用了包含100个研究问题的数据集,以及GPT-4从五篇相关论文的摘要中进行的综合,并与人类质量评级进行了检查。该研究评估了封闭源GPT-4和开放源Mistral模型对这些摘要进行评级并为其判断提供理由的能力。初步结果表明,LLM可以提供与质量评级在一定程度上匹配的逻辑解释,但更深入的统计分析表明LLM和人类评级之间的相关性较弱,这表明LLM在科学综合评估中的潜在和当前局限性。

[NLP-40] FSM: A Finite State Machine Based Zero-Shot Prompting Paradigm for Multi-Hop Question Answering
[NLP-40] RSM:基于有限状态机的多跳问题回答零镜头预算范式

链接: https://arxiv.org/abs/2407.02964
作者: Xiaochen Wang,Junqing He,Zhe yang,Yiru Wang,Xiangdi Meng,Kunhao Pan,Zhifang Sui
关键词: Large Language Models, simple nature language, nature language inference, Large Language, Language Models
中文关键词: 大型语言模型、简单自然语言、自然语言推理、大型语言、语言模型
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) with chain-of-thought (COT) prompting have demonstrated impressive abilities on simple nature language inference tasks. However, they tend to perform poorly on Multi-hop Question Answering (MHQA) tasks due to several challenges, including hallucination, error propagation and limited context length. We propose a prompting method, Finite State Machine (FSM) to enhance the reasoning capabilities of LLM for complex tasks in addition to improved effectiveness and trustworthiness. Different from COT methods, FSM addresses MHQA by iteratively decomposing a question into multi-turn sub-questions, and self-correcting in time, improving the accuracy of answers in each step. Specifically, FSM addresses one sub-question at a time and decides on the next step based on its current result and state, in an automaton-like format. Experiments on benchmarks show the effectiveness of our method. Although our method performs on par with the baseline on relatively simpler datasets, it excels on challenging datasets like Musique. Moreover, this approach mitigates the hallucination phenomenon, wherein the correct final answer can be recovered despite errors in intermediate reasoning. Furthermore, our method improves LLMs’ ability to follow specified output format requirements, significantly reducing the difficulty of answer interpretation and the need for reformatting.
摘要:具有思维链(CoT)提示的大语言模型(LLM)在简单自然语言推理任务中表现出了令人印象深刻的能力。然而,由于幻觉、错误传播和有限的语境长度等几个挑战,他们在多跳问答(MHQA)任务中往往表现不佳。我们提出了一种有限状态机(FSM)的提示方法来增强LLM对复杂任务的推理能力,同时提高了有效性和可信度。与COT方法不同,有限状态机通过迭代将问题分解成多话轮的子问题,并及时自校正,提高了每一步答案的准确性。具体地说,FSM一次解决一个子问题,并以类似自动机的格式基于其当前结果和状态决定下一步。在基准测试上的实验表明了该方法的有效性。虽然我们的方法在相对简单的数据集上的性能与基线相当,但它在Musique等具有挑战性的数据集上表现出色。此外,这种方法减轻了幻觉现象,在这种现象中,正确的最终答案可以在中间推理错误的情况下恢复。此外,我们的方法提高了LLMS遵循特定输出格式要求的能力,显著降低了答案解释的难度和重新格式化的需要。

[NLP-41] ObfuscaTune: Obfuscated Offsite Fine-tuning and Inference of Proprietary LLMs on Private Datasets
[NLP-41] ObfuscaButton:模糊的非现场微调和私有数据集中专有LLM的推断

链接: https://arxiv.org/abs/2407.02960
作者: Ahmed Frikha,Nassim Walha,Ricardo Mendes,Krishna Kanth Nakka,Xue Jiang,Xuebing Zhou
关键词: proprietary LLM owned, data owner entity, model provider entity, proprietary LLM, LLM owned
中文关键词: 专有LLM拥有、数据所有者实体、模型提供商实体、专有LLM、LLM拥有
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:This work addresses the timely yet underexplored problem of performing inference and finetuning of a proprietary LLM owned by a model provider entity on the confidential/private data of another data owner entity, in a way that ensures the confidentiality of both the model and the data. Hereby, the finetuning is conducted offsite, i.e., on the computation infrastructure of a third-party cloud provider. We tackle this problem by proposing ObfuscaTune, a novel, efficient and fully utility-preserving approach that combines a simple yet effective obfuscation technique with an efficient usage of confidential computing (only 5% of the model parameters are placed on TEE). We empirically demonstrate the effectiveness of ObfuscaTune by validating it on GPT-2 models with different sizes on four NLP benchmark datasets. Finally, we compare to a naïve version of our approach to highlight the necessity of using random matrices with low condition numbers in our approach to reduce errors induced by the obfuscation.
摘要:这项工作解决了及时但未被探索的问题,即以确保模型和数据的机密性的方式,对模型提供者实体拥有的专有LLM执行推理和精调,以对另一个数据所有者实体的机密/私有数据执行。因此,微调在场外进行,即在第三方云提供商的计算基础设施上进行。为了解决这个问题,我们提出了一种新颖、高效且完全保持效用的方法–ObfuscaTune,它结合了简单而有效的混淆技术和机密计算的高效使用(只有5%的模型参数被放置在TEE上)。我们通过在四个NLP基准数据集上对不同大小的GPT-2模型进行验证,验证了该算法的有效性。最后,我们比较了我们的方法的一个天真的版本,以强调在我们的方法中使用具有低条件数的随机矩阵的必要性,以减少由于混淆而导致的错误。

[NLP-42] IncogniText: Privacy-enhancing Conditional Text Anonymization via LLM-based Private Attribute Randomization
[NLP-42] Incognitext:通过基于LLM的私有属性随机化增强隐私的条件文本解析

链接: https://arxiv.org/abs/2407.02956
作者: Ahmed Frikha,Nassim Walha,Krishna Kanth Nakka,Ricardo Mendes,Xue Jiang,Xuebing Zhou
关键词: correctly inferring private, inferring private attributes, meaning and semantics, address the problem, prevent adversaries
中文关键词: 正确推断私人,推断私人属性、含义和语义,解决问题,防止对手
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:In this work, we address the problem of text anonymization where the goal is to prevent adversaries from correctly inferring private attributes of the author, while keeping the text utility, i.e., meaning and semantics. We propose IncogniText, a technique that anonymizes the text to mislead a potential adversary into predicting a wrong private attribute value. Our empirical evaluation shows a reduction of private attribute leakage by more than 90%. Finally, we demonstrate the maturity of IncogniText for real-world applications by distilling its anonymization capability into a set of LoRA parameters associated with an on-device model.
摘要:在这项工作中,我们解决了文本匿名化的问题,目标是防止对手正确推断作者的私人属性,同时保持文本效用,即意义和语义。我们提出了Incognitext,这是一种匿名文本以误导潜在对手预测错误的私有属性值的技术。我们的经验评估显示,私人属性泄露减少了90%以上。最后,我们通过将其匿名化能力提炼为与设备上模型相关的一组LoRA参数,展示了Incognitext在现实世界应用程序中的成熟度。

[NLP-43] PII-Compass: Guiding LLM training data extraction prompts towards the target PII via grounding
[NLP-43] PII-Compass:通过接地引导LLM训练数据提取提示指向目标PRI

链接: https://arxiv.org/abs/2407.02943
作者: Krishna Kanth Nakka,Ahmed Frikha,Ricardo Mendes,Xue Jiang,Xuebing Zhou
关键词: large models stem, increased size, impactful advances, advances in large, large models
中文关键词: 大型模型茎、尺寸增加、有影响力的进步、大型模型的进步
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ACL 2024

点击查看摘要

Abstract:The latest and most impactful advances in large models stem from their increased size. Unfortunately, this translates into an improved memorization capacity, raising data privacy concerns. Specifically, it has been shown that models can output personal identifiable information (PII) contained in their training data. However, reported PIII extraction performance varies widely, and there is no consensus on the optimal methodology to evaluate this risk, resulting in underestimating realistic adversaries. In this work, we empirically demonstrate that it is possible to improve the extractability of PII by over ten-fold by grounding the prefix of the manually constructed extraction prompt with in-domain data. Our approach, PII-Compass, achieves phone number extraction rates of 0.92%, 3.9%, and 6.86% with 1, 128, and 2308 queries, respectively, i.e., the phone number of 1 person in 15 is extractable.
摘要:大型型号最新、最具影响力的进步源于其尺寸的增加。不幸的是,这转化为记忆能力的提高,引发了数据隐私问题。具体来说,已经证明模型可以输出其训练数据中包含的个人可识别信息(PRI)。然而,报告的PIII提取性能差异很大,并且对于评估该风险的最佳方法没有共识,导致低估了现实对手。在这项工作中,我们经验证明,通过将手动构建的提取提示的前置数据与域内数据接地,可以将PRI的提取能力提高十倍以上。我们的方法PII-Compass在1、128和2308个查询时分别实现了0.92%、3.9%和6.86%的电话号码提取率,即每15个人中有1个的电话号码是可提取的。

[NLP-44] Probing the Feasibility of Multilingual Speaker Anonymization
[NLP-44] 探讨多语言说话人语音化的可行性

链接: https://arxiv.org/abs/2407.02937
作者: Sarina Meyer,Florian Lux,Ngoc Thang Vu
关键词: speaker remains hidden, remains hidden, recordings are modified, English data, speaker remains
中文关键词: 扬声器保持隐藏,保持隐藏,录音被修改,英语数据,扬声器保持
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: accepted at Interspeech 2024

点击查看摘要

Abstract:In speaker anonymization, speech recordings are modified in a way that the identity of the speaker remains hidden. While this technology could help to protect the privacy of individuals around the globe, current research restricts this by focusing almost exclusively on English data. In this study, we extend a state-of-the-art anonymization system to nine languages by transforming language-dependent components to their multilingual counterparts. Experiments testing the robustness of the anonymized speech against privacy attacks and speech deterioration show an overall success of this system for all languages. The results suggest that speaker embeddings trained on English data can be applied across languages, and that the anonymization performance for a language is mainly affected by the quality of the speech synthesis component used for it.
摘要:在发言人匿名化中,语音记录被修改,以保持发言人身份的隐藏方式。虽然这项技术可以帮助保护全球各地个人的隐私,但目前的研究几乎只关注英语数据,从而限制了这一点。在这项研究中,我们通过将语言相关组件转换为多语言对应组件,将最先进的匿名系统扩展到九种语言。测试匿名语音对抗隐私攻击和语音恶化的鲁棒性的实验表明,该系统对所有语言都取得了总体成功。结果表明,在英语数据上训练的说话者嵌入可以应用于跨语言,并且语言的匿名化性能主要受到用于该语言的语音合成组件质量的影响。

[NLP-45] GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models
[NLP-45] GraCoRe:大型语言模型中的图理解和复杂推理基准

链接: https://arxiv.org/abs/2407.02936
作者: Zike Yuan,Ming Liu,Hui Wang,Bing Qin
关键词: Large Language Models, Large Language, abilities of Large, Language Models, graph comprehension
中文关键词: 大型语言模型,大型语言,大型能力,语言模型,图形理解
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating the graph comprehension and reasoning abilities of Large Language Models (LLMs) is challenging and often incomplete. Existing benchmarks focus primarily on pure graph understanding, lacking a comprehensive evaluation across all graph types and detailed capability definitions. This paper presents GraCoRe, a benchmark for systematically assessing LLMs’ graph comprehension and reasoning. GraCoRe uses a three-tier hierarchical taxonomy to categorize and test models on pure graph and heterogeneous graphs, subdividing capabilities into 10 distinct areas tested through 19 tasks. Our benchmark includes 11 datasets with 5,140 graphs of varying complexity. We evaluated three closed-source and seven open-source LLMs, conducting thorough analyses from both ability and task perspectives. Key findings reveal that semantic enrichment enhances reasoning performance, node ordering impacts task success, and the ability to process longer texts does not necessarily improve graph comprehension or reasoning. GraCoRe is open-sourced at this https URL
摘要:评估大型语言模型的图形理解和推理能力是具有挑战性的,而且往往是不完整的。现有的基准测试主要关注纯粹的图形理解,缺乏对所有图形类型的全面评估和详细的能力定义。本文介绍了一个用于系统评价LLMS图形理解和推理能力的基准测试工具GraCoRe。GraCoRe使用三层分层分类来对纯图形和异类图形上的模型进行分类和测试,将功能细分为10个不同的领域,通过19个任务进行测试。我们的基准包括11个数据集,具有5,140个不同复杂性的图。我们评估了三个封闭源代码和七个开放源代码的LLM,从能力和任务的角度进行了深入的分析。主要研究结果表明,语义丰富提高了推理成绩,节点顺序影响任务成功,处理较长文本的能力并不一定提高图形理解或推理能力。GraCoRe在这个HTTPS URL上是开源的

[NLP-46] owards Negotiative Dialogue for the Talkamatic Dialogue Manager
[NLP-46] owards Talkamatic对话经理的谈判对话

链接: https://arxiv.org/abs/2407.02917
作者: Staffan Larsson,Alexander Berman,David Hjelm
关键词: Talkamatic Dialogue Manager, Dialogue Manager, Talkamatic Dialogue, paper describes, describes a number
中文关键词: Talkamatic对话经理,对话经理,Talkamatic对话,论文描述,描述了一个数字
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The paper describes a number of dialogue phenomena associated with negotiative dialogue, as implemented in a development version of the Talkamatic Dialogue Manager (TDM). This implementation is an initial step towards full coverage of general features of negotiative dialogue in TDM.
摘要:本文描述了许多与谈判对话相关的对话现象,这些对话现象在Talkamatic Dialogue Manager(STM)的开发版本中实现。这一实施是全面覆盖DM谈判对话一般特征的第一步。

[NLP-47] ranslatotron-V(ison): An End-to-End Model for In-Image Machine Translation
[NLP-47] ranslatotron-V(ison):图像内机器翻译的端到端模型

链接: https://arxiv.org/abs/2407.02894
作者: Zhibin Lan,Liqiang Niu,Fandong Meng,Jie Zhou,Min Zhang,Jinsong Su
关键词: In-image machine translation, In-image machine, image, aims to translate, IIMT model
中文关键词: 图像内机器翻译,图像内机器,图像,旨在翻译,IIMG模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2024 Findings

点击查看摘要

Abstract:In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language. In this regard, conventional cascaded methods suffer from issues such as error propagation, massive parameters, and difficulties in deployment and retaining visual characteristics of the input image. Thus, constructing end-to-end models has become an option, which, however, faces two main challenges: 1) the huge modeling burden, as it is required to simultaneously learn alignment across languages and preserve the visual characteristics of the input image; 2) the difficulties of directly predicting excessively lengthy pixel sequences. In this paper, we propose \textitTranslatotron-V(ision), an end-to-end IIMT model consisting of four modules. In addition to an image encoder, and an image decoder, our model contains a target text decoder and an image tokenizer. Among them, the target text decoder is used to alleviate the language alignment burden, and the image tokenizer converts long sequences of pixels into shorter sequences of visual tokens, preventing the model from focusing on low-level visual features. Besides, we present a two-stage training framework for our model to assist the model in learning alignment across modalities and languages. Finally, we propose a location-aware evaluation metric called Structure-BLEU to assess the translation quality of the generated images. Experimental results demonstrate that our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
摘要:像内机器翻译旨在将包含源语言文本的图像翻译成包含目标语言译文的图像。在这一点上,传统的级联方法存在诸如错误传播、大量参数以及难以部署和保持输入图像的视觉特征等问题。因此,构建端到端模型已经成为一种选择,然而,它面临着两个主要挑战:1)巨大的建模负担,因为需要同时学习跨语言的对齐并保持输入图像的视觉特征;2)直接预测过长的像素序列的困难。本文提出了一种端到端的IIMT模型Translatotron-V(ISION),它由四个模块组成。除了一个图像编码器和一个图像解码器,我们的模型还包含一个目标文本解码器和一个图像标记器。其中,目标文本解码器用于减轻语言对齐负担,图像标记器将较长的像素序列转换为较短的视觉标记序列,避免了模型对低层视觉特征的关注。此外,我们还为我们的模型提出了一个两阶段的培训框架,以帮助该模型学习跨形态和语言的对齐。最后,我们提出了一种基于位置感知的评价指标–结构-BLEU来评价生成图像的翻译质量。实验结果表明,与仅使用70.9个参数的级联模型相比,该模型具有与之相当的性能,并且显著优于像素级的端到端IIMT模型。

[NLP-48] GPTQT: Quantize Large Language Models Twice to Push the Efficiency
[NLP-48] GPTQT:两次量化大型语言模型以提高效率

链接: https://arxiv.org/abs/2407.02891
作者: Yipin Guo,Yilin Lang,Qinyuan Ren
关键词: generative Large Language, Large Language Models, Large Language, require significant computing, large size
中文关键词: 生成式大型语言、大型语言模型、大型语言,需要大量计算、大尺寸
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by 11th IEEE International Conference on Cybernetics and Intelligent Systems

点击查看摘要

Abstract:Due to their large size, generative Large Language Models (LLMs) require significant computing and storage resources. This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed by expressing the weight of LLM in 3bit/2bit. Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting. Therefore, GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding. A re-explore strategy is proposed to optimize initial scaling factor. During inference, these steps are merged into pure binary coding, enabling efficient computation. Testing across various models and datasets confirms GPTQT’s effectiveness. Compared to the strong 3-bit quantization baseline, GPTQT further reduces perplexity by 4.01 on opt-66B and increases speed by 1.24 times on opt-30b. The results on Llama2 show that GPTQT is currently the best binary coding quantization method for such kind of LLMs.
摘要:由于生成式大型语言模型(LLM)的规模较大,因此需要大量的计算和存储资源。本文介绍了一种新的训练后量化方法GPTQT,该方法通过用3bit/2bit来表示LLM的权重,从而减少了内存消耗,提高了处理速度。实践证明,最小化权值的量化误差是无效的,会导致过拟合。因此,GPTQT采用渐进的两步方法:首先使用线性量化将权重量化到相对较高的比特,然后将获得的INT权重转换为较低比特的二进制编码。提出了一种重新探索策略来优化初始比例因子。在推理过程中,这些步骤被合并为纯二进制编码,从而实现高效计算。对各种模型和数据集的测试证实了GPTQT的有效性。与强大的3位量化基线相比,GPTQT在OPT-66B上进一步降低了4.01的困惑,在OPT-30B上提高了1.24倍的速度。在Llama2上的结果表明,GPTQT是目前适用于这类LLMS的最佳二进制编码量化方法。

[NLP-49] CogErgLLM: Exploring Large Language Model Systems Design Perspective Using Cognitive Ergonomics
[NLP-49] CogErgLLM:使用认知人体工程学探索大型语言模型系统设计视角

链接: https://arxiv.org/abs/2407.02885
作者: Azmine Toushik Wasi
关键词: enhancing safety, essential for enhancing, Integrating cognitive ergonomics, LLM, LLM design
中文关键词: 增强安全性,对于增强至关重要,集成认知人体工程学,LLM,LLM设计
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: 8 Page, 3 Figures. Accepted to Large Language Models and Cognition @ ICML 2024 ( this https URL )

点击查看摘要

Abstract:Integrating cognitive ergonomics with LLMs is essential for enhancing safety, reliability, and user satisfaction in human-AI interactions. Current LLM design often lacks this integration, leading to systems that may not fully align with human cognitive capabilities and limitations. Insufficient focus on incorporating cognitive science methods exacerbates biases in LLM outputs, while inconsistent application of user-centered design principles results in sub-optimal user experiences. To address these challenges, our position paper explores the critical integration of cognitive ergonomics principles into LLM design, aiming to provide a comprehensive framework and practical guidelines for ethical LLM development. Through our contributions, we seek to advance understanding and practice in integrating cognitive ergonomics into LLM systems, fostering safer, more reliable, and ethically sound human-AI interactions.
摘要:将认知人体工程学与LLM集成对于提高人机交互的安全性、可靠性和用户满意度至关重要。当前的LLM设计通常缺乏这种集成,导致系统可能不完全符合人类认知能力和限制。对纳入认知科学方法的关注不足加剧了LLM输出的偏见,而以用户为中心的设计原则的不一致应用会导致次优的用户体验。为了应对这些挑战,我们的立场文件探讨了认知人体工程学原则与LLM设计的关键整合,旨在为道德LLM发展提供全面的框架和实用指南。通过我们的贡献,我们寻求促进将认知人体工程学集成到LLM系统中的理解和实践,促进更安全、更可靠且符合道德规范的人机交互。

[NLP-50] CoIR: A Comprehensive Benchmark for Code Information Retrieval Models
[NLP-50] CoIR:代码信息检索模型的综合基准

链接: https://arxiv.org/abs/2407.02883
作者: Xiangyang Li,Kuicai Dong,Yi Quan Lee,Wei Xia,Yichun Yin,Hao Zhang,Yong Liu,Yasheng Wang,Ruiming Tang
关键词: predominantly handle queries, success of Information, textbf, code retrieval, Information Retrieval
中文关键词: 主要处理查询、成功的信息、文本BF、代码检索、信息检索
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the substantial success of Information Retrieval (IR) in various NLP tasks, most IR systems predominantly handle queries and corpora in natural language, neglecting the domain of code retrieval. Code retrieval is critically important yet remains under-explored, with existing methods and benchmarks inadequately representing the diversity of code in various domains and tasks. Addressing this gap, we present \textbf\name (\textbfCode \textbfInformation \textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities. \name comprises \textbften meticulously curated code datasets, spanning \textbfeight distinctive retrieval tasks across \textbfseven diverse domains. We first discuss the construction of \name and its diverse dataset composition. Further, we evaluate nine widely used retrieval models using \name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems. To facilitate easy adoption and integration within existing research workflows, \name has been developed as a user-friendly Python framework, readily installable via pip. It shares same data schema as other popular benchmarks like MTEB and BEIR, enabling seamless cross-benchmark evaluations. Through \name, we aim to invigorate research in the code retrieval domain, providing a versatile benchmarking tool that encourages further development and exploration of code retrieval systems\footnote\url this https URL.
摘要:尽管信息检索在各种自然语言处理任务中取得了很大的成功,但大多数信息检索系统主要处理自然语言的查询和语料库,而忽略了代码检索领域。代码检索是至关重要的,但仍未得到充分探索,现有的方法和基准不足以代表不同领域和任务中代码的多样性。为了解决这一差距,我们提出了\extbf\name(\textbfCode\extbf Information\extbf Retrival Benchmark),这是一个专门为评估代码检索能力而设计的健壮而全面的基准测试。名称由经过精心管理的代码数据集组成,跨越七个不同的域执行不同的检索任务。我们首先讨论\NAME的构造及其不同的数据集组成。此外,我们使用\NAME评估了九个广泛使用的检索模型,发现即使使用最先进的系统也很难执行代码检索任务。为了便于在现有研究工作流中轻松采用和集成,\NAME已开发为用户友好的Python框架,可通过pip轻松安装。它与其他流行的基准测试(如MTEB和BEIR)共享相同的数据模式,支持无缝的交叉基准评估。通过\NAME,我们的目标是促进代码检索领域的研究,提供一个通用的基准测试工具,鼓励进一步开发和探索代码检索系统\Footnote\url此HTTPS URL。

[NLP-51] Contrast then Memorize: Semantic Neighbor Retrieval-Enhanced Inductive Multimodal Knowledge Graph Completion
[NLP-51] 对比然后简化:语义邻居检索-增强的归纳多模式知识图完成

链接: https://arxiv.org/abs/2407.02867
作者: Yu Zhao,Ying Zhang,Baohang Zhou,Xinying Qian,Kehui Song,Xiangrui Cai
关键词: Knowledge Graph Completion, Multimodal Knowledge Graph, Graph Completion, emerged for Multimodal, inductive MKGC
中文关键词: 知识图完成、多模式知识图、图完成,为多模式、归纳MKGC而出现
类目: Multimedia (cs.MM); Computation and Language (cs.CL)
备注: Accepted by SIGIR 2024

点击查看摘要

Abstract:A large number of studies have emerged for Multimodal Knowledge Graph Completion (MKGC) to predict the missing links in MKGs. However, fewer studies have been proposed to study the inductive MKGC (IMKGC) involving emerging entities unseen during training. Existing inductive approaches focus on learning textual entity representations, which neglect rich semantic information in visual modality. Moreover, they focus on aggregating structural neighbors from existing KGs, which of emerging entities are usually limited. However, the semantic neighbors are decoupled from the topology linkage and usually imply the true target entity. In this paper, we propose the IMKGC task and a semantic neighbor retrieval-enhanced IMKGC framework CMR, where the contrast brings the helpful semantic neighbors close, and then the memorize supports semantic neighbor retrieval to enhance inference. Specifically, we first propose a unified cross-modal contrastive learning to simultaneously capture the textual-visual and textual-textual correlations of query-entity pairs in a unified representation space. The contrastive learning increases the similarity of positive query-entity pairs, therefore making the representations of helpful semantic neighbors close. Then, we explicitly memorize the knowledge representations to support the semantic neighbor retrieval. At test time, we retrieve the nearest semantic neighbors and interpolate them to the query-entity similarity distribution to augment the final prediction. Extensive experiments validate the effectiveness of CMR on three inductive MKGC datasets. Codes are available at this https URL.
摘要:多通道知识图补全(MKGC)用于预测多通道知识图中的缺失环节,已有大量研究涌现。然而,很少有人提出研究诱导性MKGC(IMKGC),涉及在训练中看不到的新兴实体。现有的归纳方法侧重于文本实体表征的学习,忽略了视觉通道中丰富的语义信息。此外,它们专注于从现有KG聚合结构邻居,而新兴实体的结构邻居通常是有限的。然而,语义邻居与拓扑链是解耦的,并且通常隐含着真实的目标实体。在本文中,我们提出了IMKGC任务和一个增强语义邻居检索的IMKGC框架CMR,其中对比使有用的语义邻居更接近,然后记忆支持语义邻居检索以增强推理。具体地说,我们首先提出了一种统一的跨通道对比学习,以同时捕获查询实体对在统一表示空间中的文本-视觉和文本-文本相关性。对比学习提高了正向查询-实体对的相似度,从而使有帮助的语义邻居的表示更接近。然后,显式记忆知识表示,以支持语义邻居检索。在测试时,我们检索最近的语义邻居,并将它们插入到查询实体相似度分布中,以增强最终的预测。在三个归纳MKGC数据集上的大量实验验证了CMR的有效性。代码可在此HTTPS URL中找到。

[NLP-52] Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
[NLP-52] 安全忘记学习:防御越狱攻击的一种令人惊讶的有效且可推广的解决方案

链接: https://arxiv.org/abs/2407.02855
作者: Zhexin Zhang,Junxiao Yang,Pei Ke,Shiyao Cui,Chujie Zheng,Hongning Wang,Minlie Huang
关键词: jailbreak attacks, Attack Success Rate, jailbreak, harmful, harmful questions
中文关键词: 越狱攻击、攻击成功率、越狱、有害、有害的问题
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages

点击查看摘要

Abstract:LLMs are known to be vulnerable to jailbreak attacks, even after safety alignment. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Therefore, we conjecture that directly unlearn the harmful knowledge in the LLM can be a more effective way to defend against jailbreak attacks than the mainstream supervised fine-tuning (SFT) based approaches. Our extensive experiments confirmed our insight and suggested surprising generalizability of our unlearning-based approach: using only 20 raw harmful questions \emphwithout any jailbreak prompt during training, our solution reduced the Attack Success Rate (ASR) in Vicuna-7B on \emphout-of-distribution (OOD) harmful questions wrapped with various complex jailbreak prompts from 82.6% to 7.7%. This significantly outperforms Llama2-7B-Chat, which is fine-tuned on about 0.1M safety alignment samples but still has an ASR of 21.9% even under the help of an additional safety system prompt. Further analysis reveals that the generalization ability of our solution stems from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions, and similarity among their learned representations in the LLM). Our code is available at \urlthis https URL.
摘要:众所周知,即使在安全对准之后,LLM也容易受到越狱攻击。一个重要的观察是,虽然不同类型的越狱攻击可以产生显著不同的查询,但它们大多导致植根于相同有害知识(例如,制造炸弹的详细步骤)的相似响应。因此,我们推测,与主流的基于监督微调(SFT)的方法相比,直接忘记LLM中的有害知识可能是一种更有效的防御越狱攻击的方法。我们广泛的实验证实了我们的洞察力,并建议我们基于遗忘的方法具有惊人的概括性:仅使用20个原始有害问题,在训练期间没有任何越狱提示的情况下,我们的解决方案将维库纳-7B对包裹着各种复杂越狱提示的有害问题的攻击成功率(ASR)从82.6%降低到7.7%。这大大超过了Llama2-7B-Chat,后者在大约0.1万个安全对准样本上进行了微调,但即使在额外的安全系统提示的帮助下,ASR仍为21.9%。进一步的分析表明,我们的解决方案的泛化能力源于跨有害问题的有害响应之间的内在关联(例如,响应模式、共享的步骤和动作,以及它们在LLM中学习的表示之间的相似性)。我们的代码位于此HTTPS URL。

[NLP-53] Universal Gloss-level Representation for Gloss-free Sign Language Translation and Production
[NLP-53] 无光泽手语翻译和制作的通用光泽级表示

链接: https://arxiv.org/abs/2407.02854
作者: Eui Jun Hwang,Sukmin Cho,Huije Lee,Youngwoo Yoon,Jong C. Park
关键词: Sign language, presents unique challenges, spoken language words, sign language motion, mapping sign language
中文关键词: 手语,提出独特的挑战,口语单词,手语动作,映射手语
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:Sign language, essential for the deaf and hard-of-hearing, presents unique challenges in translation and production due to its multimodal nature and the inherent ambiguity in mapping sign language motion to spoken language words. Previous methods often rely on gloss annotations, requiring time-intensive labor and specialized expertise in sign language. Gloss-free methods have emerged to address these limitations, but they often depend on external sign language data or dictionaries, failing to completely eliminate the need for gloss annotations. There is a clear demand for a comprehensive approach that can supplant gloss annotations and be utilized for both Sign Language Translation (SLT) and Sign Language Production (SLP). We introduce Universal Gloss-level Representation (UniGloR), a unified and self-supervised solution for both SLT and SLP, trained on multiple datasets including PHOENIX14T, How2Sign, and NIASL2021. Our results demonstrate UniGloR’s effectiveness in the translation and production tasks. We further report an encouraging result for the Sign Language Recognition (SLR) on previously unseen data. Our study suggests that self-supervised learning can be made in a unified manner, paving the way for innovative and practical applications in future research.
摘要:手语是聋人和听力障碍者的必备语言,由于其多通道的特性以及在将手语动作映射到口语词汇方面固有的模糊性,给翻译和制作带来了独特的挑战。以前的方法通常依赖于注解,需要耗时的劳动和专门的手语专业知识。无光泽的方法已经出现,以解决这些限制,但它们通常依赖于外部手语数据或词典,未能完全消除对光泽注释的需求。手语翻译(SLT)和手语制作(SLP)都需要一种可以取代注解并可用于手语翻译(SLT)和手语制作(SLP)的综合方法。我们介绍了通用光泽度表示(UniGloR),这是一个针对SLT和SLP的统一和自我监督的解决方案,在包括PHOENIX14T、How2Sign和NIASL2021在内的多个数据集上进行了训练。我们的结果证明了UniGloR在翻译和制作任务中的有效性。我们进一步报告了基于以前未见数据的手语识别(SLR)的令人鼓舞的结果。我们的研究表明,自我监督学习可以在统一的方式下进行,为未来研究的创新和实际应用铺平了道路。

[NLP-54] MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis
[NLP-54] MindBench:思维导图结构识别和分析的综合基准

链接: https://arxiv.org/abs/2407.02842
作者: Lei Chen,Feng Yan,Yujie Zhong,Shaoxiang Chen,Zequn Jie,Lin Ma
关键词: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models
中文关键词: 多模式大型语言,大型语言模型,多模式大型,大型语言,语言模型
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: technical report

点击查看摘要

Abstract:Multimodal Large Language Models (MLLM) have made significant progress in the field of document analysis. Despite this, existing benchmarks typically focus only on extracting text and simple layout information, neglecting the complex interactions between elements in structured documents such as mind maps and flowcharts. To address this issue, we introduce the new benchmark named MindBench, which not only includes meticulously constructed bilingual authentic or synthetic images, detailed annotations, evaluation metrics and baseline models, but also specifically designs five types of structured understanding and parsing tasks. These tasks include full parsing, partial parsing, position-related parsing, structured Visual Question Answering (VQA), and position-related VQA, covering key areas such as text recognition, spatial awareness, relationship discernment, and structured parsing. Extensive experimental results demonstrate the substantial potential and significant room for improvement in current models’ ability to handle structured document information. We anticipate that the launch of MindBench will significantly advance research and application development in structured document analysis technology. MindBench is available at: this https URL.
摘要:多通道大语言模型(MLLM)在文档分析领域取得了重大进展。尽管如此,现有的基准测试通常只专注于提取文本和简单的布局信息,而忽略了思维导图和流程图等结构化文档中元素之间的复杂交互。为了解决这个问题,我们引入了新的基准测试MindBtch,它不仅包括精心构建的双语真实或合成图像、详细的注释、评估指标和基线模型,而且还专门设计了五种类型的结构化理解和句法分析任务。这些任务包括完全句法分析、部分句法分析、位置相关句法分析、结构化视觉问答(VQA)和位置相关VQA,涵盖文本识别、空间感知、关系识别和结构化句法分析等关键领域。广泛的实验结果表明,当前模型处理结构化文档信息的能力有很大的潜力和巨大的改进空间。我们预计,MindBitch的推出将极大地推动结构化文档分析技术的研究和应用开发。可通过以下网址获得MindBtch:这个HTTPS URL。

[NLP-55] Comparing Feature-based and Context-aware Approaches to PII Generalization Level Prediction
[NLP-55] 比较基于环境和上下文感知方法与PIP概括级别预测

链接: https://arxiv.org/abs/2407.02837
作者: Kailin Zhang,Xinying Qiu
关键词: Protecting Personal Identifiable, Personal Identifiable Information, Protecting Personal, uneven data distributions, Personal Identifiable
中文关键词: 保护个人可识别信息、保护个人、不均匀数据分布、个人可识别信息
类目: Computation and Language (cs.CL)
备注: Accepted to IALP 2024

点击查看摘要

Abstract:Protecting Personal Identifiable Information (PII) in text data is crucial for privacy, but current PII generalization methods face challenges such as uneven data distributions and limited context awareness. To address these issues, we propose two approaches: a feature-based method using machine learning to improve performance on structured inputs, and a novel context-aware framework that considers the broader context and semantic relationships between the original text and generalized candidates. The context-aware approach employs Multilingual-BERT for text representation, functional transformations, and mean squared error scoring to evaluate candidates. Experiments on the WikiReplace dataset demonstrate the effectiveness of both methods, with the context-aware approach outperforming the feature-based one across different scales. This work contributes to advancing PII generalization techniques by highlighting the importance of feature selection, ensemble learning, and incorporating contextual information for better privacy protection in text anonymization.
摘要:保护文本数据中的个人身份信息(PII)对于隐私是至关重要的,但现有的PII泛化方法面临着数据分布不均匀和上下文感知有限等挑战。为了解决这些问题,我们提出了两种方法:一种是基于特征的方法,使用机器学习来提高结构化输入的性能;另一种是新的上下文感知框架,它考虑了原始文本和广义候选之间更广泛的上下文和语义关系。上下文感知方法使用多语言BERT进行文本表示、函数转换和均方误差评分来评估候选对象。在WikiReplace数据集上的实验证明了这两种方法的有效性,其中上下文感知方法在不同尺度上的表现优于基于特征的方法。这项工作通过强调特征选择、集成学习和在文本匿名化中纳入上下文信息以更好地保护隐私的重要性,促进了PII泛化技术的进步。

[NLP-56] Aspect-Based Sentiment Analysis Techniques: A Comparative Study
[NLP-56] 基于天线的情绪分析技术:比较研究

链接: https://arxiv.org/abs/2407.02834
作者: Dineth Jayakody,Koshila Isuranda,A V A Malkith,Nisansa de Silva,Sachintha Rajith Ponnamperuma,G G N Sandamali,K L K Sudheera
关键词: unequivocally major sources, digitalisation era, insights for businesses, feedback and online, unequivocally major
中文关键词: 明确的主要来源、数字化时代、企业见解、反馈和在线,明确的主要
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Since the dawn of the digitalisation era, customer feedback and online reviews are unequivocally major sources of insights for businesses. Consequently, conducting comparative analyses of such sources has become the de facto modus operandi of any business that wishes to give itself a competitive edge over its peers and improve customer loyalty. Sentiment analysis is one such method instrumental in gauging public interest, exposing market trends, and analysing competitors. While traditional sentiment analysis focuses on overall sentiment, as the needs advance with time, it has become important to explore public opinions and sentiments on various specific subjects, products and services mentioned in the reviews on a finer-granular level. To this end, Aspect-based Sentiment Analysis (ABSA), supported by advances in Artificial Intelligence (AI) techniques which have contributed to a paradigm shift from simple word-level analysis to tone and context-aware analyses, focuses on identifying specific aspects within the text and determining the sentiment associated with each aspect. In this study, we compare several deep-NN methods for ABSA on two benchmark datasets (Restaurant14 and Laptop-14) and found that FAST LSA obtains the best overall results of 87.6% and 82.6% accuracy but does not pass LSA+DeBERTa which reports 90.33% and 86.21% accuracy respectively.
摘要:自数字化时代开始以来,客户反馈和在线评论无疑是企业洞察情况的主要来源。因此,对这些来源进行比较分析已成为任何希望获得相对于同行的竞争优势并提高客户忠诚度的企业事实上的运作方式。情绪分析就是这样一种方法,有助于衡量公共利益、揭示市场趋势和分析竞争对手。虽然传统的情绪分析侧重于整体情绪,但随着需求与时俱进,在更细微的层面上探索评论中提到的各种特定主题、产品和服务的公众舆论和情绪变得重要。为此,在人工智能(AI)技术的进步的支持下,基于方面的情感分析(ABSA)专注于识别文本中的特定方面,并确定与每个方面相关联的情感。人工智能技术的进步促进了从简单的词级分析到语气和上下文感知分析的范式转变。在本研究中,我们在两个基准数据集(Restaurant14和Laptop-14)上对几种深度神经网络方法进行了比较,发现FAST LSA获得了最好的总体结果,准确率分别为87.6%和82.6%,但没有通过LSA+DeBERTa,后者的准确率分别为90.33%和86.21%。

[NLP-57] LANE: Logic Alignment of Non-tuning Large Language Models and Online Recommendation Systems for Explainable Reason Generation
[NLP-57] LANE:用于可解释原因生成的非调优大型语言模型和在线推荐系统的逻辑对齐

链接: https://arxiv.org/abs/2407.02833
作者: Hongke Zhao,Songming Zheng,Likang Wu,Bowen Yu,Jing Wang
关键词: enhancing user trust, trust and satisfaction, crucial for enhancing, recommendation, LLM models
中文关键词: 增强用户信任、信任和满意度,对于增强、推荐、LLM模型至关重要
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The explainability of recommendation systems is crucial for enhancing user trust and satisfaction. Leveraging large language models (LLMs) offers new opportunities for comprehensive recommendation logic generation. However, in existing related studies, fine-tuning LLM models for recommendation tasks incurs high computational costs and alignment issues with existing systems, limiting the application potential of proven proprietary/closed-source LLM models, such as GPT-4. In this work, our proposed effective strategy LANE aligns LLMs with online recommendation systems without additional LLMs tuning, reducing costs and improving explainability. This innovative approach addresses key challenges in integrating language models with recommendation systems while fully utilizing the capabilities of powerful proprietary models. Specifically, our strategy operates through several key components: semantic embedding, user multi-preference extraction using zero-shot prompting, semantic alignment, and explainable recommendation generation using Chain of Thought (CoT) prompting. By embedding item titles instead of IDs and utilizing multi-head attention mechanisms, our approach aligns the semantic features of user preferences with those of candidate items, ensuring coherent and user-aligned recommendations. Sufficient experimental results including performance comparison, questionnaire voting, and visualization cases prove that our method can not only ensure recommendation performance, but also provide easy-to-understand and reasonable recommendation logic.
摘要:推荐系统的可解释性是提高用户信任度和满意度的关键。利用大型语言模型(LLM)为全面的推荐逻辑生成提供了新的机会。然而,在现有的相关研究中,针对推荐任务微调LLM模型会招致高昂的计算成本和与现有系统的匹配问题,限制了经过验证的专有/闭源LLM模型的应用潜力,如GPT-4。在这项工作中,我们提出的有效策略Lane在不需要额外调整LLMS的情况下将LLMS与在线推荐系统相结合,从而降低了成本并提高了可解释性。这种创新的方法解决了在充分利用强大的专有模型的能力的同时将语言模型与推荐系统集成的关键挑战。具体地说,我们的策略通过几个关键组件运行:语义嵌入,使用零镜头提示的用户多偏好提取,语义对齐,以及使用思想链(COT)提示生成可解释的推荐。通过嵌入条目标题而不是ID,并利用多头注意机制,我们的方法将用户偏好的语义特征与候选条目的语义特征对齐,从而确保连贯和用户对齐的推荐。通过性能比较、问卷投票、可视化案例等实验证明,该方法不仅能够保证推荐性能,而且能够提供易于理解和合理的推荐逻辑。

[NLP-58] Investigating the Contextualised Word Embedding Dimensions Responsible for Contextual and Temporal Semantic Changes
[NLP-58] 调查负责上下文和时间语义变化的上下文词嵌入维度

链接: https://arxiv.org/abs/2407.02820
作者: Taichi Aida,Danushka Bollegala
关键词: Principal Component Analysis, Independent Component Analysis, Component Analysis, temporal semantic change, semantic changes
中文关键词: 主成分分析、独立成分分析、成分分析、时间语义变化、语义变化
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Words change their meaning over time as well as in different contexts. The sense-aware contextualised word embeddings (SCWEs) such as the ones produced by XL-LEXEME by fine-tuning masked langauge models (MLMs) on Word-in-Context (WiC) data attempt to encode such semantic changes of words within the contextualised word embedding (CWE) spaces. Despite the superior performance of SCWEs in contextual/temporal semantic change detection (SCD) benchmarks, it remains unclear as to how the meaning changes are encoded in the embedding space. To study this, we compare pre-trained CWEs and their fine-tuned versions on contextual and temporal semantic change benchmarks under Principal Component Analysis (PCA) and Independent Component Analysis (ICA) transformations. Our experimental results reveal several novel insights such as (a) although there exist a smaller number of axes that are responsible for semantic changes of words in the pre-trained CWE space, this information gets distributed across all dimensions when fine-tuned, and (b) in contrast to prior work studying the geometry of CWEs, we find that PCA to better represent semantic changes than ICA. Source code is available at this https URL .
摘要:词语的意义随着时间的推移而变化,在不同的语境中也是如此。感知上下文单词嵌入(SCWE),例如由XL-Lememe通过微调上下文中单词(WIC)数据上的掩蔽语言模型(MLM)产生的那些,试图在上下文单词嵌入(CWE)空间内编码单词的这种语义变化。尽管SCWE在语境/时间语义变化检测(SCD)基准测试中表现优异,但意义变化是如何在嵌入空间中编码的仍不清楚。为了研究这一点,我们在主成分分析(PCA)和独立成分分析(ICA)变换下,比较了预先训练的CWE和它们的微调版本在上下文和时间语义变化基准上的差异。我们的实验结果揭示了一些新的见解,例如:(A)虽然在预先训练的CWE空间中,负责单词语义变化的轴数量较少,但当微调时,这些信息可以分布在所有维度上;(B)与以前研究CWE几何结构的工作相比,我们发现PCA比ICA更能更好地表示语义变化。源代码可在此HTTPS URL上找到。

[NLP-59] Efficient Training of Language Models with Compact and Consistent Next Token Distributions
[NLP-59] 通过紧凑且一致的下一个代币分布有效训练语言模型

链接: https://arxiv.org/abs/2407.02819
作者: Ashutosh Sathe,Sunita Sarawagi
关键词: statistically sound objective, Maximizing the likelihood, statistically sound, sound objective, gram
中文关键词: 统计上合理的目标,最大化可能性,统计上合理的,合理的目标,克
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ACL 2024

点击查看摘要

Abstract:Maximizing the likelihood of the next token is an established, statistically sound objective for pre-training language models. In this paper we show that we can train better models faster by pre-aggregating the corpus with a collapsed n -gram distribution. Previous studies have proposed corpus-level n -gram statistics as a regularizer; however, the construction and querying of such n -grams, if done naively, prove to be costly and significantly impede training speed, thereby limiting their application in modern large language model pre-training. We introduce an alternative compact representation of the next token distribution that, in expectation, aligns with the complete n -gram distribution while markedly reducing variance across mini-batches compared to the standard next-token loss. Empirically, we demonstrate that both the n -gram regularized model and our approximation yield substantial improvements in model quality and convergence rate compared to existing methods. Furthermore, our approximation facilitates scalability of gains to larger datasets and models compared to the straightforward n -gram regularization method. Comments: ACL 2024 Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2407.02819 [cs.CL] (or arXiv:2407.02819v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2407.02819 Focus to learn more arXiv-issued DOI via DataCite
摘要:对于预训练语言模型来说,最大化下一个令牌的可能性是一个既定的、在统计上合理的目标。在本文中,我们证明了通过预先聚合具有折叠n元语法分布的语料库,我们可以更快地训练更好的模型。以前的研究已经提出了语料库级别的n元语法统计作为正则化方法;然而,这种n元语法的构建和查询如果做得很幼稚,被证明是昂贵的,并且显著阻碍了训练速度,从而限制了它们在现代大型语言模型预训练中的应用。我们引入了下一个令牌分布的另一种紧凑表示,在预期中,该表示与完整的n元语法分布一致,同时与标准的下一个令牌丢失相比,显著减少了小批次之间的方差。实验证明,与现有方法相比,n元语法正则化模型和我们的近似方法在模型质量和收敛速度方面都有显著的改善。此外,与直接的n元语法正则化方法相比,我们的近似便于将增益扩展到更大的数据集和模型。评论:ACL2024主题:计算和语言(cs.CL);机器学习(cs.lg)引用为:arxiv:2407.02819cs.CLhttps://doi.org/10.48550/arXiv.2407.02819 Focus通过DataCite了解更多arxiv发布的目标文件

[NLP-60] Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective
[NLP-60] 图像比言语更响亮:从因果调解的角度理解和减轻视觉语言模型中的偏见

链接: https://arxiv.org/abs/2407.02814
作者: Zhaotian Weng,Zijun Gao,Jerone Andrews,Jieyu Zhao
关键词: inadvertently learn biases, Vision-language models, correlating gender information, pre-trained on extensive, objects or scenarios
中文关键词: 无意中学习偏见、视觉语言模型、关联性别信息、对广泛的对象或场景进行预训练
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) pre-trained on extensive datasets can inadvertently learn biases by correlating gender information with specific objects or scenarios. Current methods, which focus on modifying inputs and monitoring changes in the model’s output probability scores, often struggle to comprehensively understand bias from the perspective of model components. We propose a framework that incorporates causal mediation analysis to measure and map the pathways of bias generation and propagation within VLMs. This approach allows us to identify the direct effects of interventions on model bias and the indirect effects of interventions on bias mediated through different model components. Our results show that image features are the primary contributors to bias, with significantly higher impacts than text features, specifically accounting for 32.57% and 12.63% of the bias in the MSCOCO and PASCAL-SENTENCE datasets, respectively. Notably, the image encoder’s contribution surpasses that of the text encoder and the deep fusion encoder. Further experimentation confirms that contributions from both language and vision modalities are aligned and non-conflicting. Consequently, focusing on blurring gender representations within the image encoder, which contributes most to the model bias, reduces bias efficiently by 22.03% and 9.04% in the MSCOCO and PASCAL-SENTENCE datasets, respectively, with minimal performance loss or increased computational demands.
摘要:在大量数据集上预先训练的视觉语言模型会通过将性别信息与特定对象或场景相关联而无意中学习到偏见。目前的方法侧重于修改输入和监控模型输出概率分数的变化,往往难以从模型组件的角度全面理解偏差。我们提出了一个包含因果中介分析的框架,以测量和绘制VLM中偏差产生和传播的路径。这种方法使我们能够确定干预措施对模型偏差的直接影响,以及干预措施通过不同的模型成分对偏差的间接影响。结果表明,图像特征是影响偏误的主要因素,图像特征的影响显著高于文本特征,在MSCOCO和PASCAL句子数据集中,图像特征对偏误的贡献率分别为32.57%和12.63%。值得注意的是,图像编码器的贡献超过了文本编码器和深度融合编码器。进一步的实验证实,语言和视觉通道的贡献是一致的和不冲突的。因此,在MSCOCO和PASCAL语句数据集中,聚焦于对模型偏差贡献最大的图像编码器中的性别模糊表示,分别有效地减少了22.03%和9.04%的偏差,而性能损失或增加的计算量最小。

[NLP-61] 52B to 1T: Lessons Learned via Tele-FLM Series
[NLP-61] 52 B至1 T:通过Tele-FLM系列中学到的教训

链接: https://arxiv.org/abs/2407.02783
作者: Xiang Li,Yiqun Yao,Xin Jiang,Xuezhi Fang,Chao Wang,Xinzhang Liu,Zihan Wang,Yu Zhao,Xin Wang,Yuyao Huang,Shuangyong Song,Yongxiang Li,Zheng Zhang,Bo Zhao,Aixin Sun,Yequan Wang,Zhongjiang He,Zhongyuan Wang,Xuelong Li,Tiejun Huang
关键词: Artificial General Intelligence, Large Language Models, Large Language, General Intelligence, Artificial General
中文关键词: 人工通用智能,大型语言模型,大型语言,通用智能,人工通用
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: For the Tele-FLM-52B tech report, see also 2404.16645

点击查看摘要

Abstract:Large Language Models (LLMs) represent a significant stride toward Artificial General Intelligence. As scaling laws underscore the potential of increasing model sizes, the academic community has intensified its investigations into LLMs with capacities exceeding 50 billion parameters. This technical report builds on our prior work with Tele-FLM (also known as FLM-2), a publicly available 52-billion-parameter model. We delve into two primary areas: we first discuss our observation of Supervised Fine-tuning (SFT) on Tele-FLM-52B, which supports the “less is more” approach for SFT data construction; second, we demonstrate our experiments and analyses on the best practices for progressively growing a model from 52 billion to 102 billion, and subsequently to 1 trillion parameters. We will open-source a 1T model checkpoint, namely Tele-FLM-1T, to advance further training and research.
摘要:大型语言模型(LLM)代表着向人工通用智能迈出的重大一步。由于缩放定律强调了增加模型大小的潜力,学术界加强了对容量超过500亿个参数的LLM的研究。本技术报告基于我们之前与Tele-FLM(也称为FLM-2)的工作,Tele-FLM是一个公开可用的520亿参数模型。我们深入研究了两个主要领域:首先讨论我们对Tele-FLM-52 B上的监督微调(SFT)的观察,该方法支持SFT数据构建的“少即是多”的方法;其次,我们展示了我们对最佳实践的实验和分析,逐步将模型从520亿增长到1020亿,然后增加到1万亿个参数。我们将开源1 T型号检查站,即Tele-FLM-1 T,以推进进一步的培训和研究。

[NLP-62] A Framework for Quantum Finite-State Languages with Density Mapping
[NLP-62] 具有密度映射的量子双状态语言框架

链接: https://arxiv.org/abs/2407.02776
作者: SeungYeop Baik,Sicheol Sung,Yo-Sub Han
关键词: sequential input strings, quantum finite-state automaton, theoretical model designed, quantum, finite-state automaton
中文关键词: 顺序输入串,量子有限状态自动机,设计的理论模型,量子,有限状态自动机
类目: Computation and Language (cs.CL); Quantum Physics (quant-ph)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:A quantum finite-state automaton (QFA) is a theoretical model designed to simulate the evolution of a quantum system with finite memory in response to sequential input strings. We define the language of a QFA as the set of strings that lead the QFA to an accepting state when processed from its initial state. QFAs exemplify how quantum computing can achieve greater efficiency compared to classical computing. While being one of the simplest quantum models, QFAs are still notably challenging to construct from scratch due to the preliminary knowledge of quantum mechanics required for superimposing unitary constraints on the automata. Furthermore, even when QFAs are correctly assembled, the limitations of a current quantum computer may cause fluctuations in the simulation results depending on how an assembled QFA is translated into a quantum circuit. We present a framework that provides a simple and intuitive way to build QFAs and maximize the simulation accuracy. Our framework relies on two methods: First, it offers a predefined construction for foundational types of QFAs that recognize special languages MOD and EQU. They play a role of basic building blocks for more complex QFAs. In other words, one can obtain more complex QFAs from these foundational automata using standard language operations. Second, we improve the simulation accuracy by converting these QFAs into quantum circuits such that the resulting circuits perform well on noisy quantum computers. Our framework is available at this https URL. Comments: 14 pages, 5 figures Subjects: Computation and Language (cs.CL); Quantum Physics (quant-ph) Cite as: arXiv:2407.02776 [cs.CL] (or arXiv:2407.02776v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2407.02776 Focus to learn more arXiv-issued DOI via DataCite
摘要:量子有限态自动机(QFA)是一种理论模型,用于模拟具有有限记忆的量子系统对序列输入串的响应。我们将QFA的语言定义为一组字符串,当QFA从其初始状态进行处理时,这些字符串会引导QFA进入接受状态。QFA举例说明了与经典计算相比,量子计算如何实现更高的效率。虽然QFA是最简单的量子模型之一,但由于在自动机上叠加么正约束所需的量子力学初步知识,从零开始构建QFA仍然具有显著的挑战性。此外,即使QFA被正确组装,当前量子计算机的限制也可能导致模拟结果的波动,这取决于组装的QFA是如何转化为量子电路的。我们提出了一个框架,它提供了一种简单直观的方法来构建QFA并最大化模拟精度。我们的框架依赖于两种方法:首先,它为识别特殊语言MOD和EQU的基本类型的QFA提供了预定义的构造。它们扮演着更复杂的QFA的基本构建块的角色。换句话说,人们可以使用标准的语言操作从这些基本自动机中获得更复杂的QFA。其次,我们通过将这些QFA转换成量子电路来提高模拟精度,从而使得到的电路在有噪声的量子计算机上运行良好。我们的框架可以在这个https URL上找到。评论:14页,5位数字主题:计算和语言(cs.CL);量子物理(quant-ph)引用为:arxiv:2407.02776cs.CLhttps://doi.org/10.48550/arXiv.2407.02776 Focus通过DataCite了解更多arxiv发布的目录

[NLP-63] MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models
[NLP-63] MLKD-BERT:预训练语言模型的多层知识提炼

链接: https://arxiv.org/abs/2407.02775
作者: Ying Zhang,Ziheng Yang,Shufan Ji
关键词: language model compression, pre-trained language model, Knowledge distillation, knowledge distillation methods, effective technique
中文关键词: 语言模型压缩、预训练语言模型、知识提炼、知识提炼方法、有效技术
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Knowledge distillation is an effective technique for pre-trained language model compression. Although existing knowledge distillation methods perform well for the most typical model BERT, they could be further improved in two aspects: the relation-level knowledge could be further explored to improve model performance; and the setting of student attention head number could be more flexible to decrease inference time. Therefore, we are motivated to propose a novel knowledge distillation method MLKD-BERT to distill multi-level knowledge in teacher-student framework. Extensive experiments on GLUE benchmark and extractive question answering tasks demonstrate that our method outperforms state-of-the-art knowledge distillation methods on BERT. In addition, MLKD-BERT can flexibly set student attention head number, allowing for substantial inference time decrease with little performance drop.
摘要:知识提炼是预训练语言模型压缩的有效技术。尽管现有的知识提炼方法对于最典型的模型BERT表现良好,但可以在两个方面进一步改进:可以进一步探索关系层面的知识以提高模型性能;学生注意力头数的设置可以更加灵活,以减少推理时间。因此,我们有动力提出一种新颖的知识提炼方法MLKD-BERT,以提炼师生框架中的多层次知识。对GLUE基准测试和提取式问答任务的大量实验表明,我们的方法优于BERT上最先进的知识提炼方法。此外,MLKD-BERT可以灵活设置学生注意力头数,从而大幅减少推理时间,同时性能下降很小。

[NLP-64] Automatic gradient descent with generalized Newtons method
[NLP-64] 用广义牛顿法自动梯度下降

链接: https://arxiv.org/abs/2407.02772
作者: Zhiqi Bu,Shiyun Xu
关键词: generalized Newton method, SGD and Adam, generalized Newton, Hessian-informed approach, Newton method
中文关键词: 广义牛顿方法,SGD和Adam,广义牛顿,黑森知情方法,牛顿方法
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose the generalized Newton’s method (GeN) – a Hessian-informed approach that applies to any optimizer such as SGD and Adam, and covers the Newton-Raphson method as a sub-case. Our method automatically and dynamically selects the learning rate that accelerates the convergence, without the intensive tuning of the learning rate scheduler. In practice, out method is easily implementable, since it only requires additional forward passes with almost zero computational overhead (in terms of training time and memory cost), if the overhead is amortized over many iterations. We present extensive experiments on language and vision tasks (e.g. GPT and ResNet) to showcase that GeN optimizers match the state-of-the-art performance, which was achieved with carefully tuned learning rate schedulers. Code to be released at \urlthis https URL.
摘要:我们提出了广义牛顿方法(GeN)–一种受黑森启发的方法,适用于任何优化器,例如BCD和Adam,并将牛顿-拉夫森方法作为子案例涵盖。我们的方法自动动态地选择加速收敛的学习率,而无需对学习率调度器进行密集调整。在实践中,out方法很容易实现,因为它只需要额外的前向传递,而计算费用(在训练时间和内存成本方面)几乎为零,前提是该费用在多次迭代中摊销。我们对语言和视觉任务(例如GPT和ResNet)进行了广泛的实验,以展示GeN优化器与最先进的性能相匹配,这是通过精心调整的学习率优化器实现的。代码将在\url这个https URL中发布。

[NLP-65] Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset
[NLP-65] 多模式对话中的情感和意图共同理解:基准数据集

链接: https://arxiv.org/abs/2407.02751
作者: Rui Liu,Haolin Zuo,Zheng Lian,Xiaofen Xing,Björn W. Schuller,Haizhou Li
关键词: semantic information manifested, multimodal conversational history, Intent Joint Understanding, aims to decode, conversational history
中文关键词: 表现出的语义信息、多模式对话历史、意图联合理解、旨在解码、对话历史
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages, 8 figures, 12 tables, NeurIPS 2024 Dataset and Benchmark Track

点击查看摘要

Abstract:Emotion and Intent Joint Understanding in Multimodal Conversation (MC-EIU) aims to decode the semantic information manifested in a multimodal conversational history, while inferring the emotions and intents simultaneously for the current utterance. MC-EIU is enabling technology for many human-computer interfaces. However, there is a lack of available datasets in terms of annotation, modality, language diversity, and accessibility. In this work, we propose an MC-EIU dataset, which features 7 emotion categories, 9 intent categories, 3 modalities, i.e., textual, acoustic, and visual content, and two languages, i.e., English and Mandarin. Furthermore, it is completely open-source for free access. To our knowledge, MC-EIU is the first comprehensive and rich emotion and intent joint understanding dataset for multimodal conversation. Together with the release of the dataset, we also develop an Emotion and Intent Interaction (EI ^2 ) network as a reference system by modeling the deep correlation between emotion and intent in the multimodal conversation. With comparative experiments and ablation studies, we demonstrate the effectiveness of the proposed EI ^2 method on the MC-EIU dataset. The dataset and codes will be made available at: this https URL.
摘要:多通道会话中的情感和意图联合理解(MC-EIU)旨在破译多通道会话历史中表现出来的语义信息,同时对当前话语进行情感和意图的推理。MC-EIU正在为许多人机界面提供技术支持。然而,在注释、情态、语言多样性和可访问性方面,缺乏可用的数据集。在这项工作中,我们提出了一个MC-EIU数据集,它包括7个情感类别,9个意图类别,3个模态,即文本,声音和视觉内容,以及两种语言,即英语和普通话。此外,它是完全开源的,可以免费访问。据我们所知,MC-EIU是第一个全面的、丰富的情感和意图联合理解数据集,用于多通道对话。在数据集发布的同时,我们还通过对多通道对话中情感和意图之间的深度关联进行建模,开发了一个情感与意图交互(EI^2)网络作为参考系统。通过对比实验和烧蚀研究,我们在MC-EIU数据集上验证了所提出的EI^2方法的有效性。数据集和代码将在以下网址提供:This HTTPS URL。

[NLP-66] Learning to Reduce: Towards Improving Performance of Large Language Models on Structured Data
[NLP-66] 学习减少:提高结构化数据上大型语言模型的性能

链接: https://arxiv.org/abs/2407.02750
作者: Younghun Lee,Sungchul Kim,Ryan A. Rossi,Tong Yu,Xiang Chen
关键词: Large Language Models, Large Language, achieving competent performance, existing work shows, Learning to Reduce
中文关键词: 大型语言模型、大型语言、实现称职的绩效、现有工作展示、学习减少
类目: Computation and Language (cs.CL)
备注: ICML 2024 Workshop on Long-Context Foundation Models, Vienna, Austria 2024. arXiv admin note: substantial text overlap with arXiv:2402.14195

点击查看摘要

Abstract:Large Language Models (LLMs) have been achieving competent performance on a wide range of downstream tasks, yet existing work shows that inference on structured data is challenging for LLMs. This is because LLMs need to either understand long structured data or select the most relevant evidence before inference, and both approaches are not trivial. This paper proposes a framework, Learning to Reduce, that fine-tunes a language model with On-Policy Learning to generate a reduced version of an input structured data. When compared to state-of-the-art LLMs like GPT-4, Learning to Reduce not only achieves outstanding performance in reducing the input, but shows generalizability on different datasets. We further show that the model fine-tuned with our framework helps LLMs better perform on table QA tasks especially when the context is longer.
摘要:大型语言模型(LLM)一直在广泛的下游任务中取得出色的性能,但现有工作表明,对结构化数据的推理对LLM来说具有挑战性。这是因为LLM需要理解长结构化数据,或者在推理之前选择最相关的证据,而且这两种方法都不是简单的。本文提出了一个名为Learning to Reduce的框架,该框架通过On-Policy Learning微调语言模型,以生成输入结构化数据的简化版本。与GPT-4等最先进的LLM相比,Learning to Reduce不仅在减少输入方面取得了出色的性能,而且在不同数据集上表现出了普遍性。我们进一步表明,用我们的框架微调的模型有助于LLM更好地执行表QA任务,尤其是当上下文更长时。

[NLP-67] A Comparative Study of DSL Code Generation: Fine-Tuning vs. Optimized Retrieval Augmentation
[NLP-67] DSA代码生成的比较研究:微调与优化检索增强

链接: https://arxiv.org/abs/2407.02742
作者: Nastaran Bassamzadeh,Chhaya Methani
关键词: Large Language Models, Natural Language, Large Language, made significant progress, Domain Specific Languages
中文关键词: 大型语言模型、自然语言、大型语言、取得重大进展、领域特定语言
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 1 figure

点击查看摘要

Abstract:Natural Language to Code Generation has made significant progress in recent years with the advent of Large Language Models(LLMs). While generation for general-purpose languages like C, C++, and Python has improved significantly, LLMs struggle with custom function names in Domain Specific Languages or DSLs. This leads to higher hallucination rates and syntax errors, specially for DSLs having a high number of custom function names. Additionally, constant updates to function names add to the challenge as LLMs need to stay up-to-date. In this paper, we present optimizations for using Retrieval Augmented Generation (or RAG) with LLMs for DSL generation along with an ablation study comparing these strategies. We generated a train as well as test dataset with a DSL to represent automation tasks across roughly 700 APIs in public domain. We used the training dataset to fine-tune a Codex model for this DSL. Our results showed that the fine-tuned model scored the best on code similarity metric. With our RAG optimizations, we achieved parity for similarity metric. The compilation rate, however, showed that both the models still got the syntax wrong many times, with RAG-based method being 2 pts better. Conversely, hallucination rate for RAG model lagged by 1 pt for API names and by 2 pts for API parameter keys. We conclude that an optimized RAG model can match the quality of fine-tuned models and offer advantages for new, unseen APIs.
摘要:随着大型语言模型(LLM)的出现,自然语言到代码生成技术取得了长足的进步。虽然C、C++和Python等通用语言的生成已经有了显著的改进,但LLM在域特定语言或DSL中的定制函数名称方面存在困难。这会导致更高的幻觉率和语法错误,特别是对于具有大量自定义函数名称的DSL。此外,函数名称的不断更新增加了挑战,因为LLM需要保持最新。在这篇文章中,我们提出了使用检索增强生成(RAG)和LLMS来生成DSL的优化方案,并对这些策略进行了烧蚀研究。我们生成了一个带有DSL的训练和测试数据集,以表示公共领域中大约700个API的自动化任务。我们使用训练数据集对此DSL的Codex模型进行了微调。结果表明,微调后的模型在代码相似性度量上得分最高。通过RAG优化,我们实现了相似性度量的等价性。然而,编译速度表明,两种模型仍然多次出现语法错误,其中基于RAG的方法要好2分。相反,RAG模型的幻觉率对于API名称落后1分,对于API参数键落后2分。我们得出结论,优化的RAG模型可以与微调模型的质量相匹配,并为新的、看不见的API提供优势。

[NLP-68] MentalAgora: A Gateway to Advanced Personalized Care in Mental Health through Multi-Agent Debating and Attribute Control
[NLP-68] MentalAgora:通过多主体辩论和属性控制实现心理健康高级个性化护理的门户

链接: https://arxiv.org/abs/2407.02736
作者: Yeonji Lee,Sangjun Park,Kyunghyun Cho,JinYeong Bak
关键词: issues globally escalate, health issues globally, globally escalate, digital support systems, issues globally
中文关键词: 全球问题升级,全球健康问题,全球升级,数字支持系统,全球问题
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As mental health issues globally escalate, there is a tremendous need for advanced digital support systems. We introduce MentalAgora, a novel framework employing large language models enhanced by interaction between multiple agents for tailored mental health support. This framework operates through three stages: strategic debating, tailored counselor creation, and response generation, enabling the dynamic customization of responses based on individual user preferences and therapeutic needs. We conduct experiments utilizing a high-quality evaluation dataset TherapyTalk crafted with mental health professionals, shwoing that MentalAgora generates expert-aligned and user preference-enhanced responses. Our evaluations, including experiments and user studies, demonstrate that MentalAgora aligns with professional standards and effectively meets user preferences, setting a new benchmark for digital mental health interventions.
摘要:随着全球心理健康问题的升级,对先进的数字支持系统的需求巨大。我们引入MentalAgora,这是一个新颖的框架,采用大型语言模型,通过多个代理之间的交互来增强,以提供量身定制的心理健康支持。该框架通过三个阶段运作:战略辩论、定制咨询师创建和响应生成,从而能够根据个人用户偏好和治疗需求动态定制响应。我们利用由心理健康专业人士制作的高质量评估数据集TherapyTalk进行实验,并提醒MentalAgora会生成与专家一致且用户偏好增强的响应。我们的评估(包括实验和用户研究)表明,MentalAgora符合专业标准并有效满足用户偏好,为数字心理健康干预措施树立了新的基准。

[NLP-69] -Health CSIRO at “Discharge Me!” 2024: Generating Discharge Summary Sections with Fine-tuned Language Models
[NLP-69] - 健康CSIRO在“释放我!“2024年:使用微调的语言模型生成放电摘要部分

链接: https://arxiv.org/abs/2407.02723
作者: Jinghui Liu,Aaron Nicolson,Jason Dowling,Bevan Koopman,Anthony Nguyen
关键词: Clinical documentation, Streamlining Discharge Documentation, clinicians’ daily work, amount of time, important aspect
中文关键词: 临床文件、简化出院文件、临床医生的日常工作、时间量、重要方面
类目: Computation and Language (cs.CL)
备注: BioNLP @ ACL 2024

点击查看摘要

Abstract:Clinical documentation is an important aspect of clinicians’ daily work and often demands a significant amount of time. The BioNLP 2024 Shared Task on Streamlining Discharge Documentation (Discharge Me!) aims to alleviate this documentation burden by automatically generating discharge summary sections, including brief hospital course and discharge instruction, which are often time-consuming to synthesize and write manually. We approach the generation task by fine-tuning multiple open-sourced language models (LMs), including both decoder-only and encoder-decoder LMs, with various configurations on input context. We also examine different setups for decoding algorithms, model ensembling or merging, and model specialization. Our results show that conditioning on the content of discharge summary prior to the target sections is effective for the generation task. Furthermore, we find that smaller encoder-decoder LMs can work as well or even slightly better than larger decoder based LMs fine-tuned through LoRA. The model checkpoints from our team (aehrc) are openly available.
摘要:临床病历是临床医生日常工作的一个重要方面,通常需要大量的时间。BioNLP 2024关于简化出院文件(出院!)的共享任务旨在通过自动生成出院总结部分来减轻这种文件负担,其中包括简要的医院病程和出院说明,这些部分的合成和手动编写往往很耗时。我们通过微调多个开源语言模型(LMS)来实现生成任务,包括仅解码器和编解码器LMS,并根据输入上下文进行不同的配置。我们还研究了解码算法、模型集成或合并以及模型专门化的不同设置。我们的结果表明,在目标截面之前对放电总结的内容进行限制对于生成任务是有效的。此外,我们发现,较小的编解码器LMS可以与通过LORA微调的基于较大解码器的LMS工作得一样甚至略好。我们团队的模型检查站(Aehrc)是公开提供的。

[NLP-70] Boosting Biomedical Concept Extraction by Rule-Based Data Augmentation
[NLP-70] 通过基于规则的数据增强促进生物医学概念提取

链接: https://arxiv.org/abs/2407.02719
作者: Qiwei Shao,Fengran Mo,Jian-Yun Nie
关键词: Document-level biomedical concept, Document-level biomedical, identifying biomedical concepts, biomedical concepts mentioned, identifying biomedical
中文关键词: 文档级生物医学概念,文档级生物医学,识别生物医学概念,提到的生物医学概念,识别生物医学
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Document-level biomedical concept extraction is the task of identifying biomedical concepts mentioned in a given document. Recent advancements have adapted pre-trained language models for this task. However, the scarcity of domain-specific data and the deviation of concepts from their canonical names often hinder these models’ effectiveness. To tackle this issue, we employ MetaMapLite, an existing rule-based concept mapping system, to generate additional pseudo-annotated data from PubMed and PMC. The annotated data are used to augment the limited training data. Through extensive experiments, this study demonstrates the utility of a manually crafted concept mapping tool for training a better concept extraction model.
摘要:文档级生物医学概念提取是识别给定文档中提到的生物医学概念的任务。最近的进步已经为这项任务调整了预先训练的语言模型。然而,特定领域数据的稀缺以及概念与规范名称的偏差常常阻碍这些模型的有效性。为了解决这个问题,我们使用MetaMapLite(一种现有的基于规则的概念映射系统)来从PubMed和PDC生成额外的伪注释数据。注释数据用于扩充有限的训练数据。通过大量实验,本研究证明了手动制作的概念映射工具用于训练更好的概念提取模型的实用性。

[NLP-71] LLM-Select: Feature Selection with Large Language Models
[NLP-71] LLLM-select:使用大型语言模型的特征选择

链接: https://arxiv.org/abs/2407.02694
作者: Daniel P. Jeong,Zachary C. Lipton,Pradeep Ravikumar
关键词: large language models, prediction task, demonstrate a surprising, surprising capability, capability of large
中文关键词: 大型语言模型,预测任务,展示了令人惊讶的、令人惊讶的能力,大的能力
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: Preprint

点击查看摘要

Abstract:In this paper, we demonstrate a surprising capability of large language models (LLMs): given only input feature names and a description of a prediction task, they are capable of selecting the most predictive features, with performance rivaling the standard tools of data science. Remarkably, these models exhibit this capacity across various query mechanisms. For example, we zero-shot prompt an LLM to output a numerical importance score for a feature (e.g., “blood pressure”) in predicting an outcome of interest (e.g., “heart failure”), with no additional context. In particular, we find that the latest models, such as GPT-4, can consistently identify the most predictive features regardless of the query mechanism and across various prompting strategies. We illustrate these findings through extensive experiments on real-world data, where we show that LLM-based feature selection consistently achieves strong performance competitive with data-driven methods such as the LASSO, despite never having looked at the downstream training data. Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place. This could potentially benefit practitioners in domains like healthcare, where collecting high-quality data comes at a high cost.
摘要:在本文中,我们展示了大型语言模型令人惊讶的能力:只要给出输入的特征名称和预测任务的描述,它们就能够选择最具预测性的特征,其性能可与数据科学的标准工具相媲美。值得注意的是,这些模型跨各种查询机制展示了这种能力。例如,我们在没有附加上下文的情况下,零射提示LLM在预测感兴趣的结果(例如,心力衰竭)时,输出特征(例如,血压)的数值重要性分数。特别是,我们发现最新的模型,如GPT-4,可以一致地识别最具预测性的特征,而无论查询机制如何,并跨越各种提示策略。我们通过对真实世界数据的广泛实验来说明这些发现,其中我们表明,尽管从未查看过下游的训练数据,基于LLM的特征选择始终获得与数据驱动的方法(如套索)相竞争的强大性能。我们的发现表明,LLMS可能不仅对于选择用于训练的最佳特征很有用,而且对于决定首先收集哪些特征也是有用的。这可能会让医疗保健等领域的从业者受益,在这些领域,收集高质量的数据需要付出高昂的成本。

[NLP-72] Reasoning in Large Language Models: A Geometric Perspective
[NLP-72] 大型语言模型中的推理:几何视角

链接: https://arxiv.org/abs/2407.02678
作者: Romain Cosentino,Sarath Shekkizhar
关键词: large language models, real-world applications hinges, applications hinges critically, language models, large language
中文关键词: 大型语言模型、现实世界应用程序枢纽、关键应用程序枢纽、语言模型、大型语言
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advancement of large language models (LLMs) for real-world applications hinges critically on enhancing their reasoning capabilities. In this work, we explore the reasoning abilities of large language models (LLMs) through their geometrical understanding. We establish a connection between the expressive power of LLMs and the density of their self-attention graphs. Our analysis demonstrates that the density of these graphs defines the intrinsic dimension of the inputs to the MLP blocks. We demonstrate through theoretical analysis and toy examples that a higher intrinsic dimension implies a greater expressive capacity of the LLM. We further provide empirical evidence linking this geometric framework to recent advancements in methods aimed at enhancing the reasoning capabilities of LLMs.
摘要:现实世界应用程序的大型语言模型(LLM)的进步关键取决于增强其推理能力。在这项工作中,我们通过大型语言模型(LLM)的几何理解来探索它们的推理能力。我们在LLM的表达能力与其自我注意力图表的密度之间建立了联系。我们的分析表明,这些图的密度定义了MLP块输入的内在维度。我们通过理论分析和玩具例子证明,更高的内在维度意味着LLM的表达能力更强。我们进一步提供了经验证据,将这个几何框架与旨在增强LLM推理能力的方法的最新进展联系起来。

[NLP-73] Supporters and Skeptics: LLM-based Analysis of Engagement with Mental Health (Mis)Information Content on Video-sharing Platforms
[NLP-73] 支持者和怀疑者:基于法学硕士的视频共享平台上心理健康(Mis)信息内容参与分析

链接: https://arxiv.org/abs/2407.02662
作者: Viet Cuong Nguyen,Mini Jain,Abhijat Chauhan,Heather Jaime Soled,Santiago Alvarez Lesmes,Zihang Li,Michael L. Birnbaum,Sunny X. Tang,Srijan Kumar,Munmun De Choudhury
关键词: mental health, mental health misinformation, mental illness, mental, health
中文关键词: 心理健康,心理健康错误信息,精神疾病,心理,健康
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 12 pages, in submission to ICWSM

点击查看摘要

Abstract:Over one in five adults in the US lives with a mental illness. In the face of a shortage of mental health professionals and offline resources, online short-form video content has grown to serve as a crucial conduit for disseminating mental health help and resources. However, the ease of content creation and access also contributes to the spread of misinformation, posing risks to accurate diagnosis and treatment. Detecting and understanding engagement with such content is crucial to mitigating their harmful effects on public health. We perform the first quantitative study of the phenomenon using YouTube Shorts and Bitchute as the sites of study. We contribute MentalMisinfo, a novel labeled mental health misinformation (MHMisinfo) dataset of 739 videos (639 from Youtube and 100 from Bitchute) and 135372 comments in total, using an expert-driven annotation schema. We first found that few-shot in-context learning with large language models (LLMs) are effective in detecting MHMisinfo videos. Next, we discover distinct and potentially alarming linguistic patterns in how audiences engage with MHMisinfo videos through commentary on both video-sharing platforms. Across the two platforms, comments could exacerbate prevailing stigma with some groups showing heightened susceptibility to and alignment with MHMisinfo. We discuss technical and public health-driven adaptive solutions to tackling the “epidemic” of mental health misinformation online.
摘要:在美国,超过五分之一的成年人患有精神疾病。面对心理健康专业人员和线下资源的短缺,在线短视频内容已成为传播心理健康帮助和资源的重要渠道。然而,内容创建和获取的便利性也助长了错误信息的传播,给准确的诊断和治疗带来了风险。检测和了解对此类内容的参与对于减轻其对公众健康的有害影响至关重要。我们首次对这一现象进行了定量研究,使用了YouTube短片和Bitchutt作为研究地点。我们贡献了MentalMisinfo,一个新颖的标记精神健康错误信息的数据集,使用专家驱动的注释模式,包含739个视频(639个来自YouTube639个来自Bitchute)和总共135372条评论。我们首先发现,使用大语言模型(LLM)的少镜头上下文学习在检测MHMisinfo视频方面是有效的。接下来,我们通过两个视频分享平台上的评论,在观众如何参与MHMisinfo视频的过程中发现了独特的、潜在的令人担忧的语言模式。在这两个平台上,评论可能会加剧普遍存在的污名,一些群体表现出对MHMisinfo的高度敏感和与之一致。我们讨论了技术和公共卫生驱动的适应性解决方案,以应对在线精神健康错误信息的“流行病”。

[NLP-74] Ensuring Responsible Sourcing of Large Language Model Training Data Through Knowledge Graph Comparison
[NLP-74] 通过知识图谱比较确保负责任地采购大型语言模型训练数据

链接: https://arxiv.org/abs/2407.02659
作者: Devam Mondal,Carlo Lipizzi
关键词: plagiarism allegations Brough, recent plagiarism allegations, large language model, Resource Description Framework, plagiarism detection system
中文关键词: 抄袭指控Brough、最近的抄袭指控、大型语言模型、资源描述框架、抄袭检测系统
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In light of recent plagiarism allegations Brough by publishers, newspapers, and other creators of copyrighted corpora against large language model (LLM) developers, we propose a novel system, a variant of a plagiarism detection system, that assesses whether a knowledge source has been used in the training or fine-tuning of a large language model. Unlike current methods, we utilize an approach that uses Resource Description Framework (RDF) triples to create knowledge graphs from both a source document and a LLM continuation of that document. These graphs are then analyzed with respect to content using cosine similarity and with respect to structure using a normalized version of graph edit distance that shows the degree of isomorphism. Unlike traditional systems that focus on content matching and keyword identification between a source and target corpus, our approach enables a broader evaluation of similarity and thus a more accurate comparison of the similarity between a source document and LLM continuation by focusing on relationships between ideas and their organization with regards to others. Additionally, our approach does not require access to LLM metrics like perplexity that may be unavailable in closed large language modeling “black-box” systems, as well as the training corpus. A prototype of our system will be found on a hyperlinked GitHub repository.
摘要:针对最近出版商、报纸和其他受版权保护的语料库创建者对大型语言模型(LLM)开发者提出的抄袭指控,我们提出了一个新的系统,该系统是抄袭检测系统的一个变体,用于评估知识来源是否被用于大型语言模型的训练或微调。与当前的方法不同,我们利用一种使用资源描述框架(RDF)三元组的方法来从源文档和该文档的LLM延续创建知识图。然后使用余弦相似性相对于内容和使用显示同构程度的归一化版本的图形编辑距离来分析这些图形。与专注于源语料库和目标语料库之间的内容匹配和关键字识别的传统系统不同,我们的方法通过关注想法及其组织与他人之间的关系,实现了更广泛的相似性评估,从而更准确地比较了源文档和LLM延续之间的相似性。此外,我们的方法不需要访问LLM度量,如困惑,这在封闭的大型语言建模“黑盒”系统中可能是不可用的,以及训练语料库。我们的系统原型将在一个超链接的GitHub储存库中找到。

[NLP-75] A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
[NLP-75] 基于转换器的语言模型的机械可解释性的实践评论

链接: https://arxiv.org/abs/2407.02646
作者: Daking Rai,Yilun Zhou,Shi Feng,Abulhair Saparov,Ziyu Yao
关键词: Mechanistic interpretability, neural network model, internal computations, emerging sub-field, neural network
中文关键词: 机械可解释性、神经网络模型、内部计算、新兴子领域、神经网络
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 11 figures, Preprint

点击查看摘要

Abstract:Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for interpreting transformer-based language models (LMs), resulting in many novel insights yet introducing new challenges. However, there has not been work that comprehensively reviews these insights and challenges, particularly as a guide for newcomers to this field. To fill this gap, we present a comprehensive survey outlining fundamental objects of study in MI, techniques that have been used for its investigation, approaches for evaluating MI results, and significant findings and applications stemming from the use of MI to understand LMs. In particular, we present a roadmap for beginners to navigate the field and leverage MI for their benefit. Finally, we also identify current gaps in the field and discuss potential future directions.
摘要:机械可解释性(MI)是一个新兴的可解释性子领域,旨在通过对其内部计算进行反向工程来理解神经网络模型。最近,MI在解释基于转换器的语言模型(LM)方面引起了广泛关注,这带来了许多新颖的见解,但也带来了新的挑战。然而,目前还没有全面审查这些见解和挑战的工作,特别是作为该领域新人的指南。为了填补这一空白,我们提出了一项全面的调查,概述了MI的基本研究对象、用于其调查的技术、评估MI结果的方法,以及使用MI来理解LM所产生的重要发现和应用。特别是,我们为初学者提供了一份路线图,帮助他们在该领域探索并利用MI来谋取利益。最后,我们还找出该领域当前的差距并讨论潜在的未来方向。

[NLP-76] Change My Frame: Reframing in the Wild in r/ChangeMyView
[NLP-76] 改变我的框架:r/ChangeMyView中的野外重新框架

链接: https://arxiv.org/abs/2407.02637
作者: Arturo Martínez Peguero,Taro Watanabe
关键词: text style transfer, Recent work, style transfer, optimistic reframes, text style
中文关键词: 文本风格转移,最近的作品,风格转移,乐观的重新框架,文本风格
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 3 pages, NAACL 2024 workshop

点击查看摘要

Abstract:Recent work in reframing, within the scope of text style transfer, has so far made use of out-of-context, task-prompted utterances in order to produce neutralizing or optimistic reframes. Our work aims to generalize reframing based on the subreddit r/ChangeMyView (CMV). We build a dataset that leverages CMV’s community’s interactions and conventions to identify high-value, community-recognized utterances that produce changes of perspective. With this data, we widen the scope of the direction of reframing since the changes in perspective do not only occur in neutral or positive directions. We fine tune transformer-based models, make use of a modern LLM to refine our dataset, and explore challenges in the dataset creation and evaluation around this type of reframing.
摘要:最近在文本风格转换范围内的重新框架工作迄今为止都使用了脱离上下文、任务提示的话语来产生中和或乐观的重新框架。我们的工作旨在基于subreddit r/ChangeMyView(CMV)来概括重建。我们构建了一个数据集,利用CMV社区的互动和惯例来识别产生观点变化的高价值、社区认可的话语。有了这些数据,我们扩大了重构方向的范围,因为视角的变化不仅仅发生在中性或积极的方向上。我们微调基于转换器的模型,利用现代LLM来完善我们的数据集,并探索围绕这种类型的重构的数据集创建和评估中的挑战。

[NLP-77] Nollywood: Lets Go to the Movies!
[NLP-77] 尼莱坞:我们去看电影吧!

链接: https://arxiv.org/abs/2407.02631
作者: John E. Ortega,Ibrahim Said Ahmad,William Chen
关键词: Bollywood from India, idea of Bollywood, series of outstanding, Nigerian English speech, translate Nigerian English
中文关键词: 来自印度的宝莱坞,宝莱坞的想法,一系列杰出的尼日利亚英语演讲,翻译尼日利亚英语
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 8 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Nollywood, based on the idea of Bollywood from India, is a series of outstanding movies that originate from Nigeria. Unfortunately, while the movies are in English, they are hard to understand for many native speakers due to the dialect of English that is spoken. In this article, we accomplish two goals: (1) create a phonetic sub-title model that is able to translate Nigerian English speech to American English and (2) use the most advanced toxicity detectors to discover how toxic the speech is. Our aim is to highlight the text in these videos which is often times ignored for lack of dialectal understanding due the fact that many people in Nigeria speak a native language like Hausa at home.
摘要:《奈莱坞》是一部源自尼日利亚的优秀电影,以印度宝莱坞为理念,改编自印度宝莱坞。不幸的是,虽然电影是英语的,但由于所使用的英语方言,许多以英语为母语的人很难理解。在本文中,我们实现了两个目标:(1)创建一个能够将尼日利亚英语语音翻译为美式英语的语音副标题模型;(2)使用最先进的毒性检测器来发现语音的毒性有多大。我们的目标是突出这些视频中的文本,由于尼日利亚的许多人在家里说豪萨语等母语,这些文本经常因缺乏方言理解而被忽视。

[NLP-78] Uplifting Lower-Income Data: Strategies for Socioeconomic Perspective Shifts in Vision-Language Models
[NLP-78] 提升低收入数据:社会经济视角的策略视觉语言模型的转变

链接: https://arxiv.org/abs/2407.02623
作者: Joan Nwatu,Oana Ignat,Rada Mihalcea
关键词: formulate translated non-English, socioeconomic integrated prompts, socioeconomic integrated, integrated prompts improve, address this issue
中文关键词: 制定翻译的非英语、社会经济综合提示、社会经济综合、综合提示改进、解决这个问题
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To address this issue, we formulate translated non-English, geographic, and socioeconomic integrated prompts and evaluate their impact on VL model performance for data from different countries and income groups. Our findings show that geographic and socioeconomic integrated prompts improve VL performance on lower-income data and favor the retrieval of topic appearances commonly found in data from low-income households. From our analyses, we identify and highlight contexts where these strategies yield the most improvements. Our model analysis code is publicly available at this https URL .
摘要:为了解决这个问题,我们制定了翻译的非英语、地理和社会经济综合提示,并评估它们对来自不同国家和收入群体的数据的DL模型性能的影响。我们的研究结果表明,地理和社会经济综合提示可以提高低收入数据上的低收入数据表现,并有利于检索低收入家庭数据中常见的主题外观。从我们的分析中,我们确定并强调这些策略能带来最大改进的背景。我们的模型分析代码可在此https URL上公开获取。

[NLP-79] D-Rax: Domain-specific Radiologic assistant leveraging multi-modal data and eXpert model predictions
[NLP-79] D-Rax:利用多模式数据和eXpress模型预测的特定领域放射学助理

链接: https://arxiv.org/abs/2407.02604
作者: Hareem Nisar,Syed Muhammad Anwar,Zhifan Jiang,Abhijeet Parida,Vishwesh Nath,Holger R. Roth,Marius George Linguraru
关键词: Large vision language, general-purpose use cases, progressed incredibly, incredibly from research, research to applicability
中文关键词: 大型视觉语言,通用用例,从研究、研究到适用性,取得了令人难以置信的进步
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Large vision language models (VLMs) have progressed incredibly from research to applicability for general-purpose use cases. LLaVA-Med, a pioneering large language and vision assistant for biomedicine, can perform multi-modal biomedical image and data analysis to provide a natural language interface for radiologists. While it is highly generalizable and works with multi-modal data, it is currently limited by well-known challenges that exist in the large language model space. Hallucinations and imprecision in responses can lead to misdiagnosis which currently hinder the clinical adaptability of VLMs. To create precise, user-friendly models in healthcare, we propose D-Rax – a domain-specific, conversational, radiologic assistance tool that can be used to gain insights about a particular radiologic image. In this study, we enhance the conversational analysis of chest X-ray (CXR) images to support radiological reporting, offering comprehensive insights from medical imaging and aiding in the formulation of accurate diagnosis. D-Rax is achieved by fine-tuning the LLaVA-Med architecture on our curated enhanced instruction-following data, comprising of images, instructions, as well as disease diagnosis and demographic predictions derived from MIMIC-CXR imaging data, CXR-related visual question answer (VQA) pairs, and predictive outcomes from multiple expert AI models. We observe statistically significant improvement in responses when evaluated for both open and close-ended conversations. Leveraging the power of state-of-the-art diagnostic models combined with VLMs, D-Rax empowers clinicians to interact with medical images using natural language, which could potentially streamline their decision-making process, enhance diagnostic accuracy, and conserve their time.
摘要:大型视觉语言模型(VLM)从研究到适用于通用用例已经取得了令人难以置信的进步。LLaVA-Med是一款开创性的大型生物医学语言和视觉助手,可以执行多模式生物医学图像和数据分析,为放射科医生提供自然语言界面。虽然它具有高度的通用性,并且可以处理多模式数据,但它目前受到大型语言模型空间中存在的众所周知挑战的限制。幻觉和反应不准确会导致误诊,目前阻碍了VLMS的临床适应性。为了在医疗保健中创建精确的、用户友好的模型,我们提出了D-Rax–一种特定于领域的、对话式的放射辅助工具,可用于深入了解特定的放射图像。在这项研究中,我们增强了对胸部X光(CXR)图像的对话分析,以支持放射学报告,提供来自医学成像的全面见解,并帮助制定准确的诊断。D-Rax是通过在我们精心策划的增强型指令遵循数据上微调LLaVA-Med架构来实现的,这些数据包括图像、指令以及来自MIMIC-CXR成像数据的疾病诊断和人口预测、与CXR相关的可视问题回答(VQA)对以及来自多个专家AI模型的预测结果。在对开放式和封闭式对话进行评估时,我们观察到在统计上有显著的改善。D-Rax利用最先进的诊断模型与VLM相结合的强大功能,使临床医生能够使用自然语言与医学图像进行交互,这可能会简化他们的决策过程,提高诊断准确性,并节省他们的时间。

[NLP-80] owards More Realistic Extraction Attacks: An Adversarial Perspective
[NLP-80] owards更现实的提取攻击:对抗的角度

链接: https://arxiv.org/abs/2407.02596
作者: Yash More,Prakhar Ganesh,Golnoosh Farnadi
关键词: memorizing large parts, prone to memorizing, memorizing large, large parts, Language models
中文关键词: 记住大部分,容易记住,记住很大很大的部分,语言模型
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: To be presented at PrivateNLP@ACL2024

点击查看摘要

Abstract:Language models are prone to memorizing large parts of their training data, making them vulnerable to extraction attacks. Existing research on these attacks remains limited in scope, often studying isolated trends rather than the real-world interactions with these models. In this paper, we revisit extraction attacks from an adversarial perspective, exploiting the brittleness of language models. We find significant churn in extraction attack trends, i.e., even minor, unintuitive changes to the prompt, or targeting smaller models and older checkpoints, can exacerbate the risks of extraction by up to 2-4 \times . Moreover, relying solely on the widely accepted verbatim match underestimates the extent of extracted information, and we provide various alternatives to more accurately capture the true risks of extraction. We conclude our discussion with data deduplication, a commonly suggested mitigation strategy, and find that while it addresses some memorization concerns, it remains vulnerable to the same escalation of extraction risks against a real-world adversary. Our findings highlight the necessity of acknowledging an adversary’s true capabilities to avoid underestimating extraction risks.
摘要:语言模型容易记住大部分训练数据,容易受到抽取攻击。现有对这些攻击的研究范围仍然有限,往往研究孤立的趋势,而不是与这些模型的现实世界互动。在本文中,我们从敌意的角度重新审视提取攻击,利用语言模型的脆弱性。我们发现提取攻击趋势中的显著波动,即提示即使是微小的、不直观的更改,或者以较小的模型和较旧的检查点为目标,都可能使提取的风险增加2-4倍。此外,仅依靠被广泛接受的逐字匹配低估了提取信息的程度,我们提供了各种替代方案来更准确地捕获提取的真实风险。我们以重复数据删除结束我们的讨论,这是一种通常建议的缓解策略,并发现虽然它解决了一些记忆问题,但它仍然容易受到针对现实世界对手的提取风险的相同升级的影响。我们的发现突显了承认对手真实能力的必要性,以避免低估开采风险。

[NLP-81] RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs
[NLP-81] RL HF可以说多种语言:解锁LLM的多语言偏好优化

链接: https://arxiv.org/abs/2407.02552
作者: John Dang,Arash Ahmadian,Kelly Marchisio,Julia Kreutzer,Ahmet Üstün,Sara Hooker
关键词: standard final stage, standard final, final stage, English and Chinese, large language models
中文关键词: 标准决赛阶段,标准决赛,决赛阶段,英语和中文,大型语言模型
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Preference optimization techniques have become a standard final stage for training state-of-art large language models (LLMs). However, despite widespread adoption, the vast majority of work to-date has focused on first-class citizen languages like English and Chinese. This captures a small fraction of the languages in the world, but also makes it unclear which aspects of current state-of-the-art research transfer to a multilingual setting. In this work, we perform an exhaustive study to achieve a new state-of-the-art in aligning multilingual LLMs. We introduce a novel, scalable method for generating high-quality multilingual feedback data to balance data coverage. We establish the benefits of cross-lingual transfer and increased dataset size in preference training. Our preference-trained model achieves a 54.4% win-rate against Aya 23 8B, the current state-of-the-art multilingual LLM in its parameter class, and a 69.5% win-rate or higher against widely used models like Gemma-1.1-7B-it, Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3. As a result of our study, we expand the frontier of alignment techniques to 23 languages covering half of the world’s population.
摘要:偏好优化技术已经成为训练最先进的大型语言模型的标准最后阶段。然而,尽管被广泛采用,到目前为止,绝大多数工作都集中在英语和中文等一等公民语言上。这只涵盖了世界上一小部分语言,但也让人不清楚当前最先进的研究的哪些方面会转移到多语言环境中。在这项工作中,我们进行了详尽的研究,以实现多语言LLM对齐的新技术。我们介绍了一种新的、可扩展的方法来生成高质量的多语言反馈数据,以平衡数据覆盖。我们在偏好训练中确定了跨语言迁移和增加数据集大小的好处。我们的偏好训练模型对Aya 23 8B的胜率为54.4%,对其参数类中最先进的多语言LLM的胜率为54.4%,对Gema-1.1-7B-It、Llama-3-8B-Indict、Mistral-7B-Indict-v0.3等广泛使用的模型的胜率为69.5%或更高。作为我们研究的结果,我们将对齐技术的前沿扩展到覆盖世界一半人口的23种语言。

[NLP-82] owards the Next Frontier in Speech Representation Learning Using Disentanglement
[NLP-82] 使用解纠缠来了解语音表示学习的下一个前沿

链接: https://arxiv.org/abs/2407.02543
作者: Varun Krishna,Sriram Ganapathy
关键词: frame-level masked prediction, masked prediction, speech, speech regions, learning
中文关键词: 帧级掩蔽预测、掩蔽预测、语音、语音区域、学习
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The popular frameworks for self-supervised learning of speech representations have largely focused on frame-level masked prediction of speech regions. While this has shown promising downstream task performance for speech recognition and related tasks, this has largely ignored factors of speech that are encoded at coarser level, like characteristics of the speaker or channel that remain consistent through-out a speech utterance. In this work, we propose a framework for Learning Disentangled Self Supervised (termed as Learn2Diss) representations of speech, which consists of frame-level and an utterance-level encoder modules. The two encoders are initially learned independently, where the frame-level model is largely inspired by existing self supervision techniques, thereby learning pseudo-phonemic representations, while the utterance-level encoder is inspired by constrastive learning of pooled embeddings, thereby learning pseudo-speaker representations. The joint learning of these two modules consists of disentangling the two encoders using a mutual information based criterion. With several downstream evaluation experiments, we show that the proposed Learn2Diss achieves state-of-the-art results on a variety of tasks, with the frame-level encoder representations improving semantic tasks, while the utterance-level representations improve non-semantic tasks.
摘要:目前流行的语音表征自监督学习框架主要集中在对语音区域进行帧级掩蔽预测。虽然这已经显示了语音识别和相关任务的有希望的下游任务性能,但这在很大程度上忽略了在较粗级别编码的语音因素,例如说话者或通道的特征,其在整个语音发声中保持一致。在这项工作中,我们提出了一种学习解缠自监督语音表示的框架,该框架由帧级别和发音级别的编码器模块组成。这两个编码器最初是独立学习的,其中帧级别的模型在很大程度上受到现有自我监督技术的启发,从而学习伪音素表示,而发声级别的编码器则受到池嵌入的对比学习的启发,从而学习伪说话人表示。这两个模块的联合学习包括使用基于互信息的准则来解开两个编码器的纠缠。实验表明,Learn2Diss在不同的任务上取得了最好的结果,其中帧级别的编码表示改善了语义任务,而发音级别的表示改善了非语义任务。

[NLP-83] Actionable Cyber Threat Intelligence using Knowledge Graphs and Large Language Models
[NLP-83] 使用知识图和大型语言模型的可操作网络威胁情报

链接: https://arxiv.org/abs/2407.02528
作者: Romy Fieblinger,Md Tanvirul Alam,Nidhi Rastogi
关键词: Cyber Threat Intelligence, unstructured Cyber Threat, Cyber threats, constantly evolving, Threat Intelligence
中文关键词: 网络威胁情报,非结构化网络威胁,网络威胁,不断发展,威胁情报
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6th Workshop on Attackers and Cyber-Crime Operations, 12 pages, 1 figure, 9 tables

点击查看摘要

Abstract:Cyber threats are constantly evolving. Extracting actionable insights from unstructured Cyber Threat Intelligence (CTI) data is essential to guide cybersecurity decisions. Increasingly, organizations like Microsoft, Trend Micro, and CrowdStrike are using generative AI to facilitate CTI extraction. This paper addresses the challenge of automating the extraction of actionable CTI using advancements in Large Language Models (LLMs) and Knowledge Graphs (KGs). We explore the application of state-of-the-art open-source LLMs, including the Llama 2 series, Mistral 7B Instruct, and Zephyr for extracting meaningful triples from CTI texts. Our methodology evaluates techniques such as prompt engineering, the guidance framework, and fine-tuning to optimize information extraction and structuring. The extracted data is then utilized to construct a KG, offering a structured and queryable representation of threat intelligence. Experimental results demonstrate the effectiveness of our approach in extracting relevant information, with guidance and fine-tuning showing superior performance over prompt engineering. However, while our methods prove effective in small-scale tests, applying LLMs to large-scale data for KG construction and Link Prediction presents ongoing challenges.
摘要:网络威胁不断演变。从非结构化网络威胁情报(CTI)数据中提取可操作的见解对于指导网络安全决策至关重要。越来越多的组织,如微软、趋势科技和CrowdStrike,正在使用生成性人工智能来促进CTI提取。本文讨论了使用大型语言模型(LLM)和知识图(KG)中的改进来自动提取可操作的CTI的挑战。我们探索了最先进的开源LLMS的应用,包括Llama 2系列、Mistral 7B指令和Zephr,用于从CTI文本中提取有意义的三元组。我们的方法评估诸如即时工程、指导框架和微调等技术,以优化信息提取和结构。然后利用提取的数据构建KG,提供结构化和可查询的威胁情报表示。实验结果表明,该方法在提取相关信息方面是有效的,其中制导和微调的性能优于即时工程。然而,尽管我们的方法在小规模测试中被证明是有效的,但将LLMS应用于KG构建和链接预测的大规模数据仍然存在挑战。

[NLP-84] INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness
[NLP-84] INDICT:通过内部对安全性和帮助性的批评对话生成代码

链接: https://arxiv.org/abs/2407.02518
作者: Hung Le,Yingbo Zhou,Caiming Xiong,Silvio Savarese,Doyen Sahoo
关键词: Large language models, Large language, intentions and requirements, natural language instructions, typically trained
中文关键词: 大型语言模型、大型语言、意图和要求、自然语言指令,通常经过训练
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Multiagent Systems (cs.MA); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) for code are typically trained to align with natural language instructions to closely follow their intentions and requirements. However, in many practical scenarios, it becomes increasingly challenging for these models to navigate the intricate boundary between helpfulness and safety, especially against highly complex yet potentially malicious instructions. In this work, we introduce INDICT: a new framework that empowers LLMs with Internal Dialogues of Critiques for both safety and helpfulness guidance. The internal dialogue is a dual cooperative system between a safety-driven critic and a helpfulness-driven critic. Each critic provides analysis against the given task and corresponding generated response, equipped with external knowledge queried through relevant code snippets and tools like web search and code interpreter. We engage the dual critic system in both code generation stage as well as code execution stage, providing preemptive and post-hoc guidance respectively to LLMs. We evaluated INDICT on 8 diverse tasks across 8 programming languages from 5 benchmarks, using LLMs from 7B to 70B parameters. We observed that our approach can provide an advanced level of critiques of both safety and helpfulness analysis, significantly improving the quality of output codes ( +10% absolute improvements in all models).
摘要:代码的大型语言模型(LLM)通常经过训练,以与自然语言指令保持一致,以密切遵循其意图和要求。然而,在许多实际场景中,对于这些模型来说,在有用和安全之间导航的复杂边界变得越来越具有挑战性,特别是针对高度复杂但可能是恶意的指令。在这项工作中,我们介绍了DIRECT:一个新的框架,它赋予LLMS以内部批评对话,以提供安全和帮助指导。内部对话是安全驱动的批评者和帮助驱动的批评者之间的双重合作系统。每个批评者提供针对给定任务的分析和相应的生成响应,配备了通过相关代码片段和工具(如网络搜索和代码解释器)查询的外部知识。我们在代码生成阶段和代码执行阶段都使用双重Critic系统,分别为LLM提供先发制人和事后指导。我们使用从7B到70B参数的LLMS,对来自5个基准的8种编程语言的8种不同任务进行了评估。我们观察到,我们的方法可以为安全性和有用性分析提供更高水平的批评,显著提高输出代码的质量(在所有模型中都有+10%的绝对改进)。

[NLP-85] LOGIC-LM: Multi-Step Refinement for Symbolic Formulations
[NLP-85] LOGIC-LM:符号公式的多步骤细化

链接: https://arxiv.org/abs/2407.02514
作者: Shashank Kirtania,Priyanshu Gupta,Arjun Radhakirshna
关键词: Large Language Models, limitations of Large, Language Models, complex reasoning tasks, Large Language
中文关键词: 大型语言模型、大型语言的局限性、语言模型、复杂推理任务、大型语言
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper we examine the limitations of Large Language Models (LLMs) for complex reasoning tasks. Although recent works have started to employ formal languages as an intermediate representation for reasoning tasks, they often face challenges in accurately generating and refining these formal specifications to ensure correctness. To address these issues, this paper proposes Logic-LM++, an improvement on Logic-LM . It uses the ability of LLMs to do pairwise comparisons, allowing the evaluation of the refinements suggested by the LLM. The paper demonstrates that Logic-LM++ outperforms Logic-LM and other contemporary techniques across natural language reasoning tasks on three datasets, FOLIO, ProofWriter and AR-LSAT, with an average improvement of 18.5% on standard prompting, 12.3% on chain of thought prompting and 5% on Logic-LM.
摘要:在本文中,我们研究了大型语言模型(LLM)对于复杂推理任务的局限性。尽管最近的作品已经开始使用形式语言作为推理任务的中间表示,但它们在准确生成和完善这些形式规范以确保正确性方面经常面临挑战。为了解决这些问题,本文提出了Logic-LM++,这是对Logic-LM的改进。它利用LLM进行成对比较的能力,允许对LLM建议的改进进行评估。该论文证明,在三个数据集(FOLIO、ProofWriter和AR-LSAT)上,Logic-LM++在自然语言推理任务中优于Logic-LM和其他当代技术,标准提示平均提高18.5%,思想链提示提高12.3%,Logic-LM提高5%。

[NLP-86] LLM-A*: Large Language Model Enhanced Incremental Heuristic Search on Path Planning
[NLP-86] LLM-A*:路径规划上的大型语言模型增强增量启发式搜索

链接: https://arxiv.org/abs/2407.02511
作者: Silin Meng,Yiwei Wang,Cheng-Fu Yang,Nanyun Peng,Kai-Wei Chang
关键词: fundamental scientific problem, autonomous navigation, requiring the derivation, avoiding obstacles, fundamental scientific
中文关键词: 基础科学问题,自主导航,需要推导,避免障碍,基础科学
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to The 2024 Conference on Empirical Methods in Natural Language Processing

点击查看摘要

Abstract:Path planning is a fundamental scientific problem in robotics and autonomous navigation, requiring the derivation of efficient routes from starting to destination points while avoiding obstacles. Traditional algorithms like A* and its variants are capable of ensuring path validity but suffer from significant computational and memory inefficiencies as the state space grows. Conversely, large language models (LLMs) excel in broader environmental analysis through contextual understanding, providing global insights into environments. However, they fall short in detailed spatial and temporal reasoning, often leading to invalid or inefficient routes. In this work, we propose LLM-A*, an new LLM based route planning method that synergistically combines the precise pathfinding capabilities of A* with the global reasoning capability of LLMs. This hybrid approach aims to enhance pathfinding efficiency in terms of time and space complexity while maintaining the integrity of path validity, especially in large-scale scenarios. By integrating the strengths of both methodologies, LLM-A* addresses the computational and memory limitations of conventional algorithms without compromising on the validity required for effective pathfinding.
摘要:路径规划是机器人学和自主导航中的一个基本科学问题,它要求在避开障碍物的同时获得从起点到终点的有效路径。A及其变体等传统算法能够确保路径有效性,但随着状态空间的增长,存在严重的计算和内存效率低下问题。相反,大型语言模型(LLM)通过上下文理解在更广泛的环境分析方面表现出色,提供对环境的全局洞察。然而,它们缺乏详细的空间和时间推理,经常导致无效或低效的路线。在这项工作中,我们提出了一种新的基于LLM的路径规划方法LLM-A,它协同结合了A的精确寻路能力和LLMS的全局推理能力。这种混合方法的目的是在保持路径有效性完整性的同时,在时间和空间复杂度方面提高寻路效率,特别是在大规模场景中。通过整合两种方法的优点,LLM-A解决了传统算法的计算和内存限制,而不会影响有效寻路所需的有效性。

计算机视觉

[CV-0] InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

链接: https://arxiv.org/abs/2407.03320
作者: Pan Zhang,Xiaoyi Dong,Yuhang Zang,Yuhang Cao,Rui Qian,Lin Chen,Qipeng Guo,Haodong Duan,Bin Wang,Linke Ouyang,Songyang Zhang,Wenwei Zhang,Yining Li,Yang Gao,Peng Sun,Xinyue Zhang,Wei Li,Jingwen Li,Wenhai Wang,Hang Yan,Conghui He,Xingcheng Zhang,Kai Chen,Jifeng Dai,Yu Qiao,Dahua Lin,Jiaqi Wang
关键词: versatile large-vision language, supports long-contextual input, large-vision language model, versatile large-vision, large-vision language
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
*备注: Technical Report. this https URL

点击查看摘要

Abstract:We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at this https URL.

[CV-1] BACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate Hallucinations

链接: https://arxiv.org/abs/2407.03314
作者: Zhantao Yang,Ruili Feng,Keyu Yan,Huangji Wang,Zhicai Wang,Shangwen Zhu,Han Zhang,Jie Xiao,Pingyu Wu,Kai Zhu,Jixuan Chen,Chen-Wei Xie,Chaojie Mao,Yue Yang,Hongyang Zhang,Yu Liu,Fan Cheng
关键词: Vision Language Models, Vision Language, limited linguistic abilities, Language Models, visual question answering
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:This paper presents Bag-of-Concept Graph (BACON) to gift models with limited linguistic abilities to taste the privilege of Vision Language Models (VLMs) and boost downstream tasks such as detection, visual question answering (VQA), and image generation. Since the visual scenes in physical worlds are structured with complex relations between objects, BACON breaks down annotations into basic minimum elements and presents them in a graph structure. Element-wise style enables easy understanding, and structural composition liberates difficult locating. Careful prompt design births the BACON captions with the help of public-available VLMs and segmentation methods. In this way, we gather a dataset with 100K annotated images, which endow VLMs with remarkable capabilities, such as accurately generating BACON, transforming prompts into BACON format, envisioning scenarios in the style of BACONr, and dynamically modifying elements within BACON through interactive dialogue and more. Wide representative experiments, including detection, VQA, and image generation tasks, tell BACON as a lifeline to achieve previous out-of-reach tasks or excel in their current cutting-edge solutions.

[CV-2] Advanced Smart City Monitoring: Real-Time Identification of Indian Citizen Attributes

链接: https://arxiv.org/abs/2407.03305
作者: Shubham Kale,Shashank Sharma,Abhilash Khuntia
关键词: smart surveillance system, Indian cities, analyze people attributes, real time, project focuses
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 6 pages , 8 figure , changed title and some alignment issue were resolved, but other contents remains same

点击查看摘要

Abstract:This project focuses on creating a smart surveillance system for Indian cities that can identify and analyze people’s attributes in real time. Using advanced technologies like artificial intelligence and machine learning, the system can recognize attributes such as upper body color, what the person is wearing, accessories they are wearing, headgear, etc., and analyze behavior through cameras installed around the city.

[CV-3] DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents

链接: https://arxiv.org/abs/2407.03300
作者: Yilun Xu,Gabriele Corso,Tommi Jaakkola,Arash Vahdat,Karsten Kreis
关键词: Gaussian distribution, Diffusion models, discrete latents, Diffusion, Latent Variable Diffusion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: this https URL

点击查看摘要

Abstract:Diffusion models (DMs) have revolutionized generative learning. They utilize a diffusion process to encode data into a simple Gaussian distribution. However, encoding a complex, potentially multimodal data distribution into a single continuous Gaussian distribution arguably represents an unnecessarily challenging learning problem. We propose Discrete-Continuous Latent Variable Diffusion Models (DisCo-Diff) to simplify this task by introducing complementary discrete latent variables. We augment DMs with learnable discrete latents, inferred with an encoder, and train DM and encoder end-to-end. DisCo-Diff does not rely on pre-trained networks, making the framework universally applicable. The discrete latents significantly simplify learning the DM’s complex noise-to-data mapping by reducing the curvature of the DM’s generative ODE. An additional autoregressive transformer models the distribution of the discrete latents, a simple step because DisCo-Diff requires only few discrete variables with small codebooks. We validate DisCo-Diff on toy data, several image synthesis tasks as well as molecular docking, and find that introducing discrete latents consistently improves model performance. For example, DisCo-Diff achieves state-of-the-art FID scores on class-conditioned ImageNet-64/128 datasets with ODE sampler.

[CV-4] Improved Noise Schedule for Diffusion Training

链接: https://arxiv.org/abs/2407.03297
作者: Tiankai Hang,Shuyang Gu
关键词: generating visual signals, visual signals, facto choice, choice for generating, generating visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as the de facto choice for generating visual signals. However, training a single model to predict noise across various levels poses significant challenges, necessitating numerous iterations and incurring significant computational costs. Various approaches, such as loss weighting strategy design and architectural refinements, have been introduced to expedite convergence. In this study, we propose a novel approach to design the noise schedule for enhancing the training of diffusion models. Our key insight is that the importance sampling of the logarithm of the Signal-to-Noise ratio (logSNR), theoretically equivalent to a modified noise schedule, is particularly beneficial for training efficiency when increasing the sample frequency around \log \textSNR=0 . We empirically demonstrate the superiority of our noise schedule over the standard cosine schedule. Furthermore, we highlight the advantages of our noise schedule design on the ImageNet benchmark, showing that the designed schedule consistently benefits different prediction targets.

[CV-5] Biomechanics-informed Non-rigid Medical Image Registration and its Inverse Material Property Estimation with Linear and Nonlinear Elasticity

链接: https://arxiv.org/abs/2407.03292
作者: Zhe Min,Zachary M.C. Baum,Shaheer U. Saeed,Mark Emberton,Dean C. Barratt,Zeike A. Taylor,Yipeng Hu
关键词: physics-informed neural networks, soft tissues, neural networks, paper investigates, material properties
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper investigates both biomechanical-constrained non-rigid medical image registrations and accurate identifications of material properties for soft tissues, using physics-informed neural networks (PINNs). The complex nonlinear elasticity theory is leveraged to formally establish the partial differential equations (PDEs) representing physics laws of biomechanical constraints that need to be satisfied, with which registration and identification tasks are treated as forward (i.e., data-driven solutions of PDEs) and inverse (i.e., parameter estimation) problems under PINNs respectively. Two net configurations (i.e., Cfg1 and Cfg2) have also been compared for both linear and nonlinear physics model. Two sets of experiments have been conducted, using pairs of undeformed and deformed MR images from clinical cases of prostate cancer biopsy. Our contributions are summarised as follows. 1) We developed a learning-based biomechanical-constrained non-rigid registration algorithm using PINNs, where linear elasticity is generalised to the nonlinear version. 2) We demonstrated extensively that nonlinear elasticity shows no statistical significance against linear models in computing point-wise displacement vectors but their respective benefits may depend on specific patients, with finite-element (FE) computed ground-truth. 3) We formulated and solved the inverse parameter estimation problem, under the joint optimisation scheme of registration and parameter identification using PINNs, whose solutions can be accurately found by locating saddle points. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2407.03292 [cs.CV] (or arXiv:2407.03292v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2407.03292 Focus to learn more arXiv-issued DOI via DataCite

[CV-6] VCHAR:Variance-Driven Complex Human Activity Recognition framework with Generative Representation

链接: https://arxiv.org/abs/2407.03291
作者: Yuan Sun,Navid Salami Pargoo,Taqiya Ehsan,Zhao Zhang Jorge Ortiz
关键词: Complex human activity, human activity recognition, complex activity recognition, activity recognition, human activity
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Complex human activity recognition (CHAR) remains a pivotal challenge within ubiquitous computing, especially in the context of smart environments. Existing studies typically require meticulous labeling of both atomic and complex activities, a task that is labor-intensive and prone to errors due to the scarcity and inaccuracies of available datasets. Most prior research has focused on datasets that either precisely label atomic activities or, at minimum, their sequence approaches that are often impractical in real world this http URL response, we introduce VCHAR (Variance-Driven Complex Human Activity Recognition), a novel framework that treats the outputs of atomic activities as a distribution over specified intervals. Leveraging generative methodologies, VCHAR elucidates the reasoning behind complex activity classifications through video-based explanations, accessible to users without prior machine learning expertise. Our evaluation across three publicly available datasets demonstrates that VCHAR enhances the accuracy of complex activity recognition without necessitating precise temporal or sequential labeling of atomic activities. Furthermore, user studies confirm that VCHAR’s explanations are more intelligible compared to existing methods, facilitating a broader understanding of complex activity recognition among non-experts.

[CV-7] For a semiotic AI: Bridging computer vision and visual semiotics for computational observation of large scale facial image archives

链接: https://arxiv.org/abs/2407.03268
作者: Lia Morra,Antonio Santangelo,Pietro Basci,Luca Piano,Fabio Garcea,Fabrizio Lamberti,Massimo Leone
关键词: arguably changing, networks are creating, bodies is arguably, Social networks, human faces
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Social networks are creating a digital world in which the cognitive, emotional, and pragmatic value of the imagery of human faces and bodies is arguably changing. However, researchers in the digital humanities are often ill-equipped to study these phenomena at scale. This work presents FRESCO (Face Representation in E-Societies through Computational Observation), a framework designed to explore the socio-cultural implications of images on social media platforms at scale. FRESCO deconstructs images into numerical and categorical variables using state-of-the-art computer vision techniques, aligning with the principles of visual semiotics. The framework analyzes images across three levels: the plastic level, encompassing fundamental visual features like lines and colors; the figurative level, representing specific entities or concepts; and the enunciation level, which focuses particularly on constructing the point of view of the spectator and observer. These levels are analyzed to discern deeper narrative layers within the imagery. Experimental validation confirms the reliability and utility of FRESCO, and we assess its consistency and precision across two public datasets. Subsequently, we introduce the FRESCO score, a metric derived from the framework’s output that serves as a reliable measure of similarity in image content.

[CV-8] A Unified Framework for 3D Scene Understanding

链接: https://arxiv.org/abs/2407.03263
作者: Wei Xu,Chunsheng Shi,Sifan Tu,Xin Zhou,Dingkang Liang,Xiang Bai
关键词: open-vocabulary semantic segmentation, achieves panoptic, single model, open-vocabulary semantic, semantic segmentation tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The code will be available at this https URL

点击查看摘要

Abstract:We propose UniSeg3D, a unified 3D segmentation framework that achieves panoptic, semantic, instance, interactive, referring, and open-vocabulary semantic segmentation tasks within a single model. Most previous 3D segmentation approaches are specialized for a specific task, thereby limiting their understanding of 3D scenes to a task-specific perspective. In contrast, the proposed method unifies six tasks into unified representations processed by the same Transformer. It facilitates inter-task knowledge sharing and, therefore, promotes comprehensive 3D scene understanding. To take advantage of multi-task unification, we enhance the performance by leveraging task connections. Specifically, we design a knowledge distillation method and a contrastive learning method to transfer task-specific knowledge across different tasks. Benefiting from extensive inter-task knowledge sharing, our UniSeg3D becomes more powerful. Experiments on three benchmarks, including the ScanNet20, ScanRefer, and ScanNet200, demonstrate that the UniSeg3D consistently outperforms current SOTA methods, even those specialized for individual tasks. We hope UniSeg3D can serve as a solid unified baseline and inspire future work. The code will be available at this https URL.

[CV-9] ACTRESS: Active Retraining for Semi-supervised Visual Grounding

链接: https://arxiv.org/abs/2407.03251
作者: Weitai Kang,Mengxue Qu,Yunchao Wei,Yan Yan
关键词: Semi-Supervised Visual Grounding, Visual Grounding, sparse labeled data, visual grounding models, multimodel understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Semi-Supervised Visual Grounding (SSVG) is a new challenge for its sparse labeled data with the need for multimodel understanding. A previous study, RefTeacher, makes the first attempt to tackle this task by adopting the teacher-student framework to provide pseudo confidence supervision and attention-based supervision. However, this approach is incompatible with current state-of-the-art visual grounding models, which follow the Transformer-based pipeline. These pipelines directly regress results without region proposals or foreground binary classification, rendering them unsuitable for fitting in RefTeacher due to the absence of confidence scores. Furthermore, the geometric difference in teacher and student inputs, stemming from different data augmentations, induces natural misalignment in attention-based constraints. To establish a compatible SSVG framework, our paper proposes the ACTive REtraining approach for Semi-Supervised Visual Grounding, abbreviated as ACTRESS. Initially, the model is enhanced by incorporating an additional quantized detection head to expose its detection confidence. Building upon this, ACTRESS consists of an active sampling strategy and a selective retraining strategy. The active sampling strategy iteratively selects high-quality pseudo labels by evaluating three crucial aspects: Faithfulness, Robustness, and Confidence, optimizing the utilization of unlabeled data. The selective retraining strategy retrains the model with periodic re-initialization of specific parameters, facilitating the model’s escape from local minima. Extensive experiments demonstrates our superior performance on widely-used benchmark datasets.

[CV-10] Visual Grounding with Attention-Driven Constraint Balancing

链接: https://arxiv.org/abs/2407.03243
作者: Weitai Kang,Luowei Zhou,Junyi Wu,Changchang Sun,Yan Yan
关键词: Grounding task necessitates, Unlike Object Detection, Visual Grounding task, Unlike Object, Grounding task
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Unlike Object Detection, Visual Grounding task necessitates the detection of an object described by complex free-form language. To simultaneously model such complex semantic and visual representations, recent state-of-the-art studies adopt transformer-based models to fuse features from both modalities, further introducing various modules that modulate visual features to align with the language expressions and eliminate the irrelevant redundant information. However, their loss function, still adopting common Object Detection losses, solely governs the bounding box regression output, failing to fully optimize for the above objectives. To tackle this problem, in this paper, we first analyze the attention mechanisms of transformer-based models. Building upon this, we further propose a novel framework named Attention-Driven Constraint Balancing (AttBalance) to optimize the behavior of visual features within language-relevant regions. Extensive experimental results show that our method brings impressive improvements. Specifically, we achieve constant improvements over five different models evaluated on four different benchmarks. Moreover, we attain a new state-of-the-art performance by integrating our method into QRNet.

[CV-11] Cyclic Refiner: Object-Aware Temporal Representation Learning for Multi-View 3D Detection and Tracking

链接: https://arxiv.org/abs/2407.03240
作者: Mingzhe Guo,Zhipeng Zhang,Liping Jing,Yuan He,Ke Wang,Heng Fan
关键词: cyclic learning model, cyclic learning, multi-view representation learning, temporal fusion, learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by IJCV

点击查看摘要

Abstract:We propose a unified object-aware temporal learning framework for multi-view 3D detection and tracking tasks. Having observed that the efficacy of the temporal fusion strategy in recent multi-view perception methods may be weakened by distractors and background clutters in historical frames, we propose a cyclic learning mechanism to improve the robustness of multi-view representation learning. The essence is constructing a backward bridge to propagate information from model predictions (e.g., object locations and sizes) to image and BEV features, which forms a circle with regular inference. After backward refinement, the responses of target-irrelevant regions in historical frames would be suppressed, decreasing the risk of polluting future frames and improving the object awareness ability of temporal fusion. We further tailor an object-aware association strategy for tracking based on the cyclic learning model. The cyclic learning model not only provides refined features, but also delivers finer clues (e.g., scale level) for tracklet association. The proposed cycle learning method and association module together contribute a novel and unified multi-task framework. Experiments on nuScenes show that the proposed model achieves consistent performance gains over baselines of different designs (i.e., dense query-based BEVFormer, sparse query-based SparseBEV and LSS-based BEVDet4D) on both detection and tracking evaluation.

[CV-12] MHNet: Multi-view High-order Network for Diagnosing Neurodevelopmental Disorders Using Resting-state fMRI

链接: https://arxiv.org/abs/2407.03217
作者: Yueyang Li,Weiming Zeng,Wenhao Dong,Luhui Cai,Lei Wang,Hongyu Chen,Hongjie Yan,Lingbin Bian,Nizhuan Wang
关键词: ASD and ADHD, diagnosing neurodevelopmental disorders, Space Features Extraction, Deep learning models, high-order features
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages

点击查看摘要

Abstract:Background: Deep learning models have shown promise in diagnosing neurodevelopmental disorders (NDD) like ASD and ADHD. However, many models either use graph neural networks (GNN) to construct single-level brain functional networks (BFNs) or employ spatial convolution filtering for local information extraction from rs-fMRI data, often neglecting high-order features crucial for NDD classification. Methods: We introduce a Multi-view High-order Network (MHNet) to capture hierarchical and high-order features from multi-view BFNs derived from rs-fMRI data for NDD prediction. MHNet has two branches: the Euclidean Space Features Extraction (ESFE) module and the Non-Euclidean Space Features Extraction (Non-ESFE) module, followed by a Feature Fusion-based Classification (FFC) module for NDD identification. ESFE includes a Functional Connectivity Generation (FCG) module and a High-order Convolutional Neural Network (HCNN) module to extract local and high-order features from BFNs in Euclidean space. Non-ESFE comprises a Generic Internet-like Brain Hierarchical Network Generation (G-IBHN-G) module and a High-order Graph Neural Network (HGNN) module to capture topological and high-order features in non-Euclidean space. Results: Experiments on three public datasets show that MHNet outperforms state-of-the-art methods using both AAL1 and Brainnetome Atlas templates. Extensive ablation studies confirm the superiority of MHNet and the effectiveness of using multi-view fMRI information and high-order features. Our study also offers atlas options for constructing more sophisticated hierarchical networks and explains the association between key brain regions and NDD. Conclusion: MHNet leverages multi-view feature learning from both Euclidean and non-Euclidean spaces, incorporating high-order information from BFNs to enhance NDD classification performance.

[CV-13] Learning Disentangled Representation in Object-Centric Models for Visual Dynamics Prediction via Transformers

链接: https://arxiv.org/abs/2407.03216
作者: Sanket Gandhi,Atul,Samanyu Mahajan,Vishal Sharma,Rushil Gupta,Arnab Kumar Mondal,Parag Singla
关键词: Recent work, bringing interpretability, learning disentangled representation, disentangled representation, Recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent work has shown that object-centric representations can greatly help improve the accuracy of learning dynamics while also bringing interpretability. In this work, we take this idea one step further, ask the following question: “can learning disentangled representation further improve the accuracy of visual dynamics prediction in object-centric models?” While there has been some attempt to learn such disentangled representations for the case of static images \citepnsb, to the best of our knowledge, ours is the first work which tries to do this in a general setting for video, without making any specific assumptions about the kind of attributes that an object might have. The key building block of our architecture is the notion of a \em block, where several blocks together constitute an object. Each block is represented as a linear combination of a given number of learnable concept vectors, which is iteratively refined during the learning process. The blocks in our model are discovered in an unsupervised manner, by attending over object masks, in a style similar to discovery of slots \citepslot_attention, for learning a dense object-centric representation. We employ self-attention via transformers over the discovered blocks to predict the next state resulting in discovery of visual dynamics. We perform a series of experiments on several benchmark 2-D, and 3-D datasets demonstrating that our architecture (1) can discover semantically meaningful blocks (2) help improve accuracy of dynamics prediction compared to SOTA object-centric models (3) perform significantly better in OOD setting where the specific attribute combinations are not seen earlier during training. Our experiments highlight the importance discovery of disentangled representation for visual dynamics prediction.

[CV-14] Category-Aware Dynamic Label Assignment with High-Quality Oriented Proposal

链接: https://arxiv.org/abs/2407.03205
作者: Mingkui Feng,Hancheng Yu,Xiaoyu Dang,Ming Zhou
关键词: exhibit arbitrary orientations, typically embedded, arbitrary oriented objects, conformer RPN head, arbitrary orientations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Objects in aerial images are typically embedded in complex backgrounds and exhibit arbitrary orientations. When employing oriented bounding boxes (OBB) to represent arbitrary oriented objects, the periodicity of angles could lead to discontinuities in label regression values at the boundaries, inducing abrupt fluctuations in the loss function. To address this problem, an OBB representation based on the complex plane is introduced in the oriented detection framework, and a trigonometric loss function is proposed. Moreover, leveraging prior knowledge of complex background environments and significant differences in large objects in aerial images, a conformer RPN head is constructed to predict angle information. The proposed loss function and conformer RPN head jointly generate high-quality oriented proposals. A category-aware dynamic label assignment based on predicted category feedback is proposed to address the limitations of solely relying on IoU for proposal label assignment. This method makes negative sample selection more representative, ensuring consistency between classification and regression features. Experiments were conducted on four realistic oriented detection datasets, and the results demonstrate superior performance in oriented object detection with minimal parameter tuning and time costs. Specifically, mean average precision (mAP) scores of 82.02%, 71.99%, 69.87%, and 98.77% were achieved on the DOTA-v1.0, DOTA-v1.5, DIOR-R, and HRSC2016 datasets, respectively.

[CV-15] Expressive Gaussian Human Avatars from Monocular RGB Video

链接: https://arxiv.org/abs/2407.03204
作者: Hezhen Hu,Zhiwen Fan,Tianhao Wu,Yihan Xi,Seoyoung Lee,Georgios Pavlakos,Zhangyang Wang
关键词: digital human representations, Nuanced expressiveness, realism and vitality, vitality of digital, human representations
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Nuanced expressiveness, particularly through fine-grained hand and facial expressions, is pivotal for enhancing the realism and vitality of digital human representations. In this work, we focus on investigating the expressiveness of human avatars when learned from monocular RGB video; a setting that introduces new challenges in capturing and animating fine-grained details. To this end, we introduce EVA, a drivable human model that meticulously sculpts fine details based on 3D Gaussians and SMPL-X, an expressive parametric human model. Focused on enhancing expressiveness, our work makes three key contributions. First, we highlight the critical importance of aligning the SMPL-X model with RGB frames for effective avatar learning. Recognizing the limitations of current SMPL-X prediction methods for in-the-wild videos, we introduce a plug-and-play module that significantly ameliorates misalignment issues. Second, we propose a context-aware adaptive density control strategy, which is adaptively adjusting the gradient thresholds to accommodate the varied granularity across body parts. Last but not least, we develop a feedback mechanism that predicts per-pixel confidence to better guide the learning of 3D Gaussians. Extensive experiments on two benchmarks demonstrate the superiority of our framework both quantitatively and qualitatively, especially on the fine-grained hand and facial details. See the project website at \urlthis https URL

[CV-16] SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding

链接: https://arxiv.org/abs/2407.03200
作者: Weitai Kang,Gaowen Liu,Mubarak Shah,Yan Yan
关键词: Object Detection, Visual Grounding deals, Visual Grounding, deals with detecting, detecting a bounding
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024

点击查看摘要

Abstract:Different from Object Detection, Visual Grounding deals with detecting a bounding box for each text-image pair. This one box for each text-image data provides sparse supervision signals. Although previous works achieve impressive results, their passive utilization of annotation, i.e. the sole use of the box annotation as regression ground truth, results in a suboptimal performance. In this paper, we present SegVG, a novel method transfers the box-level annotation as Segmentation signals to provide an additional pixel-level supervision for Visual Grounding. Specifically, we propose the Multi-layer Multi-task Encoder-Decoder as the target grounding stage, where we learn a regression query and multiple segmentation queries to ground the target by regression and segmentation of the box in each decoding layer, respectively. This approach allows us to iteratively exploit the annotation as signals for both box-level regression and pixel-level segmentation. Moreover, as the backbones are typically initialized by pretrained parameters learned from unimodal tasks and the queries for both regression and segmentation are static learnable embeddings, a domain discrepancy remains among these three types of features, which impairs subsequent target grounding. To mitigate this discrepancy, we introduce the Triple Alignment module, where the query, text, and vision tokens are triangularly updated to share the same space by triple attention mechanism. Extensive experiments on five widely used datasets validate our state-of-the-art (SOTA) performance.

[CV-17] DyFADet: Dynamic Feature Aggregation for Temporal Action Detection

链接: https://arxiv.org/abs/2407.03197
作者: Le Yang,Ziwei Zheng,Yizeng Han,Hao Cheng,Shiji Song,Gao Huang,Fan Li
关键词: shared-weights detection heads, Temporal Action Detection, Recent proposed neural, neural network-based Temporal, network-based Temporal Action
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:Recent proposed neural network-based Temporal Action Detection (TAD) models are inherently limited to extracting the discriminative representations and modeling action instances with various lengths from complex scenes by shared-weights detection heads. Inspired by the successes in dynamic neural networks, in this paper, we build a novel dynamic feature aggregation (DFA) module that can simultaneously adapt kernel weights and receptive fields at different timestamps. Based on DFA, the proposed dynamic encoder layer aggregates the temporal features within the action time ranges and guarantees the discriminability of the extracted representations. Moreover, using DFA helps to develop a Dynamic TAD head (DyHead), which adaptively aggregates the multi-scale features with adjusted parameters and learned receptive fields better to detect the action instances with diverse ranges from videos. With the proposed encoder layer and DyHead, a new dynamic TAD model, DyFADet, achieves promising performance on a series of challenging TAD benchmarks, including HACS-Segment, THUMOS14, ActivityNet-1.3, Epic-Kitchen 100, Ego4D-Moment QueriesV1.0, and FineAction. Code is released to this https URL.

[CV-18] Motion meets Attention: Video Motion Prompts

链接: https://arxiv.org/abs/2407.03179
作者: Qixiang Chen,Lei Wang,Piotr Koniusz,Tom Gedeon
关键词: rich spatio-temporal information, spatio-temporal information, rich spatio-temporal, motion, attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Research report

点击查看摘要

Abstract:Videos contain rich spatio-temporal information. Traditional methods for extracting motion, used in tasks such as action recognition, often rely on visual contents rather than precise motion features. This phenomenon is referred to as ‘blind motion extraction’ behavior, which proves inefficient in capturing motions of interest due to a lack of motion-guided cues. Recently, attention mechanisms have enhanced many computer vision tasks by effectively highlighting salient visual areas. Inspired by this, we propose using a modified Sigmoid function with learnable slope and shift parameters as an attention mechanism to activate and modulate motion signals derived from frame differencing maps. This approach generates a sequence of attention maps that enhance the processing of motion-related video content. To ensure temporally continuity and smoothness of the attention maps, we apply pair-wise temporal attention variation regularization to remove unwanted motions (e.g., noise) while preserving important ones. We then perform Hadamard product between each pair of attention maps and the original video frames to highlight the evolving motions of interest over time. These highlighted motions, termed video motion prompts, are subsequently used as inputs to the model instead of the original video frames. We formalize this process as a motion prompt layer and incorporate the regularization term into the loss function to learn better motion prompts. This layer serves as an adapter between the model and the video data, bridging the gap between traditional ‘blind motion extraction’ and the extraction of relevant motions of interest.

[CV-19] Relating CNN-Transformer Fusion Network for Change Detection

链接: https://arxiv.org/abs/2407.03178
作者: Yuhao Gao,Gensheng Pei,Mengmeng Sheng,Zeren Sun,Tao Chen,Yazhou Yao
关键词: revolutionized remote sensing, incomplete change learning, convolutional neural networks, miss crucial features, crucial features due
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted by IEEE Conference on Multimedia Expo

点击查看摘要

Abstract:While deep learning, particularly convolutional neural networks (CNNs), has revolutionized remote sensing (RS) change detection (CD), existing approaches often miss crucial features due to neglecting global context and incomplete change learning. Additionally, transformer networks struggle with low-level details. RCTNet addresses these limitations by introducing \textbf(1) an early fusion backbone to exploit both spatial and temporal features early on, \textbf(2) a Cross-Stage Aggregation (CSA) module for enhanced temporal representation, \textbf(3) a Multi-Scale Feature Fusion (MSF) module for enriched feature extraction in the decoder, and \textbf(4) an Efficient Self-deciphering Attention (ESA) module utilizing transformers to capture global information and fine-grained details for accurate change detection. Extensive experiments demonstrate RCTNet’s clear superiority over traditional RS image CD methods, showing significant improvement and an optimal balance between accuracy and computational cost.

[CV-20] IMC 2024 Methods Solutions Review

链接: https://arxiv.org/abs/2407.03172
作者: Shyam Gupta,Dhanisha Sharma,Songling Huang
关键词: focuses on solving, image reconstruction problem, Image Matching Challenge, Kaggle, Image Matching
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注: 8 Pages, 9 figures

点击查看摘要

Abstract:For the past three years, Kaggle has been hosting the Image Matching Challenge, which focuses on solving a 3D image reconstruction problem using a collection of 2D images. Each year, this competition fosters the development of innovative and effective methodologies by its participants. In this paper, we introduce an advanced ensemble technique that we developed, achieving a score of 0.153449 on the private leaderboard and securing the 160th position out of over 1,000 participants. Additionally, we conduct a comprehensive review of existing methods and techniques employed by top-performing teams in the competition. Our solution, alongside the insights gathered from other leading approaches, contributes to the ongoing advancement in the field of 3D image reconstruction. This research provides valuable knowledge for future participants and researchers aiming to excel in similar image matching and reconstruction challenges.

[CV-21] LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control

链接: https://arxiv.org/abs/2407.03168
作者: Jianzhu Guo,Dingyun Zhang,Xiaoqiang Liu,Zhizhou Zhong,Yuan Zhang,Pengfei Wan,Di Zhang
关键词: single source image, Portrait Animation aims, portrait animation framework, lifelike video, driving video
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Portrait Animation aims to synthesize a lifelike video from a single source image, using it as an appearance reference, with motion (i.e., facial expressions and head pose) derived from a driving video, audio, text, or generation. Instead of following mainstream diffusion-based methods, we explore and extend the potential of the implicit-keypoint-based framework, which effectively balances computational efficiency and controllability. Building upon this, we develop a video-driven portrait animation framework named LivePortrait with a focus on better generalization, controllability, and efficiency for practical usage. To enhance the generation quality and generalization ability, we scale up the training data to about 69 million high-quality frames, adopt a mixed image-video training strategy, upgrade the network architecture, and design better motion transformation and optimization objectives. Additionally, we discover that compact implicit keypoints can effectively represent a kind of blendshapes and meticulously propose a stitching and two retargeting modules, which utilize a small MLP with negligible computational overhead, to enhance the controllability. Experimental results demonstrate the efficacy of our framework even compared to diffusion-based methods. The generation speed remarkably reaches 12.8ms on an RTX 4090 GPU with PyTorch. The inference code and models are available at this https URL

[CV-22] Consistent Point Orientation for Manifold Surfaces via Boundary Integration

链接: https://arxiv.org/abs/2407.03165
作者: Weizhou Liu,Xingce Wang,Haichuan Zhao,Xingfei Xue,Zhongke Wu,Xuequan Lu,Ying He
关键词: globally consistent normals, generating globally consistent, point clouds sampled, globally consistent, consistent normals
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
*备注: accepted in siggraph2024

点击查看摘要

Abstract:This paper introduces a new approach for generating globally consistent normals for point clouds sampled from manifold surfaces. Given that the generalized winding number (GWN) field generated by a point cloud with globally consistent normals is a solution to a PDE with jump boundary conditions and possesses harmonic properties, and the Dirichlet energy of the GWN field can be defined as an integral over the boundary surface, we formulate a boundary energy derived from the Dirichlet energy of the GWN. Taking as input a point cloud with randomly oriented normals, we optimize this energy to restore the global harmonicity of the GWN field, thereby recovering the globally consistent normals. Experiments show that our method outperforms state-of-the-art approaches, exhibiting enhanced robustness to noise, outliers, complex topologies, and thin structures. Our code can be found at \urlthis https URL.

[CV-23] Global Context Modeling in YOLOv8 for Pediatric Wrist Fracture Detection

链接: https://arxiv.org/abs/2407.03163
作者: Rui-Yang Ju,Chun-Tse Chien,Chia-Min Lin,Jen-Shiun Chiang
关键词: interpret X-ray images, suffer wrist injuries, Children often suffer, fracture injuring radiologists, interpret X-ray
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Children often suffer wrist injuries in daily life, while fracture injuring radiologists usually need to analyze and interpret X-ray images before surgical treatment by surgeons. The development of deep learning has enabled neural network models to work as computer-assisted diagnosis (CAD) tools to help doctors and experts in diagnosis. Since the YOLOv8 models have obtained the satisfactory success in object detection tasks, it has been applied to fracture detection. The Global Context (GC) block effectively models the global context in a lightweight way, and incorporating it into YOLOv8 can greatly improve the model performance. This paper proposes the YOLOv8+GC model for fracture detection, which is an improved version of the YOLOv8 model with the GC block. Experimental results demonstrate that compared to the original YOLOv8 model, the proposed YOLOv8-GC model increases the mean average precision calculated at intersection over union threshold of 0.5 (mAP 50) from 63.58% to 66.32% on the GRAZPEDWRI-DX dataset, achieving the state-of-the-art (SOTA) level. The implementation code for this work is available on GitHub at this https URL.

[CV-24] Bunny-VisionPro: Real-Time Bimanual Dexterous Teleoperation for Imitation Learning

链接: https://arxiv.org/abs/2407.03162
作者: Runyu Ding,Yuzhe Qin,Jiyue Zhu,Chengzhe Jia,Shiqi Yang,Ruihan Yang,Xiaojuan Qi,Xiaolong Wang
关键词: collecting human demonstrations, remains a challenge, collecting human, controlling robots, dexterous hands remains
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: project page: this https URL

点击查看摘要

Abstract:Teleoperation is a crucial tool for collecting human demonstrations, but controlling robots with bimanual dexterous hands remains a challenge. Existing teleoperation systems struggle to handle the complexity of coordinating two hands for intricate manipulations. We introduce Bunny-VisionPro, a real-time bimanual dexterous teleoperation system that leverages a VR headset. Unlike previous vision-based teleoperation systems, we design novel low-cost devices to provide haptic feedback to the operator, enhancing immersion. Our system prioritizes safety by incorporating collision and singularity avoidance while maintaining real-time performance through innovative designs. Bunny-VisionPro outperforms prior systems on a standard task suite, achieving higher success rates and reduced task completion times. Moreover, the high-quality teleoperation demonstrations improve downstream imitation learning performance, leading to better generalizability. Notably, Bunny-VisionPro enables imitation learning with challenging multi-stage, long-horizon dexterous manipulation tasks, which have rarely been addressed in previous work. Our system’s ability to handle bimanual manipulations while prioritizing safety and real-time performance makes it a powerful tool for advancing dexterous manipulation and imitation learning.

[CV-25] Efficient Shapley Values for Attributing Global Properties of Diffusion Models to Data Group

链接: https://arxiv.org/abs/2407.03153
作者: Chris Lin,Mingyu Lu,Chanwoo Kim,Su-In Lee
关键词: ensure fair acknowledgment, real-world settings, harmful content, high-quality training data, deployed in real-world
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As diffusion models are deployed in real-world settings, data attribution is needed to ensure fair acknowledgment for contributors of high-quality training data and to identify sources of harmful content. Previous work focuses on identifying individual training samples important for the generation of a given image. However, instead of focusing on a given generated image, some use cases require understanding global properties of the distribution learned by a diffusion model (e.g., demographic diversity). Furthermore, training data for diffusion models are often contributed in groups rather than separately (e.g., multiple artworks from the same artist). Hence, here we tackle the problem of attributing global properties of diffusion models to groups of training data. Specifically, we develop a method to efficiently estimate Shapley values by leveraging model pruning and fine-tuning. We empirically demonstrate the utility of our method with three use cases: (i) global image quality for a DDPM trained on a CIFAR dataset, (ii) demographic diversity for an LDM trained on CelebA-HQ, and (iii) overall aesthetic quality for a Stable Diffusion model LoRA-finetuned on Post-Impressionist artworks.

[CV-26] Stereo Risk: A Continuous Modeling Approach to Stereo Matching

链接: https://arxiv.org/abs/2407.03152
作者: Ce Liu,Suryansh Kumar,Shuhang Gu,Radu Timofte,Yao Yao,Luc Van Gool
关键词: introduce Stereo Risk, classical stereo-matching problem, Stereo Risk, computer vision, scene disparity
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted as an Oral Paper at ICML 2024. Draft info: 18 pages, 6 Figure, 16 Tables

点击查看摘要

Abstract:We introduce Stereo Risk, a new deep-learning approach to solve the classical stereo-matching problem in computer vision. As it is well-known that stereo matching boils down to a per-pixel disparity estimation problem, the popular state-of-the-art stereo-matching approaches widely rely on regressing the scene disparity values, yet via discretization of scene disparity values. Such discretization often fails to capture the nuanced, continuous nature of scene depth. Stereo Risk departs from the conventional discretization approach by formulating the scene disparity as an optimal solution to a continuous risk minimization problem, hence the name “stereo risk”. We demonstrate that L^1 minimization of the proposed continuous risk function enhances stereo-matching performance for deep networks, particularly for disparities with multi-modal probability distributions. Furthermore, to enable the end-to-end network training of the non-differentiable L^1 risk optimization, we exploited the implicit function theorem, ensuring a fully differentiable network. A comprehensive analysis demonstrates our method’s theoretical soundness and superior performance over the state-of-the-art methods across various benchmark datasets, including KITTI 2012, KITTI 2015, ETH3D, SceneFlow, and Middlebury 2014.

[CV-27] Enhancing Class Fairness in Classification with A Two-Player Game Approach

链接: https://arxiv.org/abs/2407.03146
作者: Yunpeng Jiang,Paul Weng,Yutong Ban
关键词: machine learning tasks, Data augmentation, widely applied, shown its benefits, machine learning
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data augmentation is widely applied and has shown its benefits in different machine learning tasks. However, as recently observed in some downstream tasks, data augmentation may introduce an unfair impact on classifications. While it can improve the performance of some classes, it can actually be detrimental for other classes, which can be problematic in some application domains. In this paper, to counteract this phenomenon, we propose a FAir Classification approach with a Two-player game (FACT). We first formulate the training of a classifier with data augmentation as a fair optimization problem, which can be further written as an adversarial two-player game. Following this formulation, we propose a novel multiplicative weight optimization algorithm, for which we theoretically prove that it can converge to a solution that is fair over classes. Interestingly, our formulation also reveals that this fairness issue over classes is not due to data augmentation only, but is in fact a general phenomenon. Our empirical experiments demonstrate that the performance of our learned classifiers is indeed more fairly distributed over classes in five datasets, with only limited impact on the average accuracy.

[CV-28] Venomancer: Towards Imperceptible and Target-on-Demand Backdoor Attacks in Federated Learning

链接: https://arxiv.org/abs/2407.03144
作者: Son Nguyen,Thinh Nguyen,Khoa Doan,Kok-Seng Wong
关键词: distributed machine learning, machine learning approach, Federated Learning, maintains data privacy, machine learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a distributed machine learning approach that maintains data privacy by training on decentralized data sources. Similar to centralized machine learning, FL is also susceptible to backdoor attacks. Most backdoor attacks in FL assume a predefined target class and require control over a large number of clients or knowledge of benign clients’ information. Furthermore, they are not imperceptible and are easily detected by human inspection due to clear artifacts left on the poison data. To overcome these challenges, we propose Venomancer, an effective backdoor attack that is imperceptible and allows target-on-demand. Specifically, imperceptibility is achieved by using a visual loss function to make the poison data visually indistinguishable from the original data. Target-on-demand property allows the attacker to choose arbitrary target classes via conditional adversarial training. Additionally, experiments showed that the method is robust against state-of-the-art defenses such as Norm Clipping, Weak DP, Krum, and Multi-Krum. The source code is available at https://anonymous.4open.science/r/Venomancer-3426.

[CV-29] Machine Learning Models for Improved Tracking from Range-Doppler Map Images

链接: https://arxiv.org/abs/2407.03140
作者: Elizabeth Hou,Ross Greenwood,Piyush Kumar
关键词: Statistical tracking filters, accurate target measurements, Moving Target Indicator, Ground Moving Target, tracking filters depend
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Statistical tracking filters depend on accurate target measurements and uncertainty estimates for good tracking performance. In this work, we propose novel machine learning models for target detection and uncertainty estimation in range-Doppler map (RDM) images for Ground Moving Target Indicator (GMTI) radars. We show that by using the outputs of these models, we can significantly improve the performance of a multiple hypothesis tracker for complex multi-target air-to-ground tracking scenarios.

[CV-30] owards Efficient Pixel Labeling for Industrial Anomaly Detection and Localization

链接: https://arxiv.org/abs/2407.03130
作者: Hanxi Li,Jingqi Wu,Lin Yuanbo Wu,Hao Chen,Deyin Liu,Chunhua Shen
关键词: anomalous pixels proves, costly endeavor, realm of practical, anomalous pixels, pixels proves
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:In the realm of practical Anomaly Detection (AD) tasks, manual labeling of anomalous pixels proves to be a costly endeavor. Consequently, many AD methods are crafted as one-class classifiers, tailored for training sets completely devoid of anomalies, ensuring a more cost-effective approach. While some pioneering work has demonstrated heightened AD accuracy by incorporating real anomaly samples in training, this enhancement comes at the price of labor-intensive labeling processes. This paper strikes the balance between AD accuracy and labeling expenses by introducing ADClick, a novel Interactive Image Segmentation (IIS) algorithm. ADClick efficiently generates “ground-truth” anomaly masks for real defective images, leveraging innovative residual features and meticulously crafted language prompts. Notably, ADClick showcases a significantly elevated generalization capacity compared to existing state-of-the-art IIS approaches. Functioning as an anomaly labeling tool, ADClick generates high-quality anomaly labels (AP = 94.1% on MVTec AD) based on only 3 to 5 manual click annotations per training image. Furthermore, we extend the capabilities of ADClick into ADClick-Seg, an enhanced model designed for anomaly detection and localization. By fine-tuning the ADClick-Seg model using the weak labels inferred by ADClick, we establish the state-of-the-art performances in supervised AD tasks (AP = 86.4% on MVTec AD and AP = 78.4% , PRO = 98.6% on KSDD2).

[CV-31] L_p-norm Distortion-Efficient Adversarial Attack

链接: https://arxiv.org/abs/2407.03115
作者: Chao Zhou,Yuan-Gen Wang,Zi-jia Wang,Xiangui Kang
关键词: well-trained model misclassified, norm, Adversarial, norm distortion, norm based methods
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Adversarial examples have shown a powerful ability to make a well-trained model misclassified. Current mainstream adversarial attack methods only consider one of the distortions among L_0 -norm, L_2 -norm, and L_\infty -norm. L_0 -norm based methods cause large modification on a single pixel, resulting in naked-eye visible detection, while L_2 -norm and L_\infty -norm based methods suffer from weak robustness against adversarial defense since they always diffuse tiny perturbations to all pixels. A more realistic adversarial perturbation should be sparse and imperceptible. In this paper, we propose a novel L_p -norm distortion-efficient adversarial attack, which not only owns the least L_2 -norm loss but also significantly reduces the L_0 -norm distortion. To this aim, we design a new optimization scheme, which first optimizes an initial adversarial perturbation under L_2 -norm constraint, and then constructs a dimension unimportance matrix for the initial perturbation. Such a dimension unimportance matrix can indicate the adversarial unimportance of each dimension of the initial perturbation. Furthermore, we introduce a new concept of adversarial threshold for the dimension unimportance matrix. The dimensions of the initial perturbation whose unimportance is higher than the threshold will be all set to zero, greatly decreasing the L_0 -norm distortion. Experimental results on three benchmark datasets show that under the same query budget, the adversarial examples generated by our method have lower L_0 -norm and L_2 -norm distortion than the state-of-the-art. Especially for the MNIST dataset, our attack reduces 8.1 % L_2 -norm distortion meanwhile remaining 47 % pixels unattacked. This demonstrates the superiority of the proposed method over its competitors in terms of adversarial robustness and visual imperceptibility.

[CV-32] Anti-Collapse Loss for Deep Metric Learning Based on Coding Rate Metric

链接: https://arxiv.org/abs/2407.03106
作者: Xiruo Jiang,Yazhou Yao,Xili Dai,Fumin Shen,Xian-Sheng Hua,Heng-Tao Shen
关键词: Deep metric learning, Deep metric, discriminative high-dimensional embedding, metric learning, aims to learn
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:Deep metric learning (DML) aims to learn a discriminative high-dimensional embedding space for downstream tasks like classification, clustering, and retrieval. Prior literature predominantly focuses on pair-based and proxy-based methods to maximize inter-class discrepancy and minimize intra-class diversity. However, these methods tend to suffer from the collapse of the embedding space due to their over-reliance on label information. This leads to sub-optimal feature representation and inferior model performance. To maintain the structure of embedding space and avoid feature collapse, we propose a novel loss function called Anti-Collapse Loss. Specifically, our proposed loss primarily draws inspiration from the principle of Maximal Coding Rate Reduction. It promotes the sparseness of feature clusters in the embedding space to prevent collapse by maximizing the average coding rate of sample features or class proxies. Moreover, we integrate our proposed loss with pair-based and proxy-based methods, resulting in notable performance improvement. Comprehensive experiments on benchmark datasets demonstrate that our proposed method outperforms existing state-of-the-art methods. Extensive ablation studies verify the effectiveness of our method in preventing embedding space collapse and promoting generalization performance.

[CV-33] KeyVideoLLM: Towards Large-scale Video Keyframe Selection

链接: https://arxiv.org/abs/2407.03104
作者: Hao Liang,Jiapeng Li,Tianyi Bai,Chong Chen,Conghui He,Bin Cui,Wentao Zhang
关键词: Large Language Models, Video Large Language, increasingly important, understanding large-scale video, rise of web
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Recently, with the rise of web videos, managing and understanding large-scale video datasets has become increasingly important. Video Large Language Models (VideoLLMs) have emerged in recent years due to their strong video understanding capabilities. However, training and inference processes for VideoLLMs demand vast amounts of data, presenting significant challenges to data management, particularly regarding efficiency, robustness, and effectiveness. In this work, we present KeyVideoLLM, a text-video frame similarity-based keyframe selection method designed to manage VideoLLM data efficiently, robustly, and effectively. Specifically, KeyVideoLLM achieves a remarkable data compression rate of up to 60.9 times, substantially lowering disk space requirements, which proves its high efficiency. Additionally, it maintains a 100% selection success rate across all video formats and scales, enhances processing speed by up to 200 times compared to existing keyframe selection methods, and does not require hyperparameter tuning. Beyond its outstanding efficiency and robustness, KeyVideoLLM further improves model performance in video question-answering tasks during both training and inference stages. Notably, it consistently achieved the state-of-the-art (SoTA) experimental results on diverse datasets.

[CV-34] Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation

链接: https://arxiv.org/abs/2407.03056
作者: Marco Mistretta,Alberto Baldrati,Marco Bertini,Andrew D. Bagdanov
关键词: limited data, Vision-Language Models, Prompt learning, unseen tasks, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for publication at ECCV24

点击查看摘要

Abstract:Vision-Language Models (VLMs) demonstrate remarkable zero-shot generalization to unseen tasks, but fall short of the performance of supervised methods in generalizing to downstream tasks with limited data. Prompt learning is emerging as a parameter-efficient method for adapting VLMs, but state-of-the-art approaches require annotated samples. In this paper we propose a novel approach to prompt learning based on unsupervised knowledge distillation from more powerful models. Our approach, which we call Knowledge Distillation Prompt Learning (KDPL), can be integrated into existing prompt learning techniques and eliminates the need for labeled examples during adaptation. Our experiments on more than ten standard benchmark datasets demonstrate that KDPL is very effective at improving generalization of learned prompts for zero-shot domain generalization, zero-shot cross-dataset generalization, and zero-shot base-to-novel class generalization problems. KDPL requires no ground-truth labels for adaptation, and moreover we show that even in the absence of any knowledge of training class names it can be used to effectively transfer knowledge. The code is publicly available at this https URL.

[CV-35] SlerpFace: Face Template Protection via Spherical Linear Interpolation

链接: https://arxiv.org/abs/2407.03043
作者: Zhizhou Zhong,Yuxi Mi,Yuge Huang,Jianqing Xu,Guodong Mu,Shouhong Ding,Jingyun Zhang,Rizen Guo,Yunsheng Wu,Shuigeng Zhou
关键词: Contemporary face recognition, Contemporary face, identify persons, Contemporary, face
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: face template protection

点击查看摘要

Abstract:Contemporary face recognition systems use feature templates extracted from face images to identify persons. To enhance privacy, face template protection techniques are widely employed to conceal sensitive identity and appearance information stored in the template. This paper identifies an emerging privacy attack form utilizing diffusion models that could nullify prior protection, referred to as inversion attacks. The attack can synthesize high-quality, identity-preserving face images from templates, revealing persons’ appearance. Based on studies of the diffusion model’s generative capability, this paper proposes a defense to deteriorate the attack, by rotating templates to a noise-like distribution. This is achieved efficiently by spherically and linearly interpolating templates, or slerp, on their located hypersphere. This paper further proposes to group-wisely divide and drop out templates’ feature dimensions, to enhance the irreversibility of rotated templates. The division of groups and dropouts within each group are learned in a recognition-favored way. The proposed techniques are concretized as a novel face template protection technique, SlerpFace. Extensive experiments show that SlerpFace provides satisfactory recognition accuracy and comprehensive privacy protection against inversion and other attack forms, superior to prior arts.

[CV-36] Position and Altitude of the Nao Camera Head from Two Points on the Soccer Field plus the Gravitational Direction

链接: https://arxiv.org/abs/2407.03041
作者: Stijn Oomes,Arnoud Visser
关键词: play soccer, Standard Platform League, good estimate, gravitational direction, current position
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: to be published in the Proceedings of the RoboCup 2024 symposium - 12 pages

点击查看摘要

Abstract:To be able to play soccer, a robot needs a good estimate of its current position on the field. Ideally, multiple features are visible that have known locations. By applying trigonometry we can estimate the viewpoint from where this observation was actually made. Given that the Nao robots of the Standard Platform League have quite a limited field of view, a given camera frame typically only allows for one or two points to be recognized. In this paper we propose a method for determining the (x, y) coordinates on the field and the height h of the camera from the geometry of a simplified tetrahedron. This configuration is formed by two observed points on the ground plane plus the gravitational direction. When the distance between the two points is known, and the directions to the points plus the gravitational direction are measured, all dimensions of the tetrahedron can be determined. By performing these calculations with rational trigonometry instead of classical trigonometry, the computations turn out to be 28.7% faster, with equal numerical accuracy. The position of the head of the Nao can also be externally measured with the OptiTrack system. The difference between externally measured and internally predicted position from sensor data gives us mean absolute errors in the 3-6 centimeters range, when we estimated the gravitational direction from the vanishing point of the outer edges of the goal posts. Comments: to be published in the Proceedings of the RoboCup 2024 symposium - 12 pages Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68 ACMclasses: I.4.8 Cite as: arXiv:2407.03041 [cs.RO] (or arXiv:2407.03041v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2407.03041 Focus to learn more arXiv-issued DOI via DataCite

[CV-37] SAFT: Towards Out-of-Distribution Generalization in Fine-Tuning

链接: https://arxiv.org/abs/2407.03036
作者: Bac Nguyen,Stefan Uhlich,Fabien Cardinaux,Lukas Mauch,Marzieh Edraki,Aaron Courville
关键词: Handling distribution shifts, Handling distribution, poses a significant, distribution shifts, shifts from training
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Handling distribution shifts from training data, known as out-of-distribution (OOD) generalization, poses a significant challenge in the field of machine learning. While a pre-trained vision-language model like CLIP has demonstrated remarkable zero-shot performance, further adaptation of the model to downstream tasks leads to undesirable degradation for OOD data. In this work, we introduce Sparse Adaptation for Fine-Tuning (SAFT), a method that prevents fine-tuning from forgetting the general knowledge in the pre-trained model. SAFT only updates a small subset of important parameters whose gradient magnitude is large, while keeping the other parameters frozen. SAFT is straightforward to implement and conceptually simple. Extensive experiments show that with only 0.1% of the model parameters, SAFT can significantly improve the performance of CLIP. It consistently outperforms baseline methods across several benchmarks. On the few-shot learning benchmark of ImageNet and its variants, SAFT gives a gain of 5.15% on average over the conventional fine-tuning method in OOD settings.

[CV-38] ISWSST: Index-space-wave State Superposition Transformers for Multispectral Remotely Sensed Imagery Semantic Segmentation

链接: https://arxiv.org/abs/2407.03033
作者: Chang Li,Pengfei Zhang,Yu Wang
关键词: remotely sensed imagery, encoder generally leads, single domain feature, multispectral remotely sensed, MSRSI semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Currently the semantic segmentation task of multispectral remotely sensed imagery (MSRSI) faces the following problems: 1) Usually, only single domain feature (i.e., space domain or frequency domain) is considered; 2) downsampling operation in encoder generally leads to the accuracy loss of edge extraction; 3) multichannel features of MSRSI are not fully considered; and 4) prior knowledge of remote sensing is not fully utilized. To solve the aforementioned issues, an index-space-wave state superposition Transformer (ISWSST) is the first to be proposed for MSRSI semantic segmentation by the inspiration from quantum mechanics, whose superiority is as follows: 1) index, space and wave states are superposed or fused to simulate quantum superposition by adaptively voting decision (i.e., ensemble learning idea) for being a stronger classifier and improving the segmentation accuracy; 2) a lossless wavelet pyramid encoder-decoder module is designed to losslessly reconstruct image and simulate quantum entanglement based on wavelet transform and inverse wavelet transform for avoiding the edge extraction loss; 3) combining multispectral features (i.e. remote sensing index and channel attention mechanism) is proposed to accurately extract ground objects from original resolution images; and 4) quantum mechanics are introduced to interpret the underlying superiority of ISWSST. Experiments show that ISWSST is validated and superior to the state-of-the-art architectures for the MSRSI segmentation task, which improves the segmentation and edge extraction accuracy effectively. Codes will be available publicly after our paper is accepted.

[CV-39] An Organism Starts with a Single Pix-Cell: A Neural Cellular Diffusion for High-Resolution Image Synthesis

链接: https://arxiv.org/abs/2407.03018
作者: Marawan Elbatel,Konstantinos Kamnitsas,Xiaomeng Li
关键词: Generative Adversarial Networks, Generative modeling seeks, Generative modeling, enabling synthesis, Generative Cellular Automata
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: MICCAI 2024

点击查看摘要

Abstract:Generative modeling seeks to approximate the statistical properties of real data, enabling synthesis of new data that closely resembles the original distribution. Generative Adversarial Networks (GANs) and Denoising Diffusion Probabilistic Models (DDPMs) represent significant advancements in generative modeling, drawing inspiration from game theory and thermodynamics, respectively. Nevertheless, the exploration of generative modeling through the lens of biological evolution remains largely untapped. In this paper, we introduce a novel family of models termed Generative Cellular Automata (GeCA), inspired by the evolution of an organism from a single cell. GeCAs are evaluated as an effective augmentation tool for retinal disease classification across two imaging modalities: Fundus and Optical Coherence Tomography (OCT). In the context of OCT imaging, where data is scarce and the distribution of classes is inherently skewed, GeCA significantly boosts the performance of 11 different ophthalmological conditions, achieving a 12% increase in the average F1 score compared to conventional baselines. GeCAs outperform both diffusion methods that incorporate UNet or state-of-the art variants with transformer-based denoising models, under similar parameter constraints. Code is available at: this https URL.

[CV-40] Context-Aware Video Instance Segmentation

链接: https://arxiv.org/abs/2407.03010
作者: Seunghun Lee,Jiwan Seo,Kiljoon Han,Minwoo Choi,Sunghoon Im
关键词: enhance instance association, Video Instance Segmentation, Context-Aware Instance Tracker, integrating contextual information, contextual information adjacent
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Project page: this https URL

点击查看摘要

Abstract:In this paper, we introduce the Context-Aware Video Instance Segmentation (CAVIS), a novel framework designed to enhance instance association by integrating contextual information adjacent to each object. To efficiently extract and leverage this information, we propose the Context-Aware Instance Tracker (CAIT), which merges contextual data surrounding the instances with the core instance features to improve tracking accuracy. Additionally, we introduce the Prototypical Cross-frame Contrastive (PCC) loss, which ensures consistency in object-level features across frames, thereby significantly enhancing instance matching accuracy. CAVIS demonstrates superior performance over state-of-the-art methods on all benchmark datasets in video instance segmentation (VIS) and video panoptic segmentation (VPS). Notably, our method excels on the OVIS dataset, which is known for its particularly challenging videos.

[CV-41] Model Guidance via Explanations Turns Image Classifiers into Segmentation Models

链接: https://arxiv.org/abs/2407.03009
作者: Xiaoyan Yu,Jannik Franzen,Wojciech Samek,Marina M.-C. Höhne,Dagmar Kainmueller
关键词: Grad-CAM and LRP, explainable AI methods, methods like Grad-CAM, observed to resemble, image classification networks
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Heatmaps generated on inputs of image classification networks via explainable AI methods like Grad-CAM and LRP have been observed to resemble segmentations of input images in many cases. Consequently, heatmaps have also been leveraged for achieving weakly supervised segmentation with image-level supervision. On the other hand, losses can be imposed on differentiable heatmaps, which has been shown to serve for (1)~improving heatmaps to be more human-interpretable, (2)~regularization of networks towards better generalization, (3)~training diverse ensembles of networks, and (4)~for explicitly ignoring confounding input features. Due to the latter use case, the paradigm of imposing losses on heatmaps is often referred to as “Right for the right reasons”. We unify these two lines of research by investigating semi-supervised segmentation as a novel use case for the Right for the Right Reasons paradigm. First, we show formal parallels between differentiable heatmap architectures and standard encoder-decoder architectures for image segmentation. Second, we show that such differentiable heatmap architectures yield competitive results when trained with standard segmentation losses. Third, we show that such architectures allow for training with weak supervision in the form of image-level labels and small numbers of pixel-level labels, outperforming comparable encoder-decoder models. Code is available at \urlthis https URL.

[CV-42] Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering

链接: https://arxiv.org/abs/2407.03008
作者: Zhaohe Liao,Jiangtong Li,Li Niu,Liqing Zhang
关键词: recent progress made, consistent compositional reasoning, perform consistent compositional, methods typically function, compositional consistency
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages,CVPR

点击查看摘要

Abstract:Despite the recent progress made in Video Question-Answering (VideoQA), these methods typically function as black-boxes, making it difficult to understand their reasoning processes and perform consistent compositional reasoning. To address these challenges, we propose a \textitmodel-agnostic Video Alignment and Answer Aggregation (VA ^3 ) framework, which is capable of enhancing both compositional consistency and accuracy of existing VidQA methods by integrating video aligner and answer aggregator modules. The video aligner hierarchically selects the relevant video clips based on the question, while the answer aggregator deduces the answer to the question based on its sub-questions, with compositional consistency ensured by the information flow along question decomposition graph and the contrastive learning strategy. We evaluate our framework on three settings of the AGQA-Decomp dataset with three baseline methods, and propose new metrics to measure the compositional consistency of VidQA methods more comprehensively. Moreover, we propose a large language model (LLM) based automatic question decomposition pipeline to apply our framework to any VidQA dataset. We extend MSVD and NExT-QA datasets with it to evaluate our VA ^3 framework on broader scenarios. Extensive experiments show that our framework improves both compositional consistency and accuracy of existing methods, leading to more interpretable real-world VidQA models.

[CV-43] Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation

链接: https://arxiv.org/abs/2407.03006
作者: Xiang Gao,Zhengbo Xu,Junhan Zhao,Jiaying Liu
关键词: user-provided text prompts, allowing open-domain image, Latent Diffusion Model, Discrete Cosine Transform, diffusion model
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI 2024)

点击查看摘要

Abstract:Recently, large-scale text-to-image (T2I) diffusion models have emerged as a powerful tool for image-to-image translation (I2I), allowing open-domain image translation via user-provided text prompts. This paper proposes frequency-controlled diffusion model (FCDiffusion), an end-to-end diffusion-based framework that contributes a novel solution to text-guided I2I from a frequency-domain perspective. At the heart of our framework is a feature-space frequency-domain filtering module based on Discrete Cosine Transform, which filters the latent features of the source image in the DCT domain, yielding filtered image features bearing different DCT spectral bands as different control signals to the pre-trained Latent Diffusion Model. We reveal that control signals of different DCT spectral bands bridge the source image and the T2I generated image in different correlations (e.g., style, structure, layout, contour, etc.), and thus enable versatile I2I applications emphasizing different I2I correlations, including style-guided content creation, image semantic manipulation, image scene translation, and image style translation. Different from related approaches, FCDiffusion establishes a unified text-guided I2I framework suitable for diverse image translation tasks simply by switching among different frequency control branches at inference time. The effectiveness and superiority of our method for text-guided I2I are demonstrated with extensive experiments both qualitatively and quantitatively. The code is publicly available at: this https URL.

[CV-44] VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values

链接: https://arxiv.org/abs/2407.03000
作者: Zhe Hu,Yixiao Ren,Jing Li,Yu Yin
关键词: VIsion-grounded decision-making driven, paper introduces VIVA, paper introduces, benchmark for VIsion-grounded, VIsion-grounded decision-making
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VAlues. While most large vision-language models (VLMs) focus on physical-level skills, our work is the first to examine their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation. VIVA contains 1,062 images depicting diverse real-world situations and the manually annotated decisions grounded in them. Given an image there, the model should select the most appropriate action to address the situation and provide the relevant human values and reason underlying the decision. Extensive experiments based on VIVA show the limitation of VLMs in using human values to make multimodal decisions. Further analyses indicate the potential benefits of exploiting action consequences and predicted human values.

[CV-45] Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation

链接: https://arxiv.org/abs/2407.02990
作者: Mengmeng Cui,Kunbo Zhang,Zhenan Sun
关键词: Human Pose Estimation, widespread research interest, attracted widespread research, Human Pose, Pose Estimation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

点击查看摘要

Abstract:In recent years, 2D-to-3D pose uplifting in monocular 3D Human Pose Estimation (HPE) has attracted widespread research interest. GNN-based methods and Transformer-based methods have become mainstream architectures due to their advanced spatial and temporal feature learning capacities. However, existing approaches typically construct joint-wise and frame-wise attention alignments in spatial and temporal domains, resulting in dense connections that introduce considerable local redundancy and computational overhead. In this paper, we take a global approach to exploit spatio-temporal information and realise efficient 3D HPE with a concise Graph and Skipped Transformer architecture. Specifically, in Spatial Encoding stage, coarse-grained body parts are deployed to construct Spatial Graph Network with a fully data-driven adaptive topology, ensuring model flexibility and generalizability across various poses. In Temporal Encoding and Decoding stages, a simple yet effective Skipped Transformer is proposed to capture long-range temporal dependencies and implement hierarchical feature aggregation. A straightforward Data Rolling strategy is also developed to introduce dynamic information into 2D pose sequence. Extensive experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks. G-SFormer series methods achieve superior performances compared with previous state-of-the-arts with only around ten percent of parameters and significantly reduced computational complexity. Additionally, G-SFormer also exhibits outstanding robustness to inaccuracies in detected 2D poses.

[CV-46] YOLOv5 YOLOv8 and YOLOv10: The Go-To Detectors for Real-time Vision

链接: https://arxiv.org/abs/2407.02988
作者: Muhammad Hussain
关键词: object detection algorithm, paper presents, presents a comprehensive, object detection, detection algorithm
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:This paper presents a comprehensive review of the evolution of the YOLO (You Only Look Once) object detection algorithm, focusing on YOLOv5, YOLOv8, and YOLOv10. We analyze the architectural advancements, performance improvements, and suitability for edge deployment across these versions. YOLOv5 introduced significant innovations such as the CSPDarknet backbone and Mosaic Augmentation, balancing speed and accuracy. YOLOv8 built upon this foundation with enhanced feature extraction and anchor-free detection, improving versatility and performance. YOLOv10 represents a leap forward with NMS-free training, spatial-channel decoupled downsampling, and large-kernel convolutions, achieving state-of-the-art performance with reduced computational overhead. Our findings highlight the progressive enhancements in accuracy, efficiency, and real-time performance, particularly emphasizing their applicability in resource-constrained environments. This review provides insights into the trade-offs between model complexity and detection accuracy, offering guidance for selecting the most appropriate YOLO version for specific edge computing applications.

[CV-47] Unified Anomaly Detection methods on Edge Device using Knowledge Distillation and Quantization

链接: https://arxiv.org/abs/2407.02968
作者: Sushovan Jena,Arya Pulkit,Kajal Singh,Anoushka Banerjee,Sharad Joshi,Ananth Ganesh,Dinesh Singh,Arnav Bhavsar
关键词: visual inspection systems, fully integrated visual, integrated visual inspection, manufacturing in Industry, imperative for high-throughput
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Emerging Technologies (cs.ET)
*备注: 20 pages

点击查看摘要

Abstract:With the rapid advances in deep learning and smart manufacturing in Industry 4.0, there is an imperative for high-throughput, high-performance, and fully integrated visual inspection systems. Most anomaly detection approaches using defect detection datasets, such as MVTec AD, employ one-class models that require fitting separate models for each class. On the contrary, unified models eliminate the need for fitting separate models for each class and significantly reduce cost and memory requirements. Thus, in this work, we experiment with considering a unified multi-class setup. Our experimental study shows that multi-class models perform at par with one-class models for the standard MVTec AD dataset. Hence, this indicates that there may not be a need to learn separate object/class-wise models when the object classes are significantly different from each other, as is the case of the dataset considered. Furthermore, we have deployed three different unified lightweight architectures on the CPU and an edge device (NVIDIA Jetson Xavier NX). We analyze the quantized multi-class anomaly detection models in terms of latency and memory requirements for deployment on the edge device while comparing quantization-aware training (QAT) and post-training quantization (PTQ) for performance at different precision widths. In addition, we explored two different methods of calibration required in post-training scenarios and show that one of them performs notably better, highlighting its importance for unsupervised tasks. Due to quantization, the performance drop in PTQ is further compensated by QAT, which yields at par performance with the original 32-bit Floating point in two of the models considered.

[CV-48] 3D Multimodal Image Registration for Plant Phenotyping

链接: https://arxiv.org/abs/2407.02946
作者: Eric Stumpe,Gernot Bodner,Francesco Flagiello,Matthias Zeppelzauer
关键词: offers promising benefits, combined multimodal monitoring, multimodal monitoring system, multiple camera technologies, phenotyping offers promising
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 53 pages, 13 Figures, preprint submitted to Computers and Electronics in Agriculture

点击查看摘要

Abstract:The use of multiple camera technologies in a combined multimodal monitoring system for plant phenotyping offers promising benefits. Compared to configurations that only utilize a single camera technology, cross-modal patterns can be recorded that allow a more comprehensive assessment of plant phenotypes. However, the effective utilization of cross-modal patterns is dependent on precise image registration to achieve pixel-accurate alignment, a challenge often complicated by parallax and occlusion effects inherent in plant canopy imaging. In this study, we propose a novel multimodal 3D image registration method that addresses these challenges by integrating depth information from a time-of-flight camera into the registration process. By leveraging depth data, our method mitigates parallax effects and thus facilitates more accurate pixel alignment across camera modalities. Additionally, we introduce an automated mechanism to identify and differentiate different types of occlusions, thereby minimizing the introduction of registration errors. To evaluate the efficacy of our approach, we conduct experiments on a diverse image dataset comprising six distinct plant species with varying leaf geometries. Our results demonstrate the robustness of the proposed registration algorithm, showcasing its ability to achieve accurate alignment across different plant types and camera compositions. Compared to previous methods it is not reliant on detecting plant specific image features and can thereby be utilized for a wide variety of applications in plant sciences. The registration approach principally scales to arbitrary numbers of cameras with different resolutions and wavelengths. Overall, our study contributes to advancing the field of plant phenotyping by offering a robust and reliable solution for multimodal image registration. Comments: 53 pages, 13 Figures, preprint submitted to Computers and Electronics in Agriculture Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2407.02946 [cs.CV] (or arXiv:2407.02946v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2407.02946 Focus to learn more arXiv-issued DOI via DataCite

[CV-49] VEGS: View Extrapolation of Urban Scenes in 3D Gaussian Splatting using Learned Priors

链接: https://arxiv.org/abs/2407.02945
作者: Sungwon Hwang,Min-Jung Kim,Taewoong Kang,Jayeon Kang,Jaegul Choo
关键词: Neural rendering-based urban, Neural rendering-based, methods commonly rely, training camera, moving forward
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: The first two authors contributed equally. Project Page: this https URL

点击查看摘要

Abstract:Neural rendering-based urban scene reconstruction methods commonly rely on images collected from driving vehicles with cameras facing and moving forward. Although these methods can successfully synthesize from views similar to training camera trajectory, directing the novel view outside the training camera distribution does not guarantee on-par performance. In this paper, we tackle the Extrapolated View Synthesis (EVS) problem by evaluating the reconstructions on views such as looking left, right or downwards with respect to training camera distributions. To improve rendering quality for EVS, we initialize our model by constructing dense LiDAR map, and propose to leverage prior scene knowledge such as surface normal estimator and large-scale diffusion model. Qualitative and quantitative comparisons demonstrate the effectiveness of our methods on EVS. To the best of our knowledge, we are the first to address the EVS problem in urban scene reconstruction. Link to our project page: this https URL.

[CV-50] PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition

链接: https://arxiv.org/abs/2407.02934
作者: Yanbin Hao,Diansong Zhou,Zhicai Wang,Chong-Wah Ngo,Meng Wang
关键词: vision Transformers, demonstrated remarkable performance, recent years, demonstrated remarkable, Transformers and MLPs
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In recent years, vision Transformers and MLPs have demonstrated remarkable performance in image understanding tasks. However, their inherently dense computational operators, such as self-attention and token-mixing layers, pose significant challenges when applied to spatio-temporal video data. To address this gap, we propose PosMLP-Video, a lightweight yet powerful MLP-like backbone for video recognition. Instead of dense operators, we use efficient relative positional encoding (RPE) to build pairwise token relations, leveraging small-sized parameterized relative position biases to obtain each relation score. Specifically, to enable spatio-temporal modeling, we extend the image PosMLP’s positional gating unit to temporal, spatial, and spatio-temporal variants, namely PoTGU, PoSGU, and PoSTGU, respectively. These gating units can be feasibly combined into three types of spatio-temporal factorized positional MLP blocks, which not only decrease model complexity but also maintain good performance. Additionally, we enrich relative positional relationships by using channel grouping. Experimental results on three video-related tasks demonstrate that PosMLP-Video achieves competitive speed-accuracy trade-offs compared to the previous state-of-the-art models. In particular, PosMLP-Video pre-trained on ImageNet1K achieves 59.0%/70.3% top-1 accuracy on Something-Something V1/V2 and 82.1% top-1 accuracy on Kinetics-400 while requiring much fewer parameters and FLOPs than other models. The code is released at this https URL.

[CV-51] EgoFlowNet: Non-Rigid Scene Flow from Point Clouds with Ego-Motion Support

链接: https://arxiv.org/abs/2407.02920
作者: Ramy Battrawy,René Schuster,Didier Stricker
关键词: Recent weakly-supervised methods, Recent weakly-supervised, reasoning on object-level, LiDAR point clouds, clouds are limited
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: This paper is published in BMVC2023 (pp. 441-443)

点击查看摘要

Abstract:Recent weakly-supervised methods for scene flow estimation from LiDAR point clouds are limited to explicit reasoning on object-level. These methods perform multiple iterative optimizations for each rigid object, which makes them vulnerable to clustering robustness. In this paper, we propose our EgoFlowNet - a point-level scene flow estimation network trained in a weakly-supervised manner and without object-based abstraction. Our approach predicts a binary segmentation mask that implicitly drives two parallel branches for ego-motion and scene flow. Unlike previous methods, we provide both branches with all input points and carefully integrate the binary mask into the feature extraction and losses. We also use a shared cost volume with local refinement that is updated at multiple scales without explicit clustering or rigidity assumptions. On realistic KITTI scenes, we show that our EgoFlowNet performs better than state-of-the-art methods in the presence of ground surface points.

[CV-52] Free-SurGS: SfM-Free 3D Gaussian Splatting for Surgical Scene Reconstruction

链接: https://arxiv.org/abs/2407.02918
作者: Jiaxin Guo,Jiangliu Wang,Di Kang,Wenzhen Dong,Wenting Wang,Yun-hui Liu
关键词: enhance surgeons’ visibility, surgical scenes plays, computer-assisted surgery, holding a promise, surgeons’ visibility
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: Accepted to MICCAI 2024

点击查看摘要

Abstract:Real-time 3D reconstruction of surgical scenes plays a vital role in computer-assisted surgery, holding a promise to enhance surgeons’ visibility. Recent advancements in 3D Gaussian Splatting (3DGS) have shown great potential for real-time novel view synthesis of general scenes, which relies on accurate poses and point clouds generated by Structure-from-Motion (SfM) for initialization. However, 3DGS with SfM fails to recover accurate camera poses and geometry in surgical scenes due to the challenges of minimal textures and photometric inconsistencies. To tackle this problem, in this paper, we propose the first SfM-free 3DGS-based method for surgical scene reconstruction by jointly optimizing the camera poses and scene representation. Based on the video continuity, the key of our method is to exploit the immediate optical flow priors to guide the projection flow derived from 3D Gaussians. Unlike most previous methods relying on photometric loss only, we formulate the pose estimation problem as minimizing the flow loss between the projection flow and optical flow. A consistency check is further introduced to filter the flow outliers by detecting the rigid and reliable points that satisfy the epipolar geometry. During 3D Gaussian optimization, we randomly sample frames to optimize the scene representations to grow the 3D Gaussian progressively. Experiments on the SCARED dataset demonstrate our superior performance over existing methods in novel view synthesis and pose estimation with high efficiency. Code is available at this https URL.

[CV-53] Domain-independent detection of known anomalies

链接: https://arxiv.org/abs/2407.02910
作者: Jonas Bühler,Jonas Fehrenbach,Lucas Steinmann,Christian Nauck,Marios Koulakis
关键词: industrial quality inspection, persistent obstacle, quality inspection, Spatial Embedding MLP, previously unseen
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted as extended abstract in CVPR 2024 workshop VAND 2.0

点击查看摘要

Abstract:One persistent obstacle in industrial quality inspection is the detection of anomalies. In real-world use cases, two problems must be addressed: anomalous data is sparse and the same types of anomalies need to be detected on previously unseen objects. Current anomaly detection approaches can be trained with sparse nominal data, whereas domain generalization approaches enable detecting objects in previously unseen domains. Utilizing those two observations, we introduce the hybrid task of domain generalization on sparse classes. To introduce an accompanying dataset for this task, we present a modification of the well-established MVTec AD dataset by generating three new datasets. In addition to applying existing methods for benchmark, we design two embedding-based approaches, Spatial Embedding MLP (SEMLP) and Labeled PatchCore. Overall, SEMLP achieves the best performance with an average image-level AUROC of 87.2 % vs. 80.4 % by MIRO. The new and openly available datasets allow for further research to improve industrial anomaly detection.

[CV-54] Single Image Rolling Shutter Removal with Diffusion Models

链接: https://arxiv.org/abs/2407.02906
作者: Zhanglei Yang,Haipeng Li,Mingbo Hong,Bing Zeng,Shuaicheng Liu
关键词: single-frame Rolling Shutter, single-frame Rolling, Diffusion Models-based method, Rolling Shutter, Diffusion Models-based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We present RS-Diffusion, the first Diffusion Models-based method for single-frame Rolling Shutter (RS) correction. RS artifacts compromise visual quality of frames due to the row wise exposure of CMOS sensors. Most previous methods have focused on multi-frame approaches, using temporal information from consecutive frames for the motion rectification. However, few approaches address the more challenging but important single frame RS correction. In this work, we present an ``image-to-motion’’ framework via diffusion techniques, with a designed patch-attention module. In addition, we present the RS-Real dataset, comprised of captured RS frames alongside their corresponding Global Shutter (GS) ground-truth pairs. The GS frames are corrected from the RS ones, guided by the corresponding Inertial Measurement Unit (IMU) gyroscope data acquired during capture. Experiments show that our RS-Diffusion surpasses previous single RS correction methods. Our method and proposed RS-Real dataset lay a solid foundation for advancing the field of RS correction.

[CV-55] An Uncertainty-guided Tiered Self-training Framework for Active Source-free Domain Adaptation in Prostate Segmentation

链接: https://arxiv.org/abs/2407.02893
作者: Zihao Luo,Xiangde Luo,Zijun Gao,Guotai Wang
关键词: exhibited remarkable efficacy, achieving robust generalization, Source-free Domain Adaptation, Domain Adaptation, exhibited remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 11 pages, 3 figures, 2 tables, accept to MICCAI 2024

点击查看摘要

Abstract:Deep learning models have exhibited remarkable efficacy in accurately delineating the prostate for diagnosis and treatment of prostate diseases, but challenges persist in achieving robust generalization across different medical centers. Source-free Domain Adaptation (SFDA) is a promising technique to adapt deep segmentation models to address privacy and security concerns while reducing domain shifts between source and target domains. However, recent literature indicates that the performance of SFDA remains far from satisfactory due to unpredictable domain gaps. Annotating a few target domain samples is acceptable, as it can lead to significant performance improvement with a low annotation cost. Nevertheless, due to extremely limited annotation budgets, careful consideration is needed in selecting samples for annotation. Inspired by this, our goal is to develop Active Source-free Domain Adaptation (ASFDA) for medical image segmentation. Specifically, we propose a novel Uncertainty-guided Tiered Self-training (UGTST) framework, consisting of efficient active sample selection via entropy-based primary local peak filtering to aggregate global uncertainty and diversity-aware redundancy filter, coupled with a tiered self-learning strategy, achieves stable domain adaptation. Experimental results on cross-center prostate MRI segmentation datasets revealed that our method yielded marked advancements, with a mere 5% annotation, exhibiting an average Dice score enhancement of 9.78% and 7.58% in two target domains compared with state-of-the-art methods, on par with fully supervised learning. Code is available at:this https URL

[CV-56] Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion

链接: https://arxiv.org/abs/2407.02887
作者: Hang Xu,Chen Long,Wenxiao Zhang,Yuan Liu,Zhen Cao,Zhen Dong,Bisheng Yang
关键词: View-guided Point cloud, Guided Information Interaction, Explicitly Guided Information, single view image, Point cloud Completion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV 2024

点击查看摘要

Abstract:In this paper, we explore a novel framework, EGIInet (Explicitly Guided Information Interaction Network), a model for View-guided Point cloud Completion (ViPC) task, which aims to restore a complete point cloud from a partial one with a single view image. In comparison with previous methods that relied on the global semantics of input images, EGIInet efficiently combines the information from two modalities by leveraging the geometric nature of the completion task. Specifically, we propose an explicitly guided information interaction strategy supported by modal alignment for point cloud completion. First, in contrast to previous methods which simply use 2D and 3D backbones to encode features respectively, we unified the encoding process to promote modal alignment. Second, we propose a novel explicitly guided information interaction strategy that could help the network identify critical information within images, thus achieving better guidance for completion. Extensive experiments demonstrate the effectiveness of our framework, and we achieved a new state-of-the-art (+16% CD over XMFnet) in benchmark datasets despite using fewer parameters than the previous methods. The pre-trained model and code and are available at this https URL.

[CV-57] ShiftAddAug: Augment Multiplication-Free Tiny Neural Network with Hybrid Computation

链接: https://arxiv.org/abs/2407.02881
作者: Yipin Guo,Zihao Li,Yilin Lang,Qinyuan Ren
关键词: Shift and Add, compatibility with hardware, gained prominence, Operators devoid, Add
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by 2024 CVPR Workshop : Efficient Deep Learning for Computer Vision

点击查看摘要

Abstract:Operators devoid of multiplication, such as Shift and Add, have gained prominence for their compatibility with hardware. However, neural networks (NNs) employing these operators typically exhibit lower accuracy compared to conventional NNs with identical structures. ShiftAddAug uses costly multiplication to augment efficient but less powerful multiplication-free operators, improving performance without any inference overhead. It puts a ShiftAdd tiny NN into a large multiplicative model and encourages it to be trained as a sub-model to obtain additional supervision. In order to solve the weight discrepancy problem between hybrid operators, a new weight sharing method is proposed. Additionally, a novel two stage neural architecture search is used to obtain better augmentation effects for smaller but stronger multiplication-free tiny neural networks. The superiority of ShiftAddAug is validated through experiments in image classification and semantic segmentation, consistently delivering noteworthy enhancements. Remarkably, it secures up to a 4.95% increase in accuracy on the CIFAR100 compared to its directly trained counterparts, even surpassing the performance of multiplicative NNs.

[CV-58] Knowledge Composition using Task Vectors with Learned Anisotropic Scaling

链接: https://arxiv.org/abs/2407.02880
作者: Frederic Z. Zhang,Paul Albert,Cristian Rodriguez-Opazo,Anton van den Hengel,Ehsan Abbasnejad
关键词: produce strong generic, models produce strong, Pre-trained models produce, task vectors, strong generic representations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pre-trained models produce strong generic representations that can be adapted via fine-tuning. The learned weight difference relative to the pre-trained model, known as a task vector, characterises the direction and stride of fine-tuning. The significance of task vectors is such that simple arithmetic operations on them can be used to combine diverse representations from different domains. This paper builds on these properties of task vectors and aims to answer (1) whether components of task vectors, particularly parameter blocks, exhibit similar characteristics, and (2) how such blocks can be used to enhance knowledge composition and transfer. To this end, we introduce aTLAS, an algorithm that linearly combines parameter blocks with different learned coefficients, resulting in anisotropic scaling at the task vector level. We show that such linear combinations explicitly exploit the low intrinsic dimensionality of pre-trained models, with only a few coefficients being the learnable parameters. Furthermore, composition of parameter blocks leverages the already learned representations, thereby reducing the dependency on large amounts of data. We demonstrate the effectiveness of our method in task arithmetic, few-shot recognition and test-time adaptation, with supervised or unsupervised objectives. In particular, we show that (1) learned anisotropic scaling allows task vectors to be more disentangled, causing less interference in composition; (2) task vector composition excels with scarce or no labeled data and is less prone to domain shift, thus leading to better generalisability; (3) mixing the most informative parameter blocks across different task vectors prior to training can reduce the memory footprint and improve the flexibility of knowledge transfer. Moreover, we show the potential of aTLAS as a PEFT method, particularly with less data, and demonstrate that its scalibility.

[CV-59] Fast maneuver recovery from aerial observation: trajectory clustering and outliers rejection

链接: https://arxiv.org/abs/2407.02863
作者: Nelson de Moura(ASTRA),Augustin Gervreau-Mercier(ASTRA),Fernando Garrido(ASTRA),Fawzi Nashashibi(ASTRA)
关键词: open problem, Vulnerable Road Users, road user models, realistically reproduce, reproduce a credible
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The implementation of road user models that realistically reproduce a credible behavior in a multi-agentsimulation is still an open problem. A data-driven approach consists on to deduce behaviors that may exist in real situation to obtain different types of trajectories from a large set of observations. The data, and its classification, could then be used to train models capable to extrapolate such behavior. Cars and two different types of Vulnerable Road Users (VRU) will be considered by the trajectory clustering methods proposed: pedestrians and cyclists. The results reported here evaluate methods to extract well-defined trajectory classes from raw data without the use of map information while also separating ‘‘eccentric’’ or incomplete trajectories from the ones that are complete and representative in any scenario. Two environments will serve as test for the methods develop, three different intersections and one roundabout. The resulting clusters of trajectories can then be used for prediction or learning tasks or discarded if it is composed by outliers.

[CV-60] Universal Gloss-level Representation for Gloss-free Sign Language Translation and Production

链接: https://arxiv.org/abs/2407.02854
作者: Eui Jun Hwang,Sukmin Cho,Huije Lee,Youngwoo Yoon,Jong C. Park
关键词: Sign language, presents unique challenges, spoken language words, sign language motion, mapping sign language
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Sign language, essential for the deaf and hard-of-hearing, presents unique challenges in translation and production due to its multimodal nature and the inherent ambiguity in mapping sign language motion to spoken language words. Previous methods often rely on gloss annotations, requiring time-intensive labor and specialized expertise in sign language. Gloss-free methods have emerged to address these limitations, but they often depend on external sign language data or dictionaries, failing to completely eliminate the need for gloss annotations. There is a clear demand for a comprehensive approach that can supplant gloss annotations and be utilized for both Sign Language Translation (SLT) and Sign Language Production (SLP). We introduce Universal Gloss-level Representation (UniGloR), a unified and self-supervised solution for both SLT and SLP, trained on multiple datasets including PHOENIX14T, How2Sign, and NIASL2021. Our results demonstrate UniGloR’s effectiveness in the translation and production tasks. We further report an encouraging result for the Sign Language Recognition (SLR) on previously unseen data. Our study suggests that self-supervised learning can be made in a unified manner, paving the way for innovative and practical applications in future research.

[CV-61] Plant Doctor: A hybrid machine learning and image segmentation software to quantify plant damage in video footage

链接: https://arxiv.org/abs/2407.02853
作者: Marc Josep Montagut Marques,Liu Mingxin,Kuri Thomas Shiojiri,Tomika Hagiwara,Kayo Hirose,Kaori Shiojiri,Shinjiro Umezu
关键词: Artificial intelligence, fields including agriculture, diagnostic processes, benefiting various fields, intelligence has significantly
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 29 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Artificial intelligence has significantly advanced the automation of diagnostic processes, benefiting various fields including agriculture. This study introduces an AI-based system for the automatic diagnosis of urban street plants using video footage obtained with accessible camera devices. The system aims to monitor plant health on a day-to-day basis, aiding in the control of disease spreading in urban areas. By combining two machine vision algorithms, YOLOv8 and DeepSORT, the system efficiently identifies and tracks individual leaves, extracting the optimal images for health analysis. YOLOv8, chosen for its speed and computational efficiency, locates leaves, while DeepSORT ensures robust tracking in complex environments. For detailed health assessment, DeepLabV3Plus, a convolutional neural network, is employed to segment and quantify leaf damage caused by bacteria, pests, and fungi. The hybrid system, named Plant Doctor, has been trained and validated using a diverse dataset including footage from Tokyo urban plants. The results demonstrate the robustness and accuracy of the system in diagnosing leaf damage, with potential applications in large scale urban flora illness monitoring. This approach provides a non-invasive, efficient, and scalable solution for urban tree health management, supporting sustainable urban ecosystems.

[CV-62] Multi-Task Domain Adaptation for Language Grounding with 3D Objects

链接: https://arxiv.org/abs/2407.02846
作者: Penglei Sun,Yaoxian Song,Xinglin Pan,Peijie Dong,Xiaofei Yang,Qiang Wang,Zhixu Li,Tiefeng Li,Xiaowen Chu
关键词: object-level language grounding, pre-trained models, geometric priors, language grounding, works on object-level
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:The existing works on object-level language grounding with 3D objects mostly focus on improving performance by utilizing the off-the-shelf pre-trained models to capture features, such as viewpoint selection or geometric priors. However, they have failed to consider exploring the cross-modal representation of language-vision alignment in the cross-domain field. To answer this problem, we propose a novel method called Domain Adaptation for Language Grounding (DA4LG) with 3D objects. Specifically, the proposed DA4LG consists of a visual adapter module with multi-task learning to realize vision-language alignment by comprehensive multimodal feature representation. Experimental results demonstrate that DA4LG competitively performs across visual and non-visual language descriptions, independent of the completeness of observation. DA4LG achieves state-of-the-art performance in the single-view setting and multi-view setting with the accuracy of 83.8% and 86.8% respectively in the language grounding benchmark SNARE. The simulation experiments show the well-practical and generalized performance of DA4LG compared to the existing methods. Our project is available at this https URL.

[CV-63] MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis

链接: https://arxiv.org/abs/2407.02842
作者: Lei Chen,Feng Yan,Yujie Zhong,Shaoxiang Chen,Zequn Jie,Lin Ma
关键词: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: technical report

点击查看摘要

Abstract:Multimodal Large Language Models (MLLM) have made significant progress in the field of document analysis. Despite this, existing benchmarks typically focus only on extracting text and simple layout information, neglecting the complex interactions between elements in structured documents such as mind maps and flowcharts. To address this issue, we introduce the new benchmark named MindBench, which not only includes meticulously constructed bilingual authentic or synthetic images, detailed annotations, evaluation metrics and baseline models, but also specifically designs five types of structured understanding and parsing tasks. These tasks include full parsing, partial parsing, position-related parsing, structured Visual Question Answering (VQA), and position-related VQA, covering key areas such as text recognition, spatial awareness, relationship discernment, and structured parsing. Extensive experimental results demonstrate the substantial potential and significant room for improvement in current models’ ability to handle structured document information. We anticipate that the launch of MindBench will significantly advance research and application development in structured document analysis technology. MindBench is available at: this https URL.

[CV-64] A Pairwise DomMix Attentive Adversarial Network for Unsupervised Domain Adaptive Object Detection

链接: https://arxiv.org/abs/2407.02835
作者: Jie Shao,Jiacheng Wu,Wenzhong Shen,Cheng Yang
关键词: Adaptive Object Detection, Object Detection, Domain Adaptive Object, Adaptive Object, Unsupervised Domain Adaptive
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: has published on IEEE Signal Processing Letters, 2023

点击查看摘要

Abstract:Unsupervised Domain Adaptive Object Detection (DAOD) could adapt a model trained on a source domain to an unlabeled target domain for object detection. Existing unsupervised DAOD methods usually perform feature alignments from the target to the source. Unidirectional domain transfer would omit information about the target samples and result in suboptimal adaptation when there are large domain shifts. Therefore, we propose a pairwise attentive adversarial network with a Domain Mixup (DomMix) module to mitigate the aforementioned challenges. Specifically, a deep-level mixup is employed to construct an intermediate domain that allows features from both domains to share their differences. Then a pairwise attentive adversarial network is applied with attentive encoding on both image-level and instance-level features at different scales and optimizes domain alignment by adversarial learning. This allows the network to focus on regions with disparate contextual information and learn their similarities between different domains. Extensive experiments are conducted on several benchmark datasets, demonstrating the superiority of our proposed method.

[CV-65] Style Alignment based Dynamic Observation Method for UAV-View Geo-localization

链接: https://arxiv.org/abs/2407.02832
作者: Jie Shao,LingHao Jiang
关键词: reference dataset consisting, estimate the localization, visual style, UAV-view geo-localization, query satellite
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: has published on IEEE Transactions on Geoscience and Remote Sensing, 2023

点击查看摘要

Abstract:The task of UAV-view geo-localization is to estimate the localization of a query satellite/drone image by matching it against a reference dataset consisting of drone/satellite images. Though tremendous strides have been made in feature alignment between satellite and drone views, vast differences in both inter and intra-class due to changes in viewpoint, altitude, and lighting remain a huge challenge. In this paper, a style alignment based dynamic observation method for UAV-view geo-localization is proposed to meet the above challenges from two perspectives: visual style transformation and surrounding noise control. Specifically, we introduce a style alignment strategy to transfrom the diverse visual style of drone-view images into a unified satellite images visual style. Then a dynamic observation module is designed to evaluate the spatial distribution of images by mimicking human observation habits. It is featured by the hierarchical attention block (HAB) with a dual-square-ring stream structure, to reduce surrounding noise and geographical deformation. In addition, we propose a deconstruction loss to push away features of different geo-tags and squeeze knowledge from unmatched images by correlation calculation. The experimental results demonstrate the state-of-the-art performance of our model on benchmarked datasets. In particular, when compared to the prior art on University-1652, our results surpass the best of them (FSRA), while only requiring 2x fewer parameters. Code will be released at this https URL_DOM

[CV-66] A Radiometric Correction based Optical Modeling Approach to Removing Reflection Noise in TLS Point Clouds of Urban Scenes

链接: https://arxiv.org/abs/2407.02830
作者: Li Fang,Tianyu Li,Yanghong Lin,Shudong Zhou,Wei Yao
关键词: computer vision tasks, autonomous driving, vital in computer, computer vision, vision tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Point clouds are vital in computer vision tasks such as 3D reconstruction, autonomous driving, and robotics. However, TLS-acquired point clouds often contain virtual points from reflective surfaces, causing disruptions. This study presents a reflection noise elimination algorithm for TLS point clouds. Our innovative reflection plane detection algorithm, based on geometry-optical models and physical properties, identifies and categorizes reflection points per optical reflection theory. We’ve adapted the LSFH feature descriptor to retain reflection features, mitigating interference from symmetrical architectural structures. By incorporating the Hausdorff feature distance, the algorithm enhances resilience to ghosting and deformation, improving virtual point detection accuracy. Extensive experiments on the 3DRN benchmark dataset, featuring diverse urban environments with virtual TLS reflection noise, show our algorithm improves precision and recall rates for 3D points in reflective regions by 57.03% and 31.80%, respectively. Our method achieves a 9.17% better outlier detection rate and 5.65% higher accuracy than leading methods. Access the 3DRN dataset at (this https URL).

[CV-67] Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective

链接: https://arxiv.org/abs/2407.02814
作者: Zhaotian Weng,Zijun Gao,Jerone Andrews,Jieyu Zhao
关键词: inadvertently learn biases, Vision-language models, correlating gender information, pre-trained on extensive, objects or scenarios
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Vision-language models (VLMs) pre-trained on extensive datasets can inadvertently learn biases by correlating gender information with specific objects or scenarios. Current methods, which focus on modifying inputs and monitoring changes in the model’s output probability scores, often struggle to comprehensively understand bias from the perspective of model components. We propose a framework that incorporates causal mediation analysis to measure and map the pathways of bias generation and propagation within VLMs. This approach allows us to identify the direct effects of interventions on model bias and the indirect effects of interventions on bias mediated through different model components. Our results show that image features are the primary contributors to bias, with significantly higher impacts than text features, specifically accounting for 32.57% and 12.63% of the bias in the MSCOCO and PASCAL-SENTENCE datasets, respectively. Notably, the image encoder’s contribution surpasses that of the text encoder and the deep fusion encoder. Further experimentation confirms that contributions from both language and vision modalities are aligned and non-conflicting. Consequently, focusing on blurring gender representations within the image encoder, which contributes most to the model bias, reduces bias efficiently by 22.03% and 9.04% in the MSCOCO and PASCAL-SENTENCE datasets, respectively, with minimal performance loss or increased computational demands.

[CV-68] Data Overfitting for On-Device Super-Resolution with Dynamic Algorithm and Compiler Co-Design

链接: https://arxiv.org/abs/2407.02813
作者: Gen Li,Zhihao Shu,Jie Ji,Minghai Qin,Fatemeh Afghah,Wei Niu,Xiaolong Ma
关键词: computer vision applications, vision applications, frequently employed, variety of computer, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ECCV2024

点击查看摘要

Abstract:Deep neural networks (DNNs) are frequently employed in a variety of computer vision applications. Nowadays, an emerging trend in the current video distribution system is to take advantage of DNN’s overfitting properties to perform video resolution upscaling. By splitting videos into chunks and applying a super-resolution (SR) model to overfit each chunk, this scheme of SR models plus video chunks is able to replace traditional video transmission to enhance video quality and transmission efficiency. However, many models and chunks are needed to guarantee high performance, which leads to tremendous overhead on model switching and memory footprints at the user end. To resolve such problems, we propose a Dynamic Deep neural network assisted by a Content-Aware data processing pipeline to reduce the model number down to one (Dy-DCA), which helps promote performance while conserving computational resources. Additionally, to achieve real acceleration on the user end, we designed a framework that optimizes dynamic features (e.g., dynamic shapes, sizes, and control flow) in Dy-DCA to enable a series of compilation optimizations, including fused code generation, static execution planning, etc. By employing such techniques, our method achieves better PSNR and real-time performance (33 FPS) on an off-the-shelf mobile phone. Meanwhile, assisted by our compilation optimization, we achieve a 1.7 \times speedup while saving up to 1.61 \times memory consumption. Code available in this https URL.

[CV-69] Solving Motion Planning Tasks with a Scalable Generative Model

链接: https://arxiv.org/abs/2407.02797
作者: Yihan Hu,Siqi Chai,Zhening Yang,Jingyu Qian,Kun Li,Wenxin Shao,Haichao Zhang,Wei Xu,Qiang Liu
关键词: autonomous driving systems, system scalability, millions of vehicles, safety and reducing, engineering cost
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
*备注: ECCV2024

点击查看摘要

Abstract:As autonomous driving systems being deployed to millions of vehicles, there is a pressing need of improving the system’s scalability, safety and reducing the engineering cost. A realistic, scalable, and practical simulator of the driving world is highly desired. In this paper, we present an efficient solution based on generative models which learns the dynamics of the driving scenes. With this model, we can not only simulate the diverse futures of a given driving scenario but also generate a variety of driving scenarios conditioned on various prompts. Our innovative design allows the model to operate in both full-Autoregressive and partial-Autoregressive modes, significantly improving inference and training speed without sacrificing generative capability. This efficiency makes it ideal for being used as an online reactive environment for reinforcement learning, an evaluator for planning policies, and a high-fidelity simulator for testing. We evaluated our model against two real-world datasets: the Waymo motion dataset and the nuPlan dataset. On the simulation realism and scene generation benchmark, our model achieves the state-of-the-art performance. And in the planning benchmarks, our planner outperforms the prior arts. We conclude that the proposed generative model may serve as a foundation for a variety of motion planning tasks, including data generation, simulation, planning, and online training. Source code is public at this https URL

[CV-70] Eulers Elastica Based Cartoon-Smooth-Texture Image Decomposition

链接: https://arxiv.org/abs/2407.02794
作者: Roy Y. He,Hao Liu
关键词: representing sharp boundaries, capturing soft shadows, decomposing grayscale images, structural part, smooth part
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose a novel model for decomposing grayscale images into three distinct components: the structural part, representing sharp boundaries and regions with strong light-to-dark transitions; the smooth part, capturing soft shadows and shades; and the oscillatory part, characterizing textures and noise. To capture the homogeneous structures, we introduce a combination of L^0 -gradient and curvature regularization on level lines. This new regularization term enforces strong sparsity on the image gradient while reducing the undesirable staircase effects as well as preserving the geometry of contours. For the smoothly varying component, we utilize the L^2 -norm of the Laplacian that favors isotropic smoothness. To capture the oscillation, we use the inverse Sobolev seminorm. To solve the associated minimization problem, we design an efficient operator-splitting algorithm. Our algorithm effectively addresses the challenging non-convex non-smooth problem by separating it into sub-problems. Each sub-problem can be solved either directly using closed-form solutions or efficiently using the Fast Fourier Transform (FFT). We provide systematic experiments, including ablation and comparison studies, to analyze our model’s behaviors and demonstrate its effectiveness as well as efficiency.

[CV-71] Learning Positional Attention for Sequential Recommendation

链接: https://arxiv.org/abs/2407.02793
作者: Fan Luo,Juan Zhang,Shenghui Xu
关键词: sequential recommendation tasks, achieved remarkable performance, networks have achieved, recommendation tasks, achieved remarkable
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Self-attention-based networks have achieved remarkable performance in sequential recommendation tasks. A crucial component of these models is positional encoding. In this study, we delve into the learned positional embedding, demonstrating that it often captures the distance between tokens. Building on this insight, we introduce novel attention models that directly learn positional relations. Extensive experiments reveal that our proposed models, \textbfPARec and \textbfFPARec outperform previous self-attention-based approaches.Our code is available at the link for anonymous review: https://anonymous.4open.science/ r/FPARec-2C55/

[CV-72] Foster Adaptivity and Balance in Learning with Noisy Labels

链接: https://arxiv.org/abs/2407.02778
作者: Mengmeng Sheng,Zeren Sun,Tao Chen,Shuchao Pang,Yucheng Wang,Yazhou Yao
关键词: deep neural networks, supervised models due, Label noise, posing a practical, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted by the European Conference on Computer Vision (ECCV), 2024

点击查看摘要

Abstract:Label noise is ubiquitous in real-world scenarios, posing a practical challenge to supervised models due to its effect in hurting the generalization performance of deep neural networks. Existing methods primarily employ the sample selection paradigm and usually rely on dataset-dependent prior knowledge (\eg, a pre-defined threshold) to cope with label noise, inevitably degrading the adaptivity. Moreover, existing methods tend to neglect the class balance in selecting samples, leading to biased model performance. To this end, we propose a simple yet effective approach named \textbfSED to deal with label noise in a \textbfSelf-adaptiv\textbfE and class-balance\textbfD manner. Specifically, we first design a novel sample selection strategy to empower self-adaptivity and class balance when identifying clean and noisy data. A mean-teacher model is then employed to correct labels of noisy samples. Subsequently, we propose a self-adaptive and class-balanced sample re-weighting mechanism to assign different weights to detected noisy samples. Finally, we additionally employ consistency regularization on selected clean samples to improve model generalization performance. Extensive experimental results on synthetic and real-world datasets demonstrate the effectiveness and superiority of our proposed method. The source code has been made available at this https URL.

[CV-73] Automatic gradient descent with generalized Newtons method

链接: https://arxiv.org/abs/2407.02772
作者: Zhiqi Bu,Shiyun Xu
关键词: generalized Newton method, SGD and Adam, generalized Newton, Hessian-informed approach, Newton method
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose the generalized Newton’s method (GeN) – a Hessian-informed approach that applies to any optimizer such as SGD and Adam, and covers the Newton-Raphson method as a sub-case. Our method automatically and dynamically selects the learning rate that accelerates the convergence, without the intensive tuning of the learning rate scheduler. In practice, out method is easily implementable, since it only requires additional forward passes with almost zero computational overhead (in terms of training time and memory cost), if the overhead is amortized over many iterations. We present extensive experiments on language and vision tasks (e.g. GPT and ResNet) to showcase that GeN optimizers match the state-of-the-art performance, which was achieved with carefully tuned learning rate schedulers. Code to be released at \urlthis https URL.

[CV-74] Fine-Grained Scene Image Classification with Modality-Agnostic Adapter

链接: https://arxiv.org/abs/2407.02769
作者: Yiqun Wang,Zhao Zhou,Xiangcheng Du,Xingjiao Wu,Yingbin Zheng,Cheng Jin
关键词: scene image classification, fine-grained scene image, global visual features, previous works lay, multi-modal feature fusion
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:When dealing with the task of fine-grained scene image classification, most previous works lay much emphasis on global visual features when doing multi-modal feature fusion. In other words, models are deliberately designed based on prior intuitions about the importance of different modalities. In this paper, we present a new multi-modal feature fusion approach named MAA (Modality-Agnostic Adapter), trying to make the model learn the importance of different modalities in different cases adaptively, without giving a prior setting in the model architecture. More specifically, we eliminate the modal differences in distribution and then use a modality-agnostic Transformer encoder for a semantic-level feature fusion. Our experiments demonstrate that MAA achieves state-of-the-art results on benchmarks by applying the same modalities with previous methods. Besides, it is worth mentioning that new modalities can be easily added when using MAA and further boost the performance. Code is available at this https URL.

[CV-75] Knowledge Transfer with Simulated Inter-Image Erasing for Weakly Supervised Semantic Segmentation

链接: https://arxiv.org/abs/2407.02768
作者: Tao Chen,XiRuo Jiang,Gensheng Pei,Zeren Sun,Yucheng Wang,Yazhou Yao
关键词: supervised semantic segmentation, weakly supervised semantic, activate integral object, supervised semantic, semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: accepted by the European Conference on Computer Vision (ECCV), 2024

点击查看摘要

Abstract:Though adversarial erasing has prevailed in weakly supervised semantic segmentation to help activate integral object regions, existing approaches still suffer from the dilemma of under-activation and over-expansion due to the difficulty in determining when to stop erasing. In this paper, we propose a \textbfKnowledge \textbfTransfer with \textbfSimulated Inter-Image \textbfErasing (KTSE) approach for weakly supervised semantic segmentation to alleviate the above problem. In contrast to existing erasing-based methods that remove the discriminative part for more object discovery, we propose a simulated inter-image erasing scenario to weaken the original activation by introducing extra object information. Then, object knowledge is transferred from the anchor image to the consequent less activated localization map to strengthen network localization ability. Considering the adopted bidirectional alignment will also weaken the anchor image activation if appropriate constraints are missing, we propose a self-supervised regularization module to maintain the reliable activation in discriminative regions and improve the inter-class object boundary recognition for complex images with multiple categories of objects. In addition, we resort to intra-image erasing and propose a multi-granularity alignment module to gently enlarge the object activation to boost the object knowledge transfer. Extensive experiments and ablation studies on PASCAL VOC 2012 and COCO datasets demonstrate the superiority of our proposed approach. Source codes and models are available at this https URL.

[CV-76] ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers

链接: https://arxiv.org/abs/2407.02763
作者: Yanfeng Jiang,Ning Sun,Xueshuo Xie,Fei Yang,Tao Li
关键词: impeding effective inference, exhibited exceptional performance, size incurs significantly, incurs significantly increased, significantly increased memory
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 28 pages,9 figures

点击查看摘要

Abstract:Vision Transformers (ViTs) have exhibited exceptional performance across diverse computer vision tasks, while their substantial parameter size incurs significantly increased memory and computational demands, impeding effective inference on resource-constrained devices. Quantization has emerged as a promising solution to mitigate these challenges, yet existing methods still suffer from significant accuracy loss at low-bit. We attribute this issue to the distinctive distributions of post-LayerNorm and post-GELU activations within ViTs, rendering conventional hardware-friendly quantizers ineffective, particularly in low-bit scenarios. To address this issue, we propose a novel framework called Activation-Distribution-Friendly post-training Quantization for Vision Transformers, ADFQ-ViT. Concretely, we introduce the Per-Patch Outlier-aware Quantizer to tackle irregular outliers in post-LayerNorm activations. This quantizer refines the granularity of the uniform quantizer to a per-patch level while retaining a minimal subset of values exceeding a threshold at full-precision. To handle the non-uniform distributions of post-GELU activations between positive and negative regions, we design the Shift-Log2 Quantizer, which shifts all elements to the positive region and then applies log2 quantization. Moreover, we present the Attention-score enhanced Module-wise Optimization which adjusts the parameters of each quantizer by reconstructing errors to further mitigate quantization error. Extensive experiments demonstrate ADFQ-ViT provides significant improvements over various baselines in image classification, object detection, and instance segmentation tasks at 4-bit. Specifically, when quantizing the ViT-B model to 4-bit, we achieve a 10.23% improvement in Top-1 accuracy on the ImageNet dataset.

[CV-77] Differential Encoding for Improved Representation Learning over Graphs

链接: https://arxiv.org/abs/2407.02758
作者: Haimin Zhang,Jiahao Xia,Min Xu
关键词: global attention mechanism, Combining the message-passing, node, global attention, message-passing paradigm
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Combining the message-passing paradigm with the global attention mechanism has emerged as an effective framework for learning over graphs. The message-passing paradigm and the global attention mechanism fundamentally generate node embeddings based on information aggregated from a node’s local neighborhood or from the whole graph. The most basic and commonly used aggregation approach is to take the sum of information from a node’s local neighbourhood or from the whole graph. However, it is unknown if the dominant information is from a node itself or from the node’s neighbours (or the rest of the graph nodes). Therefore, there exists information lost at each layer of embedding generation, and this information lost could be accumulated and become more serious when more layers are used in the model. In this paper, we present a differential encoding method to address the issue of information lost. The idea of our method is to encode the differential representation between the information from a node’s neighbours (or the rest of the graph nodes) and that from the node itself. The obtained differential encoding is then combined with the original aggregated local or global representation to generate the updated node embedding. By integrating differential encodings, the representational ability of generated node embeddings is improved. The differential encoding method is empirically evaluated on different graph tasks on seven benchmark datasets. The results show that it is a general method that improves the message-passing update and the global attention update, advancing the state-of-the-art performance for graph representation learning on these datasets.

[CV-78] ZEAL: Surgical Skill Assessment with Zero-shot Tool Inference Using Unified Foundation Model

链接: https://arxiv.org/abs/2407.02738
作者: Satoshi Kondo
关键词: Surgical skill assessment, ensuring patient safety, unifiEd foundAtion modeL, ZEAL, skill assessment
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Surgical skill assessment is paramount for ensuring patient safety and enhancing surgical outcomes. This study addresses the need for efficient and objective evaluation methods by introducing ZEAL (surgical skill assessment with Zero-shot surgical tool segmentation with a unifiEd foundAtion modeL). ZEAL uses segmentation masks of surgical instruments obtained through a unified foundation model for proficiency assessment. Through zero-shot inference with text prompts, ZEAL predicts segmentation masks, capturing essential features of both instruments and surroundings. Utilizing sparse convolutional neural networks and segmentation masks, ZEAL extracts feature vectors for foreground (instruments) and background. Long Short-Term Memory (LSTM) networks encode temporal dynamics, modeling sequential data and dependencies in surgical videos. Combining LSTM-encoded vectors, ZEAL produces a surgical skill score, offering an objective measure of proficiency. Comparative analysis with conventional methods using open datasets demonstrates ZEAL’s superiority, affirming its potential in advancing surgical training and evaluation. This innovative approach to surgical skill assessment addresses challenges in traditional supervised learning techniques, paving the way for enhanced surgical care quality and patient outcomes.

[CV-79] MedVH: Towards Systematic Evaluation of Hallucination for Large Vision Language Models in the Medical Context

链接: https://arxiv.org/abs/2407.02730
作者: Zishan Gu,Changchang Yin,Fenglin Liu,Ping Zhang
关键词: Large Vision Language, Vision Language Models, Vision Language, Large Vision, recently achieved superior
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have recently achieved superior performance in various tasks on natural image and text data, which inspires a large amount of studies for LVLMs fine-tuning and training. Despite their advancements, there has been scant research on the robustness of these models against hallucination when fine-tuned on smaller datasets. In this study, we introduce a new benchmark dataset, the Medical Visual Hallucination Test (MedVH), to evaluate the hallucination of domain-specific LVLMs. MedVH comprises five tasks to evaluate hallucinations in LVLMs within the medical context, which includes tasks for comprehensive understanding of textual and visual input, as well as long textual response generation. Our extensive experiments with both general and medical LVLMs reveal that, although medical LVLMs demonstrate promising performance on standard medical tasks, they are particularly susceptible to hallucinations, often more so than the general models, raising significant concerns about the reliability of these domain-specific models. For medical LVLMs to be truly valuable in real-world applications, they must not only accurately integrate medical knowledge but also maintain robust reasoning abilities to prevent hallucination. Our work paves the way for future evaluations of these studies.

[CV-80] Model and Feature Diversity for Bayesian Neural Networks in Mutual Learning

链接: https://arxiv.org/abs/2407.02721
作者: Cuong Pham,Cuong C. Nguyen,Trung Le,Dinh Phung,Gustavo Carneiro,Thanh-Toan Do
关键词: Bayesian Neural Networks, enabling uncertainty quantification, Bayesian Neural, offer probability distributions, Neural Networks
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2023

点击查看摘要

Abstract:Bayesian Neural Networks (BNNs) offer probability distributions for model parameters, enabling uncertainty quantification in predictions. However, they often underperform compared to deterministic neural networks. Utilizing mutual learning can effectively enhance the performance of peer BNNs. In this paper, we propose a novel approach to improve BNNs performance through deep mutual learning. The proposed approaches aim to increase diversity in both network parameter distributions and feature distributions, promoting peer networks to acquire distinct features that capture different characteristics of the input, which enhances the effectiveness of mutual learning. Experimental results demonstrate significant improvements in the classification accuracy, negative log-likelihood, and expected calibration error when compared to traditional mutual learning for BNNs.

[CV-81] Light-weight Fine-tuning Method for Defending Adversarial Noise in Pre-trained Medical Vision-Language Models

链接: https://arxiv.org/abs/2407.02716
作者: Xu Han,Linghao Jin,Xuezhe Ma,Xiaofeng Liu
关键词: textual depiction synergy, shown remarkable capabilities, pre-trained Vision-Language Models, Fine-tuning pre-trained Vision-Language, depiction synergy
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning pre-trained Vision-Language Models (VLMs) has shown remarkable capabilities in medical image and textual depiction synergy. Nevertheless, many pre-training datasets are restricted by patient privacy concerns, potentially containing noise that can adversely affect downstream performance. Moreover, the growing reliance on multi-modal generation exacerbates this issue because of its susceptibility to adversarial attacks. To investigate how VLMs trained on adversarial noisy data perform on downstream medical tasks, we first craft noisy upstream datasets using multi-modal adversarial attacks. Through our comprehensive analysis, we unveil that moderate noise enhances model robustness and transferability, but increasing noise levels negatively impact downstream task performance. To mitigate this issue, we propose rectify adversarial noise (RAN) framework, a recipe designed to effectively defend adversarial attacks and rectify the influence of upstream noise during fine-tuning.

[CV-82] Advancing Compressed Video Action Recognition through Progressive Knowledge Distillation

链接: https://arxiv.org/abs/2407.02713
作者: Efstathia Soufleri,Deepak Ravikumar,Kaushik Roy
关键词: Compressed video action, video action recognition, recognition classifies video, classifies video samples, action recognition classifies
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Compressed video action recognition classifies video samples by leveraging the different modalities in compressed videos, namely motion vectors, residuals, and intra-frames. For this purpose, three neural networks are deployed, each dedicated to processing one modality. Our observations indicate that the network processing intra-frames tend to converge to a flatter minimum than the network processing residuals, which in turn converges to a flatter minimum than the motion vector network. This hierarchy in convergence motivates our strategy for knowledge transfer among modalities to achieve flatter minima, which are generally associated with better generalization. With this insight, we propose Progressive Knowledge Distillation (PKD), a technique that incrementally transfers knowledge across the modalities. This method involves attaching early exits (Internal Classifiers - ICs) to the three networks. PKD distills knowledge starting from the motion vector network, followed by the residual, and finally, the intra-frame network, sequentially improving IC accuracy. Further, we propose the Weighted Inference with Scaled Ensemble (WISE), which combines outputs from the ICs using learned weights, boosting accuracy during inference. Our experiments demonstrate the effectiveness of training the ICs with PKD compared to standard cross-entropy-based training, showing IC accuracy improvements of up to 5.87% and 11.42% on the UCF-101 and HMDB-51 datasets, respectively. Additionally, WISE improves accuracy by up to 4.28% and 9.30% on UCF-101 and HMDB-51, respectively.

[CV-83] Funny Valen-Tine: Solving visual abstract reasoning problems through defining the solution distribution

链接: https://arxiv.org/abs/2407.02688
作者: Ruizhuo Song,Beiming Yuan
关键词: hold immense importance, Raven Progressive Matrices, problems hold immense, RPM involving image, reasoning problems hold
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 14 pages, 20 figures, 3 tables

点击查看摘要

Abstract:Visual abstract reasoning problems hold immense importance in the field of image processing. Both Bongard-Logo and Raven’s Progressive Matrices (RPM) belong to this domain, with Bongard-Logo categorized as image clustering reasoning and RPM involving image progression pattern reasoning. This paper introduces Valen, a novel baseline model under probabilistic highlighting models. Valen exhibits remarkable performance in solving both RPM and Bongard-Logo problems, offering a versatile solution. Our investigation delves into the underlying mechanisms of probability-highlighting solvers, realizing they approximate solutions to reasoning problem instances as distributions delineated by primary and auxiliary samples. We propose that the learning objective is not the distribution of correct solutions but one defined by both primary and auxiliary samples. To bridge discrepancies, we introduced the Tine method, an adversarial learning-based approach to assist Valen in estimating a solution distribution closer to the correct one, albeit with issues like unstable training. Reflecting on Tine, we propose modeling the sample distribution of reasoning problems as a mixture of Gaussian distributions, leading to the Funny method. This effectively enables Valen to capture the true form of the correct solution distribution. Furthermore, we designed the SBR method to model the distribution of progressive patterns representation similarly. Overall, the Funny, Tine, and SBR methods significantly improve Valen’s performance, providing new ideas and methods for studying visual abstract reasoning problems.

[CV-84] No Training No Problem: Rethinking Classifier-Free Guidance for Diffusion Models

链接: https://arxiv.org/abs/2407.02687
作者: Seyedmorteza Sadat,Manuel Kansy,Otmar Hilliges,Romann M. Weber
关键词: CFG, Classifier-free guidance, conditional diffusion models, diffusion, diffusion models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Classifier-free guidance (CFG) has become the standard method for enhancing the quality of conditional diffusion models. However, employing CFG requires either training an unconditional model alongside the main diffusion model or modifying the training procedure by periodically inserting a null condition. There is also no clear extension of CFG to unconditional models. In this paper, we revisit the core principles of CFG and introduce a new method, independent condition guidance (ICG), which provides the benefits of CFG without the need for any special training procedures. Our approach streamlines the training process of conditional diffusion models and can also be applied during inference on any pre-trained conditional model. Additionally, by leveraging the time-step information encoded in all diffusion networks, we propose an extension of CFG, called time-step guidance (TSG), which can be applied to any diffusion model, including unconditional ones. Our guidance techniques are easy to implement and have the same sampling cost as CFG. Through extensive experiments, we demonstrate that ICG matches the performance of standard CFG across various conditional diffusion models. Moreover, we show that TSG improves generation quality in a manner similar to CFG, without relying on any conditional information.

[CV-85] Open Panoramic Segmentation

链接: https://arxiv.org/abs/2407.02685
作者: Junwei Zheng,Ruiping Liu,Yufan Chen,Kunyu Peng,Chengzhi Wu,Kailun Yang,Jiaming Zhang,Rainer Stiefelhagen
关键词: encompass omnidirectional spatial, omnidirectional spatial information, spatial information crucial, field of view, panoramic semantic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by ECCV 2024. Project page: this https URL

点击查看摘要

Abstract:Panoramic images, capturing a 360° field of view (FoV), encompass omnidirectional spatial information crucial for scene understanding. However, it is not only costly to obtain training-sufficient dense-annotated panoramas but also application-restricted when training models in a close-vocabulary setting. To tackle this problem, in this work, we define a new task termed Open Panoramic Segmentation (OPS), where models are trained with FoV-restricted pinhole images in the source domain in an open-vocabulary setting while evaluated with FoV-open panoramic images in the target domain, enabling the zero-shot open panoramic semantic segmentation ability of models. Moreover, we propose a model named OOOPS with a Deformable Adapter Network (DAN), which significantly improves zero-shot panoramic semantic segmentation performance. To further enhance the distortion-aware modeling ability from the pinhole source domain, we propose a novel data augmentation method called Random Equirectangular Projection (RERP) which is specifically designed to address object deformations in advance. Surpassing other state-of-the-art open-vocabulary semantic segmentation approaches, a remarkable performance boost on three panoramic datasets, WildPASS, Stanford2D3D, and Matterport3D, proves the effectiveness of our proposed OOOPS model with RERP on the OPS task, especially +2.2% on outdoor WildPASS and +2.4% mIoU on indoor Stanford2D3D. The code will be available at this https URL.

[CV-86] Generalized Event Cameras

链接: https://arxiv.org/abs/2407.02683
作者: Varun Sundar,Matthew Dutson,Andrei Ardelean,Claudio Bruschini,Edoardo Charbon,Mohit Gupta
关键词: high time resolution, minimal bandwidth requirements, Event cameras, Event cameras capture, Event
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
*备注: CVPR 2024

点击查看摘要

Abstract:Event cameras capture the world at high time resolution and with minimal bandwidth requirements. However, event streams, which only encode changes in brightness, do not contain sufficient scene information to support a wide variety of downstream tasks. In this work, we design generalized event cameras that inherently preserve scene intensity in a bandwidth-efficient manner. We generalize event cameras in terms of when an event is generated and what information is transmitted. To implement our designs, we turn to single-photon sensors that provide digital access to individual photon detections; this modality gives us the flexibility to realize a rich space of generalized event cameras. Our single-photon event cameras are capable of high-speed, high-fidelity imaging at low readout rates. Consequently, these event cameras can support plug-and-play downstream inference, without capturing new event datasets or designing specialized event-vision models. As a practical implication, our designs, which involve lightweight and near-sensor-compatible computations, provide a way to use single-photon sensors without exorbitant bandwidth costs.

[CV-87] Adversarial Magnification to Deceive Deepfake Detection through Super Resolution

链接: https://arxiv.org/abs/2407.02670
作者: Davide Alessandro Coccomini,Roberto Caldelli,Giuseppe Amato,Fabrizio Falchi,Claudio Gennaro
关键词: manipulated media content, posing significant challenges, rapidly advancing, posing significant, media content
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deepfake technology is rapidly advancing, posing significant challenges to the detection of manipulated media content. Parallel to that, some adversarial attack techniques have been developed to fool the deepfake detectors and make deepfakes even more difficult to be detected. This paper explores the application of super resolution techniques as a possible adversarial attack in deepfake detection. Through our experiments, we demonstrate that minimal changes made by these methods in the visual appearance of images can have a profound impact on the performance of deepfake detection systems. We propose a novel attack using super resolution as a quick, black-box and effective method to camouflage fake images and/or generate false alarms on pristine images. Our results indicate that the usage of super resolution can significantly impair the accuracy of deepfake detectors, thereby highlighting the vulnerability of such systems to adversarial attacks. The code to reproduce our experiments is available at: this https URL

[CV-88] MomentsNeRF: Leveraging Orthogonal Moments for Few-Shot Neural Rendering

链接: https://arxiv.org/abs/2407.02668
作者: Ahmad AlMughrabi,Ricardo Marques,Petia Radeva
关键词: scene using Orthogonal, Orthogonal Moments, few-shot neural rendering, neural rendering frameworks, few-shot neural
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose MomentsNeRF, a novel framework for one- and few-shot neural rendering that predicts a neural representation of a 3D scene using Orthogonal Moments. Our architecture offers a new transfer learning method to train on multi-scenes and incorporate a per-scene optimization using one or a few images at test time. Our approach is the first to successfully harness features extracted from Gabor and Zernike moments, seamlessly integrating them into the NeRF architecture. We show that MomentsNeRF performs better in synthesizing images with complex textures and shapes, achieving a significant noise reduction, artifact elimination, and completing the missing parts compared to the recent one- and few-shot neural rendering frameworks. Extensive experiments on the DTU and Shapenet datasets show that MomentsNeRF improves the state-of-the-art by 3.39;dB;PSNR, 11.1% SSIM, 17.9% LPIPS, and 8.3% DISTS metrics. Moreover, it outperforms state-of-the-art performance for both novel view synthesis and single-image 3D view reconstruction. The source code is accessible at: this https URL.

[CV-89] SMILe: Leveraging Submodular Mutual Information For Robust Few-Shot Object Detection

链接: https://arxiv.org/abs/2407.02665
作者: Anay Majee,Ryan Sharp,Rishabh Iyer
关键词: Few-Shot Object Detection, Submodular Mutual Information, Object Detection, learning based FSOD, Mutual Information Learning
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to ECCV 2024, 16 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Confusion and forgetting of object classes have been challenges of prime interest in Few-Shot Object Detection (FSOD). To overcome these pitfalls in metric learning based FSOD techniques, we introduce a novel Submodular Mutual Information Learning (SMILe) framework which adopts combinatorial mutual information functions to enforce the creation of tighter and discriminative feature clusters in FSOD. Our proposed approach generalizes to several existing approaches in FSOD, agnostic of the backbone architecture demonstrating elevated performance gains. A paradigm shift from instance based objective functions to combinatorial objectives in SMILe naturally preserves the diversity within an object class resulting in reduced forgetting when subjected to few training examples. Furthermore, the application of mutual information between the already learnt (base) and newly added (novel) objects ensures sufficient separation between base and novel classes, minimizing the effect of class confusion. Experiments on popular FSOD benchmarks, PASCAL-VOC and MS-COCO show that our approach generalizes to State-of-the-Art (SoTA) approaches improving their novel class performance by up to 5.7% (3.3 mAP points) and 5.4% (2.6 mAP points) on the 10-shot setting of VOC (split 3) and 30-shot setting of COCO datasets respectively. Our experiments also demonstrate better retention of base class performance and up to 2x faster convergence over existing approaches agnostic of the underlying architecture.

[CV-90] Spectral Graph Reasoning Network for Hyperspectral Image Classification

链接: https://arxiv.org/abs/2407.02647
作者: Huiling Wang
关键词: achieved remarkable performance, Convolutional neural networks, hyperspectral image, achieved remarkable, remarkable performance
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) have achieved remarkable performance in hyperspectral image (HSI) classification over the last few years. Despite the progress that has been made, rich and informative spectral information of HSI has been largely underutilized by existing methods which employ convolutional kernels with limited size of receptive field in the spectral domain. To address this issue, we propose a spectral graph reasoning network (SGR) learning framework comprising two crucial modules: 1) a spectral decoupling module which unpacks and casts multiple spectral embeddings into a unified graph whose node corresponds to an individual spectral feature channel in the embedding space; the graph performs interpretable reasoning to aggregate and align spectral information to guide learning spectral-specific graph embeddings at multiple contextual levels 2) a spectral ensembling module explores the interactions and interdependencies across graph embedding hierarchy via a novel recurrent graph propagation mechanism. Experiments on two HSI datasets demonstrate that the proposed architecture can significantly improve the classification accuracy compared with the existing methods with a sizable margin.

[CV-91] Holistically-Nested Structure-Aware Graph Neural Network for Road Extraction

链接: https://arxiv.org/abs/2407.02639
作者: Tinghuai Wang,Guangming Wang,Kuan Eeik Tan
关键词: made significant advances, Convolutional neural networks, Convolutional neural, made significant, significant advances
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Convolutional neural networks (CNN) have made significant advances in detecting roads from satellite images. However, existing CNN approaches are generally repurposed semantic segmentation architectures and suffer from the poor delineation of long and curved regions. Lack of overall road topology and structure information further deteriorates their performance on challenging remote sensing images. This paper presents a novel multi-task graph neural network (GNN) which simultaneously detects both road regions and road borders; the inter-play between these two tasks unlocks superior performance from two perspectives: (1) the hierarchically detected road borders enable the network to capture and encode holistic road structure to enhance road connectivity (2) identifying the intrinsic correlation of semantic landcover regions mitigates the difficulty in recognizing roads cluttered by regions with similar appearance. Experiments on challenging dataset demonstrate that the proposed architecture can improve the road border delineation and road extraction accuracy compared with the existing methods.

[CV-92] HOIMotion: Forecasting Human Motion During Human-Object Interactions Using Egocentric 3D Object Bounding Boxes

链接: https://arxiv.org/abs/2407.02633
作者: Zhiming Hu,Zheming Yin,Daniel Haeufle,Syn Schmitt,Andreas Bulling
关键词: human motion forecasting, past body poses, object bounding boxes, motion forecasting, body poses
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted at ISMAR 2024 TVCG-track, this http URL . arXiv admin note: text overlap with arXiv:2403.09885

点击查看摘要

Abstract:We present HOIMotion - a novel approach for human motion forecasting during human-object interactions that integrates information about past body poses and egocentric 3D object bounding boxes. Human motion forecasting is important in many augmented reality applications but most existing methods have only used past body poses to predict future motion. HOIMotion first uses an encoder-residual graph convolutional network (GCN) and multi-layer perceptrons to extract features from body poses and egocentric 3D object bounding boxes, respectively. Our method then fuses pose and object features into a novel pose-object graph and uses a residual-decoder GCN to forecast future body motion. We extensively evaluate our method on the Aria digital twin (ADT) and MoGaze datasets and show that HOIMotion consistently outperforms state-of-the-art methods by a large margin of up to 8.7% on ADT and 7.2% on MoGaze in terms of mean per joint position error. Complementing these evaluations, we report a human study (N=20) that shows that the improvements achieved by our method result in forecasted poses being perceived as both more precise and more realistic than those of existing methods. Taken together, these results reveal the significant information content available in egocentric 3D object bounding boxes for human motion forecasting and the effectiveness of our method in exploiting this information.

[CV-93] Uplifting Lower-Income Data: Strategies for Socioeconomic Perspective Shifts in Vision-Language Models

链接: https://arxiv.org/abs/2407.02623
作者: Joan Nwatu,Oana Ignat,Rada Mihalcea
关键词: formulate translated non-English, socioeconomic integrated prompts, socioeconomic integrated, integrated prompts improve, address this issue
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:To address this issue, we formulate translated non-English, geographic, and socioeconomic integrated prompts and evaluate their impact on VL model performance for data from different countries and income groups. Our findings show that geographic and socioeconomic integrated prompts improve VL performance on lower-income data and favor the retrieval of topic appearances commonly found in data from low-income households. From our analyses, we identify and highlight contexts where these strategies yield the most improvements. Our model analysis code is publicly available at this https URL .

[CV-94] Meta 3D Gen

链接: https://arxiv.org/abs/2407.02599
作者: Raphael Bensadoun,Tom Monnier,Yanir Kleiman,Filippos Kokkinos,Yawar Siddiqui,Mahendra Kariya,Omri Harosh,Roman Shapovalov,Benjamin Graham,Emilien Garreau,Animesh Karnewar,Ang Cao,Idan Azuri,Iurii Makarov,Eric-Tuan Le,Antoine Toisoul,David Novotny,Oran Gafni,Natalia Neverova,Andrea Vedaldi
关键词: fast pipeline, introduce Meta, Gen, Meta, asset
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Meta 3D Gen (3DGen), a new state-of-the-art, fast pipeline for text-to-3D asset generation. 3DGen offers 3D asset creation with high prompt fidelity and high-quality 3D shapes and textures in under a minute. It supports physically-based rendering (PBR), necessary for 3D asset relighting in real-world applications. Additionally, 3DGen supports generative retexturing of previously generated (or artist-created) 3D shapes using additional textual inputs provided by the user. 3DGen integrates key technical components, Meta 3D AssetGen and Meta 3D TextureGen, that we developed for text-to-3D and text-to-texture generation, respectively. By combining their strengths, 3DGen represents 3D objects simultaneously in three ways: in view space, in volumetric space, and in UV (or texture) space. The integration of these two techniques achieves a win rate of 68% with respect to the single-stage model. We compare 3DGen to numerous industry baselines, and show that it outperforms them in terms of prompt fidelity and visual quality for complex textual prompts, while being significantly faster.

[CV-95] AutoSplat: Constrained Gaussian Splatting for Autonomous Driving Scene Reconstruction

链接: https://arxiv.org/abs/2407.02598
作者: Mustafa Khan,Hamidreza Fazlali,Dhruv Sharma,Tongtong Cao,Dongfeng Bai,Yuan Ren,Bingbing Liu
关键词: simulating safety-critical scenarios, advancing autonomous driving, autonomous driving systems, Realistic scene reconstruction, Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Realistic scene reconstruction and view synthesis are essential for advancing autonomous driving systems by simulating safety-critical scenarios. 3D Gaussian Splatting excels in real-time rendering and static scene reconstructions but struggles with modeling driving scenarios due to complex backgrounds, dynamic objects, and sparse views. We propose AutoSplat, a framework employing Gaussian splatting to achieve highly realistic reconstructions of autonomous driving scenes. By imposing geometric constraints on Gaussians representing the road and sky regions, our method enables multi-view consistent simulation of challenging scenarios including lane changes. Leveraging 3D templates, we introduce a reflected Gaussian consistency constraint to supervise both the visible and unseen side of foreground objects. Moreover, to model the dynamic appearance of foreground objects, we estimate residual spherical harmonics for each foreground Gaussian. Extensive experiments on Pandaset and KITTI demonstrate that AutoSplat outperforms state-of-the-art methods in scene reconstruction and novel view synthesis across diverse driving scenarios. Visit our project page at this https URL.

[CV-96] Enabling Student Innovation through Virtual Reality Development

链接: https://arxiv.org/abs/2407.02591
作者: Sherri Harms
关键词: Virtual Reality, including video streaming, major press coverage, coverage that Virtual, multiple industries
类目: General Literature (cs.GL); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
*备注: Published in proceedings and presented at this https URL 10 pages; 3 figures

点击查看摘要

Abstract:It is clear, from the major press coverage that Virtual Reality (VR) development is garnering, that there is a huge amount of development interest in VR across multiple industries, including video streaming, gaming and simulated learning. Even though PC, web, and mobile are still the top platforms for software development, it is important for university computer science (CS) programs to expose students to VR as a development platform. Additionally, it is important for CS students to learn how to learn about new technologies, since change is constant in the CS field. CS curriculum changes happen much slower than the pace of technology adoption. As new technologies are introduced, CS faculty and students often learn together, especially in smaller CS programs. This paper describes how student-led VR projects are used, across the CS curriculum, as basic CS concepts are covered. The student-led VR projects are engaging, and promote learning and creativity. Additionally, each student project inspires more students to try their hand at VR development as well.

[CV-97] Improving Visual Storytelling with Multimodal Large Language Models

链接: https://arxiv.org/abs/2407.02586
作者: Xiaochuan Lin,Xiangyong Chen
关键词: contextually rich stories, emerging field, field that combines, combines images, create engaging
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 10 pages

点击查看摘要

Abstract:Visual storytelling is an emerging field that combines images and narratives to create engaging and contextually rich stories. Despite its potential, generating coherent and emotionally resonant visual stories remains challenging due to the complexity of aligning visual and textual information. This paper presents a novel approach leveraging large language models (LLMs) and large vision-language models (LVLMs) combined with instruction tuning to address these challenges. We introduce a new dataset comprising diverse visual stories, annotated with detailed captions and multimodal elements. Our method employs a combination of supervised and reinforcement learning to fine-tune the model, enhancing its narrative generation capabilities. Quantitative evaluations using GPT-4 and qualitative human assessments demonstrate that our approach significantly outperforms existing models, achieving higher scores in narrative coherence, relevance, emotional depth, and overall quality. The results underscore the effectiveness of instruction tuning and the potential of LLMs/LVLMs in advancing visual storytelling.

[CV-98] Novel Human Machine Interface via Robust Hand Gesture Recognition System using Channel Pruned YOLOv5s Model

链接: https://arxiv.org/abs/2407.02585
作者: Abir Sen,Tapas Kumar Mishra,Ratnakar Dash
关键词: smart home automation, human-computer interaction experience, home automation systems, interaction experience, virtual reality
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Hand gesture recognition (HGR) is a vital component in enhancing the human-computer interaction experience, particularly in multimedia applications, such as virtual reality, gaming, smart home automation systems, etc. Users can control and navigate through these applications seamlessly by accurately detecting and recognizing gestures. However, in a real-time scenario, the performance of the gesture recognition system is sometimes affected due to the presence of complex background, low-light illumination, occlusion problems, etc. Another issue is building a fast and robust gesture-controlled human-computer interface (HCI) in the real-time scenario. The overall objective of this paper is to develop an efficient hand gesture detection and classification model using a channel-pruned YOLOv5-small model and utilize the model to build a gesture-controlled HCI with a quick response time (in ms) and higher detection speed (in fps). First, the YOLOv5s model is chosen for the gesture detection task. Next, the model is simplified by using a channel-pruned algorithm. After that, the pruned model is further fine-tuned to ensure detection efficiency. We have compared our suggested scheme with other state-of-the-art works, and it is observed that our model has shown superior results in terms of mAP (mean average precision), precision (%), recall (%), and F1-score (%), fast inference time (in ms), and detection speed (in fps). Our proposed method paves the way for deploying a pruned YOLOv5s model for a real-time gesture-command-based HCI to control some applications, such as the VLC media player, Spotify player, etc., using correctly classified gesture commands in real-time scenarios. The average detection speed of our proposed system has reached more than 60 frames per second (fps) in real-time, which meets the perfect requirement in real-time application control.

[CV-99] Robust ADAS: Enhancing Robustness of Machine Learning-based Advanced Driver Assistance Systems for Adverse Weather

链接: https://arxiv.org/abs/2407.02581
作者: Muhammad Zaeem Shahzad,Muhammad Abdullah Hanif,Muhammad Shafique
关键词: Machine Learning-based Advanced, deploying Machine Learning-based, Advanced Driver Assistance, Learning-based Advanced Driver, Machine Learning-based
类目: Computer Vision and Pattern Recognition (cs.CV)
*备注: 7 pages, 10 figures, 1 table

点击查看摘要

Abstract:In the realm of deploying Machine Learning-based Advanced Driver Assistance Systems (ML-ADAS) into real-world scenarios, adverse weather conditions pose a significant challenge. Conventional ML models trained on clear weather data falter when faced with scenarios like extreme fog or heavy rain, potentially leading to accidents and safety hazards. This paper addresses this issue by proposing a novel approach: employing a Denoising Deep Neural Network as a preprocessing step to transform adverse weather images into clear weather images, thereby enhancing the robustness of ML-ADAS systems. The proposed method eliminates the need for retraining all subsequent Depp Neural Networks (DNN) in the ML-ADAS pipeline, thus saving computational resources and time. Moreover, it improves driver visualization, which is critical for safe navigation in adverse weather conditions. By leveraging the UNet architecture trained on an augmented KITTI dataset with synthetic adverse weather images, we develop the Weather UNet (WUNet) DNN to remove weather artifacts. Our study demonstrates substantial performance improvements in object detection with WUNet preprocessing under adverse weather conditions. Notably, in scenarios involving extreme fog, our proposed solution improves the mean Average Precision (mAP) score of the YOLOv8n from 4% to 70%.

[CV-100] Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything

链接: https://arxiv.org/abs/2407.02534
作者: Xiaotian Zou,Yongkang Chen
关键词: Large Visual Language, large language models, Visual Language Models, large language, achieved remarkable success
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Large Visual Language Models (VLMs) such as GPT-4 have achieved remarkable success in generating comprehensive and nuanced responses, surpassing the capabilities of large language models. However, with the integration of visual inputs, new security concerns emerge, as malicious attackers can exploit multiple modalities to achieve their objectives. This has led to increasing attention on the vulnerabilities of VLMs to jailbreak. Most existing research focuses on generating adversarial images or nonsensical image collections to compromise these models. However, the challenge of leveraging meaningful images to produce targeted textual content using the VLMs’ logical comprehension of images remains unexplored. In this paper, we explore the problem of logical jailbreak from meaningful images to text. To investigate this issue, we introduce a novel dataset designed to evaluate flowchart image jailbreak. Furthermore, we develop a framework for text-to-text jailbreak using VLMs. Finally, we conduct an extensive evaluation of the framework on GPT-4o and GPT-4-vision-preview, with jailbreak rates of 92.8% and 70.0%, respectively. Our research reveals significant vulnerabilities in current VLMs concerning image-to-text jailbreak. These findings underscore the need for a deeper examination of the security flaws in VLMs before their practical deployment.

[CV-101] Human-Agent Joint Learning for Efficient Robot Manipulation Skill Acquisition

链接: https://arxiv.org/abs/2407.00299
作者: Shengcheng Luo,Quanquan Peng,Jun Lv,Kaiwen Hong,Katherine Rose Driggs-Campbell,Cewu Lu,Yong-Lu Li
关键词: Employing a teleoperation, gathering demonstrations offers, offers the potential, teleoperation system, teleoperation system poses
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Employing a teleoperation system for gathering demonstrations offers the potential for more efficient learning of robot manipulation. However, teleoperating a robot arm equipped with a dexterous hand or gripper, via a teleoperation system poses significant challenges due to its high dimensionality, complex motions, and differences in physiological structure. In this study, we introduce a novel system for joint learning between human operators and robots, that enables human operators to share control of a robot end-effector with a learned assistive agent, facilitating simultaneous human demonstration collection and robot manipulation teaching. In this setup, as data accumulates, the assistive agent gradually learns. Consequently, less human effort and attention are required, enhancing the efficiency of the data collection process. It also allows the human operator to adjust the control ratio to achieve a trade-off between manual and automated control. We conducted experiments in both simulated environments and physical real-world settings. Through user studies and quantitative evaluations, it is evident that the proposed system could enhance data collection efficiency and reduce the need for human adaptation while ensuring the collected data is of sufficient quality for downstream tasks. Videos are available at this https URL. Comments: 8 pages, 6 figures Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2407.00299 [cs.RO] (or arXiv:2407.00299v2 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2407.00299 Focus to learn more arXiv-issued DOI via DataCite

[CV-102] HoloHisto: End-to-end Gigapixel WSI Segmentation with 4K Resolution Sequential Tokenization

链接: https://arxiv.org/abs/2407.03307
作者: Yucheng Tang,Yufan He,Vishwesh Nath,Pengfeig Guo,Ruining Deng,Tianyuan Yao,Quan Liu,Can Cui,Mengmeng Yin,Ziyue Xu,Holger Roth,Daguang Xu,Haichun Yang,Yuankai Huo
关键词: initially segmenting high-resolution, segmentation typically involves, deep learning-based image, two-stage process, initially segmenting
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:In digital pathology, the traditional method for deep learning-based image segmentation typically involves a two-stage process: initially segmenting high-resolution whole slide images (WSI) into smaller patches (e.g., 256x256, 512x512, 1024x1024) and subsequently reconstructing them to their original scale. This method often struggles to capture the complex details and vast scope of WSIs. In this paper, we propose the holistic histopathology (HoloHisto) segmentation method to achieve end-to-end segmentation on gigapixel WSIs, whose maximum resolution is above 80,000 \times 70,000 pixels. HoloHisto fundamentally shifts the paradigm of WSI segmentation to an end-to-end learning fashion with 1) a large (4K) resolution base patch for elevated visual information inclusion and efficient processing, and 2) a novel sequential tokenization mechanism to properly model the contextual relationships and efficiently model the rich information from the 4K input. To our best knowledge, HoloHisto presents the first holistic approach for gigapixel resolution WSI segmentation, supporting direct I/O of complete WSI and their corresponding gigapixel masks. Under the HoloHisto platform, we unveil a random 4K sampler that transcends ultra-high resolution, delivering 31 and 10 times more pixels than standard 2D and 3D patches, respectively, for advancing computational capabilities. To facilitate efficient 4K resolution dense prediction, we leverage sequential tokenization, utilizing a pre-trained image tokenizer to group image features into a discrete token grid. To assess the performance, our team curated a new kidney pathology image segmentation (KPIs) dataset with WSI-level glomeruli segmentation from whole mouse kidneys. From the results, HoloHisto-4K delivers remarkable performance gains over previous state-of-the-art models.

[CV-103] Solving the inverse problem of microscopy deconvolution with a residual Beylkin-Coifman-Rokhlin neural network

链接: https://arxiv.org/abs/2407.03239
作者: Rui Li,Mikhail Kudryashev,Artur Yakimovich
关键词: refers to recovering, revealing the ground, truth of samples, Optic deconvolution, recovering the object
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)
*备注: 17 pages, 8 figures

点击查看摘要

Abstract:Optic deconvolution in light microscopy (LM) refers to recovering the object details from images, revealing the ground truth of samples. Traditional explicit methods in LM rely on the point spread function (PSF) during image acquisition. Yet, these approaches often fall short due to inaccurate PSF models and noise artifacts, hampering the overall restoration quality. In this paper, we approached the optic deconvolution as an inverse problem. Motivated by the nonstandard-form compression scheme introduced by Beylkin, Coifman, and Rokhlin (BCR), we proposed an innovative physics-informed neural network Multi-Stage Residual-BCR Net (m-rBCR) to approximate the optic deconvolution. We validated the m-rBCR model on four microscopy datasets - two simulated microscopy datasets from ImageNet and BioSR, real dSTORM microscopy images, and real widefield microscopy images. In contrast to the explicit deconvolution methods (e.g. Richardson-Lucy) and other state-of-the-art NN models (U-Net, DDPM, CARE, DnCNN, ESRGAN, RCAN, Noise2Noise, MPRNet, and MIMO-U-Net), the m-rBCR model demonstrates superior performance to other candidates by PSNR and SSIM in two real microscopy datasets and the simulated BioSR dataset. In the simulated ImageNet dataset, m-rBCR ranks the second-best place (right after MIMO-U-Net). With the backbone from the optical physics, m-rBCR exploits the trainable parameters with better performances (from ~30 times fewer than the benchmark MIMO-U-Net to ~210 times than ESRGAN). This enables m-rBCR to achieve a shorter runtime (from ~3 times faster than MIMO-U-Net to ~300 times faster than DDPM). To summarize, by leveraging physics constraints our model reduced potentially redundant parameters significantly in expertise-oriented NN candidates and achieved high efficiency with superior performance.

[CV-104] IM-MoCo: Self-supervised MRI Motion Correction using Motion-Guided Implicit Neural Representations

链接: https://arxiv.org/abs/2407.02974
作者: Ziad Al-Haj Hemidi,Christian Weihsbach,Mattias P. Heinrich
关键词: Magnetic Resonance Imaging, Resonance Imaging, Magnetic Resonance, long acquisition times, arise due
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Submitted to MICCAI 2024 (Before peer review version)

点击查看摘要

Abstract:Motion artifacts in Magnetic Resonance Imaging (MRI) arise due to relatively long acquisition times and can compromise the clinical utility of acquired images. Traditional motion correction methods often fail to address severe motion, leading to distorted and unreliable results. Deep Learning (DL) alleviated such pitfalls through generalization with the cost of vanishing structures and hallucinations, making it challenging to apply in the medical field where hallucinated structures can tremendously impact the diagnostic outcome. In this work, we present an instance-wise motion correction pipeline that leverages motion-guided Implicit Neural Representations (INRs) to mitigate the impact of motion artifacts while retaining anatomical structure. Our method is evaluated using the NYU fastMRI dataset with different degrees of simulated motion severity. For the correction alone, we can improve over state-of-the-art image reconstruction methods by +5% SSIM, +5:db PSNR, and +14% HaarPSI. Clinical relevance is demonstrated by a subsequent experiment, where our method improves classification outcomes by at least +1.5 accuracy percentage points compared to motion-corrupted images.

[CV-105] Recompression Based JPEG Tamper Detection and Localization Using Deep Neural Network Eliminating Compression Factor Dependency

链接: https://arxiv.org/abs/2407.02942
作者: Jamimamul Bakas,Praneta Rawat,Kalyan Kokkalla,Ruchira Naskar
关键词: dual compression characteristics, compression based forgery, modified illegitimately, giving rise, compression based
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: 24 pages, conference

点击查看摘要

Abstract:In this work, we deal with the problem of re compression based image forgery detection, where some regions of an image are modified illegitimately, hence giving rise to presence of dual compression characteristics within a single image. There have been some significant researches in this direction, in the last decade. However, almost all existing techniques fail to detect this form of forgery, when the first compression factor is greater than the second. We address this problem in re compression based forgery detection, here Recently, Machine Learning techniques have started gaining a lot of importance in the domain of digital image forensics. In this work, we propose a Convolution Neural Network based deep learning architecture, which is capable of detecting the presence of re compression based forgery in JPEG images. The proposed architecture works equally efficiently, even in cases where the first compression ratio is greater than the second. In this work, we also aim to localize the regions of image manipulation based on re compression features, using the trained neural network. Our experimental results prove that the proposed method outperforms the state of the art, with respect to forgery detection and localization accuracy.

[CV-106] Explainable vertebral fracture analysis with uncertainty estimation using differentiable rule-based classification

链接: https://arxiv.org/abs/2407.02926
作者: Victor Wåhlstrand Skärström,Lisa Johansson,Jennifer Alvén,Mattias Lorentzon,Ida Häggström
关键词: deep neural networks, incorporating vertebra detection, vertebral fracture assessment, explainable vertebral fracture, neural networks
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: To be published in MICCAI 2024 conference proceedings

点击查看摘要

Abstract:We present a novel method for explainable vertebral fracture assessment (XVFA) in low-dose radiographs using deep neural networks, incorporating vertebra detection and keypoint localization with uncertainty estimates. We incorporate Genant’s semi-quantitative criteria as a differentiable rule-based means of classifying both vertebra fracture grade and morphology. Unlike previous work, XVFA provides explainable classifications relatable to current clinical methodology, as well as uncertainty estimations, while at the same time surpassing state-of-the art methods with a vertebra-level sensitivity of 93% and end-to-end AUC of 97% in a challenging setting. Moreover, we compare intra-reader agreement with model uncertainty estimates, with model reliability on par with human annotators.

[CV-107] Non-Adversarial Learning: Vector-Quantized Common Latent Space for Multi-Sequence MRI

链接: https://arxiv.org/abs/2407.02911
作者: Luyi Han,Tao Tan,Tianyu Zhang,Xin Wang,Yuan Gao,Chunyao Lu,Xinglong Liang,Haoran Dou,Yunzhi Huang,Ritse Mann
关键词: lacking paired samples, models translate MRI, Adversarial learning, paired samples, latent space
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Adversarial learning helps generative models translate MRI from source to target sequence when lacking paired samples. However, implementing MRI synthesis with adversarial learning in clinical settings is challenging due to training instability and mode collapse. To address this issue, we leverage intermediate sequences to estimate the common latent space among multi-sequence MRI, enabling the reconstruction of distinct sequences from the common latent space. We propose a generative model that compresses discrete representations of each sequence to estimate the Gaussian distribution of vector-quantized common (VQC) latent space between multiple sequences. Moreover, we improve the latent space consistency with contrastive learning and increase model stability by domain augmentation. Experiments using BraTS2021 dataset show that our non-adversarial model outperforms other GAN-based methods, and VQC latent space aids our model to achieve (1) anti-interference ability, which can eliminate the effects of noise, bias fields, and artifacts, and (2) solid semantic representation ability, with the potential of one-shot segmentation. Our code is publicly available.

[CV-108] Self-supervised Vision Transformer are Scalable Generative Models for Domain Generalization

链接: https://arxiv.org/abs/2407.02900
作者: Sebastian Doerrich,Francesco Di Salvo,Christian Ledig
关键词: impactful clinical applications, achieving robust generalization, notable advancements, deep learning, techniques into impactful
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at MICCAI 2024. This is the submitted manuscript with added link to github repo and funding acknowledgements. No further post submission improvements or corrections were integrated. Final version not published yet

点击查看摘要

Abstract:Despite notable advancements, the integration of deep learning (DL) techniques into impactful clinical applications, particularly in the realm of digital histopathology, has been hindered by challenges associated with achieving robust generalization across diverse imaging domains and characteristics. Traditional mitigation strategies in this field such as data augmentation and stain color normalization have proven insufficient in addressing this limitation, necessitating the exploration of alternative methodologies. To this end, we propose a novel generative method for domain generalization in histopathology images. Our method employs a generative, self-supervised Vision Transformer to dynamically extract characteristics of image patches and seamlessly infuse them into the original images, thereby creating novel, synthetic images with diverse attributes. By enriching the dataset with such synthesized images, we aim to enhance its holistic nature, facilitating improved generalization of DL models to unseen domains. Extensive experiments conducted on two distinct histopathology datasets demonstrate the effectiveness of our proposed approach, outperforming the state of the art substantially, on the Camelyon17-wilds challenge dataset (+2%) and on a second epithelium-stroma dataset (+26%). Furthermore, we emphasize our method’s ability to readily scale with increasingly available unlabeled data samples and more complex, higher parametric architectures. Source code is available at this https URL .

[CV-109] LMBF-Net: A Lightweight Multipath Bidirectional Focal Attention Network for Multifeatures Segmentation

链接: https://arxiv.org/abs/2407.02871
作者: Tariq M Khan,Shahzaib Iqbal,Syed S. Naqvi,Imran Razzak,Erik Meijering
关键词: irreversible vision loss, treated early, irreversible vision, vision loss, diagnosed and treated
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Retinal diseases can cause irreversible vision loss in both eyes if not diagnosed and treated early. Since retinal diseases are so complicated, retinal imaging is likely to show two or more abnormalities. Current deep learning techniques for segmenting retinal images with many labels and attributes have poor detection accuracy and generalisability. This paper presents a multipath convolutional neural network for multifeature segmentation. The proposed network is lightweight and spatially sensitive to information. A patch-based implementation is used to extract local image features, and focal modulation attention blocks are incorporated between the encoder and the decoder for improved segmentation. Filter optimisation is used to prevent filter overlaps and speed up model convergence. A combination of convolution operations and group convolution operations is used to reduce computational costs. This is the first robust and generalisable network capable of segmenting multiple features of fundus images (including retinal vessels, microaneurysms, optic discs, haemorrhages, hard exudates, and soft exudates). The results of our experimental evaluation on more than ten publicly available datasets with multiple features show that the proposed network outperforms recent networks despite having a small number of learnable parameters.

[CV-110] Multi-Attention Integrated Deep Learning Frameworks for Enhanced Breast Cancer Segmentation and Identification

链接: https://arxiv.org/abs/2407.02844
作者: Pandiyaraju V,Shravan Venkatraman,Pavan Kumar S,Santhosh Malarvannan,Kannan A
关键词: claiming numerous lives, lives globally, Breast cancer poses, numerous lives, claiming numerous
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 29 pages, 15 figures, 6 tables

点击查看摘要

Abstract:Breast cancer poses a profound threat to lives globally, claiming numerous lives each year. Therefore, timely detection is crucial for early intervention and improved chances of survival. Accurately diagnosing and classifying breast tumors using ultrasound images is a persistent challenge in medicine, demanding cutting-edge solutions for improved treatment strategies. This research introduces multiattention-enhanced deep learning (DL) frameworks designed for the classification and segmentation of breast cancer tumors from ultrasound images. A spatial channel attention mechanism is proposed for segmenting tumors from ultrasound images, utilizing a novel LinkNet DL framework with an InceptionResNet backbone. Following this, the paper proposes a deep convolutional neural network with an integrated multi-attention framework (DCNNIMAF) to classify the segmented tumor as benign, malignant, or normal. From experimental results, it is observed that the segmentation model has recorded an accuracy of 98.1%, with a minimal loss of 0.6%. It has also achieved high Intersection over Union (IoU) and Dice Coefficient scores of 96.9% and 97.2%, respectively. Similarly, the classification model has attained an accuracy of 99.2%, with a low loss of 0.31%. Furthermore, the classification framework has achieved outstanding F1-Score, precision, and recall values of 99.1%, 99.3%, and 99.1%, respectively. By offering a robust framework for early detection and accurate classification of breast cancer, this proposed work significantly advances the field of medical image analysis, potentially improving diagnostic precision and patient outcomes.

[CV-111] Highly Accelerated MRI via Implicit Neural Representation Guided Posterior Sampling of Diffusion Models

链接: https://arxiv.org/abs/2407.02744
作者: Jiayue Chu,Chenhe Du,Xiyue Lin,Yuyao Zhang,Hongjiang Wei
关键词: Reconstructing high-fidelity magnetic, high-fidelity magnetic resonance, reduce scan time, Reconstructing high-fidelity, magnetic resonance
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Reconstructing high-fidelity magnetic resonance (MR) images from under-sampled k-space is a commonly used strategy to reduce scan time. The posterior sampling of diffusion models based on the real measurement data holds significant promise of improved reconstruction accuracy. However, traditional posterior sampling methods often lack effective data consistency guidance, leading to inaccurate and unstable reconstructions. Implicit neural representation (INR) has emerged as a powerful paradigm for solving inverse problems by modeling a signal’s attributes as a continuous function of spatial coordinates. In this study, we present a novel posterior sampler for diffusion models using INR, named DiffINR. The INR-based component incorporates both the diffusion prior distribution and the MRI physical model to ensure high data fidelity. DiffINR demonstrates superior performance on experimental datasets with remarkable accuracy, even under high acceleration factors (up to R=12 in single-channel reconstruction). Notably, our proposed framework can be a generalizable framework to solve inverse problems in other medical imaging tasks.

[CV-112] Depth-Aware Endoscopic Video Inpainting

链接: https://arxiv.org/abs/2407.02675
作者: Francis Xiatian Zhang,Shuang Chen,Xianghua Xie,Hubert P. H. Shum
关键词: corrupted video content, Video inpainting fills, endoscopic video inpainting, Video inpainting, endoscopic video
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by MICCAI 2024

点击查看摘要

Abstract:Video inpainting fills in corrupted video content with plausible replacements. While recent advances in endoscopic video inpainting have shown potential for enhancing the quality of endoscopic videos, they mainly repair 2D visual information without effectively preserving crucial 3D spatial details for clinical reference. Depth-aware inpainting methods attempt to preserve these details by incorporating depth information. Still, in endoscopic contexts, they face challenges including reliance on pre-acquired depth maps, less effective fusion designs, and ignorance of the fidelity of 3D spatial details. To address them, we introduce a novel Depth-aware Endoscopic Video Inpainting (DAEVI) framework. It features a Spatial-Temporal Guided Depth Estimation module for direct depth estimation from visual features, a Bi-Modal Paired Channel Fusion module for effective channel-by-channel fusion of visual and depth information, and a Depth Enhanced Discriminator to assess the fidelity of the RGB-D sequence comprised of the inpainted frames and estimated depth images. Experimental evaluations on established benchmarks demonstrate our framework’s superiority, achieving a 2% improvement in PSNR and a 6% reduction in MSE compared to state-of-the-art methods. Qualitative analyses further validate its enhanced ability to inpaint fine details, highlighting the benefits of integrating depth information into endoscopic inpainting.

[CV-113] Lung-CADex: Fully automatic Zero-Shot Detection and Classification of Lung Nodules in Thoracic CT Images

链接: https://arxiv.org/abs/2407.02625
作者: Furqan Shaukat,Syed Muhammad Anwar,Abhijeet Parida,Van Khanh Lam,Marius George Linguraru,Mubarak Shah
关键词: life for decades, major threats, threats to human, human life, Visual Language models
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Lung cancer has been one of the major threats to human life for decades. Computer-aided diagnosis can help with early lung nodul detection and facilitate subsequent nodule characterization. Large Visual Language models (VLMs) have been found effective for multiple downstream medical tasks that rely on both imaging and text data. However, lesion level detection and subsequent diagnosis using VLMs have not been explored yet. We propose CADe, for segmenting lung nodules in a zero-shot manner using a variant of the Segment Anything Model called MedSAM. CADe trains on a prompt suite on input computed tomography (CT) scans by using the CLIP text encoder through prefix tuning. We also propose, CADx, a method for the nodule characterization as benign/malignant by making a gallery of radiomic features and aligning image-feature pairs through contrastive learning. Training and validation of CADe and CADx have been done using one of the largest publicly available datasets, called LIDC. To check the generalization ability of the model, it is also evaluated on a challenging dataset, LUNGx. Our experimental results show that the proposed methods achieve a sensitivity of 0.86 compared to 0.76 that of other fully supervised methods.The source code, datasets and pre-processed data can be accessed using the link:

[CV-114] Deep Learning Based Apparent Diffusion Coefficient Map Generation from Multi-parametric MR Images for Patients with Diffuse Gliomas

链接: https://arxiv.org/abs/2407.02616
作者: Zach Eidex,Mojtaba Safari,Jacob Wynne,Richard L.J. Qiu,Tonghe Wang,David Viar Hernandez,Hui-Kuo Shu,Hui Mao,Xiaofeng Yang
关键词: Apparent diffusion coefficient, Apparent diffusion, ADC maps, ADC, preprocessed ADC maps
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
*备注: arXiv admin note: text overlap with arXiv:2311.15044

点击查看摘要

Abstract:Purpose: Apparent diffusion coefficient (ADC) maps derived from diffusion weighted (DWI) MRI provides functional measurements about the water molecules in tissues. However, DWI is time consuming and very susceptible to image artifacts, leading to inaccurate ADC measurements. This study aims to develop a deep learning framework to synthesize ADC maps from multi-parametric MR images. Methods: We proposed the multiparametric residual vision transformer model (MPR-ViT) that leverages the long-range context of ViT layers along with the precision of convolutional operators. Residual blocks throughout the network significantly increasing the representational power of the model. The MPR-ViT model was applied to T1w and T2- fluid attenuated inversion recovery images of 501 glioma cases from a publicly available dataset including preprocessed ADC maps. Selected patients were divided into training (N=400), validation (N=50) and test (N=51) sets, respectively. Using the preprocessed ADC maps as ground truth, model performance was evaluated and compared against the Vision Convolutional Transformer (VCT) and residual vision transformer (ResViT) models. Results: The results are as follows using T1w + T2-FLAIR MRI as inputs: MPR-ViT - PSNR: 31.0 +/- 2.1, MSE: 0.009 +/- 0.0005, SSIM: 0.950 +/- 0.015. In addition, ablation studies showed the relative impact on performance of each input sequence. Both qualitative and quantitative results indicate that the proposed MR- ViT model performs favorably against the ground truth data. Conclusion: We show that high-quality ADC maps can be synthesized from structural MRI using a MPR- VCT model. Our predicted images show better conformality to the ground truth volume than ResViT and VCT predictions. These high-quality synthetic ADC maps would be particularly useful for disease diagnosis and intervention, especially when ADC maps have artifacts or are unavailable.

机器学习

[LG-0] Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

链接: https://arxiv.org/abs/2407.03321
作者: Max Zuo,Francisco Piedrahita Velez,Xiaochen Li,Michael L. Littman,Stephen H. Bach
关键词: PDDL code, PDDL, generated PDDL code, language, natural language descriptions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many recent works have explored using language models for planning problems. One line of research focuses on translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). While this approach is promising, accurately measuring the quality of generated PDDL code continues to pose significant challenges. First, generated PDDL code is typically evaluated using planning validators that check whether the problem can be solved with a planner. This method is insufficient because a language model might generate valid PDDL code that does not align with the natural language description of the task. Second, existing evaluation sets often have natural language descriptions of the planning task that closely resemble the ground truth PDDL, reducing the challenge of the task. To bridge this gap, we introduce \benchmarkName, a benchmark designed to evaluate language models’ ability to generate PDDL code from natural language descriptions of planning tasks. We begin by creating a PDDL equivalence algorithm that rigorously evaluates the correctness of PDDL code generated by language models by flexibly comparing it against a ground truth PDDL. Then, we present a dataset of 132,037 text-to-PDDL pairs across 13 different tasks, with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task’s complexity. For example, 87.6% of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, 82.2% are valid, solve-able problems, but only 35.1% are semantically correct, highlighting the need for a more rigorous benchmark for this problem.

[LG-1] Value-Penalized Auxiliary Control from Examples for Learning without Rewards or Demonstrations

链接: https://arxiv.org/abs/2407.03311
作者: Trevor Ablett,Bryan Chan,Jayce Haoran Wang,Jonathan Kelly
关键词: full expert-demonstration trajectories, hand-crafted reward functions, expert-demonstration trajectories, difficult to acquire, hand-crafted reward
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to the Conference on Robot Learning (CoRL’24), Munich, Germany, Nov. 6-9, 2024

点击查看摘要

Abstract:Learning from examples of success is an appealing approach to reinforcement learning that eliminates many of the disadvantages of using hand-crafted reward functions or full expert-demonstration trajectories, both of which can be difficult to acquire, biased, or suboptimal. However, learning from examples alone dramatically increases the exploration challenge, especially for complex tasks. This work introduces value-penalized auxiliary control from examples (VPACE); we significantly improve exploration in example-based control by adding scheduled auxiliary control and examples of auxiliary tasks. Furthermore, we identify a value-calibration problem, where policy value estimates can exceed their theoretical limits based on successful data. We resolve this problem, which is exacerbated by learning auxiliary tasks, through the addition of an above-success-level value penalty. Across three simulated and one real robotic manipulation environment, and 21 different main tasks, we show that our approach substantially improves learning efficiency. Videos, code, and datasets are available at this https URL.

[LG-2] Universal Length Generalization with Turing Programs

链接: https://arxiv.org/abs/2407.03310
作者: Kaiying Hou,David Brandfonbrener,Sham Kakade,Samy Jelassi,Eran Malach
关键词: large language models, short training sequences, long test sequences, current large language, Turing Programs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models. While prior work has proposed some architecture or data format changes to achieve length generalization, these proposals typically apply to a limited set of tasks. Building on prior scratchpad and Chain-of-Thought (CoT) techniques, we propose Turing Programs, a novel CoT strategy that decomposes an algorithmic task into steps mimicking the computation of a Turing Machine. This framework is both universal, as it can accommodate any algorithmic task, and simple, requiring only copying text from the context with small modifications. We show that by using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks: addition, multiplication and in-context SGD. We then demonstrate that transformers achieve length generalization on random Turing Programs, suggesting that length generalization is possible for any algorithmic task. Finally, we theoretically prove that transformers can implement Turing Programs, constructing a simple RASP (Weiss et al.) program that simulates an arbitrary Turing machine.

[LG-3] DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents

链接: https://arxiv.org/abs/2407.03300
作者: Yilun Xu,Gabriele Corso,Tommi Jaakkola,Arash Vahdat,Karsten Kreis
关键词: Gaussian distribution, Diffusion models, discrete latents, Diffusion, Latent Variable Diffusion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: this https URL

点击查看摘要

Abstract:Diffusion models (DMs) have revolutionized generative learning. They utilize a diffusion process to encode data into a simple Gaussian distribution. However, encoding a complex, potentially multimodal data distribution into a single continuous Gaussian distribution arguably represents an unnecessarily challenging learning problem. We propose Discrete-Continuous Latent Variable Diffusion Models (DisCo-Diff) to simplify this task by introducing complementary discrete latent variables. We augment DMs with learnable discrete latents, inferred with an encoder, and train DM and encoder end-to-end. DisCo-Diff does not rely on pre-trained networks, making the framework universally applicable. The discrete latents significantly simplify learning the DM’s complex noise-to-data mapping by reducing the curvature of the DM’s generative ODE. An additional autoregressive transformer models the distribution of the discrete latents, a simple step because DisCo-Diff requires only few discrete variables with small codebooks. We validate DisCo-Diff on toy data, several image synthesis tasks as well as molecular docking, and find that introducing discrete latents consistently improves model performance. For example, DisCo-Diff achieves state-of-the-art FID scores on class-conditioned ImageNet-64/128 datasets with ODE sampler.

[LG-4] Correlated Privacy Mechanisms for Differentially Private Distributed Mean Estimation

链接: https://arxiv.org/abs/2407.03289
作者: Sajani Vithana,Viveck R. Cadambe,Flavio P. Calmon,Haewon Jeong
关键词: Differentially private distributed, privacy-preserving federated learning, dimensional vectors held, fundamental building block, Differentially private
类目: Information Theory (cs.IT); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Differentially private distributed mean estimation (DP-DME) is a fundamental building block in privacy-preserving federated learning, where a central server estimates the mean of d -dimensional vectors held by n users while ensuring (\epsilon,\delta) -DP. Local differential privacy (LDP) and distributed DP with secure aggregation (SecAgg) are the most common notions of DP used in DP-DME settings with an untrusted server. LDP provides strong resilience to dropouts, colluding users, and malicious server attacks, but suffers from poor utility. In contrast, SecAgg-based DP-DME achieves an O(n) utility gain over LDP in DME, but requires increased communication and computation overheads and complex multi-round protocols to handle dropouts and malicious attacks. In this work, we propose CorDP-DME, a novel DP-DME mechanism that spans the gap between DME with LDP and distributed DP, offering a favorable balance between utility and resilience to dropout and collusion. CorDP-DME is based on correlated Gaussian noise, ensuring DP without the perfect conditional privacy guarantees of SecAgg-based approaches. We provide an information-theoretic analysis of CorDP-DME, and derive theoretical guarantees for utility under any given privacy parameters and dropout/colluding user thresholds. Our results demonstrate that (anti) correlated Gaussian DP mechanisms can significantly improve utility in mean estimation tasks compared to LDP – even in adversarial settings – while maintaining better resilience to dropouts and attacks compared to distributed DP.

[LG-5] Nearly Linear Sparsification of ell_p Subspace Approximation

链接: https://arxiv.org/abs/2407.03262
作者: David P. Woodruff,Taisuke Yasuda
关键词: principal component analysis, median hyperplane problem, center hyperplane problem, NP-hard low rank, hyperplane problem
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The \ell_p subspace approximation problem is an NP-hard low rank approximation problem that generalizes the median hyperplane problem ( p = 1 ), principal component analysis ( p = 2 ), and the center hyperplane problem ( p = \infty ). A popular approach to cope with the NP-hardness of this problem is to compute a strong coreset, which is a small weighted subset of the input points which simultaneously approximates the cost of every k -dimensional subspace, typically to (1+\varepsilon) relative error for a small constant \varepsilon . We obtain the first algorithm for constructing a strong coreset for \ell_p subspace approximation with a nearly optimal dependence on the rank parameter k , obtaining a nearly linear bound of \tilde O(k)\mathrmpoly(\varepsilon^-1) for p2 and \tilde O(k^p/2)\mathrmpoly(\varepsilon^-1) for p2 . Prior constructions either achieved a similar size bound but produced a coreset with a modification of the original points [SW18, FKW21], or produced a coreset of the original points but lost \mathrmpoly(k) factors in the coreset size [HV20, WY23]. Our techniques also lead to the first nearly optimal online strong coresets for \ell_p subspace approximation with similar bounds as the offline setting, resolving a problem of [WY23]. All prior approaches lose \mathrmpoly(k) factors in this setting, even when allowed to modify the original points. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2407.03262 [cs.DS] (or arXiv:2407.03262v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2407.03262 Focus to learn more arXiv-issued DOI via DataCite

[LG-6] Magnetic Hysteresis Modeling with Neural Operators

链接: https://arxiv.org/abs/2407.03261
作者: Abhishek Chandra,Bram Daniels,Mitrofan Curti,Koen Tiels,Elena A. Lomonova
关键词: facilitating optimal designs, Fourier neural operator, facilitating optimal, optimal designs, magnetic
类目: Machine Learning (cs.LG); Robotics (cs.RO); Signal Processing (eess.SP); Systems and Control (eess.SY)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Hysteresis modeling is crucial to comprehend the behavior of magnetic devices, facilitating optimal designs. Hitherto, deep learning-based methods employed to model hysteresis, face challenges in generalizing to novel input magnetic fields. This paper addresses the generalization challenge by proposing neural operators for modeling constitutive laws that exhibit magnetic hysteresis by learning a mapping between magnetic fields. In particular, two prominent neural operators – deep operator network and Fourier neural operator – are employed to predict novel first-order reversal curves and minor loops, where novel means they are not used to train the model. In addition, a rate-independent Fourier neural operator is proposed to predict material responses at sampling rates different from those used during training to incorporate the rate-independent characteristics of magnetic hysteresis. The presented numerical experiments demonstrate that neural operators efficiently model magnetic hysteresis, outperforming the traditional neural recurrent methods on various metrics and generalizing to novel magnetic fields. The findings emphasize the advantages of using neural operators for modeling hysteresis under varying magnetic conditions, underscoring their importance in characterizing magnetic material based devices.

[LG-7] Modern Neighborhood Components Analysis: A Deep Tabular Baseline Two Decades Later

链接: https://arxiv.org/abs/2407.03257
作者: Han-Jia Ye,Huai-Hong Yin,De-Chuan Zhan
关键词: Neighborhood Component Analysis, shown promising results, promising results compared, growing success, domains has prompted
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing success of deep learning in various domains has prompted investigations into its application to tabular data, where deep models have shown promising results compared to traditional tree-based methods. In this paper, we revisit Neighborhood Component Analysis (NCA), a classic tabular prediction method introduced in 2004, designed to learn a linear projection that captures semantic similarities between instances. We find that minor modifications, such as adjustments to the learning objectives and the integration of deep learning architectures, significantly enhance NCA’s performance, enabling it to surpass most modern deep tabular models. Additionally, we introduce a stochastic neighbor sampling strategy that improves both the efficiency and predictive accuracy of our proposed ModernNCA – sampling only a subset of neighbors during training, while utilizing the entire neighborhood during inference. Extensive experiments demonstrate that our ModernNCA achieves state-of-the-art results in both classification and regression tasks across various tabular datasets, outperforming both tree-based and other deep tabular models, while also reducing training time and model size.

[LG-8] When big data actually are low-rank or entrywise approximation of certain function-generated matrices

链接: https://arxiv.org/abs/2407.03250
作者: Stanislav Budzinskiy
关键词: article concerns low-rank, concerns low-rank approximation, article concerns, sampling a smooth, dimensional variables
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: Fixed the definition of the function f1f_1 in Section 7.2 and figure captions

点击查看摘要

Abstract:The article concerns low-rank approximation of matrices generated by sampling a smooth function of two m -dimensional variables. We refute an argument made in the literature that, for a specific class of analytic functions, such matrices admit accurate entrywise approximation of rank that is independent of m . We provide a theoretical explanation of the numerical results presented in support of this argument, describing three narrower classes of functions for which n \times n function-generated matrices can be approximated within an entrywise error of order \varepsilon with rank \mathcalO(\log(n) \varepsilon^-2 \mathrmpolylog(\varepsilon^-1)) that is independent of the dimension m : (i) functions of the inner product of the two variables, (ii) functions of the squared Euclidean distance between the variables, and (iii) shift-invariant positive-definite kernels. We extend our argument to low-rank tensor-train approximation of tensors generated with functions of the multi-linear product of their m -dimensional variables. We discuss our results in the context of low-rank approximation of attention in transformer neural networks.

[LG-9] rrain Classification Enhanced with Uncertainty for Space Exploration Robots from Proprioceptive Data

链接: https://arxiv.org/abs/2407.03241
作者: Mariela De Lucas Álvarez,Jichen Guo,Raul Domínguez,Matias Valdenegro-Toro
关键词: space exploration, Neural Networks, Implementing Neural Network, Terrain Classification, Neural Network classifiers
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 6 pages, 4 figures. LatinX in AI Workshop @ ICML 2023 Camera Ready

点击查看摘要

Abstract:Terrain Classification is an essential task in space exploration, where unpredictable environments are difficult to observe using only exteroceptive sensors such as vision. Implementing Neural Network classifiers can have high performance but can be deemed untrustworthy as they lack transparency, which makes them unreliable for taking high-stakes decisions during mission planning. We address this by proposing Neural Networks with Uncertainty Quantification in Terrain Classification. We enable our Neural Networks with Monte Carlo Dropout, DropConnect, and Flipout in time series-capable architectures using only proprioceptive data as input. We use Bayesian Optimization with Hyperband for efficient hyperparameter optimization to find optimal models for trustworthy terrain classification.

[LG-10] Self-Evaluation as a Defense Against Adversarial Attacks on LLMs

链接: https://arxiv.org/abs/2407.03234
作者: Hannah Brown,Leon Lin,Kenji Kawaguchi,Michael Shieh
关键词: human-facing settings, deployed in sensitive, LLMs are deployed, answer unsafe, answer unsafe prompts
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as “Tell me how to build a bomb.” We find that, despite these safeguards, it is possible to break model defenses simply by appending a space to the end of a model’s input. In a study of eight open-source models, we demonstrate that this acts as a strong enough attack to cause the majority of models to generate harmful outputs with very high success rates. We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models to generate lists when prompted, overriding training signals to refuse to answer unsafe requests. Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods. Code and data will be made available at this https URL.

[LG-11] Single Character Perturbations Break LLM Alignment

链接: https://arxiv.org/abs/2407.03232
作者: Leon Lin,Hannah Brown,Kenji Kawaguchi,Michael Shieh
关键词: human-facing settings, deployed in sensitive, LLMs are deployed, answer unsafe, answer unsafe prompts
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as “Tell me how to build a bomb.” We find that, despite these safeguards, it is possible to break model defenses simply by appending a space to the end of a model’s input. In a study of eight open-source models, we demonstrate that this acts as a strong enough attack to cause the majority of models to generate harmful outputs with very high success rates. We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models to generate lists when prompted, overriding training signals to refuse to answer unsafe requests. Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods. Code and data will be available at this https URL.

[LG-12] How Does Quantization Affect Multilingual LLMs?

链接: https://arxiv.org/abs/2407.03211
作者: Kelly Marchisio,Saurabh Dash,Hongyu Chen,Dennis Aumiller,Ahmet Üstün,Sara Hooker,Sebastian Ruder
关键词: improve inference speed, techniques are widely, improve inference, inference speed, speed and deployment
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantization techniques are widely used to improve inference speed and deployment of large language models. While a wide body of work examines the impact of quantized LLMs on English tasks, none have examined the effect of quantization across languages. We conduct a thorough analysis of quantized multilingual LLMs, focusing on their performance across languages and at varying scales. We use automatic benchmarks, LLM-as-a-Judge methods, and human evaluation, finding that (1) harmful effects of quantization are apparent in human evaluation, and automatic metrics severely underestimate the detriment: a 1.7% average drop in Japanese across automatic tasks corresponds to a 16.0% drop reported by human evaluators on realistic prompts; (2) languages are disparately affected by quantization, with non-Latin script languages impacted worst; and (3) challenging tasks such as mathematical reasoning degrade fastest. As the ability to serve low-compute models is critical for wide global adoption of NLP technologies, our results urge consideration of multilingual performance as a key evaluation criterion for efficient models.

[LG-13] Combining AI Control Systems and Human Decision Support via Robustness and Criticality

链接: https://arxiv.org/abs/2407.03210
作者: Walt Woods,Alexander Grushin,Simon Khan,Alvaro Velasquez
关键词: AI-enabled capabilities, real world, capabilities are reaching, reaching the requisite, requisite level
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 19 pages, 12 figures

点击查看摘要

Abstract:AI-enabled capabilities are reaching the requisite level of maturity to be deployed in the real world, yet do not always make correct or safe decisions. One way of addressing these concerns is to leverage AI control systems alongside and in support of human decisions, relying on the AI control system in safe situations while calling on a human co-decider for critical situations. We extend a methodology for adversarial explanations (AE) to state-of-the-art reinforcement learning frameworks, including MuZero. Multiple improvements to the base agent architecture are proposed. We demonstrate how this technology has two applications: for intelligent decision tools and to enhance training / learning frameworks. In a decision support context, adversarial explanations help a user make the correct decision by highlighting those contextual factors that would need to change for a different AI-recommended decision. As another benefit of adversarial explanations, we show that the learned AI control system demonstrates robustness against adversarial tampering. Additionally, we supplement AE by introducing strategically similar autoencoders (SSAs) to help users identify and understand all salient factors being considered by the AI system. In a training / learning framework, this technology can improve both the AI’s decisions and explanations through human interaction. Finally, to identify when AI decisions would most benefit from human oversight, we tie this combined system to our prior art on statistically verified analyses of the criticality of decisions at any point in time.

[LG-14] Prediction Instability in Machine Learning Ensembles

链接: https://arxiv.org/abs/2407.03194
作者: Jeremy Kedziora
关键词: machine learning ensembles, machine learning, learning ensembles predictions, multiple models, models
类目: Machine Learning (cs.LG)
*备注: 15 pages, uses a modified version of ICML2024.sty

点击查看摘要

Abstract:In machine learning ensembles predictions from multiple models are aggregated. Despite widespread use and strong performance of ensembles in applied problems little is known about the mathematical properties of aggregating models and associated consequences for safe, explainable use of such models. In this paper we prove a theorem that shows that any ensemble will exhibit at least one of the following forms of prediction instability. It will either ignore agreement among all underlying models, change its mind when none of the underlying models have done so, or be manipulable through inclusion or exclusion of options it would never actually predict. As a consequence, ensemble aggregation procedures will always need to balance the benefits of information use against the risk of these prediction instabilities. This analysis also sheds light on what specific forms of prediction instability to expect from particular ensemble algorithms; for example popular tree ensembles like random forest, or xgboost will violate basic, intuitive monotonicity and fairness properties.

[LG-15] Cutting through the noise to motivate people: A comprehensive analysis of COVID-19 social media posts de/motivating vaccination

链接: https://arxiv.org/abs/2407.03190
作者: Ashiqur Rahman,Ehsan Mohammadi,Hamed Alhoori
关键词: healthcare information system, pandemic exposed significant, exposed significant weaknesses, pandemic exposed, information system
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 51 pages, 13 figures, 12 tables. Accepted at Natural Language Processing Journal

点击查看摘要

Abstract:The COVID-19 pandemic exposed significant weaknesses in the healthcare information system. The overwhelming volume of misinformation on social media and other socioeconomic factors created extraordinary challenges to motivate people to take proper precautions and get vaccinated. In this context, our work explored a novel direction by analyzing an extensive dataset collected over two years, identifying the topics de/motivating the public about COVID-19 vaccination. We analyzed these topics based on time, geographic location, and political orientation. We noticed that while the motivating topics remain the same over time and geographic location, the demotivating topics rapidly. We also identified that intrinsic motivation, rather than external mandate, is more advantageous to inspire the public. This study addresses scientific communication and public motivation in social media. It can help public health officials, policymakers, and social media platforms develop more effective messaging strategies to cut through the noise of misinformation and educate the public about scientific findings.

[LG-16] Multiple-Resolution Tokenization for Time Series Forecasting with an Application to Pricing

链接: https://arxiv.org/abs/2407.03185
作者: Egon Peršak,Miguel F. Anjos,Sebastian Lautz,Aleksandar Kolev
关键词: time series forecasting, time series tokenisation, real-world prediction problem, time series, pricing domain
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a transformer architecture for time series forecasting with a focus on time series tokenisation and apply it to a real-world prediction problem from the pricing domain. Our architecture aims to learn effective representations at many scales across all available data simultaneously. The model contains a number of novel modules: a differentiated form of time series patching which employs multiple resolutions, a multiple-resolution module for time-varying known variables, a mixer-based module for capturing cross-series information, and a novel output head with favourable scaling to account for the increased number of tokens. We present an application of this model to a real world prediction problem faced by the markdown team at a very large retailer. On the experiments conducted our model outperforms in-house models and the selected existing deep learning architectures.

[LG-17] Motion meets Attention: Video Motion Prompts

链接: https://arxiv.org/abs/2407.03179
作者: Qixiang Chen,Lei Wang,Piotr Koniusz,Tom Gedeon
关键词: rich spatio-temporal information, spatio-temporal information, rich spatio-temporal, motion, attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Research report

点击查看摘要

Abstract:Videos contain rich spatio-temporal information. Traditional methods for extracting motion, used in tasks such as action recognition, often rely on visual contents rather than precise motion features. This phenomenon is referred to as ‘blind motion extraction’ behavior, which proves inefficient in capturing motions of interest due to a lack of motion-guided cues. Recently, attention mechanisms have enhanced many computer vision tasks by effectively highlighting salient visual areas. Inspired by this, we propose using a modified Sigmoid function with learnable slope and shift parameters as an attention mechanism to activate and modulate motion signals derived from frame differencing maps. This approach generates a sequence of attention maps that enhance the processing of motion-related video content. To ensure temporally continuity and smoothness of the attention maps, we apply pair-wise temporal attention variation regularization to remove unwanted motions (e.g., noise) while preserving important ones. We then perform Hadamard product between each pair of attention maps and the original video frames to highlight the evolving motions of interest over time. These highlighted motions, termed video motion prompts, are subsequently used as inputs to the model instead of the original video frames. We formalize this process as a motion prompt layer and incorporate the regularization term into the loss function to learn better motion prompts. This layer serves as an adapter between the model and the video data, bridging the gap between traditional ‘blind motion extraction’ and the extraction of relevant motions of interest.

[LG-18] Relating CNN-Transformer Fusion Network for Change Detection

链接: https://arxiv.org/abs/2407.03178
作者: Yuhao Gao,Gensheng Pei,Mengmeng Sheng,Zeren Sun,Tao Chen,Yazhou Yao
关键词: revolutionized remote sensing, incomplete change learning, convolutional neural networks, miss crucial features, crucial features due
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted by IEEE Conference on Multimedia Expo

点击查看摘要

Abstract:While deep learning, particularly convolutional neural networks (CNNs), has revolutionized remote sensing (RS) change detection (CD), existing approaches often miss crucial features due to neglecting global context and incomplete change learning. Additionally, transformer networks struggle with low-level details. RCTNet addresses these limitations by introducing \textbf(1) an early fusion backbone to exploit both spatial and temporal features early on, \textbf(2) a Cross-Stage Aggregation (CSA) module for enhanced temporal representation, \textbf(3) a Multi-Scale Feature Fusion (MSF) module for enriched feature extraction in the decoder, and \textbf(4) an Efficient Self-deciphering Attention (ESA) module utilizing transformers to capture global information and fine-grained details for accurate change detection. Extensive experiments demonstrate RCTNet’s clear superiority over traditional RS image CD methods, showing significant improvement and an optimal balance between accuracy and computational cost.

[LG-19] Bunny-VisionPro: Real-Time Bimanual Dexterous Teleoperation for Imitation Learning

链接: https://arxiv.org/abs/2407.03162
作者: Runyu Ding,Yuzhe Qin,Jiyue Zhu,Chengzhe Jia,Shiqi Yang,Ruihan Yang,Xiaojuan Qi,Xiaolong Wang
关键词: collecting human demonstrations, remains a challenge, collecting human, controlling robots, dexterous hands remains
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: project page: this https URL

点击查看摘要

Abstract:Teleoperation is a crucial tool for collecting human demonstrations, but controlling robots with bimanual dexterous hands remains a challenge. Existing teleoperation systems struggle to handle the complexity of coordinating two hands for intricate manipulations. We introduce Bunny-VisionPro, a real-time bimanual dexterous teleoperation system that leverages a VR headset. Unlike previous vision-based teleoperation systems, we design novel low-cost devices to provide haptic feedback to the operator, enhancing immersion. Our system prioritizes safety by incorporating collision and singularity avoidance while maintaining real-time performance through innovative designs. Bunny-VisionPro outperforms prior systems on a standard task suite, achieving higher success rates and reduced task completion times. Moreover, the high-quality teleoperation demonstrations improve downstream imitation learning performance, leading to better generalizability. Notably, Bunny-VisionPro enables imitation learning with challenging multi-stage, long-horizon dexterous manipulation tasks, which have rarely been addressed in previous work. Our system’s ability to handle bimanual manipulations while prioritizing safety and real-time performance makes it a powerful tool for advancing dexterous manipulation and imitation learning.

[LG-20] SOS! Soft Prompt Attack Against Open-Source Large Language Models

链接: https://arxiv.org/abs/2407.03160
作者: Ziqing Yang,Michael Backes,Yang Zhang,Ahmed Salem
关键词: Open-source large language, large language models, public and industry, Open-source large, large language
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Open-source large language models (LLMs) have become increasingly popular among both the general public and industry, as they can be customized, fine-tuned, and freely used. However, some open-source LLMs require approval before usage, which has led to third parties publishing their own easily accessible versions. Similarly, third parties have been publishing fine-tuned or quantized variants of these LLMs. These versions are particularly appealing to users because of their ease of access and reduced computational resource demands. This trend has increased the risk of training time attacks, compromising the integrity and security of LLMs. In this work, we present a new training time attack, SOS, which is designed to be low in computational demand and does not require clean data or modification of the model weights, thereby maintaining the model’s utility intact. The attack addresses security issues in various scenarios, including the backdoor attack, jailbreak attack, and prompt stealing attack. Our experimental findings demonstrate that the proposed attack is effective across all evaluated targets. Furthermore, we present the other side of our SOS technique, namely the copyright token – a novel technique that enables users to mark their copyrighted content and prevent models from using it.

[LG-21] Let the Code LLM Edit Itself When You Edit the Code

链接: https://arxiv.org/abs/2407.03157
作者: Zhenyu He,Jun Zhang,Shengjie Luo,Jingjing Xu,Zhi Zhang,Di He
关键词: developer edits existing, edits existing code, large language model, investigate a typical, typical scenario
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Preprint. Work in Progress

点击查看摘要

Abstract:In this work, we investigate a typical scenario in code generation where a developer edits existing code in real time and requests a code assistant, e.g., a large language model, to re-predict the next token or next line on the fly. Naively, the LLM needs to re-encode the entire KV cache to provide an accurate prediction. However, this process is computationally expensive, especially when the sequence length is long. Simply encoding the edited subsequence and integrating it to the original KV cache meets the temporal confusion problem, leading to significantly worse performance. We address this efficiency and accuracy trade-off by introducing \underline\textbfPositional \textbfIntegrity \textbfEncoding (PIE). Building upon the rotary positional encoding, PIE first removes the rotary matrices in the Key cache that introduce temporal confusion and then reapplies the correct rotary matrices. This process ensures that positional relationships between tokens are correct and requires only a single round of matrix multiplication. We validate the effectiveness of PIE through extensive experiments on the RepoBench-C-8k dataset, utilizing DeepSeek-Coder models with 1.3B, 6.7B, and 33B parameters. Our evaluation includes three real-world coding tasks: code insertion, code deletion, and multi-place code editing. Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach across all model sizes and tasks while well approximating the model performance.

[LG-22] Reinforcement Learning for Sequence Design Leveraging Protein Language Models

链接: https://arxiv.org/abs/2407.03154
作者: Jithendaraa Subramanian,Shivakanth Sujit,Niloy Irtisam,Umong Sain,Derek Nowrouzezahrai,Samira Ebrahimi Kahou,Riashat Islam
关键词: amino acid sequences, Protein sequence design, protein engineering problems, determined by amino, drug discovery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: 22 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Protein sequence design, determined by amino acid sequences, are essential to protein engineering problems in drug discovery. Prior approaches have resorted to evolutionary strategies or Monte-Carlo methods for protein design, but often fail to exploit the structure of the combinatorial search space, to generalize to unseen sequences. In the context of discrete black box optimization over large search spaces, learning a mutation policy to generate novel sequences with reinforcement learning is appealing. Recent advances in protein language models (PLMs) trained on large corpora of protein sequences offer a potential solution to this problem by scoring proteins according to their biological plausibility (such as the TM-score). In this work, we propose to use PLMs as a reward function to generate new sequences. Yet the PLM can be computationally expensive to query due to its large size. To this end, we propose an alternative paradigm where optimization can be performed on scores from a smaller proxy model that is periodically finetuned, jointly while learning the mutation policy. We perform extensive experiments on various sequence lengths to benchmark RL-based approaches, and provide comprehensive evaluations along biological plausibility and diversity of the protein. Our experimental results include favorable evaluations of the proposed sequences, along with high diversity scores, demonstrating that RL is a strong candidate for biological sequence design. Finally, we provide a modular open source implementation can be easily integrated in most RL training loops, with support for replacing the reward model with other PLMs, to spur further research in this domain. The code for all experiments is provided in the supplementary material.

[LG-23] Efficient Shapley Values for Attributing Global Properties of Diffusion Models to Data Group

链接: https://arxiv.org/abs/2407.03153
作者: Chris Lin,Mingyu Lu,Chanwoo Kim,Su-In Lee
关键词: ensure fair acknowledgment, real-world settings, harmful content, high-quality training data, deployed in real-world
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:As diffusion models are deployed in real-world settings, data attribution is needed to ensure fair acknowledgment for contributors of high-quality training data and to identify sources of harmful content. Previous work focuses on identifying individual training samples important for the generation of a given image. However, instead of focusing on a given generated image, some use cases require understanding global properties of the distribution learned by a diffusion model (e.g., demographic diversity). Furthermore, training data for diffusion models are often contributed in groups rather than separately (e.g., multiple artworks from the same artist). Hence, here we tackle the problem of attributing global properties of diffusion models to groups of training data. Specifically, we develop a method to efficiently estimate Shapley values by leveraging model pruning and fine-tuning. We empirically demonstrate the utility of our method with three use cases: (i) global image quality for a DDPM trained on a CIFAR dataset, (ii) demographic diversity for an LDM trained on CelebA-HQ, and (iii) overall aesthetic quality for a Stable Diffusion model LoRA-finetuned on Post-Impressionist artworks.

[LG-24] Stereo Risk: A Continuous Modeling Approach to Stereo Matching

链接: https://arxiv.org/abs/2407.03152
作者: Ce Liu,Suryansh Kumar,Shuhang Gu,Radu Timofte,Yao Yao,Luc Van Gool
关键词: introduce Stereo Risk, classical stereo-matching problem, Stereo Risk, computer vision, scene disparity
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted as an Oral Paper at ICML 2024. Draft info: 18 pages, 6 Figure, 16 Tables

点击查看摘要

Abstract:We introduce Stereo Risk, a new deep-learning approach to solve the classical stereo-matching problem in computer vision. As it is well-known that stereo matching boils down to a per-pixel disparity estimation problem, the popular state-of-the-art stereo-matching approaches widely rely on regressing the scene disparity values, yet via discretization of scene disparity values. Such discretization often fails to capture the nuanced, continuous nature of scene depth. Stereo Risk departs from the conventional discretization approach by formulating the scene disparity as an optimal solution to a continuous risk minimization problem, hence the name “stereo risk”. We demonstrate that L^1 minimization of the proposed continuous risk function enhances stereo-matching performance for deep networks, particularly for disparities with multi-modal probability distributions. Furthermore, to enable the end-to-end network training of the non-differentiable L^1 risk optimization, we exploited the implicit function theorem, ensuring a fully differentiable network. A comprehensive analysis demonstrates our method’s theoretical soundness and superior performance over the state-of-the-art methods across various benchmark datasets, including KITTI 2012, KITTI 2015, ETH3D, SceneFlow, and Middlebury 2014.

[LG-25] Enhancing Class Fairness in Classification with A Two-Player Game Approach

链接: https://arxiv.org/abs/2407.03146
作者: Yunpeng Jiang,Paul Weng,Yutong Ban
关键词: machine learning tasks, Data augmentation, widely applied, shown its benefits, machine learning
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data augmentation is widely applied and has shown its benefits in different machine learning tasks. However, as recently observed in some downstream tasks, data augmentation may introduce an unfair impact on classifications. While it can improve the performance of some classes, it can actually be detrimental for other classes, which can be problematic in some application domains. In this paper, to counteract this phenomenon, we propose a FAir Classification approach with a Two-player game (FACT). We first formulate the training of a classifier with data augmentation as a fair optimization problem, which can be further written as an adversarial two-player game. Following this formulation, we propose a novel multiplicative weight optimization algorithm, for which we theoretically prove that it can converge to a solution that is fair over classes. Interestingly, our formulation also reveals that this fairness issue over classes is not due to data augmentation only, but is in fact a general phenomenon. Our empirical experiments demonstrate that the performance of our learned classifiers is indeed more fairly distributed over classes in five datasets, with only limited impact on the average accuracy.

[LG-26] Quantifying the Cross-sectoral Intersecting Discrepancies within Multiple Groups Using Latent Class Analysis Towards Fairness

链接: https://arxiv.org/abs/2407.03133
作者: Yingfang Yuan,Kefan Chen,Mehdi Rizvi,Lynne Baillie,Wei Pang
关键词: growing interest, interest in fair, cross-sectoral intersecting discrepancies, discrepancies, intersecting discrepancies
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The growing interest in fair AI development is evident. The ‘‘Leave No One Behind’’ initiative urges us to address multiple and intersecting forms of inequality in accessing services, resources, and opportunities, emphasising the significance of fairness in AI. This is particularly relevant as an increasing number of AI tools are applied to decision-making processes, such as resource allocation and service scheme development, across various sectors such as health, energy, and housing. Therefore, exploring joint inequalities in these sectors is significant and valuable for thoroughly understanding overall inequality and unfairness. This research introduces an innovative approach to quantify cross-sectoral intersecting discrepancies among user-defined groups using latent class analysis. These discrepancies can be used to approximate inequality and provide valuable insights to fairness issues. We validate our approach using both proprietary and public datasets, including EVENS and Census 2021 (England \ Wales) datasets, to examine cross-sectoral intersecting discrepancies among different ethnic groups. We also verify the reliability of the quantified discrepancy by conducting a correlation analysis with a government public metric. Our findings reveal significant discrepancies between minority ethnic groups, highlighting the need for targeted interventions in real-world AI applications. Additionally, we demonstrate how the proposed approach can be used to provide insights into the fairness of machine learning.

[LG-27] Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech

链接: https://arxiv.org/abs/2407.03132
作者: Tobias Weise,Philipp Klumpp,Kubilay Can Demir,Paula Andrea Pérez-Toro,Maria Schuster,Elmar Noeth,Bjoern Heismann,Andreas Maier,Seung Hee Yang
关键词: previously treated separately, motion estimation, previously treated, treated separately, speech inversion
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: to be published in Interspeech 2024 proceedings

点击查看摘要

Abstract:This paper introduces a novel combination of two tasks, previously treated separately: acoustic-to-articulatory speech inversion (AAI) and phoneme-to-articulatory (PTA) motion estimation. We refer to this joint task as acoustic phoneme-to-articulatory speech inversion (APTAI) and explore two different approaches, both working speaker- and text-independently during inference. We use a multi-task learning setup, with the end-to-end goal of taking raw speech as input and estimating the corresponding articulatory movements, phoneme sequence, and phoneme alignment. While both proposed approaches share these same requirements, they differ in their way of achieving phoneme-related predictions: one is based on frame classification, the other on a two-staged training procedure and forced alignment. We reach competitive performance of 0.73 mean correlation for the AAI task and achieve up to approximately 87% frame overlap compared to a state-of-the-art text-dependent phoneme force aligner.

[LG-28] Foundations and Frontiers of Graph Learning Theory

链接: https://arxiv.org/abs/2407.03125
作者: Yu Huang,Min Zhou,Menglin Yang,Zhen Wang,Muhan Zhang,Jie Wang,Hong Xie,Hao Wang,Defu Lian,Enhong Chen
关键词: Recent advancements, Graph Neural Networks, complex structures, neural network architectures, analyze data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 36pages,273references

点击查看摘要

Abstract:Recent advancements in graph learning have revolutionized the way to understand and analyze data with complex structures. Notably, Graph Neural Networks (GNNs), i.e. neural network architectures designed for learning graph representations, have become a popular paradigm. With these models being usually characterized by intuition-driven design or highly intricate components, placing them within the theoretical analysis framework to distill the core concepts, helps understand the key principles that drive the functionality better and guide further development. Given this surge in interest, this article provides a comprehensive summary of the theoretical foundations and breakthroughs concerning the approximation and learning behaviors intrinsic to prevalent graph learning models. Encompassing discussions on fundamental aspects such as expressiveness power, generalization, optimization, and unique phenomena such as over-smoothing and over-squashing, this piece delves into the theoretical foundations and frontier driving the evolution of graph learning. In addition, this article also presents several challenges and further initiates discussions on possible solutions.

[LG-29] Can machine learning solve the challenge of adaptive learning and the individualization of learning paths? A field experiment in an online learning platform

链接: https://arxiv.org/abs/2407.03118
作者: Tim Klausmann,Marius Köppel,Daniel Schunk,Isabell Zipperle
关键词: digital technologies promises, technologies promises large, social benefits, technologies promises, learning contents based
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The individualization of learning contents based on digital technologies promises large individual and social benefits. However, it remains an open question how this individualization can be implemented. To tackle this question we conduct a randomized controlled trial on a large digital self-learning platform. We develop an algorithm based on two convolutional neural networks that assigns tasks to 4,365 learners according to their learning paths. Learners are randomized into three groups: two treatment groups – a group-based adaptive treatment group and an individual adaptive treatment group – and one control group. We analyze the difference between the three groups with respect to effort learners provide and their performance on the platform. Our null results shed light on the multiple challenges associated with the individualization of learning paths.

[LG-30] Compressed Latent Replays for Lightweight Continual Learning on Spiking Neural Networks

链接: https://arxiv.org/abs/2407.03111
作者: Alberto Dequino,Alessio Carpegna,Davide Nadalini,Alessandro Savino,Luca Benini,Stefano Di Carlo,Francesco Conti
关键词: Rehearsal-based Continual Learning, Deep Neural Networks, Spiking Neural Networks, Rehearsal-based Continual, Continual Learning
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rehearsal-based Continual Learning (CL) has been intensely investigated in Deep Neural Networks (DNNs). However, its application in Spiking Neural Networks (SNNs) has not been explored in depth. In this paper we introduce the first memory-efficient implementation of Latent Replay (LR)-based CL for SNNs, designed to seamlessly integrate with resource-constrained devices. LRs combine new samples with latent representations of previously learned data, to mitigate forgetting. Experiments on the Heidelberg SHD dataset with Sample and Class-Incremental tasks reach a Top-1 accuracy of 92.5% and 92%, respectively, without forgetting the previously learned information. Furthermore, we minimize the LRs’ requirements by applying a time-domain compression, reducing by two orders of magnitude their memory requirement, with respect to a naive rehearsal setup, with a maximum accuracy drop of 4%. On a Multi-Class-Incremental task, our SNN learns 10 new classes from an initial set of 10, reaching a Top-1 accuracy of 78.4% on the full test set.

[LG-31] How Reliable and Stable are Explanations of XAI Methods?

链接: https://arxiv.org/abs/2407.03108
作者: José Ribeiro,Lucas Cardoso,Vitor Santos,Eduardo Carvalho,Níkolas Carneiro,Ronnie Alves
关键词: Explainable Artificial Intelligence, Black box models, XAI Methods, living in society, Black box
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 6 figures, submitted to BRACIS 2024

点击查看摘要

Abstract:Black box models are increasingly being used in the daily lives of human beings living in society. Along with this increase, there has been the emergence of Explainable Artificial Intelligence (XAI) methods aimed at generating additional explanations regarding how the model makes certain predictions. In this sense, methods such as Dalex, Eli5, eXirt, Lofo and Shap emerged as different proposals and methodologies for generating explanations of black box models in an agnostic way. Along with the emergence of these methods, questions arise such as “How Reliable and Stable are XAI Methods?”. With the aim of shedding light on this main question, this research creates a pipeline that performs experiments using the diabetes dataset and four different machine learning models (LGBM, MLP, DT and KNN), creating different levels of perturbations of the test data and finally generates explanations from the eXirt method regarding the confidence of the models and also feature relevances ranks from all XAI methods mentioned, in order to measure their stability in the face of perturbations. As a result, it was found that eXirt was able to identify the most reliable models among all those used. It was also found that current XAI methods are sensitive to perturbations, with the exception of one specific method.

[LG-32] On Generalization for Generative Flow Networks

链接: https://arxiv.org/abs/2407.03105
作者: Anas Krichel,Nikolay Malkin,Salem Lahlou,Yoshua Bengio
关键词: Generative Flow Networks, Generative Flow, Flow Networks, innovative learning paradigm, learning paradigm designed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative Flow Networks (GFlowNets) have emerged as an innovative learning paradigm designed to address the challenge of sampling from an unnormalized probability distribution, called the reward function. This framework learns a policy on a constructed graph, which enables sampling from an approximation of the target probability distribution through successive steps of sampling from the learned policy. To achieve this, GFlowNets can be trained with various objectives, each of which can lead to the model s ultimate goal. The aspirational strength of GFlowNets lies in their potential to discern intricate patterns within the reward function and their capacity to generalize effectively to novel, unseen parts of the reward function. This paper attempts to formalize generalization in the context of GFlowNets, to link generalization with stability, and also to design experiments that assess the capacity of these models to uncover unseen parts of the reward function. The experiments will focus on length generalization meaning generalization to states that can be constructed only by longer trajectories than those seen in training.

[LG-33] Conformal Prediction for Causal Effects of Continuous Treatments

链接: https://arxiv.org/abs/2407.03094
作者: Maresa Schröder,Dennis Frauen,Jonas Schweisthal,Konstantin Heß,Valentyn Melnychuk,Stefan Feuerriegel
关键词: conformal prediction, personalized medicine, prediction, crucial for safety-critical, safety-critical applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Uncertainty quantification of causal effects is crucial for safety-critical applications such as personalized medicine. A powerful approach for this is conformal prediction, which has several practical benefits due to model-agnostic finite-sample guarantees. Yet, existing methods for conformal prediction of causal effects are limited to binary/discrete treatments and make highly restrictive assumptions such as known propensity scores. In this work, we provide a novel conformal prediction method for potential outcomes of continuous treatments. We account for the additional uncertainty introduced through propensity estimation so that our conformal prediction intervals are valid even if the propensity score is unknown. Our contributions are three-fold: (1) We derive finite-sample prediction intervals for potential outcomes of continuous treatments. (2) We provide an algorithm for calculating the derived intervals. (3) We demonstrate the effectiveness of the conformal prediction intervals in experiments on synthetic and real-world datasets. To the best of our knowledge, we are the first to propose conformal prediction for continuous treatments when the propensity score is unknown and must be estimated from data.

[LG-34] Revisiting the Performance of Deep Learning-Based Vulnerability Detection on Realistic Datasets

链接: https://arxiv.org/abs/2407.03093
作者: Partha Chakraborty,Krishna Kanth Arumugam,Mahmoud Alfadel,Meiyappan Nagappan,Shane McIntosh
关键词: everyday software systems, software vulnerabilities, everyday software, software systems, vulnerabilities on everyday
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The impact of software vulnerabilities on everyday software systems is significant. Despite deep learning models being proposed for vulnerability detection, their reliability is questionable. Prior evaluations show high recall/F1 scores of up to 99%, but these models underperform in practical scenarios, particularly when assessed on entire codebases rather than just the fixing commit. This paper introduces Real-Vul, a comprehensive dataset representing real-world scenarios for evaluating vulnerability detection models. Evaluating DeepWukong, LineVul, ReVeal, and IVDetect shows a significant drop in performance, with precision decreasing by up to 95 percentage points and F1 scores by up to 91 points. Furthermore, Model performance fluctuates based on vulnerability characteristics, with better F1 scores for information leaks or code injection than for path resolution or predictable return values. The results highlight a significant performance gap that needs addressing before deploying deep learning-based vulnerability detection in practical settings. Overfitting is identified as a key issue, and an augmentation technique is proposed, potentially improving performance by up to 30%. Contributions include a dataset creation approach for better model evaluation, Real-Vul dataset, and empirical evidence of deep learning models struggling in real-world settings.

[LG-35] Effective Heterogeneous Federated Learning via Efficient Hypernetwork-based Weight Generation

链接: https://arxiv.org/abs/2407.03086
作者: Yujin Shin,Kichang Lee,Sungmin Lee,You Rim Choi,Hyung-Sin Kim,JeongGil Ko
关键词: faces challenges due, learning leverages distributed, leverages distributed client, leverages distributed, faces challenges
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:While federated learning leverages distributed client resources, it faces challenges due to heterogeneous client capabilities. This necessitates allocating models suited to clients’ resources and careful parameter aggregation to accommodate this heterogeneity. We propose HypeMeFed, a novel federated learning framework for supporting client heterogeneity by combining a multi-exit network architecture with hypernetwork-based model weight generation. This approach aligns the feature spaces of heterogeneous model layers and resolves per-layer information disparity during weight aggregation. To practically realize HypeMeFed, we also propose a low-rank factorization approach to minimize computation and memory overhead associated with hypernetworks. Our evaluations on a real-world heterogeneous device testbed indicate that HypeMeFed enhances accuracy by 5.12% over FedAvg, reduces the hypernetwork memory requirements by 98.22%, and accelerates its operations by 1.86 times compared to a naive hypernetwork approach. These results demonstrate HypeMeFed’s effectiveness in leveraging and engaging heterogeneous clients for federated learning.

[LG-36] Stable Heterogeneous Treatment Effect Estimation across Out-of-Distribution Populations

链接: https://arxiv.org/abs/2407.03082
作者: Yuling Zhang,Anpeng Wu,Kun Kuang,Liang Du,Zixun Sun,Zhi Wang
关键词: Heterogeneous treatment effect, Heterogeneous treatment, treatment effect, stable HTE estimation, HTE estimation
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by ICDE’2024

点击查看摘要

Abstract:Heterogeneous treatment effect (HTE) estimation is vital for understanding the change of treatment effect across individuals or subgroups. Most existing HTE estimation methods focus on addressing selection bias induced by imbalanced distributions of confounders between treated and control units, but ignore distribution shifts across populations. Thereby, their applicability has been limited to the in-distribution (ID) population, which shares a similar distribution with the training dataset. In real-world applications, where population distributions are subject to continuous changes, there is an urgent need for stable HTE estimation across out-of-distribution (OOD) populations, which, however, remains an open problem. As pioneers in resolving this problem, we propose a novel Stable Balanced Representation Learning with Hierarchical-Attention Paradigm (SBRL-HAP) framework, which consists of 1) Balancing Regularizer for eliminating selection bias, 2) Independence Regularizer for addressing the distribution shift issue, 3) Hierarchical-Attention Paradigm for coordination between balance and independence. In this way, SBRL-HAP regresses counterfactual outcomes using ID data, while ensuring the resulting HTE estimation can be successfully generalized to out-of-distribution scenarios, thereby enhancing the model’s applicability in real-world settings. Extensive experiments conducted on synthetic and real-world datasets demonstrate the effectiveness of our SBRL-HAP in achieving stable HTE estimation across OOD populations, with an average 10% reduction in the error metric PEHE and 11% decrease in the ATE bias, compared to the SOTA methods.

[LG-37] Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios

链接: https://arxiv.org/abs/2407.03080
作者: Patricia A. Apellániz,Ana Jiménez,Borja Arroyo Galende,Juan Parras,Santiago Zazo
关键词: Deep Generative Models, generation using Deep, substantial training data, synthetic tabular data, Deep Generative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages, 6 Figures

点击查看摘要

Abstract:While synthetic tabular data generation using Deep Generative Models (DGMs) offers a compelling solution to data scarcity and privacy concerns, their effectiveness relies on substantial training data, often unavailable in real-world applications. This paper addresses this challenge by proposing a novel methodology for generating realistic and reliable synthetic tabular data with DGMs in limited real-data environments. Our approach proposes several ways to generate an artificial inductive bias in a DGM through transfer learning and meta-learning techniques. We explore and compare four different methods within this framework, demonstrating that transfer learning strategies like pre-training and model averaging outperform meta-learning approaches, like Model-Agnostic Meta-Learning, and Domain Randomized Search. We validate our approach using two state-of-the-art DGMs, namely, a Variational Autoencoder and a Generative Adversarial Network, to show that our artificial inductive bias fuels superior synthetic data quality, as measured by Jensen-Shannon divergence, achieving relative gains of up to 50% when using our proposed approach. This methodology has broad applicability in various DGMs and machine learning tasks, particularly in areas like healthcare and finance, where data scarcity is often a critical issue.

[LG-38] Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes

链接: https://arxiv.org/abs/2407.03065
作者: Asaf Cassel,Aviv Rosenberg
关键词: popular Reinforcement Learning, Policy Optimization, Reinforcement Learning, popular Reinforcement, Markov Decision Process
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice. Recently, Sherman et al. [2023a] proposed a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model. However, their algorithm relies on a costly pure exploration warm-up phase that is hard to implement in practice. This paper eliminates this undesired warm-up phase, replacing it with a simple and efficient contraction mechanism. Our PO algorithm achieves rate-optimal regret with improved dependence on the other parameters of the problem (horizon and function approximation dimension) in two fundamental settings: adversarial losses with full-information feedback and stochastic losses with bandit feedback.

[LG-39] FairJob: A Real-World Dataset for Fairness in Online Systems

链接: https://arxiv.org/abs/2407.03059
作者: Mariia Vladimirova,Federico Pavone,Eustache Diemert
关键词: designed to foster, real-world scenarios, foster research, research in algorithmic, introduce a fairness-aware
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注: 24 pages, 15 figures

点击查看摘要

Abstract:We introduce a fairness-aware dataset for job recommendation in advertising, designed to foster research in algorithmic fairness within real-world scenarios. It was collected and prepared to comply with privacy standards and business confidentiality. An additional challenge is the lack of access to protected user attributes such as gender, for which we propose a solution to obtain a proxy estimate. Despite being anonymized and including a proxy for a sensitive attribute, our dataset preserves predictive power and maintains a realistic and challenging benchmark. This dataset addresses a significant gap in the availability of fairness-focused resources for high-impact domains like advertising – the actual impact being having access or not to precious employment opportunities, where balancing fairness and utility is a common industrial challenge. We also explore various stages in the advertising process where unfairness can occur and introduce a method to compute a fair utility metric for the job recommendations in online systems case from a biased dataset. Experimental evaluations of bias mitigation techniques on the released dataset demonstrate potential improvements in fairness and the associated trade-offs with utility.

[LG-40] Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation

链接: https://arxiv.org/abs/2407.03056
作者: Marco Mistretta,Alberto Baldrati,Marco Bertini,Andrew D. Bagdanov
关键词: limited data, Vision-Language Models, Prompt learning, unseen tasks, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for publication at ECCV24

点击查看摘要

Abstract:Vision-Language Models (VLMs) demonstrate remarkable zero-shot generalization to unseen tasks, but fall short of the performance of supervised methods in generalizing to downstream tasks with limited data. Prompt learning is emerging as a parameter-efficient method for adapting VLMs, but state-of-the-art approaches require annotated samples. In this paper we propose a novel approach to prompt learning based on unsupervised knowledge distillation from more powerful models. Our approach, which we call Knowledge Distillation Prompt Learning (KDPL), can be integrated into existing prompt learning techniques and eliminates the need for labeled examples during adaptation. Our experiments on more than ten standard benchmark datasets demonstrate that KDPL is very effective at improving generalization of learned prompts for zero-shot domain generalization, zero-shot cross-dataset generalization, and zero-shot base-to-novel class generalization problems. KDPL requires no ground-truth labels for adaptation, and moreover we show that even in the absence of any knowledge of training class names it can be used to effectively transfer knowledge. The code is publicly available at this https URL.

[LG-41] JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets

链接: https://arxiv.org/abs/2407.03045
作者: Zhihua Jin,Shiyi Liu,Haotian Li,Xun Zhao,Huamin Qu
关键词: Large Language Models, Large Language, Language Models, gained significant attention, Jailbreak prompts
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 18 pages, 9 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have gained significant attention but also raised concerns due to the risk of misuse. Jailbreak prompts, a popular type of adversarial attack towards LLMs, have appeared and constantly evolved to breach the safety protocols of LLMs. To address this issue, LLMs are regularly updated with safety patches based on reported jailbreak prompts. However, malicious users often keep their successful jailbreak prompts private to exploit LLMs. To uncover these private jailbreak prompts, extensive analysis of large-scale conversational datasets is necessary to identify prompts that still manage to bypass the system’s defenses. This task is highly challenging due to the immense volume of conversation data, diverse characteristics of jailbreak prompts, and their presence in complex multi-turn conversations. To tackle these challenges, we introduce JailbreakHunter, a visual analytics approach for identifying jailbreak prompts in large-scale human-LLM conversational datasets. We have designed a workflow with three analysis levels: group-level, conversation-level, and turn-level. Group-level analysis enables users to grasp the distribution of conversations and identify suspicious conversations using multiple criteria, such as similarity with reported jailbreak prompts in previous research and attack success rates. Conversation-level analysis facilitates the understanding of the progress of conversations and helps discover jailbreak prompts within their conversation contexts. Turn-level analysis allows users to explore the semantic similarity and token overlap between a singleturn prompt and the reported jailbreak prompts, aiding in the identification of new jailbreak strategies. The effectiveness and usability of the system were verified through multiple case studies and expert interviews.

[LG-42] On the Client Preference of LLM Fine-tuning in Federated Learning

链接: https://arxiv.org/abs/2407.03038
作者: Feijie Wu,Xiaoze Liu,Haoyu Wang,Xingchen Wang,Jing Gao
关键词: large language model, pretrained large language, Reinforcement learning, preference datasets, fine-tunes a pretrained
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Work in progress

点击查看摘要

Abstract:Reinforcement learning with human feedback (RLHF) fine-tunes a pretrained large language model (LLM) using preference datasets, enabling the LLM to generate outputs that align with human preferences. Given the sensitive nature of these preference datasets held by various clients, there is a need to implement RLHF within a federated learning (FL) framework, where clients are reluctant to share their data due to privacy concerns. To address this, we introduce a feasible framework in which clients collaboratively train a binary selector with their preference datasets using our proposed FedBis. With a well-trained selector, we can further enhance the LLM that generates human-preferred completions. Meanwhile, we propose a novel algorithm, FedBiscuit, that trains multiple selectors by organizing clients into balanced and disjoint clusters based on their preferences. Compared to the FedBis, FedBiscuit demonstrates superior performance in simulating human preferences for pairwise completions. Our extensive experiments on federated human preference datasets – marking the first benchmark to address heterogeneous data partitioning among clients – demonstrate that FedBiscuit outperforms FedBis and even surpasses traditional centralized training.

[LG-43] LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models

链接: https://arxiv.org/abs/2407.02987
作者: Hayder Elesedy,Pedro M. Esperança,Silviu Vlad Oprea,Mete Ozay
关键词: alternative to safety, safety alignment, large language models, Abstract, content moderation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs). Existing model-based guardrails have not been designed for resource-constrained computational portable devices, such as mobile phones, more and more of which are running LLM-based applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models. LoRA-Guard extracts language features from the LLMs and adapts them for the content moderation task using low-rank adapters, while a dual-path design prevents any performance degradation on the generative task. We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.

[LG-44] Semantically Rich Local Dataset Generation for Explainable AI in Genomics

链接: https://arxiv.org/abs/2407.02984
作者: Pedro Barbosa,Rosina Savisaar,Alcides Fonseca
关键词: Black box deep, gene regulatory mechanisms, Black box, box deep learning, regulatory mechanisms
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms. Therefore, interpreting these models may provide novel insights into the underlying biology, supporting downstream biomedical applications. Due to their complexity, interpretable surrogate models can only be built for local explanations (e.g., a single instance). However, accomplishing this requires generating a dataset in the neighborhood of the input, which must maintain syntactic similarity to the original data while introducing semantic variability in the model’s predictions. This task is challenging due to the complex sequence-to-function relationship of DNA. We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity. Our custom, domain-guided individual representation effectively constrains syntactic similarity, and we provide two alternative fitness functions that promote diversity with no computational effort. Applied to the RNA splicing domain, our approach quickly achieves good diversity and significantly outperforms a random baseline in exploring the search space, as shown by our proof-of-concept, short RNA sequence. Furthermore, we assess its generalizability and demonstrate scalability to larger sequences, resulting in a ~30% improvement over the baseline. Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Genomics (q-bio.GN) Cite as: arXiv:2407.02984 [cs.LG] (or arXiv:2407.02984v2 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2407.02984 Focus to learn more arXiv-issued DOI via DataCite

[LG-45] owards a Scalable Reference-Free Evaluation of Generative Models

链接: https://arxiv.org/abs/2407.02961
作者: Azim Ospanov,Jingwei Zhang,Mohammad Jalali,Xuenan Cao,Andrej Bogdanov,Farzan Farnia
关键词: generally difficult due, applicable reference datasets, generative models, large-scale generative models, standard evaluation scores
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While standard evaluation scores for generative models are mostly reference-based, a reference-dependent assessment of generative models could be generally difficult due to the unavailability of applicable reference datasets. Recently, the reference-free entropy scores, VENDI and RKE, have been proposed to evaluate the diversity of generated data. However, estimating these scores from data leads to significant computational costs for large-scale generative models. In this work, we leverage the random Fourier features framework to reduce the computational price and propose the Fourier-based Kernel Entropy Approximation (FKEA) method. We utilize FKEA’s approximated eigenspectrum of the kernel matrix to efficiently estimate the mentioned entropy scores. Furthermore, we show the application of FKEA’s proxy eigenvectors to reveal the method’s identified modes in evaluating the diversity of produced samples. We provide a stochastic implementation of the FKEA assessment algorithm with a complexity O(n) linearly growing with sample size n . We extensively evaluate FKEA’s numerical performance in application to standard image, text, and video datasets. Our empirical results indicate the method’s scalability and interpretability applied to large-scale generative models. The codebase is available at this https URL.

[LG-46] ObfuscaTune: Obfuscated Offsite Fine-tuning and Inference of Proprietary LLMs on Private Datasets

链接: https://arxiv.org/abs/2407.02960
作者: Ahmed Frikha,Nassim Walha,Ricardo Mendes,Krishna Kanth Nakka,Xue Jiang,Xuebing Zhou
关键词: proprietary LLM owned, data owner entity, model provider entity, proprietary LLM, LLM owned
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:This work addresses the timely yet underexplored problem of performing inference and finetuning of a proprietary LLM owned by a model provider entity on the confidential/private data of another data owner entity, in a way that ensures the confidentiality of both the model and the data. Hereby, the finetuning is conducted offsite, i.e., on the computation infrastructure of a third-party cloud provider. We tackle this problem by proposing ObfuscaTune, a novel, efficient and fully utility-preserving approach that combines a simple yet effective obfuscation technique with an efficient usage of confidential computing (only 5% of the model parameters are placed on TEE). We empirically demonstrate the effectiveness of ObfuscaTune by validating it on GPT-2 models with different sizes on four NLP benchmark datasets. Finally, we compare to a naïve version of our approach to highlight the necessity of using random matrices with low condition numbers in our approach to reduce errors induced by the obfuscation.

[LG-47] IncogniText: Privacy-enhancing Conditional Text Anonymization via LLM-based Private Attribute Randomization

链接: https://arxiv.org/abs/2407.02956
作者: Ahmed Frikha,Nassim Walha,Krishna Kanth Nakka,Ricardo Mendes,Xue Jiang,Xuebing Zhou
关键词: correctly inferring private, inferring private attributes, meaning and semantics, address the problem, prevent adversaries
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:In this work, we address the problem of text anonymization where the goal is to prevent adversaries from correctly inferring private attributes of the author, while keeping the text utility, i.e., meaning and semantics. We propose IncogniText, a technique that anonymizes the text to mislead a potential adversary into predicting a wrong private attribute value. Our empirical evaluation shows a reduction of private attribute leakage by more than 90%. Finally, we demonstrate the maturity of IncogniText for real-world applications by distilling its anonymization capability into a set of LoRA parameters associated with an on-device model.

[LG-48] PII-Compass: Guiding LLM training data extraction prompts towards the target PII via grounding

链接: https://arxiv.org/abs/2407.02943
作者: Krishna Kanth Nakka,Ahmed Frikha,Ricardo Mendes,Xue Jiang,Xuebing Zhou
关键词: large models stem, increased size, impactful advances, advances in large, large models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at ACL 2024

点击查看摘要

Abstract:The latest and most impactful advances in large models stem from their increased size. Unfortunately, this translates into an improved memorization capacity, raising data privacy concerns. Specifically, it has been shown that models can output personal identifiable information (PII) contained in their training data. However, reported PIII extraction performance varies widely, and there is no consensus on the optimal methodology to evaluate this risk, resulting in underestimating realistic adversaries. In this work, we empirically demonstrate that it is possible to improve the extractability of PII by over ten-fold by grounding the prefix of the manually constructed extraction prompt with in-domain data. Our approach, PII-Compass, achieves phone number extraction rates of 0.92%, 3.9%, and 6.86% with 1, 128, and 2308 queries, respectively, i.e., the phone number of 1 person in 15 is extractable.

[LG-49] he More the Merrier? Navigating Accuracy vs. Energy Efficiency Design Trade-Offs in Ensemble Learning Systems

链接: https://arxiv.org/abs/2407.02914
作者: Rafiullah Omar,Justus Bogner,Henry Muccini,Patricia Lago,Silverio Martínez-Fernández,Xavier Franch
关键词: effective ML-enabled systems, Machine learning, energy consumption, energy, ML-enabled systems
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Currently under review at a journal

点击查看摘要

Abstract:Background: Machine learning (ML) model composition is a popular technique to mitigate shortcomings of a single ML model and to design more effective ML-enabled systems. While ensemble learning, i.e., forwarding the same request to several models and fusing their predictions, has been studied extensively for accuracy, we have insufficient knowledge about how to design energy-efficient ensembles. Objective: We therefore analyzed three types of design decisions for ensemble learning regarding a potential trade-off between accuracy and energy consumption: a) ensemble size, i.e., the number of models in the ensemble, b) fusion methods (majority voting vs. a meta-model), and c) partitioning methods (whole-dataset vs. subset-based training). Methods: By combining four popular ML algorithms for classification in different ensembles, we conducted a full factorial experiment with 11 ensembles x 4 datasets x 2 fusion methods x 2 partitioning methods (176 combinations). For each combination, we measured accuracy (F1-score) and energy consumption in J (for both training and inference). Results: While a larger ensemble size significantly increased energy consumption (size 2 ensembles consumed 37.49% less energy than size 3 ensembles, which in turn consumed 26.96% less energy than the size 4 ensembles), it did not significantly increase accuracy. Furthermore, majority voting outperformed meta-model fusion both in terms of accuracy (Cohen’s d of 0.38) and energy consumption (Cohen’s d of 0.92). Lastly, subset-based training led to significantly lower energy consumption (Cohen’s d of 0.91), while training on the whole dataset did not increase accuracy significantly. Conclusions: From a Green AI perspective, we recommend designing ensembles of small size (2 or maximum 3 models), using subset-based training, majority voting, and energy-efficient ML algorithms like decision trees, Naive Bayes, or KNN.

[LG-50] SFC: Achieve Accurate Fast Convolution under Low-precision Arithmetic

链接: https://arxiv.org/abs/2407.02913
作者: Liulu He,Yufei Zhao,Rui Gao,Yuan Du,Li Du
关键词: accelerate convolution operations, efficiently accelerate convolution, Discrete Fourier Transform, efficiently accelerate, operations in deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP); Numerical Analysis (math.NA)
*备注: ICML 2024

点击查看摘要

Abstract:Fast convolution algorithms, including Winograd and FFT, can efficiently accelerate convolution operations in deep models. However, these algorithms depend on high-precision arithmetic to maintain inference accuracy, which conflicts with the model quantization. To resolve this conflict and further improve the efficiency of quantized convolution, we proposes SFC, a new algebra transform for fast convolution by extending the Discrete Fourier Transform (DFT) with symbolic computing, in which only additions are required to perform the transformation at specific transform points, avoiding the calculation of irrational number and reducing the requirement for precision. Additionally, we enhance convolution efficiency by introducing correction terms to convert invalid circular convolution outputs of the Fourier method into effective ones. The numerical error analysis is presented for the first time in this type of work and proves that our algorithms can provide a 3.68x multiplication reduction for 3x3 convolution, while the Winograd algorithm only achieves a 2.25x reduction with similarly low numerical errors. Experiments carried out on benchmarks and FPGA show that our new algorithms can further improve the computation efficiency of quantized models while maintaining accuracy, surpassing both the quantization-alone method and existing works on fast convolution quantization.

[LG-51] he Shortcomings of Force-from-Motion in Robot Learning

链接: https://arxiv.org/abs/2407.02904
作者: Elie Aljalbout,Felix Frank,Patrick van der Smagt,Alexandros Paraschos
关键词: Robotic manipulation requires, manipulation requires accurate, requires accurate motion, Robotic manipulation, physical interaction control
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robotic manipulation requires accurate motion and physical interaction control. However, current robot learning approaches focus on motion-centric action spaces that do not explicitly give the policy control over the interaction. In this paper, we discuss the repercussions of this choice and argue for more interaction-explicit action spaces in robot learning.

[LG-52] GPTQT: Quantize Large Language Models Twice to Push the Efficiency

链接: https://arxiv.org/abs/2407.02891
作者: Yipin Guo,Yilin Lang,Qinyuan Ren
关键词: generative Large Language, Large Language Models, Large Language, require significant computing, large size
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted by 11th IEEE International Conference on Cybernetics and Intelligent Systems

点击查看摘要

Abstract:Due to their large size, generative Large Language Models (LLMs) require significant computing and storage resources. This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed by expressing the weight of LLM in 3bit/2bit. Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting. Therefore, GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding. A re-explore strategy is proposed to optimize initial scaling factor. During inference, these steps are merged into pure binary coding, enabling efficient computation. Testing across various models and datasets confirms GPTQT’s effectiveness. Compared to the strong 3-bit quantization baseline, GPTQT further reduces perplexity by 4.01 on opt-66B and increases speed by 1.24 times on opt-30b. The results on Llama2 show that GPTQT is currently the best binary coding quantization method for such kind of LLMs.

[LG-53] Joint Optimization of Resource Allocation and Data Selection for Fast and Cost-Efficient Federated Edge Learning

链接: https://arxiv.org/abs/2407.02888
作者: Yunjian Jia,Zhen Huang,Jiping Yan,Yulu Zhang,Kun Luo,Wanli Wen
关键词: Deploying federated learning, federated edge learning, edge introduces federated, introduces federated edge, wireless edge introduces
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deploying federated learning at the wireless edge introduces federated edge learning (FEEL). Given FEEL’s limited communication resources and potential mislabeled data on devices, improper resource allocation or data selection can hurt convergence speed and increase training costs. Thus, to realize an efficient FEEL system, this paper emphasizes jointly optimizing resource allocation and data selection. Specifically, in this work, through rigorously modeling the training process and deriving an upper bound on FEEL’s one-round convergence rate, we establish a problem of joint resource allocation and data selection, which, unfortunately, cannot be solved directly. Toward this end, we equivalently transform the original problem into a solvable form via a variable substitution and then break it into two subproblems, that is, the resource allocation problem and the data selection problem. The two subproblems are mixed-integer non-convex and integer non-convex problems, respectively, and achieving their optimal solutions is a challenging task. Based on the matching theory and applying the convex-concave procedure and gradient projection methods, we devise a low-complexity suboptimal algorithm for the two subproblems, respectively. Finally, the superiority of our proposed scheme of joint resource allocation and data selection is validated by numerical results.

[LG-54] ShiftAddAug: Augment Multiplication-Free Tiny Neural Network with Hybrid Computation

链接: https://arxiv.org/abs/2407.02881
作者: Yipin Guo,Zihao Li,Yilin Lang,Qinyuan Ren
关键词: Shift and Add, compatibility with hardware, gained prominence, Operators devoid, Add
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by 2024 CVPR Workshop : Efficient Deep Learning for Computer Vision

点击查看摘要

Abstract:Operators devoid of multiplication, such as Shift and Add, have gained prominence for their compatibility with hardware. However, neural networks (NNs) employing these operators typically exhibit lower accuracy compared to conventional NNs with identical structures. ShiftAddAug uses costly multiplication to augment efficient but less powerful multiplication-free operators, improving performance without any inference overhead. It puts a ShiftAdd tiny NN into a large multiplicative model and encourages it to be trained as a sub-model to obtain additional supervision. In order to solve the weight discrepancy problem between hybrid operators, a new weight sharing method is proposed. Additionally, a novel two stage neural architecture search is used to obtain better augmentation effects for smaller but stronger multiplication-free tiny neural networks. The superiority of ShiftAddAug is validated through experiments in image classification and semantic segmentation, consistently delivering noteworthy enhancements. Remarkably, it secures up to a 4.95% increase in accuracy on the CIFAR100 compared to its directly trained counterparts, even surpassing the performance of multiplicative NNs.

[LG-55] Knowledge Composition using Task Vectors with Learned Anisotropic Scaling

链接: https://arxiv.org/abs/2407.02880
作者: Frederic Z. Zhang,Paul Albert,Cristian Rodriguez-Opazo,Anton van den Hengel,Ehsan Abbasnejad
关键词: produce strong generic, models produce strong, Pre-trained models produce, task vectors, strong generic representations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pre-trained models produce strong generic representations that can be adapted via fine-tuning. The learned weight difference relative to the pre-trained model, known as a task vector, characterises the direction and stride of fine-tuning. The significance of task vectors is such that simple arithmetic operations on them can be used to combine diverse representations from different domains. This paper builds on these properties of task vectors and aims to answer (1) whether components of task vectors, particularly parameter blocks, exhibit similar characteristics, and (2) how such blocks can be used to enhance knowledge composition and transfer. To this end, we introduce aTLAS, an algorithm that linearly combines parameter blocks with different learned coefficients, resulting in anisotropic scaling at the task vector level. We show that such linear combinations explicitly exploit the low intrinsic dimensionality of pre-trained models, with only a few coefficients being the learnable parameters. Furthermore, composition of parameter blocks leverages the already learned representations, thereby reducing the dependency on large amounts of data. We demonstrate the effectiveness of our method in task arithmetic, few-shot recognition and test-time adaptation, with supervised or unsupervised objectives. In particular, we show that (1) learned anisotropic scaling allows task vectors to be more disentangled, causing less interference in composition; (2) task vector composition excels with scarce or no labeled data and is less prone to domain shift, thus leading to better generalisability; (3) mixing the most informative parameter blocks across different task vectors prior to training can reduce the memory footprint and improve the flexibility of knowledge transfer. Moreover, we show the potential of aTLAS as a PEFT method, particularly with less data, and demonstrate that its scalibility.

[LG-56] Membership Inference Attacks Against Time-Series Models

链接: https://arxiv.org/abs/2407.02870
作者: Noam Koren,Abigail Goldsteen,Ariel Farkash,Guy Amit
关键词: Analyzing time-series data, Analyzing time-series, personal information, Analyzing, privacy concerns
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages

点击查看摘要

Abstract:Analyzing time-series data that may contain personal information, particularly in the medical field, presents serious privacy concerns. Sensitive health data from patients is often used to train machine-learning models for diagnostics and ongoing care. Assessing the privacy risk of such models is crucial to making knowledgeable decisions on whether to use a model in production, share it with third parties, or deploy it in patients homes. Membership Inference Attacks (MIA) are a key method for this kind of evaluation, however time-series prediction models have not been thoroughly studied in this context. We explore existing MIA techniques on time-series models, and introduce new features, focusing on the seasonality and trend components of the data. Seasonality is estimated using a multivariate Fourier transform, and a low-degree polynomial is used to approximate trends. We applied these techniques to various types of time-series models, using datasets from the health domain. Our results demonstrate that these new features enhance the effectiveness of MIAs in identifying membership, improving the understanding of privacy risks in medical data applications.

[LG-57] A Self-Supervised Task for Fault Detection in Satellite Multivariate Time Series

链接: https://arxiv.org/abs/2407.02861
作者: Carlo Cena,Silvia Bucci,Alessandro Balossino,Marcello Chiaberge
关键词: safeguarding valuable assets, ensuring mission success, Physics-Informed Real NVP, Real NVP neural, due to environmental
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: SPAICE: AI in and for Space, 2024

点击查看摘要

Abstract:In the space sector, due to environmental conditions and restricted accessibility, robust fault detection methods are imperative for ensuring mission success and safeguarding valuable assets. This work proposes a novel approach leveraging Physics-Informed Real NVP neural networks, renowned for their ability to model complex and high-dimensional distributions, augmented with a self-supervised task based on sensors’ data permutation. It focuses on enhancing fault detection within the satellite multivariate time series. The experiments involve various configurations, including pre-training with self-supervision, multi-task learning, and standalone self-supervised training. Results indicate significant performance improvements across all settings. In particular, employing only the self-supervised loss yields the best overall results, suggesting its efficacy in guiding the network to extract relevant features for fault detection. This study presents a promising direction for improving fault detection in space systems and warrants further exploration in other datasets and applications.

[LG-58] Early-Stage Anomaly Detection: A Study of Model Performance on Complete vs. Partial Flows

链接: https://arxiv.org/abs/2407.02856
作者: Adrian Pekar,Richard Jozsa
关键词: specifically Random Forest, Random Forest, specifically Random, machine learning models, complete flow records
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 9 pages, 5 tables, 2 figures

点击查看摘要

Abstract:This study investigates the efficacy of machine learning models, specifically Random Forest, in anomaly detection systems when trained on complete flow records and tested on partial flow data. We explore the performance disparity that arises when models are applied to incomplete data typical in real-world, real-time network environments. Our findings demonstrate a significant decline in model performance, with precision and recall dropping by up to 30% under certain conditions when models trained on complete flows are tested against partial flows. Conversely, models trained and tested on consistently complete or partial datasets maintain robustness, highlighting the importance of dataset consistency in training. The study reveals that a minimum of 7 packets in the test set is required for maintaining reliable detection rates. These results underscore the need for tailored training strategies that can effectively adapt to the dynamics of partial data, enhancing the practical applicability of anomaly detection systems in operational settings.

[LG-59] Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks

链接: https://arxiv.org/abs/2407.02855
作者: Zhexin Zhang,Junxiao Yang,Pei Ke,Shiyao Cui,Chujie Zheng,Hongning Wang,Minlie Huang
关键词: jailbreak attacks, Attack Success Rate, jailbreak, harmful, harmful questions
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: 15 pages

点击查看摘要

Abstract:LLMs are known to be vulnerable to jailbreak attacks, even after safety alignment. An important observation is that, while different types of jailbreak attacks can generate significantly different queries, they mostly result in similar responses that are rooted in the same harmful knowledge (e.g., detailed steps to make a bomb). Therefore, we conjecture that directly unlearn the harmful knowledge in the LLM can be a more effective way to defend against jailbreak attacks than the mainstream supervised fine-tuning (SFT) based approaches. Our extensive experiments confirmed our insight and suggested surprising generalizability of our unlearning-based approach: using only 20 raw harmful questions \emphwithout any jailbreak prompt during training, our solution reduced the Attack Success Rate (ASR) in Vicuna-7B on \emphout-of-distribution (OOD) harmful questions wrapped with various complex jailbreak prompts from 82.6% to 7.7%. This significantly outperforms Llama2-7B-Chat, which is fine-tuned on about 0.1M safety alignment samples but still has an ASR of 21.9% even under the help of an additional safety system prompt. Further analysis reveals that the generalization ability of our solution stems from the intrinsic relatedness among harmful responses across harmful questions (e.g., response patterns, shared steps and actions, and similarity among their learned representations in the LLM). Our code is available at \urlthis https URL.

[LG-60] LANE: Logic Alignment of Non-tuning Large Language Models and Online Recommendation Systems for Explainable Reason Generation

链接: https://arxiv.org/abs/2407.02833
作者: Hongke Zhao,Songming Zheng,Likang Wu,Bowen Yu,Jing Wang
关键词: enhancing user trust, trust and satisfaction, crucial for enhancing, recommendation, LLM models
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The explainability of recommendation systems is crucial for enhancing user trust and satisfaction. Leveraging large language models (LLMs) offers new opportunities for comprehensive recommendation logic generation. However, in existing related studies, fine-tuning LLM models for recommendation tasks incurs high computational costs and alignment issues with existing systems, limiting the application potential of proven proprietary/closed-source LLM models, such as GPT-4. In this work, our proposed effective strategy LANE aligns LLMs with online recommendation systems without additional LLMs tuning, reducing costs and improving explainability. This innovative approach addresses key challenges in integrating language models with recommendation systems while fully utilizing the capabilities of powerful proprietary models. Specifically, our strategy operates through several key components: semantic embedding, user multi-preference extraction using zero-shot prompting, semantic alignment, and explainable recommendation generation using Chain of Thought (CoT) prompting. By embedding item titles instead of IDs and utilizing multi-head attention mechanisms, our approach aligns the semantic features of user preferences with those of candidate items, ensuring coherent and user-aligned recommendations. Sufficient experimental results including performance comparison, questionnaire voting, and visualization cases prove that our method can not only ensure recommendation performance, but also provide easy-to-understand and reasonable recommendation logic.

[LG-61] Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2407.02827
作者: Xianliang Xu,Zhongyi Huang,Ye Li
关键词: physics-informed neural networks, training physics-informed neural, Optimization algorithms, implicit gradient descent, gradient descent algorithm
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Optimization algorithms is crucial in training physics-informed neural networks (PINNs), unsuitable methods may lead to poor solutions. Compared to the common gradient descent algorithm, implicit gradient descent (IGD) outperforms it in handling some multi-scale problems. In this paper, we provide convergence analysis for the implicit gradient descent for training over-parametrized two-layer PINNs. We first demonstrate the positive definiteness of Gram matrices for general smooth activation functions, like sigmoidal function, softplus function, tanh function and so on. Then the over-parameterization allows us to show that the randomly initialized IGD converges a globally optimal solution at a linear convergence rate. Moreover, due to the different training dynamics, the learning rate of IGD can be chosen independent of the sample size and the least eigenvalue of the Gram matrix.

[LG-62] Representation learning with CGAN for casual inference

链接: https://arxiv.org/abs/2407.02825
作者: Zhaotian Weng,Jianbo Hong,Lan Wang
关键词: Generative Adversarial Nets, Conditional Generative Adversarial, improve conditional image, conditional image generation, image generation performance
类目: Machine Learning (cs.LG)
*备注: Proceedings of the 3rd International Conference on Signal Processing and Machine Learning

点击查看摘要

Abstract:Conditional Generative Adversarial Nets (CGAN) is often used to improve conditional image generation performance. However, there is little research on Representation learning with CGAN for causal inference. This paper proposes a new method for finding representation learning functions by adopting the adversarial idea. We apply the pattern of CGAN and theoretically emonstrate the feasibility of finding a suitable representation function in the context of two distributions being balanced. The theoretical result shows that when two distributions are balanced, the ideal representation function can be found and thus can be used to further research.

[LG-63] Effect of a Process Mining based Pre-processing Step in Prediction of the Critical Health Outcomes

链接: https://arxiv.org/abs/2407.02821
作者: Negin Ashrafi,Armin Abdollahi,Greg Placencia,Maryam Pishgar
关键词: Predicting critical health, Predicting critical, improving survivability, patient mortality, readmission is essential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Predicting critical health outcomes such as patient mortality and hospital readmission is essential for improving survivability. However, healthcare datasets have many concurrences that create complexities, leading to poor predictions. Consequently, pre-processing the data is crucial to improve its quality. In this study, we use an existing pre-processing algorithm, concatenation, to improve data quality by decreasing the complexity of datasets. Sixteen healthcare datasets were extracted from two databases - MIMIC III and University of Illinois Hospital - converted to the event logs, they were then fed into the concatenation algorithm. The pre-processed event logs were then fed to the Split Miner (SM) algorithm to produce a process model. Process model quality was evaluated before and after concatenation using the following metrics: fitness, precision, F-Measure, and complexity. The pre-processed event logs were also used as inputs to the Decay Replay Mining (DREAM) algorithm to predict critical outcomes. We compared predicted results before and after applying the concatenation algorithm using Area Under the Curve (AUC) and Confidence Intervals (CI). Results indicated that the concatenation algorithm improved the quality of the process models and predictions of the critical health outcomes.

[LG-64] Efficient Training of Language Models with Compact and Consistent Next Token Distributions

链接: https://arxiv.org/abs/2407.02819
作者: Ashutosh Sathe,Sunita Sarawagi
关键词: statistically sound objective, Maximizing the likelihood, statistically sound, sound objective, gram
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: ACL 2024

点击查看摘要

Abstract:Maximizing the likelihood of the next token is an established, statistically sound objective for pre-training language models. In this paper we show that we can train better models faster by pre-aggregating the corpus with a collapsed n -gram distribution. Previous studies have proposed corpus-level n -gram statistics as a regularizer; however, the construction and querying of such n -grams, if done naively, prove to be costly and significantly impede training speed, thereby limiting their application in modern large language model pre-training. We introduce an alternative compact representation of the next token distribution that, in expectation, aligns with the complete n -gram distribution while markedly reducing variance across mini-batches compared to the standard next-token loss. Empirically, we demonstrate that both the n -gram regularized model and our approximation yield substantial improvements in model quality and convergence rate compared to existing methods. Furthermore, our approximation facilitates scalability of gains to larger datasets and models compared to the straightforward n -gram regularization method. Comments: ACL 2024 Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2407.02819 [cs.CL] (or arXiv:2407.02819v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2407.02819 Focus to learn more arXiv-issued DOI via DataCite

[LG-65] Data Overfitting for On-Device Super-Resolution with Dynamic Algorithm and Compiler Co-Design

链接: https://arxiv.org/abs/2407.02813
作者: Gen Li,Zhihao Shu,Jie Ji,Minghai Qin,Fatemeh Afghah,Wei Niu,Xiaolong Ma
关键词: computer vision applications, vision applications, frequently employed, variety of computer, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: ECCV2024

点击查看摘要

Abstract:Deep neural networks (DNNs) are frequently employed in a variety of computer vision applications. Nowadays, an emerging trend in the current video distribution system is to take advantage of DNN’s overfitting properties to perform video resolution upscaling. By splitting videos into chunks and applying a super-resolution (SR) model to overfit each chunk, this scheme of SR models plus video chunks is able to replace traditional video transmission to enhance video quality and transmission efficiency. However, many models and chunks are needed to guarantee high performance, which leads to tremendous overhead on model switching and memory footprints at the user end. To resolve such problems, we propose a Dynamic Deep neural network assisted by a Content-Aware data processing pipeline to reduce the model number down to one (Dy-DCA), which helps promote performance while conserving computational resources. Additionally, to achieve real acceleration on the user end, we designed a framework that optimizes dynamic features (e.g., dynamic shapes, sizes, and control flow) in Dy-DCA to enable a series of compilation optimizations, including fused code generation, static execution planning, etc. By employing such techniques, our method achieves better PSNR and real-time performance (33 FPS) on an off-the-shelf mobile phone. Meanwhile, assisted by our compilation optimization, we achieve a 1.7 \times speedup while saving up to 1.61 \times memory consumption. Code available in this https URL.

[LG-66] SPLITZ: Certifiable Robustness via Split Lipschitz Randomized Smoothing

链接: https://arxiv.org/abs/2407.02811
作者: Meiyu Zhong,Ravi Tandon
关键词: SPLITZ, textit, Certifiable robustness, change the prediction, small Lipschitz constants
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Certifiable robustness gives the guarantee that small perturbations around an input to a classifier will not change the prediction. There are two approaches to provide certifiable robustness to adversarial examples: a) explicitly training classifiers with small Lipschitz constants, and b) Randomized smoothing, which adds random noise to the input to create a smooth classifier. We propose \textitSPLITZ, a practical and novel approach which leverages the synergistic benefits of both the above ideas into a single framework. Our main idea is to \textitsplit a classifier into two halves, constrain the Lipschitz constant of the first half, and smooth the second half via randomization. Motivation for \textitSPLITZ comes from the observation that many standard deep networks exhibit heterogeneity in Lipschitz constants across layers. \textitSPLITZ can exploit this heterogeneity while inheriting the scalability of randomized smoothing. We present a principled approach to train \textitSPLITZ and provide theoretical analysis to derive certified robustness guarantees during inference. We present a comprehensive comparison of robustness-accuracy tradeoffs and show that \textitSPLITZ consistently improves upon existing state-of-the-art approaches on MNIST and CIFAR-10 datasets. For instance, with \ell_2 norm perturbation budget of \textbf \epsilon=1 , \textitSPLITZ achieves \textbf43.2% top-1 test accuracy on CIFAR-10 dataset compared to state-of-art top-1 test accuracy \textbf39.8%

[LG-67] Croppable Knowledge Graph Embedding

链接: https://arxiv.org/abs/2407.02779
作者: Yushan Zhu,Wen Zhang,Zhiqiang Liu,Mingyang Chen,Lei Liang,Huajun Chen
关键词: Knowledge Graph Embedding, artificial intelligence tasks, Graph Embedding, Knowledge Graph, intelligence tasks
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge Graph Embedding (KGE) is a common method for Knowledge Graphs (KGs) to serve various artificial intelligence tasks. The suitable dimensions of the embeddings depend on the storage and computing conditions of the specific application scenarios. Once a new dimension is required, a new KGE model needs to be trained from scratch, which greatly increases the training cost and limits the efficiency and flexibility of KGE in serving various scenarios. In this work, we propose a novel KGE training framework MED, through which we could train once to get a croppable KGE model applicable to multiple scenarios with different dimensional requirements, sub-models of the required dimensions can be cropped out of it and used directly without any additional training. In MED, we propose a mutual learning mechanism to improve the low-dimensional sub-models performance and make the high-dimensional sub-models retain the capacity that low-dimensional sub-models have, an evolutionary improvement mechanism to promote the high-dimensional sub-models to master the knowledge that the low-dimensional sub-models can not learn, and a dynamic loss weight to balance the multiple losses adaptively. Experiments on 3 KGE models over 4 standard KG completion datasets, 3 real application scenarios over a real-world large-scale KG, and the experiments of extending MED to the language model BERT show the effectiveness, high efficiency, and flexible extensibility of MED.

[LG-68] Foster Adaptivity and Balance in Learning with Noisy Labels

链接: https://arxiv.org/abs/2407.02778
作者: Mengmeng Sheng,Zeren Sun,Tao Chen,Shuchao Pang,Yucheng Wang,Yazhou Yao
关键词: deep neural networks, supervised models due, Label noise, posing a practical, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: accepted by the European Conference on Computer Vision (ECCV), 2024

点击查看摘要

Abstract:Label noise is ubiquitous in real-world scenarios, posing a practical challenge to supervised models due to its effect in hurting the generalization performance of deep neural networks. Existing methods primarily employ the sample selection paradigm and usually rely on dataset-dependent prior knowledge (\eg, a pre-defined threshold) to cope with label noise, inevitably degrading the adaptivity. Moreover, existing methods tend to neglect the class balance in selecting samples, leading to biased model performance. To this end, we propose a simple yet effective approach named \textbfSED to deal with label noise in a \textbfSelf-adaptiv\textbfE and class-balance\textbfD manner. Specifically, we first design a novel sample selection strategy to empower self-adaptivity and class balance when identifying clean and noisy data. A mean-teacher model is then employed to correct labels of noisy samples. Subsequently, we propose a self-adaptive and class-balanced sample re-weighting mechanism to assign different weights to detected noisy samples. Finally, we additionally employ consistency regularization on selected clean samples to improve model generalization performance. Extensive experimental results on synthetic and real-world datasets demonstrate the effectiveness and superiority of our proposed method. The source code has been made available at this https URL.

[LG-69] MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models

链接: https://arxiv.org/abs/2407.02775
作者: Ying Zhang,Ziheng Yang,Shufan Ji
关键词: language model compression, pre-trained language model, Knowledge distillation, knowledge distillation methods, effective technique
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge distillation is an effective technique for pre-trained language model compression. Although existing knowledge distillation methods perform well for the most typical model BERT, they could be further improved in two aspects: the relation-level knowledge could be further explored to improve model performance; and the setting of student attention head number could be more flexible to decrease inference time. Therefore, we are motivated to propose a novel knowledge distillation method MLKD-BERT to distill multi-level knowledge in teacher-student framework. Extensive experiments on GLUE benchmark and extractive question answering tasks demonstrate that our method outperforms state-of-the-art knowledge distillation methods on BERT. In addition, MLKD-BERT can flexibly set student attention head number, allowing for substantial inference time decrease with little performance drop.

[LG-70] Automatic gradient descent with generalized Newtons method

链接: https://arxiv.org/abs/2407.02772
作者: Zhiqi Bu,Shiyun Xu
关键词: generalized Newton method, SGD and Adam, generalized Newton, Hessian-informed approach, Newton method
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:We propose the generalized Newton’s method (GeN) – a Hessian-informed approach that applies to any optimizer such as SGD and Adam, and covers the Newton-Raphson method as a sub-case. Our method automatically and dynamically selects the learning rate that accelerates the convergence, without the intensive tuning of the learning rate scheduler. In practice, out method is easily implementable, since it only requires additional forward passes with almost zero computational overhead (in terms of training time and memory cost), if the overhead is amortized over many iterations. We present extensive experiments on language and vision tasks (e.g. GPT and ResNet) to showcase that GeN optimizers match the state-of-the-art performance, which was achieved with carefully tuned learning rate schedulers. Code to be released at \urlthis https URL.

[LG-71] Large language models physics-based modeling experimental measurements: the trinity of data-scarce learning of polymer properties

链接: https://arxiv.org/abs/2407.02770
作者: Ning Liu,Siavash Jafarzadeh,Brian Y. Lattimer,Shuna Ni,Jim Lua,Yue Yu
关键词: Large language models, Large language, material modeling paradigm, bear promise, paradigm for evaluation
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) bear promise as a fast and accurate material modeling paradigm for evaluation, analysis, and design. Their vast number of trainable parameters necessitates a wealth of data to achieve accuracy and mitigate overfitting. However, experimental measurements are often limited and costly to obtain in sufficient quantities for finetuning. To this end, we present a physics-based training pipeline that tackles the pathology of data scarcity. The core enabler is a physics-based modeling framework that generates a multitude of synthetic data to align the LLM to a physically consistent initial state before finetuning. Our framework features a two-phase training strategy: (1) utilizing the large-in-amount while less accurate synthetic data for supervised pretraining, and (2) finetuning the phase-1 model with limited experimental data. We empirically demonstrate that supervised pretraining is vital to obtaining accurate finetuned LLMs, via the lens of learning polymer flammability metrics where cone calorimeter data is sparse.

[LG-72] SF-GNN: Self Filter for Message Lossless Propagation in Deep Graph Neural Network

链接: https://arxiv.org/abs/2407.02762
作者: Yushan Zhu,Wen Zhang,Yajing Xu,Zhen Yao,Mingyang Chen,Huajun Chen
关键词: Graph Neural Network, Neural Network, encoding graph structure, graph structure information, Graph Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Network (GNN), with the main idea of encoding graph structure information of graphs by propagation and aggregation, has developed rapidly. It achieved excellent performance in representation learning of multiple types of graphs such as homogeneous graphs, heterogeneous graphs, and more complex graphs like knowledge graphs. However, merely stacking GNN layers may not improve the model’s performance and can even be detrimental. For the phenomenon of performance degradation in deep GNNs, we propose a new perspective. Unlike the popular explanations of over-smoothing or over-squashing, we think the issue arises from the interference of low-quality node representations during message propagation. We introduce a simple and general method, SF-GNN, to address this problem. In SF-GNN, we define two representations for each node, one is the node representation that represents the feature of the node itself, and the other is the message representation specifically for propagating messages to neighbor nodes. A self-filter module evaluates the quality of the node representation and decides whether to integrate it into the message propagation based on this quality assessment. Experiments on node classification tasks for both homogeneous and heterogeneous graphs, as well as link prediction tasks on knowledge graphs, demonstrate that our method can be applied to various GNN models and outperforms state-of-the-art baseline methods in addressing deep GNN degradation.

[LG-73] Multi-Scenario Combination Based on Multi-Agent Reinforcement Learning to Optimize the Advertising Recommendation System

链接: https://arxiv.org/abs/2407.02759
作者: Yang Zhao,Chang Zhou,Jin Cao,Yi Zhao,Shaobo Liu,Chiyu Cheng,Xingchen Li
关键词: paper explores multi-scenario, explores multi-scenario optimization, multi-agent reinforcement learning, reinforcement learning, Deterministic Policy Gradient
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by 2024 5th International Conference on Artificial Intelligence and Electromechanical Automation IEEE (ISBN: 979-8-3503-6617-4)

点击查看摘要

Abstract:This paper explores multi-scenario optimization on large platforms using multi-agent reinforcement learning (MARL). We address this by treating scenarios like search, recommendation, and advertising as a cooperative, partially observable multi-agent decision problem. We introduce the Multi-Agent Recurrent Deterministic Policy Gradient (MARDPG) algorithm, which aligns different scenarios under a shared objective and allows for strategy communication to boost overall performance. Our results show marked improvements in metrics such as click-through rate (CTR), conversion rate, and total sales, confirming our method’s efficacy in practical settings.

[LG-74] Differential Encoding for Improved Representation Learning over Graphs

链接: https://arxiv.org/abs/2407.02758
作者: Haimin Zhang,Jiahao Xia,Min Xu
关键词: global attention mechanism, Combining the message-passing, node, global attention, message-passing paradigm
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Combining the message-passing paradigm with the global attention mechanism has emerged as an effective framework for learning over graphs. The message-passing paradigm and the global attention mechanism fundamentally generate node embeddings based on information aggregated from a node’s local neighborhood or from the whole graph. The most basic and commonly used aggregation approach is to take the sum of information from a node’s local neighbourhood or from the whole graph. However, it is unknown if the dominant information is from a node itself or from the node’s neighbours (or the rest of the graph nodes). Therefore, there exists information lost at each layer of embedding generation, and this information lost could be accumulated and become more serious when more layers are used in the model. In this paper, we present a differential encoding method to address the issue of information lost. The idea of our method is to encode the differential representation between the information from a node’s neighbours (or the rest of the graph nodes) and that from the node itself. The obtained differential encoding is then combined with the original aggregated local or global representation to generate the updated node embedding. By integrating differential encodings, the representational ability of generated node embeddings is improved. The differential encoding method is empirically evaluated on different graph tasks on seven benchmark datasets. The results show that it is a general method that improves the message-passing update and the global attention update, advancing the state-of-the-art performance for graph representation learning on these datasets.

[LG-75] Curvature Clues: Decoding Deep Learning Privacy with Input Loss Curvature

链接: https://arxiv.org/abs/2407.02747
作者: Deepak Ravikumar,Efstathia Soufleri,Kaushik Roy
关键词: input loss curvature, loss curvature, input loss, loss, termed input loss
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:In this paper, we explore the properties of loss curvature with respect to input data in deep neural networks. Curvature of loss with respect to input (termed input loss curvature) is the trace of the Hessian of the loss with respect to the input. We investigate how input loss curvature varies between train and test sets, and its implications for train-test distinguishability. We develop a theoretical framework that derives an upper bound on the train-test distinguishability based on privacy and the size of the training set. This novel insight fuels the development of a new black box membership inference attack utilizing input loss curvature. We validate our theoretical findings through experiments in computer vision classification tasks, demonstrating that input loss curvature surpasses existing methods in membership inference effectiveness. Our analysis highlights how the performance of membership inference attack (MIA) methods varies with the size of the training set, showing that curvature-based MIA outperforms other methods on sufficiently large datasets. This condition is often met by real datasets, as demonstrated by our results on CIFAR10, CIFAR100, and ImageNet. These findings not only advance our understanding of deep neural network behavior but also improve the ability to test privacy-preserving techniques in machine learning.

[LG-76] Model and Feature Diversity for Bayesian Neural Networks in Mutual Learning

链接: https://arxiv.org/abs/2407.02721
作者: Cuong Pham,Cuong C. Nguyen,Trung Le,Dinh Phung,Gustavo Carneiro,Thanh-Toan Do
关键词: Bayesian Neural Networks, enabling uncertainty quantification, Bayesian Neural, offer probability distributions, Neural Networks
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted to NeurIPS 2023

点击查看摘要

Abstract:Bayesian Neural Networks (BNNs) offer probability distributions for model parameters, enabling uncertainty quantification in predictions. However, they often underperform compared to deterministic neural networks. Utilizing mutual learning can effectively enhance the performance of peer BNNs. In this paper, we propose a novel approach to improve BNNs performance through deep mutual learning. The proposed approaches aim to increase diversity in both network parameter distributions and feature distributions, promoting peer networks to acquire distinct features that capture different characteristics of the input, which enhances the effectiveness of mutual learning. Experimental results demonstrate significant improvements in the classification accuracy, negative log-likelihood, and expected calibration error when compared to traditional mutual learning for BNNs.

[LG-77] Light-weight Fine-tuning Method for Defending Adversarial Noise in Pre-trained Medical Vision-Language Models

链接: https://arxiv.org/abs/2407.02716
作者: Xu Han,Linghao Jin,Xuezhe Ma,Xiaofeng Liu
关键词: textual depiction synergy, shown remarkable capabilities, pre-trained Vision-Language Models, Fine-tuning pre-trained Vision-Language, depiction synergy
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning pre-trained Vision-Language Models (VLMs) has shown remarkable capabilities in medical image and textual depiction synergy. Nevertheless, many pre-training datasets are restricted by patient privacy concerns, potentially containing noise that can adversely affect downstream performance. Moreover, the growing reliance on multi-modal generation exacerbates this issue because of its susceptibility to adversarial attacks. To investigate how VLMs trained on adversarial noisy data perform on downstream medical tasks, we first craft noisy upstream datasets using multi-modal adversarial attacks. Through our comprehensive analysis, we unveil that moderate noise enhances model robustness and transferability, but increasing noise levels negatively impact downstream task performance. To mitigate this issue, we propose rectify adversarial noise (RAN) framework, a recipe designed to effectively defend adversarial attacks and rectify the influence of upstream noise during fine-tuning.

[LG-78] Advancing Compressed Video Action Recognition through Progressive Knowledge Distillation

链接: https://arxiv.org/abs/2407.02713
作者: Efstathia Soufleri,Deepak Ravikumar,Kaushik Roy
关键词: Compressed video action, video action recognition, recognition classifies video, classifies video samples, action recognition classifies
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Compressed video action recognition classifies video samples by leveraging the different modalities in compressed videos, namely motion vectors, residuals, and intra-frames. For this purpose, three neural networks are deployed, each dedicated to processing one modality. Our observations indicate that the network processing intra-frames tend to converge to a flatter minimum than the network processing residuals, which in turn converges to a flatter minimum than the motion vector network. This hierarchy in convergence motivates our strategy for knowledge transfer among modalities to achieve flatter minima, which are generally associated with better generalization. With this insight, we propose Progressive Knowledge Distillation (PKD), a technique that incrementally transfers knowledge across the modalities. This method involves attaching early exits (Internal Classifiers - ICs) to the three networks. PKD distills knowledge starting from the motion vector network, followed by the residual, and finally, the intra-frame network, sequentially improving IC accuracy. Further, we propose the Weighted Inference with Scaled Ensemble (WISE), which combines outputs from the ICs using learned weights, boosting accuracy during inference. Our experiments demonstrate the effectiveness of training the ICs with PKD compared to standard cross-entropy-based training, showing IC accuracy improvements of up to 5.87% and 11.42% on the UCF-101 and HMDB-51 datasets, respectively. Additionally, WISE improves accuracy by up to 4.28% and 9.30% on UCF-101 and HMDB-51, respectively.

[LG-79] Practical Guide for Causal Pathways and Sub-group Disparity Analysis

链接: https://arxiv.org/abs/2407.02702
作者: Farnaz Kohankhaki,Shaina Raza,Oluwanifemi Bamgbose,Deval Pandya,Elham Dolatabadi
关键词: unveil intricate relationships, causal disparity analysis, real-world observational data, sensitive attributes, introduce the application
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In this study, we introduce the application of causal disparity analysis to unveil intricate relationships and causal pathways between sensitive attributes and the targeted outcomes within real-world observational data. Our methodology involves employing causal decomposition analysis to quantify and examine the causal interplay between sensitive attributes and outcomes. We also emphasize the significance of integrating heterogeneity assessment in causal disparity analysis to gain deeper insights into the impact of sensitive attributes within specific sub-groups on outcomes. Our two-step investigation focuses on datasets where race serves as the sensitive attribute. The results on two datasets indicate the benefit of leveraging causal analysis and heterogeneity assessment not only for quantifying biases in the data but also for disentangling their influences on outcomes. We demonstrate that the sub-groups identified by our approach to be affected the most by disparities are the ones with the largest ML classification errors. We also show that grouping the data only based on a sensitive attribute is not enough, and through these analyses, we can find sub-groups that are directly affected by disparities. We hope that our findings will encourage the adoption of such methodologies in future ethical AI practices and bias audits, fostering a more equitable and fair technological landscape.

[LG-80] Output Range Analysis for Deep Neural Networks based on Simulated Annealing Processes

链接: https://arxiv.org/abs/2407.02700
作者: Helder Rojas,Nilton Rojas,Espinoza J. B.,Luis Huamanchumo
关键词: Deep Neural Networks, Simulated Annealing, Residual Neural Networks, Deep Neural, estimation for Deep
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper tackles the challenging problem of output range estimation for Deep Neural Networks (DNNs), introducing a novel algorithm based on Simulated Annealing (SA). Our approach addresses the lack of local geometric information and high non-linearity in DNNs, making it versatile across various architectures, especially Residual Neural Networks (ResNets). We present a straightforward, implementation-friendly algorithm that avoids restrictive assumptions about network architecture. Through theoretical analysis and experimental evaluations, including tests on the Ackley function, we demonstrate our algorithm’s effectiveness in navigating complex, non-convex surfaces and accurately estimating DNN output ranges. Futhermore, the Python codes of this experimental evaluation that support our results are available in our GitHub repository (this https URL).

[LG-81] LLM-Select: Feature Selection with Large Language Models

链接: https://arxiv.org/abs/2407.02694
作者: Daniel P. Jeong,Zachary C. Lipton,Pradeep Ravikumar
关键词: large language models, prediction task, demonstrate a surprising, surprising capability, capability of large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:In this paper, we demonstrate a surprising capability of large language models (LLMs): given only input feature names and a description of a prediction task, they are capable of selecting the most predictive features, with performance rivaling the standard tools of data science. Remarkably, these models exhibit this capacity across various query mechanisms. For example, we zero-shot prompt an LLM to output a numerical importance score for a feature (e.g., “blood pressure”) in predicting an outcome of interest (e.g., “heart failure”), with no additional context. In particular, we find that the latest models, such as GPT-4, can consistently identify the most predictive features regardless of the query mechanism and across various prompting strategies. We illustrate these findings through extensive experiments on real-world data, where we show that LLM-based feature selection consistently achieves strong performance competitive with data-driven methods such as the LASSO, despite never having looked at the downstream training data. Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place. This could potentially benefit practitioners in domains like healthcare, where collecting high-quality data comes at a high cost.

[LG-82] Accelerating Distributed Optimization: A Primal-Dual Perspective on Local Steps

链接: https://arxiv.org/abs/2407.02689
作者: Junchi Yang,Murat Yildirim,Qiu Feng
关键词: poses significant challenges, data distributions poses, distributions poses significant, Stochastic Gradient Descent, distributed machine learning
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In distributed machine learning, efficient training across multiple agents with different data distributions poses significant challenges. Even with a centralized coordinator, current algorithms that achieve optimal communication complexity typically require either large minibatches or compromise on gradient complexity. In this work, we tackle both centralized and decentralized settings across strongly convex, convex, and nonconvex objectives. We first demonstrate that a basic primal-dual method, (Accelerated) Gradient Ascent Multiple Stochastic Gradient Descent (GA-MSGD), applied to the Lagrangian of distributed optimization inherently incorporates local updates, because the inner loops of running Stochastic Gradient Descent on the primal variable require no inter-agent communication. Notably, for strongly convex objectives, we show (Accelerated) GA-MSGD achieves linear convergence in communication rounds despite the Lagrangian being only linear in the dual variables. This is due to a unique structural property where the dual variable is confined to the span of the coupling matrix, rendering the dual problem strongly concave. When integrated with the Catalyst framework, our approach achieves nearly optimal communication complexity across various settings without the need for minibatches. Moreover, in stochastic decentralized problems, it attains communication complexities comparable to those in deterministic settings, improving over existing algorithms.

[LG-83] No Training No Problem: Rethinking Classifier-Free Guidance for Diffusion Models

链接: https://arxiv.org/abs/2407.02687
作者: Seyedmorteza Sadat,Manuel Kansy,Otmar Hilliges,Romann M. Weber
关键词: CFG, Classifier-free guidance, conditional diffusion models, diffusion, diffusion models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Classifier-free guidance (CFG) has become the standard method for enhancing the quality of conditional diffusion models. However, employing CFG requires either training an unconditional model alongside the main diffusion model or modifying the training procedure by periodically inserting a null condition. There is also no clear extension of CFG to unconditional models. In this paper, we revisit the core principles of CFG and introduce a new method, independent condition guidance (ICG), which provides the benefits of CFG without the need for any special training procedures. Our approach streamlines the training process of conditional diffusion models and can also be applied during inference on any pre-trained conditional model. Additionally, by leveraging the time-step information encoded in all diffusion networks, we propose an extension of CFG, called time-step guidance (TSG), which can be applied to any diffusion model, including unconditional ones. Our guidance techniques are easy to implement and have the same sampling cost as CFG. Through extensive experiments, we demonstrate that ICG matches the performance of standard CFG across various conditional diffusion models. Moreover, we show that TSG improves generation quality in a manner similar to CFG, without relying on any conditional information.

[LG-84] Uniform Transformation: Refining Latent Representation in Variational Autoencoders

链接: https://arxiv.org/abs/2407.02681
作者: Ye Shi,C.S. George Lee
关键词: Variational Autoencoders, Kernel Density Estimation, Probability Integral Transform, Gaussian Kernel Density, problem in Variational
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Accepted by 2024 IEEE 20th International Conference on Automation Science and Engineering

点击查看摘要

Abstract:Irregular distribution in latent space causes posterior collapse, misalignment between posterior and prior, and ill-sampling problem in Variational Autoencoders (VAEs). In this paper, we introduce a novel adaptable three-stage Uniform Transformation (UT) module – Gaussian Kernel Density Estimation (G-KDE) clustering, non-parametric Gaussian Mixture (GM) Modeling, and Probability Integral Transform (PIT) – to address irregular latent distributions. By reconfiguring irregular distributions into a uniform distribution in the latent space, our approach significantly enhances the disentanglement and interpretability of latent representations, overcoming the limitation of traditional VAE models in capturing complex data structures. Empirical evaluations demonstrated the efficacy of our proposed UT module in improving disentanglement metrics across benchmark datasets – dSprites and MNIST. Our findings suggest a promising direction for advancing representation learning techniques, with implication for future research in extending this framework to more sophisticated datasets and downstream tasks.

[LG-85] Ensuring Responsible Sourcing of Large Language Model Training Data Through Knowledge Graph Comparison

链接: https://arxiv.org/abs/2407.02659
作者: Devam Mondal,Carlo Lipizzi
关键词: plagiarism allegations Brough, recent plagiarism allegations, large language model, Resource Description Framework, plagiarism detection system
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In light of recent plagiarism allegations Brough by publishers, newspapers, and other creators of copyrighted corpora against large language model (LLM) developers, we propose a novel system, a variant of a plagiarism detection system, that assesses whether a knowledge source has been used in the training or fine-tuning of a large language model. Unlike current methods, we utilize an approach that uses Resource Description Framework (RDF) triples to create knowledge graphs from both a source document and a LLM continuation of that document. These graphs are then analyzed with respect to content using cosine similarity and with respect to structure using a normalized version of graph edit distance that shows the degree of isomorphism. Unlike traditional systems that focus on content matching and keyword identification between a source and target corpus, our approach enables a broader evaluation of similarity and thus a more accurate comparison of the similarity between a source document and LLM continuation by focusing on relationships between ideas and their organization with regards to others. Additionally, our approach does not require access to LLM metrics like perplexity that may be unavailable in closed large language modeling “black-box” systems, as well as the training corpus. A prototype of our system will be found on a hyperlinked GitHub repository.

[LG-86] Large Scale Hierarchical Industrial Demand Time-Series Forecasting incorporating Sparsity

链接: https://arxiv.org/abs/2407.02657
作者: Harshavardhan Kamarthi,Aditya B. Sasanur,Xinjie Tong,Xingyu Zhou,James Peters,Joe Czyzyk,B. Aditya Prakash
关键词: simultaneously forecast multiple, important problem, hierarchical relation, demand forecasting, forecasting
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Accepted at KDD 2024

点击查看摘要

Abstract:Hierarchical time-series forecasting (HTSF) is an important problem for many real-world business applications where the goal is to simultaneously forecast multiple time-series that are related to each other via a hierarchical relation. Recent works, however, do not address two important challenges that are typically observed in many demand forecasting applications at large companies. First, many time-series at lower levels of the hierarchy have high sparsity i.e., they have a significant number of zeros. Most HTSF methods do not address this varying sparsity across the hierarchy. Further, they do not scale well to the large size of the real-world hierarchy typically unseen in benchmarks used in literature. We resolve both these challenges by proposing HAILS, a novel probabilistic hierarchical model that enables accurate and calibrated probabilistic forecasts across the hierarchy by adaptively modeling sparse and dense time-series with different distributional assumptions and reconciling them to adhere to hierarchical constraints. We show the scalability and effectiveness of our methods by evaluating them against real-world demand forecasting datasets. We deploy HAILS at a large chemical manufacturing company for a product demand forecasting application with over ten thousand products and observe a significant 8.5% improvement in forecast accuracy and 23% better improvement for sparse time-series. The enhanced accuracy and scalability make HAILS a valuable tool for improved business planning and customer experience.

[LG-87] Learning Graph Structures and Uncertainty for Accurate and Calibrated Time-series Forecasting

链接: https://arxiv.org/abs/2407.02641
作者: Harshavardhan Kamarthi,Lingkai Kong,Alexander Rodriguez,Chao Zhang,B Aditya Prakash
关键词: Multi-variate time series, time series forecasting, Multi-variate time, time series, series forecasting
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-variate time series forecasting is an important problem with a wide range of applications. Recent works model the relations between time-series as graphs and have shown that propagating information over the relation graph can improve time series forecasting. However, in many cases, relational information is not available or is noisy and reliable. Moreover, most works ignore the underlying uncertainty of time-series both for structure learning and deriving the forecasts resulting in the structure not capturing the uncertainty resulting in forecast distributions with poor uncertainty estimates. We tackle this challenge and introduce STOIC, that leverages stochastic correlations between time-series to learn underlying structure between time-series and to provide well-calibrated and accurate forecasts. Over a wide-range of benchmark datasets STOIC provides around 16% more accurate and 14% better-calibrated forecasts. STOIC also shows better adaptation to noise in data during inference and captures important and useful relational information in various benchmarks. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2407.02641 [cs.LG] (or arXiv:2407.02641v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2407.02641 Focus to learn more arXiv-issued DOI via DataCite

[LG-88] owards Federated Learning with On-device Training and Communication in 8-bit Floating Point

链接: https://arxiv.org/abs/2407.02610
作者: Bokun Wang,Axel Berg,Durmus Alp Emre Acar,Chuteng Zhou
关键词: reduced computational overhead, Recent work, computational overhead compared, floating point, efficiently training neural
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Recent work has shown that 8-bit floating point (FP8) can be used for efficiently training neural networks with reduced computational overhead compared to training in FP32/FP16. In this work, we investigate the use of FP8 training in a federated learning context. This brings not only the usual benefits of FP8 which are desirable for on-device training at the edge, but also reduces client-server communication costs due to significant weight compression. We present a novel method for combining FP8 client training while maintaining a global FP32 server model and provide convergence analysis. Experiments with various machine learning models and datasets show that our method consistently yields communication reductions of at least 2.9x across a variety of tasks and models compared to an FP32 baseline.

[LG-89] D-Rax: Domain-specific Radiologic assistant leveraging multi-modal data and eXpert model predictions

链接: https://arxiv.org/abs/2407.02604
作者: Hareem Nisar,Syed Muhammad Anwar,Zhifan Jiang,Abhijeet Parida,Vishwesh Nath,Holger R. Roth,Marius George Linguraru
关键词: Large vision language, general-purpose use cases, progressed incredibly, incredibly from research, research to applicability
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:

点击查看摘要

Abstract:Large vision language models (VLMs) have progressed incredibly from research to applicability for general-purpose use cases. LLaVA-Med, a pioneering large language and vision assistant for biomedicine, can perform multi-modal biomedical image and data analysis to provide a natural language interface for radiologists. While it is highly generalizable and works with multi-modal data, it is currently limited by well-known challenges that exist in the large language model space. Hallucinations and imprecision in responses can lead to misdiagnosis which currently hinder the clinical adaptability of VLMs. To create precise, user-friendly models in healthcare, we propose D-Rax – a domain-specific, conversational, radiologic assistance tool that can be used to gain insights about a particular radiologic image. In this study, we enhance the conversational analysis of chest X-ray (CXR) images to support radiological reporting, offering comprehensive insights from medical imaging and aiding in the formulation of accurate diagnosis. D-Rax is achieved by fine-tuning the LLaVA-Med architecture on our curated enhanced instruction-following data, comprising of images, instructions, as well as disease diagnosis and demographic predictions derived from MIMIC-CXR imaging data, CXR-related visual question answer (VQA) pairs, and predictive outcomes from multiple expert AI models. We observe statistically significant improvement in responses when evaluated for both open and close-ended conversations. Leveraging the power of state-of-the-art diagnostic models combined with VLMs, D-Rax empowers clinicians to interact with medical images using natural language, which could potentially streamline their decision-making process, enhance diagnostic accuracy, and conserve their time.

[LG-90] Linear Submodular Maximization with Bandit Feedback

链接: https://arxiv.org/abs/2407.02601
作者: Wenjing Chen,Victoria G. Crawford
关键词: variety of contexts, feedback has recently, recently been studied, Submodular optimization, submodular function exhibits
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:Submodular optimization with bandit feedback has recently been studied in a variety of contexts. In a number of real-world applications such as diversified recommender systems and data summarization, the submodular function exhibits additional linear structure. We consider developing approximation algorithms for the maximization of a submodular objective function f:2^U\to\mathbbR_\geq 0 , where f=\sum_i=1^dw_iF_i . It is assumed that we have value oracle access to the functions F_i , but the coefficients w_i are unknown, and f can only be accessed via noisy queries. We develop algorithms for this setting inspired by adaptive allocation algorithms in the best-arm identification for linear bandit, with approximation guarantees arbitrarily close to the setting where we have value oracle access to f . Finally, we empirically demonstrate that our algorithms make vast improvements in terms of sample efficiency compared to algorithms that do not exploit the linear structure of f on instances of move recommendation.

[LG-91] Meta 3D Gen

链接: https://arxiv.org/abs/2407.02599
作者: Raphael Bensadoun,Tom Monnier,Yanir Kleiman,Filippos Kokkinos,Yawar Siddiqui,Mahendra Kariya,Omri Harosh,Roman Shapovalov,Benjamin Graham,Emilien Garreau,Animesh Karnewar,Ang Cao,Idan Azuri,Iurii Makarov,Eric-Tuan Le,Antoine Toisoul,David Novotny,Oran Gafni,Natalia Neverova,Andrea Vedaldi
关键词: fast pipeline, introduce Meta, Gen, Meta, asset
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Meta 3D Gen (3DGen), a new state-of-the-art, fast pipeline for text-to-3D asset generation. 3DGen offers 3D asset creation with high prompt fidelity and high-quality 3D shapes and textures in under a minute. It supports physically-based rendering (PBR), necessary for 3D asset relighting in real-world applications. Additionally, 3DGen supports generative retexturing of previously generated (or artist-created) 3D shapes using additional textual inputs provided by the user. 3DGen integrates key technical components, Meta 3D AssetGen and Meta 3D TextureGen, that we developed for text-to-3D and text-to-texture generation, respectively. By combining their strengths, 3DGen represents 3D objects simultaneously in three ways: in view space, in volumetric space, and in UV (or texture) space. The integration of these two techniques achieves a win rate of 68% with respect to the single-stage model. We compare 3DGen to numerous industry baselines, and show that it outperforms them in terms of prompt fidelity and visual quality for complex textual prompts, while being significantly faster.

[LG-92] owards More Realistic Extraction Attacks: An Adversarial Perspective

链接: https://arxiv.org/abs/2407.02596
作者: Yash More,Prakhar Ganesh,Golnoosh Farnadi
关键词: memorizing large parts, prone to memorizing, memorizing large, large parts, Language models
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: To be presented at PrivateNLP@ACL2024

点击查看摘要

Abstract:Language models are prone to memorizing large parts of their training data, making them vulnerable to extraction attacks. Existing research on these attacks remains limited in scope, often studying isolated trends rather than the real-world interactions with these models. In this paper, we revisit extraction attacks from an adversarial perspective, exploiting the brittleness of language models. We find significant churn in extraction attack trends, i.e., even minor, unintuitive changes to the prompt, or targeting smaller models and older checkpoints, can exacerbate the risks of extraction by up to 2-4 \times . Moreover, relying solely on the widely accepted verbatim match underestimates the extent of extracted information, and we provide various alternatives to more accurately capture the true risks of extraction. We conclude our discussion with data deduplication, a commonly suggested mitigation strategy, and find that while it addresses some memorization concerns, it remains vulnerable to the same escalation of extraction risks against a real-world adversary. Our findings highlight the necessity of acknowledging an adversary’s true capabilities to avoid underestimating extraction risks.

[LG-93] RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs

链接: https://arxiv.org/abs/2407.02552
作者: John Dang,Arash Ahmadian,Kelly Marchisio,Julia Kreutzer,Ahmet Üstün,Sara Hooker
关键词: standard final stage, standard final, final stage, English and Chinese, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Preference optimization techniques have become a standard final stage for training state-of-art large language models (LLMs). However, despite widespread adoption, the vast majority of work to-date has focused on first-class citizen languages like English and Chinese. This captures a small fraction of the languages in the world, but also makes it unclear which aspects of current state-of-the-art research transfer to a multilingual setting. In this work, we perform an exhaustive study to achieve a new state-of-the-art in aligning multilingual LLMs. We introduce a novel, scalable method for generating high-quality multilingual feedback data to balance data coverage. We establish the benefits of cross-lingual transfer and increased dataset size in preference training. Our preference-trained model achieves a 54.4% win-rate against Aya 23 8B, the current state-of-the-art multilingual LLM in its parameter class, and a 69.5% win-rate or higher against widely used models like Gemma-1.1-7B-it, Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3. As a result of our study, we expand the frontier of alignment techniques to 23 languages covering half of the world’s population.

[LG-94] Diffusion Models for Tabular Data Imputation and Synthetic Data Generation

链接: https://arxiv.org/abs/2407.02549
作者: Mario Villaizán-Vallelado,Matteo Salvatori,Carlos Segura,Ioannis Arapakis
关键词: hinder accurate analysis, Data, healthcare and finance, analysis and decision-making, hinder accurate
类目: Machine Learning (cs.LG)
*备注: 25 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Data imputation and data generation have important applications for many domains, like healthcare and finance, where incomplete or missing data can hinder accurate analysis and decision-making. Diffusion models have emerged as powerful generative models capable of capturing complex data distributions across various data modalities such as image, audio, and time series data. Recently, they have been also adapted to generate tabular data. In this paper, we propose a diffusion model for tabular data that introduces three key enhancements: (1) a conditioning attention mechanism, (2) an encoder-decoder transformer as the denoising network, and (3) dynamic masking. The conditioning attention mechanism is designed to improve the model’s ability to capture the relationship between the condition and synthetic data. The transformer layers help model interactions within the condition (encoder) or synthetic data (decoder), while dynamic masking enables our model to efficiently handle both missing data imputation and synthetic data generation tasks within a unified framework. We conduct a comprehensive evaluation by comparing the performance of diffusion models with transformer conditioning against state-of-the-art techniques, such as Variational Autoencoders, Generative Adversarial Networks and Diffusion Models, on benchmark datasets. Our evaluation focuses on the assessment of the generated samples with respect to three important criteria, namely: (1) Machine Learning efficiency, (2) statistical similarity, and (3) privacy risk mitigation. For the task of data imputation, we consider the efficiency of the generated samples across different levels of missing features.

[LG-95] Domain Generalizable Knowledge Tracing via Concept Aggregation and Relation-Based Attention

链接: https://arxiv.org/abs/2407.02547
作者: Yuquan Xie,Wanqi Yang,Jinyu Wei,Ming Yang,Yang Gao
关键词: education systems, monitor students’ knowledge, students’ knowledge states, Knowledge Tracing, aiming to monitor
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge Tracing (KT) is a critical task in online education systems, aiming to monitor students’ knowledge states throughout a learning period. Common KT approaches involve predicting the probability of a student correctly answering the next question based on their exercise history. However, these methods often suffer from performance degradation when faced with the scarcity of student interactions in new education systems. To address this, we leverage student interactions from existing education systems to mitigate performance degradation caused by limited training data. Nevertheless, these interactions exhibit significant differences since they are derived from different education systems. To address this issue, we propose a domain generalization approach for knowledge tracing, where existing education systems are considered source domains, and new education systems with limited data are considered target domains. Additionally, we design a domain-generalizable knowledge tracing framework (DGKT) that can be applied to any KT model. Specifically, we present a concept aggregation approach designed to reduce conceptual disparities within sequences of student interactions from diverse domains. To further mitigate domain discrepancies, we introduce a novel normalization module called Sequence Instance Normalization (SeqIN). Moreover, to fully leverage exercise information, we propose a new knowledge tracing model tailored for the domain generalization KT task, named Domain-Generalizable Relation-based Knowledge Tracing (DGRKT). Extensive experiments across five benchmark datasets demonstrate that the proposed method performs well despite limited training data.

[LG-96] Adaptive Autopilot: Constrained DRL for Diverse Driving Behaviors

链接: https://arxiv.org/abs/2407.02546
作者: Dinesh Cyril Selvaraj,Christian Vitale,Tania Panayiotou,Panayiotis Kolios,Carla Fabiana Chiasserini,Georgios Ellinas
关键词: autonomous vehicles, behavior is vital, pursuit of autonomous, achieving human-like driving, human-like driving behavior
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 9 figures

点击查看摘要

Abstract:In pursuit of autonomous vehicles, achieving human-like driving behavior is vital. This study introduces adaptive autopilot (AA), a unique framework utilizing constrained-deep reinforcement learning (C-DRL). AA aims to safely emulate human driving to reduce the necessity for driver intervention. Focusing on the car-following scenario, the process involves (i) extracting data from the highD natural driving study and categorizing it into three driving styles using a rule-based classifier; (ii) employing deep neural network (DNN) regressors to predict human-like acceleration across styles; and (iii) using C-DRL, specifically the soft actor-critic Lagrangian technique, to learn human-like safe driving policies. Results indicate effectiveness in each step, with the rule-based classifier distinguishing driving styles, the regressor model accurately predicting acceleration, outperforming traditional car-following models, and C-DRL agents learning optimal policies for humanlike driving across styles.

[LG-97] owards the Next Frontier in Speech Representation Learning Using Disentanglement

链接: https://arxiv.org/abs/2407.02543
作者: Varun Krishna,Sriram Ganapathy
关键词: frame-level masked prediction, masked prediction, speech, speech regions, learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The popular frameworks for self-supervised learning of speech representations have largely focused on frame-level masked prediction of speech regions. While this has shown promising downstream task performance for speech recognition and related tasks, this has largely ignored factors of speech that are encoded at coarser level, like characteristics of the speaker or channel that remain consistent through-out a speech utterance. In this work, we propose a framework for Learning Disentangled Self Supervised (termed as Learn2Diss) representations of speech, which consists of frame-level and an utterance-level encoder modules. The two encoders are initially learned independently, where the frame-level model is largely inspired by existing self supervision techniques, thereby learning pseudo-phonemic representations, while the utterance-level encoder is inspired by constrastive learning of pooled embeddings, thereby learning pseudo-speaker representations. The joint learning of these two modules consists of disentangling the two encoders using a mutual information based criterion. With several downstream evaluation experiments, we show that the proposed Learn2Diss achieves state-of-the-art results on a variety of tasks, with the frame-level encoder representations improving semantic tasks, while the utterance-level representations improve non-semantic tasks.

[LG-98] ECAT: A Entire space Continual and Adaptive Transfer Learning Framework for Cross-Domain Recommendation

链接: https://arxiv.org/abs/2407.02542
作者: Chaoqun Hou,Yuanhang Zhou,Yi Cao,Tong Liu
关键词: entire space, designed to meet, meet the diverse, diverse interests, industrial recommendation systems
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In industrial recommendation systems, there are several mini-apps designed to meet the diverse interests and needs of users. The sample space of them is merely a small subset of the entire space, making it challenging to train an efficient model. In recent years, there have been many excellent studies related to cross-domain recommendation aimed at mitigating the problem of data sparsity. However, few of them have simultaneously considered the adaptability of both sample and representation continual transfer setting to the target task. To overcome the above issue, we propose a Entire space Continual and Adaptive Transfer learning framework called ECAT which includes two core components: First, as for sample transfer, we propose a two-stage method that realizes a coarse-to-fine process. Specifically, we perform an initial selection through a graph-guided method, followed by a fine-grained selection using domain adaptation method. Second, we propose an adaptive knowledge distillation method for continually transferring the representations from a model that is well-trained on the entire space dataset. ECAT enables full utilization of the entire space samples and representations under the supervision of the target task, while avoiding negative migration. Comprehensive experiments on real-world industrial datasets from Taobao show that ECAT advances state-of-the-art performance on offline metrics, and brings +13.6% CVR and +8.6% orders for Baiyibutie, a famous mini-app of Taobao.

[LG-99] Research on Autonomous Robots Navigation based on Reinforcement Learning

链接: https://arxiv.org/abs/2407.02539
作者: Zixiang Wang,Hao Yan,Yining Wang,Zhengjia Xu,Zhuoyue Wang,Zhizhong Wu
关键词: Reinforcement learning continuously, demonstrating strong adaptive, learning continuously optimizes, Proximal Policy Optimization, demonstrating strong
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Reinforcement learning continuously optimizes decision-making based on real-time feedback reward signals through continuous interaction with the environment, demonstrating strong adaptive and self-learning capabilities. In recent years, it has become one of the key methods to achieve autonomous navigation of robots. In this work, an autonomous robot navigation method based on reinforcement learning is introduced. We use the Deep Q Network (DQN) and Proximal Policy Optimization (PPO) models to optimize the path planning and decision-making process through the continuous interaction between the robot and the environment, and the reward signals with real-time feedback. By combining the Q-value function with the deep neural network, deep Q network can handle high-dimensional state space, so as to realize path planning in complex environments. Proximal policy optimization is a strategy gradient-based method, which enables robots to explore and utilize environmental information more efficiently by optimizing policy functions. These methods not only improve the robot’s navigation ability in the unknown environment, but also enhance its adaptive and self-learning capabilities. Through multiple training and simulation experiments, we have verified the effectiveness and robustness of these models in various complex scenarios.

[LG-100] Reducing False Discoveries in Statistically-Significant Regional-Colocation Mining: A Summary of Results

链接: https://arxiv.org/abs/2407.02536
作者: Subhankar Ghosh,Jayant Gupta,Arun Sharma,Shuai An,Shashi Shekhar
关键词: spatial feature types, emph, feature types, feature instances, spatial feature
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); General Economics (econ.GN); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Given a set \emphS of spatial feature types, its feature instances, a study area, and a neighbor relationship, the goal is to find pairs a region ( r_g ), a subset \emphC of \emphS such that \emphC is a statistically significant regional-colocation pattern in r_g . This problem is important for applications in various domains including ecology, economics, and sociology. The problem is computationally challenging due to the exponential number of regional colocation patterns and candidate regions. Previously, we proposed a miner \citehttps://doi.org/10.1145/3557989.3566158 that finds statistically significant regional colocation patterns. However, the numerous simultaneous statistical inferences raise the risk of false discoveries (also known as the multiple comparisons problem) and carry a high computational cost. We propose a novel algorithm, namely, multiple comparisons regional colocation miner (MultComp-RCM) which uses a Bonferroni correction. Theoretical analysis, experimental evaluation, and case study results show that the proposed method reduces both the false discovery rate and computational cost.

[LG-101] Performance Comparison of Deep RL Algorithms for Mixed Traffic Cooperative Lane-Changing

链接: https://arxiv.org/abs/2407.02521
作者: Xue Yao,Shengren Hou,Serge P. Hoogendoorn,Simeon C. Calvert
关键词: challenging scenario, scenario for connected, connected and automated, complex dynamics, dynamics and high
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, IEEE conference

点击查看摘要

Abstract:Lane-changing (LC) is a challenging scenario for connected and automated vehicles (CAVs) because of the complex dynamics and high uncertainty of the traffic environment. This challenge can be handled by deep reinforcement learning (DRL) approaches, leveraging their data-driven and model-free nature. Our previous work proposed a cooperative lane-changing in mixed traffic (CLCMT) mechanism based on TD3 to facilitate an optimal lane-changing strategy. This study enhances the current CLCMT mechanism by considering both the uncertainty of the human-driven vehicles (HVs) and the microscopic interactions between HVs and CAVs. The state-of-the-art (SOTA) DRL algorithms including DDPG, TD3, SAC, and PPO are utilized to deal with the formulated MDP with continuous actions. Performance comparison among the four DRL algorithms demonstrates that DDPG, TD3, and PPO algorithms can deal with uncertainty in traffic environments and learn well-performed LC strategies in terms of safety, efficiency, comfort, and ecology. The PPO algorithm outperforms the other three algorithms, regarding a higher reward, fewer exploration mistakes and crashes, and a more comfortable and ecology LC strategy. The improvements promise CLCMT mechanism greater advantages in the LC motion planning of CAVs.

[LG-102] RaCIL: Ray Tracing based Multi-UAV Obstacle Avoidance through Composite Imitation Learning

链接: https://arxiv.org/abs/2407.02520
作者: Harsh Bansal,Vyom Goyal,Bhaskar Joshi,Akhil Gupta,Harikumar Kandath
关键词: Unmanned Aerial Vehicles, Proximal Policy Optimization, Generative Adversarial Imitation, Adversarial Imitation Learning, combines Proximal Policy
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this study, we address the challenge of obstacle avoidance for Unmanned Aerial Vehicles (UAVs) through an innovative composite imitation learning approach that combines Proximal Policy Optimization (PPO) with Behavior Cloning (BC) and Generative Adversarial Imitation Learning (GAIL), enriched by the integration of ray-tracing techniques. Our research underscores the significant role of ray-tracing in enhancing obstacle detection and avoidance capabilities. Moreover, we demonstrate the effectiveness of incorporating GAIL in coordinating the flight paths of two UAVs, showcasing improved collision avoidance capabilities. Extending our methodology, we apply our combined PPO, BC, GAIL, and ray-tracing framework to scenarios involving four UAVs, illustrating its scalability and adaptability to more complex scenarios. The findings indicate that our approach not only improves the reliability of basic PPO based obstacle avoidance but also paves the way for advanced autonomous UAV operations in crowded or dynamic environments.

[LG-103] Anvil: An integration of artificial intelligence sampling techniques and a combined CAD-CFD tool

链接: https://arxiv.org/abs/2407.02519
作者: Harsh Vardhan,Umesh Timalsina,Michael Sandborn,David Hyde,Peter Volgyesi,Janos Sztipanovits
关键词: integrated CAD-CFD tool, AI-based optimization method, Bayesian optimization, open-source integrated CAD-CFD, CFD analysis
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:In this work, we introduce an open-source integrated CAD-CFD tool, Anvil, which combines FreeCAD for CAD modeling and OpenFOAM for CFD analysis, along with an AI-based optimization method (Bayesian optimization) and other sampling algorithms. Anvil serves as a scientific machine learning tool for shape optimization in three modes: data generation, CFD evaluation, and shape optimization. In data generation mode, it automatically runs CFD evaluations and generates data for training a surrogate model. In optimization mode, it searches for the optimal design under given requirements and optimization metrics. In CFD mode, a single CAD file can be evaluated with a single OpenFOAM run. To use Anvil, experimenters provide a JSON configuration file and a parametric CAD seed design. Anvil can be used to study solid-fluid dynamics for any subsonic flow conditions and has been demonstrated in various simulation and optimization use cases. The open-source code for the tool, installation process, artifacts (such as CAD seed designs and example STL models), experimentation results, and detailed documentation can be found at \urlthis https URL.

[LG-104] Detecting Stimuli with Novel Temporal Patterns to Accelerate Functional Coverage Closure

链接: https://arxiv.org/abs/2407.02510
作者: Xuan Zheng,Tim Blackmore,James Buckingham,Kerstin Eder
关键词: industrial digital designs, simulation-based verification, demonstrated their effectiveness, effectiveness in accelerating, accelerating the closure
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Novel test selectors have demonstrated their effectiveness in accelerating the closure of functional coverage for various industrial digital designs in simulation-based verification. The primary advantages of these test selectors include performance that is not impacted by coverage holes, straightforward implementation, and relatively low computational expense. However, the detection of stimuli with novel temporal patterns remains largely unexplored. This paper introduces two novel test selectors designed to identify such stimuli. The experiments reveal that both test selectors can accelerate the functional coverage for a commercial bus bridge, compared to random test selection. Specifically, one selector achieves a 26.9% reduction in the number of simulated tests required to reach 98.5% coverage, outperforming the savings achieved by two previously published test selectors by factors of 13 and 2.68, respectively.

[LG-105] Variables are a Curse in Software Vulnerability Prediction

链接: https://arxiv.org/abs/2407.02509
作者: Jinghua Groppe,Sven Groppe,Ralf Möller
关键词: original text, Deep learning-based approaches, software vulnerability prediction, text representation, code text
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning-based approaches for software vulnerability prediction currently mainly rely on the original text of software code as the feature of nodes in the graph of code and thus could learn a representation that is only specific to the code text, rather than the representation that depicts the ‘intrinsic’ functionality of a program hidden in the text representation. One curse that causes this problem is an infinite number of possibilities to name a variable. In order to lift the curse, in this work we introduce a new type of edge called name dependence, a type of abstract syntax graph based on the name dependence, and an efficient node representation method named 3-property encoding scheme. These techniques will allow us to remove the concrete variable names from code, and facilitate deep learning models to learn the functionality of software hidden in diverse code expressions. The experimental results show that the deep learning models built on these techniques outperform the ones based on existing approaches not only in the prediction of vulnerabilities but also in the memory need. The factor of memory usage reductions of our techniques can be up to the order of 30,000 in comparison to existing approaches.

[LG-106] Sample-efficient Imitative Multi-token Decision Transformer for Generalizable Real World Driving

链接: https://arxiv.org/abs/2407.02508
作者: Hang Zhou,Dan Xu,Yiding Ji
关键词: shown remarkable promise, make informed decisions, harnessing the power, sequence modeling, modeling has shown
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning via sequence modeling has shown remarkable promise in autonomous systems, harnessing the power of offline datasets to make informed decisions in simulated environments. However, the full potential of such methods in complex dynamic environments remain to be discovered. In autonomous driving domain, learning-based agents face significant challenges when transferring knowledge from simulated to real-world settings and the performance is also significantly impacted by data distribution shift. To address these issue, we propose Sample-efficient Imitative Multi-token Decision Transformer (SimDT). SimDT introduces multi-token prediction, imitative online learning and prioritized experience replay to Decision Transformer. The performance is evaluated through empirical experiments and results exceed popular imitation and reinforcement learning algorithms on Waymax benchmark.

[LG-107] A MgNO Method for Multiphase Flow in Porous Media

链接: https://arxiv.org/abs/2407.02505
作者: Xinliang Liu,Xia Yang,Chen-Song Zhang,Lian Zhang,Li Zhao
关键词: Multigrid Neural Operator, operator architecture inspired, neural operator architecture, Neural Operator, Multigrid Neural
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:This research investigates the application of Multigrid Neural Operator (MgNO), a neural operator architecture inspired by multigrid methods, in the simulation for multiphase flow within porous media. The architecture is adjusted to manage a variety of crucial factors, such as permeability and porosity heterogeneity. The study extendes MgNO to time-dependent porous media flow problems and validate its accuracy in predicting essential aspects of multiphase flows. Furthermore, the research provides a detailed comparison between MgNO and Fourier Neural Opeartor (FNO), which is one of the most popular neural operator methods, on their performance regarding prediction error accumulation over time. This aspect provides valuable insights into the models’ long-term predictive stability and reliability. The study demonstrates MgNO’s capability to effectively simulate multiphase flow problems, offering considerable time savings compared to traditional simulation methods, marking an advancement in integrating data-driven methodologies in geoscience applications.

[LG-108] Data-driven Power Flow Linearization: Theory

链接: https://arxiv.org/abs/2407.02501
作者: Mengshuo Jia,Gabriela Hug,Ning Zhang,Zhaojian Wang,Yi Wang,Chongqing Kang
关键词: data-driven power flow, gaining increased attention, domain gaining increased, two-part tutorial dives, DPFL
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Systems and Control (eess.SY); Applications (stat.AP)
*备注: 20 pages

点击查看摘要

Abstract:This two-part tutorial dives into the field of data-driven power flow linearization (DPFL), a domain gaining increased attention. DPFL stands out for its higher approximation accuracy, wide adaptability, and better ability to implicitly incorporate the latest system attributes. This renders DPFL a potentially superior option for managing the significant fluctuations from renewable energy sources, a step towards realizing a more sustainable energy future, by translating the higher model accuracy into increased economic efficiency and less energy losses. To conduct a deep and rigorous reexamination, this tutorial first classifies existing DPFL methods into DPFL training algorithms and supportive techniques. Their mathematical models, analytical solutions, capabilities, limitations, and generalizability are systematically examined, discussed, and summarized. In addition, this tutorial reviews existing DPFL experiments, examining the settings of test systems, the fidelity of datasets, and the comparison made among a limited number of DPFL methods. Further, this tutorial implements extensive numerical comparisons of all existing DPFL methods (40 methods in total) and four classic physics-driven approaches, focusing on their generalizability, applicability, accuracy, and computational efficiency. Through these simulationmethodss, this tutorial aims to reveal the actual performance of all the methods (including the performances exposed to data noise or outliers), guiding the selection of appropriate linearization methods. Furthermore, this tutorial discusses future directions based on the theoretical and numerical insights gained. As the first part, this paper reexamines DPFL theories, covering all the training algorithms and supportive techniques. Capabilities, limitations, and aspects of generalizability, which were previously unmentioned in the literature, have been identified.

[LG-109] FedEx: Expediting Federated Learning over Heterogeneous Mobile Devices by Overlapping and Participant Selection

链接: https://arxiv.org/abs/2407.00943
作者: Jiaxiang Geng,Boyu Li,Xiaoqi Qin,Yixuan Li,Liang Li,Yanzhao Hou,Miao Pan
关键词: numerous intrigued applications, intrigued applications ignited, heterogeneous mobile devices, Training latency, success of numerous
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 21 pages, 10 figures, Submitted to Sensys2024

点击查看摘要

Abstract:Training latency is critical for the success of numerous intrigued applications ignited by federated learning (FL) over heterogeneous mobile devices. By revolutionarily overlapping local gradient transmission with continuous local computing, FL can remarkably reduce its training latency over homogeneous clients, yet encounter severe model staleness, model drifts, memory cost and straggler issues in heterogeneous environments. To unleash the full potential of overlapping, we propose, FedEx, a novel \underlinefederated learning approach to \underlineexpedite FL training over mobile devices under data, computing and wireless heterogeneity. FedEx redefines the overlapping procedure with staleness ceilings to constrain memory consumption and make overlapping compatible with participation selection (PS) designs. Then, FedEx characterizes the PS utility function by considering the latency reduced by overlapping, and provides a holistic PS solution to address the straggler issue. FedEx also introduces a simple but effective metric to trigger overlapping, in order to avoid model drifts. Experimental results show that compared with its peer designs, FedEx demonstrates substantial reductions in FL training latency over heterogeneous mobile devices with limited memory cost.

[LG-110] Human-Agent Joint Learning for Efficient Robot Manipulation Skill Acquisition

链接: https://arxiv.org/abs/2407.00299
作者: Shengcheng Luo,Quanquan Peng,Jun Lv,Kaiwen Hong,Katherine Rose Driggs-Campbell,Cewu Lu,Yong-Lu Li
关键词: Employing a teleoperation, gathering demonstrations offers, offers the potential, teleoperation system, teleoperation system poses
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures

点击查看摘要

Abstract:Employing a teleoperation system for gathering demonstrations offers the potential for more efficient learning of robot manipulation. However, teleoperating a robot arm equipped with a dexterous hand or gripper, via a teleoperation system poses significant challenges due to its high dimensionality, complex motions, and differences in physiological structure. In this study, we introduce a novel system for joint learning between human operators and robots, that enables human operators to share control of a robot end-effector with a learned assistive agent, facilitating simultaneous human demonstration collection and robot manipulation teaching. In this setup, as data accumulates, the assistive agent gradually learns. Consequently, less human effort and attention are required, enhancing the efficiency of the data collection process. It also allows the human operator to adjust the control ratio to achieve a trade-off between manual and automated control. We conducted experiments in both simulated environments and physical real-world settings. Through user studies and quantitative evaluations, it is evident that the proposed system could enhance data collection efficiency and reduce the need for human adaptation while ensuring the collected data is of sufficient quality for downstream tasks. Videos are available at this https URL. Comments: 8 pages, 6 figures Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2407.00299 [cs.RO] (or arXiv:2407.00299v2 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2407.00299 Focus to learn more arXiv-issued DOI via DataCite

[LG-111] Understanding Biases in ChatGPT-based Recommender Systems: Provider Fairness Temporal Stability and Recency

链接: https://arxiv.org/abs/2401.10545
作者: Yashar Deldjoo
关键词: provider fairness, fairness, ChatGPT-based recommender systems, thousand API calls, ICL
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores the biases in ChatGPT-based recommender systems, focusing on provider fairness (item-side fairness). Through extensive experiments and over a thousand API calls, we investigate the impact of prompt design strategies-including structure, system role, and intent-on evaluation metrics such as provider fairness, catalog coverage, temporal stability, and recency. The first experiment examines these strategies in classical top-K recommendations, while the second evaluates sequential in-context learning (ICL). In the first experiment, we assess seven distinct prompt scenarios on top-K recommendation accuracy and fairness. Accuracy-oriented prompts, like Simple and Chain-of-Thought (COT), outperform diversification prompts, which, despite enhancing temporal freshness, reduce accuracy by up to 50%. Embedding fairness into system roles, such as “act as a fair recommender,” proved more effective than fairness directives within prompts. Diversification prompts led to recommending newer movies, offering broader genre distribution compared to traditional collaborative filtering (CF) models. The second experiment explores sequential ICL, comparing zero-shot and few-shot ICL. Results indicate that including user demographic information in prompts affects model biases and stereotypes. However, ICL did not consistently improve item fairness and catalog coverage over zero-shot learning. Zero-shot learning achieved higher NDCG and coverage, while ICL-2 showed slight improvements in hit rate (HR) when age-group context was included. Our study provides insights into biases of RecLLMs, particularly in provider fairness and catalog coverage. By examining prompt design, learning strategies, and system roles, we highlight the potential and challenges of integrating LLMs into recommendation systems. Further details can be found at this https URL. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2401.10545 [cs.IR] (or arXiv:2401.10545v3 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2401.10545 Focus to learn more arXiv-issued DOI via DataCite

[LG-112] Vertex Exchange Method for a Class of Quadratic Programming Problems

链接: https://arxiv.org/abs/2407.03294
作者: Ling Liang,Kim-Chuan Toh,Haizhao Yang
关键词: quadratic program subject, strongly convex quadratic, convex quadratic program, vertex exchange method, generalized simplex constraint
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 32 pages, 5 tables

点击查看摘要

Abstract:A vertex exchange method is proposed for solving the strongly convex quadratic program subject to the generalized simplex constraint. We conduct rigorous convergence analysis for the proposed algorithm and demonstrate its essential roles in solving some important classes of constrained convex optimization. To get a feasible initial point to execute the algorithm, we also present and analyze a highly efficient semismooth Newton method for computing the projection onto the generalized simplex. The excellent practical performance of the proposed algorithms is demonstrated by a set of extensive numerical experiments. Our theoretical and numerical results further motivate the potential applications of the considered model and the proposed algorithms.

[LG-113] Do Quantum Neural Networks have Simplicity Bias?

链接: https://arxiv.org/abs/2407.03266
作者: Jessica Pointing
关键词: deep neural networks, inductive bias, neural networks, strong inductive bias, bias
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 9 pages, 42 pages with appendices

点击查看摘要

Abstract:One hypothesis for the success of deep neural networks (DNNs) is that they are highly expressive, which enables them to be applied to many problems, and they have a strong inductive bias towards solutions that are simple, known as simplicity bias, which allows them to generalise well on unseen data because most real-world data is structured (i.e. simple). In this work, we explore the inductive bias and expressivity of quantum neural networks (QNNs), which gives us a way to compare their performance to those of DNNs. Our results show that it is possible to have simplicity bias with certain QNNs, but we prove that this type of QNN limits the expressivity of the QNN. We also show that it is possible to have QNNs with high expressivity, but they either have no inductive bias or a poor inductive bias and result in a worse generalisation performance compared to DNNs. We demonstrate that an artificial (restricted) inductive bias can be produced by intentionally restricting the expressivity of a QNN. Our results suggest a bias-expressivity tradeoff. Our conclusion is that the QNNs we studied can not generally offer an advantage over DNNs, because these QNNs either have a poor inductive bias or poor expressivity compared to DNNs.

[LG-114] Incremental Gauss–Newton Methods with Superlinear Convergence Rates

链接: https://arxiv.org/abs/2407.03195
作者: Zhiling Zhou,Zhuanghua Liu,Chengchang Liu,Luo Luo
关键词: Hölder continuous Jacobians, continuous Jacobians, equations with Hölder, Hölder continuous, solving large-scale nonlinear
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 37 pages, 9 figures

点击查看摘要

Abstract:This paper addresses the challenge of solving large-scale nonlinear equations with Hölder continuous Jacobians. We introduce a novel Incremental Gauss–Newton (IGN) method within explicit superlinear convergence rate, which outperforms existing methods that only achieve linear convergence rate. In particular, we formulate our problem by the nonlinear least squares with finite-sum structure, and our method incrementally iterates with the information of one component in each round. We also provide a mini-batch extension to our IGN method that obtains an even faster superlinear convergence rate. Furthermore, we conduct numerical experiments to show the advantages of the proposed methods.

[LG-115] Spatio-Temporal Adaptive Diffusion Models for EEG Super-Resolution in Epilepsy Diagnosis

链接: https://arxiv.org/abs/2407.03089
作者: Tong Zhou,Shuqiang Wang
关键词: EEG, EEG devices improve, high-density EEG, Electroencephalogram, spatial resolution
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Electroencephalogram (EEG) technology, particularly high-density EEG (HD EEG) devices, is widely used in fields such as neuroscience. HD EEG devices improve the spatial resolution of EEG by placing more electrodes on the scalp, meeting the requirements of clinical diagnostic applications such as epilepsy focus localization. However, this technique faces challenges such as high acquisition costs and limited usage scenarios. In this paper, spatio-temporal adaptive diffusion models (STADMs) are proposed to pioneer the use of diffusion models for achieving spatial SR reconstruction from low-resolution (LR, 64 channels or fewer) EEG to high-resolution (HR, 256 channels) EEG. Specifically, a spatio-temporal condition module is designed to extract the spatio-temporal features of LR EEG, which then serve as conditional inputs to guide the reverse denoising process of diffusion models. Additionally, a multi-scale Transformer denoising module is constructed to leverage multi-scale convolution blocks and cross-attention-based diffusion Transformer blocks for conditional guidance to generate subject-adaptive SR EEG. Experimental results demonstrate that the proposed method effectively enhances the spatial resolution of LR EEG and quantitatively outperforms existing methods. Furthermore, STADMs demonstrate their value by applying synthetic SR EEG to classification and source localization tasks of epilepsy patients, indicating their potential to significantly improve the spatial resolution of LR EEG.

[LG-116] IM-MoCo: Self-supervised MRI Motion Correction using Motion-Guided Implicit Neural Representations

链接: https://arxiv.org/abs/2407.02974
作者: Ziad Al-Haj Hemidi,Christian Weihsbach,Mattias P. Heinrich
关键词: Magnetic Resonance Imaging, Resonance Imaging, Magnetic Resonance, long acquisition times, arise due
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Submitted to MICCAI 2024 (Before peer review version)

点击查看摘要

Abstract:Motion artifacts in Magnetic Resonance Imaging (MRI) arise due to relatively long acquisition times and can compromise the clinical utility of acquired images. Traditional motion correction methods often fail to address severe motion, leading to distorted and unreliable results. Deep Learning (DL) alleviated such pitfalls through generalization with the cost of vanishing structures and hallucinations, making it challenging to apply in the medical field where hallucinated structures can tremendously impact the diagnostic outcome. In this work, we present an instance-wise motion correction pipeline that leverages motion-guided Implicit Neural Representations (INRs) to mitigate the impact of motion artifacts while retaining anatomical structure. Our method is evaluated using the NYU fastMRI dataset with different degrees of simulated motion severity. For the correction alone, we can improve over state-of-the-art image reconstruction methods by +5% SSIM, +5:db PSNR, and +14% HaarPSI. Clinical relevance is demonstrated by a subsequent experiment, where our method improves classification outcomes by at least +1.5 accuracy percentage points compared to motion-corrupted images.

[LG-117] Self-supervised Vision Transformer are Scalable Generative Models for Domain Generalization

链接: https://arxiv.org/abs/2407.02900
作者: Sebastian Doerrich,Francesco Di Salvo,Christian Ledig
关键词: impactful clinical applications, achieving robust generalization, notable advancements, deep learning, techniques into impactful
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: Accepted at MICCAI 2024. This is the submitted manuscript with added link to github repo and funding acknowledgements. No further post submission improvements or corrections were integrated. Final version not published yet

点击查看摘要

Abstract:Despite notable advancements, the integration of deep learning (DL) techniques into impactful clinical applications, particularly in the realm of digital histopathology, has been hindered by challenges associated with achieving robust generalization across diverse imaging domains and characteristics. Traditional mitigation strategies in this field such as data augmentation and stain color normalization have proven insufficient in addressing this limitation, necessitating the exploration of alternative methodologies. To this end, we propose a novel generative method for domain generalization in histopathology images. Our method employs a generative, self-supervised Vision Transformer to dynamically extract characteristics of image patches and seamlessly infuse them into the original images, thereby creating novel, synthetic images with diverse attributes. By enriching the dataset with such synthesized images, we aim to enhance its holistic nature, facilitating improved generalization of DL models to unseen domains. Extensive experiments conducted on two distinct histopathology datasets demonstrate the effectiveness of our proposed approach, outperforming the state of the art substantially, on the Camelyon17-wilds challenge dataset (+2%) and on a second epithelium-stroma dataset (+26%). Furthermore, we emphasize our method’s ability to readily scale with increasingly available unlabeled data samples and more complex, higher parametric architectures. Source code is available at this https URL .

[LG-118] Multi-Attention Integrated Deep Learning Frameworks for Enhanced Breast Cancer Segmentation and Identification

链接: https://arxiv.org/abs/2407.02844
作者: Pandiyaraju V,Shravan Venkatraman,Pavan Kumar S,Santhosh Malarvannan,Kannan A
关键词: claiming numerous lives, lives globally, Breast cancer poses, numerous lives, claiming numerous
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注: 29 pages, 15 figures, 6 tables

点击查看摘要

Abstract:Breast cancer poses a profound threat to lives globally, claiming numerous lives each year. Therefore, timely detection is crucial for early intervention and improved chances of survival. Accurately diagnosing and classifying breast tumors using ultrasound images is a persistent challenge in medicine, demanding cutting-edge solutions for improved treatment strategies. This research introduces multiattention-enhanced deep learning (DL) frameworks designed for the classification and segmentation of breast cancer tumors from ultrasound images. A spatial channel attention mechanism is proposed for segmenting tumors from ultrasound images, utilizing a novel LinkNet DL framework with an InceptionResNet backbone. Following this, the paper proposes a deep convolutional neural network with an integrated multi-attention framework (DCNNIMAF) to classify the segmented tumor as benign, malignant, or normal. From experimental results, it is observed that the segmentation model has recorded an accuracy of 98.1%, with a minimal loss of 0.6%. It has also achieved high Intersection over Union (IoU) and Dice Coefficient scores of 96.9% and 97.2%, respectively. Similarly, the classification model has attained an accuracy of 99.2%, with a low loss of 0.31%. Furthermore, the classification framework has achieved outstanding F1-Score, precision, and recall values of 99.1%, 99.3%, and 99.1%, respectively. By offering a robust framework for early detection and accurate classification of breast cancer, this proposed work significantly advances the field of medical image analysis, potentially improving diagnostic precision and patient outcomes.

[LG-119] Development of Machine Learning Classifiers for Blood-based Diagnosis and Prognosis of Suspected Acute Infections and Sepsis

链接: https://arxiv.org/abs/2407.02737
作者: Ljubomir Buturovic,Michael Mayhew,Roland Luethy,Kirindi Choi,Uros Midic,Nandita Damaraju,Yehudit Hasin-Brumshtein,Amitesh Pratap,Rhys M. Adams,Joao Fonseca,Ambika Srinath,Paul Fleming,Claudia Pereira,Oliver Liesenfeld,Purvesh Khatri,Timothy Sweeney
关键词: applied machine learning, emergency departments, unmet medical, rapid and accurate, sepsis in emergency
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:We applied machine learning to the unmet medical need of rapid and accurate diagnosis and prognosis of acute infections and sepsis in emergency departments. Our solution consists of a Myrna ™ Instrument and embedded TriVerity ™ classifiers. The instrument measures abundances of 29 messenger RNAs in patient’s blood, subsequently used as features for machine learning. The classifiers convert the input features to an intuitive test report comprising the separate likelihoods of (1) a bacterial infection (2) a viral infection, and (3) severity (need for Intensive Care Unit-level care). In internal validation, the system achieved AUROC = 0.83 on the three-class disease diagnosis (bacterial, viral, or non-infected) and AUROC = 0.77 on binary prognosis of disease severity. The Myrna, TriVerity system was granted breakthrough device designation by the United States Food and Drug Administration (FDA). This engineering manuscript teaches the standard and novel machine learning methods used to translate an academic research concept to a clinical product aimed at improving patient care, and discusses lessons learned.

[LG-120] UAV-assisted Distributed Learning for Environmental Monitoring in Rural Environments

链接: https://arxiv.org/abs/2407.02693
作者: Vukan Ninkovic,Dejan Vukobratovic,Dragisa Miskovic
关键词: data privacy preservation, Distributed learning, offering benefits, workload alleviation, data privacy
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distributed learning and inference algorithms have become indispensable for IoT systems, offering benefits such as workload alleviation, data privacy preservation, and reduced latency. This paper introduces an innovative approach that utilizes unmanned aerial vehicles (UAVs) as a coverage extension relay for IoT environmental monitoring in rural areas. Our method integrates a split learning (SL) strategy between edge devices, a UAV and a server to enhance adaptability and performance of inference mechanisms. By employing UAVs as a relay and by incorporating SL, we address connectivity and resource constraints for applications of learning in IoT in remote settings. Our system model accounts for diverse channel conditions to determine the most suitable transmission strategy for optimal system behaviour. Through simulation analysis, the proposed approach demonstrates its robustness and adaptability, even excelling under adverse channel conditions. Integrating UAV relaying and the SL paradigm offers significant flexibility to the server, enabling adaptive strategies that consider various trade-offs beyond simply minimizing overall inference quality.

[LG-121] Joint Segmentation and Image Reconstruction with Error Prediction in Photoacoustic Imaging using Deep Learning

链接: https://arxiv.org/abs/2407.02653
作者: Ruibo Shang,Geoffrey P. Luke,Matthew O’Donnell
关键词: Deep learning, improve photoacoustic, Deep, image reconstruction, predictions
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 31 pages, 8 figures

点击查看摘要

Abstract:Deep learning has been used to improve photoacoustic (PA) image reconstruction. One major challenge is that errors cannot be quantified to validate predictions when ground truth is unknown. Validation is key to quantitative applications, especially using limited-bandwidth ultrasonic linear detector arrays. Here, we propose a hybrid Bayesian convolutional neural network (Hybrid-BCNN) to jointly predict PA image and segmentation with error (uncertainty) predictions. Each output pixel represents a probability distribution where error can be quantified. The Hybrid-BCNN was trained with simulated PA data and applied to both simulations and experiments. Due to the sparsity of PA images, segmentation focuses Hybrid-BCNN on minimizing the loss function in regions with PA signals for better predictions. The results show that accurate PA segmentations and images are obtained, and error predictions are highly statistically correlated to actual errors. To leverage error predictions, confidence processing created PA images above a specific confidence level.

[LG-122] Lung-CADex: Fully automatic Zero-Shot Detection and Classification of Lung Nodules in Thoracic CT Images

链接: https://arxiv.org/abs/2407.02625
作者: Furqan Shaukat,Syed Muhammad Anwar,Abhijeet Parida,Van Khanh Lam,Marius George Linguraru,Mubarak Shah
关键词: life for decades, major threats, threats to human, human life, Visual Language models
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Lung cancer has been one of the major threats to human life for decades. Computer-aided diagnosis can help with early lung nodul detection and facilitate subsequent nodule characterization. Large Visual Language models (VLMs) have been found effective for multiple downstream medical tasks that rely on both imaging and text data. However, lesion level detection and subsequent diagnosis using VLMs have not been explored yet. We propose CADe, for segmenting lung nodules in a zero-shot manner using a variant of the Segment Anything Model called MedSAM. CADe trains on a prompt suite on input computed tomography (CT) scans by using the CLIP text encoder through prefix tuning. We also propose, CADx, a method for the nodule characterization as benign/malignant by making a gallery of radiomic features and aligning image-feature pairs through contrastive learning. Training and validation of CADe and CADx have been done using one of the largest publicly available datasets, called LIDC. To check the generalization ability of the model, it is also evaluated on a challenging dataset, LUNGx. Our experimental results show that the proposed methods achieve a sensitivity of 0.86 compared to 0.76 that of other fully supervised methods.The source code, datasets and pre-processed data can be accessed using the link:

[LG-123] Product Geometries on Cholesky Manifolds with Applications to SPD Manifolds

链接: https://arxiv.org/abs/2407.02607
作者: Ziheng Chen,Yue Song,Xiao-Jun Wu,Nicu Sebe
关键词: Symmetric Positive Definite, positive diagonal elements, Positive Definite, Diagonal Power Euclidean, Symmetric Positive
类目: Differential Geometry (math.DG); Machine Learning (cs.LG); Metric Geometry (math.MG)
*备注: 25 pages, 1 figures

点击查看摘要

Abstract:This paper presents two new metrics on the Symmetric Positive Definite (SPD) manifold via the Cholesky manifold, i.e., the space of lower triangular matrices with positive diagonal elements. We first unveil that the existing popular Riemannian metric on the Cholesky manifold can be generally characterized as the product metric of a Euclidean metric and a Riemannian metric on the space of n-dimensional positive vectors. Based on this analysis, we propose two novel metrics on the Cholesky manifolds, i.e., Diagonal Power Euclidean Metric and Diagonal Generalized Bures-Wasserstein Metric, which are numerically stabler than the existing Cholesky metric. We also discuss the gyro structures and deformed metrics associated with our metrics. The gyro structures connect the linear and geometric properties, while the deformed metrics interpolate between our proposed metrics and the existing metric. Further, by Cholesky decomposition, the proposed deformed metrics and gyro structures are pulled back to SPD manifolds. Compared with existing Riemannian metrics on SPD manifolds, our metrics are easy to use, computationally efficient, and numerically stable.

[LG-124] Analytical Solution of a Three-layer Network with a Matrix Exponential Activation Function

链接: https://arxiv.org/abs/2407.02540
作者: Kuo Gai,Shihua Zhang
关键词: deeper networks tend, understood theoretically, Machine Learning, powerful than shallow, deeper networks
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 8 pages,1 figure

点击查看摘要

Abstract:In practice, deeper networks tend to be more powerful than shallow ones, but this has not been understood theoretically. In this paper, we find the analytical solution of a three-layer network with a matrix exponential activation function, i.e., f(X)=W_3\exp(W_2\exp(W_1X)), X\in \mathbbC^d\times d have analytical solutions for the equations Y_1=f(X_1),Y_2=f(X_2) for X_1,X_2,Y_1,Y_2 with only invertible assumptions. Our proof shows the power of depth and the use of a non-linear activation function, since one layer network can only solve one equation,i.e., Y=WX . Comments: 8 pages,1 figure Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2407.02540 [stat.ML] (or arXiv:2407.02540v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2407.02540 Focus to learn more arXiv-issued DOI via DataCite

[LG-125] CGRclust: Chaos Game Representation for Twin Contrastive Clustering of Unlabelled DNA Sequences

链接: https://arxiv.org/abs/2407.02538
作者: Fatemeh Alipour,Kathleen A. Hill,Lila Kari
关键词: Chaos Game Representations, Game Representations, Chaos Game, convolutional neural networks, DNA sequences
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 29 pages, 4 figures

点击查看摘要

Abstract:This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclust overcomes the limitations of traditional sequence classification methods by leveraging unsupervised twin contrastive learning to detect distinctive sequence patterns, without requiring DNA sequence alignment or biological/taxonomic labels. CGRclust accurately clustered twenty-five diverse datasets, with sequence lengths ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences. Compared with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy across all four taxonomic levels tested for mitochondrial DNA genomes of fish. Moreover, CGRclust also consistently demonstrates superior performance across all the viral genomic datasets. The high clustering accuracy of CGRclust on these twenty-five datasets, which vary significantly in terms of sequence length, number of genomes, number of clusters, and level of taxonomy, demonstrates its robustness, scalability, and versatility.

[LG-126] Livestock feeding behaviour: A review on automated systems for ruminant monitoring

链接: https://arxiv.org/abs/2312.09259
作者: José Chelotti,Luciano Martinez-Rau,Mariano Ferrero,Leandro Vignolo,Julio Galli,Alejandra Planisich,H. Leonardo Rufiner,Leonardo Giovanini
关键词: Livestock feeding behaviour, feeding behaviour, influential research area, Livestock feeding, husbandry and agriculture
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
*备注: Preprint submitted to the journal biosystems engineering

点击查看摘要

Abstract:Livestock feeding behaviour is an influential research area for those involved in animal husbandry and agriculture. In recent years, there has been a growing interest in automated systems for monitoring the behaviour of ruminants. Despite the developments accomplished in the last decade, there is still much to do and learn about the methods for measuring and analysing livestock feeding behaviour. Automated monitoring systems mainly use motion, acoustic, and image sensors to collect animal behavioural data. The performance evaluation of existing methods is a complex task and direct comparisons between studies are difficult. Several factors prevent a direct comparison, starting from the diversity of data and performance metrics used in the experiments. To the best of our knowledge, this work represents the first tutorial-style review on the analysis of the feeding behaviour of ruminants, emphasising the relationship between sensing methodologies, signal processing, and computational intelligence methods. It assesses the main sensing methodologies (i.e. based on movement, sound, images/videos, and pressure) and the main techniques to measure and analyse the signals associated with feeding behaviour, evaluating their use in different settings and situations. It also highlights the potentiality of automated monitoring systems to provide valuable information that improves our understanding of livestock feeding behaviour. The relevance of these systems is increasingly important due to their impact on production systems and research. Finally, the paper closes by discussing future challenges and opportunities in livestock feeding behaviour monitoring.

信息检索

[IR-0] CoIR: A Comprehensive Benchmark for Code Information Retrieval Models

链接: https://arxiv.org/abs/2407.02883
作者: Xiangyang Li,Kuicai Dong,Yi Quan Lee,Wei Xia,Yichun Yin,Hao Zhang,Yong Liu,Yasheng Wang,Ruiming Tang
关键词: predominantly handle queries, success of Information, textbf, code retrieval, Information Retrieval
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Despite the substantial success of Information Retrieval (IR) in various NLP tasks, most IR systems predominantly handle queries and corpora in natural language, neglecting the domain of code retrieval. Code retrieval is critically important yet remains under-explored, with existing methods and benchmarks inadequately representing the diversity of code in various domains and tasks. Addressing this gap, we present \textbf\name (\textbfCode \textbfInformation \textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities. \name comprises \textbften meticulously curated code datasets, spanning \textbfeight distinctive retrieval tasks across \textbfseven diverse domains. We first discuss the construction of \name and its diverse dataset composition. Further, we evaluate nine widely used retrieval models using \name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems. To facilitate easy adoption and integration within existing research workflows, \name has been developed as a user-friendly Python framework, readily installable via pip. It shares same data schema as other popular benchmarks like MTEB and BEIR, enabling seamless cross-benchmark evaluations. Through \name, we aim to invigorate research in the code retrieval domain, providing a versatile benchmarking tool that encourages further development and exploration of code retrieval systems\footnote\url this https URL.

[IR-1] CRUISE on Quantum Computing for Feature Selection in Recommender Systems

链接: https://arxiv.org/abs/2407.02839
作者: Jiayang Niu,Jie Li,Ke Deng,Yongli Ren
关键词: Recommender Systems, worthwhile research topic, Systems that classical, Quantum Computers, classical computers
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: accepted by QuantumCLEF 2024

点击查看摘要

Abstract:Using Quantum Computers to solve problems in Recommender Systems that classical computers cannot address is a worthwhile research topic. In this paper, we use Quantum Annealers to address the feature selection problem in recommendation algorithms. This feature selection problem is a Quadratic Unconstrained Binary Optimization(QUBO) problem. By incorporating Counterfactual Analysis, we significantly improve the performance of the item-based KNN recommendation algorithm compared to using pure Mutual Information. Extensive experiments have demonstrated that the use of Counterfactual Analysis holds great promise for addressing such problems.

[IR-2] LANE: Logic Alignment of Non-tuning Large Language Models and Online Recommendation Systems for Explainable Reason Generation

链接: https://arxiv.org/abs/2407.02833
作者: Hongke Zhao,Songming Zheng,Likang Wu,Bowen Yu,Jing Wang
关键词: enhancing user trust, trust and satisfaction, crucial for enhancing, recommendation, LLM models
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The explainability of recommendation systems is crucial for enhancing user trust and satisfaction. Leveraging large language models (LLMs) offers new opportunities for comprehensive recommendation logic generation. However, in existing related studies, fine-tuning LLM models for recommendation tasks incurs high computational costs and alignment issues with existing systems, limiting the application potential of proven proprietary/closed-source LLM models, such as GPT-4. In this work, our proposed effective strategy LANE aligns LLMs with online recommendation systems without additional LLMs tuning, reducing costs and improving explainability. This innovative approach addresses key challenges in integrating language models with recommendation systems while fully utilizing the capabilities of powerful proprietary models. Specifically, our strategy operates through several key components: semantic embedding, user multi-preference extraction using zero-shot prompting, semantic alignment, and explainable recommendation generation using Chain of Thought (CoT) prompting. By embedding item titles instead of IDs and utilizing multi-head attention mechanisms, our approach aligns the semantic features of user preferences with those of candidate items, ensuring coherent and user-aligned recommendations. Sufficient experimental results including performance comparison, questionnaire voting, and visualization cases prove that our method can not only ensure recommendation performance, but also provide easy-to-understand and reasonable recommendation logic.

[IR-3] Learning Positional Attention for Sequential Recommendation

链接: https://arxiv.org/abs/2407.02793
作者: Fan Luo,Juan Zhang,Shenghui Xu
关键词: sequential recommendation tasks, achieved remarkable performance, networks have achieved, recommendation tasks, achieved remarkable
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Self-attention-based networks have achieved remarkable performance in sequential recommendation tasks. A crucial component of these models is positional encoding. In this study, we delve into the learned positional embedding, demonstrating that it often captures the distance between tokens. Building on this insight, we introduce novel attention models that directly learn positional relations. Extensive experiments reveal that our proposed models, \textbfPARec and \textbfFPARec outperform previous self-attention-based approaches.Our code is available at the link for anonymous review: https://anonymous.4open.science/ r/FPARec-2C55/

[IR-4] Supporting Cross-language Cross-project Bug Localization Using Pre-trained Language Models

链接: https://arxiv.org/abs/2407.02732
作者: Mahinthan Chandramohan,Dai Quoc Nguyen,Padmanabhan Krishnan,Jovan Jancic
关键词: Automatically locating, challenge for developers, remains a significant, significant challenge, large codebase remains
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Automatically locating a bug within a large codebase remains a significant challenge for developers. Existing techniques often struggle with generalizability and deployment due to their reliance on application-specific data and large model sizes. This paper proposes a novel pre-trained language model (PLM) based technique for bug localization that transcends project and language boundaries. Our approach leverages contrastive learning to enhance the representation of bug reports and source code. It then utilizes a novel ranking approach that combines commit messages and code segments. Additionally, we introduce a knowledge distillation technique that reduces model size for practical deployment without compromising performance. This paper presents several key benefits. By incorporating code segment and commit message analysis alongside traditional file-level examination, our technique achieves better bug localization accuracy. Furthermore, our model excels at generalizability - trained on code from various projects and languages, it can effectively identify bugs in unseen codebases. To address computational limitations, we propose a CPU-compatible solution. In essence, proposed work presents a highly effective, generalizable, and efficient bug localization technique with the potential to real-world deployment. Subjects: Software Engineering (cs.SE); Information Retrieval (cs.IR) Cite as: arXiv:2407.02732 [cs.SE] (or arXiv:2407.02732v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2407.02732 Focus to learn more arXiv-issued DOI via DataCite

[IR-5] ECAT: A Entire space Continual and Adaptive Transfer Learning Framework for Cross-Domain Recommendation

链接: https://arxiv.org/abs/2407.02542
作者: Chaoqun Hou,Yuanhang Zhou,Yi Cao,Tong Liu
关键词: entire space, designed to meet, meet the diverse, diverse interests, industrial recommendation systems
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In industrial recommendation systems, there are several mini-apps designed to meet the diverse interests and needs of users. The sample space of them is merely a small subset of the entire space, making it challenging to train an efficient model. In recent years, there have been many excellent studies related to cross-domain recommendation aimed at mitigating the problem of data sparsity. However, few of them have simultaneously considered the adaptability of both sample and representation continual transfer setting to the target task. To overcome the above issue, we propose a Entire space Continual and Adaptive Transfer learning framework called ECAT which includes two core components: First, as for sample transfer, we propose a two-stage method that realizes a coarse-to-fine process. Specifically, we perform an initial selection through a graph-guided method, followed by a fine-grained selection using domain adaptation method. Second, we propose an adaptive knowledge distillation method for continually transferring the representations from a model that is well-trained on the entire space dataset. ECAT enables full utilization of the entire space samples and representations under the supervision of the target task, while avoiding negative migration. Comprehensive experiments on real-world industrial datasets from Taobao show that ECAT advances state-of-the-art performance on offline metrics, and brings +13.6% CVR and +8.6% orders for Baiyibutie, a famous mini-app of Taobao.

[IR-6] Reducing False Discoveries in Statistically-Significant Regional-Colocation Mining: A Summary of Results

链接: https://arxiv.org/abs/2407.02536
作者: Subhankar Ghosh,Jayant Gupta,Arun Sharma,Shuai An,Shashi Shekhar
关键词: spatial feature types, emph, feature types, feature instances, spatial feature
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); General Economics (econ.GN); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Given a set \emphS of spatial feature types, its feature instances, a study area, and a neighbor relationship, the goal is to find pairs a region ( r_g ), a subset \emphC of \emphS such that \emphC is a statistically significant regional-colocation pattern in r_g . This problem is important for applications in various domains including ecology, economics, and sociology. The problem is computationally challenging due to the exponential number of regional colocation patterns and candidate regions. Previously, we proposed a miner \citehttps://doi.org/10.1145/3557989.3566158 that finds statistically significant regional colocation patterns. However, the numerous simultaneous statistical inferences raise the risk of false discoveries (also known as the multiple comparisons problem) and carry a high computational cost. We propose a novel algorithm, namely, multiple comparisons regional colocation miner (MultComp-RCM) which uses a Bonferroni correction. Theoretical analysis, experimental evaluation, and case study results show that the proposed method reduces both the false discovery rate and computational cost.

[IR-7] Understanding Biases in ChatGPT-based Recommender Systems: Provider Fairness Temporal Stability and Recency

链接: https://arxiv.org/abs/2401.10545
作者: Yashar Deldjoo
关键词: provider fairness, fairness, ChatGPT-based recommender systems, thousand API calls, ICL
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores the biases in ChatGPT-based recommender systems, focusing on provider fairness (item-side fairness). Through extensive experiments and over a thousand API calls, we investigate the impact of prompt design strategies-including structure, system role, and intent-on evaluation metrics such as provider fairness, catalog coverage, temporal stability, and recency. The first experiment examines these strategies in classical top-K recommendations, while the second evaluates sequential in-context learning (ICL). In the first experiment, we assess seven distinct prompt scenarios on top-K recommendation accuracy and fairness. Accuracy-oriented prompts, like Simple and Chain-of-Thought (COT), outperform diversification prompts, which, despite enhancing temporal freshness, reduce accuracy by up to 50%. Embedding fairness into system roles, such as “act as a fair recommender,” proved more effective than fairness directives within prompts. Diversification prompts led to recommending newer movies, offering broader genre distribution compared to traditional collaborative filtering (CF) models. The second experiment explores sequential ICL, comparing zero-shot and few-shot ICL. Results indicate that including user demographic information in prompts affects model biases and stereotypes. However, ICL did not consistently improve item fairness and catalog coverage over zero-shot learning. Zero-shot learning achieved higher NDCG and coverage, while ICL-2 showed slight improvements in hit rate (HR) when age-group context was included. Our study provides insights into biases of RecLLMs, particularly in provider fairness and catalog coverage. By examining prompt design, learning strategies, and system roles, we highlight the potential and challenges of integrating LLMs into recommendation systems. Further details can be found at this https URL. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2401.10545 [cs.IR] (or arXiv:2401.10545v3 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2401.10545 Focus to learn more arXiv-issued DOI via DataCite

人工智能

[AI-0] Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages

链接: https://arxiv.org/abs/2407.03321
作者: Max Zuo,Francisco Piedrahita Velez,Xiaochen Li,Michael L. Littman,Stephen H. Bach
关键词: PDDL code, PDDL, generated PDDL code, language, natural language descriptions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many recent works have explored using language models for planning problems. One line of research focuses on translating natural language descriptions of planning tasks into structured planning languages, such as the planning domain definition language (PDDL). While this approach is promising, accurately measuring the quality of generated PDDL code continues to pose significant challenges. First, generated PDDL code is typically evaluated using planning validators that check whether the problem can be solved with a planner. This method is insufficient because a language model might generate valid PDDL code that does not align with the natural language description of the task. Second, existing evaluation sets often have natural language descriptions of the planning task that closely resemble the ground truth PDDL, reducing the challenge of the task. To bridge this gap, we introduce \benchmarkName, a benchmark designed to evaluate language models’ ability to generate PDDL code from natural language descriptions of planning tasks. We begin by creating a PDDL equivalence algorithm that rigorously evaluates the correctness of PDDL code generated by language models by flexibly comparing it against a ground truth PDDL. Then, we present a dataset of 132,037 text-to-PDDL pairs across 13 different tasks, with varying levels of difficulty. Finally, we evaluate several API-access and open-weight language models that reveal this task’s complexity. For example, 87.6% of the PDDL problem descriptions generated by GPT-4o are syntactically parseable, 82.2% are valid, solve-able problems, but only 35.1% are semantically correct, highlighting the need for a more rigorous benchmark for this problem.

[AI-1] Value-Penalized Auxiliary Control from Examples for Learning without Rewards or Demonstrations

链接: https://arxiv.org/abs/2407.03311
作者: Trevor Ablett,Bryan Chan,Jayce Haoran Wang,Jonathan Kelly
关键词: full expert-demonstration trajectories, hand-crafted reward functions, expert-demonstration trajectories, difficult to acquire, hand-crafted reward
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Submitted to the Conference on Robot Learning (CoRL’24), Munich, Germany, Nov. 6-9, 2024

点击查看摘要

Abstract:Learning from examples of success is an appealing approach to reinforcement learning that eliminates many of the disadvantages of using hand-crafted reward functions or full expert-demonstration trajectories, both of which can be difficult to acquire, biased, or suboptimal. However, learning from examples alone dramatically increases the exploration challenge, especially for complex tasks. This work introduces value-penalized auxiliary control from examples (VPACE); we significantly improve exploration in example-based control by adding scheduled auxiliary control and examples of auxiliary tasks. Furthermore, we identify a value-calibration problem, where policy value estimates can exceed their theoretical limits based on successful data. We resolve this problem, which is exacerbated by learning auxiliary tasks, through the addition of an above-success-level value penalty. Across three simulated and one real robotic manipulation environment, and 21 different main tasks, we show that our approach substantially improves learning efficiency. Videos, code, and datasets are available at this https URL.

[AI-2] DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents

链接: https://arxiv.org/abs/2407.03300
作者: Yilun Xu,Gabriele Corso,Tommi Jaakkola,Arash Vahdat,Karsten Kreis
关键词: Gaussian distribution, Diffusion models, discrete latents, Diffusion, Latent Variable Diffusion
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: project page: this https URL

点击查看摘要

Abstract:Diffusion models (DMs) have revolutionized generative learning. They utilize a diffusion process to encode data into a simple Gaussian distribution. However, encoding a complex, potentially multimodal data distribution into a single continuous Gaussian distribution arguably represents an unnecessarily challenging learning problem. We propose Discrete-Continuous Latent Variable Diffusion Models (DisCo-Diff) to simplify this task by introducing complementary discrete latent variables. We augment DMs with learnable discrete latents, inferred with an encoder, and train DM and encoder end-to-end. DisCo-Diff does not rely on pre-trained networks, making the framework universally applicable. The discrete latents significantly simplify learning the DM’s complex noise-to-data mapping by reducing the curvature of the DM’s generative ODE. An additional autoregressive transformer models the distribution of the discrete latents, a simple step because DisCo-Diff requires only few discrete variables with small codebooks. We validate DisCo-Diff on toy data, several image synthesis tasks as well as molecular docking, and find that introducing discrete latents consistently improves model performance. For example, DisCo-Diff achieves state-of-the-art FID scores on class-conditioned ImageNet-64/128 datasets with ODE sampler.

[AI-3] Improved Noise Schedule for Diffusion Training

链接: https://arxiv.org/abs/2407.03297
作者: Tiankai Hang,Shuyang Gu
关键词: generating visual signals, visual signals, facto choice, choice for generating, generating visual
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as the de facto choice for generating visual signals. However, training a single model to predict noise across various levels poses significant challenges, necessitating numerous iterations and incurring significant computational costs. Various approaches, such as loss weighting strategy design and architectural refinements, have been introduced to expedite convergence. In this study, we propose a novel approach to design the noise schedule for enhancing the training of diffusion models. Our key insight is that the importance sampling of the logarithm of the Signal-to-Noise ratio (logSNR), theoretically equivalent to a modified noise schedule, is particularly beneficial for training efficiency when increasing the sample frequency around \log \textSNR=0 . We empirically demonstrate the superiority of our noise schedule over the standard cosine schedule. Furthermore, we highlight the advantages of our noise schedule design on the ImageNet benchmark, showing that the designed schedule consistently benefits different prediction targets.

[AI-4] VCHAR:Variance-Driven Complex Human Activity Recognition framework with Generative Representation

链接: https://arxiv.org/abs/2407.03291
作者: Yuan Sun,Navid Salami Pargoo,Taqiya Ehsan,Zhao Zhang Jorge Ortiz
关键词: Complex human activity, human activity recognition, complex activity recognition, activity recognition, human activity
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Complex human activity recognition (CHAR) remains a pivotal challenge within ubiquitous computing, especially in the context of smart environments. Existing studies typically require meticulous labeling of both atomic and complex activities, a task that is labor-intensive and prone to errors due to the scarcity and inaccuracies of available datasets. Most prior research has focused on datasets that either precisely label atomic activities or, at minimum, their sequence approaches that are often impractical in real world this http URL response, we introduce VCHAR (Variance-Driven Complex Human Activity Recognition), a novel framework that treats the outputs of atomic activities as a distribution over specified intervals. Leveraging generative methodologies, VCHAR elucidates the reasoning behind complex activity classifications through video-based explanations, accessible to users without prior machine learning expertise. Our evaluation across three publicly available datasets demonstrates that VCHAR enhances the accuracy of complex activity recognition without necessitating precise temporal or sequential labeling of atomic activities. Furthermore, user studies confirm that VCHAR’s explanations are more intelligible compared to existing methods, facilitating a broader understanding of complex activity recognition among non-experts.

[AI-5] Bot: Learning to Knot a Tie from Visual Demonstration through a Real-to-Sim-to-Real Approach

链接: https://arxiv.org/abs/2407.03245
作者: Weikun Peng,Jun Lv,Yuwei Zeng,Haonan Chen,Siheng Zhao,Jichen Sun,Cewu Lu,Lin Shao
关键词: long-horizon manipulation actions, highly challenging due, tie high deformation, Hierarchical Feature Matching, manipulation actions
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: fix few typos

点击查看摘要

Abstract:The tie-knotting task is highly challenging due to the tie’s high deformation and long-horizon manipulation actions. This work presents TieBot, a Real-to-Sim-to-Real learning from visual demonstration system for the robots to learn to knot a tie. We introduce the Hierarchical Feature Matching approach to estimate a sequence of tie’s meshes from the demonstration video. With these estimated meshes used as subgoals, we first learn a teacher policy using privileged information. Then, we learn a student policy with point cloud observation by imitating teacher policy. Lastly, our pipeline learns a residual policy when the learned policy is applied to real-world execution, mitigating the Sim2Real gap. We demonstrate the effectiveness of TieBot in simulation and the real world. In the real-world experiment, a dual-arm robot successfully knots a tie, achieving 50% success rate among 10 trials. Videos can be found this https URL.

[AI-6] Improving Retrieval-augmented Text-to-SQL with AST-based Ranking and Schema Pruning

链接: https://arxiv.org/abs/2407.03227
作者: Zhili Shen,Pavlos Vougiouklis,Chenxin Diao,Kaustubh Vyas,Yuanyi Ji,Jeff Z. Pan
关键词: Large Language Models, perspective of Large, Large Language, abstract syntax trees, Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:We focus on Text-to-SQL semantic parsing from the perspective of Large Language Models. Motivated by challenges related to the size of commercial database schemata and the deployability of business intelligence solutions, we propose an approach that dynamically retrieves input database information and uses abstract syntax trees to select few-shot examples for in-context learning. Furthermore, we investigate the extent to which an in-parallel semantic parser can be leveraged for generating \textitapproximated versions of the expected SQL queries, to support our retrieval. We take this approach to the extreme–we adapt a model consisting of less than 500 M parameters, to act as an extremely efficient approximator, enhancing it with the ability to process schemata in a parallelised manner. We apply our approach to monolingual and cross-lingual benchmarks for semantic parsing, showing improvements over state-of-the-art baselines. Comprehensive experiments highlight the contribution of modules involved in this retrieval-augmented generation setting, revealing interesting directions for future work. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB) Cite as: arXiv:2407.03227 [cs.CL] (or arXiv:2407.03227v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2407.03227 Focus to learn more arXiv-issued DOI via DataCite

[AI-7] PPO-based Dynamic Control of Uncertain Floating Platforms in the Zero-G Environment

链接: https://arxiv.org/abs/2407.03224
作者: Mahya Ramezani,M. Amin Alandihallaj,Andreas M. Hein
关键词: Proximal Policy Optimization, floating platforms play, play a crucial, crucial role, role in scientific
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: Pre-print version submitted to 2024 International Conference on Robotics and Automation (ICRA)

点击查看摘要

Abstract:In the field of space exploration, floating platforms play a crucial role in scientific investigations and technological advancements. However, controlling these platforms in zero-gravity environments presents unique challenges, including uncertainties and disturbances. This paper introduces an innovative approach that combines Proximal Policy Optimization (PPO) with Model Predictive Control (MPC) in the zero-gravity laboratory (Zero-G Lab) at the University of Luxembourg. This approach leverages PPO’s reinforcement learning power and MPC’s precision to navigate the complex control dynamics of floating platforms. Unlike traditional control methods, this PPO-MPC approach learns from MPC predictions, adapting to unmodeled dynamics and disturbances, resulting in a resilient control framework tailored to the zero-gravity environment. Simulations and experiments in the Zero-G Lab validate this approach, showcasing the adaptability of the PPO agent. This research opens new possibilities for controlling floating platforms in zero-gravity settings, promising advancements in space exploration.

[AI-8] Learning Disentangled Representation in Object-Centric Models for Visual Dynamics Prediction via Transformers

链接: https://arxiv.org/abs/2407.03216
作者: Sanket Gandhi,Atul,Samanyu Mahajan,Vishal Sharma,Rushil Gupta,Arnab Kumar Mondal,Parag Singla
关键词: Recent work, bringing interpretability, learning disentangled representation, disentangled representation, Recent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent work has shown that object-centric representations can greatly help improve the accuracy of learning dynamics while also bringing interpretability. In this work, we take this idea one step further, ask the following question: “can learning disentangled representation further improve the accuracy of visual dynamics prediction in object-centric models?” While there has been some attempt to learn such disentangled representations for the case of static images \citepnsb, to the best of our knowledge, ours is the first work which tries to do this in a general setting for video, without making any specific assumptions about the kind of attributes that an object might have. The key building block of our architecture is the notion of a \em block, where several blocks together constitute an object. Each block is represented as a linear combination of a given number of learnable concept vectors, which is iteratively refined during the learning process. The blocks in our model are discovered in an unsupervised manner, by attending over object masks, in a style similar to discovery of slots \citepslot_attention, for learning a dense object-centric representation. We employ self-attention via transformers over the discovered blocks to predict the next state resulting in discovery of visual dynamics. We perform a series of experiments on several benchmark 2-D, and 3-D datasets demonstrating that our architecture (1) can discover semantically meaningful blocks (2) help improve accuracy of dynamics prediction compared to SOTA object-centric models (3) perform significantly better in OOD setting where the specific attribute combinations are not seen earlier during training. Our experiments highlight the importance discovery of disentangled representation for visual dynamics prediction.

[AI-9] Combining AI Control Systems and Human Decision Support via Robustness and Criticality

链接: https://arxiv.org/abs/2407.03210
作者: Walt Woods,Alexander Grushin,Simon Khan,Alvaro Velasquez
关键词: AI-enabled capabilities, real world, capabilities are reaching, reaching the requisite, requisite level
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 19 pages, 12 figures

点击查看摘要

Abstract:AI-enabled capabilities are reaching the requisite level of maturity to be deployed in the real world, yet do not always make correct or safe decisions. One way of addressing these concerns is to leverage AI control systems alongside and in support of human decisions, relying on the AI control system in safe situations while calling on a human co-decider for critical situations. We extend a methodology for adversarial explanations (AE) to state-of-the-art reinforcement learning frameworks, including MuZero. Multiple improvements to the base agent architecture are proposed. We demonstrate how this technology has two applications: for intelligent decision tools and to enhance training / learning frameworks. In a decision support context, adversarial explanations help a user make the correct decision by highlighting those contextual factors that would need to change for a different AI-recommended decision. As another benefit of adversarial explanations, we show that the learned AI control system demonstrates robustness against adversarial tampering. Additionally, we supplement AE by introducing strategically similar autoencoders (SSAs) to help users identify and understand all salient factors being considered by the AI system. In a training / learning framework, this technology can improve both the AI’s decisions and explanations through human interaction. Finally, to identify when AI decisions would most benefit from human oversight, we tie this combined system to our prior art on statistically verified analyses of the criticality of decisions at any point in time.

[AI-10] heoremLlama: Transforming General-Purpose LLMs into Lean4 Experts

链接: https://arxiv.org/abs/2407.03203
作者: Ruida Wang,Jipeng Zhang,Yizhen Jia,Rui Pan,Shizhe Diao,Renjie Pi,Tong Zhang
关键词: Lean significantly impacts, significantly impacts mathematical, Lean significantly, Proving mathematical theorems, Large Language Models
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Proving mathematical theorems using computer-verifiable formal languages like Lean significantly impacts mathematical reasoning. One approach to formal theorem proving involves generating complete proofs using Large Language Models (LLMs) based on Natural Language (NL) proofs. Similar methods have shown promising results in code generation. However, most modern LLMs exhibit suboptimal performance due to the scarcity of aligned NL and Formal Language (FL) theorem-proving data. This scarcity results in a paucity of methodologies for training LLMs and techniques to fully utilize their capabilities in composing formal proofs. To address the challenges, this paper proposes TheoremLlama, an end-to-end framework to train a general-purpose LLM to become a Lean4 expert. This framework encompasses NL-FL aligned dataset generation methods, training approaches for the LLM formal theorem prover, and techniques for LLM Lean4 proof writing. Using the dataset generation method, we provide Open Bootstrapped Theorems (OBT), an NL-FL aligned and bootstrapped dataset. A key innovation in this framework is the NL-FL bootstrapping method, where NL proofs are integrated into Lean4 code for training datasets, leveraging the NL reasoning ability of LLMs for formal reasoning. The TheoremLlama framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively, surpassing the GPT-4 baseline of 22.95% and 25.41%. We have also open-sourced our model checkpoints and generated dataset, and will soon make all the code publicly available.

[AI-11] MuDiT MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

链接: https://arxiv.org/abs/2407.03188
作者: Zihao Wang,Haoxuan Liu,Jiaxing Yu,Tao Zhang,Yan Liu,Kejun Zhang
关键词: Amid the rising, human artistic processes, automatic song composition, human-centric automatic song, artistic processes
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: 19 pages, 5 figures

点击查看摘要

Abstract:Amid the rising intersection of generative AI and human artistic processes, this study probes the critical yet less-explored terrain of alignment in human-centric automatic song composition. We propose a novel task of Colloquial Description-to-Song Generation, which focuses on aligning the generated content with colloquial human expressions. This task is aimed at bridging the gap between colloquial language understanding and auditory expression within an AI model, with the ultimate goal of creating songs that accurately satisfy human auditory expectations and structurally align with musical norms. Current datasets are limited due to their narrow descriptive scope, semantic gaps and inaccuracies. To overcome data scarcity in this domain, we present the Caichong Music Dataset (CaiMD). CaiMD is manually annotated by both professional musicians and amateurs, offering diverse perspectives and a comprehensive understanding of colloquial descriptions. Unlike existing datasets pre-set with expert annotations or auto-generated ones with inherent biases, CaiMD caters more sufficiently to our purpose of aligning AI-generated music with widespread user-desired results. Moreover, we propose an innovative single-stage framework called MuDiT/MuSiT for enabling effective human-machine alignment in song creation. This framework not only achieves cross-modal comprehension between colloquial language and auditory music perceptions but also ensures generated songs align with user-desired results. MuDiT/MuSiT employs one DiT/SiT model for end-to-end generation of musical components like melody, harmony, rhythm, vocals, and instrumentation. The approach ensures harmonious sonic cohesiveness amongst all generated musical components, facilitating better resonance with human auditory expectations.

[AI-12] Multiple-Resolution Tokenization for Time Series Forecasting with an Application to Pricing

链接: https://arxiv.org/abs/2407.03185
作者: Egon Peršak,Miguel F. Anjos,Sebastian Lautz,Aleksandar Kolev
关键词: time series forecasting, time series tokenisation, real-world prediction problem, time series, pricing domain
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a transformer architecture for time series forecasting with a focus on time series tokenisation and apply it to a real-world prediction problem from the pricing domain. Our architecture aims to learn effective representations at many scales across all available data simultaneously. The model contains a number of novel modules: a differentiated form of time series patching which employs multiple resolutions, a multiple-resolution module for time-varying known variables, a mixer-based module for capturing cross-series information, and a novel output head with favourable scaling to account for the increased number of tokens. We present an application of this model to a real world prediction problem faced by the markdown team at a very large retailer. On the experiments conducted our model outperforms in-house models and the selected existing deep learning architectures.

[AI-13] A Formal Model for Artificial Intelligence Applications in Automation Systems

链接: https://arxiv.org/abs/2407.03183
作者: Marvin Schieseck,Philip Topalis,Lasse Reinpold,Felix Gehlhoff,Alexander Fay
关键词: existing technical challenges, unsolved existing technical, automation systems, technical challenges, potential to enhance
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The integration of Artificial Intelligence (AI) into automation systems has the potential to enhance efficiency and to address currently unsolved existing technical challenges. However, the industry-wide adoption of AI is hindered by the lack of standardized documentation for the complex compositions of automation systems, AI software, production hardware, and their interdependencies. This paper proposes a formal model using standards and ontologies to provide clear and structured documentation of AI applications in automation systems. The proposed information model for artificial intelligence in automation systems (AIAS) utilizes ontology design patterns to map and link various aspects of automation systems and AI software. Validated through a practical example, the model demonstrates its effectiveness in improving documentation practices and aiding the sustainable implementation of AI in industrial settings.

[AI-14] A multi-objective combinatorial optimisation framework for large scale hierarchical population synthesis

链接: https://arxiv.org/abs/2407.03180
作者: Imran Mahmood,Nicholas Bishop,Anisoara Calinescu,Michael Wooldridge,Ioannis Zachos
关键词: agent-based simulations, agents are commonly, synthetic population, synthetic, population
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In agent-based simulations, synthetic populations of agents are commonly used to represent the structure, behaviour, and interactions of individuals. However, generating a synthetic population that accurately reflects real population statistics is a challenging task, particularly when performed at scale. In this paper, we propose a multi objective combinatorial optimisation technique for large scale population synthesis. We demonstrate the effectiveness of our approach by generating a synthetic population for selected regions and validating it on contingency tables from real population data. Our approach supports complex hierarchical structures between individuals and households, is scalable to large populations and achieves minimal contigency table reconstruction error. Hence, it provides a useful tool for policymakers and researchers for simulating the dynamics of complex populations.

[AI-15] Motion meets Attention: Video Motion Prompts

链接: https://arxiv.org/abs/2407.03179
作者: Qixiang Chen,Lei Wang,Piotr Koniusz,Tom Gedeon
关键词: rich spatio-temporal information, spatio-temporal information, rich spatio-temporal, motion, attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Research report

点击查看摘要

Abstract:Videos contain rich spatio-temporal information. Traditional methods for extracting motion, used in tasks such as action recognition, often rely on visual contents rather than precise motion features. This phenomenon is referred to as ‘blind motion extraction’ behavior, which proves inefficient in capturing motions of interest due to a lack of motion-guided cues. Recently, attention mechanisms have enhanced many computer vision tasks by effectively highlighting salient visual areas. Inspired by this, we propose using a modified Sigmoid function with learnable slope and shift parameters as an attention mechanism to activate and modulate motion signals derived from frame differencing maps. This approach generates a sequence of attention maps that enhance the processing of motion-related video content. To ensure temporally continuity and smoothness of the attention maps, we apply pair-wise temporal attention variation regularization to remove unwanted motions (e.g., noise) while preserving important ones. We then perform Hadamard product between each pair of attention maps and the original video frames to highlight the evolving motions of interest over time. These highlighted motions, termed video motion prompts, are subsequently used as inputs to the model instead of the original video frames. We formalize this process as a motion prompt layer and incorporate the regularization term into the loss function to learn better motion prompts. This layer serves as an adapter between the model and the video data, bridging the gap between traditional ‘blind motion extraction’ and the extraction of relevant motions of interest.

[AI-16] IMC 2024 Methods Solutions Review

链接: https://arxiv.org/abs/2407.03172
作者: Shyam Gupta,Dhanisha Sharma,Songling Huang
关键词: focuses on solving, image reconstruction problem, Image Matching Challenge, Kaggle, Image Matching
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注: 8 Pages, 9 figures

点击查看摘要

Abstract:For the past three years, Kaggle has been hosting the Image Matching Challenge, which focuses on solving a 3D image reconstruction problem using a collection of 2D images. Each year, this competition fosters the development of innovative and effective methodologies by its participants. In this paper, we introduce an advanced ensemble technique that we developed, achieving a score of 0.153449 on the private leaderboard and securing the 160th position out of over 1,000 participants. Additionally, we conduct a comprehensive review of existing methods and techniques employed by top-performing teams in the competition. Our solution, alongside the insights gathered from other leading approaches, contributes to the ongoing advancement in the field of 3D image reconstruction. This research provides valuable knowledge for future participants and researchers aiming to excel in similar image matching and reconstruction challenges.

[AI-17] Let the Code LLM Edit Itself When You Edit the Code

链接: https://arxiv.org/abs/2407.03157
作者: Zhenyu He,Jun Zhang,Shengjie Luo,Jingjing Xu,Zhi Zhang,Di He
关键词: developer edits existing, edits existing code, large language model, investigate a typical, typical scenario
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Preprint. Work in Progress

点击查看摘要

Abstract:In this work, we investigate a typical scenario in code generation where a developer edits existing code in real time and requests a code assistant, e.g., a large language model, to re-predict the next token or next line on the fly. Naively, the LLM needs to re-encode the entire KV cache to provide an accurate prediction. However, this process is computationally expensive, especially when the sequence length is long. Simply encoding the edited subsequence and integrating it to the original KV cache meets the temporal confusion problem, leading to significantly worse performance. We address this efficiency and accuracy trade-off by introducing \underline\textbfPositional \textbfIntegrity \textbfEncoding (PIE). Building upon the rotary positional encoding, PIE first removes the rotary matrices in the Key cache that introduce temporal confusion and then reapplies the correct rotary matrices. This process ensures that positional relationships between tokens are correct and requires only a single round of matrix multiplication. We validate the effectiveness of PIE through extensive experiments on the RepoBench-C-8k dataset, utilizing DeepSeek-Coder models with 1.3B, 6.7B, and 33B parameters. Our evaluation includes three real-world coding tasks: code insertion, code deletion, and multi-place code editing. Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach across all model sizes and tasks while well approximating the model performance.

[AI-18] Reinforcement Learning for Sequence Design Leveraging Protein Language Models

链接: https://arxiv.org/abs/2407.03154
作者: Jithendaraa Subramanian,Shivakanth Sujit,Niloy Irtisam,Umong Sain,Derek Nowrouzezahrai,Samira Ebrahimi Kahou,Riashat Islam
关键词: amino acid sequences, Protein sequence design, protein engineering problems, determined by amino, drug discovery
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
*备注: 22 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Protein sequence design, determined by amino acid sequences, are essential to protein engineering problems in drug discovery. Prior approaches have resorted to evolutionary strategies or Monte-Carlo methods for protein design, but often fail to exploit the structure of the combinatorial search space, to generalize to unseen sequences. In the context of discrete black box optimization over large search spaces, learning a mutation policy to generate novel sequences with reinforcement learning is appealing. Recent advances in protein language models (PLMs) trained on large corpora of protein sequences offer a potential solution to this problem by scoring proteins according to their biological plausibility (such as the TM-score). In this work, we propose to use PLMs as a reward function to generate new sequences. Yet the PLM can be computationally expensive to query due to its large size. To this end, we propose an alternative paradigm where optimization can be performed on scores from a smaller proxy model that is periodically finetuned, jointly while learning the mutation policy. We perform extensive experiments on various sequence lengths to benchmark RL-based approaches, and provide comprehensive evaluations along biological plausibility and diversity of the protein. Our experimental results include favorable evaluations of the proposed sequences, along with high diversity scores, demonstrating that RL is a strong candidate for biological sequence design. Finally, we provide a modular open source implementation can be easily integrated in most RL training loops, with support for replacing the reward model with other PLMs, to spur further research in this domain. The code for all experiments is provided in the supplementary material.

[AI-19] Enhancing Class Fairness in Classification with A Two-Player Game Approach

链接: https://arxiv.org/abs/2407.03146
作者: Yunpeng Jiang,Paul Weng,Yutong Ban
关键词: machine learning tasks, Data augmentation, widely applied, shown its benefits, machine learning
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data augmentation is widely applied and has shown its benefits in different machine learning tasks. However, as recently observed in some downstream tasks, data augmentation may introduce an unfair impact on classifications. While it can improve the performance of some classes, it can actually be detrimental for other classes, which can be problematic in some application domains. In this paper, to counteract this phenomenon, we propose a FAir Classification approach with a Two-player game (FACT). We first formulate the training of a classifier with data augmentation as a fair optimization problem, which can be further written as an adversarial two-player game. Following this formulation, we propose a novel multiplicative weight optimization algorithm, for which we theoretically prove that it can converge to a solution that is fair over classes. Interestingly, our formulation also reveals that this fairness issue over classes is not due to data augmentation only, but is in fact a general phenomenon. Our empirical experiments demonstrate that the performance of our learned classifiers is indeed more fairly distributed over classes in five datasets, with only limited impact on the average accuracy.

[AI-20] GMM-ResNext: Combining Generative and Discriminative Models for Speaker Verification

链接: https://arxiv.org/abs/2407.03135
作者: Hui Yan,Zhenchun Lei,Changhong Liu,Yong Zhou
关键词: deep learning architecture, deep learning, network architectures, deep learning models, network architectures rely
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:With the development of deep learning, many different network architectures have been explored in speaker verification. However, most network architectures rely on a single deep learning architecture, and hybrid networks combining different architectures have been little studied in ASV tasks. In this paper, we propose the GMM-ResNext model for speaker verification. Conventional GMM does not consider the score distribution of each frame feature over all Gaussian components and ignores the relationship between neighboring speech frames. So, we extract the log Gaussian probability features based on the raw acoustic features and use ResNext-based network as the backbone to extract the speaker embedding. GMM-ResNext combines Generative and Discriminative Models to improve the generalization ability of deep learning models and allows one to more easily specify meaningful priors on model parameters. A two-path GMM-ResNext model based on two gender-related GMMs has also been proposed. The Experimental results show that the proposed GMM-ResNext achieves relative improvements of 48.1% and 11.3% in EER compared with ResNet34 and ECAPA-TDNN on VoxCeleb1-O test set.

[AI-21] Quantifying the Cross-sectoral Intersecting Discrepancies within Multiple Groups Using Latent Class Analysis Towards Fairness

链接: https://arxiv.org/abs/2407.03133
作者: Yingfang Yuan,Kefan Chen,Mehdi Rizvi,Lynne Baillie,Wei Pang
关键词: growing interest, interest in fair, cross-sectoral intersecting discrepancies, discrepancies, intersecting discrepancies
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The growing interest in fair AI development is evident. The ‘‘Leave No One Behind’’ initiative urges us to address multiple and intersecting forms of inequality in accessing services, resources, and opportunities, emphasising the significance of fairness in AI. This is particularly relevant as an increasing number of AI tools are applied to decision-making processes, such as resource allocation and service scheme development, across various sectors such as health, energy, and housing. Therefore, exploring joint inequalities in these sectors is significant and valuable for thoroughly understanding overall inequality and unfairness. This research introduces an innovative approach to quantify cross-sectoral intersecting discrepancies among user-defined groups using latent class analysis. These discrepancies can be used to approximate inequality and provide valuable insights to fairness issues. We validate our approach using both proprietary and public datasets, including EVENS and Census 2021 (England \ Wales) datasets, to examine cross-sectoral intersecting discrepancies among different ethnic groups. We also verify the reliability of the quantified discrepancy by conducting a correlation analysis with a government public metric. Our findings reveal significant discrepancies between minority ethnic groups, highlighting the need for targeted interventions in real-world AI applications. Additionally, we demonstrate how the proposed approach can be used to provide insights into the fairness of machine learning.

[AI-22] Speaker- and Text-Independent Estimation of Articulatory Movements and Phoneme Alignments from Speech

链接: https://arxiv.org/abs/2407.03132
作者: Tobias Weise,Philipp Klumpp,Kubilay Can Demir,Paula Andrea Pérez-Toro,Maria Schuster,Elmar Noeth,Bjoern Heismann,Andreas Maier,Seung Hee Yang
关键词: previously treated separately, motion estimation, previously treated, treated separately, speech inversion
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: to be published in Interspeech 2024 proceedings

点击查看摘要

Abstract:This paper introduces a novel combination of two tasks, previously treated separately: acoustic-to-articulatory speech inversion (AAI) and phoneme-to-articulatory (PTA) motion estimation. We refer to this joint task as acoustic phoneme-to-articulatory speech inversion (APTAI) and explore two different approaches, both working speaker- and text-independently during inference. We use a multi-task learning setup, with the end-to-end goal of taking raw speech as input and estimating the corresponding articulatory movements, phoneme sequence, and phoneme alignment. While both proposed approaches share these same requirements, they differ in their way of achieving phoneme-related predictions: one is based on frame classification, the other on a two-staged training procedure and forced alignment. We reach competitive performance of 0.73 mean correlation for the AAI task and achieve up to approximately 87% frame overlap compared to a state-of-the-art text-dependent phoneme force aligner.

[AI-23] MVGT: A Multi-view Graph Transformer Based on Spatial Relations for EEG Emotion Recognition

链接: https://arxiv.org/abs/2407.03131
作者: Yanjie Cui,Xiaohong Liu,Jing Liang,Yamin Fu
关键词: medical imaging technique, scalp electrical activity, captures scalp electrical, medical imaging, imaging technique
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG), a medical imaging technique that captures scalp electrical activity of brain structures via electrodes, has been widely used in affective computing. The spatial domain of EEG is rich in affective information.However, few of the existing studies have simultaneously analyzed EEG signals from multiple perspectives of geometric and anatomical structures in spatial domain. In this paper, we propose a multi-view Graph Transformer (MVGT) based on spatial relations, which integrates information from the temporal, frequency and spatial domains, including geometric and anatomical structures, so as to enhance the expressive power of the model comprehensively.We incorporate the spatial information of EEG channels into the model as encoding, thereby improving its ability to perceive the spatial structure of the channels. Meanwhile, experimental results based on publicly available datasets demonstrate that our proposed model outperforms state-of-the-art methods in recent years. In addition, the results also show that the MVGT could extract information from multiple domains and capture inter-channel relationships in EEG emotion recognition tasks effectively.

[AI-24] Foundations and Frontiers of Graph Learning Theory

链接: https://arxiv.org/abs/2407.03125
作者: Yu Huang,Min Zhou,Menglin Yang,Zhen Wang,Muhan Zhang,Jie Wang,Hong Xie,Hao Wang,Defu Lian,Enhong Chen
关键词: Recent advancements, Graph Neural Networks, complex structures, neural network architectures, analyze data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 36pages,273references

点击查看摘要

Abstract:Recent advancements in graph learning have revolutionized the way to understand and analyze data with complex structures. Notably, Graph Neural Networks (GNNs), i.e. neural network architectures designed for learning graph representations, have become a popular paradigm. With these models being usually characterized by intuition-driven design or highly intricate components, placing them within the theoretical analysis framework to distill the core concepts, helps understand the key principles that drive the functionality better and guide further development. Given this surge in interest, this article provides a comprehensive summary of the theoretical foundations and breakthroughs concerning the approximation and learning behaviors intrinsic to prevalent graph learning models. Encompassing discussions on fundamental aspects such as expressiveness power, generalization, optimization, and unique phenomena such as over-smoothing and over-squashing, this piece delves into the theoretical foundations and frontier driving the evolution of graph learning. In addition, this article also presents several challenges and further initiates discussions on possible solutions.

[AI-25] Compressed Latent Replays for Lightweight Continual Learning on Spiking Neural Networks

链接: https://arxiv.org/abs/2407.03111
作者: Alberto Dequino,Alessio Carpegna,Davide Nadalini,Alessandro Savino,Luca Benini,Stefano Di Carlo,Francesco Conti
关键词: Rehearsal-based Continual Learning, Deep Neural Networks, Spiking Neural Networks, Rehearsal-based Continual, Continual Learning
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rehearsal-based Continual Learning (CL) has been intensely investigated in Deep Neural Networks (DNNs). However, its application in Spiking Neural Networks (SNNs) has not been explored in depth. In this paper we introduce the first memory-efficient implementation of Latent Replay (LR)-based CL for SNNs, designed to seamlessly integrate with resource-constrained devices. LRs combine new samples with latent representations of previously learned data, to mitigate forgetting. Experiments on the Heidelberg SHD dataset with Sample and Class-Incremental tasks reach a Top-1 accuracy of 92.5% and 92%, respectively, without forgetting the previously learned information. Furthermore, we minimize the LRs’ requirements by applying a time-domain compression, reducing by two orders of magnitude their memory requirement, with respect to a naive rehearsal setup, with a maximum accuracy drop of 4%. On a Multi-Class-Incremental task, our SNN learns 10 new classes from an initial set of 10, reaching a Top-1 accuracy of 78.4% on the full test set.

[AI-26] A Toolchain for Comprehensive Audio/Video Analysis Using Deep Learning Based Multimodal Approach (A use case of riot or violent context detection)

链接: https://arxiv.org/abs/2407.03110
作者: Lam Pham,Phat Lam,Tin Nguyen,Hieu Tang,Alexander Schindler
关键词: based multimodal approach, Acoustic Scene Classification, leveraging deep learning, deep learning based, learning based multimodal
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:In this paper, we present a toolchain for a comprehensive audio/video analysis by leveraging deep learning based multimodal approach. To this end, different specific tasks of Speech to Text (S2T), Acoustic Scene Classification (ASC), Acoustic Event Detection (AED), Visual Object Detection (VOD), Image Captioning (IC), and Video Captioning (VC) are conducted and integrated into the toolchain. By combining individual tasks and analyzing both audio \ visual data extracted from input video, the toolchain offers various audio/video-based applications: Two general applications of audio/video clustering, comprehensive audio/video summary and a specific application of riot or violent context detection. Furthermore, the toolchain presents a flexible and adaptable architecture that is effective to integrate new models for further audio/video-based applications.

[AI-27] How Reliable and Stable are Explanations of XAI Methods?

链接: https://arxiv.org/abs/2407.03108
作者: José Ribeiro,Lucas Cardoso,Vitor Santos,Eduardo Carvalho,Níkolas Carneiro,Ronnie Alves
关键词: Explainable Artificial Intelligence, Black box models, XAI Methods, living in society, Black box
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 6 figures, submitted to BRACIS 2024

点击查看摘要

Abstract:Black box models are increasingly being used in the daily lives of human beings living in society. Along with this increase, there has been the emergence of Explainable Artificial Intelligence (XAI) methods aimed at generating additional explanations regarding how the model makes certain predictions. In this sense, methods such as Dalex, Eli5, eXirt, Lofo and Shap emerged as different proposals and methodologies for generating explanations of black box models in an agnostic way. Along with the emergence of these methods, questions arise such as “How Reliable and Stable are XAI Methods?”. With the aim of shedding light on this main question, this research creates a pipeline that performs experiments using the diabetes dataset and four different machine learning models (LGBM, MLP, DT and KNN), creating different levels of perturbations of the test data and finally generates explanations from the eXirt method regarding the confidence of the models and also feature relevances ranks from all XAI methods mentioned, in order to measure their stability in the face of perturbations. As a result, it was found that eXirt was able to identify the most reliable models among all those used. It was also found that current XAI methods are sensitive to perturbations, with the exception of one specific method.

[AI-28] Conformal Prediction for Causal Effects of Continuous Treatments

链接: https://arxiv.org/abs/2407.03094
作者: Maresa Schröder,Dennis Frauen,Jonas Schweisthal,Konstantin Heß,Valentyn Melnychuk,Stefan Feuerriegel
关键词: conformal prediction, personalized medicine, prediction, crucial for safety-critical, safety-critical applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Uncertainty quantification of causal effects is crucial for safety-critical applications such as personalized medicine. A powerful approach for this is conformal prediction, which has several practical benefits due to model-agnostic finite-sample guarantees. Yet, existing methods for conformal prediction of causal effects are limited to binary/discrete treatments and make highly restrictive assumptions such as known propensity scores. In this work, we provide a novel conformal prediction method for potential outcomes of continuous treatments. We account for the additional uncertainty introduced through propensity estimation so that our conformal prediction intervals are valid even if the propensity score is unknown. Our contributions are three-fold: (1) We derive finite-sample prediction intervals for potential outcomes of continuous treatments. (2) We provide an algorithm for calculating the derived intervals. (3) We demonstrate the effectiveness of the conformal prediction intervals in experiments on synthetic and real-world datasets. To the best of our knowledge, we are the first to propose conformal prediction for continuous treatments when the propensity score is unknown and must be estimated from data.

[AI-29] Revisiting the Performance of Deep Learning-Based Vulnerability Detection on Realistic Datasets

链接: https://arxiv.org/abs/2407.03093
作者: Partha Chakraborty,Krishna Kanth Arumugam,Mahmoud Alfadel,Meiyappan Nagappan,Shane McIntosh
关键词: everyday software systems, software vulnerabilities, everyday software, software systems, vulnerabilities on everyday
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The impact of software vulnerabilities on everyday software systems is significant. Despite deep learning models being proposed for vulnerability detection, their reliability is questionable. Prior evaluations show high recall/F1 scores of up to 99%, but these models underperform in practical scenarios, particularly when assessed on entire codebases rather than just the fixing commit. This paper introduces Real-Vul, a comprehensive dataset representing real-world scenarios for evaluating vulnerability detection models. Evaluating DeepWukong, LineVul, ReVeal, and IVDetect shows a significant drop in performance, with precision decreasing by up to 95 percentage points and F1 scores by up to 91 points. Furthermore, Model performance fluctuates based on vulnerability characteristics, with better F1 scores for information leaks or code injection than for path resolution or predictable return values. The results highlight a significant performance gap that needs addressing before deploying deep learning-based vulnerability detection in practical settings. Overfitting is identified as a key issue, and an augmentation technique is proposed, potentially improving performance by up to 30%. Contributions include a dataset creation approach for better model evaluation, Real-Vul dataset, and empirical evidence of deep learning models struggling in real-world settings.

[AI-30] Effective Heterogeneous Federated Learning via Efficient Hypernetwork-based Weight Generation

链接: https://arxiv.org/abs/2407.03086
作者: Yujin Shin,Kichang Lee,Sungmin Lee,You Rim Choi,Hyung-Sin Kim,JeongGil Ko
关键词: faces challenges due, learning leverages distributed, leverages distributed client, leverages distributed, faces challenges
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:While federated learning leverages distributed client resources, it faces challenges due to heterogeneous client capabilities. This necessitates allocating models suited to clients’ resources and careful parameter aggregation to accommodate this heterogeneity. We propose HypeMeFed, a novel federated learning framework for supporting client heterogeneity by combining a multi-exit network architecture with hypernetwork-based model weight generation. This approach aligns the feature spaces of heterogeneous model layers and resolves per-layer information disparity during weight aggregation. To practically realize HypeMeFed, we also propose a low-rank factorization approach to minimize computation and memory overhead associated with hypernetworks. Our evaluations on a real-world heterogeneous device testbed indicate that HypeMeFed enhances accuracy by 5.12% over FedAvg, reduces the hypernetwork memory requirements by 98.22%, and accelerates its operations by 1.86 times compared to a naive hypernetwork approach. These results demonstrate HypeMeFed’s effectiveness in leveraging and engaging heterogeneous clients for federated learning.

[AI-31] Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios

链接: https://arxiv.org/abs/2407.03080
作者: Patricia A. Apellániz,Ana Jiménez,Borja Arroyo Galende,Juan Parras,Santiago Zazo
关键词: Deep Generative Models, generation using Deep, substantial training data, synthetic tabular data, Deep Generative
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages, 6 Figures

点击查看摘要

Abstract:While synthetic tabular data generation using Deep Generative Models (DGMs) offers a compelling solution to data scarcity and privacy concerns, their effectiveness relies on substantial training data, often unavailable in real-world applications. This paper addresses this challenge by proposing a novel methodology for generating realistic and reliable synthetic tabular data with DGMs in limited real-data environments. Our approach proposes several ways to generate an artificial inductive bias in a DGM through transfer learning and meta-learning techniques. We explore and compare four different methods within this framework, demonstrating that transfer learning strategies like pre-training and model averaging outperform meta-learning approaches, like Model-Agnostic Meta-Learning, and Domain Randomized Search. We validate our approach using two state-of-the-art DGMs, namely, a Variational Autoencoder and a Generative Adversarial Network, to show that our artificial inductive bias fuels superior synthetic data quality, as measured by Jensen-Shannon divergence, achieving relative gains of up to 50% when using our proposed approach. This methodology has broad applicability in various DGMs and machine learning tasks, particularly in areas like healthcare and finance, where data scarcity is often a critical issue.

[AI-32] Federated Learning for Zero-Day Attack Detection in 5G and Beyond V2X Networks

链接: https://arxiv.org/abs/2407.03070
作者: Abdelaziz Amara korba,Abdelwahab Boualouache,Bouziane Brik,Rabah Rahal,Yacine Ghamri-Doudane,Sidi Mohammed Senouci
关键词: Deploying Connected, Automated Vehicles, Connected and Automated, makes them vulnerable, vulnerable to increasing
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deploying Connected and Automated Vehicles (CAVs) on top of 5G and Beyond networks (5GB) makes them vulnerable to increasing vectors of security and privacy attacks. In this context, a wide range of advanced machine/deep learning based solutions have been designed to accurately detect security attacks. Specifically, supervised learning techniques have been widely applied to train attack detection models. However, the main limitation of such solutions is their inability to detect attacks different from those seen during the training phase, or new attacks, also called zero-day attacks. Moreover, training the detection model requires significant data collection and labeling, which increases the communication overhead, and raises privacy concerns. To address the aforementioned limits, we propose in this paper a novel detection mechanism that leverages the ability of the deep auto-encoder method to detect attacks relying only on the benign network traffic pattern. Using federated learning, the proposed intrusion detection system can be trained with large and diverse benign network traffic, while preserving the CAVs privacy, and minimizing the communication overhead. The in-depth experiment on a recent network traffic dataset shows that the proposed system achieved a high detection rate while minimizing the false positive rate, and the detection delay.

[AI-33] xApp Distillation: AI-based Conflict Mitigation in B5G O-RAN

链接: https://arxiv.org/abs/2407.03068
作者: Hakan Erdol,Xiaoyang Wang,Robert Piechocki,George Oikonomou,Arjun Parekh
关键词: decision-making algorithms created, machine learning-based, decision-making algorithms, industrial opportunities, advancements of machine
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: 5 Pages, 4 figures

点击查看摘要

Abstract:The advancements of machine learning-based (ML) decision-making algorithms created various research and industrial opportunities. One of these areas is ML-based near-real-time network management applications (xApps) in Open-Radio Access Network (O-RAN). Normally, xApps are designed solely for the desired objectives, and fine-tuned for deployment. However, telecommunication companies can employ multiple xApps and deploy them in overlapping areas. Consider the different design objectives of xApps, the deployment might cause conflicts. To prevent such conflicts, we proposed the xApp distillation method that distills knowledge from multiple xApps, then uses this knowledge to train a single model that has retained the capabilities of Previous xApps. Performance evaluations show that compared conflict mitigation schemes can cause up to six times more network outages than xApp distillation in some cases.

[AI-34] FairJob: A Real-World Dataset for Fairness in Online Systems

链接: https://arxiv.org/abs/2407.03059
作者: Mariia Vladimirova,Federico Pavone,Eustache Diemert
关键词: designed to foster, real-world scenarios, foster research, research in algorithmic, introduce a fairness-aware
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注: 24 pages, 15 figures

点击查看摘要

Abstract:We introduce a fairness-aware dataset for job recommendation in advertising, designed to foster research in algorithmic fairness within real-world scenarios. It was collected and prepared to comply with privacy standards and business confidentiality. An additional challenge is the lack of access to protected user attributes such as gender, for which we propose a solution to obtain a proxy estimate. Despite being anonymized and including a proxy for a sensitive attribute, our dataset preserves predictive power and maintains a realistic and challenging benchmark. This dataset addresses a significant gap in the availability of fairness-focused resources for high-impact domains like advertising – the actual impact being having access or not to precious employment opportunities, where balancing fairness and utility is a common industrial challenge. We also explore various stages in the advertising process where unfairness can occur and introduce a method to compute a fair utility metric for the job recommendations in online systems case from a biased dataset. Experimental evaluations of bias mitigation techniques on the released dataset demonstrate potential improvements in fairness and the associated trade-offs with utility.

[AI-35] Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation

链接: https://arxiv.org/abs/2407.03056
作者: Marco Mistretta,Alberto Baldrati,Marco Bertini,Andrew D. Bagdanov
关键词: limited data, Vision-Language Models, Prompt learning, unseen tasks, downstream tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for publication at ECCV24

点击查看摘要

Abstract:Vision-Language Models (VLMs) demonstrate remarkable zero-shot generalization to unseen tasks, but fall short of the performance of supervised methods in generalizing to downstream tasks with limited data. Prompt learning is emerging as a parameter-efficient method for adapting VLMs, but state-of-the-art approaches require annotated samples. In this paper we propose a novel approach to prompt learning based on unsupervised knowledge distillation from more powerful models. Our approach, which we call Knowledge Distillation Prompt Learning (KDPL), can be integrated into existing prompt learning techniques and eliminates the need for labeled examples during adaptation. Our experiments on more than ten standard benchmark datasets demonstrate that KDPL is very effective at improving generalization of learned prompts for zero-shot domain generalization, zero-shot cross-dataset generalization, and zero-shot base-to-novel class generalization problems. KDPL requires no ground-truth labels for adaptation, and moreover we show that even in the absence of any knowledge of training class names it can be used to effectively transfer knowledge. The code is publicly available at this https URL.

[AI-36] Enhancements for Real-Time Monte-Carlo Tree Search in General Video Game Playing

链接: https://arxiv.org/abs/2407.03049
作者: Dennis J.N.J. Soemers,Chiara F. Sironi,Torsten Schuster,Mark H.M. Winands
关键词: General Video Game, Artificial Intelligence, General Video, real-time video games, Video Game Playing
类目: Artificial Intelligence (cs.AI)
*备注: Green Open Access version of conference paper published in 2016

点击查看摘要

Abstract:General Video Game Playing (GVGP) is a field of Artificial Intelligence where agents play a variety of real-time video games that are unknown in advance. This limits the use of domain-specific heuristics. Monte-Carlo Tree Search (MCTS) is a search technique for game playing that does not rely on domain-specific knowledge. This paper discusses eight enhancements for MCTS in GVGP; Progressive History, N-Gram Selection Technique, Tree Reuse, Breadth-First Tree Initialization, Loss Avoidance, Novelty-Based Pruning, Knowledge-Based Evaluations, and Deterministic Game Detection. Some of these are known from existing literature, and are either extended or introduced in the context of GVGP, and some are novel enhancements for MCTS. Most enhancements are shown to provide statistically significant increases in win percentages when applied individually. When combined, they increase the average win percentage over sixty different games from 31.0% to 48.4% in comparison to a vanilla MCTS implementation, approaching a level that is competitive with the best agents of the GVG-AI competition in 2015.

[AI-37] Raw Text is All you Need: Knowledge-intensive Multi-turn Instruction Tuning for Large Language Model

链接: https://arxiv.org/abs/2407.03040
作者: Xia Hou,Qifeng Li,Jian Yang,Tongliang Li,Linzheng Chai,Xianjie Wu,Hangyuan Ji,Zhoujun Li,Jixuan Nie,Jingbo Dun,Wenfeng Song
关键词: effective technique aligns, Instruction tuning, large language models, human preference, effective technique
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:Instruction tuning as an effective technique aligns the outputs of large language models (LLMs) with human preference. But how to generate the seasonal multi-turn dialogues from raw documents for instruction tuning still requires further exploration. In this paper, we present a novel framework named R2S that leverages the CoD-Chain of Dialogue logic to guide large language models (LLMs) in generating knowledge-intensive multi-turn dialogues for instruction tuning. By integrating raw documents from both open-source datasets and domain-specific web-crawled documents into a benchmark K-BENCH, we cover diverse areas such as Wikipedia (English), Science (Chinese), and Artifacts (Chinese). Our approach first decides the logic flow of the current dialogue and then prompts LLMs to produce key phrases for sourcing relevant response content. This methodology enables the creation of the G I NSTRUCT instruction dataset, retaining raw document knowledge within dialoguestyle interactions. Utilizing this dataset, we fine-tune GLLM, a model designed to transform raw documents into structured multi-turn dialogues, thereby injecting comprehensive domain knowledge into the SFT model for enhanced instruction tuning. This work signifies a stride towards refining the adaptability and effectiveness of LLMs in processing and generating more accurate, contextually nuanced responses across various fields.

[AI-38] NLP Sampling: Combining MCMC and NLP Methods for Diverse Constrained Sampling

链接: https://arxiv.org/abs/2407.03035
作者: Marc Toussaint,Cornelius V. Braun,Joaquim Ortiz-Haro
关键词: Generating diverse samples, Generating diverse, diverse samples, samples under hard, hard constraints
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generating diverse samples under hard constraints is a core challenge in many areas. With this work we aim to provide an integrative view and framework to combine methods from the fields of MCMC, constrained optimization, as well as robotics, and gain insights in their strengths from empirical evaluations. We propose NLP Sampling as a general problem formulation, propose a family of restarting two-phase methods as a framework to integrated methods from across the fields, and evaluate them on analytical and robotic manipulation planning problems. Complementary to this, we provide several conceptual discussions, e.g. on the role of Lagrange parameters, global sampling, and the idea of a Diffused NLP and a corresponding model-based denoising sampler.

[AI-39] Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition

链接: https://arxiv.org/abs/2407.03026
作者: Jinming Chen,Jingyi Fang,Yuanzhong Zheng,Yaoxuan Wang,Haojun Fei
关键词: achieved promising performance, promising performance, achieved promising, speech recognition, auto speech recognition
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: accpeted by interspeech 2014, 5 pages, 1 figure

点击查看摘要

Abstract:Currently, end-to-end (E2E) speech recognition methods have achieved promising performance. However, auto speech recognition (ASR) models still face challenges in recognizing multi-accent speech accurately. We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. Based on dynamic chunk strategy, our approach enables streaming decoding and can extract frame-level acoustic feature, facilitating fine-grained information fusion. Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1 % and 17.2 % in character error rate (CER) across multi accent test datasets on KeSpeech and MagicData-RMAC.

[AI-40] An Organism Starts with a Single Pix-Cell: A Neural Cellular Diffusion for High-Resolution Image Synthesis

链接: https://arxiv.org/abs/2407.03018
作者: Marawan Elbatel,Konstantinos Kamnitsas,Xiaomeng Li
关键词: Generative Adversarial Networks, Generative modeling seeks, Generative modeling, enabling synthesis, Generative Cellular Automata
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
*备注: MICCAI 2024

点击查看摘要

Abstract:Generative modeling seeks to approximate the statistical properties of real data, enabling synthesis of new data that closely resembles the original distribution. Generative Adversarial Networks (GANs) and Denoising Diffusion Probabilistic Models (DDPMs) represent significant advancements in generative modeling, drawing inspiration from game theory and thermodynamics, respectively. Nevertheless, the exploration of generative modeling through the lens of biological evolution remains largely untapped. In this paper, we introduce a novel family of models termed Generative Cellular Automata (GeCA), inspired by the evolution of an organism from a single cell. GeCAs are evaluated as an effective augmentation tool for retinal disease classification across two imaging modalities: Fundus and Optical Coherence Tomography (OCT). In the context of OCT imaging, where data is scarce and the distribution of classes is inherently skewed, GeCA significantly boosts the performance of 11 different ophthalmological conditions, achieving a 12% increase in the average F1 score compared to conventional baselines. GeCAs outperform both diffusion methods that incorporate UNet or state-of-the art variants with transformer-based denoising models, under similar parameter constraints. Code is available at: this https URL.

[AI-41] What Affects the Stability of Tool Learning? An Empirical Study on the Robustness of Tool Learning Frameworks

链接: https://arxiv.org/abs/2407.03007
作者: Chengrui Huang,Zhengliang Shi,Yuntao Wen,Xiuying Chen,Peng Han,Shen Gao,Shuo Shang
关键词: large language models, Tool learning methods, Tool learning, methods have enhanced, enhanced the ability
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 19 pages, 9 figures

点击查看摘要

Abstract:Tool learning methods have enhanced the ability of large language models (LLMs) to interact with real-world applications. Many existing works fine-tune LLMs or design prompts to enable LLMs to select appropriate tools and correctly invoke them to meet user requirements. However, it is observed in previous works that the performance of tool learning varies from tasks, datasets, training settings, and algorithms. Without understanding the impact of these factors, it can lead to inconsistent results, inefficient model deployment, and suboptimal tool utilization, ultimately hindering the practical integration and scalability of LLMs in real-world scenarios. Therefore, in this paper, we explore the impact of both internal and external factors on the performance of tool learning frameworks. Through extensive experiments on two benchmark datasets, we find several insightful conclusions for future work, including the observation that LLMs can benefit significantly from increased trial and exploration. We believe our empirical study provides a new perspective for future tool learning research.

[AI-42] Human-like Linguistic Biases in Neural Speech Models: Phonetic Categorization and Phonotactic Constraints in Wav2Vec2.0

链接: https://arxiv.org/abs/2407.03005
作者: Marianne de Heer Kloots,Willem Zuidema
关键词: deep neural speech, models, deep neural, Abstract, speech
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted to Interspeech 2024. For code and materials, see this https URL

点击查看摘要

Abstract:What do deep neural speech models know about phonology? Existing work has examined the encoding of individual linguistic units such as phonemes in these models. Here we investigate interactions between units. Inspired by classic experiments on human speech perception, we study how Wav2Vec2 resolves phonotactic constraints. We synthesize sounds on an acoustic continuum between /l/ and /r/ and embed them in controlled contexts where only /l/, only /r/, or neither occur in English. Like humans, Wav2Vec2 models show a bias towards the phonotactically admissable category in processing such ambiguous sounds. Using simple measures to analyze model internals on the level of individual stimuli, we find that this bias emerges in early layers of the model’s Transformer module. This effect is amplified by ASR finetuning but also present in fully self-supervised models. Our approach demonstrates how controlled stimulus designs can help localize specific linguistic knowledge in neural speech models.

[AI-43] SemioLLM: Assessing Large Language Models for Semiological Analysis in Epilepsy Research

链接: https://arxiv.org/abs/2407.03004
作者: Meghal Dani,Muthu Jeyanthi Prakash,Zeynep Akata,Stefanie Liebe
关键词: Large Language Models, Large Language, shown promising results, medical question-answering datasets, encode general medical
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models have shown promising results in their ability to encode general medical knowledge in standard medical question-answering datasets. However, their potential application in clinical practice requires evaluation in domain-specific tasks, where benchmarks are largely missing. In this study semioLLM, we test the ability of state-of-the-art LLMs (GPT-3.5, GPT-4, Mixtral 8x7B, and Qwen-72chat) to leverage their internal knowledge and reasoning for epilepsy diagnosis. Specifically, we obtain likelihood estimates linking unstructured text descriptions of seizures to seizure-generating brain regions, using an annotated clinical database containing 1269 entries. We evaluate the LLM’s performance, confidence, reasoning, and citation abilities in comparison to clinical evaluation. Models achieve above-chance classification performance with prompt engineering significantly improving their outcome, with some models achieving close-to-clinical performance and reasoning. However, our analyses also reveal significant pitfalls with several models being overly confident while showing poor performance, as well as exhibiting citation errors and hallucinations. In summary, our work provides the first extensive benchmark comparing current SOTA LLMs in the medical domain of epilepsy and highlights their ability to leverage unstructured texts from patients’ medical history to aid diagnostic processes in health care.

[AI-44] Are Large Language Models Consistent over Value-laden Questions?

链接: https://arxiv.org/abs/2407.02996
作者: Jared Moore,Tanvi Deshpande,Diyi Yang
关键词: bias their survey, Large language models, models, Large language, topics
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: 8 pages, 9 figures

点击查看摘要

Abstract:Large language models (LLMs) appear to bias their survey answers toward certain values. Nonetheless, some argue that LLMs are too inconsistent to simulate particular values. Are they? To answer, we first define value consistency as the similarity of answers across (1) paraphrases of one question, (2) related questions under one topic, (3) multiple-choice and open-ended use-cases of one question, and (4) multilingual translations of a question to English, Chinese, German, and Japanese. We apply these measures to a few large ( =34b ), open LLMs including llama-3, as well as gpt-4o, using eight thousand questions spanning more than 300 topics. Unlike prior work, we find that models are relatively consistent across paraphrases, use-cases, translations, and within a topic. Still, some inconsistencies remain. Models are more consistent on uncontroversial topics (e.g., in the U.S., “Thanksgiving”) than on controversial ones (“euthanasia”). Base models are both more consistent compared to fine-tuned models and are uniform in their consistency across topics, while fine-tuned models are more inconsistent about some topics (“euthanasia”) than others (“women’s rights”) like our human subjects (n=165).

[AI-45] MedPix 2.0: A Comprehensive Multimodal Biomedical Dataset for Advanced AI Applications

链接: https://arxiv.org/abs/2407.02994
作者: Irene Siragusa,Salvatore Contino,Massimo La Ciura,Rosario Alicata,Roberto Pirrone
关键词: developing Artificial Intelligence, Artificial Intelligence applications, Artificial Intelligence, developing Artificial, Intelligence applications
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing interest in developing Artificial Intelligence applications in the medical domain, suffers from the lack of high-quality dataset, mainly due to privacy-related issues. Moreover, the recent rising of Multimodal Large Language Models (MLLM) leads to a need for multimodal medical datasets, where clinical reports and findings are attached to the corresponding CT or MR scans. This paper illustrates the entire workflow for building the data set MedPix 2.0. Starting from the well-known multimodal dataset MedPix\textsuperscript\textregistered, mainly used by physicians, nurses and healthcare students for Continuing Medical Education purposes, a semi-automatic pipeline was developed to extract visual and textual data followed by a manual curing procedure where noisy samples were removed, thus creating a MongoDB database. Along with the dataset, we developed a GUI aimed at navigating efficiently the MongoDB instance, and obtaining the raw data that can be easily used for training and/or fine-tuning MLLMs. To enforce this point, we also propose a CLIP-based model trained on MedPix 2.0 for scan classification tasks.

[AI-46] LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models

链接: https://arxiv.org/abs/2407.02987
作者: Hayder Elesedy,Pedro M. Esperança,Silviu Vlad Oprea,Mete Ozay
关键词: alternative to safety, safety alignment, large language models, Abstract, content moderation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs). Existing model-based guardrails have not been designed for resource-constrained computational portable devices, such as mobile phones, more and more of which are running LLM-based applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models. LoRA-Guard extracts language features from the LLMs and adapts them for the content moderation task using low-rank adapters, while a dual-path design prevents any performance degradation on the generative task. We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.

[AI-47] Mast Kalandar at SemEval-2024 Task 8: On the Trail of Textual Origins: RoBERTa-BiLSTM Approach to Detect AI-Generated Text

链接: https://arxiv.org/abs/2407.02978
作者: Jainit Sushil Bafna,Hardik Mittal,Suyash Sethia,Manish Shrivastava,Radhika Mamidi
关键词: Large Language Models, Large Language, diverse user queries, showcased impressive abilities, generating fluent responses
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: SemEval-2024

点击查看摘要

Abstract:Large Language Models (LLMs) have showcased impressive abilities in generating fluent responses to diverse user queries. However, concerns regarding the potential misuse of such texts in journalism, educational, and academic contexts have surfaced. SemEval 2024 introduces the task of Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection, aiming to develop automated systems for identifying machine-generated text and detecting potential misuse. In this paper, we i) propose a RoBERTa-BiLSTM based classifier designed to classify text into two categories: AI-generated or human ii) conduct a comparative study of our model with baseline approaches to evaluate its effectiveness. This paper contributes to the advancement of automatic text detection systems in addressing the challenges posed by machine-generated text misuse. Our architecture ranked 46th on the official leaderboard with an accuracy of 80.83 among 125.

[AI-48] Large Language Models as Evaluators for Scientific Synthesis

链接: https://arxiv.org/abs/2407.02977
作者: Julia Evans,Jennifer D’Souza,Sören Auer
关键词: Large Language Models, Large Language, Language Models, Mistral model ability, open-source Mistral model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注: 4 pages, forthcoming as part of the KONVENS 2024 proceedings this https URL

点击查看摘要

Abstract:Our study explores how well the state-of-the-art Large Language Models (LLMs), like GPT-4 and Mistral, can assess the quality of scientific summaries or, more fittingly, scientific syntheses, comparing their evaluations to those of human annotators. We used a dataset of 100 research questions and their syntheses made by GPT-4 from abstracts of five related papers, checked against human quality ratings. The study evaluates both the closed-source GPT-4 and the open-source Mistral model’s ability to rate these summaries and provide reasons for their judgments. Preliminary results show that LLMs can offer logical explanations that somewhat match the quality ratings, yet a deeper statistical analysis shows a weak correlation between LLM and human ratings, suggesting the potential and current limitations of LLMs in scientific synthesis evaluation.

[AI-49] Zero-X: A Blockchain-Enabled Open-Set Federated Learning Framework for Zero-Day Attack Detection in IoV

链接: https://arxiv.org/abs/2407.02969
作者: Abdelaziz Amara korba,Abdelwahab Boualouache,Yacine Ghamri-Doudane
关键词: Intelligent Transportation Systems, Intelligent Transportation, Transportation Systems, integrates vehicles, Internet of Vehicles
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Internet of Vehicles (IoV) is a crucial technology for Intelligent Transportation Systems (ITS) that integrates vehicles with the Internet and other entities. The emergence of 5G and the forthcoming 6G networks presents an enormous potential to transform the IoV by enabling ultra-reliable, low-latency, and high-bandwidth communications. Nevertheless, as connectivity expands, cybersecurity threats have become a significant concern. The issue has been further exacerbated by the rising number of zero-day (0-day) attacks, which can exploit unknown vulnerabilities and bypass existing Intrusion Detection Systems (IDSs). In this paper, we propose Zero-X, an innovative security framework that effectively detects both 0-day and N-day attacks. The framework achieves this by combining deep neural networks with Open-Set Recognition (OSR). Our approach introduces a novel scheme that uses blockchain technology to facilitate trusted and decentralized federated learning (FL) of the ZeroX framework. This scheme also prioritizes privacy preservation, enabling both CAVs and Security Operation Centers (SOCs) to contribute their unique knowledge while protecting the privacy of their sensitive data. To the best of our knowledge, this is the first work to leverage OSR in combination with privacy-preserving FL to identify both 0-day and N-day attacks in the realm of IoV. The in-depth experiments on two recent network traffic datasets show that the proposed framework achieved a high detection rate while minimizing the false positive rate. Comparison with related work showed that the Zero-X framework outperforms existing solutions.

[AI-50] Unified Anomaly Detection methods on Edge Device using Knowledge Distillation and Quantization

链接: https://arxiv.org/abs/2407.02968
作者: Sushovan Jena,Arya Pulkit,Kajal Singh,Anoushka Banerjee,Sharad Joshi,Ananth Ganesh,Dinesh Singh,Arnav Bhavsar
关键词: visual inspection systems, fully integrated visual, integrated visual inspection, manufacturing in Industry, imperative for high-throughput
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Emerging Technologies (cs.ET)
*备注: 20 pages

点击查看摘要

Abstract:With the rapid advances in deep learning and smart manufacturing in Industry 4.0, there is an imperative for high-throughput, high-performance, and fully integrated visual inspection systems. Most anomaly detection approaches using defect detection datasets, such as MVTec AD, employ one-class models that require fitting separate models for each class. On the contrary, unified models eliminate the need for fitting separate models for each class and significantly reduce cost and memory requirements. Thus, in this work, we experiment with considering a unified multi-class setup. Our experimental study shows that multi-class models perform at par with one-class models for the standard MVTec AD dataset. Hence, this indicates that there may not be a need to learn separate object/class-wise models when the object classes are significantly different from each other, as is the case of the dataset considered. Furthermore, we have deployed three different unified lightweight architectures on the CPU and an edge device (NVIDIA Jetson Xavier NX). We analyze the quantized multi-class anomaly detection models in terms of latency and memory requirements for deployment on the edge device while comparing quantization-aware training (QAT) and post-training quantization (PTQ) for performance at different precision widths. In addition, we explored two different methods of calibration required in post-training scenarios and show that one of them performs notably better, highlighting its importance for unsupervised tasks. Due to quantization, the performance drop in PTQ is further compensated by QAT, which yields at par performance with the original 32-bit Floating point in two of the models considered.

[AI-51] owards a Scalable Reference-Free Evaluation of Generative Models

链接: https://arxiv.org/abs/2407.02961
作者: Azim Ospanov,Jingwei Zhang,Mohammad Jalali,Xuenan Cao,Andrej Bogdanov,Farzan Farnia
关键词: generally difficult due, applicable reference datasets, generative models, large-scale generative models, standard evaluation scores
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While standard evaluation scores for generative models are mostly reference-based, a reference-dependent assessment of generative models could be generally difficult due to the unavailability of applicable reference datasets. Recently, the reference-free entropy scores, VENDI and RKE, have been proposed to evaluate the diversity of generated data. However, estimating these scores from data leads to significant computational costs for large-scale generative models. In this work, we leverage the random Fourier features framework to reduce the computational price and propose the Fourier-based Kernel Entropy Approximation (FKEA) method. We utilize FKEA’s approximated eigenspectrum of the kernel matrix to efficiently estimate the mentioned entropy scores. Furthermore, we show the application of FKEA’s proxy eigenvectors to reveal the method’s identified modes in evaluating the diversity of produced samples. We provide a stochastic implementation of the FKEA assessment algorithm with a complexity O(n) linearly growing with sample size n . We extensively evaluate FKEA’s numerical performance in application to standard image, text, and video datasets. Our empirical results indicate the method’s scalability and interpretability applied to large-scale generative models. The codebase is available at this https URL.

[AI-52] ObfuscaTune: Obfuscated Offsite Fine-tuning and Inference of Proprietary LLMs on Private Datasets

链接: https://arxiv.org/abs/2407.02960
作者: Ahmed Frikha,Nassim Walha,Ricardo Mendes,Krishna Kanth Nakka,Xue Jiang,Xuebing Zhou
关键词: proprietary LLM owned, data owner entity, model provider entity, proprietary LLM, LLM owned
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:This work addresses the timely yet underexplored problem of performing inference and finetuning of a proprietary LLM owned by a model provider entity on the confidential/private data of another data owner entity, in a way that ensures the confidentiality of both the model and the data. Hereby, the finetuning is conducted offsite, i.e., on the computation infrastructure of a third-party cloud provider. We tackle this problem by proposing ObfuscaTune, a novel, efficient and fully utility-preserving approach that combines a simple yet effective obfuscation technique with an efficient usage of confidential computing (only 5% of the model parameters are placed on TEE). We empirically demonstrate the effectiveness of ObfuscaTune by validating it on GPT-2 models with different sizes on four NLP benchmark datasets. Finally, we compare to a naïve version of our approach to highlight the necessity of using random matrices with low condition numbers in our approach to reduce errors induced by the obfuscation.

[AI-53] IncogniText: Privacy-enhancing Conditional Text Anonymization via LLM-based Private Attribute Randomization

链接: https://arxiv.org/abs/2407.02956
作者: Ahmed Frikha,Nassim Walha,Krishna Kanth Nakka,Ricardo Mendes,Xue Jiang,Xuebing Zhou
关键词: correctly inferring private, inferring private attributes, meaning and semantics, address the problem, prevent adversaries
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:In this work, we address the problem of text anonymization where the goal is to prevent adversaries from correctly inferring private attributes of the author, while keeping the text utility, i.e., meaning and semantics. We propose IncogniText, a technique that anonymizes the text to mislead a potential adversary into predicting a wrong private attribute value. Our empirical evaluation shows a reduction of private attribute leakage by more than 90%. Finally, we demonstrate the maturity of IncogniText for real-world applications by distilling its anonymization capability into a set of LoRA parameters associated with an on-device model.

[AI-54] PII-Compass: Guiding LLM training data extraction prompts towards the target PII via grounding

链接: https://arxiv.org/abs/2407.02943
作者: Krishna Kanth Nakka,Ahmed Frikha,Ricardo Mendes,Xue Jiang,Xuebing Zhou
关键词: large models stem, increased size, impactful advances, advances in large, large models
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
*备注: Accepted at ACL 2024

点击查看摘要

Abstract:The latest and most impactful advances in large models stem from their increased size. Unfortunately, this translates into an improved memorization capacity, raising data privacy concerns. Specifically, it has been shown that models can output personal identifiable information (PII) contained in their training data. However, reported PIII extraction performance varies widely, and there is no consensus on the optimal methodology to evaluate this risk, resulting in underestimating realistic adversaries. In this work, we empirically demonstrate that it is possible to improve the extractability of PII by over ten-fold by grounding the prefix of the manually constructed extraction prompt with in-domain data. Our approach, PII-Compass, achieves phone number extraction rates of 0.92%, 3.9%, and 6.86% with 1, 128, and 2308 queries, respectively, i.e., the phone number of 1 person in 15 is extractable.

[AI-55] GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models

链接: https://arxiv.org/abs/2407.02936
作者: Zike Yuan,Ming Liu,Hui Wang,Bing Qin
关键词: Large Language Models, Large Language, abilities of Large, Language Models, graph comprehension
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注:

点击查看摘要

Abstract:Evaluating the graph comprehension and reasoning abilities of Large Language Models (LLMs) is challenging and often incomplete. Existing benchmarks focus primarily on pure graph understanding, lacking a comprehensive evaluation across all graph types and detailed capability definitions. This paper presents GraCoRe, a benchmark for systematically assessing LLMs’ graph comprehension and reasoning. GraCoRe uses a three-tier hierarchical taxonomy to categorize and test models on pure graph and heterogeneous graphs, subdividing capabilities into 10 distinct areas tested through 19 tasks. Our benchmark includes 11 datasets with 5,140 graphs of varying complexity. We evaluated three closed-source and seven open-source LLMs, conducting thorough analyses from both ability and task perspectives. Key findings reveal that semantic enrichment enhances reasoning performance, node ordering impacts task success, and the ability to process longer texts does not necessarily improve graph comprehension or reasoning. GraCoRe is open-sourced at this https URL

[AI-56] owards Negotiative Dialogue for the Talkamatic Dialogue Manager

链接: https://arxiv.org/abs/2407.02917
作者: Staffan Larsson,Alexander Berman,David Hjelm
关键词: Talkamatic Dialogue Manager, Dialogue Manager, Talkamatic Dialogue, paper describes, describes a number
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The paper describes a number of dialogue phenomena associated with negotiative dialogue, as implemented in a development version of the Talkamatic Dialogue Manager (TDM). This implementation is an initial step towards full coverage of general features of negotiative dialogue in TDM.

[AI-57] SFC: Achieve Accurate Fast Convolution under Low-precision Arithmetic

链接: https://arxiv.org/abs/2407.02913
作者: Liulu He,Yufei Zhao,Rui Gao,Yuan Du,Li Du
关键词: accelerate convolution operations, efficiently accelerate convolution, Discrete Fourier Transform, efficiently accelerate, operations in deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Signal Processing (eess.SP); Numerical Analysis (math.NA)
*备注: ICML 2024

点击查看摘要

Abstract:Fast convolution algorithms, including Winograd and FFT, can efficiently accelerate convolution operations in deep models. However, these algorithms depend on high-precision arithmetic to maintain inference accuracy, which conflicts with the model quantization. To resolve this conflict and further improve the efficiency of quantized convolution, we proposes SFC, a new algebra transform for fast convolution by extending the Discrete Fourier Transform (DFT) with symbolic computing, in which only additions are required to perform the transformation at specific transform points, avoiding the calculation of irrational number and reducing the requirement for precision. Additionally, we enhance convolution efficiency by introducing correction terms to convert invalid circular convolution outputs of the Fourier method into effective ones. The numerical error analysis is presented for the first time in this type of work and proves that our algorithms can provide a 3.68x multiplication reduction for 3x3 convolution, while the Winograd algorithm only achieves a 2.25x reduction with similarly low numerical errors. Experiments carried out on benchmarks and FPGA show that our new algorithms can further improve the computation efficiency of quantized models while maintaining accuracy, surpassing both the quantization-alone method and existing works on fast convolution quantization.

[AI-58] he Shortcomings of Force-from-Motion in Robot Learning

链接: https://arxiv.org/abs/2407.02904
作者: Elie Aljalbout,Felix Frank,Patrick van der Smagt,Alexandros Paraschos
关键词: Robotic manipulation requires, manipulation requires accurate, requires accurate motion, Robotic manipulation, physical interaction control
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robotic manipulation requires accurate motion and physical interaction control. However, current robot learning approaches focus on motion-centric action spaces that do not explicitly give the policy control over the interaction. In this paper, we discuss the repercussions of this choice and argue for more interaction-explicit action spaces in robot learning.

[AI-59] ranslatotron-V(ison): An End-to-End Model for In-Image Machine Translation

链接: https://arxiv.org/abs/2407.02894
作者: Zhibin Lan,Liqiang Niu,Fandong Meng,Jie Zhou,Min Zhang,Jinsong Su
关键词: In-image machine translation, In-image machine, image, aims to translate, IIMT model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
*备注: Accepted to ACL 2024 Findings

点击查看摘要

Abstract:In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language. In this regard, conventional cascaded methods suffer from issues such as error propagation, massive parameters, and difficulties in deployment and retaining visual characteristics of the input image. Thus, constructing end-to-end models has become an option, which, however, faces two main challenges: 1) the huge modeling burden, as it is required to simultaneously learn alignment across languages and preserve the visual characteristics of the input image; 2) the difficulties of directly predicting excessively lengthy pixel sequences. In this paper, we propose \textitTranslatotron-V(ision), an end-to-end IIMT model consisting of four modules. In addition to an image encoder, and an image decoder, our model contains a target text decoder and an image tokenizer. Among them, the target text decoder is used to alleviate the language alignment burden, and the image tokenizer converts long sequences of pixels into shorter sequences of visual tokens, preventing the model from focusing on low-level visual features. Besides, we present a two-stage training framework for our model to assist the model in learning alignment across modalities and languages. Finally, we propose a location-aware evaluation metric called Structure-BLEU to assess the translation quality of the generated images. Experimental results demonstrate that our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.

[AI-60] GPTQT: Quantize Large Language Models Twice to Push the Efficiency

链接: https://arxiv.org/abs/2407.02891
作者: Yipin Guo,Yilin Lang,Qinyuan Ren
关键词: generative Large Language, Large Language Models, Large Language, require significant computing, large size
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: Accepted by 11th IEEE International Conference on Cybernetics and Intelligent Systems

点击查看摘要

Abstract:Due to their large size, generative Large Language Models (LLMs) require significant computing and storage resources. This paper introduces a new post-training quantization method, GPTQT, to reduce memory usage and enhance processing speed by expressing the weight of LLM in 3bit/2bit. Practice has shown that minimizing the quantization error of weights is ineffective, leading to overfitting. Therefore, GPTQT employs a progressive two-step approach: initially quantizing weights using Linear quantization to a relatively high bit, followed by converting obtained int weight to lower bit binary coding. A re-explore strategy is proposed to optimize initial scaling factor. During inference, these steps are merged into pure binary coding, enabling efficient computation. Testing across various models and datasets confirms GPTQT’s effectiveness. Compared to the strong 3-bit quantization baseline, GPTQT further reduces perplexity by 4.01 on opt-66B and increases speed by 1.24 times on opt-30b. The results on Llama2 show that GPTQT is currently the best binary coding quantization method for such kind of LLMs.

[AI-61] Joint Optimization of Resource Allocation and Data Selection for Fast and Cost-Efficient Federated Edge Learning

链接: https://arxiv.org/abs/2407.02888
作者: Yunjian Jia,Zhen Huang,Jiping Yan,Yulu Zhang,Kun Luo,Wanli Wen
关键词: Deploying federated learning, federated edge learning, edge introduces federated, introduces federated edge, wireless edge introduces
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deploying federated learning at the wireless edge introduces federated edge learning (FEEL). Given FEEL’s limited communication resources and potential mislabeled data on devices, improper resource allocation or data selection can hurt convergence speed and increase training costs. Thus, to realize an efficient FEEL system, this paper emphasizes jointly optimizing resource allocation and data selection. Specifically, in this work, through rigorously modeling the training process and deriving an upper bound on FEEL’s one-round convergence rate, we establish a problem of joint resource allocation and data selection, which, unfortunately, cannot be solved directly. Toward this end, we equivalently transform the original problem into a solvable form via a variable substitution and then break it into two subproblems, that is, the resource allocation problem and the data selection problem. The two subproblems are mixed-integer non-convex and integer non-convex problems, respectively, and achieving their optimal solutions is a challenging task. Based on the matching theory and applying the convex-concave procedure and gradient projection methods, we devise a low-complexity suboptimal algorithm for the two subproblems, respectively. Finally, the superiority of our proposed scheme of joint resource allocation and data selection is validated by numerical results.

[AI-62] Complex Event Recognition with Symbolic Register Transducers: Extended Technical Report

链接: https://arxiv.org/abs/2407.02884
作者: Elias Alevizos,Alexander Artikis,Georgios Paliouras
关键词: Complex Event Recognition, Event Recognition, Complex Event, SRT, CER
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注: arXiv admin note: substantial text overlap with arXiv:2110.04032

点击查看摘要

Abstract:We present a system for Complex Event Recognition (CER) based on automata. While multiple such systems have been described in the literature, they typically suffer from a lack of clear and denotational semantics, a limitation which often leads to confusion with respect to their expressive power. In order to address this issue, our system is based on an automaton model which is a combination of symbolic and register automata. We extend previous work on these types of automata, in order to construct a formalism with clear semantics and a corresponding automaton model whose properties can be formally investigated. We call such automata Symbolic Register Transducers (SRT). We show that SRT are closed under various operators, but are not in general closed under complement and they are not determinizable. However, they are closed under these operations when a window operator, quintessential in Complex Event Recognition, is used. We show how SRT can be used in CER in order to detect patterns upon streams of events, using our framework that provides declarative and compositional semantics, and that allows for a systematic treatment of such automata. For SRT to work in pattern detection, we allow them to mark events from the input stream as belonging to a complex event or not, hence the name “transducers”. We also present an implementation of SRT which can perform CER. We compare our SRT-based CER engine against other state-of-the-art CER systems and show that it is both more expressive and more efficient.

[AI-63] ShiftAddAug: Augment Multiplication-Free Tiny Neural Network with Hybrid Computation

链接: https://arxiv.org/abs/2407.02881
作者: Yipin Guo,Zihao Li,Yilin Lang,Qinyuan Ren
关键词: Shift and Add, compatibility with hardware, gained prominence, Operators devoid, Add
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注: Accepted by 2024 CVPR Workshop : Efficient Deep Learning for Computer Vision

点击查看摘要

Abstract:Operators devoid of multiplication, such as Shift and Add, have gained prominence for their compatibility with hardware. However, neural networks (NNs) employing these operators typically exhibit lower accuracy compared to conventional NNs with identical structures. ShiftAddAug uses costly multiplication to augment efficient but less powerful multiplication-free operators, improving performance without any inference overhead. It puts a ShiftAdd tiny NN into a large multiplicative model and encourages it to be trained as a sub-model to obtain additional supervision. In order to solve the weight discrepancy problem between hybrid operators, a new weight sharing method is proposed. Additionally, a novel two stage neural architecture search is used to obtain better augmentation effects for smaller but stronger multiplication-free tiny neural networks. The superiority of ShiftAddAug is validated through experiments in image classification and semantic segmentation, consistently delivering noteworthy enhancements. Remarkably, it secures up to a 4.95% increase in accuracy on the CIFAR100 compared to its directly trained counterparts, even surpassing the performance of multiplicative NNs.

[AI-64] Knowledge Composition using Task Vectors with Learned Anisotropic Scaling

链接: https://arxiv.org/abs/2407.02880
作者: Frederic Z. Zhang,Paul Albert,Cristian Rodriguez-Opazo,Anton van den Hengel,Ehsan Abbasnejad
关键词: produce strong generic, models produce strong, Pre-trained models produce, task vectors, strong generic representations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
*备注:

点击查看摘要

Abstract:Pre-trained models produce strong generic representations that can be adapted via fine-tuning. The learned weight difference relative to the pre-trained model, known as a task vector, characterises the direction and stride of fine-tuning. The significance of task vectors is such that simple arithmetic operations on them can be used to combine diverse representations from different domains. This paper builds on these properties of task vectors and aims to answer (1) whether components of task vectors, particularly parameter blocks, exhibit similar characteristics, and (2) how such blocks can be used to enhance knowledge composition and transfer. To this end, we introduce aTLAS, an algorithm that linearly combines parameter blocks with different learned coefficients, resulting in anisotropic scaling at the task vector level. We show that such linear combinations explicitly exploit the low intrinsic dimensionality of pre-trained models, with only a few coefficients being the learnable parameters. Furthermore, composition of parameter blocks leverages the already learned representations, thereby reducing the dependency on large amounts of data. We demonstrate the effectiveness of our method in task arithmetic, few-shot recognition and test-time adaptation, with supervised or unsupervised objectives. In particular, we show that (1) learned anisotropic scaling allows task vectors to be more disentangled, causing less interference in composition; (2) task vector composition excels with scarce or no labeled data and is less prone to domain shift, thus leading to better generalisability; (3) mixing the most informative parameter blocks across different task vectors prior to training can reduce the memory footprint and improve the flexibility of knowledge transfer. Moreover, we show the potential of aTLAS as a PEFT method, particularly with less data, and demonstrate that its scalibility.

[AI-65] Efficient Fusion and Task Guided Embedding for End-to-end Autonomous Driving

链接: https://arxiv.org/abs/2407.02878
作者: Yipin Guo,Yilin Lang,Qinyuan Ren
关键词: run neural networks, neural networks leveraging, networks leveraging imitation, leveraging imitation learning, imitation learning typically
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Best Paper Award of the IEEE 13th Data-Driven Control and Learning Systems Conference

点击查看摘要

Abstract:To address the challenges of sensor fusion and safety risk prediction, contemporary closed-loop autonomous driving neural networks leveraging imitation learning typically require a substantial volume of parameters and computational resources to run neural networks. Given the constrained computational capacities of onboard vehicular computers, we introduce a compact yet potent solution named EfficientFuser. This approach employs EfficientViT for visual information extraction and integrates feature maps via cross attention. Subsequently, it utilizes a decoder-only transformer for the amalgamation of multiple features. For prediction purposes, learnable vectors are embedded as tokens to probe the association between the task and sensor features through attention. Evaluated on the CARLA simulation platform, EfficientFuser demonstrates remarkable efficiency, utilizing merely 37.6% of the parameters and 8.7% of the computations compared to the state-of-the-art lightweight method with only 0.4% lower driving score, and the safety score neared that of the leading safety-enhanced method, showcasing its efficacy and potential for practical deployment in autonomous driving systems.

[AI-66] Membership Inference Attacks Against Time-Series Models

链接: https://arxiv.org/abs/2407.02870
作者: Noam Koren,Abigail Goldsteen,Ariel Farkash,Guy Amit
关键词: Analyzing time-series data, Analyzing time-series, personal information, Analyzing, privacy concerns
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 16 pages

点击查看摘要

Abstract:Analyzing time-series data that may contain personal information, particularly in the medical field, presents serious privacy concerns. Sensitive health data from patients is often used to train machine-learning models for diagnostics and ongoing care. Assessing the privacy risk of such models is crucial to making knowledgeable decisions on whether to use a model in production, share it with third parties, or deploy it in patients homes. Membership Inference Attacks (MIA) are a key method for this kind of evaluation, however time-series prediction models have not been thoroughly studied in this context. We explore existing MIA techniques on time-series models, and introduce new features, focusing on the seasonality and trend components of the data. Seasonality is estimated using a multivariate Fourier transform, and a low-degree polynomial is used to approximate trends. We applied these techniques to various types of time-series models, using datasets from the health domain. Our results demonstrate that these new features enhance the effectiveness of MIAs in identifying membership, improving the understanding of privacy risks in medical data applications.

[AI-67] Fast maneuver recovery from aerial observation: trajectory clustering and outliers rejection

链接: https://arxiv.org/abs/2407.02863
作者: Nelson de Moura(ASTRA),Augustin Gervreau-Mercier(ASTRA),Fernando Garrido(ASTRA),Fawzi Nashashibi(ASTRA)
关键词: open problem, Vulnerable Road Users, road user models, realistically reproduce, reproduce a credible
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:The implementation of road user models that realistically reproduce a credible behavior in a multi-agentsimulation is still an open problem. A data-driven approach consists on to deduce behaviors that may exist in real situation to obtain different types of trajectories from a large set of observations. The data, and its classification, could then be used to train models capable to extrapolate such behavior. Cars and two different types of Vulnerable Road Users (VRU) will be considered by the trajectory clustering methods proposed: pedestrians and cyclists. The results reported here evaluate methods to extract well-defined trajectory classes from raw data without the use of map information while also separating ‘‘eccentric’’ or incomplete trajectories from the ones that are complete and representative in any scenario. Two environments will serve as test for the methods develop, three different intersections and one roundabout. The resulting clusters of trajectories can then be used for prediction or learning tasks or discarded if it is composed by outliers.

[AI-68] MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis

链接: https://arxiv.org/abs/2407.02842
作者: Lei Chen,Feng Yan,Yujie Zhong,Shaoxiang Chen,Zequn Jie,Lin Ma
关键词: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
*备注: technical report

点击查看摘要

Abstract:Multimodal Large Language Models (MLLM) have made significant progress in the field of document analysis. Despite this, existing benchmarks typically focus only on extracting text and simple layout information, neglecting the complex interactions between elements in structured documents such as mind maps and flowcharts. To address this issue, we introduce the new benchmark named MindBench, which not only includes meticulously constructed bilingual authentic or synthetic images, detailed annotations, evaluation metrics and baseline models, but also specifically designs five types of structured understanding and parsing tasks. These tasks include full parsing, partial parsing, position-related parsing, structured Visual Question Answering (VQA), and position-related VQA, covering key areas such as text recognition, spatial awareness, relationship discernment, and structured parsing. Extensive experimental results demonstrate the substantial potential and significant room for improvement in current models’ ability to handle structured document information. We anticipate that the launch of MindBench will significantly advance research and application development in structured document analysis technology. MindBench is available at: this https URL.

[AI-69] CRUISE on Quantum Computing for Feature Selection in Recommender Systems

链接: https://arxiv.org/abs/2407.02839
作者: Jiayang Niu,Jie Li,Ke Deng,Yongli Ren
关键词: Recommender Systems, worthwhile research topic, Systems that classical, Quantum Computers, classical computers
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: accepted by QuantumCLEF 2024

点击查看摘要

Abstract:Using Quantum Computers to solve problems in Recommender Systems that classical computers cannot address is a worthwhile research topic. In this paper, we use Quantum Annealers to address the feature selection problem in recommendation algorithms. This feature selection problem is a Quadratic Unconstrained Binary Optimization(QUBO) problem. By incorporating Counterfactual Analysis, we significantly improve the performance of the item-based KNN recommendation algorithm compared to using pure Mutual Information. Extensive experiments have demonstrated that the use of Counterfactual Analysis holds great promise for addressing such problems.

[AI-70] Effect of a Process Mining based Pre-processing Step in Prediction of the Critical Health Outcomes

链接: https://arxiv.org/abs/2407.02821
作者: Negin Ashrafi,Armin Abdollahi,Greg Placencia,Maryam Pishgar
关键词: Predicting critical health, Predicting critical, improving survivability, patient mortality, readmission is essential
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Predicting critical health outcomes such as patient mortality and hospital readmission is essential for improving survivability. However, healthcare datasets have many concurrences that create complexities, leading to poor predictions. Consequently, pre-processing the data is crucial to improve its quality. In this study, we use an existing pre-processing algorithm, concatenation, to improve data quality by decreasing the complexity of datasets. Sixteen healthcare datasets were extracted from two databases - MIMIC III and University of Illinois Hospital - converted to the event logs, they were then fed into the concatenation algorithm. The pre-processed event logs were then fed to the Split Miner (SM) algorithm to produce a process model. Process model quality was evaluated before and after concatenation using the following metrics: fitness, precision, F-Measure, and complexity. The pre-processed event logs were also used as inputs to the Decay Replay Mining (DREAM) algorithm to predict critical outcomes. We compared predicted results before and after applying the concatenation algorithm using Area Under the Curve (AUC) and Confidence Intervals (CI). Results indicated that the concatenation algorithm improved the quality of the process models and predictions of the critical health outcomes.

[AI-71] Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective

链接: https://arxiv.org/abs/2407.02814
作者: Zhaotian Weng,Zijun Gao,Jerone Andrews,Jieyu Zhao
关键词: inadvertently learn biases, Vision-language models, correlating gender information, pre-trained on extensive, objects or scenarios
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (